Eric Lee — Creative Technologist

Process

Research Design

Evaluated AI vibe-coding tools (Claude, Codex) across five prompting strategies: zero-shot, few-shot, chain-of-thought, role-based, and constraint-driven prompting.
Ran two parallel evaluation tracks: AI self-evaluation (asking the model to score its own outputs) and independent human evaluation using the same criteria.
Metrics: System Usability Scale (SUS), User Engagement Scale Short Form (UES-SF), and User Experience Questionnaire (UEQ).

Key Finding

Across all five prompting strategies, AI self-evaluation scores were consistently 20–30 points higher than human assessment on the same outputs.
The gap was largest in creative and interaction design tasks — areas where quality is context-dependent and hard to specify in a prompt.
This finding has direct implications for any organization deploying AI in production: self-reported quality metrics from AI systems are not reliable indicators of actual output quality.

Framework

Instruction layer: how the model is prompted and constrained defines the ceiling on output quality. Vague instructions produce variable outputs; structured constraint layers reduce variance.
Evaluation layer: AI outputs require independent human evaluation, not self-assessment. Mixing evaluation frameworks (SUS for usability, UEQ for experience quality) provides more complete signal.
Feedback layer: closed-loop feedback from real usage — not just pre-deployment testing — is necessary to detect quality drift over time.
Trust threshold: no output should be delegated to end users without passing a defined quality gate that was designed independently of the AI system generating the output.

A practical framework applicable to any organization building AI-powered products — particularly in regulated or high-stakes environments where output quality is non-negotiable.
Presented internally at ProtoPie as a foundation for client AI consulting engagements.
Directly applicable to legal AI, medical AI, and enterprise AI contexts where the cost of low-quality outputs is high.