Process
1
Research Design
Building an evaluation methodology that catches what AI misses
- Evaluated AI vibe-coding tools (Claude, Codex) across five prompting strategies: zero-shot, few-shot, chain-of-thought, role-based, and constraint-driven prompting.
- Ran two parallel evaluation tracks: AI self-evaluation (asking the model to score its own outputs) and independent human evaluation using the same criteria.
- Metrics: System Usability Scale (SUS), User Engagement Scale Short Form (UES-SF), and User Experience Questionnaire (UEQ).
2
Key Finding
AI consistently overestimates the quality of its own outputs
- Across all five prompting strategies, AI self-evaluation scores were consistently 20–30 points higher than human assessment on the same outputs.
- The gap was largest in creative and interaction design tasks — areas where quality is context-dependent and hard to specify in a prompt.
- This finding has direct implications for any organization deploying AI in production: self-reported quality metrics from AI systems are not reliable indicators of actual output quality.
3
Framework
AI Reliability Architecture — the layers that make delegation safe
- Instruction layer: how the model is prompted and constrained defines the ceiling on output quality. Vague instructions produce variable outputs; structured constraint layers reduce variance.
- Evaluation layer: AI outputs require independent human evaluation, not self-assessment. Mixing evaluation frameworks (SUS for usability, UEQ for experience quality) provides more complete signal.
- Feedback layer: closed-loop feedback from real usage — not just pre-deployment testing — is necessary to detect quality drift over time.
- Trust threshold: no output should be delegated to end users without passing a defined quality gate that was designed independently of the AI system generating the output.
Outcomes
- A practical framework applicable to any organization building AI-powered products — particularly in regulated or high-stakes environments where output quality is non-negotiable.
- Presented internally at ProtoPie as a foundation for client AI consulting engagements.
- Directly applicable to legal AI, medical AI, and enterprise AI contexts where the cost of low-quality outputs is high.