All work

Process

1

Research Design

Building an evaluation methodology that catches what AI misses

  • Evaluated AI vibe-coding tools (Claude, Codex) across five prompting strategies: zero-shot, few-shot, chain-of-thought, role-based, and constraint-driven prompting.
  • Ran two parallel evaluation tracks: AI self-evaluation (asking the model to score its own outputs) and independent human evaluation using the same criteria.
  • Metrics: System Usability Scale (SUS), User Engagement Scale Short Form (UES-SF), and User Experience Questionnaire (UEQ).
2

Key Finding

AI consistently overestimates the quality of its own outputs

  • Across all five prompting strategies, AI self-evaluation scores were consistently 20–30 points higher than human assessment on the same outputs.
  • The gap was largest in creative and interaction design tasks — areas where quality is context-dependent and hard to specify in a prompt.
  • This finding has direct implications for any organization deploying AI in production: self-reported quality metrics from AI systems are not reliable indicators of actual output quality.
3

Framework

AI Reliability Architecture — the layers that make delegation safe

  • Instruction layer: how the model is prompted and constrained defines the ceiling on output quality. Vague instructions produce variable outputs; structured constraint layers reduce variance.
  • Evaluation layer: AI outputs require independent human evaluation, not self-assessment. Mixing evaluation frameworks (SUS for usability, UEQ for experience quality) provides more complete signal.
  • Feedback layer: closed-loop feedback from real usage — not just pre-deployment testing — is necessary to detect quality drift over time.
  • Trust threshold: no output should be delegated to end users without passing a defined quality gate that was designed independently of the AI system generating the output.

Outcomes

  • A practical framework applicable to any organization building AI-powered products — particularly in regulated or high-stakes environments where output quality is non-negotiable.
  • Presented internally at ProtoPie as a foundation for client AI consulting engagements.
  • Directly applicable to legal AI, medical AI, and enterprise AI contexts where the cost of low-quality outputs is high.