AI Development Framework: Why Evaluation Comes First, Not Later
A few months ago, I watched a team present an AI application they'd built using a low-code platform. The tool worked. It answered questions, it looked polished, and the team was proud of it.
During the Q&A, someone mentioned that the platform made it easy to swap in different models. I asked whether they'd tested multiple models against each other to see which one actually performed best for their use case. The response from the team's senior leader was immediate: "We're just trying to get it working. Optimization comes later."
I didn't push back. But that moment stuck with me, because it captured a mindset I keep seeing in teams building AI applications. They treat evaluation the same way traditional software teams treat performance tuning: something you do after the thing works.
What that team needed wasn't a better model. They needed an AI development framework that treats evaluation as a first principle, not an afterthought.
The Paradigm Shift Most Teams Miss
Traditional software is deterministic. You write a function, give it input, and get the same output every time. If it works on Tuesday, it works on Wednesday. You can look at the result and verify it's correct. The code either does what it's supposed to, or it doesn't.
AI applications break every one of those assumptions.
Give the same prompt and input to the same model twice, and you might get different outputs. A prompt change that improves one category of responses might silently degrade another. A model update from your provider can change behavior without warning. You can't just look at the output and know it's right, because LLM outputs are fluent and confident whether they're accurate or not.
This is the core shift: your code isn't deterministic anymore. The question "does it work?" no longer has a self-evident answer.
In traditional development, determinism gives you a built-in verification mechanism. With AI, you have to build that mechanism yourself. That mechanism is evaluation, and it's not optimization. It's verification. Without it, you genuinely do not know if your system works.
Chip Huyen makes this point sharply in AI Engineering (O'Reilly, 2025). She identifies "vibe checks," the practice of manually eyeballing a few outputs and deciding it looks good, as the number one pitfall in AI development. The problem is that LLM outputs look right. They're grammatically correct, confidently stated, and plausible. Teams mistake fluency for accuracy, and that confidence carries them right past the point where they should have started measuring.
Three Principles for AI Development That Works
After spending months studying the best thinking on this topic, starting with Marina Wyss's framework for AI engineering projects and then digging into Chip Huyen's AI Engineering, Anthropic's research on building effective agents, AWS's work on the Strands Agents SDK, and evaluation frameworks like RAGAS and DeepEval, I found strong consensus on three principles. Every credible source agrees on these, even though they frame them differently.
Principle 1: Start with the Problem, Not the Technology
Before writing a line of code, answer three questions. First, is an LLM even the right tool for this problem? If the input and output involve natural language with high variability, and the task requires judgment rather than simple lookup, and some degree of imprecision is acceptable, an LLM is probably a fit. If you need exact, deterministic, auditable answers every time, traditional code is likely better.
Second, what does "good" look like? You need ground truth before you build. That doesn't mean you need a single correct answer for every input. It means you need a rubric. For some tasks, correctness is binary: did it extract the right dollar amount? For open-ended generation, define scoring dimensions like accuracy, relevance, and completeness on a 1-5 scale. The point is to turn the vague question "is this good?" into something measurable.
Third, how will you measure it? Will you use exact matching, rubric-based scoring, LLM-as-judge, or a combination? If you're using an LLM to judge outputs, you need to validate it against human reviewers. Chip Huyen recommends scoring 10-20% of test cases with humans. If the LLM judge and humans disagree more than 15-20% of the time, the automated evaluation isn't trustworthy yet.
The anti-pattern I keep seeing: teams skip all three questions and jump straight to "use the best model available." That's how you end up with a system that looks like it works but has no way to prove it.
Principle 2: Simplicity First, Complexity Earned
Every source I studied agrees on this, and it's the strongest consensus point across all of them. Anthropic's guidance on building effective agents says it directly: most applications don't need agents at all. Optimizing single LLM calls with retrieval and in-context examples is often enough. Chip Huyen names "starting too complex" as a specific pitfall. AWS built Strands Agents specifically because existing frameworks were getting in the way of what modern LLMs could do on their own.
But "start simple" is easy to nod along with and hard to follow. It needs teeth. Here's how to give it teeth: use an improvement hierarchy as a validation sequence. When your system isn't meeting its quality bar, the failure type tells you what tier to try next.
| Tier | What to Try | When |
|---|---|---|
| 1. Prompt | Better instructions, examples, format specification | Format or instruction-following failures |
| 2. Context | Add retrieval (RAG), improve chunking, add reranking | Knowledge gaps in model's training |
| 3. Orchestration | Prompt chaining, routing, parallelization | Multi-step reasoning failures |
| 4. Model | Try a different model, model routing, fine-tuning | Reasoning ceiling on current model |
| 5. Agents | Autonomous tool use, planning loops | System needs to take actions, not just answer |
The key insight: complexity doesn't just add capability. It adds diagnostic difficulty. Every layer you add makes it harder to figure out what's going wrong when things go wrong. If you jump straight to a fully autonomous agent and it fails, you have five possible failure points and no way to isolate which one is the problem.
Don't move up the hierarchy until you can point to specific test cases where the current approach fails, and you've exhausted simpler improvements first. That's the decision gate. Without evaluation in place (Principle 1), you can't see the failure patterns, so you throw more complexity at the problem hoping something sticks.
Principle 3: Evaluation Is Verification, Not Optimization
This is the principle that changes everything, and it's why I didn't push back in that meeting. The framing needed to be sharper than "you should test your models." The real argument is this: "optimization comes later" assumes you know the system works and you're just making it better. With AI, you don't know it works without evaluation. The team I watched wasn't deferring optimization. They were skipping verification.
Evaluation has two distinct modes. Development evaluation runs your test set, compares prompt versions, and tests model changes. It answers "is this version better than the last one?" Production evaluation monitors real usage continuously, tracking quality scores, detecting drift, and catching degradation. It answers "is the system still working as well as it was?" Both are required. Neither substitutes for the other.
The hardest part is measuring what matters. Teams default to measuring what's easy: latency, token count, cost per request. Those are important but they don't tell you if the outputs are correct. Quality feels subjective and therefore unmeasurable. But it's not, once you learn to decompose it.
Take AI-generated software requirements documentation as an example. "Did it produce good requirements?" feels impossibly subjective. But break it down:
- Did it capture all the requirements from the source material? Checkable against the source.
- Are the requirements written so they can be tested and verified? A rubric handles this.
- Are there contradictions between requirements? An automated cross-reference check.
- Is the language unambiguous? LLM-as-judge with a specific rubric.
- Does it follow the organization's format and standards? Pattern matching, fully automated.
That vague quality question just became 60-70% automated evaluation and 30-40% structured rubric scoring. This decomposition technique works for almost any AI output. The pattern is repeatable: start with the vague quality question, break it into the most specific sub-questions you can, automate what's automatable, and use rubric-based scoring only for what's genuinely subjective.
After 20 years of writing and reviewing requirements documentation in financial services, I know exactly what "good" looks like for that domain. That domain knowledge is what makes the rubric meaningful. Without it, you're guessing. With it, you have a measurement system you can trust.
Foundational Practices
Two practices run through every phase of AI development and deserve explicit attention.
Decision records. In traditional software, most design decisions are visible in the architecture. You can see the database choice, the service patterns, the code structure. In AI applications, the decisions that matter most are invisible unless you write them down: why you chose one model over another, why you chunk documents at 512 tokens instead of 256, what you tried that didn't work and why. These are AI-specific architectural decision records, and they matter more here than in any other kind of software development. They're also invaluable if you ever need to explain your work in an interview or hand the project to someone else.
Prompt versioning. Store prompts as separate versioned files, never as hardcoded strings in application code. Each version gets a brief note about what changed and why. Every evaluation run ties to a specific prompt version so you can trace improvements or regressions back to specific changes. This sounds like overhead until the first time a prompt change breaks something and you need to figure out what happened.
Building Up With Proof: The Phases
The three principles and two practices come together in a phased development approach. Each phase builds on the last, and you only advance when you have measured evidence that the added complexity is justified.
| Phase | What You're Doing | Decision Gate |
|---|---|---|
| 0: Problem Scoping | Define the problem, constraints, success metrics, and test data | Can you articulate what "good" looks like? |
| 1: Baseline | Single prompt + direct LLM call against your test set | Does it meet your accuracy floor? |
| 2: RAG | Add retrieval when failures are knowledge gaps | Does retrieval measurably improve accuracy? |
| 3: Agents | Add autonomous tool use when the task requires action | Do you actually need agent autonomy? |
| 4: Deployment | API, infrastructure, UI | Is the system accessible and reliable? |
| 5: Monitoring | Production evaluation, drift detection, alerting | Is the system still performing? |
The evaluation thread runs through every phase. Phase 0 establishes what you're measuring. Phase 1 creates the baseline every future change is compared against. Phase 2 proves retrieval helps. Phase 3 proves agents help. Phase 5 continues evaluation in production, because the system that works today might not work next month.
Each phase deserves its own deep treatment, and future articles will provide that. The point here is the structure: you build confidence in each layer before stacking the next one on top.
Where to Start
If you're building an AI application right now, start with 20 test cases. Real inputs from your domain, with criteria that define what a good response looks like for each one. Run your first prompt against them. Score the results. That's your baseline, and every decision you make from here gets measured against it.
That single step, creating a test set and measuring against it before you've built anything else, separates intentional AI development from the "we'll optimize later" approach. It's not more work. It's the work that makes all the other work count.
Key Resources
The thinking in this article draws on work from people and teams who've earned the credibility. Full credit where it's due:
- Marina Wyss, "How to Build AI Engineering Projects That Get You Interviews" (YouTube). The 8-component project framework that started me down this path. Her insistence that tutorial-style projects don't demonstrate real engineering skill is what pushed me to look for a deeper development methodology.
- Chip Huyen, AI Engineering: Building Applications with Foundation Models (O'Reilly, 2025). The most thorough treatment of evaluation methodology and production AI development I've found.
- Anthropic, "Building Effective Agents" (2024) and "Effective Context Engineering for AI Agents" (2025). The clearest articulation of the simplicity-first principle and composable agent patterns.
- AWS, Strands Agents SDK and documentation. Model-driven agent architecture with observability built in from the start.
- RAGAS (docs.ragas.io). Automated evaluation metrics for RAG systems: faithfulness, relevance, correctness.
- DeepEval (deepeval.com). Pytest-style LLM evaluation that brings software engineering discipline to AI testing.