Why Your LLM Evals Don't Need Fancy Platforms—Just Unit Testing Discipline
Share this article
The CTO’s expression shifted from curiosity to concern as the AI demo unraveled—confusing features, compounding errors, and a developer’s dread crystallizing into one question: How do we prevent this from happening again? For Cameron Westland, this moment became a catalyst to rethink LLM evaluation entirely. His realization? LLM outputs aren’t mystical—they’re functions. And functions need tests, not proprietary platforms.
The Breaking Point
Westland’s AI had confidently conflated “deep research” with “thematic analysis scans” during a high-stakes presentation. The failure wasn’t just embarrassing; it exposed a dangerous gap: prompt changes shipped based on intuition, not verification. "I was vibe-checking system prompts," he admits. The aftermath forced a paradigm shift: Why treat LLMs differently than any other code?
The Radical Simplification
Rejecting platforms like LangSmith and Helicone, Westland built an eval system using tools already in his stack:
- Vitest for running assertions on LLM outputs
- GitHub Actions for continuous integration
- JSON artifacts for tracking performance metrics
His first test failed immediately—unsurprisingly, given LLMs’ non-determinism. The breakthrough came by treating outputs as probabilistic outcomes:
// Example test checking for "thematic analysis" feature accuracy
const response = await generateResponse(prompt);
expect(response).toContainThemes(['sustainability', 'innovation']);
Scaling Visibility
Initial tests lacked clarity. Developers complained about debugging failures through CI logs. The solution? Automated PR comments with scorecards:
// Sample JSON output per test run
{
"feature": "theme_generation",
"accuracy": 92%,
"hallucination_rate": 3%,
"comparison_to_main": "-5%"
}
GitHub Actions transformed these into visual PR reports, while artifacts enabled historical trend analysis—catching regressions like the 5% drop in "Theme Generation Quality" that would’ve averted the demo disaster.
The Minimalist Architecture
Westland’s entire "eval platform" now fits into familiar workflows:
1. LLM-as-Judge: Custom scoring functions
2. Datasets: Fixtures in the repo
3. Versioning: Database-tagged prompts
4. Tracking: GitHub's 30-day artifact retention
"No new dashboards. No vendor lock-in," he emphasizes. The system took days—not months—to build.
Start Small, Scale Necessarily
Westland’s advice defies industry hype: Begin with one test.
1. Identify a critical LLM behavior
2. Write a vitest/Jest case
3. Run it 10x, measure success rate
4. Integrate into CI
"The infrastructure evolves with your needs," he notes. Specialized platforms might solve edge cases later—but most teams overestimate their requirements.
The Uncomfortable Truth
LLM development suffers from over-engineering bias. As Westland concludes: "You already have the infrastructure. It’s called your codebase." Testing generative AI isn’t a new discipline—it’s an extension of the rigor we apply to calculateTotal(). When outputs become core product logic, they deserve the same scrutiny.
Source: Cameron Westland