Why Your LLM Evals Don't Need Fancy Platforms—Just Unit Testing Discipline

After an AI demo disaster exposed critical flaws in prompt reliability, a developer discovered that treating LLM outputs as testable functions—using vitest and GitHub Actions—eliminated regressions without specialized tooling. This approach challenges the industry's rush toward complex evaluation platforms by proving existing CI/CD infrastructure is sufficient.

The CTO’s expression shifted from curiosity to concern as the AI demo unraveled—confusing features, compounding errors, and a developer’s dread crystallizing into one question: How do we prevent this from happening again? For Cameron Westland, this moment became a catalyst to rethink LLM evaluation entirely. His realization? LLM outputs aren’t mystical—they’re functions. And functions need tests, not proprietary platforms.

The Breaking Point

Westland’s AI had confidently conflated “deep research” with “thematic analysis scans” during a high-stakes presentation. The failure wasn’t just embarrassing; it exposed a dangerous gap: prompt changes shipped based on intuition, not verification. "I was vibe-checking system prompts," he admits. The aftermath forced a paradigm shift: Why treat LLMs differently than any other code?

The Radical Simplification

Rejecting platforms like LangSmith and Helicone, Westland built an eval system using tools already in his stack:

Vitest for running assertions on LLM outputs
GitHub Actions for continuous integration
JSON artifacts for tracking performance metrics

His first test failed immediately—unsurprisingly, given LLMs’ non-determinism. The breakthrough came by treating outputs as probabilistic outcomes:

// Example test checking for "thematic analysis" feature accuracy
const response = await generateResponse(prompt);
expect(response).toContainThemes(['sustainability', 'innovation']);

Scaling Visibility

Initial tests lacked clarity. Developers complained about debugging failures through CI logs. The solution? Automated PR comments with scorecards:

// Sample JSON output per test run
{
  "feature": "theme_generation",
  "accuracy": 92%,
  "hallucination_rate": 3%,
  "comparison_to_main": "-5%"
}

GitHub Actions transformed these into visual PR reports, while artifacts enabled historical trend analysis—catching regressions like the 5% drop in "Theme Generation Quality" that would’ve averted the demo disaster.

The Minimalist Architecture

Westland’s entire "eval platform" now fits into familiar workflows:

LLM-as-Judge: Custom scoring functions
Datasets: Fixtures in the repo
Versioning: Database-tagged prompts
Tracking: GitHub's 30-day artifact retention

"No new dashboards. No vendor lock-in," he emphasizes. The system took days—not months—to build.

Start Small, Scale Necessarily

Westland’s advice defies industry hype: Begin with one test.

Identify a critical LLM behavior
Write a vitest/Jest case
Run it 10x, measure success rate
Integrate into CI

"The infrastructure evolves with your needs," he notes. Specialized platforms might solve edge cases later—but most teams overestimate their requirements.

The Uncomfortable Truth

LLM development suffers from over-engineering bias. As Westland concludes: "You already have the infrastructure. It’s called your codebase." Testing generative AI isn’t a new discipline—it’s an extension of the rigor we apply to calculateTotal(). When outputs become core product logic, they deserve the same scrutiny.

Source: Cameron Westland