From PoC to Production

/images/Bayeux-Tapisserie-de-Bayeux.jpg

Between May and June 2025, I wrote a series of posts on LinkedIn about the reality of designing, evaluating, and scaling multi-agent AI systems. This article is a direct regrouping and rephrasing of those posts, with no added or removed content โ€” only organized for clarity.

  1. Evaluation is your main driver
  2. The real battle begins in Production
  3. Regression Testing and Guardrails: The Hidden Heroes
  4. Evaluation inside the Agentic Loop
  5. Building for Scale and Automation

It’s not new. Even back in the classical ML days, I noticed this pattern:
we stuck to abstract, generic metrics (accuracy, F1-score, etc.) instead of designing domain-specific ones that truly reflect business value.

Today, I’ll go even further:
๐Ÿ‘‰ ๐—•๐˜‚๐—ถ๐—น๐—ฑ ๐—ฐ๐—น๐—ถ๐—ฒ๐—ป๐˜-๐˜€๐—ฝ๐—ฒ๐—ฐ๐—ถ๐—ณ๐—ถ๐—ฐ ๐—บ๐—ฒ๐˜๐—ฟ๐—ถ๐—ฐ๐˜€: metrics that align exactly with your client’s goals and values.

Spoiler: ๐˜๐—ต๐—ฒ๐˜€๐—ฒ ๐—ผ๐—ณ๐˜๐—ฒ๐—ป ๐—ฏ๐—ผ๐—ถ๐—น ๐—ฑ๐—ผ๐˜„๐—ป ๐˜๐—ผ ๐—บ๐—ผ๐—ป๐—ฒ๐˜† ๐Ÿ’ฒ ๐—ฅ๐—ข๐—œ.

metrics meme

Concrete example. ๐—”๐˜€๐—ธ ๐—ฏ๐˜‚๐˜€๐—ถ๐—ป๐—ฒ๐˜€๐˜€ ๐—ผ๐˜„๐—ป๐—ฒ๐—ฟ๐˜€ ๐˜„๐—ต๐—ถ๐—ฐ๐—ต ๐˜€๐—ฐ๐—ฒ๐—ป๐—ฎ๐—ฟ๐—ถ๐—ผ ๐˜๐—ต๐—ฒ๐˜†’๐—ฑ ๐—ฐ๐—ต๐—ผ๐—ผ๐˜€๐—ฒ:
1๏ธโƒฃ A sales AI that hallucinates 10% of the time, but closes 20% of deals
2๏ธโƒฃ A sales AI that hallucinates 1%, but only closes 1%

The answer isn’t obvious right ? it’s not that easy since you do not know the real impact in terme of costs and revenues for each scenario…

As a data scientist, my answer is NO. at least, not in the strict sense of the word.

Here’s why:
Evaluation without a gold standard is just another opinion.
Or if you prefer: it’s more of an ๐—ฎ๐—ป๐—ฎ๐—น๐˜†๐˜€๐—ถ๐˜€ ๐˜๐—ต๐—ฎ๐—ป ๐—ฎ๐—ป ๐—ฒ๐˜ƒ๐—ฎ๐—น๐˜‚๐—ฎ๐˜๐—ถ๐—ผ๐—ป.

Don’t get me wrong! I’m not saying LLM-as-a-judge is useless. It can provide useful signals to improve your system, ๐—ฒ๐˜€๐—ฝ๐—ฒ๐—ฐ๐—ถ๐—ฎ๐—น๐—น๐˜† ๐˜„๐—ต๐—ฒ๐—ป ๐—ต๐˜‚๐—บ๐—ฎ๐—ป ๐—ฒ๐˜ƒ๐—ฎ๐—น๐˜‚๐—ฎ๐˜๐—ถ๐—ผ๐—ป๐˜€ ๐—ฎ๐—ฟ๐—ฒ๐—ป’๐˜ ๐˜€๐—ฐ๐—ฎ๐—น๐—ฎ๐—ฏ๐—น๐—ฒ.

llm judge meme

But calling it evaluation is misleading, because you’re ๐—ป๐—ผ๐˜ ๐—ฐ๐—ผ๐—บ๐—ฝ๐—ฎ๐—ฟ๐—ถ๐—ป๐—ด ๐˜๐—ผ ๐—ฎ ๐˜€๐—ผ๐˜‚๐—ฟ๐—ฐ๐—ฒ ๐—ผ๐—ณ ๐˜๐—ฟ๐˜‚๐˜๐—ต.

As a data scientist, I’ve already seen this kind of confusion, especially when discussing unsupervised learning evaluation.

If you’ve worked in data science, you know:
Evaluation in unsupervised learning is tricky. You often have to ๐˜๐—ถ๐—ฒ ๐—ถ๐˜ ๐—ฏ๐—ฎ๐—ฐ๐—ธ ๐˜๐—ผ ๐—ฎ ๐˜€๐˜‚๐—ฝ๐—ฒ๐—ฟ๐˜ƒ๐—ถ๐˜€๐—ฒ๐—ฑ ๐˜๐—ฎ๐˜€๐—ธ ๐—ผ๐—ฟ ๐—ฎ ๐—ฏ๐˜‚๐˜€๐—ถ๐—ป๐—ฒ๐˜€๐˜€ ๐—ผ๐˜‚๐˜๐—ฐ๐—ผ๐—บ๐—ฒ ๐˜๐—ผ ๐—ฎ๐˜€๐˜€๐—ฒ๐˜€๐˜€ ๐—ถ๐˜๐˜€ ๐˜ƒ๐—ฎ๐—น๐˜‚๐—ฒ.

Same goes for evaluating text generation.

๐—ฆ๐—ผ ๐—ต๐—ผ๐˜„ ๐—ฑ๐—ผ ๐˜„๐—ฒ ๐—ต๐—ฎ๐—ป๐—ฑ๐—น๐—ฒ ๐˜๐—ต๐—ถ๐˜€ ๐—ฝ๐—ฟ๐—ผ๐—ฝ๐—ฒ๐—ฟ๐—น๐˜†?
โœ… Define expected outputs at the data point level so you can apply real metrics.
โœ… Don’t ask the LLM to reason, ask it to compare your system’s output against a reference.
โœ… Let them help you understand why an output deviated, not just whether it’s good or bad.
โœ… Instead of relying on end-to-end judgment, look at what each agent or tool is doing in your system.

How do we break it down into multiple steps, and what’s the most critical part?

First, understand that ๐—ฅ๐—”๐—š ๐—ถ๐˜€๐—ป'๐˜ ๐—ท๐˜‚๐˜€๐˜ ๐—ฎ ๐˜๐—ฒ๐˜…๐˜-๐—ฏ๐—ฎ๐˜€๐—ฒ๐—ฑ ๐˜€๐˜†๐˜€๐˜๐—ฒ๐—บ. It can involve various data types and formats integrated into the LLM’s context.
RAG can cover API calls, SQL databases, web searches, and of course vector databases.

To control your agent’s output quality, you must ensure quality at every process step.

๐—œ๐—ป ๐˜๐—ต๐—ฒ ๐—ฎ๐—ด๐—ฒ๐—ป๐˜ ๐˜€๐—ฝ๐—ฎ๐—ฐ๐—ฒ, ๐—ฟ๐—ฒ๐˜๐—ฟ๐—ถ๐—ฒ๐˜ƒ๐—ฎ๐—น ๐—ถ๐˜€ ๐—ฐ๐—ฟ๐—ถ๐˜๐—ถ๐—ฐ๐—ฎ๐—น, because it involves multiple sub-steps:

  • The agent builds the query itself. You need to assess whether it’s doing this correctly. ex: not misusing database tools or accessing unauthorized data.
  • Evaluate the real targets (entities, lines, documents, or chunks) that should be retrieved and used as context. Then compute metrics like false positives and false negatives.
  • Problems can also come from your knowledge base. You need clean, curated data, especially during the building and testing phases.
  • And ensure you’re building efficient context retrieval! If the data you’re feeding the LLM is too big (and isn’t focused on what needs to be shown to the user) you’ll lose your LLM’s attention, even with models that support large contexts.

๐—ง๐—ต๐—ฒ๐—ป, ๐—ฎ๐˜€๐˜€๐—ฒ๐˜€๐˜€ ๐˜๐—ต๐—ฒ ๐—ด๐—ฒ๐—ป๐—ฒ๐—ฟ๐—ฎ๐˜๐—ถ๐—ผ๐—ป ๐—ฝ๐—ฎ๐—ฟ๐˜. 3 key points to evaluate:

  • ๐—›๐—ฎ๐—น๐—น๐˜‚๐—ฐ๐—ถ๐—ป๐—ฎ๐˜๐—ถ๐—ผ๐—ป and Non-Based Information: Is the output strictly based on your RAG context? Errors here could come from model quality, prompt design, or lack of guidance.
  • ๐—ฅ๐—ฒ๐—น๐—ฒ๐˜ƒ๐—ฎ๐—ป๐—ฐ๐˜†: Does the generation highlight the right parts of the context? Is it structured as expected, per prompt policy or fine-tuning?
  • ๐—˜๐˜…๐—ต๐—ฎ๐˜‚๐˜€๐˜๐—ถ๐˜ƒ๐—ถ๐˜๐˜†: Is the generated text exhaustive? Be cautious, some filters in the prompt might deliberately exclude parts of the retrieved data.
llm judge meme

๐—ฅ๐—”๐—š ๐—ถ๐˜€ ๐˜๐—ต๐—ฒ ๐—บ๐—ผ๐˜€๐˜ ๐—ถ๐—บ๐—ฝ๐—ผ๐—ฟ๐˜๐—ฎ๐—ป๐˜ ๐—ฝ๐—ฎ๐—ฟ๐˜ ๐˜๐—ผ ๐—ฒ๐˜ƒ๐—ฎ๐—น๐˜‚๐—ฎ๐˜๐—ฒ ๐—ถ๐—ป ๐˜†๐—ผ๐˜‚๐—ฟ ๐˜€๐˜†๐˜€๐˜๐—ฒ๐—บ. It’s where most issues originate. Break it down and evaluate each part carefully.
โœ… Study your errors and keep track of them in neat and well-defined automated tests.
โœ… Build a robust test framework with a diverse query set and reference answers.
โœ… Use metrics like precision, recall, and context relevance for retrieval.
โœ… For generation, focus on faithfulness and answer relevance.
โœ… Regularly update your evaluation datasets based on real-world usage.
โœ… Use LLM-based evaluators to scale your assessment efficiently.


metrics meme

Congratulations! ๐ŸŽŠ

You’ve shipped your wonderful AI platform to prod, and it’s working pretty well for a first version.

And now comes the hard part: ๐—ฌ๐—ผ๐˜‚ ๐—ป๐—ฒ๐—ฒ๐—ฑ ๐˜๐—ผ ๐—ฎ๐—ฑ๐—ฑ ๐—ฎ๐—น๐—น ๐˜๐—ต๐—ฒ ๐—ณ๐—ฒ๐—ฎ๐˜๐˜‚๐—ฟ๐—ฒ๐˜€ ๐˜๐—ต๐—ฎ๐˜ ๐˜„๐—ฒ๐—ฟ๐—ฒ๐—ป’๐˜ ๐—ถ๐—ป ๐˜†๐—ผ๐˜‚๐—ฟ ๐— ๐—ฉ๐—ฃ, the ones that are crucial to keeping your AI services ahead of the competition.

And guess what? It’s not going to be a small iteration.

๐—ง๐—ต๐—ฒ๐˜€๐—ฒ ๐—ป๐—ฒ๐˜„ ๐—ณ๐—ฒ๐—ฎ๐˜๐˜‚๐—ฟ๐—ฒ๐˜€ ๐˜„๐—ถ๐—น๐—น ๐—น๐—ถ๐—ธ๐—ฒ๐—น๐˜† ๐—ฟ๐—ฒ๐—พ๐˜‚๐—ถ๐—ฟ๐—ฒ ๐—ฎ ๐˜€๐˜๐—ฟ๐˜‚๐—ฐ๐˜๐˜‚๐—ฟ๐—ฎ๐—น ๐—ฟ๐—ฒ๐˜ƒ๐—ถ๐—ฒ๐˜„ of your agent orchestration, and even a rethink of your entire multi-agent system architecture (Yes, there are multiple architectures with new ones introduced every week.)

๐—ช๐—ฒ ๐—ฐ๐—ฎ๐—ป ๐˜€๐—ฎ๐˜† ๐—ถ๐˜:
๐—ฌ๐—ผ๐˜‚๐—ฟ ๐—ฝ๐—ฟ๐—ผ๐—ฑ๐˜‚๐—ฐ๐˜๐—ถ๐—ผ๐—ป ๐—ถ๐˜€ ๐—น๐—ฒ๐—ด๐—ฎ๐—ฐ๐˜† ๐—ถ๐—ป ๐—น๐—ฒ๐˜€๐˜€ ๐˜๐—ต๐—ฎ๐—ป 3 ๐—บ๐—ผ๐—ป๐˜๐—ต๐˜€. ๐Ÿ˜…

This isn’t entirely newโ€ฆ
But two things are probably different from your usual tech stack:
1- ๐—Ÿ๐—ฒ๐—ด๐—ฎ๐—ฐ๐˜† ๐—ต๐—ฎ๐—ฝ๐—ฝ๐—ฒ๐—ป๐˜€ ๐—ณ๐—ฎ๐˜€๐˜ ๐—ถ๐—ป ๐—”๐—œ.
Some projects become legacy before even hitting production.
We’ve experienced this ourselves. When a framework we were using got a major upgrade, we had no choice but to migrate if we wanted to keep building new features.
2- ๐— ๐—ถ๐—ด๐—ฟ๐—ฎ๐˜๐—ถ๐—ผ๐—ป ๐—ถ๐˜€ ๐—ฟ๐—ถ๐˜€๐—ธ๐—ถ๐—ฒ๐—ฟ.
You have less control over your AI platform than with a deterministic system.
Which means even small changes can lead to unexpected behaviors.

๐—”๐—ป๐—ฑ ๐—ต๐—ฒ๐—ฟ๐—ฒ ๐—ฐ๐—ผ๐—บ๐—ฒ๐˜€ ๐—ฟ๐—ฒ๐—ด๐—ฟ๐—ฒ๐˜€๐˜€๐—ถ๐—ผ๐—ป ๐˜๐—ฒ๐˜€๐˜๐—ถ๐—ป๐—ด.
Regression tests are your best ally to make sure refactors or architecture changes don’t break critical production features.
โœ… Create and maintain dedicated regression scenarios.
โœ… Define metrics that qualify outputs: since you’re working with AI, pass/fail isn’t enough.
โœ… Automate these tests or every release ๐˜„๐—ถ๐—น๐—น ๐˜๐˜‚๐—ฟ๐—ป ๐—ถ๐—ป๐˜๐—ผ ๐—ฑ๐—ฎ๐˜†๐˜€ ๐—ผ๐—ณ ๐—บ๐—ฎ๐—ป๐˜‚๐—ฎ๐—น ๐—ค๐—” ๐—ฎ๐—ป๐—ฑ ๐—ฟ๐—ฒ๐˜๐—ฒ๐˜€๐˜๐—ถ๐—ป๐—ด.

If you’re a startup moving fast in AI with users already rely on your first product:

๐—ถ๐—ป๐˜ƒ๐—ฒ๐˜€๐˜ ๐—ถ๐—ป ๐—ฟ๐—ฒ๐—ด๐—ฟ๐—ฒ๐˜€๐˜€๐—ถ๐—ผ๐—ป ๐˜๐—ฒ๐˜€๐˜๐—ถ๐—ป๐—ด ๐—ป๐—ผ๐˜„!

Otherwise, your roadmap will slow down before it even picks up speed.

It’s far more correlated with the number of real cases processed in production.

Why?

Because evaluation isn’t meaningful until your system has faced a wide range of inputs… and even then, some edge cases won’t appear until you’re operating at scale.

Here’s what we’ve observed again and again:

โœ… Complex agent tasks can perform flawlessly in dev, but start breaking down under production load or when inputs deviate even slightly from the original distribution.
โœ… Free text input is unpredictable. No matter how creative your dev team is, users will always surprise you.
โœ… It’s not just LLMs : rate limits, data quality issues, and fragile pipelines will all show up once the volume kicks in.
โœ… Even your data pipelines need to be stress tested at scale. You’ll spot performance cliffs and resilience gaps only under pressure.

Scaling to production gives your team real signals:

  • What breaks?
  • Where is the system fragile?
  • What improvements would make the most users happy?

With enough volume, your team gets the feedback it needs to iterate faster and with higher confidence.

Scaling isn’t a nice-to-have.
It’s the prerequisite for building reliable and evolving AI systems.

After months running AI systems 24/7, one thing is clear:
๐—ฃ๐—ฟ๐—ผ๐—ฑ๐˜‚๐—ฐ๐˜๐—ถ๐—ผ๐—ป ๐—ถ๐˜€ ๐˜„๐—ต๐—ฒ๐—ฟ๐—ฒ ๐˜†๐—ผ๐˜‚ ๐—ฏ๐˜‚๐—ถ๐—น๐—ฑ ๐—ฟ๐—ฒ๐—ฎ๐—น ๐—ถ๐—ป๐˜๐˜‚๐—ถ๐˜๐—ถ๐—ผ๐—ป ๐—ณ๐—ผ๐—ฟ ๐˜†๐—ผ๐˜‚๐—ฟ ๐—ป๐—ฒ๐˜…๐˜ ๐˜ƒ๐—ฒ๐—ฟ๐˜€๐—ถ๐—ผ๐—ป.

metrics meme

Evaluation before prod is critical, but watching real user interactions is what sharpens your instincts and helps prioritize what matters.

Here’s how we ๐˜๐˜‚๐—ฟ๐—ป ๐—ฝ๐—ฟ๐—ผ๐—ฑ๐˜‚๐—ฐ๐˜๐—ถ๐—ผ๐—ป ๐—บ๐—ผ๐—ป๐—ถ๐˜๐—ผ๐—ฟ๐—ถ๐—ป๐—ด ๐—ถ๐—ป๐˜๐—ผ ๐—ฎ ๐˜€๐˜‚๐—ฝ๐—ฒ๐—ฟ๐—ฝ๐—ผ๐˜„๐—ฒ๐—ฟ:

๐Ÿ” Make it a team-wide habit: Everyone in the team should get ๐—ฎ ๐˜๐—ฎ๐˜€๐˜๐—ฒ ๐—ผ๐—ณ ๐—ฟ๐—ฒ๐—ฎ๐—น ๐—ฝ๐—ฟ๐—ผ๐—ฑ๐˜‚๐—ฐ๐˜๐—ถ๐—ผ๐—ป ๐—ฑ๐—ฎ๐˜๐—ฎ. It fuels both creativity and clarity.

๐Ÿ“ˆ Use it to tune priorities: Support helps reveal what truly matters for the business and what’s just “nice to have.”
๐—ช๐—ต๐—ฒ๐—ป ๐—ฏ๐˜‚๐—ถ๐—น๐—ฑ๐—ถ๐—ป๐—ด ๐—ณ๐—ฎ๐˜€๐˜, ๐˜†๐—ผ๐˜‚ ๐˜„๐—ฎ๐—ป๐˜ ๐˜๐—ผ ๐—ฑ๐—ฒ๐—น๐—ถ๐˜ƒ๐—ฒ๐—ฟ ๐˜„๐—ต๐—ฎ๐˜’๐˜€ ๐—ฒ๐˜€๐˜€๐—ฒ๐—ป๐˜๐—ถ๐—ฎ๐—น, and hold off on the rest, especially since priorities often shift between project and production phases.

๐Ÿ’ก Help your team develop good trade-offs: Not every issue needs an immediate fix.
But sometimes a ๐—พ๐˜‚๐—ถ๐—ฐ๐—ธ ๐—ฝ๐—ฎ๐˜๐—ฐ๐—ต ๐—ฐ๐—ฎ๐—ป ๐—ฎ๐˜ƒ๐—ผ๐—ถ๐—ฑ ๐˜‚๐˜€๐—ฒ๐—ฟ ๐—ณ๐—ฟ๐˜‚๐˜€๐˜๐—ฟ๐—ฎ๐˜๐—ถ๐—ผ๐—ป, even if it’s not perfect.

๐Ÿ“Š Quantify issues: Log incident frequency and business impact. This data will sharpen priorities and guide improvements, ๐—ฒ๐˜€๐—ฝ๐—ฒ๐—ฐ๐—ถ๐—ฎ๐—น๐—น๐˜† ๐—ณ๐—ผ๐—ฟ ๐˜†๐—ผ๐˜‚๐—ฟ ๐—ด๐˜‚๐—ฎ๐—ฟ๐—ฑ๐—ฟ๐—ฎ๐—ถ๐—น, the most critical component of your system.


๐—™๐—ถ๐—ฟ๐˜€๐˜ ๐—น๐—ถ๐—ป๐—ฒ (keeps the chaos out) when it’s the input guardrail:

  • helps you catch out of scope/unexpected inputs (things your workflows weren’t designed to handle). With these kinds of input, you’re almost guaranteed to get unreliable, unpredictable results.
  • It’s also your security guard: detecting jailbreaks, prompt injections, and malicious inputs.

๐—Ÿ๐—ฎ๐˜€๐˜ ๐—น๐—ถ๐—ป๐—ฒ (keeps the chaos in) when it’s the output guardrail:

  • It helps drastically reduce hallucinations.
  • It can flag out-of-scope topics that slipped past the input guardrail.
  • It should also enforce brand moderation, keeping responses aligned with company image and policies.

Think of building a ๐—ฟ๐—ฒ๐—น๐—ถ๐—ฎ๐—ฏ๐—น๐—ฒ ๐—ด๐˜‚๐—ฎ๐—ฟ๐—ฑ๐—ฟ๐—ฎ๐—ถ๐—น as a hard ๐—ฐ๐—น๐—ฎ๐˜€๐˜€๐—ถ๐—ณ๐—ถ๐—ฐ๐—ฎ๐˜๐—ถ๐—ผ๐—ป ๐—ฝ๐—ฟ๐—ผ๐—ฏ๐—น๐—ฒ๐—บ. Like any classification task, you’re tuning the threshold between false positives (blocking good inputs) and false negatives (letting bad ones through).

๐—ง๐—ต๐—ฒ ๐—ฐ๐—ต๐—ฎ๐—น๐—น๐—ฒ๐—ป๐—ด๐—ฒ?
Your input and output is human language: open-ended, messy, context-dependent. The distribution is huge, and the ambiguity is real. There’s no perfect line.

๐—›๐—ฒ๐—ฟ๐—ฒ ๐—ฎ๐—ฟ๐—ฒ ๐˜€๐—ผ๐—บ๐—ฒ ๐˜๐—ถ๐—ฝ๐˜€ ๐—ถ๐—ณ ๐˜†๐—ผ๐˜‚’๐—ฟ๐—ฒ ๐˜๐—ฎ๐—ฐ๐—ธ๐—น๐—ถ๐—ป๐—ด ๐˜๐—ต๐—ถ๐˜€ ๐—ฝ๐—ฟ๐—ผ๐—ฏ๐—น๐—ฒ๐—บ:
โœ… Define clear and concise policies: Set strong guidelines for how your guardrail should behave.
โœ… Inject necessary information: Feed the guardrail only with data needed to make the right decision.
โœ… Define good and bad examples: Help it learn what should be let through - and what shouldn’t.
โœ… Test different LLM models: Your guardrail is your most important agent. Pay the necessary price to have best quality.
โœ… Evaluate performance: Use metrics like accuracy, recall, and precision. Better yet, build custom metrics tied to business impact.

metrics meme

๐—ฆ๐—ผ๐—น๐—ถ๐—ฑ ๐—ด๐˜‚๐—ฎ๐—ฟ๐—ฑ๐—ฟ๐—ฎ๐—ถ๐—น๐˜€ ๐—ฎ๐—ฟ๐—ฒ ๐˜„๐—ต๐—ฎ๐˜ ๐˜€๐—ฒ๐—ฝ๐—ฎ๐—ฟ๐—ฎ๐˜๐—ฒ ๐˜ƒ๐—ฎ๐—น๐˜‚๐—ฎ๐—ฏ๐—น๐—ฒ ๐—”๐—œ ๐˜€๐˜†๐˜€๐˜๐—ฒ๐—บ๐˜€ ๐—ณ๐—ฟ๐—ผ๐—บ ๐—ฐ๐—ต๐—ฎ๐—ผ๐˜๐—ถ๐—ฐ ๐—ผ๐—ป๐—ฒ๐˜€.


When building an agentic system, it’s important to have data points ๐˜๐—ต๐—ฎ๐˜ ๐—ฐ๐—ฎ๐—ป ๐—ฏ๐—ฒ ๐—ฐ๐—ต๐—ฒ๐—ฐ๐—ธ๐—ฒ๐—ฑ ๐—ฎ๐˜ ๐˜๐—ต๐—ฒ ๐—ต๐—ฒ๐—ฎ๐—ฟ๐˜ ๐—ผ๐—ณ ๐˜†๐—ผ๐˜‚๐—ฟ ๐—ฝ๐—ฟ๐—ผ๐—ฐ๐—ฒ๐˜€๐˜€ ๐˜๐—ผ ๐—ฒ๐—ป๐˜€๐˜‚๐—ฟ๐—ฒ ๐˜†๐—ผ๐˜‚’๐—ฟ๐—ฒ ๐—ผ๐—ป ๐˜๐—ต๐—ฒ ๐—ฟ๐—ถ๐—ด๐—ต๐˜ ๐—ฝ๐—ฎ๐˜๐—ต.

Think of these as intermediary milestones, deliverables you’re asking your multi-agent system to produce on its way to the ultimate objective.

These checkpoints serve a dual purpose:

  • ๐—ง๐—ฎ๐—ป๐—ด๐—ถ๐—ฏ๐—น๐—ฒ ๐— ๐—ฒ๐—ฎ๐˜€๐˜‚๐—ฟ๐—ฒ๐—บ๐—ฒ๐—ป๐˜๐˜€: They provide clear metrics that simplify system monitoring and facilitate the evaluation of each agent’s performance.
  • ๐—ฃ๐—ฟ๐—ผ๐—ฐ๐—ฒ๐˜€๐˜€ ๐—œ๐—ป๐˜€๐—ถ๐—ด๐—ต๐˜๐˜€: They offer valuable data that can enrich your understanding of the automated process and even serve as a source of structured knowledge about your users.
metrics meme

Moreover, these data points should be used to establish hard controls, dictating actions that agents must or must not take based on the extracted information.

Additionally, this metadata functions as a form of memory. Ensuring that agents get this information at critical steps of the process and that it’s never lost.

๐—ช๐—ฎ๐—ป๐˜ ๐—บ๐—ผ๐—ฟ๐—ฒ ๐—ฟ๐—ฒ๐—น๐—ถ๐—ฎ๐—ฏ๐—น๐—ฒ ๐—ฎ๐—ด๐—ฒ๐—ป๐˜๐˜€?
๐—ฆ๐˜๐—ฎ๐—ฟ๐˜ ๐—ฑ๐—ฒ๐˜€๐—ถ๐—ด๐—ป๐—ถ๐—ป๐—ด ๐˜„๐—ถ๐˜๐—ต ๐˜๐—ต๐—ฒ๐˜€๐—ฒ ๐—ถ๐—ป๐˜๐—ฒ๐—ฟ๐—ป๐—ฎ๐—น ๐—ฐ๐—ต๐—ฒ๐—ฐ๐—ธ๐—ฝ๐—ผ๐—ถ๐—ป๐˜๐˜€ ๐—ถ๐—ป ๐—บ๐—ถ๐—ป๐—ฑ.


If you want to bring cutting-edge AI systems to production, you’ll need to build an R&D mindset into your team and invest in experimentation.

The challenge?
Keep experimentation ๐—ณ๐—ฎ๐˜€๐˜ ๐—ฎ๐—ป๐—ฑ ๐—ฎ๐—ณ๐—ณ๐—ผ๐—ฟ๐—ฑ๐—ฎ๐—ฏ๐—น๐—ฒ.
You want feedback in ๐—ฑ๐—ฎ๐˜†๐˜€, ๐—ป๐—ผ๐˜ ๐˜„๐—ฒ๐—ฒ๐—ธ๐˜€.

And just like any expert team, to move fast you need the right tooling.
As your production matures and your agent tasks become more complex, your tools need to scale with you.

At NORMA, when we started building multi-agent systems, we hit a clear bottleneck: Maintaining constant quality was slowing down our shipping velocity.
So, we built something.

๐—” ๐˜๐—ผ๐—ผ๐—น ๐˜๐—ผ ๐˜๐—ฒ๐˜€๐˜ ๐—”๐—œ ๐—ฎ๐—ด๐—ฒ๐—ป๐˜๐˜€ ๐—ฒ๐—ป๐—ฑ ๐˜๐—ผ ๐—ฒ๐—ป๐—ฑ, ๐—ฎ๐˜ ๐˜€๐—ฐ๐—ฎ๐—น๐—ฒ.
It started as an internal utility.
Today, it’s a full platform for teams who need to control agent quality over time and evaluate changes across versions efficiently.

๐— ๐—ฎ๐—ธ๐—ฒ ๐˜๐—ฒ๐˜€๐˜๐—ถ๐—ป๐—ด ๐—ฒ๐—ฎ๐˜€๐˜† ๐—ณ๐—ผ๐—ฟ ๐˜†๐—ผ๐˜‚๐—ฟ ๐˜๐—ฒ๐—ฎ๐—บ!

evaluation meme

๐—ช๐—ต๐˜† ๐˜๐—ฒ๐˜€๐˜๐—ถ๐—ป๐—ด ๐—ถ๐˜€ ๐—ต๐—ฎ๐—ฟ๐—ฑ๐—ฒ๐—ฟ ๐—ถ๐—ป ๐—”๐—œ ๐—ฝ๐—น๐—ฎ๐˜๐—ณ๐—ผ๐—ฟ๐—บ:

  • Your input is challenging to control and fully cover.
  • LLMs are everywhere in your pipeline, each bringing potential variance to your process.
  • Many steps rely on structured outputs from LLMs, which aren’t always easy to obtain systematically.
  • Agents may access external sources you don’t fully control.
  • Outputs vary widely: generated text, finite decision spaces, data extraction…
  • AI frameworks themselves are still evolving quickly and might introduce instabilities or deprecate parts of your code overnight.

For all these reasons, developers often struggle to build new features while maintaining high production quality and stability.

๐——๐—ผ๐—ป’๐˜ ๐—ผ๐˜ƒ๐—ฒ๐—ฟ-๐—ฝ๐—น๐—ฎ๐—ป, ๐˜€๐˜๐—ฎ๐—ฟ๐˜ ๐—ฎ๐˜‚๐˜๐—ผ๐—บ๐—ฎ๐˜๐—ถ๐—ป๐—ด ๐˜๐—ฒ๐˜€๐˜๐˜€ ๐—ป๐—ผ๐˜„ ๐˜๐—ผ ๐—ฟ๐—ฒ๐—น๐—ถ๐—ฒ๐˜ƒ๐—ฒ ๐˜†๐—ผ๐˜‚๐—ฟ ๐—ฑ๐—ฒ๐˜ƒ ๐˜๐—ฒ๐—ฎ๐—บ.

Anything that reduces developer workload is worth pursuing. For example:

  • Begin with a simple script that automatically runs tests and saves results to an accessible folder.
  • Execute these tests in parallel to accelerate feedback.
  • Provide a clear dashboard or interface for easily inspecting results.
  • Encourage your team to establish metrics to continuously monitor overall quality.

Start by measuring outputs globally, then prioritize evaluating the most critical sub-steps, usually those handling data manipulation.
After several iterations, your developers will become comfortable with evaluation processes, naturally integrating these tests to ensure high-quality features. At NORMA, weโ€™ve even automated evaluation directly into our CI/CD pipeline for every PR ๐Ÿ”ฅ

Check our quik demo here: