21. June 2025
Between May and June 2025, I wrote a series of posts on LinkedIn about the reality of designing, evaluating, and scaling multi-agent AI systems. This article is a direct regrouping and rephrasing of those posts, with no added or removed content โ only organized for clarity.
It’s not new. Even back in the classical ML days, I noticed this pattern:
we stuck to abstract, generic metrics (accuracy, F1-score, etc.) instead of designing domain-specific ones that truly reflect business value.
Today, I’ll go even further:
๐ ๐๐๐ถ๐น๐ฑ ๐ฐ๐น๐ถ๐ฒ๐ป๐-๐๐ฝ๐ฒ๐ฐ๐ถ๐ณ๐ถ๐ฐ ๐บ๐ฒ๐๐ฟ๐ถ๐ฐ๐: metrics that align exactly with your client’s goals and values.
Spoiler: ๐๐ต๐ฒ๐๐ฒ ๐ผ๐ณ๐๐ฒ๐ป ๐ฏ๐ผ๐ถ๐น ๐ฑ๐ผ๐๐ป ๐๐ผ ๐บ๐ผ๐ป๐ฒ๐ ๐ฒ ๐ฅ๐ข๐
.
Concrete example. ๐๐๐ธ ๐ฏ๐๐๐ถ๐ป๐ฒ๐๐ ๐ผ๐๐ป๐ฒ๐ฟ๐ ๐๐ต๐ถ๐ฐ๐ต ๐๐ฐ๐ฒ๐ป๐ฎ๐ฟ๐ถ๐ผ ๐๐ต๐ฒ๐’๐ฑ ๐ฐ๐ต๐ผ๐ผ๐๐ฒ:
1๏ธโฃ A sales AI that hallucinates 10% of the time, but closes 20% of deals
2๏ธโฃ A sales AI that hallucinates 1%, but only closes 1%
The answer isn’t obvious right ? it’s not that easy since you do not know the real impact in terme of costs and revenues for each scenario…
As a data scientist, my answer is NO. at least, not in the strict sense of the word.
Here’s why:
Evaluation without a gold standard is just another opinion.
Or if you prefer: it’s more of an ๐ฎ๐ป๐ฎ๐น๐๐๐ถ๐ ๐๐ต๐ฎ๐ป ๐ฎ๐ป ๐ฒ๐๐ฎ๐น๐๐ฎ๐๐ถ๐ผ๐ป.
Don’t get me wrong! I’m not saying LLM-as-a-judge is useless. It can provide useful signals to improve your system, ๐ฒ๐๐ฝ๐ฒ๐ฐ๐ถ๐ฎ๐น๐น๐ ๐๐ต๐ฒ๐ป ๐ต๐๐บ๐ฎ๐ป ๐ฒ๐๐ฎ๐น๐๐ฎ๐๐ถ๐ผ๐ป๐ ๐ฎ๐ฟ๐ฒ๐ป’๐ ๐๐ฐ๐ฎ๐น๐ฎ๐ฏ๐น๐ฒ.
But calling it evaluation is misleading, because you’re ๐ป๐ผ๐ ๐ฐ๐ผ๐บ๐ฝ๐ฎ๐ฟ๐ถ๐ป๐ด ๐๐ผ ๐ฎ ๐๐ผ๐๐ฟ๐ฐ๐ฒ ๐ผ๐ณ ๐๐ฟ๐๐๐ต.
As a data scientist, I’ve already seen this kind of confusion, especially when discussing unsupervised learning evaluation.
If you’ve worked in data science, you know:
Evaluation in unsupervised learning is tricky. You often have to ๐๐ถ๐ฒ ๐ถ๐ ๐ฏ๐ฎ๐ฐ๐ธ ๐๐ผ ๐ฎ ๐๐๐ฝ๐ฒ๐ฟ๐๐ถ๐๐ฒ๐ฑ ๐๐ฎ๐๐ธ ๐ผ๐ฟ ๐ฎ ๐ฏ๐๐๐ถ๐ป๐ฒ๐๐ ๐ผ๐๐๐ฐ๐ผ๐บ๐ฒ ๐๐ผ ๐ฎ๐๐๐ฒ๐๐ ๐ถ๐๐ ๐๐ฎ๐น๐๐ฒ.
Same goes for evaluating text generation.
๐ฆ๐ผ ๐ต๐ผ๐ ๐ฑ๐ผ ๐๐ฒ ๐ต๐ฎ๐ป๐ฑ๐น๐ฒ ๐๐ต๐ถ๐ ๐ฝ๐ฟ๐ผ๐ฝ๐ฒ๐ฟ๐น๐?
โ
Define expected outputs at the data point level so you can apply real metrics.
โ
Don’t ask the LLM to reason, ask it to compare your system’s output against a reference.
โ
Let them help you understand why an output deviated, not just whether it’s good or bad.
โ
Instead of relying on end-to-end judgment, look at what each agent or tool is doing in your system.
How do we break it down into multiple steps, and what’s the most critical part?
First, understand that ๐ฅ๐๐ ๐ถ๐๐ป'๐ ๐ท๐๐๐ ๐ฎ ๐๐ฒ๐
๐-๐ฏ๐ฎ๐๐ฒ๐ฑ ๐๐๐๐๐ฒ๐บ
. It can involve various data types and formats integrated into the LLM’s context.
RAG can cover API calls, SQL databases, web searches, and of course vector databases.
To control your agent’s output quality, you must ensure quality at every process step.
๐๐ป ๐๐ต๐ฒ ๐ฎ๐ด๐ฒ๐ป๐ ๐๐ฝ๐ฎ๐ฐ๐ฒ, ๐ฟ๐ฒ๐๐ฟ๐ถ๐ฒ๐๐ฎ๐น ๐ถ๐ ๐ฐ๐ฟ๐ถ๐๐ถ๐ฐ๐ฎ๐น, because it involves multiple sub-steps:
๐ง๐ต๐ฒ๐ป, ๐ฎ๐๐๐ฒ๐๐ ๐๐ต๐ฒ ๐ด๐ฒ๐ป๐ฒ๐ฟ๐ฎ๐๐ถ๐ผ๐ป ๐ฝ๐ฎ๐ฟ๐. 3 key points to evaluate:
๐ฅ๐๐ ๐ถ๐ ๐๐ต๐ฒ ๐บ๐ผ๐๐ ๐ถ๐บ๐ฝ๐ผ๐ฟ๐๐ฎ๐ป๐ ๐ฝ๐ฎ๐ฟ๐ ๐๐ผ ๐ฒ๐๐ฎ๐น๐๐ฎ๐๐ฒ ๐ถ๐ป ๐๐ผ๐๐ฟ ๐๐๐๐๐ฒ๐บ. It’s where most issues originate. Break it down and evaluate each part carefully.
โ
Study your errors and keep track of them in neat and well-defined automated tests.
โ
Build a robust test framework with a diverse query set and reference answers.
โ
Use metrics like precision, recall, and context relevance for retrieval.
โ
For generation, focus on faithfulness and answer relevance.
โ
Regularly update your evaluation datasets based on real-world usage.
โ
Use LLM-based evaluators to scale your assessment efficiently.
Congratulations! ๐
You’ve shipped your wonderful AI platform to prod, and it’s working pretty well for a first version.
And now comes the hard part: ๐ฌ๐ผ๐ ๐ป๐ฒ๐ฒ๐ฑ ๐๐ผ ๐ฎ๐ฑ๐ฑ ๐ฎ๐น๐น ๐๐ต๐ฒ ๐ณ๐ฒ๐ฎ๐๐๐ฟ๐ฒ๐ ๐๐ต๐ฎ๐ ๐๐ฒ๐ฟ๐ฒ๐ป’๐ ๐ถ๐ป ๐๐ผ๐๐ฟ ๐ ๐ฉ๐ฃ, the ones that are crucial to keeping your AI services ahead of the competition.
And guess what? It’s not going to be a small iteration.
๐ง๐ต๐ฒ๐๐ฒ ๐ป๐ฒ๐ ๐ณ๐ฒ๐ฎ๐๐๐ฟ๐ฒ๐ ๐๐ถ๐น๐น ๐น๐ถ๐ธ๐ฒ๐น๐ ๐ฟ๐ฒ๐พ๐๐ถ๐ฟ๐ฒ ๐ฎ ๐๐๐ฟ๐๐ฐ๐๐๐ฟ๐ฎ๐น ๐ฟ๐ฒ๐๐ถ๐ฒ๐ of your agent orchestration, and even a rethink of your entire multi-agent system architecture (Yes, there are multiple architectures with new ones introduced every week.)
๐ช๐ฒ ๐ฐ๐ฎ๐ป ๐๐ฎ๐ ๐ถ๐:
๐ฌ๐ผ๐๐ฟ ๐ฝ๐ฟ๐ผ๐ฑ๐๐ฐ๐๐ถ๐ผ๐ป ๐ถ๐ ๐น๐ฒ๐ด๐ฎ๐ฐ๐ ๐ถ๐ป ๐น๐ฒ๐๐ ๐๐ต๐ฎ๐ป 3 ๐บ๐ผ๐ป๐๐ต๐. ๐
This isn’t entirely newโฆ
But two things are probably different from your usual tech stack:
1- ๐๐ฒ๐ด๐ฎ๐ฐ๐ ๐ต๐ฎ๐ฝ๐ฝ๐ฒ๐ป๐ ๐ณ๐ฎ๐๐ ๐ถ๐ป ๐๐.
Some projects become legacy before even hitting production.
We’ve experienced this ourselves. When a framework we were using got a major upgrade, we had no choice but to migrate if we wanted to keep building new features.
2- ๐ ๐ถ๐ด๐ฟ๐ฎ๐๐ถ๐ผ๐ป ๐ถ๐ ๐ฟ๐ถ๐๐ธ๐ถ๐ฒ๐ฟ.
You have less control over your AI platform than with a deterministic system.
Which means even small changes can lead to unexpected behaviors.
๐๐ป๐ฑ ๐ต๐ฒ๐ฟ๐ฒ ๐ฐ๐ผ๐บ๐ฒ๐ ๐ฟ๐ฒ๐ด๐ฟ๐ฒ๐๐๐ถ๐ผ๐ป ๐๐ฒ๐๐๐ถ๐ป๐ด.
Regression tests are your best ally to make sure refactors or architecture changes don’t break critical production features.
โ
Create and maintain dedicated regression scenarios
.
โ
Define metrics that qualify outputs: since you’re working with AI, pass/fail isn’t enough.
โ
Automate these tests or every release ๐๐ถ๐น๐น ๐๐๐ฟ๐ป ๐ถ๐ป๐๐ผ ๐ฑ๐ฎ๐๐ ๐ผ๐ณ ๐บ๐ฎ๐ป๐๐ฎ๐น ๐ค๐ ๐ฎ๐ป๐ฑ ๐ฟ๐ฒ๐๐ฒ๐๐๐ถ๐ป๐ด.
If you’re a startup moving fast in AI with users already rely on your first product:
๐ถ๐ป๐๐ฒ๐๐ ๐ถ๐ป ๐ฟ๐ฒ๐ด๐ฟ๐ฒ๐๐๐ถ๐ผ๐ป ๐๐ฒ๐๐๐ถ๐ป๐ด ๐ป๐ผ๐!
Otherwise, your roadmap will slow down before it even picks up speed.
It’s far more correlated with the number of real cases processed in production.
Why?
Because evaluation isn’t meaningful until your system has faced a wide range of inputs… and even then, some edge cases won’t appear until you’re operating at scale.
Here’s what we’ve observed again and again:
โ
Complex agent tasks can perform flawlessly in dev, but start breaking down under production load or when inputs deviate even slightly from the original distribution.
โ
Free text input is unpredictable. No matter how creative your dev team is, users will always surprise you.
โ
It’s not just LLMs : rate limits, data quality issues, and fragile pipelines will all show up once the volume kicks in.
โ
Even your data pipelines need to be stress tested at scale. You’ll spot performance cliffs and resilience gaps only under pressure.
Scaling to production gives your team real signals:
With enough volume, your team gets the feedback it needs to iterate faster and with higher confidence.
Scaling isn’t a nice-to-have.
It’s the prerequisite for building reliable and evolving AI systems.
After months running AI systems 24/7, one thing is clear:
๐ฃ๐ฟ๐ผ๐ฑ๐๐ฐ๐๐ถ๐ผ๐ป ๐ถ๐ ๐๐ต๐ฒ๐ฟ๐ฒ ๐๐ผ๐ ๐ฏ๐๐ถ๐น๐ฑ ๐ฟ๐ฒ๐ฎ๐น ๐ถ๐ป๐๐๐ถ๐๐ถ๐ผ๐ป ๐ณ๐ผ๐ฟ ๐๐ผ๐๐ฟ ๐ป๐ฒ๐
๐ ๐๐ฒ๐ฟ๐๐ถ๐ผ๐ป.
Evaluation before prod is critical, but watching real user interactions is what sharpens your instincts and helps prioritize what matters.
Here’s how we ๐๐๐ฟ๐ป ๐ฝ๐ฟ๐ผ๐ฑ๐๐ฐ๐๐ถ๐ผ๐ป ๐บ๐ผ๐ป๐ถ๐๐ผ๐ฟ๐ถ๐ป๐ด ๐ถ๐ป๐๐ผ ๐ฎ ๐๐๐ฝ๐ฒ๐ฟ๐ฝ๐ผ๐๐ฒ๐ฟ:
๐ Make it a team-wide habit: Everyone in the team should get ๐ฎ ๐๐ฎ๐๐๐ฒ ๐ผ๐ณ ๐ฟ๐ฒ๐ฎ๐น ๐ฝ๐ฟ๐ผ๐ฑ๐๐ฐ๐๐ถ๐ผ๐ป ๐ฑ๐ฎ๐๐ฎ. It fuels both creativity and clarity.
๐ Use it to tune priorities: Support helps reveal what truly matters for the business and what’s just “nice to have.”
๐ช๐ต๐ฒ๐ป ๐ฏ๐๐ถ๐น๐ฑ๐ถ๐ป๐ด ๐ณ๐ฎ๐๐, ๐๐ผ๐ ๐๐ฎ๐ป๐ ๐๐ผ ๐ฑ๐ฒ๐น๐ถ๐๐ฒ๐ฟ ๐๐ต๐ฎ๐’๐ ๐ฒ๐๐๐ฒ๐ป๐๐ถ๐ฎ๐น, and hold off on the rest, especially since priorities often shift between project and production phases.
๐ก Help your team develop good trade-offs: Not every issue needs an immediate fix.
But sometimes a ๐พ๐๐ถ๐ฐ๐ธ ๐ฝ๐ฎ๐๐ฐ๐ต ๐ฐ๐ฎ๐ป ๐ฎ๐๐ผ๐ถ๐ฑ ๐๐๐ฒ๐ฟ ๐ณ๐ฟ๐๐๐๐ฟ๐ฎ๐๐ถ๐ผ๐ป, even if it’s not perfect.
๐ Quantify issues: Log incident frequency and business impact. This data will sharpen priorities and guide improvements, ๐ฒ๐๐ฝ๐ฒ๐ฐ๐ถ๐ฎ๐น๐น๐ ๐ณ๐ผ๐ฟ ๐๐ผ๐๐ฟ ๐ด๐๐ฎ๐ฟ๐ฑ๐ฟ๐ฎ๐ถ๐น, the most critical component of your system.
๐๐ถ๐ฟ๐๐ ๐น๐ถ๐ป๐ฒ (keeps the chaos out) when it’s the input guardrail:
๐๐ฎ๐๐ ๐น๐ถ๐ป๐ฒ (keeps the chaos in) when it’s the output guardrail:
Think of building a ๐ฟ๐ฒ๐น๐ถ๐ฎ๐ฏ๐น๐ฒ ๐ด๐๐ฎ๐ฟ๐ฑ๐ฟ๐ฎ๐ถ๐น as a hard ๐ฐ๐น๐ฎ๐๐๐ถ๐ณ๐ถ๐ฐ๐ฎ๐๐ถ๐ผ๐ป ๐ฝ๐ฟ๐ผ๐ฏ๐น๐ฒ๐บ. Like any classification task, you’re tuning the threshold between false positives (blocking good inputs) and false negatives (letting bad ones through).
๐ง๐ต๐ฒ ๐ฐ๐ต๐ฎ๐น๐น๐ฒ๐ป๐ด๐ฒ?
Your input and output is human language: open-ended, messy, context-dependent. The distribution is huge, and the ambiguity is real. There’s no perfect line.
๐๐ฒ๐ฟ๐ฒ ๐ฎ๐ฟ๐ฒ ๐๐ผ๐บ๐ฒ ๐๐ถ๐ฝ๐ ๐ถ๐ณ ๐๐ผ๐’๐ฟ๐ฒ ๐๐ฎ๐ฐ๐ธ๐น๐ถ๐ป๐ด ๐๐ต๐ถ๐ ๐ฝ๐ฟ๐ผ๐ฏ๐น๐ฒ๐บ:
โ
Define clear and concise policies: Set strong guidelines for how your guardrail should behave.
โ
Inject necessary information: Feed the guardrail only with data needed to make the right decision.
โ
Define good and bad examples: Help it learn what should be let through - and what shouldn’t.
โ
Test different LLM models: Your guardrail is your most important agent. Pay the necessary price to have best quality.
โ
Evaluate performance: Use metrics like accuracy, recall, and precision. Better yet, build custom metrics tied to business impact.
๐ฆ๐ผ๐น๐ถ๐ฑ ๐ด๐๐ฎ๐ฟ๐ฑ๐ฟ๐ฎ๐ถ๐น๐ ๐ฎ๐ฟ๐ฒ ๐๐ต๐ฎ๐ ๐๐ฒ๐ฝ๐ฎ๐ฟ๐ฎ๐๐ฒ ๐๐ฎ๐น๐๐ฎ๐ฏ๐น๐ฒ ๐๐ ๐๐๐๐๐ฒ๐บ๐ ๐ณ๐ฟ๐ผ๐บ ๐ฐ๐ต๐ฎ๐ผ๐๐ถ๐ฐ ๐ผ๐ป๐ฒ๐.
When building an agentic system, it’s important to have data points ๐๐ต๐ฎ๐ ๐ฐ๐ฎ๐ป ๐ฏ๐ฒ ๐ฐ๐ต๐ฒ๐ฐ๐ธ๐ฒ๐ฑ ๐ฎ๐ ๐๐ต๐ฒ ๐ต๐ฒ๐ฎ๐ฟ๐ ๐ผ๐ณ ๐๐ผ๐๐ฟ ๐ฝ๐ฟ๐ผ๐ฐ๐ฒ๐๐ ๐๐ผ ๐ฒ๐ป๐๐๐ฟ๐ฒ ๐๐ผ๐’๐ฟ๐ฒ ๐ผ๐ป ๐๐ต๐ฒ ๐ฟ๐ถ๐ด๐ต๐ ๐ฝ๐ฎ๐๐ต.
Think of these as intermediary milestones, deliverables you’re asking your multi-agent system to produce on its way to the ultimate objective.
These checkpoints serve a dual purpose:
Moreover, these data points should be used to establish hard controls, dictating actions that agents must or must not take based on the extracted information.
Additionally, this metadata functions as a form of memory. Ensuring that agents get this information at critical steps of the process and that it’s never lost.
๐ช๐ฎ๐ป๐ ๐บ๐ผ๐ฟ๐ฒ ๐ฟ๐ฒ๐น๐ถ๐ฎ๐ฏ๐น๐ฒ ๐ฎ๐ด๐ฒ๐ป๐๐?
๐ฆ๐๐ฎ๐ฟ๐ ๐ฑ๐ฒ๐๐ถ๐ด๐ป๐ถ๐ป๐ด ๐๐ถ๐๐ต ๐๐ต๐ฒ๐๐ฒ ๐ถ๐ป๐๐ฒ๐ฟ๐ป๐ฎ๐น ๐ฐ๐ต๐ฒ๐ฐ๐ธ๐ฝ๐ผ๐ถ๐ป๐๐ ๐ถ๐ป ๐บ๐ถ๐ป๐ฑ.
If you want to bring cutting-edge AI systems to production, you’ll need to build an R&D mindset into your team and invest in experimentation.
The challenge?
Keep experimentation ๐ณ๐ฎ๐๐ ๐ฎ๐ป๐ฑ ๐ฎ๐ณ๐ณ๐ผ๐ฟ๐ฑ๐ฎ๐ฏ๐น๐ฒ.
You want feedback in ๐ฑ๐ฎ๐๐, ๐ป๐ผ๐ ๐๐ฒ๐ฒ๐ธ๐.
And just like any expert team, to move fast you need the right tooling.
As your production matures and your agent tasks become more complex, your tools need to scale with you.
At NORMA, when we started building multi-agent systems, we hit a clear bottleneck: Maintaining constant quality was slowing down our shipping velocity.
So, we built something.
๐ ๐๐ผ๐ผ๐น ๐๐ผ ๐๐ฒ๐๐ ๐๐ ๐ฎ๐ด๐ฒ๐ป๐๐ ๐ฒ๐ป๐ฑ ๐๐ผ ๐ฒ๐ป๐ฑ, ๐ฎ๐ ๐๐ฐ๐ฎ๐น๐ฒ.
It started as an internal utility.
Today, it’s a full platform for teams who need to control agent quality over time and evaluate changes across versions efficiently.
๐ ๐ฎ๐ธ๐ฒ ๐๐ฒ๐๐๐ถ๐ป๐ด ๐ฒ๐ฎ๐๐ ๐ณ๐ผ๐ฟ ๐๐ผ๐๐ฟ ๐๐ฒ๐ฎ๐บ!
๐ช๐ต๐ ๐๐ฒ๐๐๐ถ๐ป๐ด ๐ถ๐ ๐ต๐ฎ๐ฟ๐ฑ๐ฒ๐ฟ ๐ถ๐ป ๐๐ ๐ฝ๐น๐ฎ๐๐ณ๐ผ๐ฟ๐บ:
For all these reasons, developers often struggle to build new features while maintaining high production quality and stability.
๐๐ผ๐ป’๐ ๐ผ๐๐ฒ๐ฟ-๐ฝ๐น๐ฎ๐ป, ๐๐๐ฎ๐ฟ๐ ๐ฎ๐๐๐ผ๐บ๐ฎ๐๐ถ๐ป๐ด ๐๐ฒ๐๐๐ ๐ป๐ผ๐ ๐๐ผ ๐ฟ๐ฒ๐น๐ถ๐ฒ๐๐ฒ ๐๐ผ๐๐ฟ ๐ฑ๐ฒ๐ ๐๐ฒ๐ฎ๐บ.
Anything that reduces developer workload is worth pursuing. For example:
Start by measuring outputs globally, then prioritize evaluating the most critical sub-steps, usually those handling data manipulation.
After several iterations, your developers will become comfortable with evaluation processes, naturally integrating these tests to ensure high-quality features.
At NORMA, weโve even automated evaluation directly into our CI/CD pipeline for every PR ๐ฅ
Check our quik demo here: