21. June 2025

From PoC to Production

Between May and June 2025, I wrote a series of posts on LinkedIn about the reality of designing, evaluating, and scaling multi-agent AI systems. This article is a direct regrouping and rephrasing of those posts, with no added or removed content — only organized for clarity.

📌 Table of Contents

1. Evaluation is your main driver

Most tech teams have been missing a key point when evaluating AI systems

It’s not new. Even back in the classical ML days, I noticed this pattern:
we stuck to abstract, generic metrics (accuracy, F1-score, etc.) instead of designing domain-specific ones that truly reflect business value.

Today, I’ll go even further:
👉 𝗕𝘂𝗶𝗹𝗱 𝗰𝗹𝗶𝗲𝗻𝘁-𝘀𝗽𝗲𝗰𝗶𝗳𝗶𝗰 𝗺𝗲𝘁𝗿𝗶𝗰𝘀: metrics that align exactly with your client’s goals and values.

Spoiler: 𝘁𝗵𝗲𝘀𝗲 𝗼𝗳𝘁𝗲𝗻 𝗯𝗼𝗶𝗹 𝗱𝗼𝘄𝗻 𝘁𝗼 𝗺𝗼𝗻𝗲𝘆 💲 𝗥𝗢𝗜.

Concrete example. 𝗔𝘀𝗸 𝗯𝘂𝘀𝗶𝗻𝗲𝘀𝘀 𝗼𝘄𝗻𝗲𝗿𝘀 𝘄𝗵𝗶𝗰𝗵 𝘀𝗰𝗲𝗻𝗮𝗿𝗶𝗼 𝘁𝗵𝗲𝘆’𝗱 𝗰𝗵𝗼𝗼𝘀𝗲:
1️⃣ A sales AI that hallucinates 10% of the time, but closes 20% of deals
2️⃣ A sales AI that hallucinates 1%, but only closes 1%

The answer isn’t obvious right ? it’s not that easy since you do not know the real impact in terme of costs and revenues for each scenario…

Can LLM-as-a-judge be considered a form of evaluation?

As a data scientist, my answer is NO. at least, not in the strict sense of the word.

Here’s why:
Evaluation without a gold standard is just another opinion.
Or if you prefer: it’s more of an 𝗮𝗻𝗮𝗹𝘆𝘀𝗶𝘀 𝘁𝗵𝗮𝗻 𝗮𝗻 𝗲𝘃𝗮𝗹𝘂𝗮𝘁𝗶𝗼𝗻.

Don’t get me wrong! I’m not saying LLM-as-a-judge is useless. It can provide useful signals to improve your system, 𝗲𝘀𝗽𝗲𝗰𝗶𝗮𝗹𝗹𝘆 𝘄𝗵𝗲𝗻 𝗵𝘂𝗺𝗮𝗻 𝗲𝘃𝗮𝗹𝘂𝗮𝘁𝗶𝗼𝗻𝘀 𝗮𝗿𝗲𝗻’𝘁 𝘀𝗰𝗮𝗹𝗮𝗯𝗹𝗲.

But calling it evaluation is misleading, because you’re 𝗻𝗼𝘁 𝗰𝗼𝗺𝗽𝗮𝗿𝗶𝗻𝗴 𝘁𝗼 𝗮 𝘀𝗼𝘂𝗿𝗰𝗲 𝗼𝗳 𝘁𝗿𝘂𝘁𝗵.

As a data scientist, I’ve already seen this kind of confusion, especially when discussing unsupervised learning evaluation.

If you’ve worked in data science, you know:
Evaluation in unsupervised learning is tricky. You often have to 𝘁𝗶𝗲 𝗶𝘁 𝗯𝗮𝗰𝗸 𝘁𝗼 𝗮 𝘀𝘂𝗽𝗲𝗿𝘃𝗶𝘀𝗲𝗱 𝘁𝗮𝘀𝗸 𝗼𝗿 𝗮 𝗯𝘂𝘀𝗶𝗻𝗲𝘀𝘀 𝗼𝘂𝘁𝗰𝗼𝗺𝗲 𝘁𝗼 𝗮𝘀𝘀𝗲𝘀𝘀 𝗶𝘁𝘀 𝘃𝗮𝗹𝘂𝗲.

Same goes for evaluating text generation.

𝗦𝗼 𝗵𝗼𝘄 𝗱𝗼 𝘄𝗲 𝗵𝗮𝗻𝗱𝗹𝗲 𝘁𝗵𝗶𝘀 𝗽𝗿𝗼𝗽𝗲𝗿𝗹𝘆?
✅ Define expected outputs at the data point level so you can apply real metrics.
✅ Don’t ask the LLM to reason, ask it to compare your system’s output against a reference.
✅ Let them help you understand why an output deviated, not just whether it’s good or bad.
✅ Instead of relying on end-to-end judgment, look at what each agent or tool is doing in your system.

What does it mean to evaluate RAG? Is it worth doing?

How do we break it down into multiple steps, and what’s the most critical part?

First, understand that 𝗥𝗔𝗚 𝗶𝘀𝗻'𝘁 𝗷𝘂𝘀𝘁 𝗮 𝘁𝗲𝘅𝘁-𝗯𝗮𝘀𝗲𝗱 𝘀𝘆𝘀𝘁𝗲𝗺. It can involve various data types and formats integrated into the LLM’s context.
RAG can cover API calls, SQL databases, web searches, and of course vector databases.

To control your agent’s output quality, you must ensure quality at every process step.

𝗜𝗻 𝘁𝗵𝗲 𝗮𝗴𝗲𝗻𝘁 𝘀𝗽𝗮𝗰𝗲, 𝗿𝗲𝘁𝗿𝗶𝗲𝘃𝗮𝗹 𝗶𝘀 𝗰𝗿𝗶𝘁𝗶𝗰𝗮𝗹, because it involves multiple sub-steps:

The agent builds the query itself. You need to assess whether it’s doing this correctly. ex: not misusing database tools or accessing unauthorized data.
Evaluate the real targets (entities, lines, documents, or chunks) that should be retrieved and used as context. Then compute metrics like false positives and false negatives.
Problems can also come from your knowledge base. You need clean, curated data, especially during the building and testing phases.
And ensure you’re building efficient context retrieval! If the data you’re feeding the LLM is too big (and isn’t focused on what needs to be shown to the user) you’ll lose your LLM’s attention, even with models that support large contexts.

𝗧𝗵𝗲𝗻, 𝗮𝘀𝘀𝗲𝘀𝘀 𝘁𝗵𝗲 𝗴𝗲𝗻𝗲𝗿𝗮𝘁𝗶𝗼𝗻 𝗽𝗮𝗿𝘁. 3 key points to evaluate:

𝗛𝗮𝗹𝗹𝘂𝗰𝗶𝗻𝗮𝘁𝗶𝗼𝗻 and Non-Based Information: Is the output strictly based on your RAG context? Errors here could come from model quality, prompt design, or lack of guidance.
𝗥𝗲𝗹𝗲𝘃𝗮𝗻𝗰𝘆: Does the generation highlight the right parts of the context? Is it structured as expected, per prompt policy or fine-tuning?
𝗘𝘅𝗵𝗮𝘂𝘀𝘁𝗶𝘃𝗶𝘁𝘆: Is the generated text exhaustive? Be cautious, some filters in the prompt might deliberately exclude parts of the retrieved data.

𝗥𝗔𝗚 𝗶𝘀 𝘁𝗵𝗲 𝗺𝗼𝘀𝘁 𝗶𝗺𝗽𝗼𝗿𝘁𝗮𝗻𝘁 𝗽𝗮𝗿𝘁 𝘁𝗼 𝗲𝘃𝗮𝗹𝘂𝗮𝘁𝗲 𝗶𝗻 𝘆𝗼𝘂𝗿 𝘀𝘆𝘀𝘁𝗲𝗺. It’s where most issues originate. Break it down and evaluate each part carefully.
✅ Study your errors and keep track of them in neat and well-defined automated tests.
✅ Build a robust test framework with a diverse query set and reference answers.
✅ Use metrics like precision, recall, and context relevance for retrieval.
✅ For generation, focus on faithfulness and answer relevance.
✅ Regularly update your evaluation datasets based on real-world usage.
✅ Use LLM-based evaluators to scale your assessment efficiently.

2. The real battle begins in Production

Your AI system just hit production. Now comes a new challenge: 𝗻𝗼𝘁 𝗯𝗿𝗲𝗮𝗸𝗶𝗻𝗴 𝗶𝘁

Congratulations! 🎊

You’ve shipped your wonderful AI platform to prod, and it’s working pretty well for a first version.

And now comes the hard part: 𝗬𝗼𝘂 𝗻𝗲𝗲𝗱 𝘁𝗼 𝗮𝗱𝗱 𝗮𝗹𝗹 𝘁𝗵𝗲 𝗳𝗲𝗮𝘁𝘂𝗿𝗲𝘀 𝘁𝗵𝗮𝘁 𝘄𝗲𝗿𝗲𝗻’𝘁 𝗶𝗻 𝘆𝗼𝘂𝗿 𝗠𝗩𝗣, the ones that are crucial to keeping your AI services ahead of the competition.

And guess what? It’s not going to be a small iteration.

𝗧𝗵𝗲𝘀𝗲 𝗻𝗲𝘄 𝗳𝗲𝗮𝘁𝘂𝗿𝗲𝘀 𝘄𝗶𝗹𝗹 𝗹𝗶𝗸𝗲𝗹𝘆 𝗿𝗲𝗾𝘂𝗶𝗿𝗲 𝗮 𝘀𝘁𝗿𝘂𝗰𝘁𝘂𝗿𝗮𝗹 𝗿𝗲𝘃𝗶𝗲𝘄 of your agent orchestration, and even a rethink of your entire multi-agent system architecture (Yes, there are multiple architectures with new ones introduced every week.)

𝗪𝗲 𝗰𝗮𝗻 𝘀𝗮𝘆 𝗶𝘁:
𝗬𝗼𝘂𝗿 𝗽𝗿𝗼𝗱𝘂𝗰𝘁𝗶𝗼𝗻 𝗶𝘀 𝗹𝗲𝗴𝗮𝗰𝘆 𝗶𝗻 𝗹𝗲𝘀𝘀 𝘁𝗵𝗮𝗻 3 𝗺𝗼𝗻𝘁𝗵𝘀. 😅

This isn’t entirely new…
But two things are probably different from your usual tech stack:
1- 𝗟𝗲𝗴𝗮𝗰𝘆 𝗵𝗮𝗽𝗽𝗲𝗻𝘀 𝗳𝗮𝘀𝘁 𝗶𝗻 𝗔𝗜.
Some projects become legacy before even hitting production.
We’ve experienced this ourselves. When a framework we were using got a major upgrade, we had no choice but to migrate if we wanted to keep building new features.
2- 𝗠𝗶𝗴𝗿𝗮𝘁𝗶𝗼𝗻 𝗶𝘀 𝗿𝗶𝘀𝗸𝗶𝗲𝗿.
You have less control over your AI platform than with a deterministic system.
Which means even small changes can lead to unexpected behaviors.

𝗔𝗻𝗱 𝗵𝗲𝗿𝗲 𝗰𝗼𝗺𝗲𝘀 𝗿𝗲𝗴𝗿𝗲𝘀𝘀𝗶𝗼𝗻 𝘁𝗲𝘀𝘁𝗶𝗻𝗴.
Regression tests are your best ally to make sure refactors or architecture changes don’t break critical production features.
✅ Create and maintain dedicated regression scenarios.
✅ Define metrics that qualify outputs: since you’re working with AI, pass/fail isn’t enough.
✅ Automate these tests or every release 𝘄𝗶𝗹𝗹 𝘁𝘂𝗿𝗻 𝗶𝗻𝘁𝗼 𝗱𝗮𝘆𝘀 𝗼𝗳 𝗺𝗮𝗻𝘂𝗮𝗹 𝗤𝗔 𝗮𝗻𝗱 𝗿𝗲𝘁𝗲𝘀𝘁𝗶𝗻𝗴.

If you’re a startup moving fast in AI with users already rely on your first product:

𝗶𝗻𝘃𝗲𝘀𝘁 𝗶𝗻 𝗿𝗲𝗴𝗿𝗲𝘀𝘀𝗶𝗼𝗻 𝘁𝗲𝘀𝘁𝗶𝗻𝗴 𝗻𝗼𝘄!

Otherwise, your roadmap will slow down before it even picks up speed.

Improving your AI system isn’t just about development time.

It’s far more correlated with the number of real cases processed in production.

Why?

Because evaluation isn’t meaningful until your system has faced a wide range of inputs… and even then, some edge cases won’t appear until you’re operating at scale.

Here’s what we’ve observed again and again:

✅ Complex agent tasks can perform flawlessly in dev, but start breaking down under production load or when inputs deviate even slightly from the original distribution.
✅ Free text input is unpredictable. No matter how creative your dev team is, users will always surprise you.
✅ It’s not just LLMs : rate limits, data quality issues, and fragile pipelines will all show up once the volume kicks in.
✅ Even your data pipelines need to be stress tested at scale. You’ll spot performance cliffs and resilience gaps only under pressure.

Scaling to production gives your team real signals:

What breaks?
Where is the system fragile?
What improvements would make the most users happy?

With enough volume, your team gets the feedback it needs to iterate faster and with higher confidence.

Scaling isn’t a nice-to-have.
It’s the prerequisite for building reliable and evolving AI systems.

Production support and online evaluation are not just ops tasks, they’re strategic levers for your team.

After months running AI systems 24/7, one thing is clear:
𝗣𝗿𝗼𝗱𝘂𝗰𝘁𝗶𝗼𝗻 𝗶𝘀 𝘄𝗵𝗲𝗿𝗲 𝘆𝗼𝘂 𝗯𝘂𝗶𝗹𝗱 𝗿𝗲𝗮𝗹 𝗶𝗻𝘁𝘂𝗶𝘁𝗶𝗼𝗻 𝗳𝗼𝗿 𝘆𝗼𝘂𝗿 𝗻𝗲𝘅𝘁 𝘃𝗲𝗿𝘀𝗶𝗼𝗻.

Evaluation before prod is critical, but watching real user interactions is what sharpens your instincts and helps prioritize what matters.

Here’s how we 𝘁𝘂𝗿𝗻 𝗽𝗿𝗼𝗱𝘂𝗰𝘁𝗶𝗼𝗻 𝗺𝗼𝗻𝗶𝘁𝗼𝗿𝗶𝗻𝗴 𝗶𝗻𝘁𝗼 𝗮 𝘀𝘂𝗽𝗲𝗿𝗽𝗼𝘄𝗲𝗿:

🔁 Make it a team-wide habit: Everyone in the team should get 𝗮 𝘁𝗮𝘀𝘁𝗲 𝗼𝗳 𝗿𝗲𝗮𝗹 𝗽𝗿𝗼𝗱𝘂𝗰𝘁𝗶𝗼𝗻 𝗱𝗮𝘁𝗮. It fuels both creativity and clarity.

📈 Use it to tune priorities: Support helps reveal what truly matters for the business and what’s just “nice to have.”
𝗪𝗵𝗲𝗻 𝗯𝘂𝗶𝗹𝗱𝗶𝗻𝗴 𝗳𝗮𝘀𝘁, 𝘆𝗼𝘂 𝘄𝗮𝗻𝘁 𝘁𝗼 𝗱𝗲𝗹𝗶𝘃𝗲𝗿 𝘄𝗵𝗮𝘁’𝘀 𝗲𝘀𝘀𝗲𝗻𝘁𝗶𝗮𝗹, and hold off on the rest, especially since priorities often shift between project and production phases.

💡 Help your team develop good trade-offs: Not every issue needs an immediate fix.
But sometimes a 𝗾𝘂𝗶𝗰𝗸 𝗽𝗮𝘁𝗰𝗵 𝗰𝗮𝗻 𝗮𝘃𝗼𝗶𝗱 𝘂𝘀𝗲𝗿 𝗳𝗿𝘂𝘀𝘁𝗿𝗮𝘁𝗶𝗼𝗻, even if it’s not perfect.

📊 Quantify issues: Log incident frequency and business impact. This data will sharpen priorities and guide improvements, 𝗲𝘀𝗽𝗲𝗰𝗶𝗮𝗹𝗹𝘆 𝗳𝗼𝗿 𝘆𝗼𝘂𝗿 𝗴𝘂𝗮𝗿𝗱𝗿𝗮𝗶𝗹, the most critical component of your system.

3. Regression Testing and Guardrails: The Hidden Heroes

Your 𝗴𝘂𝗮𝗿𝗱𝗿𝗮𝗶𝗹 is the 𝗺𝗼𝘀𝘁 𝗰𝗿𝗶𝘁𝗶𝗰𝗮𝗹 𝗽𝗮𝗿𝘁 of your multi-agent system. It’s your system’s gateway. 𝗬𝗼𝘂𝗿 𝗳𝗶𝗿𝘀𝘁 𝗮𝗻𝗱 𝗹𝗮𝘀𝘁 𝗹𝗶𝗻𝗲 𝗼𝗳 𝗱𝗲𝗳𝗲𝗻𝘀𝗲.

𝗙𝗶𝗿𝘀𝘁 𝗹𝗶𝗻𝗲 (keeps the chaos out) when it’s the input guardrail:

helps you catch out of scope/unexpected inputs (things your workflows weren’t designed to handle). With these kinds of input, you’re almost guaranteed to get unreliable, unpredictable results.
It’s also your security guard: detecting jailbreaks, prompt injections, and malicious inputs.

𝗟𝗮𝘀𝘁 𝗹𝗶𝗻𝗲 (keeps the chaos in) when it’s the output guardrail:

It helps drastically reduce hallucinations.
It can flag out-of-scope topics that slipped past the input guardrail.
It should also enforce brand moderation, keeping responses aligned with company image and policies.

Think of building a 𝗿𝗲𝗹𝗶𝗮𝗯𝗹𝗲 𝗴𝘂𝗮𝗿𝗱𝗿𝗮𝗶𝗹 as a hard 𝗰𝗹𝗮𝘀𝘀𝗶𝗳𝗶𝗰𝗮𝘁𝗶𝗼𝗻 𝗽𝗿𝗼𝗯𝗹𝗲𝗺. Like any classification task, you’re tuning the threshold between false positives (blocking good inputs) and false negatives (letting bad ones through).

𝗧𝗵𝗲 𝗰𝗵𝗮𝗹𝗹𝗲𝗻𝗴𝗲?
Your input and output is human language: open-ended, messy, context-dependent. The distribution is huge, and the ambiguity is real. There’s no perfect line.

𝗛𝗲𝗿𝗲 𝗮𝗿𝗲 𝘀𝗼𝗺𝗲 𝘁𝗶𝗽𝘀 𝗶𝗳 𝘆𝗼𝘂’𝗿𝗲 𝘁𝗮𝗰𝗸𝗹𝗶𝗻𝗴 𝘁𝗵𝗶𝘀 𝗽𝗿𝗼𝗯𝗹𝗲𝗺:
✅ Define clear and concise policies: Set strong guidelines for how your guardrail should behave.
✅ Inject necessary information: Feed the guardrail only with data needed to make the right decision.
✅ Define good and bad examples: Help it learn what should be let through - and what shouldn’t.
✅ Test different LLM models: Your guardrail is your most important agent. Pay the necessary price to have best quality.
✅ Evaluate performance: Use metrics like accuracy, recall, and precision. Better yet, build custom metrics tied to business impact.

𝗦𝗼𝗹𝗶𝗱 𝗴𝘂𝗮𝗿𝗱𝗿𝗮𝗶𝗹𝘀 𝗮𝗿𝗲 𝘄𝗵𝗮𝘁 𝘀𝗲𝗽𝗮𝗿𝗮𝘁𝗲 𝘃𝗮𝗹𝘂𝗮𝗯𝗹𝗲 𝗔𝗜 𝘀𝘆𝘀𝘁𝗲𝗺𝘀 𝗳𝗿𝗼𝗺 𝗰𝗵𝗮𝗼𝘁𝗶𝗰 𝗼𝗻𝗲𝘀.

4. Evaluation inside the Agentic Loop

Data extraction evaluation matters in agentic systems

When building an agentic system, it’s important to have data points 𝘁𝗵𝗮𝘁 𝗰𝗮𝗻 𝗯𝗲 𝗰𝗵𝗲𝗰𝗸𝗲𝗱 𝗮𝘁 𝘁𝗵𝗲 𝗵𝗲𝗮𝗿𝘁 𝗼𝗳 𝘆𝗼𝘂𝗿 𝗽𝗿𝗼𝗰𝗲𝘀𝘀 𝘁𝗼 𝗲𝗻𝘀𝘂𝗿𝗲 𝘆𝗼𝘂’𝗿𝗲 𝗼𝗻 𝘁𝗵𝗲 𝗿𝗶𝗴𝗵𝘁 𝗽𝗮𝘁𝗵.

Think of these as intermediary milestones, deliverables you’re asking your multi-agent system to produce on its way to the ultimate objective.

These checkpoints serve a dual purpose:

𝗧𝗮𝗻𝗴𝗶𝗯𝗹𝗲 𝗠𝗲𝗮𝘀𝘂𝗿𝗲𝗺𝗲𝗻𝘁𝘀: They provide clear metrics that simplify system monitoring and facilitate the evaluation of each agent’s performance.
𝗣𝗿𝗼𝗰𝗲𝘀𝘀 𝗜𝗻𝘀𝗶𝗴𝗵𝘁𝘀: They offer valuable data that can enrich your understanding of the automated process and even serve as a source of structured knowledge about your users.

Moreover, these data points should be used to establish hard controls, dictating actions that agents must or must not take based on the extracted information.

Additionally, this metadata functions as a form of memory. Ensuring that agents get this information at critical steps of the process and that it’s never lost.

𝗪𝗮𝗻𝘁 𝗺𝗼𝗿𝗲 𝗿𝗲𝗹𝗶𝗮𝗯𝗹𝗲 𝗮𝗴𝗲𝗻𝘁𝘀?
𝗦𝘁𝗮𝗿𝘁 𝗱𝗲𝘀𝗶𝗴𝗻𝗶𝗻𝗴 𝘄𝗶𝘁𝗵 𝘁𝗵𝗲𝘀𝗲 𝗶𝗻𝘁𝗲𝗿𝗻𝗮𝗹 𝗰𝗵𝗲𝗰𝗸𝗽𝗼𝗶𝗻𝘁𝘀 𝗶𝗻 𝗺𝗶𝗻𝗱.

5. Building for Scale and Automation

R&D is not optional. Be ready to pay for experimentation.

If you want to bring cutting-edge AI systems to production, you’ll need to build an R&D mindset into your team and invest in experimentation.

The challenge?
Keep experimentation 𝗳𝗮𝘀𝘁 𝗮𝗻𝗱 𝗮𝗳𝗳𝗼𝗿𝗱𝗮𝗯𝗹𝗲.
You want feedback in 𝗱𝗮𝘆𝘀, 𝗻𝗼𝘁 𝘄𝗲𝗲𝗸𝘀.

And just like any expert team, to move fast you need the right tooling.
As your production matures and your agent tasks become more complex, your tools need to scale with you.

At NORMA, when we started building multi-agent systems, we hit a clear bottleneck: Maintaining constant quality was slowing down our shipping velocity.
So, we built something.

𝗔 𝘁𝗼𝗼𝗹 𝘁𝗼 𝘁𝗲𝘀𝘁 𝗔𝗜 𝗮𝗴𝗲𝗻𝘁𝘀 𝗲𝗻𝗱 𝘁𝗼 𝗲𝗻𝗱, 𝗮𝘁 𝘀𝗰𝗮𝗹𝗲.
It started as an internal utility.
Today, it’s a full platform for teams who need to control agent quality over time and evaluate changes across versions efficiently.

You want reliable and efficient AI?

𝗠𝗮𝗸𝗲 𝘁𝗲𝘀𝘁𝗶𝗻𝗴 𝗲𝗮𝘀𝘆 𝗳𝗼𝗿 𝘆𝗼𝘂𝗿 𝘁𝗲𝗮𝗺!

𝗪𝗵𝘆 𝘁𝗲𝘀𝘁𝗶𝗻𝗴 𝗶𝘀 𝗵𝗮𝗿𝗱𝗲𝗿 𝗶𝗻 𝗔𝗜 𝗽𝗹𝗮𝘁𝗳𝗼𝗿𝗺:

Your input is challenging to control and fully cover.
LLMs are everywhere in your pipeline, each bringing potential variance to your process.
Many steps rely on structured outputs from LLMs, which aren’t always easy to obtain systematically.
Agents may access external sources you don’t fully control.
Outputs vary widely: generated text, finite decision spaces, data extraction…
AI frameworks themselves are still evolving quickly and might introduce instabilities or deprecate parts of your code overnight.

For all these reasons, developers often struggle to build new features while maintaining high production quality and stability.

𝗗𝗼𝗻’𝘁 𝗼𝘃𝗲𝗿-𝗽𝗹𝗮𝗻, 𝘀𝘁𝗮𝗿𝘁 𝗮𝘂𝘁𝗼𝗺𝗮𝘁𝗶𝗻𝗴 𝘁𝗲𝘀𝘁𝘀 𝗻𝗼𝘄 𝘁𝗼 𝗿𝗲𝗹𝗶𝗲𝘃𝗲 𝘆𝗼𝘂𝗿 𝗱𝗲𝘃 𝘁𝗲𝗮𝗺.

Anything that reduces developer workload is worth pursuing. For example:

Begin with a simple script that automatically runs tests and saves results to an accessible folder.
Execute these tests in parallel to accelerate feedback.
Provide a clear dashboard or interface for easily inspecting results.
Encourage your team to establish metrics to continuously monitor overall quality.

Start by measuring outputs globally, then prioritize evaluating the most critical sub-steps, usually those handling data manipulation.
After several iterations, your developers will become comfortable with evaluation processes, naturally integrating these tests to ensure high-quality features. At NORMA, we’ve even automated evaluation directly into our CI/CD pipeline for every PR 🔥

Check our quik demo here: