August 2025

Global Telecom Leader builds custom Conversation Quality metric to improve AI Agent Conversations

90%+

correlation with human CSAT

100%

coverage across all customer conversations

2M+

conversations evaluated monthly for brand alignment

90%+

correlation with human CSAT

100%

coverage across all customer conversations

2M+

conversations evaluated monthly for brand alignment

  • 90%+ correlation with human CSAT
  • 100% coverage across all customer conversations
  • 2M+ conversations evaluated monthly for brand alignment

A Fortune 500 telecom company runs one of the world’s largest virtual agents, handling millions of billing, troubleshooting, and service conversations every day. As the agents evolved from IVR menus to LLM-powered dialogue agents, leadership recognized the need for a scalable way to measure quality, filter risky data, and ensure every interaction reflected brand values.

Challenge

As our customer expanded its AI-powered agents, the traditional approach of relying on surveys and ad hoc evaluations was no longer enough.

  • Lagging and limited metrics: Transactional NPS data arrived late, covered only a small subset of conversations, and often reflected brand perception rather than interaction quality.

  • Blind spots in quality: Without a reliable signal, leadership lacked visibility into whether the agent was resolving issues effectively or drifting off-brand.

  • Innovation bottlenecks: Poorly measured interactions risked reinforcing low-quality training data and weakening customer trust.

The organization needed a scalable way to measure, monitor, and improve conversational quality, ensuring its agent could evolve without sacrificing customer experience.

Solution 

Collinear delivered a custom evaluation and improvement framework purpose-built for large-scale conversational AI.

  • Custom reward models: Trained on pairwise human annotations and judge-labeled data to quantify conversation quality with higher fidelity than tNPS.

  • LLM-as-a-Judge: Automated evaluation across nine quality dimensions, scaling expert review across millions of interactions.

  • Escalation detection: Early-warning signals identified failing conversations, enabling timely rescue and reducing escalation costs.

  • Data pipeline integration: Risky or off-brand conversations were filtered before entering training sets, preventing model drift.

By combining expert-driven criteria with scalable automation, the company gained a repeatable mechanism to measure and improve its agent in real time.

Results

The project delivered measurable improvements in both model quality and customer experience:

   Better than NPS: Custom reward models outperformed transactional NPS in predicting satisfaction.


   90%+ alignment with CSAT: AI Judges matched human-graded outcomes across millions of conversations.


   Reduced escalations: Early-warning signals enabled conversation rescue, lowering handoff rates to human agents.


   Faster iteration: Real-time scoring unlocked safe A/B testing and gated rollouts of new models.

Together, these outcomes proved that objective, scalable evaluation can drive both stronger AI performance and better customer experiences at enterprise scale.

Build Trust. Improve Faster.

This Fortune 500 telecom provider showed that real-time measurement can transform how enterprises run customer assistants. By scaling expert judgment with custom reward models, they replaced lagging surveys with actionable signals, reduced escalations, and delivered safer rollouts.

If your organization is struggling with blind spots in AI quality, the path forward is clear: measure smarter, rescue faster, and improve at scale.

Let’s explore how this can work for your organization.