August 2025

2M+ Simulated Conversations per month: How a F500 Telecom tests AI Agents before they reach customers

90%+

correlation with human CSAT

2M+

conversations evaluated monthly

9 quality

dimensions scored per interaction

90%+

correlation with human CSAT

2M+

conversations evaluated monthly

9 quality

dimensions scored per interaction

A Fortune 500 telecom company runs one of the world's largest virtual agents, handling millions of billing, troubleshooting, and service conversations every day. As its agents evolved from IVR menus to LLM-powered dialogue, the risk profile changed. A bad LLM response can go off-brand, hallucinate policy, or escalate a simple billing question into a customer service incident.

The company needed to test new agent versions against the full diversity of real customer interactions before deployment. Their existing quality signal, transactional NPS collected after launch, arrived too late and covered too few conversations.

Challenge

  • No pre-production quality signal. Transactional NPS arrived after agents were already live, covered a single-digit percentage of conversations, and measured brand sentiment more than resolution quality. By the time a problem surfaced, customers had already experienced it.
  • Scale of the problem. Millions of conversations per day across billing, troubleshooting, and service. Any pre-production evaluation needed to match that diversity. A small test set wouldn't catch the long-tail failures that matter most.

Solution 

The company used Collinear's Simulation Lab to test every agent version against 2M+ simulated customer conversations per month before anything reached production.

  • 2M+ simulated conversations monthly. The Simulation Lab generated realistic customer interactions matching the diversity of actual traffic: billing disputes, troubleshooting flows, service changes, edge cases, and adversarial inputs. At this scale, long-tail failures that would slip through a small test set get caught.
  • Custom reward models across 9 quality dimensions. Every simulated interaction was scored across nine dimensions including resolution effectiveness, brand alignment, tone, and accuracy, with higher fidelity than transactional NPS.
  • Escalation prediction. Simulated conversations surfaced interaction patterns likely to cause customer escalations, enabling the team to fix failure modes before customers ever encountered them.

Results

  • 90%+ correlation with human CSAT. Custom reward models matched human-graded satisfaction outcomes across millions of simulated conversations, replacing transactional NPS as the primary quality signal.
  • 2M+ simulated conversations evaluated monthly. 100% coverage. Every interaction scored across all 9 dimensions.
  • Escalation risk caught pre-production. Failure modes that would have caused customer escalations were identified and fixed in simulation before reaching production.

See what Collinear's Simulation Lab can do for your team.

Book a demo

Company
Major U.S. media & telecom company
Industry
Telecom, Media & Entertainment
Company size
100,000+ employees
Pain point
No scalable way to test AI agent quality before production; post-deployment NPS covered a fraction of conversations and arrived too late
Collinear SimLab Use Case
Agent Testing
About the company

A diversified global media and technology enterprise that provides broadband internet, video, and phone services, operates business connectivity solutions, and owns major television, film, streaming, and theme park brands, serving tens of millions of residential and commercial customers across the United States and international markets.

Results
  • 90%+ correlation with human CSAT
  • 2M+ simulated conversations evaluated monthly

A Fortune 500 telecom company runs one of the world's largest virtual agents, handling millions of billing, troubleshooting, and service conversations every day. As its agents evolved from IVR menus to LLM-powered dialogue, the risk profile changed. A bad LLM response can go off-brand, hallucinate policy, or escalate a simple billing question into a customer service incident.

The company needed to test new agent versions against the full diversity of real customer interactions before deployment. Their existing quality signal, transactional NPS collected after launch, arrived too late and covered too few conversations.

Challenge

  • No pre-production quality signal. Transactional NPS arrived after agents were already live, covered a single-digit percentage of conversations, and measured brand sentiment more than resolution quality. By the time a problem surfaced, customers had already experienced it.
  • Scale of the problem. Millions of conversations per day across billing, troubleshooting, and service. Any pre-production evaluation needed to match that diversity. A small test set wouldn't catch the long-tail failures that matter most.

Solution 

The company used Collinear's Simulation Lab to test every agent version against 2M+ simulated customer conversations per month before anything reached production.

  • 2M+ simulated conversations monthly. The Simulation Lab generated realistic customer interactions matching the diversity of actual traffic: billing disputes, troubleshooting flows, service changes, edge cases, and adversarial inputs. At this scale, long-tail failures that would slip through a small test set get caught.
  • Custom reward models across 9 quality dimensions. Every simulated interaction was scored across nine dimensions including resolution effectiveness, brand alignment, tone, and accuracy, with higher fidelity than transactional NPS.
  • Escalation prediction. Simulated conversations surfaced interaction patterns likely to cause customer escalations, enabling the team to fix failure modes before customers ever encountered them.

Results

  • 90%+ correlation with human CSAT. Custom reward models matched human-graded satisfaction outcomes across millions of simulated conversations, replacing transactional NPS as the primary quality signal.
  • 2M+ simulated conversations evaluated monthly. 100% coverage. Every interaction scored across all 9 dimensions.
  • Escalation risk caught pre-production. Failure modes that would have caused customer escalations were identified and fixed in simulation before reaching production.

See what Collinear's Simulation Lab can do for your team.

Book a demo