GSMA Open-Telco LLM Benchmarks

Community Article Published February 25, 2025

The Definitive AI Benchmark for the Telecoms Industry

Why There is Urgent Need for Telecom-Specific AI Benchmarks
Generic AI Models Fail in the Telecoms domain.

The Unique Challenges of AI in Telecom

How GSMA Open-Telco LLM Benchmarks Solve This Issue

The Need for a Telco-Specific AI Benchmark
Findings from Research

Key Industry Gap

What is GSMA Open-Telco LLM Benchmarks?
Who’s Behind It?

Supported by Major Partners

What Does It Do?

How the Benchmark Works?
Why This Matters?

Initial Findings and Model Performance

Future Roadmap: Telco Use Cases, Energy Efficiency, and Safety
Network Troubleshooting & Optimization

Energy Efficiency & Sustainable AI

Safety & Compliance

Operator Submissions & Industry Collaboration

Why Open Benchmarking Matters for Telecom AI
Transparency

Best Practice

Collaboration

Get Involved & Next Steps
How to Participate?

The Definitive AI Benchmark for the Telecoms Industry

Why There is Urgent Need for Telecom-Specific AI Benchmarks

Generic AI Models Fail in the Telecoms domain.

Telecom operators are investing billions into AI, recognizing its potential to redefine networks, drive efficiency, automation, and innovation. From network optimization to AI-driven customer interactions, AI is becoming a strategic imperative for the industry. However, despite these investments, a critical gap remains .. current AI models are not built for telecom!

The Unique Challenges of AI in Telecom

While general-purpose Large Language Models (LLMs) have brought revolutionary changes to vertical industries like healthcare, finance, and retail, they often fail to perform reliably on telecom-specific tasks. This is due to several key challenges:

Misinterpretation of Telecom Standards & policies:
- LLMs struggle with highly technical documents, such as 3GPP specifications, ETSI reports, and ITU guidelines.
- This results in non-compliant AI outputs, potentially affecting everything from spectrum management to network security policies.
Errors in Network Optimization & Automation:
- AI-driven network orchestration, RAN slicing, and congestion control require telecom-grade accuracy.
- Generic LLMs misinterpret optimization constraints, which may lead to inefficient resource allocation or suboptimal Quality of Service (QoS).
Ineffective Fault Detection & Incident Resolution:
- AI-powered network troubleshooting often fails due to a lack of telecom-specific datasets.
- Some AI models suggest incorrect or even counterproductive fixes, possibly worsening network reliability.
Challenges in AI-Powered Customer Experience & Service Management
- AI-driven customer support in telecom requires deep contextual understanding of networks, billing structures, and service configurations.
- Generic chatbots often fail to provide accurate troubleshooting for complex telecom services.

How GSMA Open-Telco LLM Benchmarks Solve This Issue

To address these AI limitations, the GSMA Open-Telco LLM Benchmarks introduce an evaluation methodology that tests an LLM's:

Accuracy in parsing and interpreting 3GPP/ETSI/ITU standards
Ability to handle cross-referenced specifications
Consistency in compliance with telecom regulations
Effectiveness in troubleshooting real-world telecom problems

By benchmarking AI models against real-world telecom documentation and compliance scenarios, GSMA Open-Telco LLM Benchmarks ensure that LLMs are optimized for the complexities of telecom standards, making AI-driven automation more trustworthy, efficient, and industry-ready.

The Need for a Telco-Specific AI Benchmark

Findings from Research

Recent studies reveal that general AI models struggle with telecom-specific tasks, highlighting the need for a dedicated benchmark.

TelBench (SK Telecom): Evaluations show that existing LLMs perform poorly in telecom customer service and technical queries, struggling with industry-specific terminology.
Telco-RAG: Retrieval-augmented AI models fail to effectively process telecom documentation, particularly complex 3GPP standards, due to technical density and terminology inconsistencies.
TelecomGPT: The lack of open, high-quality training data limits AI performance. Telecom data is often proprietary, fragmented, and highly technical, requiring custom pretraining methods.

Key Industry Gap

General AI models are not optimized for telecom due to:

Limited telecom-specific language understanding (jargon, standards, and abbreviations).
Lack of knowledge of the telecom infrastructure (legacy systems, network optimization).
Failure to address real-world telecom challenges, such as accurate network modelling and decision-making.

What is GSMA Open-Telco LLM Benchmarks?

The GSMA Open-Telco LLM Benchmarks is an industry-led initiative designed to evaluate AI models for telecom applications, ensuring they meet the sector's unique demands.

Who’s Behind It?

Launched by GSMA's Foundry, this initiative brings together leading industry players to establish a standardized AI benchmarking framework tailored for telecom.

Supported by Major Partners

A wide range of telecom operators, AI leaders, and research institutions support the benchmark, including:

Tech & Research: Hugging Face, The Linux Foundation, Khalifa University.
Telecom & Industry Leaders:  Deutsche Telekom, LG Uplus, SK Telecom and Turkcell, Huawei, and others.

What Does It Do?

The benchmark provides an open-source, transparent evaluation framework for AI models in telecom, focusing on:

Real-World Performance: Testing AI capabilities in customer support, network automation, and regulatory compliance.
Holistic AI Assessment: Evaluating capability, energy efficiency, and safety, ensuring AI aligns with telecom’s operational and sustainability goals.

By setting a unified approach for AI in telecom, the Open-Telco LLM Benchmark helps drive innovation and accelerate AI adoption in next-generation networks.

How the Benchmark Works?

At launch, the GSMA Open-Telco LLM Benchmarks will assess AI models using four key datasets, each targeting a critical aspect of telecom AI performance. These datasets ensure that models are rigorously tested for domain expertise, document comprehension, mathematical reasoning, and network troubleshooting.

TeleQnA – Telecom Domain Knowledge & Technical Understanding

This dataset evaluates AI’s ability to answer telecom-specific queries, interpret industry terminology, and understand standards like 3GPP and ITU regulations. It helps measure how well an AI model can grasp the complexities of telecom infrastructure and operations.

3GPPTdocs Classification – Standards Comprehension & Documentation Parsing

AI models are tested on 3GPP technical documentation, assessing their ability to:

Classify and organize telecom standards documents.
Extract key insights from dense regulatory and technical texts.

MATH500 – Mathematical Reasoning & Modeling

A comprehensive mathematics benchmark containing 500 problems across various mathematics topics including algebra, calculus, probability, and more. Tests both computational ability and mathematical reasoning. Higher scores indicate stronger mathematical problem-solving capabilities.

FOLIO – Logic & Reasoning

FOLIO is an expert-written, open-domain, logically complex and diverse dataset for natural language reasoning with first-order logic, which can be used to test the reasoning capabilities of LLMs

Why This Matters?

These datasets provide a comprehensive framework to assess how well AI models perform in real-world telecom environments. The results will guide telcos in selecting the most effective AI solutions for customer service, network management, and operational automation, and eventually building telecom specialized AI models.

Initial Findings and Model Performance

OpenAI Models Lead, but with Gaps in Telecom Standards

GPT-4 ranks the highest with an average score of 56.96, outperforming all other models across most categories. It excels in TELE-QnA (74.91) and MATH-500 (76.6), demonstrating strong telecom knowledge and mathematical reasoning capabilities. However, its 3GPP-TSG score (38.94) indicates difficulty in comprehending structured telecom standards documents.
GPT-3.5-Turbo follows with an average score of 51.44, showing strong general performance but with a slight drop in TELE-QnA (67.29) and MATH-500 (74.68) compared to GPT-4. Similar to GPT-4, it struggles with 3GPP-TSG (38.54).

Key takeaway: While OpenAI models lead overall, they are not optimized for telecom-specific technical documentation (3GPP-TSG), which limits their effectiveness in regulatory and standards-heavy tasks.

Open-Source Models Show Potential but Struggle with Standards

Llama 3-8B-Instruct from Meta achieves an average score of 40.38, performing well in TELE-QnA (68.03) but significantly underperforming in 3GPP-TSG (13.2). This highlights its lack of exposure to structured telecom standards.
Qwen 2.5-7B-Instruct scores 39.78 on average, showing stronger 3GPP-TSG performance (28.45) compared to Llama 3, indicating a better ability to process telecom regulations.
Owen 2.5-1.5B-Instruct achieves a lower average of 32.8, with poor 3GPP-TSG comprehension (8.05) but decent TELE-QnA (66.01), suggesting it performs well in general telecom inquiries but lacks deeper technical understanding.

Key takeaway: Llama 3 and Qwen models are competitive alternatives to OpenAI models, but their performance on 3GPP standards needs improvement to be fully viable for telecom applications.

Mistral and Microsoft Phi-2 Lag in Telecom Tasks

Mistral-7B-Instruct (27.82 avg.) performs decently in 3GPP-TSG (27.84) but significantly lags behind in TELE-QnA (47.07) and MATH-500 (32.06). This suggests it has some capability in processing structured telecom documents but struggles with mathematical reasoning and telecom-specific Q&A.
Microsoft Phi-2 (26.11 avg.) scores among the lowest across the board, particularly in MATH-500 (10.8) and Spider (8.15), highlighting significant weaknesses in reasoning and structured database-related tasks.

Key takeaway: Smaller models like Mistral-7B and Microsoft Phi-2 show potential but are not yet optimized for real-world telecom AI applications. Their mathematical reasoning and telecom knowledge need improvement for practical deployment.

Benchmarking Insights & Future Directions

Strengths of Existing Models

GPT-4 and GPT-3.5-Turbo remain the strongest performers, particularly in TELE-QnA and MATH-500.
Qwen and Llama 3 models show promise as open-source alternatives but need improvements in handling telecom standards and structured data.

Key Areas for Improvement

3GPP Standards Comprehension (3GPP-TSG): Most models, including GPT-4, struggle with parsing and understanding telecom technical documentation.
Mathematical Modeling for Telecom (MATH-500): Only GPT-4 and GPT-3.5-Turbo show strong performance, while other models struggle with advanced telecom calculations.
Fault Detection & Log Interpretation (FOLIO): Lower scores in this category indicate the need for better training on real-world network event logs.

Future Roadmap: Telco Use Cases, Energy Efficiency, and Safety

Beyond the initial four datasets, the GSMA Open-Telco LLM Benchmarks is evolving to address real-world telecom challenges, ensuring AI models are evaluated on key industry priorities such as network troubleshooting, energy efficiency, safety, and operator-driven use cases.

Network Troubleshooting & Optimization

The benchmark will expand to assess AI’s role in predicting, diagnosing, and resolving network issues, ensuring seamless connectivity and efficient operations.

AI models will be tested on their ability to detect failures, analyze connectivity issues, and recommend real-time fixes.
Evaluation will include how AI integrates with telecom network logs, OSS/BSS systems, and real-time operational data.
Automated troubleshooting is a key area of future research, aiming to reduce downtime and enhance network resilience.

Energy Efficiency & Sustainable AI

As telcos prioritize sustainability, the benchmark will introduce AI energy efficiency assessments to guide eco-friendly AI adoption.

Measures compute power consumption, carbon footprint, and efficiency of AI models.
Provides telcos with a framework to select energy-efficient AI solutions that align with cost and sustainability goals.
Supports GSMA’s Responsible AI Maturity Roadmap, ensuring that AI deployment in telecom aligns with environmental best practices.

Safety & Compliance

Ensuring AI safety, trustworthiness, and regulatory compliance is a major focus area. AI models will be tested for:

Hallucinations and misinformation, especially in customer interactions and network decision-making.
Regulatory compliance automation, ensuring AI-driven telecom policies adhere to local and global telecom regulations.
Alignment with telecom industry safety and ethical standards for responsible AI deployment.

Operator Submissions & Industry Collaboration

The GSMA Open-Telco team is actively seeking input from operators deploying GenAI in telecom.

Telcos can submit real-world AI use cases where benchmarking and evaluation support are needed.
The Open-Telco team will develop custom playbooks and benchmarks tailored to meet operator-specific AI requirements.

Why Open Benchmarking Matters for Telecom AI

The GSMA Open-Telco LLM Benchmarks play a crucial role in shaping the future of AI in telecom by providing an open, standardized, and collaborative evaluation framework. Unlike closed, proprietary AI assessments, open benchmarking ensures fairness, industry-wide adoption, and continuous improvement.

Transparency

Unlike proprietary evaluations, GSMA's benchmarks are open-source and publicly hosted on Hugging Face, allowing anyone to access, test, and validate AI models.
Open benchmarking fosters trust and accountability, ensuring that AI models are assessed under clear, reproducible conditions rather than black-box evaluations.

Best Practice

Establishes a common industry framework for evaluating AI models on telecom-specific tasks, ensuring that comparisons are consistent, fair, and meaningful.
Helps telecom operators and vendors identify the best AI models for real-world applications, from customer support to network automation.

Collaboration

Encourages participation from mobile network operators, AI vendors, and researchers, enabling collective innovation in telco AI development.
Open-source contributions allow continuous refinement of datasets, evaluation metrics, and model improvements, accelerating AI advancements in the telecom sector.

Get Involved & Next Steps

The GSMA Open-Telco LLM Benchmarks thrive on industry collaboration. Whether you're a telecom operator, AI researcher, or technology provider, your contributions can help shape the future of AI in telecom.

How to Participate?

Submit Telco AI Use Cases & Datasets: Have a real-world AI use case or dataset that could improve telecom AI benchmarking? Contribute by emailing [email protected].

Join the Open-Telco Benchmarking Community: Be part of the discussion, access the latest benchmarking insights, and collaborate with leading telcos, AI vendors, and researchers by joining the Otellm Hugging Face community.

Next Steps The Open-Telco initiative will continue expanding benchmarks, integrating new datasets, use cases, and evaluation metrics. By participating, you help drive standardized, transparent, and efficient AI adoption in the telecom industry.

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote