NVIDIA Developer Blog · · 11 min read

Building Token‑Metered AI Services on Telco AI Factories

Mirrored from NVIDIA Developer Blog for archival readability. Support the source by reading on the original site.

Building Token‑Metered AI Services on Telco AI Factories

AI-Generated Summary

Like
Dislike
  • Telcos are building sovereign AI factories based on the NVIDIA Cloud Partner reference architecture to provide secure, in-country AI infrastructure with controls and performance suited for enterprise AI services.
  • The economic model is shifting from selling GPU hours to delivering token-metered AI services, where revenue and billing are based on tokens processed rather than infrastructure usage.
  • Token-as-a-Service transforms raw GPU infrastructure into AI applications and APIs measured by tokens, supported by AI developer studios for model fine-tuning and AI marketplaces for service deployment.

AI-generated content may summarize information incompletely. Verify important information. Learn more

Telcos around the world are building sovereign AI factories based on the NVIDIA Cloud Partner (NCP) reference architecture, giving governments, enterprises, and startups access to in‑country AI infrastructure with the right controls, trust, and performance. But infrastructure alone doesn’t get you to high-margin, production-ready enterprise AI services.

Model sizes and reasoning workloads continue to grow, driving up tokens per request, while each new generation of accelerated computing drives down cost per token. Together, these trends make it more valuable to push AI economics higher up the stack—from selling GPU hours to delivering AI services measured and billed in tokens.

At the same time, enterprises don’t want to manage clusters, runtimes, or model weights. They want production‑ready applications and model APIs with predictable performance, metered by token consumption, and backed by service‑level agreements (SLAs) tied to AI‑native metrics such as tokens per second, time‑to‑first‑token (TTFT), and end‑to‑end query latency.

This post traces the path from GPU‑per‑hour infrastructure to token‑metered AI services and outlines the technical building blocks telcos need to evolve from infrastructure landlords into “token factories” with transparent, token‑based economics that enterprises can easily adopt without operating the underlying infrastructure themselves.

Building the telco AI cloud stack

Diagram of a telco sovereign AI stack with an NVIDIA‑powered AI factory on the left, a central metering and billing column that tracks GPU and token usage, and a telco AI services layer on the right exposing AI applications, APIs, and marketplaces
Figure 1. Telco sovereign AI architecture showing an NVIDIA‑powered AI factory, cross‑stack metering and billing, and telco AI services delivered as token‑metered offerings

AI can be understood as a 5-layer cake—energy, chips, infrastructure, models, and applications. Telco sovereign AI factories sit on top of the energy and chip layers and anchor the infrastructure layer, providing NVIDIA‑accelerated compute, networking, and storage that can securely host models and applications.

Telco AI factories start with NVIDIA‑certified infrastructure and a choice of software partners that define both the platform’s economic and regulatory posture. This foundational layer sets the cost of compute‑as‑a‑service, enforces where data can reside, and controls which tenants can run which workloads in a shared environment. 

In practice, it turns raw GPU capacity into secure, multi‑tenant compute that can be exposed as services, and its cost structure and footprint set the baseline for cost per token as telcos move up the stack—from compute‑as‑a‑service to token‑as‑a‑service, where most of the long-term economic upside sits.

Compute‑as‑a‑Service: Infrastructure and platforms

Compute‑as‑a‑Service (CaaS) is how telcos monetize the energy, chips, and infrastructure layers of the 5‑layer cake, exposing NVIDIA‑certified systems, CPUs, GPUs, NVLink, high‑speed InfiniBand or Ethernet, and storage as GPU/Infrastructure‑as‑a‑Service (IaaS) that customers rent by the hour, similar to traditional cloud instances. 

On top of that, a Kubernetes‑based platform layer turns this raw capacity into a managed environment with multi‑tenant clusters, namespaces, and GPU scheduling, so developers can deploy containers and inference runtimes while being billed primarily on GPU‑hours, node‑hours, and storage. 

This tier is essential for flexibility, control, and sovereignty, but it keeps the business anchored in a GPU‑per‑hour model. The real economic shift happens when telcos add token‑metered models and applications on top of it and start selling AI output rather than just infrastructure time.

Token-as‑a‑Service: Creating and consuming token-metered services

Token‑as‑a‑Service (TaaS) moves telcos up into the model and application layers of the 5‑layer cake, where value is measured in tokens, API calls, and workflows rather than GPU‑hours. In this layer, GPU capacity from the AI factory is packaged into products that are measured, billed, and governed in those same units, and revenue is no longer limited by how many hours a GPU can be rented but by how many tokens the stack can serve at a given price and SLA.

Telcos typically begin with a focused portfolio of token‑metered services powered by open-source models like NVIDIA Nemotron, NVIDIA NIM, and blueprints, such as:

  • Vertical AI applications (for example, customer‑care copilots or knowledge assistants tailored to local languages and regulations)
  • Model and tools APIs for text, vision, speech, and agents
  • Inference‑as‑a‑Service endpoints for fine‑tuned and domain‑specific models

Customers integrate these services through APIs and pay in units that match how their business consumes AI—tokens, requests, or workflows—rather than in opaque infrastructure metrics. SLAs shift accordingly: instead of uptime on specific servers, enterprises care about latency, reliability, and response quality at the model or application level.

To simplify service creation and consumption at this layer, many telcos work with NVIDIA-certified software partners to develop AI developer studios and AI marketplaces.

An AI developer studio is where these token‑metered services are designed, adapted, and operated. Data scientists and developers use NVIDIA NeMo to fine‑tune foundation models, deploy them as secure NIM‑based endpoints, and connect them to retrieval pipelines or agentic workflows. Within an AI studio, they can choose models from a curated catalog, fine-tune them with their own enterprise data to improve accuracy and relevancy, and publish them as reusable AI assets—models, agents, and blueprints—that developers can reuse without ever touching the underlying infrastructure.

An AI marketplace then becomes the storefront that turns those assets into products. Business and application owners browse a catalog of copilots, retrieval-augmented generation (RAG) applications, model SKUs, and independent software vendor (ISV) solutions, then subscribe and deploy them with a few clicks.

Behind the scenes, the platform provisions inference endpoints and meters usage in input and output tokens, API calls, or workflow executions, automatically enforcing quotas, rate limits, and SLAs.

Together, TaaS enabled by the AI developer studio and AI marketplace transform the telco AI factory from a pool of GPUs into a portfolio of sovereign, token‑metered AI products that enterprises can adopt out of the box. 

Token-level metering and billing 

To turn those capabilities into products, telcos require a metering and billing layer that treats tokens as a first-class signal and connects them to performance, governance, and infrastructure efficiency.

KPI groupExamples
Token usageTokens per tenant, model, endpoint; input vs output; hourly/daily/monthly totals
PerformanceQPS, request counts, p50–p99 latency, throughput in tokens per second
ReliabilityError rates tied to token volume
GovernancePer‑tenant quotas, rate limits, access/audits, policy signals
EconomicsTokens per GPU‑hour, per GPU type, tokens per dollar
Table 1. Token‑level usage, performance, reliability, governance, and economic KPIs that telcos track to price, govern, and optimize token‑metered AI services on NVIDIA platforms

Together, these metrics let telcos offer plans priced per million tokens, enforce usage across tenants, and pick the right NVIDIA platform SKUs and service price-points based on real cost-per-token data.

Over time, this token‑level visibility turns the AI factory into a true token factory, where every improvement in the stack is measured in lower cost per token and higher, more predictable gross margin.

Monetizing AI infrastructure as a token factory

Upward‑sloping curve showing revenue models evolving from IaaS compute‑as‑a‑service priced per GPU‑hour, through PaaS tiers, to SaaS model and AI app token‑as‑a‑service priced per tokens, requests, and apps
Figure 2. Moving up the stack from IaaS compute‑as‑a‑service to PaaS and token‑metered AIaaS, turning NVIDIA GPU infrastructure into higher‑value AI applications and APIs

In a GPU‑per‑hour model, revenue is capped by how many hours a GPU can be rented and at what rate. You can tune utilization and pricing, but the unit of value remains “dollars per GPU‑hour,” so improvements in hardware and software mainly show up as pressure to lower hourly prices rather than as higher margins.

In a token‑as‑a‑service model, the same GPU is monetized by how many high‑quality tokens it can produce through an optimized stack, at a given price per million tokens and SLA.

Viewed this way, the AI factory becomes a token factory. Every improvement to the stack—better batching, smarter routing and scheduling, more efficient models, faster networking, and storage that removes I/O bottlenecks—either increases tokens per second or reduces cost‑per‑token.

Revenue scales with token throughput and price per token, while margin improves with each new NVIDIA platform generation and each software optimization, not just with higher hourly rental rates.

A practical example: GPU-per-hour vs. TaaS

The example in Figure 3, below, uses simplified assumptions to show how the economics change when you move from GPU‑per‑hour to TaaS. These numbers are illustrative, not prescriptive pricing.

Bar chart for an H100‑class GPU comparing annual revenue per GPU in two models. The GPU‑per‑hour bar is about 18,400 USD per year, based on a 3 USD hourly rate at 70% utilization. The token‑as‑a‑service bar is much higher, at 157,680 USD per year, based on 30 million billable tokens per hour priced at 1 USD per 1 million tokens with 60% token‑active utilization, illustrating that monetizing tokens generates more revenue than renting GPU time
Figure 3. Illustrative comparison of annual revenue per NVIDIA GPU in a GPU‑per‑hour model versus a token‑as‑a‑service model for an H100‑class GPU, showing higher annual revenue when monetizing tokens instead of raw GPU time

GPU-per-hour model:  Assume an H100‑class instance rents for about 3 USD per hour. At 70% average utilization over a year, that works out to roughly 18,400 USD in annual revenue per GPU. In this model, you mainly tune utilization and hourly price—you are still selling time on a GPU, not AI output.

TaaS model: Now assume you run a throughput‑optimized, mid‑size model that can sustain 30 million billable tokens per hour on a single H100. If you charge 1 USD per 1 million tokens, that GPU has 30 USD per hour of token revenue potential. At 60% “token‑active” utilization, that yields about 18 USD of realized token revenue per hour, or roughly 157,680 USD per year per GPU.

New GPU generations amplify this effect. NVIDIA GB200 NVL72 delivers order‑of‑magnitude improvements in tokens‑per‑second and cost‑per‑million‑tokens versus the previous generation, and leading inference providers report up to 10x lower cost‑per‑token on real workloads when they pair Blackwell with optimized stacks.

These savings are easiest to capture when you monetize at the token layer rather than per GPU‑hour, because higher tokens‑per‑second and lower cost‑per‑token translate directly into better unit economics for token‑metered services.

Alt Text: Bar chart comparing illustrative annual revenue per GPU for H100‑class and B200‑class GPUs in GPU‑per‑hour and token‑as‑a‑service models. The H100 and B200 GPU‑per‑hour bars are both about 18,400 USD per year, while the token‑as‑a‑service bars rise from 157,680 USD on H100 to 315,360 USD on B200, highlighting that higher Blackwell‑generation throughput only appears as more revenue when billing per token
Figure 4. Illustrative annual revenue per GPU for H100‑class and B200‑class GPUs in GPU‑per‑hour and token‑as‑a‑service models, showing that Blackwell‑generation throughput only generates additional revenue when GPUs are monetized per token instead of per hour

For example, if a B200‑class GPU doubles effective token throughput from 30 million to 60 million billable tokens per hour at the same price of 1 USD per 1 million tokens and 60% token‑active utilization, annual token‑as‑a‑service revenue per GPU increases from 157,680 USD to approximately 315,360 USD.

In a GPU‑per‑hour model, that extra throughput does not show up as additional revenue, but in a token‑as‑a‑service model it directly translates into higher revenue on the same GPU footprint and better margins as cost per token improves.

Where telcos go from here

For telcos that have already invested in NVIDIA‑powered sovereign AI factories, the next step is to move quickly up the stack—from AI infrastructure to AI services—and to align their business models with the AI token economy. 

Practically, this means going beyond GPU clusters and standing up an AI cloud stack with a NVIDIA‑certified software provider that can orchestrate GPUs, enforce multi‑tenant policies, and connect token‑level usage to billing, SLAs, and governance. For example, partners such as Rafay are already helping telcos roll out token‑metered AI services on sovereign infrastructure, offering early evidence that this approach matches real enterprise demand and use cases.

From there, telcos can launch token‑metered AI services: AI studios where teams build and adapt models using NVIDIA NIM and NeMo, marketplaces where those models and applications are offered as SKUs, and APIs that enterprises can consume on a per‑token or per‑workflow basis. 

By treating tokens as the core economic unit—backed by NVIDIA’s advances in tokens‑per‑second, tokens‑per‑watt, and cost‑per‑token—telcos can evolve from connectivity and infrastructure providers into sovereign AI service providers, with revenue and margins that scale as their token factories grow.

Learn how telecom operators are turning sovereign AI infrastructure into real revenue and impact for their nations. 

Discuss (0)

Tags

Agentic AI / Generative AI | Data Center / Cloud | Developer Tools & Techniques | Telecommunications | Blackwell | GB200 | H100 | Hopper | InfiniBand | NeMo | NIM | NVLink | Intermediate Technical | Tutorial | AI Factory | Cloud Services | Software-Defined Data Center | Sovereign AI

About the Authors

Avatar photo
About Waleed Badr
Waleed Badr is a technology leader with deep experience across product GoTM, focused on building and scaling AI and security solutions from cloud to edge with hyperscalers and global service providers. He specializes in developing joint AI solutions from ideation and integration through global scale, enabling production-grade AI platforms. He works with NVIDIA cloud partners to build token factories and sovereign AI infrastructure, enabling large-scale AI training, inference, and data pipeline architectures.
Avatar photo
About Amogh Dendukuri
Amogh Dendukuri is a product marketing manager for telco AI at NVIDIA, where he drives go-to-market strategies that accelerate telco-led AI infrastructure and AI-powered telco operations. Previously, Amogh worked as a product manager in the networking industry, building solutions for telcos, cloud providers and enterprises at the intersection of AI, cloud and network transformation. Amogh holds a bachelor's degree in computer science and anthropology from the University of Illinois at Urbana-Champaign.

Comments

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from NVIDIA Developer Blog