NVIDIA Developer Blog · · 15 min read

Evaluate Clinical ASR Models Faster with Agent Skills and NVIDIA Nemotron Speech

Mirrored from NVIDIA Developer Blog for archival readability. Support the source by reading on the original site.

Evaluate Clinical ASR Models Faster with Agent Skills and NVIDIA Nemotron Speech

AI-Generated Summary

Like
Dislike
  • Clinical speech AI faces unique challenges with rare and complex terminology, making standard speech systems inadequate for tasks like medication name recognition; synthetic data generation can help, but only if pronunciation is carefully validated.
  • The workflow described leverages NVIDIA agent skills, NeMo Data Designer, and Magpie TTS Multilingual to create, review, and benchmark pronunciation-aware synthetic audio, enabling rapid, repeatable creation of clinical ASR evaluation datasets without real patient data.
  • The process uses an iterative improvement loop: define clinical profiles, generate and review synthetic audio for accurate term pronunciation, evaluate ASR performance at the entity level, and adapt or expand the benchmark based on targeted error analysis, with explicit manual review for pronunciation gaps.

AI-generated content may summarize information incompletely. Verify important information. Learn more

Training a speech AI model to correctly recognize or synthesize clinical terminology is surprisingly difficult. Drug names like Acetaminophen, Amlodipine, Cefazolin, and Biktarvy are not part of everyday vocabulary. Procedure names, anatomy terms, and specialty-specific diagnoses introduce the same problem in a different form. Off-the-shelf speech systems can sound fluent and still miss the words that matter most to a clinical workflow.

Synthetic data generation (SDG) can help close this gap, but only if the synthesized speech is phonetically accurate. A text-to-speech (TTS) system that mispronounces a medication or procedure name produces training or evaluation data that teaches the wrong pronunciation. Instead of fixing the original problem, it can make the failure more difficult to detect. When correctly implemented, SDG enables a team to stand up a domain benchmark in hours without collecting real clinical audio or waiting on annotation pipelines or IRB approval.

This post presents a clinical automatic speech recognition (ASR) workflow for generating pronunciation-aware synthetic audio, reviewing clinical terms, and evaluating recognition quality. NVIDIA agent skills guide the workflow, while NVIDIA NeMo Data Designer and NVIDIA Nemotron Speech provide the data generation and speech services. 

Why does clinical ASR need a repeatable feedback loop?

Clinical voice AI is becoming part of dictation, ambient documentation, call-center workflows, patient intake, and post-visit follow-up. These systems are expected to understand terms that are rare in general speech but central to the task: medication names, procedure names, anatomy, diagnoses, devices, symptoms, and specialty abbreviations. 

Real-world clinical audio is also difficult to collect and share. It can be expensive, slow to annotate, restricted by privacy requirements, and unevenly distributed across specialties and rare terms. Real patient recordings are protected health information under HIPAA, which means they cannot be freely shared across teams, checked into version control or used in automated test pipelines without significant compliance overhead. Synthetic audio contains no PHI by design, making it the only form of clinical speech data a team can version, share, and test. Public datasets may not include the exact terminology a deployment depends on.

The practical challenge is not only to generate more data. Developers need a repeatable way to define the target clinical profile, create a benchmark, review pronunciation risk, measure ASR behavior, improve the model, and decide whether the next cycle should expand terms, improve pronunciations, add noise, or fine-tune.

How are AI agent skills helpful for clinical voice AI?

Agent skills guide a developer through the repeated steps of clinical ASR evaluation: defining a profile, building a term-centered benchmark, reviewing pronunciations, generating synthetic audio, measuring ASR behavior, and choosing the next iteration. 

In this post, the flywheel is the full improvement loop: build the benchmark, evaluate ASR behavior, use the results to decide what to change, and reevaluate after the change. The pipeline is one pass through part of that loop, such as generating sentences, adding pronunciation markup, synthesizing audio, and writing the manifest.

The pipeline begins as a conversation, not a static config file. You can run the build skill in any agent harness (Claude Code or Codex, for example) and describe the clinical workflow they want to stress test. The digital health clinical asr build agent skill then asks for the specialty, target setting, known failure terms, and desired contexts before generating a first-pass benchmark configuration. 

The skill asks one question at a time, in order: 

  • What specialty or workflow
  • Which ASR failure modes have been observed
  • Which terms are encountered daily and which terms are difficult 

Common terms become the baseline, and difficult terms become the signal that drives benchmark design.

Developer:
/flywheel-build

Agent:
What clinical specialty or workflow is this for?
Examples: orthopedic practice, oncology dictation, ICU handoff, ambulatory cardiology.

Developer:
Orthopedic practice. Focus on post-op instructions and PACU handoff.
The ASR system often misses medication names and procedure names.

Agent:
I'll start a benchmark profile with drug, procedure, anatomy, and condition terms, then run 
pronunciation QA before generating the full audio set.

This profile-driven start enables the same flywheel to be configured for orthopedic surgery, cardiology, oncology, behavioral health, or any other domain where the vocabulary differs. The agent’s job is to keep the workflow on the right rails: collect the clinical profile, propose or ingest terms, generate a small QA set first, route IPA misses to review, and only then build the full benchmark.

CategoryExample terms
DrugsCefazolin, Ketorolac, Ropivacaine, Enoxaparin, Tranexamic acid
ProceduresTotal knee arthroplasty, hemiarthroplasty, ORIF, arthroscopy
AnatomyAcetabulum, tibial plateau, femoral neck, iliopsoas
ConditionsHemarthrosis, osteomyelitis, compartment syndrome, femoroacetabular impingement
Table 1. Example clinical term categories for an orthopedic practice profile

How to generate TTS-ready synthetic audio from clinical seed terms

Starting from the profile-specific term list, the pipeline uses NeMo Data Designer to expand seed terms into a richer dataset. NeMo Data Designer generates high-quality synthetic data from scratch or from seed data. Developers define the output columns and the dependencies between them. 

NeMo Data Designer resolves the dependencies while handling batching, parallel execution, validation, and preview or full-run execution. In this flywheel, the output columns produce a complete synthetic speech record: a unique sample ID, a clinical sentence containing the target term, a pronunciation source, a Speech Synthesis Markup Language (SSML) sentence with phoneme markup when available, and the target path for the synthesized audio.     

For this pipeline, five columns transform a clinical term into a phoneme-annotated, TTS-ready sentence (Figure 1).

Diagram showing a clinical terminology list inputted into NeMo Data Designer to generate columns for sample ID, sentence, IPA, SSML, and audio path, then into Magpie TTS to generate an audio sample.
Figure 1. Pipeline for pronunciation-aware synthetic audio generation using NeMo Data Designer and NVIDIA Magpie TTS Multilingual
ColumnPurposeSkill use
sample_idUnique ID for the generated sampleKeeps audio files, transcripts, and metric rows aligned
sentenceClinical sentence containing the exact target termBecomes the ASR reference transcript
ipa_pronunciationReviewed or dictionary-derived pronunciation candidateDrives phoneme injection and flags review gaps
ssml_sentenceSentence wrapped in SSML with phoneme markup when availableBecomes the TTS input
audio_filepathTarget path for the synthesized audio fileBecomes the manifest audio path
Table 2. Core columns in the generated text dataset

The generated sentence prompt should preserve the exact target term. If the model substitutes a brand name, generic equivalent, abbreviation, or spelling variant, the benchmark no longer tests the intended entity. The agent skill can check for that condition and regenerate or reject rows that do not contain the exact term.

DrugSentenceipa_pronunciationssml_sentenceaudio_filepath
AcetaminophenThe nurse administered Acetaminophen to the patient after surgery to manage mild pain.əˌsiːtəˈmɪnəfɛn<speak>The nurse administered <phoneme alphabet=”ipa” ph=”əˌsiːtəˈmɪnəfɛn”>Acetaminophen</phoneme> to the patient after surgery to manage mild pain.</speak>data/audio/audio_Acetaminophen_3c7a1f02.wav
Table 3. Example-enriched row from the text dataset

SSML phoneme tag injection

SSML is an XML-based markup language that provides TTS engines with instructions on how to synthesize speech. It is critical for controlling aspects like pronunciation, pacing, volume, and emphasis. The SSML step wraps the generated sentence in a <speak> element and injects a <phoneme alphabet="ipa"> tag around every occurrence of the target term. The implementation uses a case-insensitive regex so the original casing in the sentence is preserved while the match remains robust.

<speak>A forty-five year old patient was prescribed
<phoneme alphabet="ipa" ph="əˌsiːtəˈmɪnəfɛn">Acetaminophen </phoneme>
once daily to manage mild pain.</speak>

Manual pronunciation review for IPA gaps

Dictionary lookup covers many clinical terms, but not all of them. Newer drug names, trade names, rare procedure terms, and specialty-specific phrases may be missing or may return a pronunciation that requires review. The flywheel handles those gaps with an explicit manual review path.

When a trusted dictionary pronunciation is unavailable, an LLM-backed agent harness can propose candidate IPA strings. The important boundary is that the LLM proposal is not treated as ground truth. It is a candidate that must pass validation and human review.

The manual pronunciation loop is as follows:

  1. Flag rows with missing or low-confidence IPA
  2. Use the agent harness to propose one or more IPA candidates
  3. Validate the candidate against the TTS phoneme inventory
  4. Synthesize a short QA clip for the term in context
  5. Review to accept, edit, or reject the candidate
  6. Write accepted pronunciations to a reviewed override file
  7. Regenerate the affected SSML and audio

This process turns pronunciation gaps into a small review queue instead of a hidden benchmark-quality problem. For example, in the orthopedic practice reference session, terms such as Femoroacetabular impingement, Hemiarthroplasty, Ketorolac, Pertrochanteric, and Ropivacaine needed review or overrides. After review, the full benchmark generated 67 audio samples with no rows relying on unreviewed native TTS pronunciation.

The loop only works if the agent actually stops and waits for the human at the right moment. The skill itself enforces that pause. The instructions in the skills are written for the agent, not the developer, and they tell the agent in plain language that it cannot move on until the user has listened to the clips.

How to synthesize the audio and produce the manifest

Once each row has an SSML sentence and target audio path, the workflow synthesizes one audio file per generated sample. NVIDIA Magpie TTS Multilingual is a good fit for this stage because it supports SSML phoneme tags with IPA and ARPAbet. This allows the synthesizer to render the clinical term using the reviewed phoneme sequence instead of relying only on its own grapheme-to-phoneme prediction.

The final output is a NeMo-compatible JSONL manifest. Each line links an audio file to its transcript and metadata:

{
  "audio_filepath": "data/audio/audio_Acetaminophen_3c7a1f02.wav",
  "text": "The nurse administered Acetaminophen to the patient after surgery to manage mild pain.",
  "duration": 3.914,
  "term": "Acetaminophen",
  "entity_category": "drug",
  "ipa_source": "reviewed"
}

The manifest is the handoff point between SDG, ASR evaluation, and model adaptation. It is also where the benchmark keeps the metadata needed for slicing results by entity category, pronunciation source, context type, voice, or acoustic condition.

What is the value of a skill-native clinical ASR quality flywheel?

While generating phonetically controlled audio is useful on its own, the greater value is an AI agent working together with a developer through the improvement loop. The user starts with a clinical profile. The build skill creates a benchmark. The evaluation skill reports where the ASR system struggles. The adaptation skill helps decide whether to fine-tune, expand the term list, improve pronunciation coverage, or add harder acoustic conditions. The reevaluation step then checks whether the change helped.

Circular clinical ASR quality flywheel with three stages: benchmark building, ASR evaluation, and ASR fine-tuning.
Figure 2. Skill-native clinical ASR quality flywheel

The evaluation skill includes one counter-intuitive routing rule worth surfacing. If Merriam-Webster improved audio scores but Magpie fallback audio scores poorly, the skill routes the user back to build, not to fine-tune. That pattern is a pronunciation-coverage gap, not a model gap. Fine-tuning over a TTS-pronunciation gap teaches the model to misrecognize the model’s own mistakes. ASR transcription itself is served by NVIDIA Nemotron Speech.

StageDeveloper intentSkill behavior
SetupPrepare the environment and check accessVerifies dependencies, credentials, and smoke tests
BuildCreate a profile-specific benchmarkCollects specialty context, proposes terms, runs pronunciation QA, and generates the manifest
EvaluateMeasure ASR behavior on the benchmarkRuns transcription and reports aggregate and entity-level metrics
AdaptImprove ASR quality based on failure patternsGates fine-tuning behind two thresholds, priority-category KER > 0.3 and manifest ≥ 100 rows, and otherwise routes back to build to grow the manifest. Fine-tuning runs use the stock NeMo Framework
ReevaluateCheck whether the change helpedCompares current and prior runs and recommends the next cycle
Table 4. Skill stages in the ASR quality flywheel

How to benchmark ASR performance

The flywheel still reports familiar ASR metrics, but the skill presents them as decision signals. If pronunciation QA is incomplete, the next step may be review rather than model training. If entity errors cluster in one category, the next step may be more targeted data. If errors persist across reviewed terms, adaptation may be justified.

MetricWhat it measuresSkill use
WERWord error rate across the full sentenceGeneral ASR quality signal
CERCharacter error rateNear-miss signal for long clinical terms
KERKeyword error rate on the target clinical entityPrimary signal for whether workflow-critical terms are recognized
SERSentence error rateShows whether any error occurred in the sentence
Table 5. Metrics reported by the evaluation skill

In the orthopedic practice simulation, the entity-level metrics made the next step clear: medication names were the weakest category, and the follow-up cycle focused on pronunciation review, additional drug-name coverage, and model adaptation. The result was not a production benchmark, but it showed how the flywheel can turn a clinical ASR failure pattern into a concrete improvement path.

What are the limitations of the flywheel

Synthetic audio is not a substitute for real clinical audio. It is a controllable way to create targeted stress tests, especially for rare terms, but production validation still requires real-world audio from the intended setting. Pronunciation control still needs human review. Dictionary lookup works well for many medical terms, but not every term appears in a trusted dictionary. Automated pronunciation proposals can accelerate review, but they should not be treated as ground truth without audio inspection.

The current benchmark is small. The orthopedic practice simulation demonstrates the flywheel on a small set of generated samples. Stronger claims require held-out terms, more contexts, more speakers, acoustic perturbations, repeated runs, and real audio. Clean-audio performance is not enough. Clinical environments include alarms, overlapping speakers, masks, telehealth microphones, room reverberation, ambulance noise, and dictation artifacts. The next version of the benchmark should include acoustic stress profiles.

Get started with clinical ASR agent skills

Clinical ASR improvement requires more than a one-time dataset or aggregate score. You need a workflow that helps you define the clinical profile, generate pronunciation-aware synthetic audio, measure ASR quality on the terms that matter, adapt the model when appropriate, and reevaluate the result.

The flywheel described in this post starts with a simple conversation and ends with a repeatable ASR flywheel. NVIDIA NeMo Data Designer handles the text-enrichment layer. Magpie TTS Multilingual synthesizes pronunciation-controlled audio. The NeMo-compatible manifest connects generation, evaluation, adaptation, and reporting. AI agent skills make the process repeatable by guiding term curation, IPA review, benchmark generation, scoring, and next-step decisions.

The orthopedic practice simulation shows the workflow pattern: configure a profile-specific term list, generate reviewed synthetic audio, inspect entity-level errors, and decide the next action. The larger contribution is the repeatable loop: profile-driven benchmarks, pronunciation-aware TTS, explicit review gates, and entity-level evaluation. 

Ready to get started? Explore NVIDIA agent skills to use the clinical ASR agent workflow as a guide for building profile-driven benchmarks, reviewing pronunciations, generating synthetic clinical audio, and evaluating ASR output with entity-level metrics.

Discuss (0)

Tags

Agentic AI / Generative AI | Developer Tools & Techniques | Healthcare & Life Sciences | NeMo | Nemotron | Intermediate Technical | Deep dive | Agent Skill | AI Agent | Speech & Audio Processing | Speech AI | Synthetic Data Generation

About the Authors

Avatar photo
About John Jahanipour
John Jahanipour is a senior solutions architect on the Worldwide Field Operations team at NVIDIA. His work focuses on enabling enterprise adoption of accelerated computing platforms, particularly in the healthcare and manufacturing industries, with an emphasis on generative AI. John's technical expertise spans AI model optimization, GPU infrastructure sizing, and deployment strategies for both cloud and on-premises environments. Prior to joining NVIDIA, he contributed to AI research and development in both academic and industrial settings. He holds a Ph.D. in Electrical Engineering with a specialization in applied machine learning and AI.
Avatar photo
About Ben Randoing
Ben Randoing is an applied AI engineer currently working to support AI adoption in healthcare. He holds a bachelor’s degree in Biomedical Engineering from Duke University and a master’s degree in Computer Science from Stanford, where he conducted research at both the Stanford Center for Artificial Intelligence in Medicine and Imaging (AIMI) and the Neuromuscular Biomechanics Lab. His industry career spans Apple, where he worked on the health technology team developing moonshot projects in consumer health, and NVIDIA, where he contributed to conversational AI, clinical NL2SQL, multimodal retrieval, and fine-tuning pipelines.
Avatar photo
About Abood Quraini
Abood Al-Quraini is a technical marketing engineering manager for Healthcare AI at NVIDIA. He leads a team working on product solutions for the digital health space. He focuses on utilizing NVIDIA Blueprints and NVIDIA NIM microservices to create reference workflows, demos, and tutorials, inspiring developers and researchers to solve real-world healthcare challenges with generative AI and Agents. Abood holds a bachelor's degree in Electrical Engineering from Lehigh University, a master’s degree in Electrical Engineering from McGill University and an MBA degree from Santa Clara University.
Avatar photo
About Jonny Hancox
Jonny Hancox is a senior solutions architect on the Digital Health team at NVIDIA. He works with the ecosystem of developers and scientists in the clinical domain to extract the most value from NVIDIA hardware and software platforms. With a software engineering background, Jonny has worked in many roles in his career but mostly within the Healthcare and Life Sciences domain.

Comments

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from NVIDIA Developer Blog