NVIDIA Developer Blog · · 12 min read

Transform Video Into Instantly Searchable, Actionable Intelligence with AI Agents and Skills

Mirrored from NVIDIA Developer Blog for archival readability. Support the source by reading on the original site.

Transform Video Into Instantly Searchable, Actionable Intelligence with AI Agents and Skills 

AI-Generated Summary

Like
Dislike
  • NVIDIA Metropolis Blueprint for video search and summarization (VSS) helps organizations turn large amounts of video into searchable and actionable insights in real-time.
  • VSS uses microservices, vision-language models, and large language models to enable fast video monitoring, trend detection, and decision-making.
  • Developers can automate VSS deployment and video analysis using coding agents like Codex and OpenClaw, which simplify setup and interaction through chat interfaces.

AI-generated content may summarize information incompletely. Verify important information. Learn more

In today’s data-driven world, organizations increasingly rely on video to capture critical information, yet extracting meaningful, real-time insights from massive amounts of footage remains a challenge. NVIDIA Metropolis Blueprint for video search and summarization (VSS) overcomes this hurdle by transforming millions of live video streams or hours of recorded video into instantly searchable, actionable intelligence.

VSS brings a reference architecture for building video analytics AI agents that perceive, reason, and act in real-time on massive volumes of live video streams and recorded data. It uses accelerated vision-based microservices, vision-language models (VLMs), large language models (LLMs), and retrievers for real-time video intelligence, agentic search, and automated reporting. VSS helps enterprises monitor operations, detect trends, and make informed decisions faster than ever. The latest version of VSS brings a new modular design, advanced fusion search capability and a set of skills to easily integrate with autonomous agents. 

In this post you will learn how to use the new VSS skills with coding agents to automate VSS deployment and integration into custom applications, followed by a deep dive into the technology behind VSS 3. Continue reading to learn how to use VSS skills with coding agents for building autonomous video analytics AI Agents.

You can also join us live on Wednesday, May 13, at 9 am PT, to learn how to build a video analytics AI agent with VSS skills. 

Diagram shows the architecture of VSS, including real-time video intelligence, downstream analytics, and agentic and offline processing
Figure 1. VSS architecture is composed of a set of microservices, databases and agents for analyzing videos

Build a video AI agent with VSS skills and coding agents

In the past, developers had to manually configure, deploy and integrate the rich set of microservices VSS provides for video management, search, summarization and more to build video analytic applications. Today, it’s possible to  use coding agents augmented with VSS skills to automate the deployment, usage and integration of VSS all through a simple agentic chat interface. 

VSS skills are hosted on the VSS GitHub Repository and follow the agent skills specification, allowing them to be used with a wide variety of agents. A prerequisite to utilizing these skills is to have a system that is set up to run VSS and an agent compatible with skills such as Codex, Claude Code, OpenClaw, or NemoClaw.

First we will show an example of how to add VSS skills to Codex and use it to deploy the VSS search profile. Then, we will show how to add VSS skills to OpenClaw, which will allow us to interact with our VSS deployment through nearly any chat interface to search and analyze large volumes of video. 

Setting up the VSS pre-requisites

The first step is to prepare a system to run VSS. The easiest way to do this is to use the NVIDIA Brev Launchable for VSS. Go to the VSS launchable documentation page and click the “Launch Blueprint” button and then “Deploy Launchable.”

Once deployed click the Open Notebook button and navigate to the /video-search-and-summarization/scripts/deploy_vss_launchable.ipynb notebook. Paste in your NGC_CLI_API_KEY from NGC in the first cell and then execute the entire notebook including the tear-down section. This will ensure the system is fully set up for VSS and then you can make use of the deployment skill to manage our VSS deployment from our coding agent. 

Once the notebook has run to completion, install the Brev CLI on your host system, launch VSCode and remotely connect to your Brev Instance following the Using Brev CLI (SSH) section from your Launchable page as shown in Figure 2, below. 

A screenshot showing the NVIDIA Brev Web UI with instructions for setting up the Brev command line interface
Figure 2. NVIDIA Brev Launchable page for using the Brev CLI 

Once you have a remote access configured, you can install the Codex through the VSCode extension to use as the coding agent. 

Deploying VSS with Codex

In VSCode you will use the extensions tab to search for and install Codex. Once installed you need to install the VSS skills. You can do this by telling Codex to self install the VSS skills and providing it the location of our VSS Github repository as shown in the following prompt: 

Read ~/video-search-and-summarization/skills/README.md and every SKILL.md file under ~/video-search-and-summarization/skills/. For each skill in the catalog, install it for this host so I can invoke it from a shell or chat session. Use the host's standard skills directory:

Claude Code: ~/.claude/skills/<name>/
Codex: ~/.codex/skills/<name>/
Hosts that follow the agentskills.io universal path: ~/.agents/skills/<name>/
Symlink each skill folder rather than copying it so a git pull here keeps every install up to date. Skip skills that are already installed and pointing at this checkout. When you're done, list the skills you registered and which directory you used.

Figure 3, below, shows how the agent will respond, verifying that it can access the VSS skills. 

A screenshot from a Codex chat listing out the VSS skills it has available
Figure 3: Codex’s response to verify VSS skill availability

Once your agent is loaded with the VSS skills, you can use it to deploy the various VSS components and profiles. Then you can use Codex to deploy the new VSS Search profile, as shown in Figure 4, below. 

A screenshot of Codex chat showing it successfully deployed the VSS search profile
Figure 4: Codex successfully deploys the VSS search profile

Codex will then plan out the deployment, configure the necessary environment variables and deploy all the containers needed to enable the VSS Search capability. From here, you can continue using Codex to interact with VSS for searching videos or continue to the next section to see how to also use OpenClaw with VSS skills. 

Searching videos with VSS and OpenClaw

With the search profile running you can install and configure OpenClaw to be an autonomous agent for analyzing videos using VSS. 

We will show you how to set up OpenClaw on the Brev system to see what a powerful autonomous agent can do. You will follow the standard OpenClaw installation instructions from the VSCode terminal connected to the Brev instance and use the recommended installer script. 

After running through the initial configuration, you can hatch our agent shown in Figure 5, below, and give it some context that it will be an agent for building video analytic applications using VSS. 

 A screenshot of the OpenClaw terminal user interface during initial setup
Figure 5: Hatching OpenClaw with context about VSS

After the initial setup, you need to provide OpenClaw with the VSS Skills. The easiest way to do this is to manually copy the skills into the OpenClaw workspace. 

mkdir ~/.openclaw/workspace/skills 
cp -r ~/video-search-and-summarization/skills/* ~/.openclaw/workspace/skills 

Now, open up the OpenClaw UI by running the openclaw dashboard command in the terminal, which will return a clickable link to access the OpenClaw UI. Once opened, you can verify that OpenClaw has access to the VSS Skills. 

A screenshot of OpenClaw UI with a prompt to verify access to VSS skills
Figure 6: OpenClaw verifying VSS skills

Now you can tell OpenClaw to use the VSS search profile deployed in the previous section to start analyzing large volumes of video data. For this example, you will provide a path to three 10-minute videos captured in a warehouse that need to be analyzed for safe ladder usage. You want OpenClaw to use the search capability to find all instances of ladder usage in the videos and verify the worker is wearing a hardhat and safety vest. For this, you will use the following prompt: 

I have a set of warehouse videos located at ~/warehouse_videos. I need to find any instances of a worker climbing a ladder and verify they are wearing a hardhat and safety vest. Can you do this with the VSS Search profile that is deployed? 

Once prompted, OpenClaw will start working behind the scenes to figure out the necessary skills and associated tool calls it needs to make to complete the task. 

OpenClaw makes use of the VSS skills to upload your video files to VIOS, ingest the videos through the embedding microservices to generate searchable indexes and then use the fusion search capability in VSS to find the video clips where a worker wearing a hardhat and safety vest is climbing a ladder. 

Two screenshots side by side showing the OpenClaw Chat UI with search results for ladder and PPE usage in warehouse videos
Figure 7: OpenClaw results using the VSS search profile to verify safe ladder usage

Once it’s done, OpenClaw returns a concise report of all ladder usage seen across the videos as well as screenshots from the videos. 

This section covered just one simple example of using Codex for deployment and OpenClaw for video analysis with VSS skills. By augmenting agents with VSS Skills, they are given endless possibilities to gain valuable insights into video data and build new applications with VSS. 

Now you can dive deeper into the technology that powers the rich set of video analysis capabilities in VSS 3. 

Smarter video: From alerts to search

Large-scale video search remains one of the most challenging frontiers in modern information retrieval. User queries are inherently complex and ambiguous—capturing full semantic intent within a single visual embedding is fundamentally insufficient, particularly when objects and events carry multi-layered attributes that resist simple vector representation.

At massive scale, locating a specific moment across millions of hours of footage becomes a true “needle in a haystack” problem, where nearest-neighbor search over a monolithic embedding space quickly degrades in both precision and recall.

Video 1: Agentic video search by attributes, events, and actions using natural language

Addressing these limitations requires a more sophisticated search architecture built on two core capabilities: 

  • Multi-type embedding extraction and retrieval, combined with relevance filtering and semantic deduplication.
  • Search orchestration driven by agentic reasoning; decomposing complex queries into tractable sub-queries, applying reasoning-based retrieval strategies at each step, and running iterative verification and reflection loops to progressively refine results.

The search architecture first uses RTVI-CV with embedding and RTVI-embedding microservices to ingest video and extract features. The VSS agent then uses this feature data and vision-aware tools to perform a deep, iterative search on video, creating a plan and retrieving results to locate specific objects or events in the video timeline.

Diagram of a multi-embedding search pipeline where a query is converted into multiple embeddings, searched in separate indexes, and combined to return ranked results
Figure 8.  Process of multi-embedding search

Modular architecture brings high flexibility and performance 

VSS is designed around a docker-compose based modular developer profile system: A base agent deploys in under five minutes, and additional workflows are layered on top as needed.

WorkflowProfileCore Capability
Base / Q&AbaseVLM-based Q&A and report generation on short clips
Alert Verificationalerts (verification)CV pipeline + Behavior Analytics + VLM verification
Real-Time VLM Alertsalerts (VLM)Continuous VLM anomaly detection on live streams
SearchsearchAgentic multi-embedding search across video archives
Video SummarizationlvsChunked summarization of extended recordings
 Table 1. Available VSS developer profiles 

Each workflow is supported on several types of GPUs in various configurations to meet your hardware and performance needs. 

Let’s look at some benchmarks for the various workflows and configurations. 

The agentic search workflow can be characterized by its maximum concurrent input streams, the time it takes to ingest the incoming streams and the retrieval latency to receive a search result. Table 2, below, shows these metrics on single GPU configurations for H100 and NVIDIA RTX PRO 6000. 

GPUMax Concurrent Streams Max Ingestion Latency (s)Retrieval Latency (s)
1x H100 330.0792.24
1x RTX PRO 6000510.1011.87
Table 2: Key performance metrics for the agentic search workflow

For the alert verification workflow, the maximum number of concurrent streams is measured along with the latency for the verification to take place. Table 3, below, shows these metrics measured using RT-DETR as the detector, Cosmos Reason 2 as the VLM verifier operating on streams with an average of 1 alert event per minute. 

GPUMax Concurrent StreamsVerification Latency (s) 
1x DGX Spark 1x AGX Thor 140.89
1x H1001471.01
1x RTX PRO 6000870.82
Table 3. Key performance metrics for the alert verification workflow 

The long video summarization microservice rapidly produces summaries on hours of video footage. Figure below, shows the time it takes for a given GPU configuration to summarize an hour long video. Scaling the LVS microservice to multiple GPUs can greatly decrease the summarization time. 

Bar chart showing tokens and time required for summarization using various hardware setups
Figure 9: Time to summarize a 1-hour video using the long video summarization microservice on a variety of GPU topologies. 

Get started with VSS skills

VSS skills enable developers to transform video into searchable, meaningful data using natural language—making it easier to uncover insights, generate summaries, and build smarter applications.

To dive deeper into VSS, see the documentation. Explore all VSS skills in Github

For technical questions, visit our forum.

Discuss (0)

Tags

Agentic AI / Generative AI | Computer Vision / Video Analytics | General | Energy | Manufacturing | Retail / Consumer Packaged Goods | Smart Cities / Spaces | Telecommunications | Blueprint | Metropolis | Intermediate Technical | AI Agent | GTC 2026 | News | VLMs

About the Authors

Samuel Ochoa NVIDIA
About Samuel Ochoa
Samuel Ochoa is a technical marketing engineer with the Metropolis team at NVIDIA, focusing on bringing AI to industrial applications. Prior to NVIDIA, he graduated from the University of Texas at Austin with a BS and MS in Computer Engineering with a specialty in machine learning for edge devices.
Avatar photo
About Adam Ryason
Adam Ryason is a product manager on the NVIDIA Metropolis team and leads the AI Blueprint for Video Search and Summarization project. An experienced startup founder and researcher, his background consists of vision-based AI, digital twins, and human-computer interaction. He holds a Ph.D. in mechanical engineering from Rensselaer Polytechnic Institute.
Avatar photo
About Debraj Sinha
Debraj Sinha is a Product Marketing Manager for Metropolis at NVIDIA, focusing on building smarter spaces around the world with AI-enabled video analytics. Debraj collaborates with partners ranging from startups to Fortune 500 companies to market AI applications that drive safety and efficiency gains. He holds an MBA degree from Haas School of Business, University of California, Berkeley and a Master's degree in Computer Science from Cornell University.

Comments

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from NVIDIA Developer Blog