Building an Open Source Edge Semantic Cache for LLMs in Rust/WASM – Sanity check on the architecture? [D]
Mirrored from r/MachineLearning for archival readability. Support the source by reading on the original site.
Hey everyone,
I am planning out a new open-source infrastructure project and want to get some brutal feedback on the architecture and use-case validity from people running high volume LLM workloads in production.
The Problem: Python-based proxies/gateways introduce too much latency overhead for real-time streaming agent steps or fast UI completions. Additionally, centralized semantic caching still suffers from cross-region network latency (e.g., London to us-east-1), and enterprise API costs remain a massive bottleneck for repetitive/predictable user queries (like customer support or structured data extraction).
The Proposed Architecture: Instead of a heavy centralized gateway, the goal is to build a lightweight, zero-dependency semantic cache running directly at the CDN Edge using WebAssembly (WASM) compiled from Rust.
The flow looks like this:
- Inbound Prompt: Hits the edge node closest to the user (e.g., Cloudflare Workers / Fastly Compute).
- Edge Embedding: The Rust/WASM module intercepts the raw text prompt and instantly generates a vector using an edge-native lightweight model (e.g.,
bge-small-en-v1.5). - Similarity Index Check: It performs a fast cosine similarity check against an edge vector database (like Cloudflare Vectorize) to find the nearest semantic neighbor.
- Cache Hit: If similarity >= threshold (e.g., 0.88), it pulls the full generated response text from an edge KV store and returns it in ~5ms. The main LLM provider is never billed or touched.
- Cache Miss: It proxies the streaming request to OpenAI/Anthropic/vLLM, streams it back to the client, and asynchronously updates the edge vector index and KV store.
Why Rust/WASM? To achieve sub-millisecond execution overhead on the proxy itself, avoid garbage collection pauses, and maintain a tiny memory footprint suitable for edge runtime constraints where traditional databases or Python scripts cannot run.
My Questions for the Community:
- For those running LLMs in production (especially customer support, internal RAG, or autonomous agents), what is your realistic semantic cache hit rate? Is the power law of repetitive queries high enough in your domains to justify this?
- What are the biggest footguns with semantic caching at the edge? (e.g., Cache invalidation strategies, handling system prompt updates, or drift in embedding models).
- Would you actually use a drop-in open-source template/CLI that lets you spin this up on your own edge account, or do you prefer centralized API gateways?
[link] [comments]
More from r/MachineLearning
-
Derivative-Free Neural Network Optimization: MNIST Case [R]
Jun 13
-
is a preprint from an independent researcher worthy of arxiv endorsement if it got cited by a Peking University lab's paper 1 month after release? [D]
Jun 12
-
AMAZON ML SUMMER SCHOOL 2026 [D]
Jun 12
-
Just thinking, what about conducting a 1 day virtual session on fundamentals of computer vision ??? [D]
Jun 12
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.