r/LocalLLaMA · June 20, 2026 · 3 min read

Giving a local agent web access without paid search/scrape APIs: SearXNG + Scrapling

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

I wanted web access for a local-first agent without reaching for Tavily, Serper, Firecrawl, etc.

For this agent path, I wanted no paid API keys, a search service I control, and page extraction I can run myself.

What I ended up with is two tools: web_search and web_extract. Nothing fancy. Mostly just wiring together good open-source pieces.

1. Search -> SearXNG

SearXNG is a self-hostable metasearch engine. I run it in Docker and point the agent at its JSON endpoint.

The search call is roughly:

text GET {SEARXNG_URL}/search?q=<query>&format=json&pageno=1

Then I cap the results and normalize them to: {title, url, description}

description is just the SearXNG snippet. It is not page content.

Config is basically:

text SEARXNG_URL=http://localhost:8080

Gotchas:

Add json to search.formats in SearXNG settings.yml.
Public SearXNG instances are usually a bad fit for programmatic use.
SearXNG is search-only. Use extraction when the agent needs to read a page.

2. Extract -> Scrapling + Trafilatura

Search snippets are not enough. The agent needs to read the actual page.

For web_extract, I use Scrapling with two paths:

Fast path: Fetcher.get(url, impersonate="chrome"). No browser. Good for normal pages.
Stealth path: if the fast path is empty, blocked, or challenge-looking, try a real headless browser:

python StealthyFetcher.fetch( url, headless=True, solve_cloudflare=True, block_webrtc=True, hide_canvas=True, )

The stealth path is an attempt, not a guaranteed bypass. If the page still shows a CAPTCHA or Cloudflare wall, I mark the result as blocked/partial.

Once I have HTML, Trafilatura turns it into Markdown with links and tables. Markdown is much easier for the model than raw HTML. I also keep a visible-text fallback for pages where Trafilatura under-extracts.

Other pieces that mattered:

PDFs: PDF URLs go through pypdf.
Challenge detection: CAPTCHA/security pages get flagged instead of treated as real content.
SSRF guard: requested URLs and redirects are checked against private/internal ranges. Final URLs are checked too. Caveat: this is not a network-level guard for every browser subrequest.
Optional summarization: large pages can be summarized by a configurable auxiliary model before they go back into context.

Why this combo

No paid search/scrape API keys for this path.
Queries go through my SearXNG instance, not a vendor API tied to my account.
SearXNG still hits upstream engines, so this is not "zero third-party contact."
Most pages use the fast path. The browser only kicks in when needed.
The final output is Markdown, not HTML soup.

Honest tradeoffs

The stealth path is slow. Keep it as a fallback.
SearXNG quality depends on enabled upstream engines and rate limits.
Paid search APIs can still be better. This has been good enough for my use.
Cloudflare/browser scraping is always a moving target.

Not claiming this is the optimal setup. It is just one that has worked for me and stays self-hostable.

Curious what others are using for this. Has anyone found something better than SearXNG for self-hosted search, or a lighter alternative to a full browser for the hard pages?

Happy to share more details if anyone's trying something similar.

submitted by /u/luke_pacman
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

1. Search -> SearXNG

2. Extract -> Scrapling + Trafilatura

Why this combo

Honest tradeoffs

Discussion (0)

More from r/LocalLLaMA