Giving a local agent web access without paid search/scrape APIs: SearXNG + Scrapling
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
I wanted web access for a local-first agent without reaching for Tavily, Serper, Firecrawl, etc.
For this agent path, I wanted no paid API keys, a search service I control, and page extraction I can run myself.
What I ended up with is two tools: web_search and web_extract. Nothing fancy. Mostly just wiring together good open-source pieces.
1. Search -> SearXNG
SearXNG is a self-hostable metasearch engine. I run it in Docker and point the agent at its JSON endpoint.
The search call is roughly:
text GET {SEARXNG_URL}/search?q=<query>&format=json&pageno=1
Then I cap the results and normalize them to: {title, url, description}
description is just the SearXNG snippet. It is not page content.
Config is basically:
text SEARXNG_URL=http://localhost:8080
Gotchas:
- Add
jsontosearch.formatsin SearXNGsettings.yml. - Public SearXNG instances are usually a bad fit for programmatic use.
- SearXNG is search-only. Use extraction when the agent needs to read a page.
2. Extract -> Scrapling + Trafilatura
Search snippets are not enough. The agent needs to read the actual page.
For web_extract, I use Scrapling with two paths:
- Fast path:
Fetcher.get(url, impersonate="chrome"). No browser. Good for normal pages. - Stealth path: if the fast path is empty, blocked, or challenge-looking, try a real headless browser:
python StealthyFetcher.fetch( url, headless=True, solve_cloudflare=True, block_webrtc=True, hide_canvas=True, )
The stealth path is an attempt, not a guaranteed bypass. If the page still shows a CAPTCHA or Cloudflare wall, I mark the result as blocked/partial.
Once I have HTML, Trafilatura turns it into Markdown with links and tables. Markdown is much easier for the model than raw HTML. I also keep a visible-text fallback for pages where Trafilatura under-extracts.
Other pieces that mattered:
- PDFs: PDF URLs go through
pypdf. - Challenge detection: CAPTCHA/security pages get flagged instead of treated as real content.
- SSRF guard: requested URLs and redirects are checked against private/internal ranges. Final URLs are checked too. Caveat: this is not a network-level guard for every browser subrequest.
- Optional summarization: large pages can be summarized by a configurable auxiliary model before they go back into context.
Why this combo
- No paid search/scrape API keys for this path.
- Queries go through my SearXNG instance, not a vendor API tied to my account.
- SearXNG still hits upstream engines, so this is not "zero third-party contact."
- Most pages use the fast path. The browser only kicks in when needed.
- The final output is Markdown, not HTML soup.
Honest tradeoffs
- The stealth path is slow. Keep it as a fallback.
- SearXNG quality depends on enabled upstream engines and rate limits.
- Paid search APIs can still be better. This has been good enough for my use.
- Cloudflare/browser scraping is always a moving target.
Not claiming this is the optimal setup. It is just one that has worked for me and stays self-hostable.
Curious what others are using for this. Has anyone found something better than SearXNG for self-hosted search, or a lighter alternative to a full browser for the hard pages?
Happy to share more details if anyone's trying something similar.
[link] [comments]
More from r/LocalLLaMA
-
Been running Qwen3.6-27B through a 3-critic harness. The harness matters more than I thought
Jun 30
-
I Hate Dario Amodei, and everything he stands for.
Jun 29
-
Introducing LongCat-2.0 - , a large-scale MoE language model with 1.6 trillion total parameters and ~48 billion activated per token. This was the stealth model that was on Openrouter under the name 'owl-alpha'.
Jun 29
-
Krea-2-Turbo Image Model - Easy to be fully uncensored, but it can also EDIT Images!
Jun 29
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.