r/LocalLLaMA · · 3 min read

Giving a local agent web access without paid search/scrape APIs: SearXNG + Scrapling

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

I wanted web access for a local-first agent without reaching for Tavily, Serper, Firecrawl, etc.

For this agent path, I wanted no paid API keys, a search service I control, and page extraction I can run myself.

What I ended up with is two tools: web_search and web_extract. Nothing fancy. Mostly just wiring together good open-source pieces.

1. Search -> SearXNG

SearXNG is a self-hostable metasearch engine. I run it in Docker and point the agent at its JSON endpoint.

The search call is roughly:

text GET {SEARXNG_URL}/search?q=<query>&format=json&pageno=1

Then I cap the results and normalize them to: {title, url, description}

description is just the SearXNG snippet. It is not page content.

Config is basically:

text SEARXNG_URL=http://localhost:8080

Gotchas:

  • Add json to search.formats in SearXNG settings.yml.
  • Public SearXNG instances are usually a bad fit for programmatic use.
  • SearXNG is search-only. Use extraction when the agent needs to read a page.

2. Extract -> Scrapling + Trafilatura

Search snippets are not enough. The agent needs to read the actual page.

For web_extract, I use Scrapling with two paths:

  1. Fast path: Fetcher.get(url, impersonate="chrome"). No browser. Good for normal pages.
  2. Stealth path: if the fast path is empty, blocked, or challenge-looking, try a real headless browser:

python StealthyFetcher.fetch( url, headless=True, solve_cloudflare=True, block_webrtc=True, hide_canvas=True, )

The stealth path is an attempt, not a guaranteed bypass. If the page still shows a CAPTCHA or Cloudflare wall, I mark the result as blocked/partial.

Once I have HTML, Trafilatura turns it into Markdown with links and tables. Markdown is much easier for the model than raw HTML. I also keep a visible-text fallback for pages where Trafilatura under-extracts.

Other pieces that mattered:

  • PDFs: PDF URLs go through pypdf.
  • Challenge detection: CAPTCHA/security pages get flagged instead of treated as real content.
  • SSRF guard: requested URLs and redirects are checked against private/internal ranges. Final URLs are checked too. Caveat: this is not a network-level guard for every browser subrequest.
  • Optional summarization: large pages can be summarized by a configurable auxiliary model before they go back into context.

Why this combo

  • No paid search/scrape API keys for this path.
  • Queries go through my SearXNG instance, not a vendor API tied to my account.
  • SearXNG still hits upstream engines, so this is not "zero third-party contact."
  • Most pages use the fast path. The browser only kicks in when needed.
  • The final output is Markdown, not HTML soup.

Honest tradeoffs

  • The stealth path is slow. Keep it as a fallback.
  • SearXNG quality depends on enabled upstream engines and rate limits.
  • Paid search APIs can still be better. This has been good enough for my use.
  • Cloudflare/browser scraping is always a moving target.

Not claiming this is the optimal setup. It is just one that has worked for me and stays self-hostable.

Curious what others are using for this. Has anyone found something better than SearXNG for self-hosted search, or a lighter alternative to a full browser for the hard pages?

Happy to share more details if anyone's trying something similar.

submitted by /u/luke_pacman
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA