r/LocalLLaMA · · 1 min read

Best way to index full Italian Wikipedia for 100% offline RAG in LM Studio?

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Hi everyone,

I want to set up a 100% offline RAG system using LM Studio and the entire Italian Wikipedia (text-only, no images). My goal is to index the database once so my local LLMs can query it for up-to-date factual knowledge without internet access.

Here are my PC specs:

  • GPU: RTX 4070 super oc 12gb
  • RAM: 32gb ddr5
  • Storage: NVMe SSD samsung 870 evo 2tb

I have two main questions for the community:

  1. Data Source: What is currently the best, cleanest, and most updated source for the Italian Wikipedia dump in pure text format (like .txt, .md, or a clean .jsonl)? I know about Kiwix (.zim) and Hugging Face datasets, but I want to avoid formatting issues (wikitext/HTML tags) that could mess up the embeddings.
  2. LM Studio Indexing: LM Studio's "Local Docs" feature works great for a few documents, but has anyone successfully indexed a large dump like the full Italian Wikipedia (around 5-7GB of raw text)? Will it crash or freeze during the vector database creation? If so, what is the best alternative pipeline to create the vector database offline?

Any advice, scripts, or links to pre-cleaned updated Italian dumps would be highly appreciated.

Thanks in advance!

submitted by /u/tombino104
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA