Ask HN: How are you doing RAG locally?

I am curious how people are doing RAG locally with minimal dependencies for internal code or complex documents?

Are you using a vector database, some type of semantic search, a knowledge graph, a hypergraph?

155 points | by tmaly 19 hours ago

36 comments

__jf__ 49 minutes ago
For vector generation I started using Meta-LLama-3-8B in april 2024 with Python and Transformers for each text chunk on an RTX-A6000. Wow that thing was fast but noisy and also burns 500W. So a year ago I switched to an M1 Ultra and only had to replace Transformers with Apple's MLX python library. Approximately the same speed but less heat and noise. The Llama model has 4k dimensions so at fp16 thats 8 kilobyte per chunk, which I store in a BLOB column in SQLite via numpy.save(). Between running on the RTX and M1 there is a very small difference in vector output but not enough for me to change retrieval results, regenerate the vectors or change to another LLM.
For retrieval I load all the vectors from the SQlite database into a numpy.array and hand it to FAISS. Faiss-gpu was impressively fast on the RTX6000 and faiss-cpu is slower on the M1 Ultra but still fast enough for my purposes (I'm firing a few queries per day, not per minute). For 5 million chunks memory usage is around 40 GB which both fit into the A6000 and easily fits into the 128GB of the M1 Ultra. It works, I'm happy.
podgietaru 14 minutes ago
I made a small RAG database just using Postgres. I outlined it in the blog post below. I use it for RSS Feed organisation, and searching. They are small blobs. I do the labeling using a pseudo-KNN algorithm.
https://aws.amazon.com/blogs/machine-learning/use-language-e...
The code for it is here: https://github.com/aws-samples/rss-aggregator-using-cohere-e...
The example link no longer works, as I no longer work at AWS.
esperent 2 hours ago
I'm lucky enough to have 95% of my docs in small markdown markdown files so I'm just... not (+). I'm using SQLite FTS5 (full text search) to build a normal search index and using that. Well, I already had the index so I just wired it up to my mastra agents. Each file has a short description field, so if a keyword search surfaces the doc they check the description and if it matches, load the whole doc.
This took about one hour to set up and works very well.
(+) At least, I don't think this counts as RAG. I'm honestly a bit hazy on the definition. But there's no vectordb anyway.
[-]
- dmos62 1 hour ago
  Retrieval-augmented generation. What you described is a perfect example of a RAG. An embedding-based search might be more common, but that's a detail.
CuriouslyC 5 hours ago
Don't use a vector database for code, embeddings are slow and bad for code. Code likes bm25+trigram, that gets better results while keeping search responses snappy.
[-]
- jankovicsandras 2 hours ago
  You can do hybrid search in Postgres.
  Shameless plug: https://github.com/jankovicsandras/plpgsql_bm25 BM25 search implemented in PL/pgSQL ( Unlicense / Public domain )
  The repo includes also plpgsql_bm25rrf.sql : PL/pgSQL function for hybrid search ( plpgsql_bm25 + pgvector ) with Reciprocal Rank Fusion; and Jupyter notebook examples.
- postalcoder 4 hours ago
  I agree. Someone here posted a drop-in for grep that added the ability to do hybrid text/vector search but the constant need to re-index files was annoying and a drag. Moreover, vector search can add a ton of noise if the model isn't meant for code search and if you're not using a re-ranker.
  For all intents and purposes, running gpt-oss 20B in a while loop with access to ripgrep works pretty dang well. gpt-oss is a tool calling god compared to everything else i've tried, and fast.
- rao-v 3 hours ago
  Anybody know of a good service / docker that will do BM25 + vector lookup without spinning up half a dozen microservices?
- ehsanu1 3 hours ago
  I've gotten great results applying it to file paths + signatures. Even better if you also fuse those results with BM25.
- itake 4 hours ago
  With AI needing more access to documentation, WDYT about using RAG for documentation retrieval?
- lee1012 5 hours ago
  static embedding models im finding quite fast lee101/gobed https://github.com/lee101/gobed is 1ms on gpu :) would need to be trained for code though the bigger code llm embeddings can be high quality too so its just yea about where is ideal on the pareto fronteir really , often yea though your right it tends to be bm25 or rg even for code but yea more complex solutions are kind of possible too if its really important the search is high quality
bzGoRust 51 minutes ago
In my company, we build the internal chatbot based on RAG through LangChain + Milvus + LLM. Since the documents are well formatted, it is easy to do the overlapping chunking, then all those chunking data are inserted into vector db Milvus. The hybrid search (combine dense search and sparse search) is native supported in the Milvus could help us to do better retrieve. Thus the better quality answers are there.
spqw 2 hours ago
I am surprised to see very few setups leveraging LSP support. (Language Server Protocol) It has been added to Claude Code last month. Most setups rely on naive grep.
[-]
- woggy 1 hour ago
  I've written a few terminal tools on top of Roslyn to assist Claude in code analysis for C# code. Obviously the tools are also written with the help of Claude. Worked quite well.
- aqula 59 minutes ago
  LSP is not great for non-editor use cases. Everything is cursor position oriented.
tebeka 3 hours ago
https://duckdb.org/2024/05/03/vector-similarity-search-vss
[-]
- m00dy 1 hour ago
  does duckdb scale well over large datasets for vector search ?
  [-]
  - lgrebe 53 minutes ago
    What order of magnitude would you define as „large“ in this case?
sinandrei 20 minutes ago
Anyone use these approaches with academic pdfs?
[-]
- urschrei 15 minutes ago
  Another approach is to teach Claude Code how to use your Zotero library's full-text search: https://github.com/urschrei/zotero_search_skill.
lsb 1 hour ago
I'm using Sonnet with 1M Context Window at work, just stuffing everything in a window (it works fine for now), and I'm hoping to investigate Recursive Language Models with DSPy when I'm using local models with Ollama
autogn0me 3 hours ago
https://github.com/ggozad/haiku.rag/ - the embedded lancedb is convenient and has benchmarks; uses docling. qwen3-embedding:4b, 2560 w/ gpt-oss:20b.
init0 4 hours ago
I built a lib for myself https://pypi.org/project/piragi/
[-]
- stingraycharles 4 hours ago
  That looks great! Is there a way to store / cache the embeddings?
Bombthecat 1 hour ago
AnythingLLM for documents, amazing tool!
cbcoutinho 3 hours ago
The Nextcloud MCP Server [0] supports Qdrant as a vectordb to store embeddings and provide semantic search across your personal documents. This enables any LLM & MCP client (e.g. claude code) into a RAG system that you can use to chat with your files.
For local deployments, Qdrant supports storing embeddings in memory as well as in a local directory (similar to sqlite) - for larger deployments Qdrant supports running as a standalone service/sidecar and can be made available over the network.
[0] https://github.com/cbcoutinho/nextcloud-mcp-server
softwaredoug 1 hour ago
I built a Pandas extension SearchArray, I just use that (plus in memory embeddings) for any toy thing
https://github.com/softwaredoug/searcharray
baalimago 2 hours ago
I thought that context building via tooling was shown to be more effective than rag in practically every way?
Question being: WHY would I be doing RAG locally?
[-]
- petesergeant 2 hours ago
  For code, maybe? For documents, no, text embeddings are magical alien technology.
rahimnathwani 17 hours ago
If your data aren't too large, you can use faiss-cpu and pickle
https://pypi.org/project/faiss-cpu/
[-]
- notyourwork 5 hours ago
  For the uneducated, how large is too large? Curious.
  [-]
  - itake 4 hours ago
    FAISS runs in RAM. If your dataset can't fit into ram, FAISS is not the right tool.
- hahahahhaah 3 hours ago
  Shoud it be:
  If the total size of your data isn't loo large...?
  Data being a plural gets me.
  You might have small datums but a lot of kilobytes!
beret4breakfast 2 hours ago
For the purposes of learning, I’ve built a chatbot using ollama, streamlit, chromadb and docling. Mostly playing around with embedding and chunking on a document library.
[-]
- sidrag22 1 hour ago
  i took a similar path, i spun up a discord bot, used ollama, pgvector, docling for random documents, and made some specialized chunking strategies for some clunkier json data. its been a little while since i messed with it, but i really did enjoy it when i was.
  it all moves so fast, i wouldnt be surprised if everything i made is now crazy outdated and it was probably like 2 months ago.
geuis 2 hours ago
I don't. I actually write code.
To answer the question more directly, I've spent the last couple of years with a few different quant models mostly running on llama.cpp and ollama, depending. The results are way slower than the paid token api versions, but they are completely free of external influence and cost.
However the models I've tests generally turn out to be pretty dumb at the quant level I'm running to be relatively fast. And their code generation capabilities are just a mess not to be dealt with.
dvorka 3 hours ago
Any suggestion what to use as embeddings model runtime and semantic search in C++?
ehsanu1 3 hours ago
Embedded usearch vector database. https://github.com/unum-cloud/USearch
lormayna 3 hours ago
I have done some experiments with nomic embedding through Ollama and ChromaDB.
Works well, but I didn't tested on larger scale
eajr 18 hours ago
Local LibreChat which bundles a vector db for docs.
Strift 2 hours ago
I just use a web server and a search engine.
TL;DR: - chunk files, index chunks - vector/hybrid search over the index - node app to handle requests (was the quickest to implement, LLMs understand OpenAPI well)
I wrote about it here: https://laurentcazanove.com/blog/obsidian-rag-api
motakuk 17 hours ago
LightRAG, Archestra as a UI with LightRAG mcp
lee1012 5 hours ago
lee101/gobed https://github.com/lee101/gobed static embedding models so they are embedded in milliseconds and on gpu search with a cagra style on gpu index with a few things for speed like int8 quantization on the embeddings and fused embedding and search in the same kernel as the embedding really is just a trained map of embeddings per token/averaging
whattheheckheck 18 hours ago
Anythingllm is promising
jeanloolz 4 hours ago
Sqlite-vec
[-]
- petesergeant 2 hours ago
  I’ve got it deployed in production for a dataset that changes infrequently and it works really well
nineteen999 13 hours ago
A little BM25 can get you quite a way with an LLM.
jeffchuber 5 hours ago
try out chroma or better yet as opus to!
electroglyph 5 hours ago
simple lil setup with qdrant
pdyc 4 hours ago
sqlite's bm25
ramesh31 14 hours ago
SQLite with FTS5
lee101 5 hours ago
[dead]
undergrowth 4 hours ago
[flagged]
undergrowth 4 hours ago
[flagged]
[-]
- kristopolous 2 hours ago
  A new account, named after the thinking you're linking just looks like spam.
  Also I've got no idea what this product does, this is just a generic page of topical ai buzzwords
  Don't tell me what it is, /show me why/ you built it. Then go back and keep that reasoning in, show me why I should care
  [-]
  - lukan 40 minutes ago
    You are talking with a spam bot.