Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Breaking up is hard to do: Chunking in RAG applications (stackoverflow.blog)
143 points by meysamazad on June 8, 2024 | hide | past | favorite | 43 comments


The problem with most chunking schemes I’ve seen is they’re naive. They don’t care about content; only token count. That’s fine, but what would be best is to chunk by topic.

I’m currently trying to implement chunking by topic using an LLM. It’s much slower, but I hope it will be a huge win in retrieval accuracy. First step is to extract topics from the document by asking the LLM to identify all topics, and then split by sentences and feed each sentence to the LLM to identify the topic. I’m hoping the result will be the original text, split by topics. From there, they can be further chunked if needed. Of course, it could be done by just one shot asking the LLM to summarize each topic in the document, but the more the LLM is relied on to write, the more distortion is introduced. Retaining the original text is the goal and the LLM should just be used for decision making.

Here is the crate I’m working out of. The chunking hasn’t been pushed, but you can see the decision making workflows. https://github.com/ShelbyJenkins/llm_client


You’re basically reinventing embeddings. Using an LLM for this is overkill since capturing the meaning of a sentence can be done with a model 1000x smaller.

I have a chunking library that does something similar to this, and there’s actually quite a few different libraries that have implemented some variant of “semantic chunking”.

[1] https://github.com/Filimoa/open-parse


Excellent, now show me a single one working in production with various docs.

I’ve talked through this problem with dozens of startups working on RAG, and everyone has the same problem.

LLMs are arguably not reinventing embeddings if you’re using them to infer the structure of a broad document. Understanding the “external structure” of the document, the internal structure of each segment, and the semantic structure of each chunk is important.


See my other comment on this thread - larger and larger context windows make perfect chunking irrelevant. We’ve ingested millions of docs and have a product in production, using LLM’s for chunking would have 10Xed our ingestion costs for marginal performance improvement.


Good to be validated. I guess I'll publish a blog post on the topic when I finalize what I've been working on.


I used to think the same, then I realize that topic is just a concept we draw from the content. It just reminded me the time we do feature engineering for many things, now we just feed everything into the algorithm and let it figure out how to handle things. Maybe the problem for us is not how to organize data in a way we think it is best, but for a format which is useful for the algorithm. A fixed window size fits perfectly with the embedding model would just work.


> They don’t care about content; only token count. That’s fine, but what would be best is to chunk by topic.

I’ve had a lot of success with chunking documents by subtitle. Works especially well for web published documents because each section tends to be fairly short.


Why not use langchain-rust and make your own client? If you don't know about langchain, i think you are missing out. I took a look at other langchain implementions in js and python, in each one people have done some serious work. Langchain-rust also uses tree-sitter to chunk code, it works very well in some quick tests i tried.

>The problem with most chunking schemes I’ve seen is they’re naive. They don’t care about content; only token count.

I think controlling different inputs depending on context is used in agents. For the moment i haven't seen anything really impressive coming out of agents. Maybe Perplexity style web search, but nothing more.


Not sure what the situation is like now, but we stopped using LangChain last year because the rate of change in the library was huge. Whenever we needed to upgrade for a new feature or bug fix we’d be 20~ versions behind and need to work through breaking changes. Eventually we decided that it was easier to just write everything ourselves.

This is from the first half of 2023 or so; maybe things are more stable now, but looks like the Python implementation is still pre-v1.


isn’t this just semantic chunking? There’s been white papers and already implementations in Langchain and llama index you could look through.


We use grid search to figure out what's the best chunking strategy to use. Create a bunch of different strategies such as recursive chunking, semantic, etc, and parameterize them and see which one works best. The "best" chunking strategy depends on the nature of the documents and the questions being asked.


Just as important as chunking strategy, imo, is recombination. After you've picked the highest-scoring chunks from the search matter, how do you glue them back together?

Simple concatenation is the obvious way, but I've had better results from using the chunks as anchors for expansion. Use something like Wilson score interval to decide how many high-scoring chunks to pick, then grab the surrounding text for each chunk up to some semantic boundary like period or newline (or arbitrary limit if no boundary is detected).


Has anyone tried skipping the embedding path in favor of combining a proper FTS engine with an LLM yet?

A tool like Lucene seems far more competent at the task of "find most relevant text fragment" compared to what is realized in a typical vector search application today. I'd also argue that you get more inspectability and control this way. You could even manage the preferred size of the fragments on a per-document basis using an entirely separate heuristic at indexing time.

The semantic capabilities of vector search seem nice, but could you not achieve a similar outcome by using the LLM to project synonymous OR clauses into the FTS query based upon static background material or prior search iteration(s)?


Generally you do both and rank the combined results

This is also why generally you don't get a vectordb and instead just add a vector index to the DB you are already using

Re:semantic vs synonym, that's really domain dependent. As the number of hits go up, and queries get more interesting, the more vectors get interesting. At the same time, vectors are heavy, so there's also the question of pushing the semantic aspect to the ranker vs the index, but I don't see that discussed much (search & storage vs compute & latency)


> There's also the question of pushing the semantic aspect to the ranker vs the index

Could it make sense to perform dynamic vector lookup over the FTS result set best fragments? This could save a lot of money if you have a massive corpus to index because you'd only be paying to embed things that are being searched for at runtime.

Focusing on just the best fragments could also improve the SnR going into the final vector search phase, especially if the fragment length is managed appropriately for each kind of document. If we are dealing with a method from a codebase, then we might prefer to have an unlimited fragment length. For a 20 megabyte PDF, it could be closer to the size of a tweet.


Query expansion has been done since forever, long before even word2vec and other semantic embedding models, e.g. using WordNet (not a DNN, despite its name), see https://lucene.apache.org/core/3_3_0/api/contrib-wordnet/org....


With query expansion do you find it needed for ux purposes to tell the user what you have done to their original query? or do you just run with it and let the result interactions speak for themselves?


Seems like the idea of Agentic Rag (https://zzbbyy.substack.com/p/agentic-rag).

I have created a very bare bones implementation as an example at https://github.com/zby/LLMEasyTools/tree/main/examples/agent... (it uses whoosh for indexing) and I am working on a more complete one at https://github.com/zby/answerbot.


You can use both search methods by using hybrid search. Here’s an implementation:

https://learn.microsoft.com/en-us/azure/search/hybrid-search...


I wonder how I could implement something similar using Lucene and a vercor db ?


All the vector stores are going hybrid; it's becoming a table stakes feature.


Build a service that talks to both at the same time, or maybe some project like this:

https://github.com/JuniusLuo/VecLucene


What’s FTS?


In this context “Full Text Search”. In the context of rage-quitting, something entirely different.


Thank you!

I really hate it when people throw around acronyms instead of just hitting a few extra keys on their keyboard for clarity


When they recommend using smaller chunks, how small in general are we talking? One sentence? One paragraph? 100 tokens or 1000? While I understand that the context of the data is important, it’s hard for me to ground vague statements like that without a concrete realistic example. I’m curious what chunk sizes people have found the most success with for various tasks in the wild.


think this way: you are going to return the chunk to your user as 'proof'.

the size therefore depends of your content style.

E.g. for a HN discussion, I would go 'paragraph' of each comment.

in a contract, each clause.

in a non-fiction book, maybe each section...

you can also decide to do some kind of reverse adaptive tree: you chunk at sentences level, then compare, if 'close enough', you merge them into a bigger chunk


Suppose you don't know the shape of the content ahead of time. This would be the case for most apps that allow users to upload their own sources.

It sounds like (at the expense of more computation and time) the reverse adaptive tree approach you described would be ideal for those scenarios.


that's why the article talks about using machine learning to decide of best strategy: to deal with unknown content 'shape'


When using vector search, you want chunks in the ~1k tokens range. If you're using full text search then chunks should probably be smaller, say a few paragraphs at most. If you use trigrams, you want chunks to be short, maybe even sentence level.


I’ll share my strategy. I usually keep chunks at the max of 0.5% of a document or 270 tokens. Multiple that by three and those are the size of the sliding windows that are then worried.


Why not overlapping sizes? (1), (1,2), (1,2,3) Sometimes the match is in a single sentence. Sometimes in the full paragraph. Sometimes across two paragraphs. If, in your top n results, some of these items overlap, you use the greater unit. And you slide these windows.

Also, I wouldn’t necessarily use a “sentence” as the lower bound, since that can be something like “Yes.”


with ever increasing context size — Claude 3 Sonnet, 204,800 token context size, 500 page document — sometimes not opting for chunking but at the same time optimizing for costs and latency might be a better solution, strategies like: Summary extraction and single pass extraction works well[1]

[1] - https://docs.unstract.com/editions/cloud_edition#summary-ext...


In theory. In practice, context size degrades performance.


Well-written article, missing key considerations:

- Titles matter, a lot: if you add the title of the section at the start of each chunk you will get 10x better embeddings and so more accurate results.

- The size doesn't matter: It depends on the combination of the layout and semantics of the content.

- Avoid garbage in / out: increased context windows don't mean you can put trash inside them. The more good you are at putting relevant information the more precise answers you get. Especially for enterprise-grade solutions, this is so important.

There are good emerging API solutions that implement semantic + layout-based chunking, which in my opinion is the best chunking strategy for PDF / Office files (the widest use case scenario for enterprises).


As embedding models become more performant and context windows increase, “ideal chunking” becomes less relevant.

Cost isn’t as important to us, so we use small chunks and then just pull in the page before and after. If you do this on 20+ matches (since you’re decomposing the query multiple times), you’re very likely finding the content.

Queries can get more expensive but you’re getting a corpus of “great answers” to test against as you refine your approach. Model costs are also plummeting which makes brute forcing it more and more viable.


Bigger context windows don't work as well as advertised. The "hay in a haystack" problem is not solved yet.

Also bigger context windows mean a lot more time waiting for an answer. Given the quadratic nature of context windows, we are stuck using transformers in smaller chunks. Other architectures like Mamba may solve that, but even then, increases in context window accuracy are not 1000x.


What would be the chunking strategy for q&a pairs? Right now I'm embedding the complete question and answer but the query results are not good as the response contains data not related to the question at all.


Join q and a when vectorizing them. Questions alone are too short to carry a lot of semantic richness.

When you get a query, you then run two semantic search queries: one using the original question and one using a HYDE version of the question. Take those results and run it through cohere’s rerank.


Yes, right now I'm vectorizing them as a pair. I'm then running a query in Pinecone by embedding the exact question but still the result does not have the actual q&a pair.

I'm not familiar with HYDE version. I'll check it out. Thanks for the suggestion


It depends on the specifics of your format, but we’ve had success embedding the questions and answers separately. If either match, you return the complete question and answer text. Make sure to deduplicate before returning, in case both match.


I'll try embedding it separately as well and try to figure out from there. Thanks for the suggestion


tldr: Use adaptive chunking methods, where ML decides if consecutive texts are relevant. Does not mention any actual adaptive chunking method name




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: