Chunking is context engineering for your embedding model. It is a critical, and often under-appreciated, component of your retrieval pipeline. Your chunking strategy, good or bad, will set the ceiling for recall. This guide provides rules of thumb to tune your chunking strategy for the best possible recall.
Even if your embedding model technically allows up to 32k token inputs, you won't get the best results that way, for two reasons:
You need to provide enough context for the embedding model to understand the input as a standalone string.
Most embedding models cannot see the full document when producing an embedding (exception below!). If your chunks are too small, most of them will reference concepts, people, places, and things described elsewhere in the document. For example, consider these two chunks from a single document:
chunk 1:
Dory recently moved from The Great Barrier Reef to a new home in Sydney Harbor.
Her new address is 42 Wallaby Way.
chunk 2:
She got a good deal from P. Sherman.
He sold her the house for only $200,000.
For the query How much was Dory's new house?, neither chunk contains the necessary context to fully answer. The answer is contained in chunk 2, but chunk 2 lacks the context that "she" means Dory and "the house" means 42 Wallaby Way. If the two chunks were combined into a single chunk, all the necessary information would be contained (at the expense of compression loss).
Some corpora have obvious chunking boundaries. Chunking code files, for example, should respect function boundaries. If you split a function definition down the middle, each chunk loses the information needed to interpret it correctly, so the embedding tends to represent a syntactically broken fragment rather than the semantics of the function. Regardless of whether you use a traditional or contextual embedding model, you should use tools like tree-sitter for code or markdown splitters to chunk documents with known boundaries.
If you are using a traditional embedding model, we recommend starting at ~300 token chunks with two-sentence overlap between chunks for text documents. This balances the tradeoff between good context and good compression. From that starting point, iteratively tweak your chunk length and overlap to achieve the desired results.
The Python code sample demonstrates using the blingfire sentence splitter to performantly split large documents (books) into chunks of configurable size and overlap.
Contextual embedding models such as voyage-context-3 or pplx-embed-context-v1-{0.6b, 4b} provide a practical means of improving recall without significantly adding cost, latency, or complexity to your embedding pipeline.
With these models, you pass the full document to the model as a list of arbitrary-length text chunks. The model attends to the entire document and produces contextualized embeddings for each chunk in one forward pass.
Consider again the example:
chunk 1:
Dory recently moved from The Great Barrier Reef to a new home in Sydney Harbor.
Her new address is 42 Wallaby Way.
chunk 2:
She got a good deal from P. Sherman.
He sold her the house for only $200,000.
A contextual embedding model can understand from chunk 1 that "she" in chunk 2 refers to Dory and "the house" in chunk 2 refers to 42 Wallaby Way. Thus, for the query How much was Dory's new house?, we would likely retrieve the embedding for chunk 2.
Further, contextual embedding models make it possible to create smaller chunks without losing context. We could split the example into 4 chunks, one per sentence:
chunk 1:
Dory recently moved from The Great Barrier Reef to a new home in Sydney Harbor.
chunk 2:
Her new address is 42 Wallaby Way.
chunk 3:
She got a good deal from P. Sherman.
chunk 4:
He sold her the house for only $200,000.
We would retrieve chunk 4 to answer the query, as its embedding includes the context of the other chunks.
Contextual embedding models make it possible to create smaller chunks that still retain the semantic context of the entire document, but they are not a silver bullet, and the tradeoffs should be considered: