Text Splitters

Once you've loaded documents, you'll often want to transform them to better suit your application. The simplest example is you may want to split a long document into smaller chunks that can fit into your model's context window. LangChain has a number of built-in document transformers that make it easy to split, combine, filter, and otherwise manipulate documents.

When you want to deal with long pieces of text, it is necessary to split up that text into chunks. As simple as this sounds, there is a lot of potential complexity here. Ideally, you want to keep the semantically related pieces of text together. What "semantically related" means could depend on the type of text. This notebook showcases several ways to do that.

At a high level, text splitters work as following:

Split the text up into small, semantically meaningful chunks (often sentences).
Start combining these small chunks into a larger chunk until you reach a certain size (as measured by some function).
Once you reach that size, make that chunk its own piece of text and then start creating a new chunk of text with some overlap (to keep context between chunks).

That means there are two different axes along which you can customize your text splitter:

How the text is split
How the chunk size is measured

Types of Text Splitters

LangChain offers many different types of text splitters. These all live in the langchain-text-splitters package.

Table columns:

Name: Name of the text splitter
Classes: Classes that implement this text splitter
Splits On: How this text splitter splits text
Adds Metadata: Whether or not this text splitter adds metadata about where each chunk came from.
Description: Description of the splitter, including recommendation on when to use it.

Name	Classes	Splits On	Adds Metadata	Description
Recursive	RecursiveCharacterTextSplitter, RecursiveJsonSplitter	A list of user defined characters		Recursively splits text. This splitting is trying to keep related pieces of text next to each other. This is the `recommended way` to start splitting text.
HTML	HTMLHeaderTextSplitter, HTMLSectionSplitter	HTML specific characters	✅	Splits text based on HTML-specific characters. Notably, this adds in relevant information about where that chunk came from (based on the HTML)
Markdown	MarkdownHeaderTextSplitter	Markdown specific characters	✅	Splits text based on Markdown-specific characters. Notably, this adds in relevant information about where that chunk came from (based on the Markdown)
Code	many languages	Code (Python, JS) specific characters		Splits text based on characters specific to coding languages. 15 different languages are available to choose from.
Token	many classes	Tokens		Splits text on tokens. There exist a few different ways to measure tokens.
Character	CharacterTextSplitter	A user defined character		Splits text based on a user defined character. One of the simpler methods.
[Experimental] Semantic Chunker	SemanticChunker	Sentences		First splits on sentences. Then combines ones next to each other if they are semantically similar enough. Taken from Greg Kamradt
AI21 Semantic Text Splitter	AI21SemanticTextSplitter	✅	Identifies distinct topics that form coherent pieces of text and splits along those.

Evaluate text splitters

You can evaluate text splitters with the Chunkviz utility created by Greg Kamradt. Chunkviz is a great tool for visualizing how your text splitter is working. It will show you how your text is being split up and help in tuning up the splitting parameters.

Text Splitters

Types of Text Splitters

Evaluate text splitters

Other Document Transforms

Help us out by providing feedback on this documentation page:

Text Splitters

Types of Text Splitters​

Evaluate text splitters​

Other Document Transforms​

Help us out by providing feedback on this documentation page:

Types of Text Splitters

Evaluate text splitters

Other Document Transforms