Deep Tech Point
first stop in your tech adventure

Optimizing NLP with Chunking: Key to Advanced Retrieval-Augmented Generation

June 4, 2024 | AI

In the context of Retrieval-Augmented Generation (RAG) in AI, chunking refers to the process of dividing a large body of text or a dataset into smaller, manageable segments or “chunks” before feeding them into the system. This is particularly important in scenarios where a model needs to retrieve relevant information from a large corpus to generate accurate and contextually appropriate responses.

What is RAG?

RAG is a hybrid approach that combines retrieval-based and generation-based methods (combined with augmentation) in natural language processing (NLP). It typically involves two main components:

Retriever: This component searches through a large collection of documents or text passages to find the most relevant pieces of information based on the input query.
Generator: This component uses the retrieved information to generate a coherent and contextually appropriate response.

Why Chunking is Important in RAG

Chunking actually “comes before” RAG, in the pre-production. We take a large corpus of documents and chunk or cut them into smaller parts. During this document segmentation the source document or corpus is divided into smaller segments or chunks, which could be based on paragraphs, sentences, or fixed-size text windows. Then, the next step is indexing – each chunk is indexed separately so that the retriever can search through them efficiently, and these chunks (pieces of text etc.) are embedded into embedding model. When we apply the RAG model: we give the RAG model a user query, that user query is embedded and the retriever part needs to find the closest documents to embedded user query in the knowledge base as a vector data base, this is where the retriever actually searches through these chunks. We call this part the retrieval – this is when a query is presented, the retriever searches through the indexed chunks to find the ones that are most relevant to the query.
Later on in the response generation, which is basically a combination of the original query and the information found through retrieval, the generator uses the retrieved chunks to generate a response, ensuring that the output is informed by the most pertinent information.

Let’s have a look at a practical example. Imagine you have a long article about ear infection. Instead of treating the entire article as one unit, the text is chunked into smaller sections, perhaps by paragraphs, maybe even by causes, symptoms or treatments (perhaps different methods of chunking should be tried in practice to see what type of chunking would be the most appropriate). When a query like “What are the main causes of ear infection?” is posed, the retriever will search through these chunks to find the sections discussing the causes of ear infection. The generator then uses this information to create a detailed and accurate response.

So, why is chunking important in RAG?

Efficiency

By breaking down a large document into smaller chunks, the retriever can more efficiently search through the text and identify the most relevant pieces of information. This reduces the computational load and speeds up the retrieval process.

Relevance

Smaller chunks can be more easily matched to specific parts of a query, improving the precision of the retrieved information. This ensures that the generator has access to highly relevant content to produce better responses.

Context Management

In long documents, relevant information might be scattered throughout the text. Chunking helps in capturing these dispersed pieces of information, ensuring that the retriever can consider all relevant parts of the document.

Methods for chunking text data

In the context of AI, specifically within Retrieval-Augmented Generation (RAG) and other NLP tasks, there are several methods for chunking text data. These methods aim to break down large bodies of text into smaller, manageable units or “chunks” to facilitate more efficient processing, retrieval, and generation. Obviously not all documents are the same and different documents and texts demand different chunking methods. In the following section we will focus on some of the most common chunking methods. Let’s have a look.

1. Fixed-Size Chunking

Fixed-Size Chunking involves dividing the text into chunks of a predetermined size, typically based on the number of units – units are words or characters, and sometimes even tokens. With this method, if the fixed size is 300 words, a document is split into chunks, each containing let’s say 300 words, regardless of sentence or paragraph boundaries. The number of units per chunk is fixed (to a maximum), additionally there may be an optional overlap (an overlap is a piece of text of some number of units at the beginning and end of each chunk – from the previous and next chunk) which connects the chucks and makes them more meaningful.

Fixed-Size Chunking is suitable for applications where uniform chunk sizes are beneficial, such as in certain machine learning models that require input of a specific length. However, if you have chunks as small as 10 words you may notice the small chunks may not contain enough information to be useful for search. However, the larger chunks begin to retain more information as they get to lengths that are similar to a typical paragraph and if these chunks become too long, the corresponding vector embeddings would start to become more general and that would eventually reach a point where they cease to be useful in terms of searching for information. So what is the “recommended one size fits all”. There is none. Different documents, different queries, different goals, different chunk sizes.

However, if you decide for fixed-size chunks, and you don’t have to take into account any other factors, the rule of thumb is a size of around 100-200 words, and a 20% overlap.

2. Sentence-Based Chunking

Sentence-Based Chunking is as the name says chunking when the text is divided into chunks based on individual sentences, and where each sentence is treated as a separate chunk. For example, a paragraph with four sentences will be split into four chunks, each containing one sentence. Sentence-Based Chunking could be useful in scenarios where sentence-level context is crucial, such as in question-answering systems or summarization tasks, or as
preprocessing step for tasks like parsing, sentiment analysis, or information extraction.

3. Paragraph-Based Chunking

As a successor of sentence-based chunking, paragraph-based chunking divides the text into chunks based on paragraph boundaries. Each paragraph becomes a chunk, so a document with five paragraphs will be split into five chunks, each containing one paragraph. Paragraph-based chunking is effective for maintaining the context within paragraphs, useful in document retrieval and analysis tasks.

Both sentence and paragraph based chunking is useful when dealing with document summarization or content analysis tasks, such as processing news articles or academic papers.

4. Semantic or Topic-Based Chunking

With Semantic or Topic-Based Chunking the text is divided based on semantic meaning or topics. Chunks are created where there is a natural shift in topic or meaning. For example, an article discussing ear infection (e.g., causes, symptoms, treatments) could be chunked into sections based on these topics. One option would be also to split the text into sentences, and then group these sentences into groups of let’s say 3 sentences, and then merge the ones that are similar in the embedding space. Following this logic, the semantic chunking involves taking the embeddings of every sentence in the document, comparing the similarity of all sentences with each other, and then grouping sentences with the most similar embeddings together.

Semantic or Topic-Based Chunking is ideal for tasks requiring coherent topic-based segmentation, such as topic modeling and content summarization.

5. Sliding Window Chunking

This method involves dividing text into fixed-size chunks based on character count. The simplicity of implementation and the inclusion of overlapping segments (it is also possible not to overlap) aim to prevent cutting sentences or thoughts. However, limitations can include imprecise control over context size, the risk of cutting words or sentences, and a lack of semantic consideration. Therefore this chunking method is more suitable for exploratory analysis but not recommended for tasks requiring deep semantic understanding. Let’s take a look at the sliding window chunking method in more detail.

The key concepts in sliding window chunking is the Window, which is a fixed-size segment or subset of the data, and the Sliding, which is the process of moving the window across the data at regular intervals or steps. First we define the window size, which is the number of data points or items included in each chunk, and then we determine the step size to decide how far the window should move forward after processing each chunk. This step size can be smaller than, equal to, or larger than the window size. Let’s say we have a window size 3 and our step size is 2. Our First Chunk will be [1, 2, 3] the first move of the window by 2 steps will create a second chunk [3, 4, 5], again move for 2 steps and the third Chunk will be [5, 6, 7] and so on.

Types of Sliding Window Chunking:

6. Custom Rule-Based Chunking

Custom Rule-Based Chunking is all about chunks being created based on custom rules defined for the specific use case and these rules can be based on punctuation, keywords, or other textual patterns.
For example, in legal documents, chunks could be created at section headers like “Article”, “Section”, or “Clause”, or in medical books, chunks could be created at section headers like “Causes”, “Symptoms”, “Treatments”, or “Drugs”.

Custom Rule-Based Chunking technique is especially useful when working with structured data extraction from text, such as identifying named entities (like person names, locations, organizations), noun phrases, or specialized patterns (like dates, addresses, etc.).

How Custom Rule-Based Chunking Works?

The first step in custom rule-based chunking involves defining grammatical or pattern-based rules that describe the chunks you want to extract. These rules are typically based on patterns of part-of-speech tags or specific sequences of words and characters. Before applying the chunking rules, the text is processed through a part-of-speech tagger that labels each word with its corresponding part of speech (like noun, verb, adjective, etc.). And then, using a chunk parser, the tagged text is then analyzed according to the predefined rules. For example, a simple rule could be to extract a chunk every time an adjective is followed by one or more nouns, capturing basic noun phrases.

One of the tools and technologies that are used with Custom Rule-Based Chunking is NLTK (Natural Language Toolkit), which is a popular Python library for working with human language data. It includes support for rule-based chunking with its RegexpParser class, which allows for defining custom rules using regular expressions SpaCy is another powerful library for NLP in Python, which supports rule-based matching through its matcher and phrase matcher tools, though it’s generally more oriented towards statistical models.

Use Cases of Custom Rule-Based Chunking

Custom rule-based chunking is a powerful approach, particularly when you have a clear understanding of the text structure and the specific information you need to extract. However, it requires careful rule design and might not generalize well across texts with varying syntax or vocabulary. For broader applications, combining rule-based methods with machine learning techniques often yields the best results.

7. Machine Learning-Based Chunking

Machine learning-based chunking in natural language processing leverages statistical models to automatically identify and extract meaningful phrases, or “chunks,” from text without relying strictly on predefined rules. Models can be trained to recognize where meaningful divisions occur – these trained models understand and predict the structure and relationships within sentences, making it more adaptable to variations in language use than purely rule-based methods.

Machine Learning-Based Chunking can be used in information retrieval because it enhances the efficiency and relevance of search results by indexing and retrieving smaller, relevant chunks, text summarization because it improves summarization by processing manageable text units while maintaining context, in a question- answer situations, because it facilitates accurate answer retrieval by isolating relevant text segments, and in text preprocessing because it simplifies text preprocessing steps for various NLP tasks by breaking down large documents.

How does Machine Learning-Based Chunking works?

Machine Learning-Based Chunking relies on Model Training, so unlike rule-based chunking, machine learning-based chunking requires a training dataset, which consists of texts with annotated chunks, because these annotations are used to teach the model what types of word groupings constitute a chunk. During training, various features of the text such as part-of-speech tags, syntactic dependencies, and word embeddings (vector representations of words) are used to inform the model about the context and semantic relationships between words. Once the model is trained, it can process new, unannotated texts to predict chunks. This is often done using sequence labeling techniques where each word in a sentence is labeled as being the start of a chunk, inside a chunk, or outside any chunk.

Conditional Random Fields (CRFs) is a popular statistical modeling technique used for sequence prediction tasks like chunking. CRFs consider the context of neighboring labels in their predictions, which is particularly useful for tasks where the context strongly influences the correct label. And, more recently, transformer-based models, such as BERT (Bidirectional Encoder Representations from Transformers) and its variants, have been used for chunking. These models use mechanisms called attention to weigh the influence of different words within a sentence irrespective of their distance from each other, leading to highly effective chunking.

One of the major advantages of Machine Learning-Based Chunking is robustness to new inputs, because this type of chunking can adapt to new patterns or variations in language more effectively than rule-based systems, so this type of chunking is it suitable for applications with diverse textual inputs. And once the model is trained, it can process large volumes of text quickly, which is crucial for big data applications. In addition to this, machine learning models that learn to chunk text can also be adapted or used in conjunction with other NLP tasks like named entity recognition, sentiment analysis, and more, providing a versatile tool in the NLP toolkit.

However, there are challenges too. The accuracy of a machine learning model heavily depends on the quantity and quality of the training data, therefore an inadequate or biased training data can lead to poor performance. In addition to this, training and deploying machine learning models require substantial computational resources and expertise in model tuning and evaluation.

Conclusion

The concept of chunking in Retrieval-Augmented Generation (RAG) significantly enhances the efficiency and relevance of natural language processing systems. By breaking down large text corpora into smaller, more manageable segments, systems can retrieve and utilize specific information more effectively. This process not only speeds up the retrieval phase but also ensures that responses generated by AI are precise and contextually relevant. Various methods of chunking—such as fixed-size, sentence-based, paragraph-based, and semantic chunking—offer flexibility in handling different types of texts and queries. Each method has its unique advantages, catering to specific needs within document analysis, information retrieval, and AI-driven response generation. As AI continues to evolve, these chunking techniques are pivotal in optimizing the balance between computational efficiency and the quality of generated content, ultimately leading to more sophisticated and accurate NLP applications.