Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
8bf2818
Delete NoopTextSplitter
natoverse Dec 16, 2025
bdc2485
Delete unused check_token_limit
natoverse Dec 16, 2025
9e8c900
Add base chunking factory and migrate workflow to use it
natoverse Dec 17, 2025
81240ab
Merge v3/main into chunker-factory
natoverse Dec 17, 2025
4612917
Split apart chunker module
natoverse Dec 18, 2025
b63f747
Co-locate chunking/splitting
natoverse Dec 18, 2025
a20dbdb
Collapse token splitting functionality into one class/function
natoverse Dec 18, 2025
e5c1aa7
Restore create_base_text_units parameterization
natoverse Dec 18, 2025
b7c0673
Move Tokenizer base class to common package
natoverse Dec 18, 2025
896a48c
Move pre-pending into chunkers
natoverse Dec 19, 2025
9aa94df
Streamline config
natoverse Dec 19, 2025
eb22d7a
Fix defaults construction
natoverse Dec 19, 2025
780a038
Add prepending tests
natoverse Dec 20, 2025
026474a
Remove chunk_size_includes_metadata config
natoverse Dec 20, 2025
c8dbb02
Revert ChunkingDocument interface
natoverse Dec 22, 2025
247547f
Move metadata prepending to a util
natoverse Dec 22, 2025
90479c0
Move Tokenizer back to GR core
natoverse Dec 22, 2025
b32f403
Fix tokenizer removal from chunker
natoverse Dec 22, 2025
d9ba63f
Set defaults for chunking config
natoverse Dec 22, 2025
bd968f2
Move chunking to monorepo package
natoverse Dec 22, 2025
88af7f8
Format
natoverse Dec 22, 2025
ee20153
Typo
natoverse Dec 22, 2025
a741bfb
Add ChunkResult model
natoverse Dec 22, 2025
7748493
Streamline chunking config
natoverse Dec 23, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 1 addition & 2 deletions docs/config/yaml.md
Original file line number Diff line number Diff line change
Expand Up @@ -100,12 +100,11 @@ These settings configure how we parse documents into text chunks. This is necess

#### Fields

- `strategy` **str**[tokens|sentences] - How to chunk the text.
- `size` **int** - The max chunk size in tokens.
- `overlap` **int** - The chunk overlap in tokens.
- `strategy` **str**[tokens|sentences] - How to chunk the text.
- `encoding_model` **str** - The text encoding model to use for splitting on token boundaries.
- `prepend_metadata` **bool** - Determines if metadata values should be added at the beginning of each chunk. Default=`False`.
- `chunk_size_includes_metadata` **bool** - Specifies whether the chunk size calculation should include metadata tokens. Default=`False`.

## Outputs and Storage

Expand Down
52 changes: 1 addition & 51 deletions docs/index/inputs.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,10 +82,9 @@ As described above, when documents are imported you can specify a list of `metad

### Chunking Config

Next, the `chunks` block needs to instruct the chunker how to handle this metadata when creating text units. By default, it is ignored. We have two settings to include it:
Next, the `chunks` block needs to instruct the chunker how to handle this metadata when creating text units. By default, it is ignored. We have the following setting to include it:

- `prepend_metadata`. This instructs the importer to copy the contents of the `metadata` column for each row into the start of every single text chunk. This metadata is copied as key: value pairs on new lines.
- `chunk_size_includes_metadata`: This tells the chunker how to compute the chunk size when metadata is included. By default, we create the text units using your specified `chunk_size` *and then* prepend the metadata. This means that the final text unit lengths may be longer than your configured `chunk_size`, and it will vary based on the length of the metadata for each document. When this setting is `True`, we will compute the raw text using the remainder after measuring the metadata length so that the resulting text units always comply with your configured `chunk_size`.

### Examples

Expand Down Expand Up @@ -124,7 +123,6 @@ chunks:
size: 100
overlap: 0
prepend_metadata: true
chunk_size_includes_metadata: false
```

Documents DataFrame
Expand Down Expand Up @@ -162,54 +160,6 @@ US to lift most federal COVID-19 vaccine mandates,WASHINGTON (AP) The Biden admi

NY lawmakers begin debating budget 1 month after due date,ALBANY, N.Y. (AP) New York lawmakers began voting Monday on a $229 billion state budget due a month ago that would raise the minimum wage, crack down on illicit pot shops and ban gas stoves and furnaces in new buildings. Negotiations among Gov. Kathy Hochul and her fellow Democrats in control of the Legislature dragged on past the April 1 budget deadline, largely because of disagreements over changes to the bail law and other policy proposals included in the spending plan. Floor debates on some budget bills began Monday. State Senate Majority Leader Andrea Stewart-Cousins said she expected voting to be wrapped up Tuesday for a budget she said contains "significant wins" for New Yorkers. "I would have liked to have done this sooner. I think we would all agree to that," Cousins told reporters before voting began. "This has been a very policy-laden budget and a lot of the policies had to parsed through." Hochul was able to push through a change to the bail law that will eliminate the standard that requires judges to prescribe the "least restrictive" means to ensure defendants return to court. Hochul said judges needed the extra discretion. Some liberal lawmakers argued that it would undercut the sweeping bail reforms approved in 2019 and result in more people with low incomes and people of color in pretrial detention. Here are some other policy provisions that will be included in the budget, according to state officials. The minimum wage would be raised to $17 in New York City and some of its suburbs and $16 in the rest of the state by 2026. That's up from $15 in the city and $14.20 upstate.

--

settings.yaml

```yaml
input:
file_type: csv
title_column: headline
text_column: article
metadata: [headline]

chunks:
size: 50
overlap: 5
prepend_metadata: true
chunk_size_includes_metadata: true
```

Documents DataFrame

| id | title | text | creation_date | metadata |
| --------------------- | --------------------------------------------------------- | ------------------------ | ----------------------------- | --------------------------------------------------------------------------- |
| (generated from text) | US to lift most federal COVID-19 vaccine mandates | (article column content) | (create date of articles.csv) | { "headline": "US to lift most federal COVID-19 vaccine mandates" } |
| (generated from text) | NY lawmakers begin debating budget 1 month after due date | (article column content) | (create date of articles.csv) | { "headline": "NY lawmakers begin debating budget 1 month after due date" } |

Raw Text Chunks

| content | length |
| ------- | ------: |
| title: US to lift most federal COVID-19 vaccine mandates<br>WASHINGTON (AP) The Biden administration will end most of the last remaining federal COVID-19 vaccine requirements next week when the national public health emergency for the coronavirus ends, the White House said Monday. Vaccine requirements for federal workers and federal contractors, | 50 |
| title: US to lift most federal COVID-19 vaccine mandates<br>federal workers and federal contractors as well as foreign air travelers to the U.S., will end May 11. The government is also beginning the process of lifting shot requirements for Head Start educators, healthcare workers, and noncitizens at U.S. land borders. | 50 |
| title: US to lift most federal COVID-19 vaccine mandates<br>noncitizens at U.S. land borders. The requirements are among the last vestiges of some of the more coercive measures taken by the federal government to promote vaccination as the deadly virus raged, and their end marks the latest display of how | 50 |
| title: US to lift most federal COVID-19 vaccine mandates<br>the latest display of how President Joe Biden's administration is moving to treat COVID-19 as a routine, endemic illness. "While I believe that these vaccine mandates had a tremendous beneficial impact, we are now at a point where we think that | 50 |
| title: US to lift most federal COVID-19 vaccine mandates<br>point where we think that it makes a lot of sense to pull these requirements down," White House COVID-19 coordinator Dr. Ashish Jha told The Associated Press on Monday. | 38 |
| title: NY lawmakers begin debating budget 1 month after due date<br>ALBANY, N.Y. (AP) New York lawmakers began voting Monday on a $229 billion state budget due a month ago that would raise the minimum wage, crack down on illicit pot shops and ban gas stoves and furnaces in new | 50 |
| title: NY lawmakers begin debating budget 1 month after due date<br>stoves and furnaces in new buildings. Negotiations among Gov. Kathy Hochul and her fellow Democrats in control of the Legislature dragged on past the April 1 budget deadline, largely because of disagreements over changes to the bail law and | 50 |
| title: NY lawmakers begin debating budget 1 month after due date<br>to the bail law and other policy proposals included in the spending plan. Floor debates on some budget bills began Monday. State Senate Majority Leader Andrea Stewart-Cousins said she expected voting to be wrapped up Tuesday for a budget | 50 |
|title: NY lawmakers begin debating budget 1 month after due date<br>up Tuesday for a budget she said contains "significant wins" for New Yorkers. "I would have liked to have done this sooner. I think we would all agree to that," Cousins told reporters before voting began. "This has been | 50 |
| title: NY lawmakers begin debating budget 1 month after due date<br>voting began. "This has been a very policy-laden budget and a lot of the policies had to parsed through." Hochul was able to push through a change to the bail law that will eliminate the standard that requires judges | 50 |
| title: NY lawmakers begin debating budget 1 month after due date<br>the standard that requires judges to prescribe the "least restrictive" means to ensure defendants return to court. Hochul said judges needed the extra discretion. Some liberal lawmakers argued that it would undercut the sweeping bail reforms approved in 2019 | 50 |
| title: NY lawmakers begin debating budget 1 month after due date<br>bail reforms approved in 2019 and result in more people with low incomes and people of color in pretrial detention. Here are some other policy provisions that will be included in the budget, according to state officials. The minimum | 50 |
| title: NY lawmakers begin debating budget 1 month after due date<br>to state officials. The minimum wage would be raised to $17 in be raised to $17 in New York City and some of its suburbs and $16 in the rest of the state by 2026. That's up from $15 | 50 |
| title: NY lawmakers begin debating budget 1 month after due date<br>2026. That's up from $15 in the city and $14.20 upstate. | 22 |


In this example we can see that the two input documents were parsed into fourteen output text chunks. The title (headline) of each document is prepended and included in the computed chunk size, so each chunk matches the configured chunk size (except the last one for each document). We've also configured some overlap in these text chunks, so the last five tokens are shared. Why would you use overlap in your text chunks? Consider that when you are splitting documents based on tokens, it is highly likely that sentences or even related concepts will be split into separate chunks. Each text chunk is processed separately by the language model, so this may result in incomplete "ideas" at the boundaries of the chunk. Overlap ensures that these split concepts are fully contained in at least one of the chunks.


#### JSON files

This final example uses a JSON file for each of the same two articles. In this example we'll set the object fields to read, but we will not add metadata to the text chunks.
Expand Down
32 changes: 32 additions & 0 deletions packages/graphrag-chunking/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# GraphRAG Chunking

This package contains a collection of text chunkers, a core config model, and a factory for acquiring instances.

## Examples

Basic sentence chunking with nltk
```python
chunker = SentenceChunker()
chunks = chunker.chunk("This is a test. Another sentence.")
print(chunks) # ["This is a test.", "Another sentence."]
```

Token chunking
```python
tokenizer = tiktoken.get_encoding("o200k_base")
chunker = TokenChunker(size=3, overlap=0, encode=tokenizer.encode, decode=tokenizer.decode)
chunks = chunker.chunk("This is a random test fragment of some text")
print(chunks) # ["This is a", " random test fragment", " of some text"]
```

Using the factory via helper util
```python
tokenizer = tiktoken.get_encoding("o200k_base")
config = ChunkingConfig(
strategy="tokens",
size=3,
overlap=0
)
chunker = create_chunker(config, tokenizer.encode, tokenizer.decode)
...
```
Original file line number Diff line number Diff line change
@@ -1,2 +1,4 @@
# Copyright (c) 2024 Microsoft Corporation.
# Licensed under the MIT License

"""System-level chunking package."""
19 changes: 19 additions & 0 deletions packages/graphrag-chunking/graphrag_chunking/add_metadata.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# Copyright (c) 2024 Microsoft Corporation.
# Licensed under the MIT License

"""A module containing 'prepend_metadata' function."""


def add_metadata(
text: str,
metadata: dict,
delimiter: str = ": ",
line_delimiter: str = "\n",
append: bool = False,
) -> str:
"""Add metadata to the given text, prepending by default. This utility writes the dict as rows of key/value pairs."""
metadata_str = (
line_delimiter.join(f"{k}{delimiter}{v}" for k, v in metadata.items())
+ line_delimiter
)
return text + metadata_str if append else metadata_str + text
17 changes: 17 additions & 0 deletions packages/graphrag-chunking/graphrag_chunking/chunk_result.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Copyright (c) 2024 Microsoft Corporation.
# Licensed under the MIT License

"""The ChunkResult dataclass."""

from dataclasses import dataclass


@dataclass
class ChunkResult:
"""Result of chunking a document."""

text: str
index: int
start_char: int
end_char: int
token_count: int | None = None
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Copyright (c) 2024 Microsoft Corporation.
# Licensed under the MIT License

"""Chunk strategy type enumeration."""

from enum import StrEnum


class ChunkerType(StrEnum):
"""ChunkerType class definition."""

Tokens = "tokens"
Sentence = "sentence"
21 changes: 21 additions & 0 deletions packages/graphrag-chunking/graphrag_chunking/chunker.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# Copyright (c) 2024 Microsoft Corporation.
# Licensed under the MIT License

"""A module containing the 'Chunker' class."""

from abc import ABC, abstractmethod
from typing import Any

from graphrag_chunking.chunk_result import ChunkResult


class Chunker(ABC):
"""Abstract base class for document chunkers."""

@abstractmethod
def __init__(self, **kwargs: Any) -> None:
"""Create a chunker instance."""

@abstractmethod
def chunk(self, text: str) -> list[ChunkResult]:
"""Chunk method definition."""
77 changes: 77 additions & 0 deletions packages/graphrag-chunking/graphrag_chunking/chunker_factory.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
# Copyright (c) 2024 Microsoft Corporation.
# Licensed under the MIT License

"""A module containing 'ChunkerFactory', 'register_chunker', and 'create_chunker'."""

from collections.abc import Callable

from graphrag_common.factory.factory import Factory, ServiceScope

from graphrag_chunking.chunk_strategy_type import ChunkerType
from graphrag_chunking.chunker import Chunker
from graphrag_chunking.chunking_config import ChunkingConfig


class ChunkerFactory(Factory[Chunker]):
"""Factory for creating Chunker instances."""


chunker_factory = ChunkerFactory()


def register_chunker(
chunker_type: str,
chunker_initializer: Callable[..., Chunker],
scope: ServiceScope = "transient",
) -> None:
"""Register a custom chunker implementation.

Args
----
- chunker_type: str
The chunker id to register.
- chunker_initializer: Callable[..., Chunker]
The chunker initializer to register.
"""
chunker_factory.register(chunker_type, chunker_initializer, scope)


def create_chunker(
config: ChunkingConfig,
encode: Callable[[str], list[int]] | None = None,
decode: Callable[[list[int]], str] | None = None,
) -> Chunker:
"""Create a chunker implementation based on the given configuration.

Args
----
- config: ChunkingConfig
The chunker configuration to use.

Returns
-------
Chunker
The created chunker implementation.
"""
config_model = config.model_dump()
if encode is not None:
config_model["encode"] = encode
if decode is not None:
config_model["decode"] = decode
chunker_strategy = config.type

if chunker_strategy not in chunker_factory:
match chunker_strategy:
case ChunkerType.Tokens:
from graphrag_chunking.token_chunker import TokenChunker

register_chunker(ChunkerType.Tokens, TokenChunker)
case ChunkerType.Sentence:
from graphrag_chunking.sentence_chunker import SentenceChunker

register_chunker(ChunkerType.Sentence, SentenceChunker)
case _:
msg = f"ChunkingConfig.strategy '{chunker_strategy}' is not registered in the ChunkerFactory. Registered types: {', '.join(chunker_factory.keys())}."
raise ValueError(msg)

return chunker_factory.create(chunker_strategy, init_args=config_model)
36 changes: 36 additions & 0 deletions packages/graphrag-chunking/graphrag_chunking/chunking_config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# Copyright (c) 2024 Microsoft Corporation.
# Licensed under the MIT License

"""Parameterization settings for the default configuration."""

from pydantic import BaseModel, ConfigDict, Field

from graphrag_chunking.chunk_strategy_type import ChunkerType


class ChunkingConfig(BaseModel):
"""Configuration section for chunking."""

model_config = ConfigDict(extra="allow")
"""Allow extra fields to support custom cache implementations."""

type: str = Field(
description="The chunking type to use.",
default=ChunkerType.Tokens,
)
encoding_model: str | None = Field(
description="The encoding model to use.",
default=None,
)
size: int = Field(
description="The chunk size to use.",
default=1200,
)
overlap: int = Field(
description="The chunk overlap to use.",
default=100,
)
prepend_metadata: bool = Field(
description="Prepend metadata into each chunk.",
default=False,
)
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# Copyright (c) 2024 Microsoft Corporation.
# Licensed under the MIT License

"""A module containing 'create_chunk_results' function."""

from collections.abc import Callable

from graphrag_chunking.chunk_result import ChunkResult


def create_chunk_results(
chunks: list[str],
encode: Callable[[str], list[int]] | None = None,
) -> list[ChunkResult]:
"""Create chunk results from a list of text chunks. The index assignments are 0-based and assume chunks we not stripped relative to the source text."""
results = []
start_char = 0
for index, chunk in enumerate(chunks):
end_char = start_char + len(chunk) - 1 # 0-based indices
chunk = ChunkResult(
text=chunk,
index=index,
start_char=start_char,
end_char=end_char,
)
if encode:
chunk.token_count = len(encode(chunk.text))
results.append(chunk)
start_char = end_char + 1
return results
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# Copyright (c) 2024 Microsoft Corporation.
# Licensed under the MIT License

"""A module containing 'SentenceChunker' class."""

from collections.abc import Callable
from typing import Any

import nltk

from graphrag_chunking.bootstrap_nltk import bootstrap
from graphrag_chunking.chunk_result import ChunkResult
from graphrag_chunking.chunker import Chunker
from graphrag_chunking.create_chunk_results import create_chunk_results


class SentenceChunker(Chunker):
"""A chunker that splits text into sentence-based chunks."""

def __init__(
self, encode: Callable[[str], list[int]] | None = None, **kwargs: Any
) -> None:
"""Create a sentence chunker instance."""
self._encode = encode
bootstrap()

def chunk(self, text) -> list[ChunkResult]:
"""Chunk the text into sentence-based chunks."""
sentences = nltk.sent_tokenize(text.strip())
results = create_chunk_results(sentences, encode=self._encode)
# nltk sentence tokenizer may trim whitespace, so we need to adjust start/end chars
for index, result in enumerate(results):
txt = result.text
start = result.start_char
actual_start = text.find(txt, start)
delta = actual_start - start
if delta > 0:
result.start_char += delta
result.end_char += delta
# bump the next to keep the start check from falling too far behind
if index < len(results) - 1:
results[index + 1].start_char += delta
results[index + 1].end_char += delta
return results
Loading