microsoft · natoverse · Dec 16, 2025 · Dec 16, 2025 · Dec 17, 2025 · Dec 17, 2025
@@ -100,12 +100,11 @@ These settings configure how we parse documents into text chunks. This is necess
 
 #### Fields
 
+- `strategy` **str**[tokens|sentences] - How to chunk the text. 
 - `size` **int** - The max chunk size in tokens.
 - `overlap` **int** - The chunk overlap in tokens.
-- `strategy` **str**[tokens|sentences] - How to chunk the text. 
 - `encoding_model` **str** - The text encoding model to use for splitting on token boundaries.
 - `prepend_metadata` **bool** - Determines if metadata values should be added at the beginning of each chunk. Default=`False`.
-- `chunk_size_includes_metadata` **bool** - Specifies whether the chunk size calculation should include metadata tokens. Default=`False`.
 
 ## Outputs and Storage
 

@@ -82,10 +82,9 @@ As described above, when documents are imported you can specify a list of `metad
 
 ### Chunking Config
 
-Next, the `chunks` block needs to instruct the chunker how to handle this metadata when creating text units. By default, it is ignored. We have two settings to include it:
+Next, the `chunks` block needs to instruct the chunker how to handle this metadata when creating text units. By default, it is ignored. We have the following setting to include it:
 
 - `prepend_metadata`. This instructs the importer to copy the contents of the `metadata` column for each row into the start of every single text chunk. This metadata is copied as key: value pairs on new lines.
-- `chunk_size_includes_metadata`: This tells the chunker how to compute the chunk size when metadata is included. By default, we create the text units using your specified `chunk_size` *and then* prepend the metadata. This means that the final text unit lengths may be longer than your configured `chunk_size`, and it will vary based on the length of the metadata for each document. When this setting is `True`, we will compute the raw text using the remainder after measuring the metadata length so that the resulting text units always comply with your configured `chunk_size`.
 
 ### Examples
 
@@ -124,7 +123,6 @@ chunks:
     size: 100
     overlap: 0
     prepend_metadata: true
-    chunk_size_includes_metadata: false
 ```
 
 Documents DataFrame
@@ -162,54 +160,6 @@ US to lift most federal COVID-19 vaccine mandates,WASHINGTON (AP) The Biden admi
 
 NY lawmakers begin debating budget 1 month after due date,ALBANY, N.Y. (AP) New York lawmakers began voting Monday on a $229 billion state budget due a month ago that would raise the minimum wage, crack down on illicit pot shops and ban gas stoves and furnaces in new buildings. Negotiations among Gov. Kathy Hochul and her fellow Democrats in control of the Legislature dragged on past the April 1 budget deadline, largely because of disagreements over changes to the bail law and other policy proposals included in the spending plan. Floor debates on some budget bills began Monday. State Senate Majority Leader Andrea Stewart-Cousins said she expected voting to be wrapped up Tuesday for a budget she said contains "significant wins" for New Yorkers. "I would have liked to have done this sooner. I think we would all agree to that," Cousins told reporters before voting began. "This has been a very policy-laden budget and a lot of the policies had to parsed through." Hochul was able to push through a change to the bail law that will eliminate the standard that requires judges to prescribe the "least restrictive" means to ensure defendants return to court. Hochul said judges needed the extra discretion. Some liberal lawmakers argued that it would undercut the sweeping bail reforms approved in 2019 and result in more people with low incomes and people of color in pretrial detention. Here are some other policy provisions that will be included in the budget, according to state officials. The minimum wage would be raised to $17 in New York City and some of its suburbs and $16 in the rest of the state by 2026. That's up from $15 in the city and $14.20 upstate.
 
---
-
-settings.yaml
-
-```yaml
-input:
-    file_type: csv
-    title_column: headline
-    text_column: article
-    metadata: [headline]
-
-chunks:
-    size: 50
-    overlap: 5
-    prepend_metadata: true
-    chunk_size_includes_metadata: true
-```
-
-Documents DataFrame
-
-| id                    | title                                                     | text                     | creation_date                 | metadata                                                                    |
-| --------------------- | --------------------------------------------------------- | ------------------------ | ----------------------------- | --------------------------------------------------------------------------- |
-| (generated from text) | US to lift most federal COVID-19 vaccine mandates         | (article column content) | (create date of articles.csv) | { "headline": "US to lift most federal COVID-19 vaccine mandates" }         |
-| (generated from text) | NY lawmakers begin debating budget 1 month after due date | (article column content) | (create date of articles.csv) | { "headline": "NY lawmakers begin debating budget 1 month after due date" } |
-
-Raw Text Chunks
-
-| content | length  |
-| ------- | ------: |
-| title: US to lift most federal COVID-19 vaccine mandates<br>WASHINGTON (AP) The Biden administration will end most of the last remaining federal COVID-19 vaccine requirements next week when the national public health emergency for the coronavirus ends, the White House said Monday. Vaccine requirements for federal workers and federal contractors, | 50 |
-| title: US to lift most federal COVID-19 vaccine mandates<br>federal workers and federal contractors as well as foreign air travelers to the U.S., will end May 11. The government is also beginning the process of lifting shot requirements for Head Start educators, healthcare workers, and noncitizens at U.S. land borders. | 50 |
-| title: US to lift most federal COVID-19 vaccine mandates<br>noncitizens at U.S. land borders. The requirements are among the last vestiges of some of the more coercive measures taken by the federal government to promote vaccination as the deadly virus raged, and their end marks the latest display of how | 50 |
-| title: US to lift most federal COVID-19 vaccine mandates<br>the latest display of how  President Joe Biden's administration is moving to treat COVID-19 as a routine, endemic illness. "While I believe that these vaccine mandates had a tremendous beneficial impact, we are now at a point where we think that | 50 |
-| title: US to lift most federal COVID-19 vaccine mandates<br>point where we think that it makes a lot of sense to pull these requirements down," White House COVID-19 coordinator Dr. Ashish Jha told The Associated Press on Monday. | 38 |
-| title: NY lawmakers begin debating budget 1 month after due date<br>ALBANY, N.Y. (AP) New York lawmakers began voting Monday on a $229 billion state budget due a month ago that would raise the minimum wage, crack down on illicit pot shops and ban gas stoves and furnaces in new | 50 |
-| title: NY lawmakers begin debating budget 1 month after due date<br>stoves and furnaces in new buildings. Negotiations among Gov. Kathy Hochul and her fellow Democrats in control of the Legislature dragged on past the April 1 budget deadline, largely because of disagreements over changes to the bail law and | 50 |
-| title: NY lawmakers begin debating budget 1 month after due date<br>to the bail law and other policy proposals included in the spending plan. Floor debates on some budget bills began Monday. State Senate Majority Leader Andrea Stewart-Cousins said she expected voting to be wrapped up Tuesday for a budget | 50 |
-|title: NY lawmakers begin debating budget 1 month after due date<br>up Tuesday for a budget she said contains "significant wins" for New Yorkers. "I would have liked to have done this sooner. I think we would all agree to that," Cousins told reporters before voting began. "This has been | 50 |
-| title: NY lawmakers begin debating budget 1 month after due date<br>voting began. "This has been a very policy-laden budget and a lot of the policies had to parsed through." Hochul was able to push through a change to the bail law that will eliminate the standard that requires judges | 50 |
-| title: NY lawmakers begin debating budget 1 month after due date<br>the standard that requires judges to prescribe the "least restrictive" means to ensure defendants return to court. Hochul said judges needed the extra discretion. Some liberal lawmakers argued that it would undercut the sweeping bail reforms approved in 2019 | 50 |
-| title: NY lawmakers begin debating budget 1 month after due date<br>bail reforms approved in 2019 and result in more people with low incomes and people of color in pretrial detention. Here are some other policy provisions that will be included in the budget, according to state officials. The minimum | 50 |
-| title: NY lawmakers begin debating budget 1 month after due date<br>to state officials. The minimum  wage would be raised to $17 in be raised to $17 in New York City and some of its suburbs and $16 in the rest of the state by 2026. That's up from $15 | 50 |
-| title: NY lawmakers begin debating budget 1 month after due date<br>2026. That's up from $15 in the city and $14.20 upstate. | 22 |
-
-
-In this example we can see that the two input documents were parsed into fourteen output text chunks. The title (headline) of each document is prepended and included in the computed chunk size, so each chunk matches the configured chunk size (except the last one for each document). We've also configured some overlap in these text chunks, so the last five tokens are shared. Why would you use overlap in your text chunks? Consider that when you are splitting documents based on tokens, it is highly likely that sentences or even related concepts will be split into separate chunks. Each text chunk is processed separately by the language model, so this may result in incomplete "ideas" at the boundaries of the chunk. Overlap ensures that these split concepts are fully contained in at least one of the chunks.
-
-
 #### JSON files
 
 This final example uses a JSON file for each of the same two articles. In this example we'll set the object fields to read, but we will not add metadata to the text chunks.

@@ -0,0 +1,32 @@
+# GraphRAG Chunking
+
+This package contains a collection of text chunkers, a core config model, and a factory for acquiring instances.
+
+## Examples
+
+Basic sentence chunking with nltk
+```python
+chunker = SentenceChunker()
+chunks = chunker.chunk("This is a test. Another sentence.")
+print(chunks) # ["This is a test.", "Another sentence."]
+```
+
+Token chunking
+```python
+tokenizer = tiktoken.get_encoding("o200k_base")
+chunker = TokenChunker(size=3, overlap=0, encode=tokenizer.encode, decode=tokenizer.decode)
+chunks = chunker.chunk("This is a random test fragment of some text")
+print(chunks) # ["This is a", " random test fragment", " of some text"]
+```
+
+Using the factory via helper util
+```python
+tokenizer = tiktoken.get_encoding("o200k_base")
+config = ChunkingConfig(
+    strategy="tokens",
+    size=3,
+    overlap=0
+)
+chunker = create_chunker(config, tokenizer.encode, tokenizer.decode)
+...
+```
@@ -1,2 +1,4 @@
 # Copyright (c) 2024 Microsoft Corporation.
 # Licensed under the MIT License
+
+"""System-level chunking package."""
@@ -0,0 +1,19 @@
+# Copyright (c) 2024 Microsoft Corporation.
+# Licensed under the MIT License
+
+"""A module containing 'prepend_metadata' function."""
+
+
+def add_metadata(
+    text: str,
+    metadata: dict,
+    delimiter: str = ": ",
+    line_delimiter: str = "\n",
+    append: bool = False,
+) -> str:
+    """Add metadata to the given text, prepending by default. This utility writes the dict as rows of key/value pairs."""
+    metadata_str = (
+        line_delimiter.join(f"{k}{delimiter}{v}" for k, v in metadata.items())
+        + line_delimiter
+    )
+    return text + metadata_str if append else metadata_str + text
@@ -0,0 +1,17 @@
+# Copyright (c) 2024 Microsoft Corporation.
+# Licensed under the MIT License
+
+"""The ChunkResult dataclass."""
+
+from dataclasses import dataclass
+
+
+@dataclass
+class ChunkResult:
+    """Result of chunking a document."""
+
+    text: str
+    index: int
+    start_char: int
+    end_char: int
+    token_count: int | None = None
@@ -0,0 +1,13 @@
+# Copyright (c) 2024 Microsoft Corporation.
+# Licensed under the MIT License
+
+"""Chunk strategy type enumeration."""
+
+from enum import StrEnum
+
+
+class ChunkerType(StrEnum):
+    """ChunkerType class definition."""
+
+    Tokens = "tokens"
+    Sentence = "sentence"
@@ -0,0 +1,21 @@
+# Copyright (c) 2024 Microsoft Corporation.
+# Licensed under the MIT License
+
+"""A module containing the 'Chunker' class."""
+
+from abc import ABC, abstractmethod
+from typing import Any
+
+from graphrag_chunking.chunk_result import ChunkResult
+
+
+class Chunker(ABC):
+    """Abstract base class for document chunkers."""
+
+    @abstractmethod
+    def __init__(self, **kwargs: Any) -> None:
+        """Create a chunker instance."""
+
+    @abstractmethod
+    def chunk(self, text: str) -> list[ChunkResult]:
+        """Chunk method definition."""
@@ -0,0 +1,77 @@
+# Copyright (c) 2024 Microsoft Corporation.
+# Licensed under the MIT License
+
+"""A module containing 'ChunkerFactory', 'register_chunker', and 'create_chunker'."""
+
+from collections.abc import Callable
+
+from graphrag_common.factory.factory import Factory, ServiceScope
+
+from graphrag_chunking.chunk_strategy_type import ChunkerType
+from graphrag_chunking.chunker import Chunker
+from graphrag_chunking.chunking_config import ChunkingConfig
+
+
+class ChunkerFactory(Factory[Chunker]):
+    """Factory for creating Chunker instances."""
+
+
+chunker_factory = ChunkerFactory()
+
+
+def register_chunker(
+    chunker_type: str,
+    chunker_initializer: Callable[..., Chunker],
+    scope: ServiceScope = "transient",
+) -> None:
+    """Register a custom chunker implementation.
+
+    Args
+    ----
+        - chunker_type: str
+            The chunker id to register.
+        - chunker_initializer: Callable[..., Chunker]
+            The chunker initializer to register.
+    """
+    chunker_factory.register(chunker_type, chunker_initializer, scope)
+
+
+def create_chunker(
+    config: ChunkingConfig,
+    encode: Callable[[str], list[int]] | None = None,
+    decode: Callable[[list[int]], str] | None = None,
+) -> Chunker:
+    """Create a chunker implementation based on the given configuration.
+
+    Args
+    ----
+        - config: ChunkingConfig
+            The chunker configuration to use.
+
+    Returns
+    -------
+        Chunker
+            The created chunker implementation.
+    """
+    config_model = config.model_dump()
+    if encode is not None:
+        config_model["encode"] = encode
+    if decode is not None:
+        config_model["decode"] = decode
+    chunker_strategy = config.type
+
+    if chunker_strategy not in chunker_factory:
+        match chunker_strategy:
+            case ChunkerType.Tokens:
+                from graphrag_chunking.token_chunker import TokenChunker
+
+                register_chunker(ChunkerType.Tokens, TokenChunker)
+            case ChunkerType.Sentence:
+                from graphrag_chunking.sentence_chunker import SentenceChunker
+
+                register_chunker(ChunkerType.Sentence, SentenceChunker)
+            case _:
+                msg = f"ChunkingConfig.strategy '{chunker_strategy}' is not registered in the ChunkerFactory. Registered types: {', '.join(chunker_factory.keys())}."
+                raise ValueError(msg)
+
+    return chunker_factory.create(chunker_strategy, init_args=config_model)
@@ -0,0 +1,36 @@
+# Copyright (c) 2024 Microsoft Corporation.
+# Licensed under the MIT License
+
+"""Parameterization settings for the default configuration."""
+
+from pydantic import BaseModel, ConfigDict, Field
+
+from graphrag_chunking.chunk_strategy_type import ChunkerType
+
+
+class ChunkingConfig(BaseModel):
+    """Configuration section for chunking."""
+
+    model_config = ConfigDict(extra="allow")
+    """Allow extra fields to support custom cache implementations."""
+
+    type: str = Field(
+        description="The chunking type to use.",
+        default=ChunkerType.Tokens,
+    )
+    encoding_model: str | None = Field(
+        description="The encoding model to use.",
+        default=None,
+    )
+    size: int = Field(
+        description="The chunk size to use.",
+        default=1200,
+    )
+    overlap: int = Field(
+        description="The chunk overlap to use.",
+        default=100,
+    )
+    prepend_metadata: bool = Field(
+        description="Prepend metadata into each chunk.",
+        default=False,
+    )
@@ -0,0 +1,30 @@
+# Copyright (c) 2024 Microsoft Corporation.
+# Licensed under the MIT License
+
+"""A module containing 'create_chunk_results' function."""
+
+from collections.abc import Callable
+
+from graphrag_chunking.chunk_result import ChunkResult
+
+
+def create_chunk_results(
+    chunks: list[str],
+    encode: Callable[[str], list[int]] | None = None,
+) -> list[ChunkResult]:
+    """Create chunk results from a list of text chunks. The index assignments are 0-based and assume chunks we not stripped relative to the source text."""
+    results = []
+    start_char = 0
+    for index, chunk in enumerate(chunks):
+        end_char = start_char + len(chunk) - 1  # 0-based indices
+        chunk = ChunkResult(
+            text=chunk,
+            index=index,
+            start_char=start_char,
+            end_char=end_char,
+        )
+        if encode:
+            chunk.token_count = len(encode(chunk.text))
+        results.append(chunk)
+        start_char = end_char + 1
+    return results
@@ -0,0 +1,44 @@
+# Copyright (c) 2024 Microsoft Corporation.
+# Licensed under the MIT License
+
+"""A module containing 'SentenceChunker' class."""
+
+from collections.abc import Callable
+from typing import Any
+
+import nltk
+
+from graphrag_chunking.bootstrap_nltk import bootstrap
+from graphrag_chunking.chunk_result import ChunkResult
+from graphrag_chunking.chunker import Chunker
+from graphrag_chunking.create_chunk_results import create_chunk_results
+
+
+class SentenceChunker(Chunker):
+    """A chunker that splits text into sentence-based chunks."""
+
+    def __init__(
+        self, encode: Callable[[str], list[int]] | None = None, **kwargs: Any
+    ) -> None:
+        """Create a sentence chunker instance."""
+        self._encode = encode
+        bootstrap()
+
+    def chunk(self, text) -> list[ChunkResult]:
+        """Chunk the text into sentence-based chunks."""
+        sentences = nltk.sent_tokenize(text.strip())
+        results = create_chunk_results(sentences, encode=self._encode)
+        # nltk sentence tokenizer may trim whitespace, so we need to adjust start/end chars
+        for index, result in enumerate(results):
+            txt = result.text
+            start = result.start_char
+            actual_start = text.find(txt, start)
+            delta = actual_start - start
+            if delta > 0:
+                result.start_char += delta
+                result.end_char += delta
+                # bump the next to keep the start check from falling too far behind
+                if index < len(results) - 1:
+                    results[index + 1].start_char += delta
+                    results[index + 1].end_char += delta
+        return results