Skip to content

Commit 275dcbf

Browse files
authored
docs(core): add clarity to base token counting methods (#33958)
Wasn't immediately obvious that `get_num_tokens_from_messages` adds additional prefixes to represent user roles in conversation, which adds to the overall token count. ```python from langchain_google_genai import GoogleGenerativeAI llm = GoogleGenerativeAI(model="gemini-2.5-flash") num_tokens = llm.get_num_tokens("Hello, world!") print(f"Number of tokens: {num_tokens}") # Number of tokens: 4 ``` ```python from langchain.messages import HumanMessage messages = [HumanMessage(content="Hello, world!")] num_tokens = llm.get_num_tokens_from_messages(messages) print(f"Number of tokens: {num_tokens}") # Number of tokens: 6 ```
1 parent 9f87b27 commit 275dcbf

File tree

1 file changed

+13
-2
lines changed
  • libs/core/langchain_core/language_models

1 file changed

+13
-2
lines changed

libs/core/langchain_core/language_models/base.py

Lines changed: 13 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -299,6 +299,9 @@ def get_num_tokens(self, text: str) -> int:
299299
300300
Useful for checking if an input fits in a model's context window.
301301
302+
This should be overridden by model-specific implementations to provide accurate
303+
token counts via model-specific tokenizers.
304+
302305
Args:
303306
text: The string input to tokenize.
304307
@@ -317,9 +320,17 @@ def get_num_tokens_from_messages(
317320
318321
Useful for checking if an input fits in a model's context window.
319322
323+
This should be overridden by model-specific implementations to provide accurate
324+
token counts via model-specific tokenizers.
325+
320326
!!! note
321-
The base implementation of `get_num_tokens_from_messages` ignores tool
322-
schemas.
327+
328+
* The base implementation of `get_num_tokens_from_messages` ignores tool
329+
schemas.
330+
* The base implementation of `get_num_tokens_from_messages` adds additional
331+
prefixes to messages in represent user roles, which will add to the
332+
overall token count. Model-specific implementations may choose to
333+
handle this differently.
323334
324335
Args:
325336
messages: The message inputs to tokenize.

0 commit comments

Comments
 (0)