Distributed tokenization module for data sets using any Hugging Face compatible tokenizer.
Please see the set of transform project conventions for details on general project conventions, transform configuration, testing and IDE set up.
The data tokenization transform operates by converting a (non-empty) input table into an output table
using a pre-trained tokenizer. The input table is required to have a minimum of two columns,
named document_id
and contents
by default. However, alternate column names can be specified using
--tkn_doc_id_column
for the document id and --tkn_doc_content_column
for the document contents.
It is essential for the values within the document_id
column to be unique across the dataset,
while the contents
column stores their respective document content. To execute example demonstrations within this directory,
a machine with 64GiB
of RAM is recommended.
To specify a pre-trained tokenizer, utilize the --tkn_tokenizer
parameter.
This parameter accepts the name of a tokenizer ready for download from Hugging Face,
such as hf-internal-testing/llama-tokenizer, bigcode/starcoder
, or any other tokenizer compatible
with the Hugging Face AutoTokenizer library. Additionally, you can employ the --tkn_tokenizer_args
parameter
to include extra arguments specific to the chosen tokenizer.
For instance, when loading a Hugging Face tokenizer like bigcode/starcoder
, which necessitate an access token,
you can specify use_auth_token=<your token>
in --tkn_tokenizer
.
The tokenization transformer utilizes the specified tokenizer to tokenize each row,
assuming each row represents a document, in the input table and save it to a corresponding row in the output table.
The output table generally consists of four columns: tokens, document_id, document_length
, and token_count
.
The tokens
stores the sequence of token IDs generated by the tokenizer during the document tokenization process.
The document_id
(or the designated name specified in --tkn_doc_id_column
) contains the document ID,
while document_length
and token_count
respectively record the length of the document and the total count of generated tokens.
During tokenization, the tokenizer will disregard empty documents (rows) in the input table,
as well as documents that yield no tokens or encounter failure during tokenization.
The count of such documents will be stored in the num_empty_rows
field of the metadata
file.
In certain cases, the tokenization process of some tokenizers may be sluggish,
particularly when handling lengthy documents containing millions of characters.
To address this, you can employ the --tkn_chunk_size
parameter to define the length of chunks to tokenize at a given time.
For English text (en
), it is recommended to set the chunk size to 20,000
, roughly equivalent to 15
pages of text.
The tokenizer will then tokenize each chunk separately and combine their resulting token IDs.
By default, the value of --tkn_chunk_size
is 0
, indicating that each document is tokenized as a whole, regardless of its length.
The following command line arguments are available in addition to the options provided by the python launcher and the python launcher.
--tkn_tokenizer TKN_TOKENIZER
Tokenizer used for tokenization. It also can be a path to a pre-trained tokenizer. By defaut, `hf-internal-testing/llama-tokenizer` from HuggingFace is used
--tkn_tokenizer_args TKN_TOKENIZER_ARGS
Arguments for tokenizer. For example, `cache_dir=/tmp/hf,use_auth_token=Your_HF_authentication_token` could be arguments for tokenizer `bigcode/starcoder` from HuggingFace
--tkn_doc_id_column TKN_DOC_ID_COLUMN
Column contains document id which values should be unique across dataset
--tkn_doc_content_column TKN_DOC_CONTENT_COLUMN
Column contains document content
--tkn_text_lang TKN_TEXT_LANG
Specify language used in the text content for better text splitting if needed
--tkn_chunk_size TKN_CHUNK_SIZE
Specify >0 value to tokenize each row/doc in chunks of characters (rounded in words)
To run the samples, use the following make
targets
run-cli-sample
- runs src/tokenization_transform_python.py using command line argsrun-local-sample
- runs src/tokenization_local_python.py
These targets will activate the virtual environment and set up any configuration needed.
Use the -n
option of make
to see the detail of what is done to run the sample.
For example,
make run-cli-sample
...
Then
ls output
To see results of the transform.
To use the transform image to transform your data, please refer to the running images quickstart, substituting the name of this transform image and runtime as appropriate.