-
Notifications
You must be signed in to change notification settings - Fork 12.4k
tool: add convertation of text/parquet to custom format #14622
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
50beb4f
to
92bbf15
Compare
f... windows moment... okay, may be on weekend will try fix |
@JohannesGaessler, windows fixed! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest changing the design:
- Internally store a
gguf_context
in the abstract file object, add subclasses for GGUF/plain text/Parquet files. - For each subclass, implement a method to populate the GGUF metadata + tensors.
- Implement a method to write the GGUF context to disk.
- Add a method for retrieving a sequence as a
ggml_tensor
.
The way I imagine it is that llama.cpp implements a GGUF standard specifically for language model training data. The training code would be able to load this GGUF data directly. We can then also use the GGUF specification as an intermediary format for e.g. plain text and Parquet data. We can use the code both to simply read plain text or Parquet and write it as GGUF (for a conversion to a portable format that does not need external dependencies) or we can use the loader directly in the training code to use plain text and Parquet as-is.
Regarding streaming: For GGUF and Parquet, add a method for retrieving the data for a sequence if the gguf_context
was initialized with no_alloc=True
. For plain text I think it's fine if streaming is not available. Streaming does not need to be part of this PR, it's fine if support for it is added later.
tools/dataset-converter/dataset-to-gguf/llama-dataset-reader/llama-parquet-data-reader.h
Outdated
Show resolved
Hide resolved
// Destructor | ||
llama_parquet_dataset_reader::~llama_parquet_dataset_reader() { | ||
close(); | ||
m_file_path.clear(); // Clear the stored path only on destruction |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is path cleared here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's unnecessary.
tools/dataset-converter/dataset-to-gguf/llama-dataset-reader/llama-text-data-reader.cpp
Outdated
Show resolved
Hide resolved
return false; | ||
} | ||
|
||
// Получаем ggml_type напрямую и сравниваем с GGML_TYPE_I32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
English, please.
|
||
// Global variables for tests requiring llama_model | ||
static llama_model * g_llama_model = nullptr; | ||
static std::string g_test_model_path = "../../gte-small.Q2_K.gguf"; // Specify the actual path to your model |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file does not exist in the llama.cpp repository by default.
I forgot: I'm getting the impression that the code in this PR is at least partially generated by a language model. This is not a problem in and of itself, but I feel like the code would benefit from being more concise as this will mean less work for me in terms of reviewing and maintenance. |
Actually, as a sequence of tokens as is done right now would probably be better since this abstracts away the handling of raw text vs. tokens. |
The code indeed went through several iterations: Initial prototyping leveraged AI to establish functional patterns and core logic (e.g., dataset readers, GGUF interactions). |
Co-authored-by: Johannes Gäßler <[email protected]>
Co-authored-by: Johannes Gäßler <[email protected]>
hmm, your suggession is replace usage my class llama_gguf_file to
? |
Does this design risk violating the Single Responsibility Principle by combining data-reading logic (e.g., handling GGUF/plain text/Parquet) and GGUF-specific metadata/tensor management within the same class hierarchy? For example, would a new ParquetFile subclass need to handle both Parquet parsing and GGUF metadata population, merging two distinct concerns into one class? |
I noticed a potential tension between two ideas here: The earlier suggestion to strictly use GGUF for training (to avoid external dependencies and simplify maintenance). The current proposal to directly load plain text/Parquet in training code (which reintroduces dependency risks and complexity). Would it be better to enforce a strict boundary: Allow llama_dataset_reader (or a new ParquetFile) to support multiple formats only for data conversion (e.g., Parquet → GGUF). Require training to always use a dedicated GGUFReader class, which loads data exclusively from GGUF files. This would: Prevent accidental use of non-GGUF formats in training (enforced at compile/runtime). Simplify maintenance by isolating GGUF-specific logic. Maintain flexibility via pre-conversion tools. What do you think? example:
|
Regarding the use of Parquet in training: My goal is to have a single standard for training data files - that way I only need to assert that the code works correctly for that standard. For e.g. Parquet files I only need to assert that the conversion to the standard works correctly. We already have GGUF as a data format that we use for models so that is the file format that I want to use for the standard. With the current GGUF code it's straightforward to do Regarding implementation: I'm generally flexible regarding how exactly the functionality is mapped to classes - personally I don't value strict adherence to object-oriented design in the first place. If I were to implement it myself I would just do a procedural design for loading the training data with something like: struct llama_dataset;
struct llama_dataset * from_gguf(const char * path);
struct llama_dataset * from_txt(const char * path);
struct llama_dataset * from_parquet(const char * path);
void to_gguf(struct llama_dataset *, const char * path);
uint64_t n_sequences(const struct llama_dataset * dataset);
int32_t sequence_length(const struct llama_dataset * dataset, uint64_t index);
const int32_t * sequence(const struct llama_dataset * dataset, uint64_t index); This is not much of a concern right now, but long-term it would be nice to have a C-compatible interface for the dataset management since it would enable the use with e.g. Python bindings. Of course you could still have C++ code with an object-oriented design underneath. The way I would have done it is with a factory method for |
And regarding the implementation for the interface I laid out: if |
@JohannesGaessler Is this union approach acceptable for handling different dataset types (GGUF, Parquet, etc.) within a single llama_dataset struct?
|
I would ultimately accept an implementation like that since it is an improvement vs. master. But as I said, this is more complex than it needs to be. For the Parquet -> GGUF file conversion tool you will need to populate a |
I agree that ideally I would like to unify gguf_context. However, until gguf_context is redesigned to flexibly work with other types of data (for example, streaming downloads from Parquet without full materialization), the current approach with union is the simplest and most pragmatic solution. For Parquet data, especially when working with large volumes and batch processing, direct use of llama_parquet_dataset_reader via union is more memory-efficient than trying to fully load into gguf_context. For GGUF data, where gguf_context is already a native representation, this, of course, fits well. |
To be clear, what I mean is to always use the |
convert-to-train-gguf
UtilityThis utility is designed to convert text datasets (or pre-tokenized data) into the GGUF format, optimized for training models in
llama.cpp
.Features
Two-pass processing: Efficiently handles large datasets that do not fit entirely into RAM, performing a first pass to collect metadata and a second pass to write the actual tensor data.
Flexible input: Supports reading both raw text (with subsequent tokenization using a provided model) and pre-tokenized data (in the format of space-separated token IDs).
Modular architecture: The code is divided into several classes (
llama_gguf_file
,llama_gguf_writer
,llama_dataset_reader
,llama_text_dataset_reader
,llama_gguf_converter
,llama_gguf_reader
) to improve modularity, extensibility, and testability.Preview functionality: Allows you to view metadata and the first few sequences of the generated GGUF file, including optional detokenization.
@JohannesGaessler