tool: add convertation of text/parquet to custom format #14622

lexasub · 2025-07-10T17:31:55Z

`convert-to-train-gguf` Utility

This utility is designed to convert text datasets (or pre-tokenized data) into the GGUF format, optimized for training models in llama.cpp.

Features

Two-pass processing: Efficiently handles large datasets that do not fit entirely into RAM, performing a first pass to collect metadata and a second pass to write the actual tensor data.
Flexible input: Supports reading both raw text (with subsequent tokenization using a provided model) and pre-tokenized data (in the format of space-separated token IDs).
Modular architecture: The code is divided into several classes (llama_gguf_file, llama_gguf_writer, llama_dataset_reader, llama_text_dataset_reader, llama_gguf_converter, llama_gguf_reader) to improve modularity, extensibility, and testability.
Preview functionality: Allows you to view metadata and the first few sequences of the generated GGUF file, including optional detokenization.

@JohannesGaessler

lexasub · 2025-07-10T19:24:07Z

f... windows moment... okay, may be on weekend will try fix

lexasub · 2025-07-11T15:59:20Z

@JohannesGaessler, windows fixed!

JohannesGaessler

I would suggest changing the design:

Internally store a gguf_context in the abstract file object, add subclasses for GGUF/plain text/Parquet files.
For each subclass, implement a method to populate the GGUF metadata + tensors.
Implement a method to write the GGUF context to disk.
Add a method for retrieving a sequence as a ggml_tensor.

The way I imagine it is that llama.cpp implements a GGUF standard specifically for language model training data. The training code would be able to load this GGUF data directly. We can then also use the GGUF specification as an intermediary format for e.g. plain text and Parquet data. We can use the code both to simply read plain text or Parquet and write it as GGUF (for a conversion to a portable format that does not need external dependencies) or we can use the loader directly in the training code to use plain text and Parquet as-is.

Regarding streaming: For GGUF and Parquet, add a method for retrieving the data for a sequence if the gguf_context was initialized with no_alloc=True. For plain text I think it's fine if streaming is not available. Streaming does not need to be part of this PR, it's fine if support for it is added later.

common/arg.cpp

tools/dataset-converter/dataset-to-gguf/llama-dataset-reader/llama-parquet-data-reader.h

JohannesGaessler · 2025-07-11T14:08:04Z

tools/dataset-converter/dataset-to-gguf/llama-dataset-reader/llama-parquet-data-reader.cpp

+// Destructor
+llama_parquet_dataset_reader::~llama_parquet_dataset_reader() {
+    close();
+    m_file_path.clear();  // Clear the stored path only on destruction


Why is path cleared here?

Yes, it's unnecessary.

tools/dataset-converter/dataset-to-gguf/llama-dataset-reader/llama-text-data-reader.cpp

JohannesGaessler · 2025-07-11T15:56:04Z

tools/dataset-converter/dataset-to-gguf/llama-gguf-reader.cpp

+        return false;
+    }
+
+    // Получаем ggml_type напрямую и сравниваем с GGML_TYPE_I32


English, please.

JohannesGaessler · 2025-07-11T16:02:30Z

tools/dataset-converter/dataset-to-gguf/tests/dataset-to-gguf-tests.cpp

+
+// Global variables for tests requiring llama_model
+static llama_model * g_llama_model     = nullptr;
+static std::string   g_test_model_path = "../../gte-small.Q2_K.gguf";  // Specify the actual path to your model


This file does not exist in the llama.cpp repository by default.

JohannesGaessler · 2025-07-11T16:45:05Z

I forgot: I'm getting the impression that the code in this PR is at least partially generated by a language model. This is not a problem in and of itself, but I feel like the code would benefit from being more concise as this will mean less work for me in terms of reviewing and maintenance.

JohannesGaessler · 2025-07-11T16:48:52Z

Add a method for retrieving a sequence as a ggml_tensor.

Actually, as a sequence of tokens as is done right now would probably be better since this abstracts away the handling of raw text vs. tokens.

lexasub · 2025-07-11T19:38:39Z

I forgot: I'm getting the impression that the code in this PR is at least partially generated by a language model. This is not a problem in and of itself, but I feel like the code would benefit from being more concise as this will mean less work for me in terms of reviewing and maintenance.

The code indeed went through several iterations:

Initial prototyping leveraged AI to establish functional patterns and core logic (e.g., dataset readers, GGUF interactions).
Extensive manual refactoring followed to align with llama.cpp’s idioms.
Architectural refinements to decouple responsibilities (e.g., separating readers from converters).
I prioritized maintainability by eliminating redundancy and adhering to the project’s minimalistic C++ style. That said, I welcome suggestions to further simplify or clarify the implementation—especially around interface boundaries or performance-critical paths. Let me know where deeper refactoring would help!

Co-authored-by: Johannes Gäßler <[email protected]>

lexasub · 2025-07-12T04:38:35Z

Internally store a gguf_context in the abstract file object, add subclasses for GGUF/plain text/Parquet files.

hmm, your suggession is replace usage my class llama_gguf_file to

struct gguf_context {
    uint32_t version = GGUF_VERSION;

    std::vector<struct gguf_kv> kv;
    std::vector<struct gguf_tensor_info> info;

    size_t alignment = GGUF_DEFAULT_ALIGNMENT;
    size_t offset    = 0; // offset of `data` from beginning of file
    size_t size      = 0; // size of `data` in bytes

    void * data = nullptr;
};

?

lexasub · 2025-07-12T08:00:49Z

Internally store a gguf_context in the abstract file object, add subclasses for GGUF/plain text/Parquet files.

For each subclass, implement a method to populate the GGUF metadata + tensors.

Implement a method to write the GGUF context to disk

Does this design risk violating the Single Responsibility Principle by combining data-reading logic (e.g., handling GGUF/plain text/Parquet) and GGUF-specific metadata/tensor management within the same class hierarchy? For example, would a new ParquetFile subclass need to handle both Parquet parsing and GGUF metadata population, merging two distinct concerns into one class?

lexasub · 2025-07-12T08:11:34Z

or we can use the loader directly in the training code to use plain text and Parquet as-is.

I noticed a potential tension between two ideas here:

The earlier suggestion to strictly use GGUF for training (to avoid external dependencies and simplify maintenance).

The current proposal to directly load plain text/Parquet in training code (which reintroduces dependency risks and complexity).

Would it be better to enforce a strict boundary:

Allow llama_dataset_reader (or a new ParquetFile) to support multiple formats only for data conversion (e.g., Parquet → GGUF).

Require training to always use a dedicated GGUFReader class, which loads data exclusively from GGUF files.

This would:

Prevent accidental use of non-GGUF formats in training (enforced at compile/runtime).

Simplify maintenance by isolating GGUF-specific logic.

Maintain flexibility via pre-conversion tools. What do you think?

example:

// Training code  
GGUFReader reader;  
if (!reader.open("data.gguf")) { /* ... */ }  

// Conversion code  
ParquetFile *parquet_reader = new ParquetFile(....);  
GGUFConverter converter(parquet_reader, "output.gguf");  
converter.convert();

JohannesGaessler · 2025-07-12T09:43:26Z

Regarding the use of Parquet in training:

My goal is to have a single standard for training data files - that way I only need to assert that the code works correctly for that standard. For e.g. Parquet files I only need to assert that the conversion to the standard works correctly. We already have GGUF as a data format that we use for models so that is the file format that I want to use for the standard. With the current GGUF code it's straightforward to do gguf_context <-> .gguf file conversions. For training the default pipeline would be .gguf file -> gguf_context -> token sequences. With the data conversion tool the pipeline would be Parquet file -> gguf_context -> .gguf file -> gguf_context -> token sequences. But if you just cut out one cyclic conversion in the middle you get Parquet file -> gguf_context -> token sequences. So without much extra effort we can get support for directly using Parquet files in training and I can still debug issues with a single standard that I am familiar with. The way that I would have implemented Parquet support would have been a simple Python script for conversion to GGUF, but if you're going to the trouble of implementing it in C++ I think we should consider both use cases ahead of time (I don't mean that this needs to be implemented in this PR).

Regarding implementation:

I'm generally flexible regarding how exactly the functionality is mapped to classes - personally I don't value strict adherence to object-oriented design in the first place. If I were to implement it myself I would just do a procedural design for loading the training data with something like:

struct llama_dataset;

struct llama_dataset * from_gguf(const char * path);
struct llama_dataset * from_txt(const char * path);
struct llama_dataset * from_parquet(const char * path);

void to_gguf(struct llama_dataset *, const char * path);

uint64_t n_sequences(const struct llama_dataset * dataset);
int32_t sequence_length(const struct llama_dataset * dataset, uint64_t index);
const int32_t * sequence(const struct llama_dataset * dataset, uint64_t index);

This is not much of a concern right now, but long-term it would be nice to have a C-compatible interface for the dataset management since it would enable the use with e.g. Python bindings. Of course you could still have C++ code with an object-oriented design underneath. The way I would have done it is with a factory method for llama_dataset, it's also fine to define external builder objects for each data format.

JohannesGaessler · 2025-07-12T10:03:52Z

And regarding the implementation for the interface I laid out: if llama_dataset internally stores a gguf_context, then from_gguf and to_gguf are just simple wrappers for the interface in gguf.h and only from_txt and from_parquet need complex implementations. Without streaming the underlying tensors can simply be put into a ggml_context. n_sequences, sequence_length, and sequence can then be determined from the tensors stored in the ggml_context. (Since they're constant and ggml_context stores them as a linked list it may make sense to just cache the pointers to the tensors in a vector.)

lexasub · 2025-07-12T19:48:56Z

@JohannesGaessler Is this union approach acceptable for handling different dataset types (GGUF, Parquet, etc.) within a single llama_dataset struct?

class arrow::io::ReadableFile;
struct llama_dataset {
    llama_dataset_type type = LLAMA_DATASET_TYPE_NONE; // Current type of the dataset
    std::uint64_t sequence_count = 0;
    std::uint64_t current_sequence_index = 0;

    // Union to hold data specific to each dataset type
    union {
        struct {
            gguf_context *ctx;
            FILE *file_handle; // For GGUF, holds the file handle for streaming reads
        } gguf_data;
#ifdef LLAMA_PARQUET
        // Pointer to the Parquet reader object
       struct {
            llama_parquet_dataset_reader *parquet_reader;
           arrow::io::ReadableFile    *input_file;
        } parquet_data;
#endif
        // Add other reader types here if needed (e.g., llama_text_dataset_reader *text_reader;)
    };
};
struct gguf_context * gguf_init_from_file_impl(FILE * file, struct gguf_init_params params); //we use impl directly, because gguf_init_from_file is closing file

JohannesGaessler · 2025-07-12T20:12:09Z

I would ultimately accept an implementation like that since it is an improvement vs. master. But as I said, this is more complex than it needs to be. For the Parquet -> GGUF file conversion tool you will need to populate a gguf_context with the Parquet metadata anyways so I think it makes more sense to just also use that code for directly loading Parquet data. Long-term, if this PR gets merged like that I will probably refactor the internals of the code once it needs to be extended to accommodate yet another data format.

lexasub · 2025-07-13T06:42:02Z

I agree that ideally I would like to unify gguf_context. However, until gguf_context is redesigned to flexibly work with other types of data (for example, streaming downloads from Parquet without full materialization), the current approach with union is the simplest and most pragmatic solution.

For Parquet data, especially when working with large volumes and batch processing, direct use of llama_parquet_dataset_reader via union is more memory-efficient than trying to fully load into gguf_context. For GGUF data, where gguf_context is already a native representation, this, of course, fits well.

JohannesGaessler · 2025-07-13T07:37:33Z

To be clear, what I mean is to always use the gguf_context for the metadata. Without streaming, simply load all sequences into the GGUF context. The sequence function I outlined before would simply return the tensor data. With streaming, the sequence function lazily fetches the data for GGUF and Parquet, the concrete implementation depends on the file type.

github-actions bot added build Compilation issues examples labels Jul 10, 2025

lexasub mentioned this pull request Jul 10, 2025

train: add simple loading already tokenized data from parquet dataset #14522

Open

lexasub force-pushed the finetune branch 8 times, most recently from 50beb4f to 92bbf15 Compare July 10, 2025 18:42

lexasub force-pushed the finetune branch from 769a3ed to 3943a4c Compare July 11, 2025 15:40

JohannesGaessler reviewed Jul 11, 2025

View reviewed changes

lexasub force-pushed the finetune branch from 82cd1bc to 61b0425 Compare July 11, 2025 19:21

lexasub and others added 5 commits July 12, 2025 08:14

tool: add convertation of text/parquet to custom format

0d1a1e0

tool: fix convertation of text/parquet to custom format

a83c0f6

Apply suggestions from code review

234c301

Co-authored-by: Johannes Gäßler <[email protected]>

Apply suggestions from code review

1004205

Co-authored-by: Johannes Gäßler <[email protected]>

small fixes

aab45e2

lexasub force-pushed the finetune branch from 61b0425 to aab45e2 Compare July 12, 2025 04:14

tool: add convertation of text/parquet to custom format #14622

Are you sure you want to change the base?

tool: add convertation of text/parquet to custom format #14622

Conversation

lexasub commented Jul 10, 2025

convert-to-train-gguf Utility

Features

Uh oh!

lexasub commented Jul 10, 2025

Uh oh!

lexasub commented Jul 11, 2025

Uh oh!

JohannesGaessler left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JohannesGaessler Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

lexasub Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

JohannesGaessler Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler commented Jul 11, 2025

Uh oh!

JohannesGaessler commented Jul 11, 2025

Uh oh!

lexasub commented Jul 11, 2025

Uh oh!

lexasub commented Jul 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lexasub commented Jul 12, 2025

Uh oh!

lexasub commented Jul 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler commented Jul 12, 2025

Uh oh!

JohannesGaessler commented Jul 12, 2025

Uh oh!

lexasub commented Jul 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler commented Jul 12, 2025

Uh oh!

lexasub commented Jul 13, 2025

Uh oh!

JohannesGaessler commented Jul 13, 2025

Uh oh!

Uh oh!

`convert-to-train-gguf` Utility

JohannesGaessler left a comment •

edited

Loading

lexasub commented Jul 12, 2025 •

edited

Loading

lexasub commented Jul 12, 2025 •

edited

Loading

lexasub commented Jul 12, 2025 •

edited

Loading