Skip to content

tool: add convertation of text/parquet to custom format #14622

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

lexasub
Copy link
Contributor

@lexasub lexasub commented Jul 10, 2025

convert-to-train-gguf Utility

This utility is designed to convert text datasets (or pre-tokenized data) into the GGUF format, optimized for training models in llama.cpp.

Features

  • Two-pass processing: Efficiently handles large datasets that do not fit entirely into RAM, performing a first pass to collect metadata and a second pass to write the actual tensor data.

  • Flexible input: Supports reading both raw text (with subsequent tokenization using a provided model) and pre-tokenized data (in the format of space-separated token IDs).

  • Modular architecture: The code is divided into several classes (llama_gguf_file, llama_gguf_writer, llama_dataset_reader, llama_text_dataset_reader, llama_gguf_converter, llama_gguf_reader) to improve modularity, extensibility, and testability.

  • Preview functionality: Allows you to view metadata and the first few sequences of the generated GGUF file, including optional detokenization.

@JohannesGaessler

@lexasub
Copy link
Contributor Author

lexasub commented Jul 10, 2025

f... windows moment... okay, may be on weekend will try fix

@lexasub
Copy link
Contributor Author

lexasub commented Jul 11, 2025

@JohannesGaessler, windows fixed!

Copy link
Collaborator

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest changing the design:

  • Internally store a gguf_context in the abstract file object, add subclasses for GGUF/plain text/Parquet files.
  • For each subclass, implement a method to populate the GGUF metadata + tensors.
  • Implement a method to write the GGUF context to disk.
  • Add a method for retrieving a sequence as a ggml_tensor.

The way I imagine it is that llama.cpp implements a GGUF standard specifically for language model training data. The training code would be able to load this GGUF data directly. We can then also use the GGUF specification as an intermediary format for e.g. plain text and Parquet data. We can use the code both to simply read plain text or Parquet and write it as GGUF (for a conversion to a portable format that does not need external dependencies) or we can use the loader directly in the training code to use plain text and Parquet as-is.

Regarding streaming: For GGUF and Parquet, add a method for retrieving the data for a sequence if the gguf_context was initialized with no_alloc=True. For plain text I think it's fine if streaming is not available. Streaming does not need to be part of this PR, it's fine if support for it is added later.

// Destructor
llama_parquet_dataset_reader::~llama_parquet_dataset_reader() {
close();
m_file_path.clear(); // Clear the stored path only on destruction
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is path cleared here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's unnecessary.

return false;
}

// Получаем ggml_type напрямую и сравниваем с GGML_TYPE_I32
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

English, please.


// Global variables for tests requiring llama_model
static llama_model * g_llama_model = nullptr;
static std::string g_test_model_path = "../../gte-small.Q2_K.gguf"; // Specify the actual path to your model
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file does not exist in the llama.cpp repository by default.

@JohannesGaessler
Copy link
Collaborator

I forgot: I'm getting the impression that the code in this PR is at least partially generated by a language model. This is not a problem in and of itself, but I feel like the code would benefit from being more concise as this will mean less work for me in terms of reviewing and maintenance.

@JohannesGaessler
Copy link
Collaborator

Add a method for retrieving a sequence as a ggml_tensor.

Actually, as a sequence of tokens as is done right now would probably be better since this abstracts away the handling of raw text vs. tokens.

@lexasub
Copy link
Contributor Author

lexasub commented Jul 11, 2025

I forgot: I'm getting the impression that the code in this PR is at least partially generated by a language model. This is not a problem in and of itself, but I feel like the code would benefit from being more concise as this will mean less work for me in terms of reviewing and maintenance.

The code indeed went through several iterations:

Initial prototyping leveraged AI to establish functional patterns and core logic (e.g., dataset readers, GGUF interactions).
Extensive manual refactoring followed to align with llama.cpp’s idioms.
Architectural refinements to decouple responsibilities (e.g., separating readers from converters).
I prioritized maintainability by eliminating redundancy and adhering to the project’s minimalistic C++ style. That said, I welcome suggestions to further simplify or clarify the implementation—especially around interface boundaries or performance-critical paths. Let me know where deeper refactoring would help!

@lexasub
Copy link
Contributor Author

lexasub commented Jul 12, 2025

  • Internally store a gguf_context in the abstract file object, add subclasses for GGUF/plain text/Parquet files.

hmm, your suggession is replace usage my class llama_gguf_file to

struct gguf_context {
    uint32_t version = GGUF_VERSION;

    std::vector<struct gguf_kv> kv;
    std::vector<struct gguf_tensor_info> info;

    size_t alignment = GGUF_DEFAULT_ALIGNMENT;
    size_t offset    = 0; // offset of `data` from beginning of file
    size_t size      = 0; // size of `data` in bytes

    void * data = nullptr;
};

?

@lexasub
Copy link
Contributor Author

lexasub commented Jul 12, 2025

  • Internally store a gguf_context in the abstract file object, add subclasses for GGUF/plain text/Parquet files.
  • For each subclass, implement a method to populate the GGUF metadata + tensors.
  • Implement a method to write the GGUF context to disk

Does this design risk violating the Single Responsibility Principle by combining data-reading logic (e.g., handling GGUF/plain text/Parquet) and GGUF-specific metadata/tensor management within the same class hierarchy? For example, would a new ParquetFile subclass need to handle both Parquet parsing and GGUF metadata population, merging two distinct concerns into one class?

@lexasub
Copy link
Contributor Author

lexasub commented Jul 12, 2025

or we can use the loader directly in the training code to use plain text and Parquet as-is.

I noticed a potential tension between two ideas here:

The earlier suggestion to strictly use GGUF for training (to avoid external dependencies and simplify maintenance).

The current proposal to directly load plain text/Parquet in training code (which reintroduces dependency risks and complexity).

Would it be better to enforce a strict boundary:

Allow llama_dataset_reader (or a new ParquetFile) to support multiple formats only for data conversion (e.g., Parquet → GGUF).

Require training to always use a dedicated GGUFReader class, which loads data exclusively from GGUF files.

This would:

Prevent accidental use of non-GGUF formats in training (enforced at compile/runtime).

Simplify maintenance by isolating GGUF-specific logic.

Maintain flexibility via pre-conversion tools. What do you think?

example:

// Training code  
GGUFReader reader;  
if (!reader.open("data.gguf")) { /* ... */ }  

// Conversion code  
ParquetFile *parquet_reader = new ParquetFile(....);  
GGUFConverter converter(parquet_reader, "output.gguf");  
converter.convert();  

@JohannesGaessler
Copy link
Collaborator

Regarding the use of Parquet in training:

My goal is to have a single standard for training data files - that way I only need to assert that the code works correctly for that standard. For e.g. Parquet files I only need to assert that the conversion to the standard works correctly. We already have GGUF as a data format that we use for models so that is the file format that I want to use for the standard. With the current GGUF code it's straightforward to do gguf_context <-> .gguf file conversions. For training the default pipeline would be .gguf file -> gguf_context -> token sequences. With the data conversion tool the pipeline would be Parquet file -> gguf_context -> .gguf file -> gguf_context -> token sequences. But if you just cut out one cyclic conversion in the middle you get Parquet file -> gguf_context -> token sequences. So without much extra effort we can get support for directly using Parquet files in training and I can still debug issues with a single standard that I am familiar with. The way that I would have implemented Parquet support would have been a simple Python script for conversion to GGUF, but if you're going to the trouble of implementing it in C++ I think we should consider both use cases ahead of time (I don't mean that this needs to be implemented in this PR).

Regarding implementation:

I'm generally flexible regarding how exactly the functionality is mapped to classes - personally I don't value strict adherence to object-oriented design in the first place. If I were to implement it myself I would just do a procedural design for loading the training data with something like:

struct llama_dataset;

struct llama_dataset * from_gguf(const char * path);
struct llama_dataset * from_txt(const char * path);
struct llama_dataset * from_parquet(const char * path);

void to_gguf(struct llama_dataset *, const char * path);

uint64_t n_sequences(const struct llama_dataset * dataset);
int32_t sequence_length(const struct llama_dataset * dataset, uint64_t index);
const int32_t * sequence(const struct llama_dataset * dataset, uint64_t index);

This is not much of a concern right now, but long-term it would be nice to have a C-compatible interface for the dataset management since it would enable the use with e.g. Python bindings. Of course you could still have C++ code with an object-oriented design underneath. The way I would have done it is with a factory method for llama_dataset, it's also fine to define external builder objects for each data format.

@JohannesGaessler
Copy link
Collaborator

And regarding the implementation for the interface I laid out: if llama_dataset internally stores a gguf_context, then from_gguf and to_gguf are just simple wrappers for the interface in gguf.h and only from_txt and from_parquet need complex implementations. Without streaming the underlying tensors can simply be put into a ggml_context. n_sequences, sequence_length, and sequence can then be determined from the tensors stored in the ggml_context. (Since they're constant and ggml_context stores them as a linked list it may make sense to just cache the pointers to the tensors in a vector.)

@lexasub
Copy link
Contributor Author

lexasub commented Jul 12, 2025

@JohannesGaessler Is this union approach acceptable for handling different dataset types (GGUF, Parquet, etc.) within a single llama_dataset struct?

class arrow::io::ReadableFile;
struct llama_dataset {
    llama_dataset_type type = LLAMA_DATASET_TYPE_NONE; // Current type of the dataset
    std::uint64_t sequence_count = 0;
    std::uint64_t current_sequence_index = 0;

    // Union to hold data specific to each dataset type
    union {
        struct {
            gguf_context *ctx;
            FILE *file_handle; // For GGUF, holds the file handle for streaming reads
        } gguf_data;
#ifdef LLAMA_PARQUET
        // Pointer to the Parquet reader object
       struct {
            llama_parquet_dataset_reader *parquet_reader;
           arrow::io::ReadableFile    *input_file;
        } parquet_data;
#endif
        // Add other reader types here if needed (e.g., llama_text_dataset_reader *text_reader;)
    };
};
struct gguf_context * gguf_init_from_file_impl(FILE * file, struct gguf_init_params params); //we use impl directly, because gguf_init_from_file is closing file

@JohannesGaessler
Copy link
Collaborator

I would ultimately accept an implementation like that since it is an improvement vs. master. But as I said, this is more complex than it needs to be. For the Parquet -> GGUF file conversion tool you will need to populate a gguf_context with the Parquet metadata anyways so I think it makes more sense to just also use that code for directly loading Parquet data. Long-term, if this PR gets merged like that I will probably refactor the internals of the code once it needs to be extended to accommodate yet another data format.

@lexasub
Copy link
Contributor Author

lexasub commented Jul 13, 2025

I agree that ideally I would like to unify gguf_context. However, until gguf_context is redesigned to flexibly work with other types of data (for example, streaming downloads from Parquet without full materialization), the current approach with union is the simplest and most pragmatic solution.

For Parquet data, especially when working with large volumes and batch processing, direct use of llama_parquet_dataset_reader via union is more memory-efficient than trying to fully load into gguf_context. For GGUF data, where gguf_context is already a native representation, this, of course, fits well.

@JohannesGaessler
Copy link
Collaborator

To be clear, what I mean is to always use the gguf_context for the metadata. Without streaming, simply load all sequences into the GGUF context. The sequence function I outlined before would simply return the tensor data. With streaming, the sequence function lazily fetches the data for GGUF and Parquet, the concrete implementation depends on the file type.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Compilation issues examples
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants