- Python 3.10.6
- Nvidia GPU
- Create venv
py -3.10 -m venv cuda
- Activate venv
cuda activate
- Install libs
pip install matplotlib numpy pylzma ipkernel jupyter
pip install torch --index-url https://download.pytorch.org/whl/cu118
- Install a new kernel for Jupyter Notebook
python -m ipykernel install --user --name=cuda --display-name "cuda-gpt"
- Start Jupyter Notebook
jupyter notebook
Project.2.mp4
- 
Download & Prepare Dataset - Download OpenWebText2 (~27 GB) from https://openwebtext2.readthedocs.io/en/latest/ and unpack all .jsonl.zstfiles into./openwebtext2/.
 
- Download OpenWebText2 (~27 GB) from https://openwebtext2.readthedocs.io/en/latest/ and unpack all 
- 
Extract & Tokenize Data - Open and run data-extract-v10.ipynb, which:- Streams and decompresses the .zstfiles.
- Filters for English-language texts.
- Tokenizes using tiktoken tokenizer.
- Outputs output_v10/encoded_data/encoded_output_v10_accuracy.npy(~107 GB).
 
- Streams and decompresses the 
 
- Open and run 
- 
Train Base GPT Model - Run gpt-v14.ipynbend-to-end to:- Configure hyperparameters (depth, heads, learning rate schedule, etc.).
- Execute the training loop, logging train/validation losses.
- Save the checkpoint (e.g., output_v14\pre_training\run_<unix_timestamp>/gpt_v14_model.pt).
 
 
- Run 
- 
Fine-Tune for Classification - Use finetuning-classification-v1.ipynbto adapt the pre-trained checkpoint for a binary classification task (e.g., spam vs. ham).
 
- Use 
- 
Fine-Tune for Instruction-Following - Use finetuning-instruction-answer-v4.ipynbto train on instruction–response pairs and improve the model’s conversational ability.
 
- Use 
- 
Evaluate Fine-Tuned Models - Open evaluate-finetuned-llm.ipynbto:- Compute performance metrics (accuracy, loss) on held-out data.
- Compare your fine-tuned outputs against the Ollama LLaMA 3.2 3B reference baseline.
 
 
- Open 
Sample generation from
gpt-v14.ipynbafter 17 epochs:
Prompt: "I like apple juice - I drink it"
→ "I like apple juice - I drink it for about 30 minutes or even 1/20 minutes. In fact it was so common, so if it would be melted and the calories for me. And for me it was a pretty cool product"
Sample generation from
gpt-v17.ipynbafter 100k steps:
Note: gpt-v17 has been trained on 100BT subset of fineweb-edu dataset using infinite streaming form *.paraquet files.
Prompt: "I like apple juice - I drink it"
→ "I like apple juice - I drink it every night, especially at weekends and holidays. I have no doubt that it contains the highest levels of polyphenols.
But, as I have learned, polyphenols are actually a natural substance found naturally in fruits, berries and vegetables like berries"
🎛️ Wrap notebooks into CLI scripts (train.py, generate.py).
🌐 Build a small Gradio/Streamlit demo for live inference.
📊 Integrate Weights & Biases or TensorBoard for experiment tracking.
🇵🇱 Experiment with Polish-language fine-tuning on local corpora.