LLM in PyTorch

Building and training a GPT-2 style model in pure PyTorch

Nov 2024 - Dec 2024 • 1 month

Tech Stack

NLPPyTorchDeep LearningPre-trainingFine TuningRLHF

Description

Building an LLM (Large Language Model) has always fascinated me, and Sebastian Raschka’s book Build a Large Language Model was the perfect opportunity to revisit the Transformer architecture in PyTorch. It covers how to convert words into tokens, how to code and use the decoder part of a Transformer (in pure PyTorch, with mathematical formulas) to generate new tokens one by one, how to prepare a dataset for pre-training, how to fine-tune a model on data, and how to implement RLHF (Reinforcement Learning with Human Feedback) to make an LLM follow instructions.

Features

Tokenization: Converting text into tokens with a custom vocabulary
Transformer Architecture: Complete implementation of the decoder (multi-head attention, layer normalization, feed-forward) in pure PyTorch
Pre-training: Dataset preparation and model training on next-token prediction
Fine-tuning: Adapting the pre-trained model to specific tasks
RLHF: Implementation of Reinforcement Learning with Human Feedback for model alignment

Challenges

Understanding the attention mechanism: Visualizing how the Q, K, V matrices interact and why dividing by √d_k is necessary for numerical stability required several readings and experiments
Training debugging: Identifying why the loss was stagnating or exploding, tuning hyperparameters (learning rate, batch size, warmup steps) empirically
Hardware constraints: With an RTX 5070 Ti, I had to optimize GPU memory usage and limit model and batch sizes
Gradient management: Understanding and implementing gradient accumulation to simulate larger batch sizes despite memory limitations

Model Architecture

The implemented model follows OpenAI’s GPT-2 Small architecture:

Parameter	Value
Parameters	124M
Layers	12
Embedding dimension	768
Attention heads	12
Max context	1024 tokens
Vocabulary	~50k tokens (BPE)

This architecture is identical to GPT-3, but at a much smaller scale (GPT-3 has 175 billion parameters compared to 124 million here), making it trainable on consumer hardware.

Results

The model was pre-trained on the Shakespeare corpus (~1 MB of text), giving it a distinctive writing style in Elizabethan English:

Coherent text generation: The model produces grammatically correct sentences that mimic Shakespearean style, with archaic turns of phrase (“thou”, “hath”, “wherefore”)
Instruction following: After fine-tuning, the model responds to prompts contextually
Limitations: The small dataset and model size limits the diversity and depth of responses — the model excels in its narrow domain but doesn’t generalize to other styles or topics

View All Projects