Fine-Tuning LLaMA 3.1-8B for Mathematical Reasoning Verification

Intro
Fine-Tuning LLaMA 3.1-8B for Mathematical Reasoning Verification NYU Deep Learning — Kaggle Competition | Fall 2025 Overview For the NYU DL-Fall-25 Kaggle competition, the task was to fine-tune a large language model to verify mathematical solutions rather than solve them — a binary classification problem (True/False) framed as an instruction-following task. I used LLaMA 3.1-8B-Instruct as the base model and parameter-efficient fine-tuning via LoRA adapters, training under real compute constraints on A100 GPUs via Google Colab Pro+. Setup The dataset contained ~1M math question/solution pairs with a 60/40 class imbalance (False/True). I balanced it to 800k samples (50/50), held out a fixed 5k-sample validation set for all experiments, and formatted each example as an instruction-following prompt ending with a True/False label. To keep the 8B model trainable on 40GB VRAM, I loaded it in 4-bit quantized format via bitsandbytes, enabled gradient checkpointing, and applied LoRA adapters across all attention and feed-forward projection layers (q/k/v/o/gate/up/down projections). Hyperparameter Search Before committing to full training runs, I ran a 31-configuration random sweep with Weights & Biases over a 10k-sample subset. The search space covered: learning rate (log-uniform, 1e-6 to 5e-1), optimizer (adamw_torch, adamw_8bit, paged_adamw_32bit, lion_8bit), LR scheduler (linear, cosine, cosine_with_restarts), gradient clipping, warmup ratio, weight decay, batch size, and gradient accumulation steps. Key findings from the sweep: learning rates above 3e-4 caused divergence; cosine scheduling significantly outperformed linear in convergence stability; paged_adamw_32bit paired with cosine produced the most consistent low-loss runs. This narrowed the search space considerably before scaling to full training. Training Phases and the Cumulative Retraining Insight The most important lesson from this project came from how I structured incremental training. With 12-hour Colab session limits and checkpoints saved to Drive between runs, I trained in 20k-sample increments. My initial approach was sequential — train on 0–20k, save checkpoint, train on 20–40k from that checkpoint, and so on. This led to catastrophic forgetting: the model kept losing earlier patterns as it absorbed new data. Accuracy plateaued around 0.83 and wouldn't budge. The fix was cumulative retraining: instead of moving to the next slice, each run trained on all data seen so far plus the new chunk (0–20k → 0–40k → 0–60k → 0–70k). Each session loaded the previous best checkpoint and trained for one epoch on the full accumulated set. This preserved reasoning patterns from earlier data while integrating new distributions. The catch: I realized this too late. By the time I identified the issue and verified it worked, there wasn't enough compute budget left to run it through the full dataset before the competition deadline. Post-competition experiments confirmed the hypothesis — cumulative retraining reached 86.7% validation accuracy versus the sequential approach plateauing at ~83%. Results Kaggle public leaderboard: 0.838 Kaggle private leaderboard: 0.830 Best validation accuracy (competition): 84% Best validation accuracy (post-competition, cumulative): 86.5–86.7% Stack: Python, Hugging Face Transformers, Unsloth, trl (SFTTrainer), LoRA, bitsandbytes, Weights & Biases, Google Colab A100
2025
