MLX Fine-Tune Example Guide
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
| A Local MLX Fine-Tuning Experiment Just finished a local LoRA fine-tune of a 7B instruction model on Apple Silicon, via MLX, teaching it a high-fantasy literary register (Gene Wolfe and Tolkien). This is a more rigorous version with more data of something I tried two years ago and wanted to share some lessons learned. The goal was to produce visible output change in style, register, and diction, not narrative quality. Example Output:
My big takeaway is that fine-tuning a small local model is now a couple hour project on a single Mac, offline and at essentially zero marginal cost, with a manageably small fine-tune data set. To me as an AI researcher, this reflects the empirical literature (LIMA, LIMO): very small data sets, if carefully curated for quality and diversity, can powerfully change outputs. 1. Process was quantize base Mistral-7B-Instruct-v0.3 → generate data (local via LM Studio) → train → evaluate → fuse base+adapter→ export to GGUF. 2. Data: ~1,200 examples, <2 epochs 3. Only 0.145% of the weights trained, however the model shifted from generic helpful-assistant to a specific literary register measurably (perplexity −35%). 4. Most of the work was data curation: cleaning, chunking on sentence boundaries, prompt generation, and register framing. Caveat: this is a QLoRA on a 4-bit quantized, 7b model, so nothing boundary pushing. However, I think this scales if you have more hardware and want to work on larger open-weight models. Environment: - Hardware: Apple M2, 64 GB unified memory. - OS / Python: macOS (Sonoma), Python 3.12.4, virtual environment managed with `uv`. - Framework: Apple MLX via `mlx-lm` (`mlx_lm.convert`, `mlx_lm.lora`, `mlx_lm.fuse`, `mlx_lm.generate`). Quantization: mlx_lm.convert --hf-path mistralai/Mistral-7B-Instruct-v0.3 --q-bits 4 --quantize \ --mlx-path ./mistral-7b-instruct-v0.3-4bit Data: 1,181 records, split 95/5 (seed 42) into 1,122 train / 59 validation, extracted from Tolkien’s The Silmarillion, Unfinished Tales of Númenor and Middle-earth and Gene Wolfe’s The Book of the New Sun tetralogy. MLX chat format (one JSON object per line): ```json {"messages": [{"role": "user", "content": "<prompt>"}, {"role": "assistant", "content": "<target-voice passage>"}]} ``` Training used `--mask-prompt`, so loss is computed only on the assistant completion (the target voice), never on the prompt. Data Cleaning & Prompt Generation was via Python scripts: removed numerals,form-feed page breaks, rejoined hyphen-split words, reflowed PDF line-wraps to continuous prose.
Prompt generation: passed cleaned/processed context chunks to Mistral Small 24b Instruct running in LM Studio, with an instruction to write a prompt backwards from the passage, always beginning "Write a story section…". MLX Training Framework & Arguments LoRA (QLoRA-style, adapter over the 4-bit base) trained with `mlx_lm.lora`: mlx_lm.lora \ --model ./mistral-7b-instruct-v0.3-4bit \ --train --data ./data_v4 --mask-prompt \ --fine-tune-type lora --num-layers 16 \ --batch-size 2 --iters 1000 \ --max-seq-length 2048 --learning-rate 1e-5 \ --steps-per-report 10 --steps-per-eval 100 --val-batches -1 \ --save-every 100 --adapter-path ./adapters_v2 \ --grad-checkpoint --seed 42 ``` - Fine-tune type: LoRA over the top 16 transformer layers. - Trainable parameters: 10.49 M of 7.25 B (0.145%); the 4-bit base stays frozen. - Optimizer / LR: Adam, learning rate `1e-5` (constant). - Sequence length: 2048 tokens; gradient checkpointing enabled for memory. - Validation: full validation set (`--val-batches -1`) every 100 iterations; adapter checkpoints saved every 100 iterations. - Seed: 42 (reproducible). After training, the adapter was fused into a single standalone deployable model: ``` mlx_lm.fuse --model ./mistral-7b-instruct-v0.3-4bit \ --adapter-path ./adapters_v2 --save-path ./mistral-7b-v2-fused ``` Batch Size & Epochs - Batch size: 2 sequences per step. - Iterations: 1,000. - Epochs: 1,000 iters × 2 sequences = 2,000 sample-passes over 1,122 training examples ≈ 1.78 epochs. - One epoch ≈ 561 iterations (1,122 ÷ 2). Training Run The run completed 1,000 iterations in ~3 h 43 m of compute (≈11 s/iter plus full-validation passes every 100 iters), at a peak of 6.4 GB memory. Training loss fell from 2.50 → 1.95 and held-out validation loss from 2.81 → 2.36, with the validation minimum (2.355) at iteration 500 and essentially flat thereafter — i.e. the model kept fitting the training data through the second epoch without *over*fitting on held-out prose. The visible step-down in training loss near iteration ~560 corresponds to the start of the second epoch (one full pass ≈ 561 iters), where the adapter begins fitting the corpus more tightly. Evaluation Harness The same fixed prompt set is run against the base model and the fine-tuned model with identical sampling settings and seed (temperature 0.7, top-p 0.9, max 400 tokens, seed 42).
GGUF Conversion mlx GGUF conversion failed (a known issue). Instead used llama.cpp:
``` python convert_hf_to_gguf.py ./<dequantized-hf-model> \ --outfile gguf/mistral-7b-fantasy-v2-f16.gguf --outtype f16 ```
``` llama-quantize gguf/mistral-7b-fantasy-v2-f16.gguf \ gguf/mistral-7b-fantasy-v2-Q4_K_M.gguf Q4_K_M ``` Result: 13,825 MiB (f16) → 4,170 MiB (Q4_K_M), ~4.83 bits/weight. Tooling installed for this path: `llama.cpp` (via Homebrew, for `convert_hf_to_gguf.py` support binaries `llama-quantize` / `llama-cli`) and, in the Python venv, `torch`, `gguf`, and `sentencepiece` (required by the conversion script). The 14 GB dequantized intermediate was deleted after conversion. Lessons Learned:
[link] [comments] |
More from r/LocalLLaMA
-
Been running Qwen3.6-27B through a 3-critic harness. The harness matters more than I thought
Jun 30
-
I Hate Dario Amodei, and everything he stands for.
Jun 29
-
Introducing LongCat-2.0 - , a large-scale MoE language model with 1.6 trillion total parameters and ~48 billion activated per token. This was the stealth model that was on Openrouter under the name 'owl-alpha'.
Jun 29
-
Krea-2-Turbo Image Model - Easy to be fully uncensored, but it can also EDIT Images!
Jun 29
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.