r/LocalLLaMA · · 9 min read

MLX Fine-Tune Example Guide

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

MLX Fine-Tune Example Guide

A Local MLX Fine-Tuning Experiment

Just finished a local LoRA fine-tune of a 7B instruction model on Apple Silicon, via MLX, teaching it a high-fantasy literary register (Gene Wolfe and Tolkien). This is a more rigorous version with more data of something I tried two years ago and wanted to share some lessons learned. The goal was to produce visible output change in style, register, and diction, not narrative quality.

Example Output:

Prompt: Write a scene about a woman standing at the edge of the sea, remembering someone she lost, in Tolkien's high fantasy style.
Base Model In the twilight of a land beyond the mountains, where the sun set in hues of gold and amber, there stood a solitary figure. Eälin, the last of the Elven-kind in the realm of Men, found herself on the edge of the Great Sea, her eyes cast towards the horizon where the sun now rested. A gentle breeze carried the scent of blooming flowers from the Hidden Wood, a memory that brought a wistful smile to her lips. The sea, vast and eternal, mirrored the heavens, its waves whispering tales of ages past. She remembered the day she first saw the sea, the day she met him. Eolir, a Man of the West, who had sailing songs on his lips and laughter in his eyes. They danced under the silver light of Lithui, the Moon, and he had promised her that they would travel the world together, their hearts bound by the love they shared…
Fantasy FT Then he spoke to her of Beren, and of Luthien, and of all that they had seen and suffered in the land of Mandos, and of the long years in the halls of Aman. She wept, and for a while he was silent, but at last he said:'Now thou art come to the sea, and the sea shall deliver thee to me; and I will send thee back to Middle-earth over the Sea; for the doom of Men, Weary Returner, is ever to come in the end to the Havens, whence they set forth. But first I will show thee the Hidden Kingdom, and the city of Tirion, where the Teleri dwell; for never again will the land of Valinor be seen by the eyes of Elves and Men. And thou shalt go thither, and dwell there for a time, until it is time for thee to make thy ending.' But when Tar-Miriel stood upon the shore of the sea she was filled with great fear, and she turned to Ulmo, and she said:'I do not wish to go to the Havens. There is a great void within me, Lord, and I am afraid. For I am a woman, and the doom of Women is to forget...
Prompt: Write a scene of external action and description in which a lone soldier returns to find his village burned.
Base Model Title: Echoes of a Forgotten Home In the desolate expanse of a once verdant landscape, the sun hung low, casting long, eerie shadows that danced upon the remnants of a once bustling village. The air was thick with the scent of ashes and smoke, a haunting reminder of the destruction that had transpired. A lone figure, cloaked in the fading light, trudged through the charred remains of what was once home. His boots echoed off the blackened ruins, each step resonating with a heavy heartbeat. The soldier, named Ethan, was a tall, lean man in his late twenties, his face etched with the lines of a thousand battles and the weight of a thousand sorrows...
Fantasy FT The sentries were all dead, and I feared the worst. As we drew nearer I saw that the main street of the village was choked with timbers, and the Thatched roofs were smoldering. The stench of burning wood and flesh filled the air. There was no one in the street, and there were no cries or barkings of animals. I saw the corpse of a man, half in a pile of debris and half in the street. He had been crushed by a beam that had fallen from one of the houses. I looked at him but did not stop to examine him. I hurried on, calling the name of the Ascian, who had been a good friend of mine in the village. But I heard no answering call, and there was no other sound than the crackling of the burning timbers. I heard the scream of a woman, and I stopped, and listened. It was the scream of the Ascian's wife, and I went running to her. She was half-mad with fear, and she clung to me as though I would protect her from the Ascian's ghost…

My big takeaway is that fine-tuning a small local model is now a couple hour project on a single Mac, offline and at essentially zero marginal cost, with a manageably small fine-tune data set. To me as an AI researcher, this reflects the empirical literature (LIMA, LIMO): very small data sets, if carefully curated for quality and diversity, can powerfully change outputs.

1. Process was quantize base Mistral-7B-Instruct-v0.3 → generate data (local via LM Studio) → train → evaluate → fuse base+adapter→ export to GGUF.

2. Data: ~1,200 examples, <2 epochs

3. Only 0.145% of the weights trained, however the model shifted from generic helpful-assistant to a specific literary register measurably (perplexity −35%).

4. Most of the work was data curation: cleaning, chunking on sentence boundaries, prompt generation, and register framing.

Caveat: this is a QLoRA on a 4-bit quantized, 7b model, so nothing boundary pushing. However, I think this scales if you have more hardware and want to work on larger open-weight models.

Environment:

- Hardware: Apple M2, 64 GB unified memory.

- OS / Python: macOS (Sonoma), Python 3.12.4, virtual environment managed with `uv`.

- Framework: Apple MLX via `mlx-lm` (`mlx_lm.convert`, `mlx_lm.lora`, `mlx_lm.fuse`, `mlx_lm.generate`).

Quantization:

mlx_lm.convert --hf-path mistralai/Mistral-7B-Instruct-v0.3 --q-bits 4 --quantize \

--mlx-path ./mistral-7b-instruct-v0.3-4bit

Data:

1,181 records, split 95/5 (seed 42) into 1,122 train / 59 validation, extracted from Tolkien’s The Silmarillion, Unfinished Tales of Númenor and Middle-earth and Gene Wolfe’s The Book of the New Sun tetralogy.

MLX chat format (one JSON object per line):

```json

{"messages": [{"role": "user", "content": "<prompt>"}, {"role": "assistant", "content": "<target-voice passage>"}]}

```

Training used `--mask-prompt`, so loss is computed only on the assistant completion (the target voice), never on the prompt.

Data Cleaning & Prompt Generation was via Python scripts:

removed numerals,form-feed page breaks, rejoined hyphen-split words, reflowed PDF line-wraps to continuous prose.

  • trimmed non-narrative front/back matter (author bio, ebook proofing notes, the translator's measurement appendix), fixed an OCR error where the pronoun "I" had been read as the digit "1," removed chapter headings, and preserved em-dashes and ellipses (core to Wolfe's style).
  • On an early 659 sample version of the data, trimmed each existing completion to whole-sentence boundaries.
  • Sentence-aware chunking: cleaned text was segmented with `nltk` (Punkt) and accumulated into ~500-word chunks on whole-sentence boundaries.

Prompt generation: passed cleaned/processed context chunks to Mistral Small 24b Instruct running in LM Studio, with an instruction to write a prompt backwards from the passage, always beginning "Write a story section…".

MLX Training Framework & Arguments

LoRA (QLoRA-style, adapter over the 4-bit base) trained with `mlx_lm.lora`:

mlx_lm.lora \

--model ./mistral-7b-instruct-v0.3-4bit \

--train --data ./data_v4 --mask-prompt \

--fine-tune-type lora --num-layers 16 \

--batch-size 2 --iters 1000 \

--max-seq-length 2048 --learning-rate 1e-5 \

--steps-per-report 10 --steps-per-eval 100 --val-batches -1 \

--save-every 100 --adapter-path ./adapters_v2 \

--grad-checkpoint --seed 42

```

- Fine-tune type: LoRA over the top 16 transformer layers.

- Trainable parameters: 10.49 M of 7.25 B (0.145%); the 4-bit base stays frozen.

- Optimizer / LR: Adam, learning rate `1e-5` (constant).

- Sequence length: 2048 tokens; gradient checkpointing enabled for memory.

- Validation: full validation set (`--val-batches -1`) every 100 iterations;

adapter checkpoints saved every 100 iterations.

- Seed: 42 (reproducible).

After training, the adapter was fused into a single standalone deployable model:

```

mlx_lm.fuse --model ./mistral-7b-instruct-v0.3-4bit \

--adapter-path ./adapters_v2 --save-path ./mistral-7b-v2-fused

```

Batch Size & Epochs

- Batch size: 2 sequences per step.

- Iterations: 1,000.

- Epochs: 1,000 iters × 2 sequences = 2,000 sample-passes over 1,122 training

examples ≈ 1.78 epochs.

- One epoch ≈ 561 iterations (1,122 ÷ 2).

Training Run

The run completed 1,000 iterations in ~3 h 43 m of compute (≈11 s/iter plus

full-validation passes every 100 iters), at a peak of 6.4 GB memory. Training loss

fell from 2.50 → 1.95 and held-out validation loss from 2.81 → 2.36, with the

validation minimum (2.355) at iteration 500 and essentially flat thereafter — i.e.

the model kept fitting the training data through the second epoch without

*over*fitting on held-out prose. The visible step-down in training loss near iteration

~560 corresponds to the start of the second epoch (one full pass ≈ 561 iters), where

the adapter begins fitting the corpus more tightly.

https://preview.redd.it/ddaq5ubv34ah1.png?width=1170&format=png&auto=webp&s=34c76c43bf8040e34c152603f57218ffbaf75f5e

Evaluation Harness

The same fixed prompt set is run against the base model and the fine-tuned model with identical sampling settings and seed (temperature 0.7, top-p 0.9, max 400 tokens, seed 42).

  1. Human side-by-side — base vs. fine-tuned outputs on every prompt, for direct reading.

  2. Perplexity change: −34.7%

  3. Stylometry (counts of concrete "voice" markers):

GGUF Conversion

mlx GGUF conversion failed (a known issue). Instead used llama.cpp:

  1. Dequantize the fused model to HF f16 (MLX).

  2. Convert HF → f16 GGUF with llama.cpp's `convert_hf_to_gguf.py`:

```

python convert_hf_to_gguf.py ./<dequantized-hf-model> \

--outfile gguf/mistral-7b-fantasy-v2-f16.gguf --outtype f16

```

  1. Quantize f16 GGUF → Q4_K_M with llama.cpp's `llama-quantize`:

```

llama-quantize gguf/mistral-7b-fantasy-v2-f16.gguf \

gguf/mistral-7b-fantasy-v2-Q4_K_M.gguf Q4_K_M

```

Result: 13,825 MiB (f16) → 4,170 MiB (Q4_K_M), ~4.83 bits/weight.

Tooling installed for this path: `llama.cpp` (via Homebrew, for `convert_hf_to_gguf.py` support binaries `llama-quantize` / `llama-cli`) and, in the Python venv, `torch`, `gguf`, and `sentencepiece` (required by the conversion script). The 14 GB dequantized intermediate was deleted after conversion.

Lessons Learned:

  • MLX works for fine-tuning and fusing, but native GGUF export is unreliable. Use llama.cpp (convert_hf_to_gguf.py + llama-quantize) instead; bridge via --dequantize to HF f16
  • --mask-prompt masks loss, not conditioning. The model still learns style associations
  • Uniform prompts create brittle phrasing to output coupling. Vary prompt format (order, diction, style) to make the fine tuned behavior more robust.
  • For data creation, chunk on sentence boundaries, not naïve methods like word count (e.g. use nltk sentence-aware chunking).
  • Outputs aren’t exactly correlated with training loss. My lowest-perplexity checkpoint generated worse prose than a later one.
  • More data reduced overfitting. ~2× the corpus let v2 train past one epoch with flat validation loss; v1 started overfitting after ~1 epoch
  • QLoRA on a 4-bit base caps quality at ~4-bit downstream. You could fine-tune from fp16/8-bit for more quality.
  • Wrap long unattended runs in caffeinate. Lost ~5.5h of compute to sleep on an overnight run

submitted by /u/Mbando
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA