losing my mind fine-tuning jina-v5 for a legal corpus
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
For the last month i've been trying to fine-tune jina-v5 (which has performed best on my corpus out of the box) on slovak law chunks, time and time again no matter what i do I can't get the model to learn nuance of slovak syntax.
here's the biggest trap chunk that keeps confusing my AI with my translation:
Query: "krádež cigariet" = theft of cigarettes
Podľa § 60 ods. 1 písm. a/ Tr. zák. súd obvinenému ukladá trest prepadnutia vecí a to: 1000 ks cigariet zn. Marlboro gold, 400 ks cigariet zn. Rothmans modré, 1000 ks cigariet zn. Rothmans červené, 400 ks cigariet zn. Bond modré, 200 ks cigariet zn. Parliament modré v celkovom množstve 3000 ks cigariet, všetky o dĺžke tabakového povrazca do 80 mm vrátane, bez platnej slovenskej kontrolnej známky. Podľa § 60 ods. 5 Tr. zák. vlastníkom prepadnutých vecí sa stáva štát. Poučenie: you can translate it to your language, but essentialy it says, "according to paragraph 60, the court is giving a punishment of "prepadnutie". which is a synonym and could mean, mugging or forfeiture or confiscation.
this example has been breaking every single model, because it is ambiguous but after a thorough read you can clearly tell its not theft or mugging but all of my fine-tunes consistently rank it high, higher than base jina.
I know there's a lot of moving parts and context needed to answer this question, so i will just focus on my latest run.
> i used an LLM to generate queries based on source chunks (varied personas, board short queries and long paraphrased queries [all sorts of combinations at this point])
> i used base jina to grab top 50 results based on my corpus of judicial data and legislature + i injected source chunk + it's similiar siblings (i also did a run without injecting still sucked)
> then i used qwen/qwen3.5-397b-a17b to logit mine relevance, basically "is chunk relevant, answer only yes/no" then we mined the probability for yes. humans and stronger AIs all agreed that qwen's ranking is actually good. except for some rare cases (it clearly distinguished this chunk however as NOT being theft, correctly giving it a low ranking)
> then i ran jina v5 fine-tunining LoRA on the retrival adapter (at least that's what claude opus told me xd) with these parameters:
| param | value |
|---|---|
| base model | jinaai/jina-embeddings-v5-text-small (1024-dim, last-token pooling) |
| what's trained | built-in retrieval LoRA only — r=32, α=32, dropout=0.1, targets q/k/v/o/gate/up/down_proj |
| trainable params | 20,185,088 / 676,790,272 = 2.98% |
| loss | MarginMSELoss (margin = teacher rel(pos) − rel(neg)); no Matryoshka |
| LR | 5e-6, linear schedule, warmup_ratio 0.05 |
| epochs | 1 |
| batch | per-device 8 × grad-accum 2 = effective 16 |
| precision | bf16, gradient_checkpointing off |
| max_seq_length | 2048 (v4 was 512) |
| optimizer | AdamW (HF default), seed 42, val_frac 0.03 |
| data | 46,001 MarginMSE triples from 2,174 Qwen-distilled queries → 44,621 train / 1,380 val → 2,789 steps |
| pair-mining | top-5 pos × bottom-5 neg per query, min-margin 0.2, ≤40 pairs/query, pos≥0.5 / neg≤0.3 |
| hardware | RTX PRO 6000 Blackwell 96GB, torch 2.11+cu128, ~74 minparam valuebase model jinaai/jina-embeddings-v5-text-small (1024-dim, last-token pooling)what's trained built-in retrieval LoRA only — r=32, α=32, dropout=0.1, targets q/k/v/o/gate/up/down_projtrainable params 20,185,088 / 676,790,272 = 2.98%loss MarginMSELoss (margin = teacher rel(pos) − rel(neg)); no MatryoshkaLR 5e-6, linear schedule, warmup_ratio 0.05epochs 1batch per-device 8 × grad-accum 2 = effective 16precision bf16, gradient_checkpointing offmax_seq_length 2048 (v4 was 512)optimizer AdamW (HF default), seed 42, val_frac 0.03data 46,001 MarginMSE triples from 2,174 Qwen-distilled queries → 44,621 train / 1,380 val → 2,789 stepspair-mining top-5 pos × bottom-5 neg per query, min-margin 0.2, ≤40 pairs/query, pos≥0.5 / neg≤0.3hardware RTX PRO 6000 Blackwell 96GB, torch 2.11+cu128, ~74 min |
If anyone is as invested in this as me here's the scripts i used for training:
finetune_jina.py
prepare_pairs.py
All models do get better at slovak law, but still fail these simple logical problems, i've also tried fine-tuning qwen 8b reranker in efforts of distilling it later into a bi-encoder, but these efforts also failed. qwen made same mistakes about the "prepadnutie" case.
I would be really thankful if someone highly skilled in this could eyeball this set-up and let me know if there's some architectural flaw, and if my focus should be looking for bugs in the code.
thank you very much!
[link] [comments]
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.