Low-level coding dataset
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
Hi all,
I've recently been thinking about putting together a community sourced coding dataset for finetuning models, with a heavy focus on cpp and systems programming.
My goal is to eventually have a model (say a finetune of Qwen3.6-27b) that is good at stuff like memory ownership, thread safety, optimization concepts, etc. Right now I feel like the coding knowledge of most locally runnable models is restricted to high-level langs like py and js.
Since I'm still learning about finetuning and what does/doesn't work, I figured I'd ask in here for help with the structure of the dataset. Right now I'm thinking a jsonl file with categories like this:
- generation: basic prompt/code output
- optimization: heres slow/bloated code, make it better
- debugging: im getting this error pls fix
- organization: code review, interface design, restructuring, tradeoff decisions
- tool_calling: exercises involving tool use and interpreting results
Anyone with experience in this sort of thing have any pointers? (for example, I'm not sure if we even need to further tune models on tool calling since they all seem pretty good at it, will that muddy the dataset and limit gains in other categories?)
Thanks in advance for all the help!
[link] [comments]
More from r/LocalLLaMA
-
How small can the orchestration model in an agent be? (separating it from code-gen — that obviously wants a big model)
May 22
-
BeeLlama v0.2.0 – major DFlash update. Single RTX 3090: Qwen 3.6 27B up to 164 tps (4.40x), Gemma 4 31B up to 177.8 tps (4.93x). Prompt processing speed near baseline.
May 22
-
trained a prompt injection detector using ml-intern and DeepSeek v4 Flash, runs in the browser
May 22
-
ByteShape Qwen3.6-35B-A3B: 30% faster than Unsloth IQ on 6GB VRAM laptop
May 22
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.