r/LocalLLaMA · · 1 min read

Low-level coding dataset

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Hi all,

I've recently been thinking about putting together a community sourced coding dataset for finetuning models, with a heavy focus on cpp and systems programming.

My goal is to eventually have a model (say a finetune of Qwen3.6-27b) that is good at stuff like memory ownership, thread safety, optimization concepts, etc. Right now I feel like the coding knowledge of most locally runnable models is restricted to high-level langs like py and js.

Since I'm still learning about finetuning and what does/doesn't work, I figured I'd ask in here for help with the structure of the dataset. Right now I'm thinking a jsonl file with categories like this:

- generation: basic prompt/code output
- optimization: heres slow/bloated code, make it better
- debugging: im getting this error pls fix
- organization: code review, interface design, restructuring, tradeoff decisions
- tool_calling: exercises involving tool use and interpreting results

Anyone with experience in this sort of thing have any pointers? (for example, I'm not sure if we even need to further tune models on tool calling since they all seem pretty good at it, will that muddy the dataset and limit gains in other categories?)

Thanks in advance for all the help!

submitted by /u/True_Tangerine_4706
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA