Has anyone experimented with stabilizing low quant models with lower temp and top p?
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
I was thinking about trying some bigger models out on my 80GB VRAM setup, but everything MoE is too slow with CPU offload. Otherwise there aren't many models that are purpose built for 80GB VRAM. Most of the bigger models require using a heavily quantized version. As I was looking at some benchmarks of same top p I realized there's something that can be done here but I haven't read anyone recently post about it. Playing with some LLM sampling visualization tools shows that it might be possible to reduce some wild outputs by reducing temp and top p. I'll be trying it this evening.
Tool example, not mine : https://artefact2.github.io/llm-sampling/index.xhtml
[link] [comments]
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.