trained a prompt injection detector using ml-intern and DeepSeek v4 Flash, runs in the browser
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
| Trained a prompt injection classifier using https://huggingface.co/spaces/av-codes/prompt-injection-detector --- I've been interested in prompt injections and agentic security for a while, and wanted to see how a purpose-built ML agent compares to general-purpose coding agents for this kind of task. Here's roughly how it went:
For v1, I went with DistilBERT targeting CPU inference. After a few parameter sweeps, the agent launched a full run and landed at F1 95.87%. I also tried training an HRM-Text model, but the agent didn't figure it out and set up a TRM run instead (different architecture, no positional encoding). When I steered it back to HRM with the correct paper, the training script wasn't optimized for my hardware. I spent $20 on HF remote training with a T4, but it fumbled after epoch 1 because agent didn't follow training routine from the paper and used wrong optimiser/params leading to params blowing up. For v2, I found a larger synthetic dataset from Bordair and re-trained the DistilBERT. That's the model in the Space above. What surprised me:
The obvious gap: the synthetic dataset means the train/test splits might be too similar. Not a proper scientific approach, but it's the most pleasant ML experience I've had with an agentic tool so far. The HRM run is still pending. I'm curious to learn about other people's experiences with these tools. Thank you! [link] [comments] |
More from r/LocalLLaMA
-
How small can the orchestration model in an agent be? (separating it from code-gen — that obviously wants a big model)
May 22
-
BeeLlama v0.2.0 – major DFlash update. Single RTX 3090: Qwen 3.6 27B up to 164 tps (4.40x), Gemma 4 31B up to 177.8 tps (4.93x). Prompt processing speed near baseline.
May 22
-
ByteShape Qwen3.6-35B-A3B: 30% faster than Unsloth IQ on 6GB VRAM laptop
May 22
-
Experts first llama.cpp
May 22
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.