r/MachineLearning · May 23, 2026 · 1 min read

Per-pixel bounding-box regression + DBSCAN for handwritten word detection - visual walkthrough of WordDetectorNet [P]

Mirrored from r/MachineLearning for archival readability. Support the source by reading on the original site.

Per-pixel bounding-box regression + DBSCAN for handwritten word detection - visual walkthrough of WordDetectorNet [P]

Overview of WordDetectorNN architecture.

Sharing a visual breakdown of WordDetectorNet, Harald Scheidl's handwritten-word detection model. I think the design choice at its core is unusual enough to be worth a closer look - and I haven't seen it written up in detail anywhere else.

The mechanism: Instead of anchor-based detection + NMS, every pixel the network classifies as a "word pixel" also regresses 4 scalar distances (top/right/bottom/left) to the enclosing bounding box. Each word pixel therefore reconstructs one candidate box, producing thousands of overlapping candidates per word. These are then collapsed with DBSCAN using distance = 1 − IoU as the metric, taking the median box per cluster as the final detection.

Architecture: ResNet18 backbone (modified to 1-channel grayscale input, with intermediate features exposed after each residual block) → FPN-style decoder that upscales and concatenates features at all scales → head producing 6 output channels per pixel (2 segmentation logits + 4 distance values). Loss = cross-entropy + IoU, equally weighted. Trained on IAM with 448×448 inputs → 224×224 outputs.

What I find interesting about the design:

The per-pixel distance regression means there is nothing to tune like anchors or NMS thresholds.
The 1 − IoU distance for DBSCAN is conceptually clean: spatially-overlapping candidates cluster together by construction.

What I don't like about the design:

The pairwise IoU distance matrix is O(n²) in the number of candidate boxes, and this is genuinely the runtime bottleneck in practice (not the forward pass).
The clustering step blocks end-to-end training — hyperparameters like DBSCAN's eps have to be set manually.

Full visual write-up with figures (one per pipeline stage + an architecture diagram): https://lellep.xyz/blog/worddetectornet-visually-explained.html

Credit where credit is due: Original architecture by Harald Scheidl, see here https://github.com/githubharald/WordDetectorNN

submitted by /u/martin_lellep
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Discussion (0)

More from r/MachineLearning