Training a number-aware embedding model + Text JEPA doesn't work too well + Text auto-encoders have a strange frequency bias [R][P]
Mirrored from r/MachineLearning for archival readability. Support the source by reading on the original site.
Hi guys!
I've spent 1y trying to predict company growth from the full text of their 10-k filings.
It completely failed.
But I've had a lot of fun playing with encoder transformers and making them good at numbers (bypassing the tokenizer/prediction head for numbers). I've MLM-trained a modified ModernBERT for this and it works really well. The model is available on HF: https://huggingface.co/edereynal/financial_bert
Then, I've made this MLM-trained model into a nice sequence embedder.
I've experimented with JEPA, but it failed.
The auto-encoder setup worked much better. But I encountered a strange frequency bias, where the decoder only cared about high-frequency information, and I had to mitigate it by adding a Contrastive Loss term.
I also investigated the tendency of transformers to have a low effective-dimensionality output space (compared to its input embedding space).
So, here's the technical blog post, that reads a bit like "how to waste 1,000 hours and $400 trying to solve an unsolvable real-world problem, but having a lot of fun along the way":
https://www.eloidereynal.com/p/i-spent-1-year-trying-to-predict
[link] [comments]
More from r/MachineLearning
-
A map of the latest 11 million papers split by semantic similarity and time slices [P]
Jun 30
-
Update on CVIL: the free CV interview prep checklist after landing my internship... just added Segmentation, OCR, and VLM sections [D]
Jun 30
-
EACL 2027: Author response and author-reviewer discussion are now two separate stages and allow more time [D]
Jun 30
-
Loss functions in Instance Representation Learning [R]
Jun 29
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.