Building a tiny transformer decoder to extract URL slugs from vector embeddings
May 25th, 2026
Vector embeddings let us compare how similar two things are. Whether for search, RAG, deduplication, pattern analysis, or recommendation, similarity comparison is usually where we stop.
An embedding compresses some input — a document, a file, or perhaps just a string — into a single fixed-size vector. Each embedding model is trained with a contrastive objective that pushes semantically related documents together in that space. That's why similarity search works: take the cosine distance between two embeddings, and a low score means they're similar in topic. But if embeddings encode enough semantic meaning for similarity, then in principle they should also answer harder questions about the original document than just is this near that?
Last weekend we spent some time exploring answers and applications. The first model we trained produced the string of-the-a for every input. The final model produces amelia-earhart-pilot for an article about Amelia Earhart, terminates at the right length on its own, runs in ~89ms on a budget VPS, and is ~14–19× faster and ~85× cheaper than a Haiku-class LLM call for the same task1.
source text
embedding
decoder
slug
To test whether embeddings actually encode recoverable semantic context, we needed a task harder than classification — ideally something useful enough that, if the model worked, it could go to production.
We landed on URL slug generation. Titles were an option too, but slugs are much easier to mine at scale. Slugs are the hyphenated strings at the end of a URL path (like slug-from-embedding). Producing them requires a human, an LLM, or a verbatim extraction from the title — and the verbatim approach loses most of an article's semantic context.
Slugs sit in a gap that makes them an interesting test case. A classifier can't recover them: a slug is a composed sequence with ordering and specific word choices, not a label from a fixed set. But an LLM is overkill: the task is bounded enough that a small model and a single 1536-dim vector turn out to be sufficient. And the training data is free: it's in the URL of every SEO-optimized web page.
Embeddings are cheap to compute (~$11 for 2.3M documents at OpenAI batch rates) and moderately expensive to store (1536 × f32, so 6 KiB per vector), but neither cost applies here. The idea is to piggyback on embeddings a system already has for search or deduplication. Deriving a slug from an existing vector costs the CPU time of one inference call — far faster and cheaper than the source-text retrieval, API call, and inference cost of shelling out to an LLM for the same task.
The work that convinced us this was worth trying is vec2text (Morris et al., 2023). They showed that iterative correction can reconstruct 92% of 32-token inputs exactly from their embeddings, using a 235M-parameter T5-base model with multiple correction passes. If embeddings preserve enough information for that, recovering a five-word slug should be feasible with something much smaller. Our final model is 25M parameters with a single forward pass: 10× fewer parameters, no iteration, and it achieves comparable topic-recovery for the narrow task of slug generation. If a single-pass network can extract useful auxiliary outputs from the same vector, slug generation is just the first application. For HASH this matters for any human-readable identifier the UI generates: draft titles, suggested filenames, entity slugs.
Before training anything, we needed a dataset. We built an extraction pipeline that takes in arbitrary document sources, funnels them through a series of filters, and outputs a cleaned corpus.
We built two corpora, both targeting diverse educational content (more likely to yield meaningful slugs and meaningful text). The first was a 10,000-document feasibility set mixing arXiv (25%), GitHub issues (25%), and FineWeb-Edu (50%). Slugs are unavailable for both arXiv and GitHub issues, so instead of mechanical extraction we used Haiku at temperature 0 to generate synthetic slugs. The slugs were meaningful and representative, but at $5.25 per 10,000 documents, scaling to our target corpus would've cost ~$1.2k. Not crazy… but perhaps a little expensive for a weekend mini-project. That left local inference (too slow to finish in time) or finding another way to derive slugs.
The actual solution turned out to be obvious. Slugs exist to tell users (and search engines) what a webpage is all about. So instead of deriving our training slugs from documents, we extracted them directly from real URLs. FineWeb-Edu, our primary dataset, annotates each document with its source URL, so meaningful slugs come essentially free. The tradeoff is that we don't control them: some are truncated, SEO-stuffed, editorially inconsistent, or just look like slugs but are actually ids in disguise. Good enough for a prototype, we thought, and cleaning this up later should be straightforward — for example, a logistic-regression re-ranker that scores quality using both the slug's embedding and the URL.
The filtering pipeline is built on the datatrove library. A language filter (fasttext) rejects non-English documents to keep the vocabulary manageable (FineWeb-Edu is primarily English, but other sources like GitHub issues needed filtering). Slug extraction from URLs comes next: take the last meaningful path segment, reject anything that doesn't match a kebab-case pattern, filter by length, numeric density, and stopword ratio. This is the largest single drop: 62% of documents lack an extractable slug. A Gopher repetition filter removes spam and boilerplate. Finally, a token-count filter admits only documents in the 50 to 1,000 token range. The token length filter isn't strictly necessary — the model sees a fixed-size embedding vector regardless — but it serves as a quality proxy and makes embedding-cost estimation easier. We didn't want to wake up to a multi-hundred-dollar OpenAI bill.
For the smaller corpus, slug extraction happens after the fact (via Haiku), so the URL-based slug filter doesn't apply. For the URL corpus, the pipeline takes FineWeb-Edu's ~9.7M documents down to 2.3M usable training samples (Figure 2).
In a language without a large vocabulary, we might just pool words together, seek to predict the right ones, and assemble a slug. 5,000 output classes for slug-specific nouns might even be enough. But the English language has approximately 500,000 to 600,000 dictionary words. Including scientific terms, technical jargon, regional slang, and old/obsolete words this rises to somewhere north of a million.
The 2.3M slugs in our training corpus contained 316k unique words, with an astonishingly long tail: 62% of these were hapax legomena (words appearing exactly once). URL slugs are particularly tail-heavy because they exist to index distinctive content: proper nouns, niche topics, and specific terminology, as well as various conjugations, and compound words. Predicting over 316k classes isn't practical, so two options remained: compress the vocabulary through clustering, or abandon word-level prediction entirely and switch to a subword tokenization approach like byte-pair encoding (BPE).
But we weren't ready to give up on word-level prediction yet. A simpler architecture — a multilayer perceptron (MLP), we hypothesized, as opposed to a transformer — would be cheaper to train and run, and we wanted to see if it could recover slugs from embeddings. To compress the vocabulary we tried clustering: embed each of the 316k unique slug words through the same OpenAI model used in our example for documents, then group similar embeddings together. We tested four approaches: K-means clustering, hierarchical density-based clustering (HDBSCAN, specifically), cosine similarity thresholding, and Louvain community detection.
Most of these attempts failed:
That they cluster so tightly makes sense. The embedding model employed (OpenAI's text-embedding-3-small) is trained on full text with a contrastive objective; its job is to capture semantic similarity. For a single word, anne and amelia are the same thing: female first names. Without surrounding context, the model has no way to distinguish them. From its perspective, it doesn't matter which female name it is; what matters is that it's a female name at all.
KMeans works differently from the graph-based approaches: it doesn't need pairwise similarity to be meaningful, just distance from centroids. With k=5,000 it compressed the vocabulary from 316k tokens to 5,000 clusters (63×). Of the algorithms we'd scoped, it was the only real option (Figure 3).
What we didn't appreciate at the time was how lossy this compression is. When you replace each slug token with its cluster representative, 47% of tokens map to a different word. amelia becomes anne (female names cluster), earhart becomes harry (surnames cluster). Even a perfect model operating over this vocabulary can only reach 50.2% Tok F1 (a measure of precision and recall over the individual words in a slug). The compression bought tractability but imposed a hard ceiling on quality: if an article is about Amelia Earhart but the best we can reconstruct is anne-harry, the slug is meaningless.
The alternative we'd deferred, BPE, doesn't have this problem. Trained on the slug corpus with - as a special token, BPE learns subword units within slug words, never merging across word boundaries. At the same budget of 5,000 tokens, the average encoded length is 11.7 subwords per slug, and reconstruction is lossless: any slug roundtrips perfectly through encode/decode. The tradeoff: BPE outputs are ordered subword sequences, so generating them requires autoregressive decoding. That rules out simple architectures like MLPs and forces a sequence model.
Before trying anything complex, we wanted to see if a simple model could do the job. A multi-layer perceptron (MLP) takes the embedding, passes it through two dense layers, and outputs a sigmoid score over each of the 5,000 KMeans cluster tokens. Conceptually simple: pick the top-scoring tokens, assemble a slug. No sequence modeling, no autoregression, just multi-label classification.
Our hypothesis: the MLP would lose word order — it predicts each token independently, so it has no way to model ordering — but should at least recover the right words. We tried three ordering strategies to work around the sequencing problem: sorting by descending sigmoid score, training a separate position head, and learning pairwise ordering from training co-occurrences. None of them mattered, because the failure was more fundamental than ordering.
The hypothesis was partially validated, but the reality was worse: the model collapsed. Across 229k test predictions, the top outputs were of-the-in (27,439 times), of-the-a (19,780 times), of-the-how (10,432 times). Complete semantic collapse. The model didn't just lose word order as predicted; in most cases it couldn't recover any content words. In hindsight this makes sense. The model learns that saying of and the is safe, because even if they aren't the most frequent slug tokens, they're the common denominator: safer than committing to something specific, which gets penalized during training if it's wrong. Our model went from a slug content predictor to one that knows which words inside a slug are the least consequential2.
We tried to rescue it with three ablations: Focal loss (which downweights easy negatives), a bigger projector (4 layers, 1024 hidden), and the original BCE. All three converged to the same ceiling: ~1.657 val loss, 0.07 to 0.08 Tok F1 (Figure 4). The problem wasn't a tunable hyperparameter; a bag-of-tokens model can't recover meaningful semantic content from a single forward pass.
We first ran this on the 10k feasibility corpus and were gutted by the result. If the simplest reasonable model fails this catastrophically, why should anything else work? But we'd scoped two other architectures before starting, and committed to trying all three until one worked or we'd ruled the idea out. Hope was thin, but we wanted to see them through. The seq2seq worked (more on that in the next section), which gave us confidence to scale to the full 2.3M URL corpus. We re-ran the MLP there too, hoping more data might break the collapse. It didn't. Not surprising, but it ruled out a data problem: a bag-of-tokens model fundamentally can't do this.
After the MLP, we still wanted to keep going. What if, instead of predicting all tokens at once, the model generated them one at a time, each conditioned on the embedding and everything it had produced so far? This is what vec2text does, but their model is a 235M-parameter T5-base encoder-decoder, 20× larger than what we had in mind. Could a tiny version of the same idea work? It was a moonshot, but training on 10,000 samples takes minutes — cheap to try.
The architecture is a prefix-conditioned transformer decoder: the 1536-dim embedding is linearly projected down to the model's internal dimension and inserted as a prefix token at position 0. From there, a standard causal transformer decoder (4 layers, 384-dim, 8 attention heads) autoregressively generates slug tokens, each attending to the prefix and all previous outputs. We initially trained it over the KMeans-compressed vocabulary (not knowing yet that this would lead to interesting problems).
position 0 holds the projected embedding; positions 1..16 are generated one token at a time
the-father-of-mind-ai-and-consciousness with the controls below.Figure 5 illustrates the decoding process; the figure shows a six-layer variant for clarity, but the structure is identical at four layers. Inside the interactive figure, the attention strip to the left of each layer shows the average attention weight for each token. The chart to the right is the logit lens, which shows the model's top predicted token at each layer's residual stream as it processes each step, with a confidence value beside it. The bottom bar chart shows the same distribution unbinned over all 5,000 BPE tokens.
Running it on the 10k corpus was confusing at first. We didn't expect much, but the outputs looked like they were always the same sequence. That's a bust. On closer inspection, they were slight variations — and they were topical. The model was generating different outputs for different inputs, and they made sense. BINGO. This might just work.
So we took the plunge: built the 2.3M URL corpus and retrained with a max sequence length of 10 tokens (foreshadowing), well above the 8-word ceiling we'd set in the training data. Thanks to the modest size of the model (11.5M parameters, d=384, 4 layers), it only took a few minutes to train. We took a sample and it just... worked? It actually worked.
For a text about Buddhism in Burma the reference slug was buddhism-in-burma and our model generated buddhist-in-burma. Close match! This is a KMeans clustering artifact: buddhism and buddhist map to the same cluster, with buddhist as the representative. Not perfect, but close enough for a slug.
Looking at more samples, we spotted something else: an article about Amelia Earhart with the reference amelia-earharts-enduring-image produced anne-essex-fly. What does Anne Essex have to do with Amelia Earhart? Then it dawned on us: this is the fatal flaw of any word-level clustering strategy. The model got the right semantic neighbourhoods: anne is the representative of the female-names cluster (where amelia lives), and fly is the representative of the flying cluster (where flyer and flying live). But for the surname, the model predicted essex, which seems completely wrong until you look at the cluster: essex contains eardeswick and earnshaw, both "ear-" proper nouns. The model had the right signal (something starting with "ear-") but the vocabulary compression mapped it to essex as the cluster representative. Meanwhile earhart itself maps to harry in a different cluster entirely.
Minor spelling variations like buddhism/buddhist compress fine. But when the clustering compresses by semantic role, we lose the very words a slug is trying to express. The lesson: semantic similarity is context-dependent, and isolating it per word destroys that context.
There was no way around it: the vocabulary strategy had to change. A subword tokenizer was the next obvious option. We didn't expect it to work, but figured we'd give it a shot. We trained a BPE (byte-pair encoding) tokenizer on the slug corpus with five special tokens: <pad>, <bos> (beginning of sequence), <eos> (end of sequence), <unk> (unknown), and - (the hyphen, so the model explicitly generates word boundaries instead of learning them implicitly).
So we reran the same training process, this time with the BPE tokenizer. The results were surprising: coherent output, but the model clearly needed to learn a lot more. From the same examples we got burma-in-myanmar and emma-j-amelia-ear, which are... kinda close? Burma is in Myanmar. And emma-j-amelia-ear is the first time any model got close to spelling out "Amelia Earhart" from an embedding. Figure 6 illustrates the quality difference between the two approaches, with BPE generating tokens that any clustering-based compression could never produce.
The decoding pipeline also needed to grow up. With the MLP we just took the top-k sigmoid scores. Autoregressive generation is a different beast: the model generates one token at a time, and the decoder has to decide when to stop, how to search, and how to handle degenerate outputs. The final pipeline uses beam search (width=4) with a bounded additive length reward, which replaced greedy decoding and eliminated repetition pathologies like turtle-of-turtle. On top of that: a minimum-length floor of 3 subword tokens and 3 slug words before EOS is allowed; hard EOS suppression after stopwords (no more slugs ending in -and or -of); a trailing stopword penalty on completed beams; a repetition filter to catch outputs like december-30-december; and UNK suppression. Each constraint was earned by a specific failure in greedy decoding, and together they make the model's predictions usable.
The results weren't perfect. Going through the outputs, we noticed something strange: the model wouldn't generate slugs longer than about 3 words — the minimum it was allowed to output — even though the training corpus had a median of 5.1 words. The outputs also felt oddly abrupt. The training data had stopword endings (about 1.5% of the corpus), but the frequency felt much higher because the slugs were so short: a stopword ending on a 3-word slug is a third of the output. A real-world slug that ends in a stopword after 5 words feels fine; a generated 3-word slug ending in -of feels broken. More concerning: the model was truncating words mid-syllable for no apparent reason.
We went through the prediction code: looked fine. Went through the training code: looked fine. We were confused. And then it hit us: the max tokens count was still set to 10 from the KMeans-based seq2seq experiment. The problem: that limit referred to words in the KMeans setting, but BPE operates on subwords. At an average of 11.7 subwords per slug, 56% of training targets were being truncated at the subword level, and most of those lost their <eos> token entirely (Figure 7). The model learned that sequences just... stop, without a termination signal, so it skewed <eos> toward the only place it had seen it: after 2-3 words.
t=10 cap truncated 56.1% of training targets; raising the cap to t=24 covers 99.4% of slugs.The fix was straightforward: raise the max-token count. But to what? The BPE length distribution drops off steeply: 85% of slugs fit in 16 subwords, 97% in 20, 99.4% in 24. Going higher than 24 means padding nearly every sequence with <pad> tokens after the <eos>, and since attention scales quadratically with sequence length, those empty positions have a real training cost for negligible coverage gains. So 24 it was: 99.4% coverage, with the remaining 0.6% filtered out. The improvement was immediate: mid-word truncations disappeared. facts-about-blood-don became facts-about-blood-donation. dragonflies-and-mosquito became dragonflies-and-mosquitoes. Val loss improved by 6.7% and Tok F1 gained ~0.02. A config value inherited from a previous experiment had been silently corrupting the majority of training examples.
With the truncation fixed, we expected the length problem to be solved. Surprise: it wasn't. The model was still generating slugs that averaged 3.6 words when the training distribution averaged 5.1. The outputs were clearly better (no more mid-word cuts), but the model still preferred to stop early. We could have just raised the minimum word count to match the training distribution, but that felt like a hack. If the model was hesitant to generate longer slugs, we wanted to understand why.
Before retraining, we wanted to see what could be learned from the prediction side alone. We swept beam width (4, 8, 16): identical length distributions. We swept length penalty: no effect on output length. That's weird. Maybe the model just doesn't have more to say? But then we forced the model to emit one extra word: Tok F1 improved by +0.012. Forcing two extra words dropped it back. So the model had useful content available at position 4 and was choosing not to emit it. It was terminating early not because it didn't know what to say, but because it was overconfident about when to stop.
Our hypothesis: under standard cross-entropy loss with teacher forcing, the <eos> token at common positions (around subword position 5 to 7, where most slugs end) receives more gradient signal than <eos> at rare positions (position 15 to 20). The model learns that "stop around word 3 to 4" is usually safe, regardless of the input.
The fix combined two things. First, label smoothing (0.1): instead of training against hard 0/1 targets, the model trains against slightly softened targets (0.9 for the correct token, 0.1 spread across the vocabulary). This prevents the model from becoming pathologically confident about any single prediction. Second, position-aware EOS loss weighting: for each position, weight the EOS loss by min(1.0, median_rate / position_rate). Positions where EOS is over-represented in training get dampened; rare EOS positions stay at weight 1.0. The ceiling at 1.0 is important: we only dampen overconfidence, never amplify. Without the ceiling, we'd nudge the model toward over-representing EOS at rare positions, defeating the point.
After retraining: mean predicted word count shifted from 3.6 to 4.9 (training mean is 5.1). Tok F1 improved only marginally in greedy decoding (0.286 vs 0.284), but the real win is that the predicted length distribution now matches training data (Figure 8). Small as it looks, this fix was foundational: it unlocked compounding improvements from a larger model and length-aware beam search, which together added +0.022 Tok F1.
With the EOS loss fix in place and a mean of 4.9 words during training, we expected prediction to match. It didn't. The generated slugs were still shorter than the training distribution. Again.
This time the cause was on the decoding side, not the training side. Standard beam search has a subtle bias toward short sequences: it fills its completion pool with the first k sequences that emit <eos>, then picks the best among them. Shorter sequences hit <eos> sooner, so they fill the pool first. Longer alternatives never get a chance to finish developing, even if they'd score higher.
The naive fix is to remove early stopping entirely and let every beam run to the maximum length. We tried that: inference time tripled. That's not practical for a model that's supposed to run in under 200ms.
The principled fix comes from Huang et al. (2017): score-based stopping. Instead of stopping when k beams have completed, the algorithm continues as long as any active beam could plausibly outscore the best completed beam. The upper bound is tight: future log-probability increments are always ≤ 0, and the length reward is capped, so the algorithm can compute exactly when no active beam can catch up. In practice this runs at about 1.6× the cost of standard early stopping, not 3×, and recovers the full length distribution.
With this final fix, the model produces slugs at a mean length of 4.9 words (matching training), terminates cleanly on its own, and runs in ~21ms on a laptop CPU (~89ms on a budget VPS). The decoding pipeline is complete.
With the decoding pipeline complete, the remaining question is what computational structure the model learned to extract information from the prefix embedding.
The architecture we chose linearly maps the 1536-dim embedding to the decoder's hidden dimension, placing it as a prefix token at position 0. The decoder attends to this prefix like any other token; attention to position 0 consults the global context (the embedding), attention to other positions consults the local context (the generated sequence).
To introspect the model's attention behavior and interpret its use of global and local context, we recorded attention weights across 500 test samples from the evaluation set, using teacher-forced forward passes conditioned on the model's own predicted output tokens. For each token position in each layer, we measured the fraction of attention allocated to the prefix. The central finding is that hyphens serve as the model's dedicated embedding readers. At layer 1, hyphen tokens allocate 53% of their attention to the prefix embedding (Figure 10); subword tokens allocate 9.3%, a 5.7× ratio. The effect is highly consistent: across 1,742 hyphen positions in our 500 test samples, the interquartile range spans 0.518 to 0.542.
PREFIX across six layers. Hyphens read the embedding; subwords don’t.Figure 9 shows that attention behavior also varies across layers, forming a U-shape. In layer 1, hyphens and BOS both read the embedding heavily (53% and 62%). In layers 2 through 4, prefix attention drops across all token types as the model performs local processing: composing subwords into words, contextualizing within the generated sequence. In layers 5 and 6, hyphens climb back to 34 to 35% and BOS to 53 to 60%, indicating the model re-consults the embedding before final predictions. Subword tokens remain below 12% throughout, consistent with receiving the embedding's information indirectly through previously generated tokens.
Within layer 1, the average of 53% masks a near-binary specialization across heads (Figure 12). Four of eight heads (H0, H3, H4, H6) allocate 96 to 99% of their attention from hyphens to the prefix. Three others (H1, H2, H5) allocate under 2%; H7 sits at 29%. The model benefits from mixing global and local context at hyphens, but achieves this by specializing individual heads rather than distributing attention evenly within each head. At layer 1, focused attention within individual heads outperforms mixed attention; the benefit of mixing global and local context comes downstream. This specialization migrates across depth, though less pronounced: H2 becomes the dominant router at layers 2 through 4; different heads take over at layers 5 and 6. The embedding-reading responsibility passes through specific heads at specific layers.
Figure 11 shows a finding we expected to materialize but did not: a position effect. The model does not discriminate between hyphen positions within a slug. The first hyphen attends to the prefix at 53.5%; the last at 52.5%. At layer 1, the three positional classes fall within 1% of each other. The routing is a structural property of the hyphen token, not a function of sequence position.
This routing structure emerged from training, not from architectural design. No special role was assigned to hyphens; the BPE vocabulary preserved - as a discrete token at every word boundary. This gave the model stable structural anchors, and it organized its entire attention pattern around them. We hypothesize that a SentencePiece-style encoding, which merges hyphens into surrounding subwords, would have foreclosed this organization.
With the model's internal structure mapped, we turn to the metrics that score it against the alternatives. The evaluation pipeline scores all model variants on the same 5,000 held-out test samples using seven metrics, each measuring a different axis of slug quality. The 5,000-sample subset is a uniform random draw from the 229k held-out test set collected during corpus generation. At this sample size, the 95% confidence interval on Tok F1 is approximately ±0.008, sufficient to resolve most regime changes. Three of the four are clearly outside the noise floor (+0.018 or more); the EOS fix (3h to 3i, +0.008) sits at the edge of the confidence interval, though the qualitative win (matching the training-distribution output length) is the more important signal there.
The full test set was not used because BERTScore (roberta-large inference), distinctiveness (nearest-neighbor search), and beam search decoding make per-sample evaluation expensive; running all twelve models on 229k samples was infeasible on the available hardware. Given the confidence interval, full evaluation would not change any conclusion.
Validity verifies the structural correctness of the generated output; anything below 100% constitutes a decoding failure. It does not account for semantic correctness and forms the lower floor of the quality metrics. The check verifies that the output is kebab-case (word-word-word), has 2 to 8 words, and is between 3 and 80 characters in length.
Exact match requires character-for-character identity between prediction and reference. This is the quality ceiling and unlikely to be reached in practice with noisy URL-extracted references. A high exact match score would indicate memorization or test-set contamination, not genuine quality.
Token F1 splits both slugs on hyphens and computes set-overlap precision, recall, and their harmonic mean. Ordering is ignored: react-suspense and suspense-react score identically. We report macro-averaged F1 (per-sample F1, then averaged) rather than micro (all tokens pooled, F1 computed once). We chose macro over micro because micro skews the result in favor of longer slugs. This is the primary quality metric throughout.
ROUGE-1 measures unigram overlap between predicted and reference slug words. Unlike Token F1, which deduplicates (set overlap), ROUGE-1 counts occurrences (bag-of-words). For slugs where repetition is rare, the two track closely; across our experiments the gap between Token F1 and ROUGE-1 stays below 0.003. ROUGE is reported separately as a standard NLG metric for cross-system comparison.
ROUGE-L extends ROUGE-1 by computing the longest common subsequence, so it penalizes correct words that appear in the wrong position.
BERTScore embeds both predicted and reference words through roberta-large and computes similarity between the resulting contextual embeddings. Unlike Token F1, it treats semantically similar tokens as similar: buddhist and buddhism receive partial credit rather than scoring as completely different. The metric was included because slug generation frequently produces such near-synonyms. The tradeoff is that roberta-large's token embeddings for common English words are not widely separated, so two unrelated four-word slugs already score around 0.82. This high floor compresses the effective dynamic range.
Vocab diversity is the ratio of unique predictions to total predictions. A model producing identical output for every input scores 0%; fully distinct outputs score 100%. This metric detected the MLP's mode collapse (Figure 4).
Distinctiveness extends vocab diversity by measuring neighbor differentiation. For each sample, it finds the k nearest neighbors in embedding space (k=5 in our case) and computes similarity-weighted Jaccard distance between their predicted slugs. Where vocab diversity detects global collapse (identical outputs), distinctiveness detects local collapse: two documents about different aspects of climate change should receive different slugs, even though their embeddings are close. High distinctiveness means the model differentiates semantically adjacent documents; low distinctiveness indicates collapse to generic topic labels.
The narrative above describes five successive approaches, all scored through the same evaluation pipeline to ensure comparability.
| Experiment | Variant | Vocab | Dim | Layers | Epochs | Tok F1 | Params | Notes |
|---|---|---|---|---|---|---|---|---|
| 1a | MLP (BCE) | KMeans | 768 | 2 | 5 | 0.071 | 5.6M | |
| 1b | MLP (focal) | KMeans | 768 | 2 | 5 | 0.082 | 5.6M | |
| 1c | MLP (big) | KMeans | 1024 | 4 | 5 | 0.068 | 9.8M | |
| 3a | Seq2seq | KMeans | 256 | 4 | 15 | 0.189 | 6.1M | |
| 3b | Seq2seq | KMeans | 384 | 4 | 15 | 0.197 | 11.5M | |
| 3d | Seq2seq | BPE | 512 | 4 | 15 | 0.269 | 18.5M | t=10 |
| 3e | Seq2seq | BPE | 384 | 4 | 50 | 0.267 | 11.5M | t=10 |
| 3f | Seq2seq | BPE | 384 | 6 | 15 | 0.259 | 15.1M | t=10 |
| 3g | Seq2seq | BPE | 512 | 6 | 23 | 0.272 | 24.8M | t=10 |
| 3h | Seq2seq | BPE | 384 | 4 | 50 | 0.290 | 11.5M | t=24 |
| 3i | Seq2seq | BPE | 384 | 4 | 30 | 0.298 | 11.5M | t=24 + EOS; smaller canonical |
| 3j | Seq2seq | BPE | 512 | 6 | 36 | 0.306 | 24.8M | t=24 + EOS; canonical |
Figure 13 plots eval-pipeline Tok F1 across all twelve experiments. No consistent regression is visible except at 3e and 3f, where reduced dimensionality was tested against larger configurations. The dominant pattern is that Tok F1 improvements correlate with regime changes and not with architectural ablations such as varying layer count or hidden dimension, which produce only modest gains. The ghost markers on the KMeans phase (hollow red squares showing the training-time values) expose a measurement artifact: during training we tracked micro F1 and F1 against the compressed vocabulary, both of which inflated the apparent KMeans ceiling to 0.326 to 0.345. The eval pipeline scores all models under macro F1 against raw reference tokens, placing KMeans at 0.189 to 0.197. Under one consistent definition, BPE outperforms KMeans from the first run onward, provided the predictor is sufficiently expressive.
Figure 14 decomposes the headline Tok F1 into its precision and recall components. Precision runs consistently above recall across all five phases, indicating the model is more selective than complete. The two curves climb in lockstep; the canonical model lands at P=0.335, R=0.299. The near-constant gap confirms that the successive interventions recovered additional relevant tokens without trading completeness for selectivity. After the EOS fix, precision drops slightly (0.349 to 0.335) while recall rises (0.260 to 0.286), narrowing the P-R gap from 0.088 to 0.049. The model trades a small amount of selectivity for substantially better completeness.
Figure 15 tracks the order-sensitive analog. ROUGE-L penalizes correct words that appear in the wrong position, producing a gap below Tok F1 that grows with model quality: ~0.006 for MLP, ~0.011 for KMeans seq2seq, and ~0.02 for the BPE phases. The gap is stable within each regime, consistent with autoregressive decoding maintaining token ordering as quality improves. The MLP's smaller absolute gap is a floor effect; with Tok F1 at 0.07, there is little correct content to misorder.
roberta-large; directionally consistent with Tok F1 but compressed by a high floor (~0.81).Figure 16 shows contextual-embedding similarity via roberta-large. The metric agrees in direction with Tok F1 at every regime change but operates in a compressed range. The floor sits at 0.81 even for MLP outputs, because roberta-large token embeddings for common English words are not widely separated; two unrelated four-word slugs score high by being short and English. The canonical model reaches 0.872. The appropriate reading is directional: every intervention that improved Tok F1 also improved BERTScore. The compressed range means BERTScore cannot serve as a standalone quality metric for this task; it is useful only for corroborating trends visible in the primary metrics.
0.016, all seq2seq variants recover to 95%+.Figure 17 measures the ratio of unique predictions to total samples. The MLP failure mode is most pronounced here, with experiment 1c collapsing to 1.6% diversity and producing 80 unique slugs across 5,000 test samples. All seq2seq variants achieve 95%+ vocab diversity, demonstrating full utilization of the vocabulary with no systematic underrepresentation of hapax legomena.
Figure 18 captures what vocab diversity misses. KMeans seq2seq achieves 95% vocab diversity but only 0.84 distinctiveness; the BPE phases reach 0.89. The gap reflects near-duplicate collapse: when two documents about different aspects of the same topic land in similar embedding neighborhoods, KMeans produces slugs that share most tokens with only one or two swapped. BPE's finer vocabulary allows the model to distinguish between semantically adjacent inputs. One subtlety: the MLP achieves higher distinctiveness (0.37 to 0.51) than its vocab diversity (0.02 to 0.47) would suggest. Even with a severely collapsed output space, the model assigns different combinations to different embedding neighborhoods; it has some input sensitivity despite mode collapse. In the other direction, the EOS-calibrated models dip slightly from the BPE t=10 peak (0.895 to 0.877). Longer slugs (4.9 vs 3.6 words) share more tokens with neighbors mechanically, which the Jaccard distance registers as reduced distinctiveness.
Figure 19 plots Tok F1 against parameter count on a log scale, stratified by phase. Within each phase the points form near-horizontal bands: scaling from 11.5M to 24.8M parameters inside the BPE t=10 regime (a 2.2× increase) produces a total spread of 0.013 Tok F1. By comparison, the truncation fix alone (a data correction, not a model change) gains +0.018, and the KMeans-to-BPE vocabulary switch gains +0.072. Every meaningful vertical jump corresponds to a regime change; parameter scaling within a regime does not account for any of them. This is the strongest evidence that the performance bottleneck is upstream of model capacity, and it directly motivates the hypotheses in the following section.
The final pipeline yields two models, experiments 3i and 3j.
| Dimension | Layers | Parameters | Size | Tok F1 | Mean Words | Inference3 |
|---|---|---|---|---|---|---|
| 384 | 4 | 11.5M | 46 MiB | 0.298 | 4.9 | ~89ms |
| 512 | 6 | 24.8M | 99 MiB | 0.306 | 4.9 | ~160ms |
Doubling the parameter count from 11.5M to 24.8M adds +0.008 Tok F1. Both models produce topically coherent slugs; the difference shows up mainly in edge cases, where the larger model overgeneralizes less often. The common failure mode is not incoherence but overgeneralization: the model captures the topic but misses the specificity. For most applications this is acceptable; where it isn't, top-k generation with human selection turns the model into a recommendation engine rather than a single-shot predictor. We recommend the smaller model for most deployments.
On matched inputs, both models run ~14–19× faster and ~85× cheaper than a Haiku-class LLM call1, with 100% structural validity, 97.3% vocab diversity, and 0.298 Tok F1 on held-out data.
The finding that matters beyond slug generation is that a small decoder can extract structured, human-readable information from the embedding space. The model organized specialized attention heads for structural parsing without being instructed to, adapting its internal circuitry to the data. This validates our initial hypothesis: embeddings are a reusable source of auxiliary information for tasks that require semantic understanding, without needing a large language model. The cost is a task-specific training set and an appropriately sized decoder. The autoregressive architecture used here generalizes to any task that requires sequential output; single-label or multi-label classification are trivial subsets of the same formulation.
In HASH, we plan to use this to give everything in the knowledge graph a slug. Entities today are identified by UUIDs, which are opaque: hard for humans to remember, hard for agents to replay, and devoid of semantic content. A slug-based identifier like amelia-earhart-f86ab (a topical slug plus a short hash for uniqueness) carries meaning and is easier to recall for both people and LLMs. The same applies to draft revisions, which also need names: a topical slug makes it immediately clear what changed in a given version, where a UUID tells you nothing. Deriving these slugs from embeddings the system already computes makes the cost negligible.
Four hypotheses remain, roughly in priority order.
Data quantity. Certain niches are underrepresented. When the model writes asm for an article about WebAssembly, two failure modes are possible: the training data did not contain enough wasm examples for the model to commit to the rarer token, or the embedding genuinely does not distinguish wasm from asm because they appear in similar contexts. The current experiments do not disambiguate them; scaling to 10 to 100M samples would. We trained on ~9.7M documents from FineWeb-Edu; the full corpus contains ~1.59B. If performance improves substantially at scale, the ceiling is data quantity. If it does not, the bottleneck is elsewhere.
Data quality. URL-extracted slugs are noisy: truncated, SEO-stuffed, editorially inconsistent. A cleaner corpus, either distilled through a self-hosted model or filtered by a reranker scoring slug-content alignment, might matter more than raw scale. The question is whether better references improve the model more than more references.
Architecture. The model uses prefix-conditioning, where the embedding occupies a single token position. The attention analysis confirms this works (four dedicated heads at layer 1), but three questions remain open. First, the input projection compresses 1,536 dimensions to 384 or 512 before any attention head sees the embedding; running the decoder at full embedding dimension would test whether the projection discards recoverable signal. Second, a cross-attention architecture, where the decoder attends to an encoded representation rather than a single prefix token, would test whether the single-position formulation itself is the bottleneck. Third, a frozen pretrained decoder (Qwen 0.6B or SmolLM 360M) with a trained embedding-to-soft-prompt projector would test whether an existing language prior improves slug quality over learning one from scratch. A narrower question is whether the hyphen-routing pattern is specific to this embedding model or universal to BPE vocabularies that preserve - as a token; cross-embedding transfer would answer it.
Training objective. Next-token cross-entropy optimizes per-token accuracy, not slug-level quality. Sequence-level training (REINFORCE on Token F1, InfoNCE on slug-document embedding similarity) might extract more from the embedding by aligning the training signal with the evaluation metric.
If all four hypotheses prove false, the ceiling reflects the embedding's information content, which is itself an informative result: it would characterize the boundary of what single-pooled sentence embeddings preserve.
Surprise! The 11.5M-parameter decoder is running in your browser right now. The full ONNX model (44 MiB) loads on demand into a WebAssembly runtime; beam search runs in JavaScript. No server, no API call, no round trip. The embedding goes in, the slug comes out, and everything between happens on your machine.
Ten curated examples from the held-out test set are loaded below. Each comes with its real 1,536-dimensional OpenAI embedding, the source text (included so you can read what was embedded — per the architecture, the model never sees it), the reference slug from the original URL, and the model's top-5 beam search candidates. To go further: paste your own text-embedding-3-small vector (see below the panel for how to get one), generate a random embedding, perturb bands of an existing one (select an operation and drag inside the embedding strip), or morph between two examples. Every slug is generated on the fly, on your machine, in the browser. Inference is slower than native (WebAssembly, single-threaded), but the point is that a 44 MiB model just works.
The training code, data pipeline, and evaluation scripts are on GitHub. Pretrained model weights (ONNX and PyTorch), as well as an implementation of the inference pipeline as a standalone Python script are on Hugging Face.
Benchmarked on the same 100 held-out test documents (avg 566 tokens). Haiku: Claude Haiku 4.5 with the same setup used to distill slugs in the small corpus, avg 1,223 input tokens and 13 output tokens per call, at $0.80/M input and $4.00/M output = $0.00103/slug, avg latency 1,683ms. Our model (d=384, L=4): inference on a Hetzner CPX22 (2 shared vCPU, 4 GB RAM, €9.51/month), median ~89ms, throughput ~11.2 inferences/sec = $0.000012/slug if the embedding needs to be computed fresh ($0.02/M tokens via OpenAI text-embedding-3-small, avg 566 tokens = $0.000011, plus $0.0000005 for model inference). Embedding latency was ~89 documents/sec via direct OpenAI API calls, so the embedding step adds ~11ms of wall time per document on top of the model's ~89ms. If the embedding already exists (the intended piggyback use case), the cost is just the CPU time: $0.0000005/slug, roughly 2,000× cheaper than the LLM call. The 85× figure in the text uses the conservative case where you pay for a fresh embedding. ↩ ↩2
This is a gradient effect, not a strategy. In multi-label BCE, predicting of gets a net positive gradient from the ~16% of slugs containing it, while the penalty from the other 84% is spread across 4,999 outputs. Rare tokens like earhart get much weaker net signal. The model settles on common-token prediction as the path of least resistance. ↩
Median over 100 samples on a Hetzner CPX22 (2 shared vCPU, 4 GB RAM, €9.51/month). On a laptop CPU (M-series Apple Silicon) the same models run at ~21ms and ~41ms respectively. ↩
Get notified when new long-reads and articles go live. Follow along as we dive deep into new tech, and share our experiences. No sales stuff.