Introducing the Ettin Reranker Family
Six new Sentence Transformers CrossEncoder rerankers, state-of-the-art at their respective sizes, built on Ettin ModernBERT encoders with a distillation recipe. The smallest (17M) to largest (1B) all outperform prior models on MTEB and NanoBEIR. Full training recipe, dataset, and models are open-sourced.
Article intelligence
Key points
- Six rerankers from 17M to 1B parameters, all SOTA for their size
- Trained with pointwise MSE distillation from mxbai-rerank-large-v2
- Significantly outperforms ms-marco-MiniLM-L*-v2 family
- Open-source models, dataset (143M triples), and training script
Why it matters
This matters because six rerankers from 17M to 1B parameters, all SOTA for their size.
Technical impact
May affect model selection, inference cost, product capability, and evaluation benchmarks.
Back to Articles
Introducing the Ettin Reranker Family
Published May 19, 2026
Update on GitHub
Upvote
22
Tom Aarsen
tomaarsen
TL;DR
Today I'm releasing six new Sentence Transformers CrossEncoder rerankers, state-of-the-art at their respective sizes, built on top of the Ettin ModernBERT encoders, together with the data and full training recipe that produced them:
cross-encoder/ettin-reranker-17m-v1
cross-encoder/ettin-reranker-32m-v1
cross-encoder/ettin-reranker-68m-v1
cross-encoder/ettin-reranker-150m-v1
cross-encoder/ettin-reranker-400m-v1
cross-encoder/ettin-reranker-1b-v1
The models were trained with a distillation recipe: pointwise MSE on mixedbread-ai/mxbai-rerank-large-v2 scores over cross-encoder/ettin-reranker-v1-data, which is a subset of lightonai/embeddings-pre-training mixed with a reranked subset of lightonai/embeddings-fine-tuning.
Our six rerankers paired with google/embeddinggemma-300m on MTEB(eng, v2) Retrieval. See Results for five more embedder pairings.
If you're new to rerankers and want the "why" first, jump to What is a reranker, and why pair one with an embedder?. If you just want to plug a model in, jump to Usage. If you want to train your own, jump to Training.
I bootstrapped the training recipe below with the new train-sentence-transformers Agent Skill shipped in Sentence Transformers v5.5.0. Install it with hf skills add train-sentence-transformers [--global] [--claude] and ask your AI coding agent (Claude Code, Codex, Cursor, Gemini CLI, ...) to fine-tune a SentenceTransformer, CrossEncoder, or SparseEncoder model on your data.
Table of contents
What is a reranker, and why pair one with an embedder?
Usage
End-to-end retrieve-then-rerank pipeline
Architecture Details
Results
MTEB(eng, v2) Retrieval
Speed
Training
Distillation recipe
Dataset
Training Arguments
Evaluation
Overall Training Script
Conclusion
Acknowledgements
What is a reranker, and why pair one with an embedder?
A reranker (a.k.a. pointwise cross-encoder) is a neural model that takes a (query, document) pair and outputs a single relevance score. Unlike an embedding model, which encodes the query and document separately and computes their similarity from the two embedding vectors, a reranker lets the two texts attend to each other through every transformer layer. That joint encoding is more accurate but also more expensive: the model has to be run once per (query, document) pair rather than once per text.
Because cross-encoders are too expensive to run over a full corpus, the common production pattern is retrieve-then-rerank: a fast embedding model retrieves the top-K candidates (cheap), then a cross-encoder re-orders just those K with high accuracy. The total cost stays bounded while the final ranking is much closer to what an exhaustive cross-encoder pass would produce.
Throughout this blogpost I'll use "reranker" and "cross-encoder" interchangeably.
Usage
The released models are normal Sentence Transformers CrossEncoder models, so you can use them with just 3 lines of code:
from sentence_transformers import CrossEncoder
model = CrossEncoder("cross-encoder/ettin-reranker-32m-v1") scores = model.predict([ ("Where was Apple founded?", "Apple Inc. was founded in Cupertino, California in 1976 by Steve Jobs, Steve Wozniak, and Ronald Wayne."), ("Where was Apple founded?", "The Fuji apple is an apple cultivar developed in the late 1930s and brought to market in 1962."), ]) print(scores)
[11.393298 2.968891] None:
config = CONFIGS[ENCODER_SIZE] encoder_id = config["base_model_name"] learning_rate = config["learning_rate"] global_batch_size = config["global_batch_size"]
world_size = int(os.environ.get("WORLD_SIZE", 1)) per_device_batch_size = global_batch_size // world_size dataloader_workers = 0 if world_size > 8 else 4 run_name = f"ettin-reranker-{ENCODER_SIZE}-lr{learning_rate:.0e}"
1. Load a model to finetune with model card data
The model mirrors ModernBertForSequenceClassification, but with a 'headless' Transformer that just loads
AutoModel. This allows for unpadding with FA2, which isn't possible with AutoModelForSequenceClassification.
This speeds up training considerably, while heavily reducing memory usage.
torch.manual_seed(12) transformer = Transformer(encoder_id, model_kwargs={"attn_implementation": "flash_attention_2"}) transformer.model.config.num_labels = 1 embedding_dimension = transformer.get_embedding_dimension() pooling = Pooling(embedding_dimension=embedding_dimension, pooling_mode="cls") dense_inner = Dense( in_features=embedding_dimension, out_features=embedding_dimension, bias=False, activation_function=nn.GELU(), module_input_name="sentence_embedding", module_output_name="sentence_embedding", ) norm = LayerNorm(dimension=embedding_dimension) dense_score = Dense( in_features=embedding_dimension, out_features=1, bias=True, activation_function=nn.Identity(), module_input_name="sentence_embedding", module_output_name="scores", ) model = CrossEncoder( modules=[transformer, pooling, dense_inner, norm, dense_score], num_labels=1, activation_fn=nn.Identity(), model_card_data=CrossEncoderModelCardData( model_name=f"Ettin Reranker {ENCODER_SIZE} distilled from mxbai-rerank-large-v2", language="en", license="apache-2.0", ), ) actual_attn = getattr(model[0].model.config, "_attn_implementation", None) if not (actual_attn and "flash" in actual_attn.lower()): logging.warning(f"FA2 may not be active (attn_impl={actual_attn!r}); training will be slower.")
2. Load the dataset. Each config is one source subset (32 lighton + 7 rerank retrieval
domains). The held-out eval rows live as the 'validation' split of the 'quora' config.
dataset_repo = "cross-encoder/ettin-reranker-v1-data" train_pieces = [] eval_dataset = None for config_name in get_dataset_config_names(dataset_repo): dataset = load_dataset(dataset_repo, config_name) train_pieces.append(dataset["train"]) if "validation" in dataset: eval_dataset = dataset["validation"] train_dataset = concatenate_datasets(train_pieces) print(train_dataset)
3. Define a loss function
loss = MSELoss(model)
4. Specify training arguments
args = CrossEncoderTrainingArguments( output_dir=f"models/{run_name}", num_train_epochs=1, per_device_train_batch_size=per_device_batch_size, per_device_eval_batch_size=per_device_batch_size, gradient_accumulation_steps=1, learning_rate=learning_rate, warmup_ratio=0.03, bf16=True, eval_strategy="steps", eval_steps=0.05, save_strategy="steps", save_steps=0.05, save_total_limit=5, logging_steps=0.025, logging_first_step=True, load_best_model_at_end=True, metric_for_best_model="eval_NanoBEIR_R100_mean_ndcg@10", dataloader_num_workers=dataloader_workers, run_name=run_name, seed=12, )
5. Create an evaluator
evaluator = CrossEncoderNanoBEIREvaluator( dataset_names=["msmarco", "nfcorpus", "nq", "fiqa2018", "touche2020", "scifact", "hotpotqa", "arguana", "fever", "dbpedia", "climatefever", "scidocs", "quoraretrieval"], batch_size=per_device_batch_size, always_rerank_positives=False, show_progress_bar=False, )
6. Create a trainer
trainer = CrossEncoderTrainer( model=model, args=args, train_dataset=train_dataset, eval_dataset=eval_dataset, loss=loss, evaluator=evaluator, )
7. Evaluate before training
if trainer.is_world_process_zero(): with torch.autocast(device_type="cuda", dtype=torch.bfloat16): evaluator(model)
8. Train
trainer.train()
9. Evaluate the final model
if trainer.is_world_process_zero(): with torch.autocast(device_type="cuda", dtype=torch.bfloat16): evaluator(model)
10. Save the final model
final_dir = f"models/{run_name}/final" model.save_pretrained(final_dir)
if __name__ == "__main__": main()
For multi-node training (anything past 17m/32m), launch the same script with torchrun:
Single-node (17m, 32m): defaults work
python train.py
Multi-node 4n setup for 150m, preserves global_batch_size=192:
torchrun --nproc_per_node=8 --nnodes=4 ... train.py
Conclusion
The ettin-reranker-v1 family, trained with a single simple recipe, is state-of-the-art at every released size up to 1B parameters. Pointwise MSE distillation from a strong teacher onto a broad-domain and retrieval-specific mix scales cleanly from 17M to 1B parameters, with only the learning rate and per-device batch size changing between sizes.
Every ettin-reranker-v1 model beats the ms-marco-MiniLM-L*-v2 family by a comfortable margin on MTEB and NanoBEIR. cross-encoder/ettin-reranker-150m-v1 is the strongest mid-tier reranker I tested in the under-600M range, cross-encoder/ettin-reranker-400m-v1 lands within 0.0024 of the 1.54B teacher's MTEB score, and cross-encoder/ettin-reranker-1b-v1 matches that teacher within 0.0001.
Everything in one place:
Models:
cross-encoder/ettin-reranker-17m-v1
cross-encoder/ettin-reranker-32m-v1
cross-encoder/ettin-reranker-68m-v1
cross-encoder/ettin-reranker-150m-v1
cross-encoder/ettin-reranker-400m-v1
cross-encoder/ettin-reranker-1b-v1
Dataset: cross-encoder/ettin-reranker-v1-data with ~143M (query, document, label) triples, kept as 39 named splits so the provenance of every row is visible.
Training script: the ~150 lines in Overall Training Script above, which is the same script used for all six models.
If you build something on top of these, please let me know! I'd genuinely love to see what people do with them, and if you manage to train better rerankers using the released data, even better. The recipe is intentionally simple, partly so that there's pl