AI News HubLIVE
原文

Introducing the Ettin Reranker Family

Six new Sentence Transformers CrossEncoder rerankers, state-of-the-art at their respective sizes, built on Ettin ModernBERT encoders with a distillation recipe. The smallest (17M) to largest (1B) all outperform prior models on MTEB and NanoBEIR. Full training recipe, dataset, and models are open-sourced.

Article intelligence

EngineersAdvanced

Key points

  • Six rerankers from 17M to 1B parameters, all SOTA for their size
  • Trained with pointwise MSE distillation from mxbai-rerank-large-v2
  • Significantly outperforms ms-marco-MiniLM-L*-v2 family
  • Open-source models, dataset (143M triples), and training script

Why it matters

This matters because six rerankers from 17M to 1B parameters, all SOTA for their size.

Technical impact

May affect model selection, inference cost, product capability, and evaluation benchmarks.

Back to Articles

Introducing the Ettin Reranker Family

Published May 19, 2026

Update on GitHub

Upvote

22

Tom Aarsen

tomaarsen

TL;DR

Today I'm releasing six new Sentence Transformers CrossEncoder rerankers, state-of-the-art at their respective sizes, built on top of the Ettin ModernBERT encoders, together with the data and full training recipe that produced them:

cross-encoder/ettin-reranker-17m-v1

cross-encoder/ettin-reranker-32m-v1

cross-encoder/ettin-reranker-68m-v1

cross-encoder/ettin-reranker-150m-v1

cross-encoder/ettin-reranker-400m-v1

cross-encoder/ettin-reranker-1b-v1

The models were trained with a distillation recipe: pointwise MSE on mixedbread-ai/mxbai-rerank-large-v2 scores over cross-encoder/ettin-reranker-v1-data, which is a subset of lightonai/embeddings-pre-training mixed with a reranked subset of lightonai/embeddings-fine-tuning.

Our six rerankers paired with google/embeddinggemma-300m on MTEB(eng, v2) Retrieval. See Results for five more embedder pairings.

If you're new to rerankers and want the "why" first, jump to What is a reranker, and why pair one with an embedder?. If you just want to plug a model in, jump to Usage. If you want to train your own, jump to Training.

I bootstrapped the training recipe below with the new train-sentence-transformers Agent Skill shipped in Sentence Transformers v5.5.0. Install it with hf skills add train-sentence-transformers [--global] [--claude] and ask your AI coding agent (Claude Code, Codex, Cursor, Gemini CLI, ...) to fine-tune a SentenceTransformer, CrossEncoder, or SparseEncoder model on your data.

Table of contents

What is a reranker, and why pair one with an embedder?

Usage

End-to-end retrieve-then-rerank pipeline

Architecture Details

Results

MTEB(eng, v2) Retrieval

Speed

Training

Distillation recipe

Dataset

Training Arguments

Evaluation

Overall Training Script

Conclusion

Acknowledgements

What is a reranker, and why pair one with an embedder?

A reranker (a.k.a. pointwise cross-encoder) is a neural model that takes a (query, document) pair and outputs a single relevance score. Unlike an embedding model, which encodes the query and document separately and computes their similarity from the two embedding vectors, a reranker lets the two texts attend to each other through every transformer layer. That joint encoding is more accurate but also more expensive: the model has to be run once per (query, document) pair rather than once per text.

Because cross-encoders are too expensive to run over a full corpus, the common production pattern is retrieve-then-rerank: a fast embedding model retrieves the top-K candidates (cheap), then a cross-encoder re-orders just those K with high accuracy. The total cost stays bounded while the final ranking is much closer to what an exhaustive cross-encoder pass would produce.

Throughout this blogpost I'll use "reranker" and "cross-encoder" interchangeably.

Usage

The released models are normal Sentence Transformers CrossEncoder models, so you can use them with just 3 lines of code:

from sentence_transformers import CrossEncoder

model = CrossEncoder("cross-encoder/ettin-reranker-32m-v1") scores = model.predict([ ("Where was Apple founded?", "Apple Inc. was founded in Cupertino, California in 1976 by Steve Jobs, Steve Wozniak, and Ronald Wayne."), ("Where was Apple founded?", "The Fuji apple is an apple cultivar developed in the late 1930s and brought to market in 1962."), ]) print(scores)

[11.393298 2.968891] None:

config = CONFIGS[ENCODER_SIZE] encoder_id = config["base_model_name"] learning_rate = config["learning_rate"] global_batch_size = config["global_batch_size"]

world_size = int(os.environ.get("WORLD_SIZE", 1)) per_device_batch_size = global_batch_size // world_size dataloader_workers = 0 if world_size > 8 else 4 run_name = f"ettin-reranker-{ENCODER_SIZE}-lr{learning_rate:.0e}"

1. Load a model to finetune with model card data

The model mirrors ModernBertForSequenceClassification, but with a 'headless' Transformer that just loads

AutoModel. This allows for unpadding with FA2, which isn't possible with AutoModelForSequenceClassification.

This speeds up training considerably, while heavily reducing memory usage.

torch.manual_seed(12) transformer = Transformer(encoder_id, model_kwargs={"attn_implementation": "flash_attention_2"}) transformer.model.config.num_labels = 1 embedding_dimension = transformer.get_embedding_dimension() pooling = Pooling(embedding_dimension=embedding_dimension, pooling_mode="cls") dense_inner = Dense( in_features=embedding_dimension, out_features=embedding_dimension, bias=False, activation_function=nn.GELU(), module_input_name="sentence_embedding", module_output_name="sentence_embedding", ) norm = LayerNorm(dimension=embedding_dimension) dense_score = Dense( in_features=embedding_dimension, out_features=1, bias=True, activation_function=nn.Identity(), module_input_name="sentence_embedding", module_output_name="scores", ) model = CrossEncoder( modules=[transformer, pooling, dense_inner, norm, dense_score], num_labels=1, activation_fn=nn.Identity(), model_card_data=CrossEncoderModelCardData( model_name=f"Ettin Reranker {ENCODER_SIZE} distilled from mxbai-rerank-large-v2", language="en", license="apache-2.0", ), ) actual_attn = getattr(model[0].model.config, "_attn_implementation", None) if not (actual_attn and "flash" in actual_attn.lower()): logging.warning(f"FA2 may not be active (attn_impl={actual_attn!r}); training will be slower.")

2. Load the dataset. Each config is one source subset (32 lighton + 7 rerank retrieval

domains). The held-out eval rows live as the 'validation' split of the 'quora' config.

dataset_repo = "cross-encoder/ettin-reranker-v1-data" train_pieces = [] eval_dataset = None for config_name in get_dataset_config_names(dataset_repo): dataset = load_dataset(dataset_repo, config_name) train_pieces.append(dataset["train"]) if "validation" in dataset: eval_dataset = dataset["validation"] train_dataset = concatenate_datasets(train_pieces) print(train_dataset)

3. Define a loss function

loss = MSELoss(model)

4. Specify training arguments

args = CrossEncoderTrainingArguments( output_dir=f"models/{run_name}", num_train_epochs=1, per_device_train_batch_size=per_device_batch_size, per_device_eval_batch_size=per_device_batch_size, gradient_accumulation_steps=1, learning_rate=learning_rate, warmup_ratio=0.03, bf16=True, eval_strategy="steps", eval_steps=0.05, save_strategy="steps", save_steps=0.05, save_total_limit=5, logging_steps=0.025, logging_first_step=True, load_best_model_at_end=True, metric_for_best_model="eval_NanoBEIR_R100_mean_ndcg@10", dataloader_num_workers=dataloader_workers, run_name=run_name, seed=12, )

5. Create an evaluator

evaluator = CrossEncoderNanoBEIREvaluator( dataset_names=["msmarco", "nfcorpus", "nq", "fiqa2018", "touche2020", "scifact", "hotpotqa", "arguana", "fever", "dbpedia", "climatefever", "scidocs", "quoraretrieval"], batch_size=per_device_batch_size, always_rerank_positives=False, show_progress_bar=False, )

6. Create a trainer

trainer = CrossEncoderTrainer( model=model, args=args, train_dataset=train_dataset, eval_dataset=eval_dataset, loss=loss, evaluator=evaluator, )

7. Evaluate before training

if trainer.is_world_process_zero(): with torch.autocast(device_type="cuda", dtype=torch.bfloat16): evaluator(model)

8. Train

trainer.train()

9. Evaluate the final model

if trainer.is_world_process_zero(): with torch.autocast(device_type="cuda", dtype=torch.bfloat16): evaluator(model)

10. Save the final model

final_dir = f"models/{run_name}/final" model.save_pretrained(final_dir)

if __name__ == "__main__": main()

For multi-node training (anything past 17m/32m), launch the same script with torchrun:

Single-node (17m, 32m): defaults work

python train.py

Multi-node 4n setup for 150m, preserves global_batch_size=192:

torchrun --nproc_per_node=8 --nnodes=4 ... train.py

Conclusion

The ettin-reranker-v1 family, trained with a single simple recipe, is state-of-the-art at every released size up to 1B parameters. Pointwise MSE distillation from a strong teacher onto a broad-domain and retrieval-specific mix scales cleanly from 17M to 1B parameters, with only the learning rate and per-device batch size changing between sizes.

Every ettin-reranker-v1 model beats the ms-marco-MiniLM-L*-v2 family by a comfortable margin on MTEB and NanoBEIR. cross-encoder/ettin-reranker-150m-v1 is the strongest mid-tier reranker I tested in the under-600M range, cross-encoder/ettin-reranker-400m-v1 lands within 0.0024 of the 1.54B teacher's MTEB score, and cross-encoder/ettin-reranker-1b-v1 matches that teacher within 0.0001.

Everything in one place:

Models:

cross-encoder/ettin-reranker-17m-v1

cross-encoder/ettin-reranker-32m-v1

cross-encoder/ettin-reranker-68m-v1

cross-encoder/ettin-reranker-150m-v1

cross-encoder/ettin-reranker-400m-v1

cross-encoder/ettin-reranker-1b-v1

Dataset: cross-encoder/ettin-reranker-v1-data with ~143M (query, document, label) triples, kept as 39 named splits so the provenance of every row is visible.

Training script: the ~150 lines in Overall Training Script above, which is the same script used for all six models.

If you build something on top of these, please let me know! I'd genuinely love to see what people do with them, and if you manage to train better rerankers using the released data, even better. The recipe is intentionally simple, partly so that there's pl