How to benchmark an Arabic LLM you can actually trust

Leaderboard scores are the least informative number about an Arabic model. Here is why translated benchmarks mislead, where aggregate scores hide the failures that matter, and the evaluation protocol we recommend before anything ships.

Bayanat Labs Research· April 2026· 10 min read

Every Arabic model announcement arrives with a benchmark table, and every benchmark table tells the same comfortable story: the model is competitive. Then the model meets production traffic — dialectal, code-switched, culturally loaded — and support tickets tell a different story. The gap is rarely the model's fault alone. It is the evaluation's.

Four ways Arabic benchmarks mislead

1. Translated benchmarks measure translation, not Arabic

Much of the Arabic evaluation stack is machine-translated from English suites. Translation carries over English framing, references and idiom, producing "translationese" that no Arabic speaker would write — so the eval rewards models that are good at translated English rather than good at Arabic. Worse, culturally specific questions arrive intact: a model can ace translated trivia about American institutions while knowing little about the region it will serve.

2. MSA-only evaluation tests the register your users don't speak

Arabic is diglossic: formal writing happens in Modern Standard Arabic, but everyday speech and chat happen in dialect. An evaluation composed of clean MSA prompts leaves the model untested on the inputs it will actually receive. A single "Arabic" score with no dialect breakdown should be treated as an MSA score.

3. Aggregates hide the failures that matter

Averaging across tasks and varieties produces one flattering number. But a model that is strong in Egyptian and weak in Gulf is not "moderately good" — it is unusable for a Saudi deployment. Scores must be reported per variety and per capability, or they conceal exactly the risk you are trying to measure.

4. Orthography and tokenization skew automatic metrics

Arabic writing tolerates surface variation — hamza forms, taa marbuta, optional diacritics — that exact-match metrics punish arbitrarily unless normalization is handled deliberately. And Arabic text typically fragments into more tokens per word than English in common tokenizers, which quietly affects context budgets, latency and cost comparisons between models.

Figure 1. Illustrative pattern we see repeatedly in client evaluations: the same model scores 20–35 points lower when tested on native, dialect-stratified prompts judged by native raters than on translated MSA benchmarks. The leaderboard number is the ceiling of the marketing narrative, not of the product.

What to measure instead: five dimensions

A trustworthy Arabic evaluation reports each of the following separately, per dialect and per domain — never as one blended number:

Dialect comprehension. Does the model correctly interpret inputs in your markets' varieties — including code-switching and Arabizi — or only in MSA?
Generation register. Are the model's answers natural for the audience: right variety, right formality, free of translationese? Native speakers can tell in one sentence; your eval should ask them.
Factuality with regional grounding. Accuracy on the entities, institutions and context of the deployment region — not translated general knowledge.
Cultural and safety alignment. Handling of religious content, social norms and sensitive regional topics, red-teamed in dialect. Safety behavior established in English does not automatically transfer.
Domain task performance. The actual jobs to be done — summarizing a contract, extracting a KYC field, triaging a symptom description — measured on your own task distribution.

Figure 2. Report the profile, not the average. Two models with identical mean scores can have opposite failure modes — and only the profile tells you which one is safe to ship for your market.

A protocol that holds up

The mechanics matter as much as the dimensions. The evaluations we trust share five properties:

Native construction. Prompts are authored natively in each target variety by native speakers — never translated from English — and stratified to match your real traffic mix.
Qualified judges. Raters are native speakers of the variety they judge, and licensed practitioners for domain content. A Gulf response judged by a non-Gulf rater is a coin flip on register.
Calibrated rubrics with agreement tracking. Judges score against written rubrics, calibrated on gold examples, with inter-annotator agreement measured and reported. If agreement is low, the eval is measuring the raters, not the model.
Blind, randomized comparisons. Model identities hidden, output order randomized, contamination checked — the eval set must not overlap the training data, which rules out most public benchmarks for anything decisive.
Living test sets. Refresh items over time and hold out a private split, so scores keep meaning something after the first run.

None of this is exotic — it is the same discipline frontier labs apply to English evaluation, applied with people who actually speak the varieties being measured. The expensive part is the human judgment. That is also the part that cannot be skipped: for a diglossic language, human raters are not a nice-to-have on top of automatic metrics; they are the metric.

Key takeaways

Distrust single numbers. One blended "Arabic score" is an MSA score with better branding — demand per-dialect, per-capability reporting.
Translated benchmarks flatter. Expect materially lower scores on natively-authored, dialect-stratified evals — that lower number is the honest one.
The judges are the metric. Native raters, calibrated rubrics and measured agreement are what make an Arabic eval trustworthy.
Evaluate before and after you buy. Use an independent eval to select the model, then re-run it on every fine-tune to catch regressions in register and safety.

Want an evaluation you can defend?

We build independent, dialect-stratified benchmarks with native raters and calibrated rubrics — for model selection, fine-tune regression testing and pre-launch sign-off.

Scope an evaluation