Arabic Is Twenty Languages Inside One — At Least
I’m from Damascus and I lead a team training AI models for the Gulf Arabic market. There’s no hypocrisy in this — it’s the market. And the market is what decides which dialect the machine learns and which it forgets. The third article in “The Machine’s Language” examines Arabic as twenty languages inside one — and asks: who is responsible for the bias?
A Personal Confession to Start With
I’m from Damascus. My Levantine dialect is the language I think in when I’m not thinking — the one that surfaces automatically when I’m surprised, when I laugh, when I forget I’m monitoring myself. And yet, for some time now, I’ve been leading a team that trains AI models specifically for the Gulf Arabic market. Text datasets, conversational samples, linguistic evaluations — all of it in a dialect that isn’t my mother tongue, for a market that isn’t my home market.
There’s no hypocrisy in this. It’s work, and it’s a market. The demand is there, the supply goes where the work is. Gulf freelancers who specialize in this kind of AI training work can be counted on one hand. So the market fills with people like me: translators and writers from other dialect backgrounds producing Gulf-inflected content because the projects pay, and because Gulf countries are investing billions in exactly this direction.
This personal confession is the door into the third article — because it embodies, in lived form, the idea we’ll arrive at by the end: the linguistic bias in AI doesn’t come from the AI. It comes from us.
Twenty Languages Under One Roof
When we say “Arabic” we behave as though we’re discussing a single, coherent language. But Nour Al Hassan, founder of Arabic.ai, describes it with more precision: Arabic isn’t one language — it’s a family of dialects layered over a deep classical base, and each dialect may use an entirely different word to express the same thing.[1]
The small word “بس” (bas) is a simple but striking example: in Egypt it means “only”; in the Levant it means “but.”[1] One word, two nearly opposite meanings, in two countries that share history and a classical written language and centuries of shared civilization. An AI model that confuses them isn’t making a language error — it’s making an identity error.
And beyond vocabulary lies the harder territory: rhythm, cultural subtext, the unstated assumption. When a Lebanese speaker says “يسلمو إيدك” — “may your hands be safe” — they’re not just expressing thanks; they’re acknowledging the spirit of the craftsperson. When a Yemeni says “والله زين” — “by God, it’s good” — they’re invoking the divine as a partner in appreciation. These aren’t translations. They’re different worlds with cultural keys that only open from the inside. No general-purpose AI model yet handles all of them with the fluency they deserve.
Numbers That Should Embarrass Us
Arabic is the language of 491 million people. It’s the fourth most used language on the internet. And yet Arabic accounts for less than 1% of the text available online for AI training — and most of that tiny fraction is Modern Standard Arabic, the language of news broadcasts and government communiqués, not the living dialects people actually speak every day.
In natural language processing research, the picture isn’t much different. Dr. Kareem Darwish, from the Fanar team at Qatar’s Hamad Bin Khalifa University, described the problem in one unsparing sentence: “Language models are a mirror of our research and innovation capacity.” Arabic’s poverty in AI is, on this reading, a mirror of Arab research poverty before it reflects any bias on the part of foreign companies.
The most visible gap is in spoken language recognition. Amsal Kapetanovic, head of Infobip in Saudi Arabia, puts it plainly: “A Lebanese speaker and a Saudi speaker might use different words and speak at different speeds, making it challenging for a single model to process spoken Arabic with consistent accuracy.” A 2025 study in JMIR Medical Informatics found that Arabic models still lag behind English by 10 to 20 percent in complex reasoning tasks — even after years of accelerating investment.[2]
Egyptian or Gulf? The Wrong Question
In the previous article, we suggested that Gulf Arabic is the most digitally documented Arabic variety. The true picture is more layered than that — and more interesting.
Egyptian Arabic dominates culturally, without contest. That dominance was built over half a century of cinema, television, and music — from Umm Kulthum to black-and-white films to serialized dramas watched from Morocco to Kuwait. The effect on AI training data is direct: the overrepresentation of Egyptian Arabic in available datasets has likely inflated model performance in that dialect specifically. In other words, AI partially understands Egyptian because Umm Kulthum and Adel Imam taught it before any tech company did.
Gulf Arabic dominates economically and digitally — a different kind of dominance entirely. Gulf states have the highest internet penetration rates in the Arab world: Kuwait and the UAE approach 100%.[3] This means more written digital content, more active employment platforms, denser commercial web presence, deeper social media engagement — all of which feeds training data continuously and indirectly.
So the question isn’t “which dialect is better represented” — it’s which kind of presence feeds the models: the cultural and media presence led by Egypt, or the economic and digital presence led by the Gulf. And the result is that other dialects fall between the two: Levantine, Moroccan, Yemeni, and Sudanese Arabic are nearly absent from training data, despite having millions of speakers between them.
The Law of Supply and Demand Decides the Machine’s Language
Here we arrive at the idea that makes my personal story more than a passing detail. What is happening in the Arabic AI market isn’t a hidden cultural bias — it’s market logic operating with full efficiency and uncomfortable transparency.
Saudi Arabia alone has launched a $1 billion GAIA initiative to accelerate generative AI development. Its sovereign wealth fund, the Public Investment Fund, is committing hundreds of billions to AI infrastructure. The UAE co-finances the Stargate Project alongside OpenAI through its MGX fund.[4] Saudi Arabia announced “Project Transcendence” — a $100 billion program to become a global AI power. Qatar is developing the Fanar model through the Qatar Computing Research Institute. Bahrain is drafting the region’s first AI-specific regulatory law.[4]
The practical effect of this investment is visible and concrete. And the detail that reveals the full paradox: Jais, the UAE’s most advanced Arabic language model, was forced to rely on over 70% English training content because sufficient Arabic data simply didn’t exist — weakening its performance on higher reasoning tasks in the very language it was built to serve.[5] Even the model made to serve Arabic was built on an English foundation.
The massive Gulf investment doesn’t correct AI’s bias toward Gulf Arabic — it deepens it and gives it the legitimacy of the market. But at the same time it does close the gap between Arabic and English overall. A productive contradiction with no tidy resolution.
So Where Is the Bias — In AI or In Us?
I return to myself — the Damascene producing Gulf Arabic training data. Why do I do this? Not because Gulf Arabic is more beautiful or more deserving by any linguistic measure. But because the contract is there, the client is there, the project pays. I’m not deliberately excluding Levantine or Yemeni Arabic from AI’s database — but I am, practically, enriching one dialect and impoverishing others. And I’m one of thousands doing the same thing.
This is the actual bias: no algorithm that dislikes you, no model that decides your dialect is less worthy. Just thousands of small decisions — a freelancer accepting a project, a company responding to market demand, a government investing in technology that serves its own citizens — accumulating to produce a model that resembles whoever spent the most on shaping it.
Noam Chomsky argued that power doesn’t require malicious intent to operate — it only requires structures that make certain choices easier than others.[6] The Arabic AI market is exactly those structures: no hostility toward Levantine or Yemeni, but an economic architecture that makes funding those dialects harder and less attractive. The result is bias without a biased party — which is perhaps the most difficult kind to confront.
Is There a Way Out?
There is genuine light at the end of this tunnel — not manufactured optimism, but documented movement. Robin Voogd, head of Middle East investments at a technology investment firm, frames it as “a huge gap and a major opportunity: whoever builds the best models for Arabic will gain a strategic data advantage in a massive underserved market.”[1]
From the UAE comes Falcon Arabic, designed to handle a range of dialects from Modern Standard Arabic to Levantine. From Saudi Arabia come ALLAM and Mulhem and Humain Chat. From Qatar comes Fanar. From Egypt comes Intella. Dr. Darwish says clearly: “Building robust Arabic language models is not a technological luxury, but a strategic necessity to ensure that the Arab world has a voice in shaping the future of AI.”[7]
But most of these models will serve first the market that funded them. Which means Yemeni, Sudanese, Moroccan, and Levantine speakers will continue to wait for their turn — not because anyone opposes them, but because the market hasn’t reached them yet.
This is the essential difference between bias and exclusion: the first is a choice, the second is a consequence. The fact that the second isn’t the first doesn’t make it less real for the people it affects.
In the next article, we’ll see how accumulated memory — the feature every user demands and dreams of — is the shift that might flip this entire equation. But it may also deepen the bias in ways nobody anticipated. (See our article: How AI Learns From You — and What It Actually Knows About You)
→ Article 4: Everyone Says: If Only AI Had a Better Memory!
References
- Al Hassan, N. (2025). Quoted in: “Why every Arab country is racing to build its own large language model.” The National, September 2025. thenationalnews.com
- Kapetanovic, A. Quoted in: “Teaching machines to speak Arabic.” Arab News, November 2025. arabnews.com — referencing JMIR Medical Informatics (2025) study on Arabic model performance.
- Internet World Stats. “Internet Usage Statistics for the Middle East.” 2024. internetworldstats.com
- Digital Bricks. “The State of AI in the Middle East (2025).” digitalbricks.ai
- Imam, A. (2025). “Arabic LLMs: The AI Frontier You’re Not Paying Attention To.” Medium / The Geopolitical Economist, October 2025. medium.com
- Chomsky, N. (1999). Profit Over People: Neoliberalism and Global Order. Seven Stories Press.
- Darwish, K. Quoted in: “Why We Need Arabic Language Models.” Nature Middle East, August 2025. natureasia.com
