UAE's Jais Arabic large language models teaching AI to understand cultural context

29 June 2026 21:21

BATOOL GHAITH (ABU DHABI)

The UAE's efforts to develop homegrown artificial intelligence are going beyond building powerful language models to teaching AI to understand the Arabic language and the cultural context behind it.

Jais, the family of Arabic-centric large language models developed through a collaboration between Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), G42's Inception and Cerebras Systems, was designed to address gaps left by global AI models that were primarily trained on English-language data.

During a webinar hosted by Abu Dhabi-based AI company Polynome on Monday, Natalia Vassilieva, Vice President and Field CTO at Cerebras Systems, and Preslav Nakov, Professor of Natural Language Processing at MBZUAI, explained the research and engineering behind developing Arabic foundation models at scale.

More Than Translating AI Into Arabic
Vassilieva said the challenge was not simply teaching AI another language, as most leading foundation models were originally designed around English, making them less efficient when processing Arabic, she explained.

"Existing vocabularies often split Arabic words into many smaller pieces, meaning the models require more computation while handling less text within the same context window," Vassilieva said.

Beyond language efficiency, global models also lack understanding of many aspects of Arab culture, traditions, religion and region-specific contexts, she added.

"We wanted all of that to get fixed," Vassilieva said, explaining that the project required rethinking nearly every stage of model development rather than simply adapting existing systems.

According to Vassilieva, the team built a new Arabic tokeniser and vocabulary specifically designed to represent Arabic more efficiently while maintaining balanced performance across Arabic, English and programming languages.

Building Arabic AI from the Ground Up
Creating the models also required building an extensive Arabic data ecosystem, Vassilieva noted.

Researchers collected Arabic-language datasets from multiple public and regional sources, eventually compiling an initial corpus of around 55 billion Arabic tokens before expanding the dataset further through high-quality translated material and multilingual training, she said.

Vassilieva explained that the team relied heavily on scaling laws, which are thousands of small-scale experiments that helped predict how larger models would perform before committing significant computing resources.

The researchers also tested different combinations of Arabic and English data, concluding that carefully balanced multilingual training ultimately produced stronger Arabic models at larger scales.

Vassilieva noted that at the time of release, the original Jais family outperformed comparable open-weight models across multiple Arabic benchmarks while remaining competitive in English.

Nakov said another major challenge was evaluating Arabic models. Rather than relying solely on benchmarks translated from English, researchers developed native Arabic evaluation datasets using educational materials from across the Arab world.

The team also created Arabic instruction datasets to improve question answering and conversational performance, including datasets specifically developed around UAE knowledge, he said.

According to Nakov, these Arabic-native benchmarks have since become reference points used by others evaluating Arabic language models.

Making Arabic AI Safer
Safety was another central focus of the project, Nakov said, noting that safety measures cannot simply be translated from English because Arabic models must also account for regional culture, language and social context.

Therefore, the researchers built dedicated Arabic safety datasets covering misinformation, harmful advice, hate speech, discrimination, illegal activities, privacy risks, regional political sensitivity and culturally specific topics.

Safety protections were introduced throughout the model development process, he explained, from data preparation and instruction tuning to deployment, with additional safeguards built into system prompts and user interactions.

"One safety mechanism is never enough," Nakov noted, explaining that multiple layers are needed because each individual safeguard has limitations.