Sunday 11 Jan 2026 Abu Dhabi UAE
Prayer Timing
Today's Edition
Today's Edition
AI

Abu Dhabi leads AI push as experts build bots that truly understand Arabic

Abu Dhabi leads AI push as experts build bots that truly understand Arabic
16 Sep 2025 00:59

MAYS IBRAHIM (ABU DHABI)

Abu Dhabi is making strides in Arabic artificial intelligence, with homegrown large language models (LLMs) working to address the region’s linguistic and cultural nuances.

At the Congress of Arabic and Creative Industries, experts outlined both the promise and the unique challenges of building AI that truly understands Arabic.

The global LLM adventure is still in its early stages, according to Dr. Hakim Hacid, Chief Researcher at the Technology Innovation Institute (TII). In fact, the AI “brain” being built today could roughly be equivalent to that of a three- or four-year-old child, he said.

While AI is making progress in reasoning and knowledge synthesis, much work remains to reach the level of nuanced understanding that even young humans naturally acquire.

In May, TII launched Falcon Arabic, recognised as the region’s top-performing Arabic AI model with a 7B-parameter architecture.

The model was designed to understand the structure, nuance, and culture of Modern Standard Arabic as well as Gulf, Levantine, and other key dialects.

Dr. Hacid explained TII’s distinctive approach to building Arabic LLMs: rather than translating English content, Falcon Arabic is trained exclusively on native Arabic resources.

This captures nuances and cultural footprints that translation-based models miss, he said.

A recurring theme among experts was the scarcity of high-quality Arabic data.

While English datasets can exceed 10 trillion tokens online, Arabic remains at roughly 100–200 billion tokens, despite its linguistic richness and variety of dialects, according to Neha Sengupta, Director of Research and Development at Inception, part of G42.

She traced the development of G42’s Jais, an auto-regressive bilingual LLM for Arabic and English launched in 2023 with 14 billion parameters.

To address Arabic’s relative scarcity of digitised content, G42 undertook large-scale efforts to gather and clean native Arabic data from books, journals, and academic theses.

Architecturally, G42 employs cross-lingual transfer, blending Arabic and English in carefully calibrated proportions to teach general reasoning skills while maintaining the integrity of Arabic.

The team found that at moderate scales, a 33% Arabic, 66% English mix works best, though the optimal ratio shifts for larger models.

Tarjama, a UAE-based AI and translation firm, has leveraged its 17 years of linguistic data to train models while supplementing it with synthesised content and human annotation.

Iyad Ahmad, Chief Technical Officer at Tarjama, explained that the quality and balance of data remain crucial, especially when developing models that can reason, generate content, and perform specialised tasks. 
“Even with decades of data, covering dialects is the real challenge,” he told Aletihad.

While the company can currently handle about 20 dialects, Ahmad acknowledges this is not enough to cover the full spectrum of regional variation.

To address low-resource dialects, Tarjama employs user-generated content and hires annotators to create high-quality datasets, ensuring the LLM can perform effectively across multiple linguistic contexts.

Arabic LLMs are already changing how businesses operate, according to Ahmad. Tarjama reports that translation productivity has tripled, with linguists producing up to 8,000 words a day thanks to AI assistance. 
Beyond translation, Arabic models are being deployed as legal assistants, HR advisors, and customer-service agents.

Copyrights reserved to Aletihad News Center © 2026