UAE

First NLP Winter School in Abu Dhabi adds momentum to Arabic natural language processing research

19 Jan 2025 23:53

Mays Ibrahim (ABU DHABI)

Guided by its leadership's vision for technological innovation, the UAE has been a key player in furthering Arabic language research, especially in the field of natural language processing (NLP), according to Nizar Habash, Professor of Computer Science at New York University Abu Dhabi (NYUAD).

"As an Arab, I feel an immense sense of pride in the progress we've made," Habash told Aletihad in a recent interview. "We're no longer just participants-we're becoming major competitors on the global stage."

Local institutions that foster innovation, like NYUAD and the Mohammed bin Zayed University of Artificial Intelligence (MBZUAI), play a critical role in advancing the field of Arabic Natural Language Processing (NLP) in the UAE.

MBZUAI and NYUAD co-organised the first Arabic NLP Winter School, during the 31st International Conference on Computational Linguistics (COLING 2025), which concluded on Sunday.

This two-day event, featuring expert-led panels, tutorials, and hands-on workshops, aimed to provide a platform for advancing Arabic NLP research in addition to fostering collaboration between students, academics, and industry professionals.

Habash noted that one of the key panels focused on the development of Arabic large language models, highlighting the efforts from the UAE, Qatar, and Saudi Arabia-countries that are leading the charge in creating AI tools specifically for the Arabic language.

The event also featured a mentoring programme pairing younger participants with experienced professionals, in addition to hands-on sessions to address Arabic NLP challenges and a mini hackathon for participants to apply their skills to real-world problems.

The Challenges of Arabic NLP

The main challenges faced by researchers in the field of Arabic NLP include orthographic ambiguity, morphological diversity, dialectical variations, and spelling inconsistencies, according to Habash.

He explained that Arabic is often written without diacritical marks, which can lead to multiple interpretations of a word, posing a challenge for disambiguation in NLP systems.

Arabic's rich morphology means that a single root word can take on thousands of forms depending on conjugation and inflection, Habash added. For instance, the Arabic verb for (to say) can appear in over 5,400 different forms, creating complexity for language processing tools.

Moreover, Arabic's many regional dialects add another layer of complexity, Habash noted. For example, the word for "tomato" is pronounced banadora in Egypt, mateya in Tunisia, and gouta in Morocco.

Habash is the Director of the Computational Approaches to Modeling Language (CAMeL) Lab at NYUAD, where he leads research and education in artificial intelligence, with a focus on natural language processing, computational linguistics, and data science. The lab's primary research areas include Arabic NLP, machine translation, text analytics, and dialogue systems.

His team has worked on the Madar project, which studied variations across 25 cities in the Arab world, modelling how the same sentence may be spoken differently across regions.

The lab also developed several open-source tools, such as Camelira, which helps disambiguate Arabic words by offering multiple interpretations.

These resources are part of the larger Camel Tools toolkit, freely available to researchers and developers working on Arabic NLP, according to Habash.

Innovative Solutions for Arabic NLP

Habash explained that addressing these challenges requires a blend of rule-based and machine learning approaches.

"Historically, we used rule-based systems, where we created manual rules to process language," he said. "However, in recent years, there's been a shift toward machine learning, where large datasets are used to train algorithms to identify patterns and make predictions."

His own work on Arabic NLP combines both approaches. For instance, Camelira uses a hybrid model-leveraging a set of rules for word disambiguation and incorporating machine learning to predict the most likely meaning of a word based on context.