What it means to build local AI

Following OpenAI’s public launch of ChatGPT in November 2022, the underpinnings of AI large language models seemed firmly “WIRED”: Western, industrialised, rich, educated, and democratic. Everyone assumed that if LLMs spoke a particular language and reflected a particular worldview, it would be a Western one. OpenAI even acknowledged ChatGPT’s skew toward Western views and the English language.

But even before OpenAI’s US competitors (Google and Anthropic) released their own LLMs the following year, Southeast Asian developers had recognised the need for AI tools that would speak to their own region in its many languages – no small task, given that more than 1,200 languages are spoken here.

Moreover, in a region where distant civilisational memories often collide with contemporary, post-colonial histories, language is profoundly political. Even seemingly mono-lingual countries belie marked diversity: Cambodians speak nearly 30 languages; Thais, roughly 70; and Vietnamese, over 100. This is also a region where communities mix languages seamlessly, where nonverbal cues speak volumes, and where oral traditions are sometimes more prevalent than textual means of capturing the deep cultural and historical nuances that have been encoded in language.

Not surprisingly, those trying to build truly local AI models for a region with so many under-represented languages have faced many obstacles, from a paucity of high-quality, high-quantity annotated data to a lack of access to the computing power needed to build and train models from scratch. In some cases, the challenges are even more basic, reflecting a shortage of native speakers and standardised orthography or frequent electricity supply disruptions.

Given these constraints, many of the region’s AI developers have settled for fine-tuning established models built by foreign incumbents. This involves taking a pre-trained model that has been fed large quantities of data and training it on a smaller dataset for a specific skill or task. Between 2020-2023, Southeast Asian language models such as PhoBERT (Vietnamese), IndoBERT (Indonesian), and Typhoon (Thai) were derived from much larger models such as Google’s BERT; Meta’s RoBERTa (later LLaMA), and France’s Mistral. Even the early versions of SeaLLM, a suite of models optimised for regional languages and released by Alibaba’s DAMO Academy, were built on Meta, Mistral, and Google’s architecture.

But in 2024, Alibaba Cloud’s Qwen disrupted this Western dominance, offering Southeast Asia a wider set of options. A Carnegie Endowment for International Peace study finds that five of the 21 regional models launched that year were built on Qwen.

Still, just as Southeast Asian developers previously had to account for a latent Western bias in the available foundation models, now they must be mindful of the ideologically filtered perspectives embedded in pre-trained Chinese models. Ironically, efforts to localise AI and ensure greater agency for Southeast Asian communities could deepen developers’ dependence on much larger players, at least in the initial stages.

Nonetheless, Southeast Asian developers have begun to address this problem, too. Multiple models, including SEA-LION (a collection of 11 official regional languages), PhoGPT (Vietnamese), and MaLLaM (Malay), have been pre-trained from scratch on a large, generic dataset of each particular language. This key step in the machine-learning process will allow these models to be further fine-tuned for specific tasks.

Although SEA-LION continues to rely on Google’s architecture for its pre-training, its use of a regional-language dataset has facilitated the development of homegrown models.

But representing native perspectives also requires a strong base of local knowledge. We cannot faithfully present Southeast Asian perspectives and values without understanding the politics of language, traditional sense-making, and historical dynamics.

For example, time and space are perceived differently in many indigenous communities. Balinese historical writings that defy conventional patterns of chronology might be viewed as myths or legends in Western terms, but they continue to shape how these communities make sense of the world.

Historians of the region have cautioned that applying a Western lens to local texts heightens the risk of misinterpreting indigenous perspectives.

Opinion

What it means to build local AI