Singapore has developed SEA-LION, a Southeast Asian language model, as an alternative to ChatGPT for more accurate representation in the region. Despite attempts with large models such as Llama 2 and Mistral AI, they often produce gibberish in English. SEA-LION (Southeast Asian Languages in One Network), a Singaporean government initiative, aims to address this imbalance by training the model in Southeast Asian languages and cultures.
Trained in 11 Southeast Asian languages, including Vietnamese, Thai, and Bahasa Indonesia, the model offers a more economical and efficient option for businesses, governments, and academics in the region, according to Leslie Teo of AI Singapore, according to SCMP. Teo emphasizes that the initiative aims to complement, rather than compete with, efforts to better represent Southeast Asia. He acknowledges that the initiative is not perfect, but it is a step towards correcting the biases inherent in American localized language models (LLMs).
Despite there being more than 7,000 languages spoken worldwide, most LLMs, such as Open AI's GPT-4 and Meta's Llama 2, are designed and trained for English, creating a gap in language representation.
Governments and technology companies are attempting to bridge this gap by creating datasets in local languages in India, empowering LLMs in the United Arab Emirates for Arabic, and developing AI models in local languages in China, Japan, and Vietnam.
According to Nuurrianti Jalli, an assistant professor in the School of Communications at Oklahoma State University, these models can help local populations participate more equitably in the global AI economy dominated by large technology companies. According to the researchers, multilingual language models can effectively infer semantic and grammatical relationships between resource-rich and resource-poor languages.
These models can be used in applications ranging from translation to customer service chatbots, to content moderation on social media platforms struggling to identify hate speech in low-resource languages such as Burmese or Amharic. SEA-LION includes 13% of data from Southeast Asian languages, more than any other major LLM. The data includes over 9% Chinese text and about 63% English, according to Teo.
Teo also reveals that multilingual language models are often trained with translated text and low-quality data, which may contain errors. As a result, AI Singapore is meticulous in selecting data to train SEA-LION, Teo said at the National University of Singapore office.
Governments are increasingly contributing data, and companies are testing SEA-LION. With its smaller size, SEA-LION can be deployed faster and at a lower cost for customization and adoption, Teo explained.
Nonetheless, there exists a notable concern among digital experts regarding the development of LLMs by various countries and regions. They apprehend that such initiatives might inadvertently reinforce prevailing online narratives, particularly in nations characterized by authoritarian regimes, stringent media censorship, or fragile civil societies.
Regarding inquiries about former Indonesian President Suharto, Llama 2 and GPT4 address his checkered human rights record, whereas SEA-LION emphasizes his achievements.
Aliya Bhatia, representing the Center for Democracy & Technology, underscores that when a model is exclusively trained on favorable content about a government, it tends to adopt a favorable outlook while disregarding dissenting viewpoints. While regional LLMs can capture linguistic and cultural subtleties, their scope of knowledge about broader global contexts may be somewhat limited.