Preserving Non-English Languages in the Digital Age: Your Role in Empowering Communities with AI

Preserving Non-English Languages in the Digital Age: Your Role in Empowering Communities with AI

In a world dominated by the English language, artificial intelligence (AI) development has often overlooked the rich diversity of non-English languages. As AI advances, this has real-world consequences for communities whose languages are underrepresented in digital spaces. However, you can be part of a growing movement of researchers and developers who are pushing back against this dominance and working to preserve linguistic diversity through AI.

The English-Centric Problem in AI Development

AI models are predominantly trained on massive amounts of English text, leaving other languages underserved. Did you know that English language models are trained on nearly a thousand times more text than non-English languages? This imbalance leads to tangible problems for racialized and marginalized communities.

For example, there have been instances where inaccurate medical advice in Hindi was provided due to poor AI translations, or where mistranslations in Arabic resulted in wrongful arrests. Worse still, AI models that fail to moderate hate speech in languages like Amharic and Tigrinya have fueled violence in places like Ethiopia.

How Big Tech Falls Short in Multilingual AI

Prominent tech companies have attempted to address this issue, but many of their solutions are flawed. Relying on machine translations, these models often lack the cultural context necessary to serve non-English speakers effectively. Historically, colonial powers contributed to the erosion of indigenous languages, and many AI efforts by companies like Meta and Google continue this trend, reinforcing English as the global default.

You can see how important it is to shift the narrative.

Grassroots AI Research Groups to the Rescue

Thankfully, there is hope in the form of community-driven AI research groups like Masakhane (for African languages), AI4Bharat (for Indian languages), and AmericasNLP (for native American languages). These organizations are focused on empowering local communities by developing AI tools that directly address their linguistic and cultural needs. This approach is key to preserving diverse languages in an increasingly digital world.

Community Participation: A Core Component

One of the most powerful aspects of these research groups is their dedication to community involvement. Rather than imposing top-down solutions, these groups work closely with native speakers, language experts, and local communities to build datasets and train AI models that reflect the unique cultural and linguistic nuances of the regions they serve.

For example, AI4Bharat’s IndicVoices dataset captures speech from 22 Indian languages. This dataset is unique because it includes diverse community voices across different regions, genders, and professions, ensuring that slang, idioms, and regional dialects are accurately represented. By engaging community members, AI4Bharat has created tools that directly benefit society, such as subtitling higher education videos and translating judicial documents in India and Bangladesh.

The Risks of Tokenism and How to Avoid It

It’s important to recognize that not all AI efforts have been equally beneficial to the communities they serve. Some projects risk tokenism, where communities are consulted for data collection, but the benefits of AI tools built with their data don’t return to them.

That’s why groups like AmericasNLP and SIGARAB have adopted collaborative approaches that prioritize the needs of the communities they serve. For instance, AmericasNLP is working to revitalize indigenous languages by developing AI tools to create educational materials, ensuring that the next generation can learn endangered languages. SIGARAB’s initiatives focus on Arabic, using AI to detect propaganda and combat media bias in news coverage.

Data Ownership and Ethical AI Development

One of the most critical elements of this movement is ensuring data ownership and inclusive authorship. Many language technology projects have historically followed Western approaches to data sharing, where only certain forms of participation—like data analysis—are credited, leaving out the original contributors who helped build the datasets.

In contrast, groups like IndoNLP have pioneered new models for ethical data sharing. IndoNLP’s NusaCrowd project collected data for Indonesian languages such as Javanese and Sundanese, and made sure that control of the data remained with the original contributors. This is a stark contrast to traditional practices where companies often retain ownership of the data they collect.

Another great example is Masakhane, a grassroots African language initiative that recognizes contributions from community members in both data creation and lived experiences. Masakhane’s partnership with Lelapa AI has been particularly impactful in building AI products that serve African communities.

Why Your Involvement Matters

As the field of AI continues to evolve, you can play a vital role in ensuring that non-English languages are not only preserved but also empowered in the digital age. Research groups like Masakhane, AI4Bharat, and AmericasNLP are providing a blueprint for how to develop more inclusive AI systems that serve diverse cultures and languages.

But this work requires support. Policymakers must prioritize the inclusion of non-English languages in technology, and companies need to collaborate with these research groups to create AI systems that reflect cultural diversity. You can advocate for this inclusion, ensuring that more communities benefit from AI in their native languages.

The Path Forward

To continue this critical work, more support and funding are necessary from governments and international organizations. AI research groups are not only challenging the dominance of English-centric technology but are also showing the world how AI can be developed with a community-driven approach. The future of AI should be one where all languages, cultures, and communities have a voice in how the technology evolves.

Now is the time to act. By learning from these groups, you can help ensure that language diversity is preserved and that communities across the globe can benefit from AI tools designed for their unique needs.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top