Leyu is a platform dedicated to providing high-quality datasets for low-resource languages. In response to the increasing demand for linguistic data in AI and Natural Language Processing (NLP) applications, Leyu employs an impact-driven crowdsourcing model to gather ethical, relevant, and accurate language datasets. Our initial focus is on five primary Ethiopian languages—Amharic, Afaan Oromo, Tigrinya, Af-Somali, and Sidama—along with their diverse dialects. This approach is designed to support businesses and organizations aiming to improve their AI and digital solutions with localized language expertise.,
Most AI tools today rely on data from just a few dominant languages—only about ten languages, leaving 60% of speakers of low-resource African languages underserved. This digital divide limits access to AI-driven tools, critical services, and culturally relevant content. The datasets that do exist are often error-prone, ethically compromised, and culturally biased due to inadequate collection and consent practices.
Leyu steps in to bridge this gap as an open-source, crowdsourcing platform that empowers African communities to collect, refine, and ethically source data for AI applications.
Named after the Amharic word for “identify” or “label,” Leyu fosters a community-driven approach to language data, bridging gaps in inclusivity and accuracy to make technology accessible and culturally relevant for Africa and beyond.