About African Voices
A large-scale multilingual speech dataset developed by Data Science Nigeria. Contains more than 3,000 hours of transcribed audio across five Nigerian and Malian languages: Bambara, Hausa, Igbo, Nigerian Pidgin, and Yorùbá.
Purpose
The dataset supports Automatic Speech Recognition (ASR) and speech technology for low-resource African languages. It combines both scripted and spontaneous speech collected through community-centered, ethical protocols that respect linguistic and cultural diversity.
Key Features
- •3,000+ hours of transcribed audio
- •5 African languages supported
- •Community-centered approach
- •Ethical and culturally respectful
Supported Languages
- 🇲🇱 Bambara
- 🇳🇬 Hausa
- 🇳🇬 Igbo
- 🇳🇬 Nigerian Pidgin
- 🇳🇬 Yorùbá
Research Paper
Our comprehensive research paper detailing the dataset construction, methodology, and benchmarks is currently in progress and will be available on arXiv. Check back soon for the latest updates.
Citation & Use
When citing or benchmarking, please include the following BibTeX reference:
@misc{datasciencenigeria_african_voices_2025,
title = {African Voices: Multilingual Speech Dataset for Low-Resource African Languages},
author = {DataScience Nigeria},
year = {2025},
note = {Latest release, November 2025},
howpublished = {\url{https://www.africanvoices.ai}},
institution = {Data Science Nigeria},
keywords = {speech recognition, multilingual datasets, African languages, low-resource ASR}
}Acknowledgments
The African Voices Dataset was prepared and collected by Data Science Nigeria through the generous support and contributions of The Bill & Melinda Gates Foundation.