About African Voices

A large-scale multilingual speech dataset developed by Data Science Nigeria. Contains more than 3,000 hours of transcribed audio across five Nigerian and Malian languages: Bambara, Hausa, Igbo, Nigerian Pidgin, and Yorùbá.

Purpose

The dataset supports Automatic Speech Recognition (ASR) and speech technology for low-resource African languages. It combines both scripted and spontaneous speech collected through community-centered, ethical protocols that respect linguistic and cultural diversity.

Key Features

•3,000+ hours of transcribed audio
•5 African languages supported
•Community-centered approach
•Ethical and culturally respectful

Supported Languages

🇲🇱 Bambara
🇳🇬 Hausa
🇳🇬 Igbo
🇳🇬 Nigerian Pidgin
🇳🇬 Yorùbá

Research Paper

Our comprehensive research paper detailing the dataset construction, methodology, and benchmarks is currently in progress and will be available on arXiv. Check back soon for the latest updates.

Citation & Use

When citing or benchmarking, please include the following BibTeX reference:

@misc{datasciencenigeria_african_voices_2025,
  title = {African Voices: Multilingual Speech Dataset for Low-Resource African Languages},
  author = {DataScience Nigeria},
  year = {2025},
  note = {Latest release, November 2025},
  howpublished = {\url{https://www.africanvoices.ai}},
  institution = {Data Science Nigeria},
  keywords = {speech recognition, multilingual datasets, African languages, low-resource ASR}
}

Acknowledgments

The African Voices Dataset was prepared and collected by Data Science Nigeria through the generous support and contributions of The Bill & Melinda Gates Foundation.