Africa Next Voices Transcription Training workshop for Dholuo and Kalenjin Communities held at Kisumu Hotel

The Bill and Melinda Gates Foundation-funded African Next Voices Project: Pilot Data Collection in Kenya is making significant strides in collecting high-quality linguistic datasets in five African languages—Dholuo, Kikuyu, Kalenjin, Maasai, and Somali. This initiative plays a crucial role in bridging the language gap in AI and speech technology by compiling both scripted and unscripted audio data. Having successfully completed the scripted phase, the project is now conducting the unscripted phase, where participants respond to textual, image, audio, and video prompts, providing natural speech in their native languages. This phase ensures the collection of diverse and representative speech patterns that are essential for developing inclusive voice-enabled technologies.
On March 27, 2024, a transcription workshop was held at Kisumu Hotel to train transcribers on the techniques required to process unscripted audio data with accuracy and consistency. The workshop focused on two languages, Dholuo and Kalenjin, and was guided by linguists and language leads. The transcribers were introduced to the African Next Voices (ANV) transcription methodology, which follows rigorous quality assurance loops to ensure that the final datasets meet high linguistic and technical standards.


The Kalenjin transcription team consisted of native speakers from Kericho, Nandi, Bomet, Uasin Gishu, Nakuru, and Narok. These transcribers underwent intensive training on language-specific guidelines to ensure consistency in transcribing Kalenjin speech data. During the practical sessions, they engaged in real-time transcription exercises, applying the guidelines under the supervision of experienced language leads. Their participation in the workshop highlights the project’s commitment to leveraging regional linguistic expertise to develop high-quality AI training data.