Whose Voice Is the Model Listening For?
ASR for foundational literacy is arriving in African classrooms. The question is whether it has learned from the classrooms where it is meant to work.
The pitch for ASR to help teachers help their students
The case for automatic speech recognition (ASR) in foundational literacy is compelling: a child reads aloud, the system scores in seconds, the teacher knows who needs help, and the feedback loop closes before the lesson ends. No term-end bottlenecks. No assessment burden. Fast, scalable, and in principle affordable. The pitch isn’t wrong. But the gap between demo and deployment remains the challenge, and where the most important questions still need answering.
What initial studies show - and what they tell us
Active deployments across Africa are generating early evidence and real lessons.
Ghana: A 2025 study* recorded 130 students reading passages aloud in classrooms on the outskirts of Accra: children from low-income households, reading English as a second or third language, with West African accents the AI had not been trained on. The system correctly transcribed roughly nine out of every ten words, and its reading scores matched those of trained human raters almost exactly. That matters: in most African schools, assessing how fluently a child reads requires a teacher to sit with each child one-on-one, which is slow, expensive, and rarely done. The study was conducted in English; the larger question for FLN is whether the same approach can work in the mother-tongue languages such as Twi, Ewe, Dagbani, in which Ghanaian children are actually supposed to learn to read. That question is still open. As proof of concept, it is more than enough to justify the next step.
South Africa: An initial pilot** using an off-the-shelf speech recognition model without any adaptation to children's voices produced poor results: the system missed many correct answers and was inconsistent across words. For a follow-up study in isiXhosa, the team changed course. They collected nearly 150,000 recordings of children reading aloud, had each one reviewed by three mother-tongue speakers, and used that data to train a model tuned specifically to how children in these classrooms actually speak. The result: 95% accuracy on items where reviewers agreed, and a near-perfect match with human marking. Background noise from classrooms was flagged as a challenge, though it ultimately had less impact on accuracy than expected. What matters here is that the team documented what failed in the first round and used it to build something better in the second. That cycle — fail, learn, redesign — is how trustworthy tools get built.
Kenya: In early 2026, Microsoft Research published Paza, a community co-created pipeline of ASR models, benchmarks, and playbooks for developers of low-resource languages***. Paza is not a children's literacy tool, but its approach focused on models tested with community members in real conditions, on basic phones, with noisy backgrounds, illustrates a clear model of what community-first development looks like — and a pointer at what the child-speech layer still needs. Whether this community-first model will prove as effective when applied to child speech, with all its additional complexity, remains an open question. It is, however, the right question to be asking.
Three gaps that impede development of ASR tools
Corpus-building efforts for low-resource languages are underway for adult speakers (e.g. African Next Voices, NaijaVoices, AfriSpeech-200, among others). However, the lack of corpus built for early-grade literacy demands attention.
Children’s voices are missing. A systematic review covering 74 datasets, 111 African languages, and over 11,000 hours of speech found that nearly all existing data comes from adults****.Children speak differently — higher pitch, less consistent articulation, more hesitation, more disfluency. A model trained on adult speech is not just imperfect for children. It is structurally mismatched.
Classroom noise is missing. Fewer than 15% of the studies reviewed attempted to capture real deployment conditions. Models are tested in quiet. They are deployed in chaos — fans, traffic, 99 other children. That gap does not close by itself.
Dialectal and multilingual variation is missing. In languages such as Yoruba and Hausa, differences in tone or pitch patterns change meaning. In multilingual households and classrooms, which describes most of the contexts where FLN gaps are largest, children do not stay in one language. A model that has only seen clean, single-variety adult read speech will misrecognise this constantly and will do so without flagging that it has.
The scenario plays out like this. A Primary 2 teacher in Kano opens her tablet. The ASR tool flags three children as struggling. Two are Hausa-dominant speakers whose dialect the model was never trained on. The third was sitting furthest from the tablet when the recording was captured. None of them receive the right instruction for the next four weeks. The teacher trusts the data. She has no reason not to. This is not a hypothetical edge case. It is the default outcome when tools go live without anyone having asked what the training data actually contained.
The commercial layer
Alongside research pilots, commercial products are beginning to make their way into African classrooms. Tools like the US-based Amira Learning, being piloted in South Africa, are designed to do at scale what a skilled reading teacher does one-on-one: listen as a child reads aloud, respond within milliseconds with targeted feedback, and adjust difficulty in real time. Independent US studies show measurable gains, with students outperforming peers by an effect size of +0.26 in early grades. Engines like SoapBox Labs from Ireland, built specifically for children's speech in noisy environments, power products across the sector and the globe. Its voice engine is licensed by publishers including Amplify and Scholastic and is being used by developers who prefer not to build speech recognition from scratch. Domestic platforms like Nigeria's Afrilearn are exploring how to embed voice capabilities into curriculum-aligned products that teachers already use.
What is beginning to emerge is a split between two paths. Licensed global systems offer polish and reliability, often at lower upfront cost, but limited visibility into how their models work, which matters when a teacher has no way to know whether a child was flagged because of a genuine reading gap or because the model was never trained on that child’s accent. Community-built approaches tend to be linguistically richer and more transparent but require more investment to develop and maintain. Neither path is right for every context. The question of which to choose, and what to ask of any vendor before signing, deserves more attention than it currently receives.
Three questions before any deployment
Knowing which questions to ask before procurement is where the practical work starts.
Does the training data include children? If the answer is adults only, ask what the plan is for child speech. If there isn’t one, that is your signal.
Has it been tested in actual classrooms? Five minutes of ambient audio from a real school will tell you more than any lab benchmark. The gap between controlled and field performance is not small — the South Africa EGRA-AI experience demonstrates that.
What do the error rates look like by speaker group? Overall word-error rate is a headline. The breakdown by dialect and age is the accountability metric, and it should be a required element of any procurement document.
Closing thought
The technology is not the obstacle. What remains unfinished is the data layer beneath it: built for adults, tested in quiet rooms, and not yet shaped around the children it is meant to serve. Getting that right is not a technical afterthought. It is the condition on which everything else depends — the accuracy, the trust, and ultimately the child in Kano or Accra who deserves a system that has actually listened to someone who sounds like her.
*Henkel, O., Horne-Robinson, H., Hills, L., Roberts, B., & McGrane, J. (2025). Supporting literacy assessment in West Africa: Using state-of-the-art speech models to assess oral reading fluency. International Journal of Artificial Intelligence in Education, 35, 282–303
** LBD-EGRA-AI Project Overview. AI-for-Education.org. https://ai-for-education.org/lbd-egra-ai/
*** Microsoft Research. (2026). Paza: Introducing ASR benchmarks and models for low-resource languages. https://www.microsoft.com/en-us/research/blog/paza-introducing-automatic-speech-recognition-benchmarks-and-models-for-low-resource-languages/
**** Imam, S. H., Sani, B., et al. (2025). Automatic Speech Recognition for African Low-Resource Languages: A Systematic Literature Review. AfricaNLP 2025. https://aclanthology.org/2025.africanlp-1.13.pdf
The opinions expressed are those of the authors alone.
