Why Tone Matters: What Speech and Language AI Gets Wrong in Africa

31 Mar

Speech and language technologies shape how people learn, work, and access information. Voice assistants, reading tutors, and automated assessments are already influencing education systems worldwide. However, in much of Africa, these tools fail in basic and predictable ways. The problem does not stem from a lack of effort or interest. It comes from a mismatch between how these systems work and how African languages encode meaning. In sub-Saharan Africa, approximately 85-90% of languages are tonal, meaning that pitch distinguishes words and grammatical forms (Hyman, 2003; Maddieson, 2013). This represents over 1,500 languages and roughly 85% of the region’s population, hundreds of millions of children who rely on tone to communicate and learn.

In this piece, I look at why tone matters, why current automatic speech recognition (ASR) and natural language processing (NLP) systems struggle with it, and what this gap reveals about opportunities to improve teaching and learning in African classrooms.

Part 1: The Problem

Africa as a stress test for speech and language AI

Global technology companies and research groups now recognize Africa as a priority region for speech AI. Large projects such as Meta’s No Language Left Behind and Mozilla’s Common Voice identify African languages as one of the most urgent gaps in speech technology. These efforts aim to expand coverage to languages that existing systems transcribe inaccurately, misinterpret key distinctions such as tone, or fail to support reliably in real-world settings, such as classrooms, markets, and public services.

Recent benchmarking work confirms the seriousness of the problem. The Voice of a Continent study reveals that even robust, modern speech models struggle to capture fundamental aspects of African speech, including tone, rhythm, and emotional cues, despite excelling in European or East Asian languages. At the same time, startups such as Dukawalla point to clear demand by building products that rely on spoken interaction in local languages for everyday business tasks, where accuracy and meaning matter immediately. The market exists. The technology still lags.

What tone is and why it carries meaning

In many languages, pitch expresses emotion or sentence type. In tonal languages, pitch serves a more fundamental purpose. It distinguishes words and grammatical forms.

A simple example from Akan, Ghana's most widely spoken language, illustrates this point. The same spoken word papa can mean “father”, “good”, or “fan”, depending only on tone. In writing, all three appear identical because tone is not marked in Akan orthography. A sentence like Ka wo papa ho asɛm can mean “talk about your father”, “talk about your good behaviour”, or “talk about your fan”. All meanings are plausible in a normal household or classroom setting. Tone, not the surrounding context, determines the intended meaning. When tone is absent, systems guess. Speakers do not.

This is not an edge case. It reflects how meaning works in daily communication across many African languages.

Why current language systems fail to handle tone

Most language technologies rely on text. Text data scales easily, trains models cheaply, and transfers across tasks. Since many African languages do not consistently mark tone, the resulting text collections are incomplete.

For these reasons developers should not treat tone as optional. Systems that ignore it build ambiguity into their outputs from the start. In practice, many systems still drop tone because doing so simplifies data collection, reduces annotation costs, and allows models trained on well-resourced non-tonal languages to transfer more easily. These short-term efficiencies shape design decisions, even when they undermine performance in tonal settings.

This pattern also reflects a broader imbalance in research. Most computational work on tone focuses on a small number of well-resourced languages, especially Mandarin, which accounts for a large share of tone-related studies in major NLP venues. African tonal languages appear far less often and usually in one-off projects rather than sustained research programs. As a result, tools, benchmarks, and best practices develop around a narrow set of linguistic assumptions.

Many language systems, therefore, rely on conditions that rarely hold in African contexts. They assume stable orthographies, large tone-annotated datasets, and controlled speech environments. In real classrooms and communities, these conditions are uncommon. Tone still carries meaning, but text rarely records it, leaving systems unable to recover what speakers intend to convey.

Speech recognition research shows that this limitation is not inherent. Work on well-resourced tonal languages such as Mandarin and Thai demonstrates that explicitly modeling tone improves recognition accuracy. These results confirm that tone-aware modeling is feasible. The problem arises downstream. Once speech is converted to text, tone is usually discarded. Text-based systems then operate on flattened representations in which distinct meanings collapse into a single meaning. Even when speech models capture tone correctly, later stages often drop it, and meaning is lost after the first step.

Part 2: The Impact on Teaching and Learning

Why tone error is of special concern in early literacy and numeracy

Tone errors matter most where language supports learning. In early literacy and numeracy, children rely on clear links between sound, meaning, and print. When systems collapse multiple meanings into one output, they confuse learners.

This impacts the learning process at multiple levels:

Automated reading tutors struggle to assess children accurately when tone distinguishes correct from incorrect pronunciation. A child may pronounce a word correctly with the correct tone pattern, but a tone-blind system marks it wrong, or worse, accepts an incorrect pronunciation as correct because it only matches the consonants and vowels.
Translation tools misrepresent learning materials, turning instructional text into nonsense or altering meanings in ways that teachers must spend time correcting.
Assessment systems may group students based on faulty signals, misidentifying which children need support and which concepts require reteaching.

Why teachers cannot compensate for broken tools

Teachers already manage large classrooms with limited resources. Many rely on oral explanation, repetition, and physical materials. When digital tools introduce errors that teachers must correct, they become a burden rather than a support.

The promise of education technology in African classrooms depends on tools that lighten teacher workload and accurately support student learning. Tools that ignore tone do neither. They add cognitive load, reduce trust, and ultimately get abandoned or used minimally, not because teachers resist technology but because the technology does not reflect how their students actually speak and learn.

Part 3: Building Better Learning Tools

What tone-aware systems make possible

Tone-aware language systems aim to restore the meaning that current tools drop. This goal does not require perfect tone marking in every context. It requires systems that treat tone as meaningful information rather than noise.

Research indicates that incorporating tone can improve several tasks. Speech recognition benefits when models treat pitch patterns as meaningful signals (Coto-Solano, 2021; Kaur et al, 2021). Disambiguation improves when systems restore tone in text (Liu et al, 2017; Alqahtani et al, 2019). Downstream tasks such as translation and tagging become more reliable when tone no longer collapses distinct meanings.

What this means for EdTech builders, researchers, and education decision-makers

For EdTech builders, tools can assume language behaves like English, or they can adapt to how meaning works in target languages. Systems that incorporate tone may require more upfront design work, but they deliver more reliable performance in classrooms.

For researchers, African languages provide a test of whether language technologies truly model meaning or rely on shortcuts that only work in a narrow set of languages.

For decision-makers, funding speech and language tools without considering tone risks wasting resources. Tools may perform well in demonstrations yet fail in real learning environments. Early attention to language structure and classroom realities leads to better outcomes.

Systems that ignore tone will continue to misread speech, mistranslate text, and struggle in classrooms where teachers and students rely on spoken language to teach and learn. Building tools that work in the African educational context requires acknowledging this reality from the start, not treating it as a problem to patch after deployment. The question is what it will take for ASR in EdTech and AI to work effectively for African languages, and especially at scale.

The opinions expressed are those of the authors alone.

Godfred Agyapong

Godfred Agyapong is a PhD student in the Department of Linguistics, specializing in Computational Linguistics. His research bridges the gap between endangered language documentation and advanced computational methods. He focuses on developing machine learning technologies tailored to low-resource languages, enhancing NLP systems with insights drawn from linguistic documentation and analysis. His current work involves the development of a multilayered Akan corpus with tone annotations, aiming to improve the accuracy and usability of NLP systems for low-resource languages. Godfred’s work contributes to the preservation and revitalisation of endangered languages through advanced computational tools by combining expertise in language documentation, machine learning, and phonological analysis.

https://www.linkedin.com/in/godfred-agyapong-a18480115/