After converting a sound wave into a stream of numbers, the next step is to figure out what the computer should look for in that data. The raw audio signal is a complex, continuous wave, but language is made of discrete units. To bridge this gap, we need to break speech down into its most fundamental components. These components are not letters, but the distinct sounds that make up a language.The Smallest Units of SoundIn linguistics, the smallest unit of sound that can change the meaning of a word is called a phoneme. Think of them as the atoms of spoken language. Changing just one phoneme can turn one word into a completely different one.Consider the word "pat". It is composed of three distinct sounds:The "p" sound at the beginning.The "a" sound in the middle.The "t" sound at the end.If you replace the first sound, "p", with a "b" sound, you get a new word: "bat". If you change the middle sound from "a" to "e", you get "pet". If you change the final sound from "t" to "d", you get "pad". The sounds /p/, /b/, /t/, /d/, /æ/ (as in pat), and /ɛ/ (as in pet) are all examples of phonemes in English. The English language uses approximately 44 phonemes to construct all of its words.Why Letters Are Not EnoughA common point of confusion is the difference between letters (called graphemes) and phonemes. The 26 letters of the English alphabet do not have a one-to-one relationship with the 44 sounds we use to speak. This inconsistency is why ASR systems focus on phonemes, not letters.Here are a few examples of this mismatch:One letter, multiple sounds: The letter 'c' makes a /k/ sound in "cat" but an /s/ sound in "city".Multiple letters, one sound: The letters 'ph' combine to make a single /f/ sound in "phone". The letters 'sh' make a /ʃ/ sound in "ship".Silent letters: The word "know" is pronounced /noʊ/. The 'k' and 'w' are not sounded.Because of this ambiguity in spelling, an ASR system cannot simply try to match audio to letters. It must first identify the sequence of phonemes being spoken.A Standard for Sounds: The IPATo handle this ambiguity, linguists developed the International Phonetic Alphabet (IPA). The IPA is a standardized system where each symbol represents exactly one phoneme. This allows for the precise transcription of speech from any language, removing the guesswork of regular spelling.You don't need to memorize the IPA, but it's useful to see how it clarifies pronunciation.WordSpellingPhonetic Transcription (IPA)Catc-a-t/kæt/Phonep-h-o-n-e/foʊn/Thought-h-o-u-g-h/ðoʊ/Tought-h-o-u-g-h/tʌf/Notice how "though" and "tough", despite their similar spelling, have very different phonetic transcriptions. This is the level of detail an ASR system must work with.The Role of Phonemes in the ASR PipelineThe core task of an Acoustic Model, a central component of any ASR system, is to solve this exact problem. It takes the processed audio data (which we will cover in the next chapter) and calculates the probability of which phonemes are being spoken at any given moment.The system doesn't hear "cat". It analyzes the audio signal and determines that the most likely sequence of sounds is /k/, followed by /æ/, followed by /t/. This sequence of phonemes is then passed to the next stages of the pipeline, which use a lexicon (a sort of dictionary) and a language model to determine that the sequence /kæt/ corresponds to the word "cat".digraph G { rankdir=TB; graph [fontname="sans-serif", fontsize=10]; node [shape=box, style="rounded,filled", fillcolor="#e9ecef", fontname="sans-serif", fontsize=10]; edge [fontname="sans-serif", fontsize=9]; Audio [label="Audio Signal for 'ship'", fillcolor="#a5d8ff"]; AcousticModel [label="Acoustic Model", fillcolor="#d0bfff"]; Phonemes [label="Phoneme Sequence\n/ʃ/ /ɪ/ /p/", shape=note, fillcolor="#b2f2bb"]; Lexicon [label="Lexicon & Language Model", fillcolor="#ffd8a8"]; Text [label="Text Output\n\"ship\"", shape=document, fillcolor="#ffc9c9"]; Audio -> AcousticModel [label="Analyzes audio features"]; AcousticModel -> Phonemes [label="Outputs most likely sounds"]; Phonemes -> Lexicon [label="Matches phonemes to words"]; Lexicon -> Text [label="Selects most probable word"]; }An ASR system first converts an audio signal into a sequence of phonemes, which are then used to determine the final text.Understanding phonemes is fundamental. They are the bridge connecting the messy, continuous realm of sound waves to the structured, discrete domain of words and sentences. In the chapters that follow, you will learn how computers extract features from audio to identify these sounds and how models are trained to perform this incredible translation.