I’ve Always Wanted To Know:

How Does A Computer Understand Speech?


Software like Dragon Natural Speaking or services like Google Voice Search or Apple’s Siri Digital Assistant allow you to speak to a computer or smartphone and have the computer understand what you are saying. How does this work though?

The most basic description is that a microphone records sound, a program or server analyzes that sound and converts it into text to be used by the program for whatever action is desired. The process is very complex beginning with the microphone recording the sound. The microphone records not only the sound of your voice but background sounds like the spinning of fans, a car outside, your home heating system and other possible distractions for the program to deal with. This can be reduces by noise canceling microphones or multiple microphones which use hardware and software to limit the background noises recorded.

The next step involves using a server or computer program to translate the audio into text. This process involves a lot of advanced mathematical algorithms and processes to determine the words being spoken. Since people have different accents and speak at different paces the software needs to be designed to recognize a wide variety. The quality of the speech recognition is usually determined by this step since better-designed software has a higher success rate at translating the correct words. Systems are used to recognize likely words placed together, so if the initial recognition shows as “I will fall you back later” the software may choose to make the word “fall” into “call” because it is more likely the word said based on phrases commonly said.

The final step involves using what is spoken in the program. Programs like Dragon Dictation simply output the translated words into text on the screen so you can type by speaking to the program. Smartphone applications like Google Search or Siri (among many others) can have keywords or phrases which perform specific actions. The software then needs to recognize where the keywords begin and what the requested action is. If you say “Send a text message to Molly” and “Molly I need to send her a text message” to Siri the first will result in a new message being created to Molly in your address book and the second will be a typed message on screen that says “Molly I need to send her a text message.”

What can you do to make speech recognition more accurate?


