by Mouser Electronics
Voice and speech user interfaces are becoming more common and important in handsets, tablets, wearables, and smart devices. This is especially true for those devices that do not lend themselves to keyboards or touch screen interfaces. To deliver higher accuracy in speech processing, systems need to be able to reliably recognize voices and words in spite of complex background noises.
Millions of people already rely on automatic speech recognition solutions to convert speech to text to write documents and generate transcripts. However, the quality of automated speech recognition often relies on optimal conditions that assume each person speaks in a way similar to the voice training data, and that their recorded speech is made in a quiet environment free from too much background noise. Even then, transcriptionists are often employed to edit word translation, punctuation, and grammar mistakes, as well as check for any other types of errors in the transcript. Continuing advances in speech enhancement for improving the intelligibility of human speech are essential in enabling the success of today’s speech recognition technologies used on mobile and smart devices, as well as in noisy environments such as automobiles.
Speech enhancement is based on a combination of voice isolation and noise suppression techniques. This article focuses on voice isolation used for noise removal and speech enhancement, so noise suppression is briefly described here.
Noise suppression is an approach for removing the different types of background noise that can interfere with recognizing human speech. Noise is characterized in the time and frequency domains. Time domain noises include continuous, discontinuous, and pulse-like noises. Frequency domain noises include broad band and narrow band noises. Office and traffic sounds, operating equipment, and hisses are examples of continuous or slow changing noises. Discontinuous noises are repetitive noises like horns or bells. Pulse-like noises are usually abrupt like clicks and thumps. Broadband noises such as background hissing sounds can occur at many frequencies. Narrow band noises occur at a set of specific frequencies and can include sinusoid waves, hums, and machine noises.
Designers have access to various filtering techniques, each with different degrees of effectiveness against each type of noise. However, because the characteristics of each noise can change over time, designers may also use adaptive filtering algorithms to dynamically adjust to how the noise changes. Some examples of noise suppression techniques include frequency compensation, impulse filtration, adaptive broadband filtration, adaptive inverse filtration, and stereo filtration.
Enter voice isolation
Voice isolation is a newer approach for improving the intelligibility of speech to the human ear. Instead of masking and filtering out each source of noise, voice isolation focuses on identifying the specific components of human speech in order to identify and pass only the speaker’s voice, filtering out any background noise. Voice isolation can significantly improve the clarity and intelligibility of the speech, even in noisy environments. In order to be able to reliably identify components of human speech, the voice isolation system needs to employ acoustic and language models. This article looks at two modelling approaches used in existing embedded designs – first, a deep neural network approach, and second, cochlea emulation which reproduces the behavior of human hearing systems from the inner ear to the brain.
With the deep neural network approach, a large database containing hundreds of hours of noise and human speech is used to train the system. The database begins with no assumptions about human speech, and through training it learns to identify different human speech patterns. The quality of the voice isolation, including the ability to identify where different sounds are coming from, is enhanced when two or more microphones are used to collect audio input. The network is even trained to identify where and when someone is speaking.
The information contained within the database is then used to create small, fast algorithms that execute on the target digital signal processor (DSP) which is then capable of detecting and classifying speech. The combination of all these adaptive algorithms developed from the information stored in the database is called a neural network.
The neural network’s algorithms break the incoming sound to be analyzed into small sound segments in order to look for different human speech patterns. The neural network analyzes different characteristics of the sound segments including frequencies, harmonics, and attack and decay characteristics to distinguish speech from environmental sounds. There are performance trade-offs using neural networks based on the sampling rate of the audio. Smaller sampling rates need less processing but is less accurate. Higher sampling rates are more accurate but require much more processing.
Different filtering algorithms are then used to perform voice-print identification while removing unwanted parts of the audio input. The use of multi-pass filtering allows for more aggressive filtering which can also restore any lost audio. By varying the algorithm parameters during the post-processing phase, the sound can be tuned for either a human listener or a speech recognition system. This is important because humans and speech recognition systems do not interpret speech the same way.
Cochlea emulation
The voice isolation method uses a DSP running a Computational Auditory Scene Analysis (CASA) algorithm to closely emulate how the human hearing system identifies human speech in a noisy environment. This approach codes the audio information so that it can be grouped and interpreted. Dozens of grouping cues, operating across time and frequency, are used including pitch, spatial location, and onset/offset time.
Pitch is a critical grouping cue because it uses distinct harmonic patterns to identify sound components that belong to certain sounds. When two or more microphones are available, spatial location cues are available that enable voice isolation systems to determine the direction and distance of the sound from each of the microphones. CASA modelling permits the voice isolation system to accomplish the “cocktail party effect” which enables the system to focus on a single sound source, such as a specific person, and block out background sounds. Onset/offset time cues refer to sound components that start and/or stop together. When these cues are combined with raw frequency data, it allows groups of sounds or frequency components to be treated as a single sound.
Sound components with similar attributes form a single auditory stream. Likewise, sound components with dissimilar attributes form separate auditory streams. The system can then use these different streams to identify persistent or recurring sound sources. Once enough sound components have been grouped, the actual process of voice isolation consists of matching the identified sound sources with the corresponding components of the speaker’s voice. An inverse transform of the auditory streams reconstructs the data into voice audio that can be transmitted and interpreted by a listener.
Considerations
Voice isolation is not just for providing higher quality speech to speech recognition systems, there are many other significant uses. For example, first responders in an emergency situation can find themselves in extreme and possibly chaotic environmental conditions where fast and accurate voice communication is life- and safety-critical. Voice isolation provides a mechanism to improve the intelligibility of voice communications in spite of uncontrollable environmental conditions as compared to just noise suppression.
Specialized DSP voice processors can optimize performance while lowering power consumption, especially when supporting an always-on voice interface or requiring the user to initiate a manual activation of the voice interface such as by pressing a button. An always-on voice interface would always be consuming power because the system processor would remain active all the time. In contrast, to save battery power, an always-on voice interface system using a dedicated voice processor could support a sleep mode that transitioned from a limited-function, low-power listening mode to a full-function wake-up mode.
A voice interface is no longer reserved just for handsets and smart phones. Wearables that do not support keyboards or touch screens could greatly benefit from voice user interface. However, as these voice user interfaces mature, the distance between the user and the device microphone will grow. For example, some smart TVs support voice commands and these TVs are often positioned across the living room. There are also issues regarding user privacy and security, which need to be eventually worked out. The resolution will pave the way for expanding the use of voice interfaces in other exciting products used in non-traditional settings.
Leave a Reply
You must be logged in to post a comment.