Speaking to products of the future
The way we interact with technology has evolved hugely over the last 10 years. Touch screens have become the standard for interacting with mobile devices. But what will be the next interface technology to change the way we live our lives?
As with other Artificial Intelligence technologies, voice interfaces have been in development for a long time. Starting life as a curiosity within research groups they have grown into a technology that most people carry around with them in their pockets. With huge investment from technology giants they have become more effective and as a result users are increasingly accepting and adopting the technology. As people discover the ways in which it can be useful to them the technology has become more widespread.
The level of adoption today is greater than you might think. Demand for speech recognition is growing and it is increasingly found in new applications. In order to stay ahead of the curve, it is important to consider whether a voice interface could, and should, be integrated into new products and services.
There are many challenges to overcome when designing a voice interface into a product. The first of which is how do you ensure the system will work reliably in the target environment? Imagine a narrator reading from a script in a recording studio. Easy – this is the perfect environment for sound quality. Now imagine a user talking to a device in a busy public area such as a train station or at home with a TV on in the background. In this case, the background noise can be just as loud, if not louder than the person attempting to speak to the device. Even if the area happens to be relatively quiet, the acoustics of the room may cause echos which appear as reverberations and repetitions in the recorded speech. For some products, the need to prevent water or particle ingress presents a further challenge as there must be an air path for the sound to travel through to reach the microphone.
One technique for improving signal quality is to use more than one microphone in an array and employ a concept known as beamforming. The microphone closest to the user will receive the signal first and microphones slightly further away will receive the signal slightly later. By employing Digital Signal Processing (DSP) to work out these relative time delays the direction of the incoming voice can be calculated. The system can then focus only on sound coming from that direction, whilst ignoring sound coming from other directions. Beamforming is an effective technique to reject background noise and focus on the target voice. For it to be effective, the designer must carefully consider the optimum spatial arrangement of the microphones and account for this accurately in DSP software. Also, the DSP filtering stage often employs sophisticated noise suppression, acoustic echo cancellation and automatic gain control techniques. As the demand for speech recognition has grown, these technologies have become increasingly accessible in special-purpose off-the-shelf integrated circuits.
To the Cloud?
Once these challenges have been overcome and the system has acquired a clear speech recording, what exactly do you do with it? Cutting-edge speech recognition engines are available as a cloud service from the likes of Google, Amazon, Microsoft, Apple and IBM. Using such services provides a quality of natural language processing (NLP) which was impossible only a few years ago. However, it does impose two key requirements on the device:
- A permanent internet connection to send the speech recording to the cloud and to receive the resulting data back
- A service level agreement with the supplier
If a Wi-Fi network is available for the product, this can be used for communication to the cloud. Where Wi-Fi isn’t likely to be available or the device lacks the UI typically required for Wi-Fi configuration, 3G/4G connectivity can be added to the device at the expense of increased lifetime costs. Alternatively, the speech recording can be processed locally within the device itself. This avoids the use of a cloud service, but since the level of computing power is drastically reduced, this also reduces the quality of recognition and imposes restrictions on the type of language that can be understood. The ability to understand different languages and regional dialects is also sacrificed.
What does all this mean?
As the world continues to adopt voice interfaces, it is important that these challenges are considered when developing new products and services. The technology behind these interfaces is complex with many technical challenges. Nevertheless, solutions to these challenges are increasingly maturing, enabling the introduction of this technology to an ever increasing range of products, businesses and consumers.