Google Reveals A New Era For Voice Search
Google reveals a new era for voice search powered by AI, promising faster, smarter, and more conversational results for users worldwide.
Google released a revolutionary upgrade to its voice search technology, unveiling a brand new AI-powered system that promises quicker, more precise results directly from processing speech, without the need to convert it to text.
Google Moving Beyond Text Conversion
The earlier method, also known in the field as Cascade ASR, relied on changing voice messages into text prior to ranking results. This approach frequently lost crucial contextual clues, which led to mistakes.
The latest technology, Speech-to-Retrieval (S2R), does away with text conversion completely. It makes use of neural networks that have been developed from paired audio questions and documents to link spoken queries to relevant information.
Dual-Encoder Model: Two Neural Networks
S2R is based upon two networks of neurons that work together:
Audio Encoder: Transforms voice queries in vectors which capture the meaning of the words.
Document Encoder: Transforms texts to similar vector formats.
Both encoders translate texts and audio to an identical semantic field making sure that voice-related queries and documents are placed close to each other based on the semantics.
How Audio and Document Encoders Work
For instance, the phrase “the scream painting” is converted into a vector close to details on Edvard munch’s The Scream, including museum information.
Similar to web pages, document files are also vectorized in order to show the content, allowing for a precise match to spoken queries. Google calls these vectors “rich” because they comprehensively encode not just keywords, but the conceptual intention as well as context.
This makes it possible to match them even when queries are differentfor example, “show me Munch’s screaming face painting” will bring up relevant content related to The Scream.
According to Google’s announcement:
“The key to this model is how it is trained. Using a large dataset of paired audio queries and relevant documents, the system learns to adjust the parameters of both encoders simultaneously.
The training objective ensures that the vector for an audio query is geometrically close to the vectors of its corresponding documents in the representation space. This architecture allows the model to learn something closer to the essential intent required for retrieval directly from the audio, bypassing the fragile intermediate step of transcribing every word, which is the principal weakness of the cascade design.”
Training and Ranking Process
Google has emphasized the simultaneous learning of both encoders using vast arrays of audio-document pairings. The system learns how to move documents and audio closer together, increasing the accuracy of retrieval while avoiding errors caused by transcription.
When documents that are candidates are discovered through vector similarity, the second ranking layer uses hundreds of quality and relevance signal to decide the ranking of the search results.
High-Performance and Growth Opportunities
Tests on benchmarks showed that S2R performed better than the older Cascade ASR system and nearly exactly matched the perfect “groundtruth” version.
Google acknowledged that there is room to improve the system, but also announced that the update was available, supporting several languages, including English.
Google explains:
“Voice Search is now powered by our new Speech-to-Retrieval engine, which gets answers straight from your spoken query without having to convert it to text first, resulting in a faster, more reliable search for everyone.”
Bottom Line
This new feature marks a brand new age of voice search, by directly interpreting spoken words by using AI and reducing the errors related to transcription of speech. It promises quicker and more precise answers that help users understand their queries in their natural voice.
Read more: