~alcinnz/argonaut-constellation.org

ref: 721acb53fd44d90b5e11f22e62192c728cd84412 argonaut-constellation.org/_posts/2021-06-13-voice2json.md -rw-r--r-- 8.7 KiB
721acb53 — Adrian Cochrane Second attempt at reformatting blockquote. 1 year, 10 months ago

#layout: post title: Voice Input Supported in Rhapsode 5! author: Adrian Cochrane date: 2021-06-13T16:10:28+12:00

Not only can Rhapsode read pages aloud to you via eSpeak NG and it's own CSS engine, but now you can speak aloud to it via Voice2JSON! All without trusting or relying upon any internet services, except ofcourse for bogstandard webservers to download your requested information from. Thereby completing my vision for Rhapsode's reading experience!

This speech recognition can be triggered either using the space key or by calling Rhapsode's name (Okay, by saying Hey Mycroft because I haven't bothered to train it).

#Thank you Voice2JSON!

Voice2JSON is exactly what I want from a speech-to-text engine!

Accross it's 4 backends (CMU PocketSphinx, Dan Povey's Kaldi, Mozilla DeepSpeech, & Kyoto University's Julius) it supports 18 human languages! I always like to see more language support, but this is impressive.

I can feed it (lightly-preprocessed) whatever random phrases I find in link elements, etc to use as voice commands. Even feeding it different commands for every webpage, including unusual words.

It operates entirely on your device, only using the internet initially to download an appropriate profile for your language.

And when I implement webforms it's slots feature will be invaluable.

The only gotcha is that I needed to also add a JSON parser to Rhapsode's dependencies.

#Mechanics

To operate Voice2JSON you rerun voice2json train-profile everytime you edit sentences.ini or any of it's referenced files to update the list of supported voice commands. This prepares a language model to guide the output of voice2json transcribe-stream or transcribe-wav, who's output you'll probably pipe into voice2json recognize-intent to determine which intent from sentences.ini it matches.

If you want this voice recognition to be triggered by some wake word run voice2json wait-wake to determine when that keyphrase has been said.

#voice2json train-profile

For every page Rhapsode outputs a sentences.ini file & runs voice2json train-profile to compile this mix of INI & Java Speech Grammar Format syntax into an appropriate NGram-based language model for the backend chosen by the downloaded profile.

Once it's parsed sentences.ini Voice2JSON optionally normalizes the sentence casing and lowers any numeric ranges, slot references from external files or programs, & numeric digits via num2words before reformatting it into a NetworkX graph with weighted edges. This resulting Nondeterministic Finite Automaton (NFA) is saved & gzip'd to the profile before lowering it further to an OpenFST graph which, with a handful of opengrm commands, is converted into an appropriate language model.

Whilst lowering the NFA to a language model Voice2JSON looks up how to pronounce every unique word in that NFA, consulting Phonetisaurus for any words the profile doesn't know about. Phonetisaurus in turn evaluates the word over a Hidden Markov n-gram model.

#voice2json transcribe-stream

voice2json transcribe-stream pipes 16bit 16khz mono WAVs from a specified file or profile-configured record command (defaults to ALSA) to the backend & formats it's output sentences with metadata inside JSON Lines objects. To determine when a voice command ends it uses some sophisticated code extracted from the WebRTC implementation (from Google).

That 16khz audio sampling rate is interesting, it's far below the 44.1khz sampling rate typical for digital audio. Presumably this reduces the computational load whilst preserving the frequencies (max 8khz per Nyquist-Shannon) typical of human speech.

#voice2json recognize-intent

To match this output to the grammar defined in sentences.ini Voice2JSON provides the voice2json recognize-intent command. This reads back in the compressed NetworkX NFA to find the best path, fuzzily or not, via depth-first-search which matches each input sentence. Once it has that path it iterates over it to resolve & capture:

  1. Substitutions
  2. Conversions
  3. Tagged slots

The resulting information from each of these passes is gathered & output as JSON Lines.

In Rhapsode I apply a further fuzzy match, the same I've always used for keyboard input, via Levenshtein Distance.

#voice2json wait-wake

To trigger Rhapsode to recognize a voice command you can either press a key (spacebar) or, to stick to pure voice control, saying a wakeword (currently Hey Mycroft). For this there's the voice2json wait-wake command.

voice2json wait-wake pipes the same 16bit 16khz mono WAV audio as voice2json transcribe-stream into (currently) Mycroft Precise & applies some edge detection to the output probabilities. Mycroft Precise, from the Mycroft opensource voice assistant project, is a Tensorflow neuralnet converting spectograms (computed via sonopy or legacy speechpy) into probabilities.

#Voice2JSON Stack

Interpreting audio input into voice commands is a non-trivial task, combining the efforts of many projects. Last I checked Voice2JSON used the following projects to tackle various components of this challenge:

  • Python
  • Rhasspy
  • num2words
  • NetworkX
  • OpenFST
  • Phonetisaurus
  • opengrm
  • Mycroft Precise
  • Sonopy
  • SpeechPy

And for the raw text-to-speech logic you can choose between:

  • PocketSphinx (matches audio via several measures to a language model of a handful of types)
  • Kaldi (supports many more types of language models than PocketSphinx, including several neuralnet variants)
  • DeepSpeech (Tensorflow neuralnet, hard to constrain to a grammar)
  • Julius (word n-grams & context-dependant Hiden Markov Model via 2-pass tree trellis search)

#Conclusion

Rhapsode's use of Voice2JSON shows two things.

First the web could be a fantastic auditory experience if only we weren't so reliant on JavaScript.

Second there is zero reason for Siri, Alexa, Cortana, etc to offload their computation to the cloud. Voice recognition may not be a trivial task, but even modest consumer hardware are more than capable enough to do a good job at it.

span {voice-volume: soft;}