---
layout: post
title: Voice Input Supported in Rhapsode 5!
author: Adrian Cochrane
date: 2021-06-13T16:10:28+12:00
---
Not only can Rhapsode read pages aloud to you via [eSpeak NG](https://github.com/espeak-ng/espeak-ng)
and it's [own CSS engine](/2020/11/12/css.html), but now you can speak aloud to *it* via
[Voice2JSON](https://voice2json.org/)! All without trusting or relying upon any
[internet services](https://www.gnu.org/philosophy/who-does-that-server-really-serve.html),
except ofcourse for [bogstandard](https://datatracker.ietf.org/doc/html/rfc7230)
webservers to download your requested information from. Thereby completing my
[vision](/2020/10/31/why-auditory.html) for Rhapsode's reading experience!
This speech recognition can be triggered either using the space key or by calling Rhapsode's name
(Okay, by saying Hey Mycroft because I haven't bothered to train it).
## Thank you Voice2JSON!
Voice2JSON is **exactly** what I want from a speech-to-text engine!
Accross it's 4 backends (CMU [PocketSphinx](https://github.com/cmusphinx/pocketsphinx),
Dan Povey's [Kaldi](https://kaldi-asr.org/), Mozilla [DeepSpeech](https://github.com/mozilla/DeepSpeech),
& Kyoto University's [Julius](https://github.com/julius-speech/julius)) it supports
*18* human languages! I always like to see more language support, but *this is impressive*.
I can feed it (lightly-preprocessed) whatever random phrases I find in link elements, etc
to use as voice commands. Even feeding it different commands for every webpage, including
unusual words.
It operates entirely on your device, only using the internet initially to download
an appropriate profile for your language.
And when I implement webforms it's slots feature will be **invaluable**.
The only gotcha is that I needed to also add a [JSON parser](https://hackage.haskell.org/package/aeson)
to Rhapsode's dependencies.
## Mechanics
To operate Voice2JSON you rerun [`voice2json train-profile`](http://voice2json.org/commands.html#train-profile)
everytime you edit [`sentences.ini`](http://voice2json.org/sentences.html) or
any of it's referenced files to update the list of supported voice commands.
This prepares a language model to guide the output of
[`voice2json transcribe-stream`](http://voice2json.org/commands.html#transcribe-stream)
or [`transcribe-wav`](http://voice2json.org/commands.html#transcribe-wav),
who's output you'll probably pipe into
[`voice2json recognize-intent`](http://voice2json.org/commands.html#recognize-intent)
to determine which intent from `sentences.ini` it matches.
If you want this voice recognition to be triggered by some wake word
run [`voice2json wait-wake`](http://voice2json.org/commands.html#wait-wake)
to determine when that keyphrase has been said.
### `voice2json train-profile`
For every page Rhapsode outputs a `sentences.ini` file & runs `voice2json train-profile`
to compile this mix of [INI](https://www.techopedia.com/definition/24302/ini-file) &
[Java Speech Grammar Format](https://www.w3.org/TR/jsgf/) syntax into an appropriate
[NGram](https://blog.xrds.acm.org/2017/10/introduction-n-grams-need/)-based
language model for the backend chosen by the
[downloaded profile](https://github.com/synesthesiam/voice2json-profiles).
Once it's parsed `sentences.ini` Voice2JSON optionally normalizes the sentence casing and
lowers any numeric ranges, slot references from external files or programs, & numeric digits
via [num2words](https://pypi.org/project/num2words/) before reformatting it into a
[NetworkX](https://pypi.org/project/networkx/) [graph](https://www.redblobgames.com/pathfinding/grids/graphs.html)
with weighted edges. This resulting
[Nondeterministic Finite Automaton](https://www.geeksforgeeks.org/%E2%88%88-nfa-of-regular-language-l-0100-11-and-l-b-ba/) (NFA)
is [saved](https://docs.python.org/3/library/pickle.html) & [gzip](http://www.gzip.org/)'d
to the profile before lowering it further to an [OpenFST](http://www.openfst.org/twiki/bin/view/FST/WebHome)
graph which, with a handful of [opengrm](http://www.opengrm.org/twiki/bin/view/GRM/WebHome) commands,
is converted into an appropriate language model.
Whilst lowering the NFA to a language model Voice2JSON looks up how to pronounce every unique
word in that NFA, consulting [Phonetisaurus](https://github.com/AdolfVonKleist/Phonetisaurus)
for any words the profile doesn't know about. Phonetisaurus in turn evaluates the word over a
[Hidden Markov](https://www.jigsawacademy.com/blogs/data-science/hidden-markov-model) n-gram model.
### `voice2json transcribe-stream`
`voice2json transcribe-stream` pipes 16bit 16khz mono [WAV](https://datatracker.ietf.org/doc/html/rfc2361)s
from a specified file or profile-configured record command
(defaults to [ALSA](https://alsa-project.org/wiki/Main_Page))
to the backend & formats it's output sentences with metadata inside
[JSON Lines](https://jsonlines.org/) objects. To determine when a voice command
ends it uses some sophisticated code [extracted](https://pypi.org/project/webrtcvad/)
from *the* WebRTC implementation (from Google).
That 16khz audio sampling rate is interesting, it's far below the 44.1khz sampling
rate typical for digital audio. Presumably this reduces the computational load
whilst preserving the frequencies
(max 8khz per [Nyquist-Shannon](https://invidio.us/watch?v=cIQ9IXSUzuM))
typical of human speech.
### `voice2json recognize-intent`
To match this output to the grammar defined in `sentences.ini` Voice2JSON provides
the `voice2json recognize-intent` command. This reads back in the compressed
NetworkX NFA to find the best path, fuzzily or not, via
[depth-first-search](https://www.techiedelight.com/depth-first-search) which matches
each input sentence. Once it has that path it iterates over it to resolve & capture:
1. Substitutions
2. Conversions
3. Tagged slots
The resulting information from each of these passes is gathered & output as JSON Lines.
In Rhapsode I apply a further fuzzy match, the same I've always used for keyboard input,
via [Levenshtein Distance](https://devopedia.org/levenshtein-distance).
### `voice2json wait-wake`
To trigger Rhapsode to recognize a voice command you can either press a key
or, to stick to pure voice control, saying a wakeword