~alcinnz/argonaut-constellation.org

ref: 2fe0871449f2de08eb923420435fb6ebd221ee40 argonaut-constellation.org/_posts/2021-06-13-voice2json.md -rw-r--r-- 8.7 KiB
2fe08714 — Adrian Cochrane Switch over to Rimu Hosting. 1 year, 11 months ago
                                                                                
da1ec90f Adrian Cochrane
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
---
layout: post
title: Voice Input Supported in Rhapsode 5!
author: Adrian Cochrane
date: 2021-06-13T16:10:28+12:00
---
Not only can Rhapsode read pages aloud to you via [eSpeak NG](https://github.com/espeak-ng/espeak-ng)
and it's [own CSS engine](/2020/11/12/css.html), but now you can speak aloud to *it* via
[Voice2JSON](https://voice2json.org/)! All without trusting or relying upon any
[internet services](https://www.gnu.org/philosophy/who-does-that-server-really-serve.html),
except ofcourse for [bogstandard](https://datatracker.ietf.org/doc/html/rfc7230)
webservers to download your requested information from. Thereby completing my
[vision](/2020/10/31/why-auditory.html) for Rhapsode's reading experience!

This speech recognition can be triggered either using the <kbd>space</kbd> key or by calling Rhapsode's name
<span>(Okay, by saying <q>Hey Mycroft</q> because I haven't bothered to train it)</span>.

## Thank you Voice2JSON!
Voice2JSON is **exactly** what I want from a speech-to-text engine!

Accross it's 4 backends <span>(CMU [PocketSphinx](https://github.com/cmusphinx/pocketsphinx),
Dan Povey's [Kaldi](https://kaldi-asr.org/), Mozilla [DeepSpeech](https://github.com/mozilla/DeepSpeech),
& Kyoto University's [Julius](https://github.com/julius-speech/julius))</span> it supports
*18* human languages! I always like to see more language support, but *this is impressive*.

I can feed it <span>(lightly-preprocessed)</span> whatever random phrases I find in link elements, etc
to use as voice commands. Even feeding it different commands for every webpage, including
unusual words.

It operates entirely on your device, only using the internet initially to download
an appropriate <q>profile</q> for your language.

And when I implement webforms it's <q>slots</q> feature will be **invaluable**.

The only gotcha is that I needed to also add a [JSON parser](https://hackage.haskell.org/package/aeson)
to Rhapsode's dependencies.

## Mechanics
To operate Voice2JSON you rerun [`voice2json train-profile`](http://voice2json.org/commands.html#train-profile)
everytime you edit [`sentences.ini`](http://voice2json.org/sentences.html) or
any of it's referenced files to update the list of supported voice commands.
This prepares a <q>language model</q> to guide the output of
[`voice2json transcribe-stream`](http://voice2json.org/commands.html#transcribe-stream)
or [`transcribe-wav`](http://voice2json.org/commands.html#transcribe-wav),
who's output you'll probably pipe into
[`voice2json recognize-intent`](http://voice2json.org/commands.html#recognize-intent)
to determine which <q>intent</q> from `sentences.ini` it matches.

If you want this voice recognition to be triggered by some <q>wake word</q>
run [`voice2json wait-wake`](http://voice2json.org/commands.html#wait-wake)
to determine when that keyphrase has been said.

### `voice2json train-profile`
For every page Rhapsode outputs a `sentences.ini` file & runs `voice2json train-profile`
to compile this mix of [INI](https://www.techopedia.com/definition/24302/ini-file) &
[Java Speech Grammar Format](https://www.w3.org/TR/jsgf/) syntax into an appropriate
[NGram](https://blog.xrds.acm.org/2017/10/introduction-n-grams-need/)-based
<q>language model</q> for the backend chosen by the
[downloaded profile](https://github.com/synesthesiam/voice2json-profiles).

Once it's parsed `sentences.ini` Voice2JSON optionally normalizes the sentence casing and
lowers any numeric ranges, <q>slot references</q> from external files or programs, & numeric digits
via [num2words](https://pypi.org/project/num2words/) before reformatting it into a
[NetworkX](https://pypi.org/project/networkx/) [graph](https://www.redblobgames.com/pathfinding/grids/graphs.html)
with weighted edges. This resulting
[Nondeterministic Finite Automaton](https://www.geeksforgeeks.org/%E2%88%88-nfa-of-regular-language-l-0100-11-and-l-b-ba/) (NFA)
is [saved](https://docs.python.org/3/library/pickle.html) & [gzip](http://www.gzip.org/)'d
to the profile before lowering it further to an [OpenFST](http://www.openfst.org/twiki/bin/view/FST/WebHome)
graph which, with a handful of [opengrm](http://www.opengrm.org/twiki/bin/view/GRM/WebHome) commands,
is converted into an appropriate language model.

Whilst lowering the NFA to a language model Voice2JSON looks up how to pronounce every unique
word in that NFA, consulting [Phonetisaurus](https://github.com/AdolfVonKleist/Phonetisaurus)
for any words the profile doesn't know about. Phonetisaurus in turn evaluates the word over a
[Hidden Markov](https://www.jigsawacademy.com/blogs/data-science/hidden-markov-model) n-gram model.

### `voice2json transcribe-stream`

`voice2json transcribe-stream` pipes 16bit 16khz mono [WAV](https://datatracker.ietf.org/doc/html/rfc2361)s
from a specified file or profile-configured record command
<span>(defaults to [ALSA](https://alsa-project.org/wiki/Main_Page))</span>
to the backend & formats it's output sentences with metadata inside
[JSON Lines](https://jsonlines.org/) objects. To determine when a voice command
ends it uses some sophisticated code [extracted](https://pypi.org/project/webrtcvad/)
from *the* WebRTC implementation <span>(from Google)</span>.

That 16khz audio sampling rate is interesting, it's far below the 44.1khz sampling
rate typical for digital audio. Presumably this reduces the computational load
whilst preserving the frequencies
<span>(max 8khz per [Nyquist-Shannon](https://invidio.us/watch?v=cIQ9IXSUzuM))</span>
typical of human speech.

### `voice2json recognize-intent`

To match this output to the grammar defined in `sentences.ini` Voice2JSON provides
the `voice2json recognize-intent` command. This reads back in the compressed
NetworkX NFA to find the best path, fuzzily or not, via
[depth-first-search](https://www.techiedelight.com/depth-first-search) which matches
each input sentence. Once it has that path it iterates over it to resolve & capture:

1. Substitutions
2. Conversions
3. Tagged slots

The resulting information from each of these passes is gathered & output as JSON Lines.

In Rhapsode I apply a further fuzzy match, the same I've always used for keyboard input,
via [Levenshtein Distance](https://devopedia.org/levenshtein-distance).

### `voice2json wait-wake`

To trigger Rhapsode to recognize a voice command you can either press a key <aside>(<kbd>spacebar</kbd>)</aside>
or, to stick to pure voice control, saying a <q>wakeword</q> <aside>(currently <q>Hey Mycroft</q>).
For this there's the `voice2json wait-wake` command.

`voice2json wait-wake` pipes the same 16bit 16khz mono WAV audio as `voice2json transcribe-stream`
into <span>(currently)</span> [Mycroft Precise](https://mycroft-ai.gitbook.io/docs/mycroft-technologies/precise)
& applies some [edge detection](https://www.scilab.org/signal-edge-detection)
to the output probabilities. Mycroft Precise, from the [Mycroft](https://mycroft.ai/)
opensource voice assistant project, is a [Tensorflow](https://www.tensorflow.org/)
[neuralnet](https://invidious.moomoo.me/watch?v=aircAruvnKk) converting
[spectograms](https://home.cc.umanitoba.ca/~robh/howto.html) <span>(computed via
[sonopy](https://pypi.org/project/sonopy/) or legacy
[speechpy](https://pypi.org/project/speechpy/))</span> into probabilities.

## Voice2JSON Stack
Interpreting audio input into voice commands is a non-trivial task, combining the
efforts of many projects. Last I checked Voice2JSON used the following projects to
tackle various components of this challenge:

* [Python](https://www.python.org/)
* [Rhasspy](https://community.rhasspy.org/)
* num2words
* NetworkX
* OpenFST
* Phonetisaurus
* opengrm
* Mycroft Precise
* Sonopy
* SpeechPy

And for the raw text-to-speech logic you can choose between:

* PocketSphinx <span>(matches audio via several measures to a language model of a handful of types)</span>
* Kaldi <span>(supports many more types of language models than PocketSphinx, including several neuralnet variants)</span>
* DeepSpeech <span>(Tensorflow neuralnet, hard to constrain to a grammar)</span>
* Julius <span>(word n-grams & context-dependant Hiden Markov Model via 2-pass [tree trellis search](https://dl.acm.org/doi/10.3115/116580.116591))</span>

## Conclusion
Rhapsode's use of Voice2JSON shows two things.

First the web could be a **fantastic** auditory experience *if only* we weren't so
reliant on [JavaScript](https://rhapsode.adrian.geek.nz/2021/01/23/why-html.html#why-not-javascript).

Second there is *zero* reason for [Siri](https://www.apple.com/siri/), [Alexa](https://www.developer.amazon.com/en-US/alexa/),
[Cortana](https://www.microsoft.com/en-us/cortana/), etc to offload their computation
to [the cloud](https://grahamcluley.com/cloud-someone-elses-computer/). Voice recognition
may not be a trivial task, but even modest consumer hardware are more than capable
enough to do a good job at it.

<style>span {voice-volume: soft;}</style>