Skip to content

Latest commit

 

History

History
354 lines (249 loc) · 14.3 KB

user-guide.md

File metadata and controls

354 lines (249 loc) · 14.3 KB

Table of Contents

  1. Overview
    1. Quick start
    2. Instructions on how to run examples
  2. DSL
    1. Nonterminals
    2. Lexicons
    3. Attributes
    4. Syntax of common regular expression operations
    5. Within utterance expressions
    6. Attribute arguments and captures
    7. The env argument
    8. Attributed lexicons
    9. Null symbol
    10. Captures within closures
  3. Examples
    1. Sleep mode
    2. Spoken numbers
    3. Key chords
  4. Grammar inspiration and resources
  5. Credits

Overview

Vocoder is a software package for dictation and voice control, in a similar category of software as Dragon, Dragonfly, Caster, Talon, and Serenade. The user is meant to install it using pip, poetry, or something similar, and write a python script defining a grammar that maps patterns of speech to python functions that will be run when the pattern of speech is detected.

At the core of vocoder is a domain-specific language (DSL) for concisely and flexibly defining the speech command grammar - i.e. the recognizable patterns of speech and the way in which they should trigger the execution of python functions. The grammar can be thought of as a giant regular expression where certain subexpressions are associated with python functions that run as soon as the subexpressions are matched. Vocoder takes advantage of f-strings to intersperse python functions with subexpressions of the grammar. Using the DSL to define the grammar and python to define actions that should be taken - like keystroke execution - the user can create scripts that enable them to dictate prose or control desktop applications by voice. For more info about the grammar DSL, see here. The grammar also improves speech recognition accuracy by constraining the set of words that can be recognized.

Quick start

To get started, install vocoder using pip install vocoder-dictation or clone the source code and install the dependencies with poetry. Then run the following in a python interpreter or as a script:

from vocoder.app import App
from vocoder.grammar import Grammar

def _action(t):
    print(f"You said '{' '.join(t)}'!")

g = Grammar()
g(f"""
!start = hello world => %{g(_action)}
""")

App(g).run()

When run, vocoder will start listening to input from the microphone. If the words "hello world" are detected, vocoder will print You said 'hello world'!. You can hit ctrl-c to exit.

Note that after you say "hello world" once vocoder will not recognize any more speech. In vocoder, the grammar is traversed one time instead of resetting for each utterance.

The run method in the last line can be given the argument text=True in order to start a text prompt where you can enter "hello world" instead of speaking into the microphone. This can be useful to experiment with grammars.

Vocoder may have problems understanding speech with poor or even average quality microphones. For best results, you will need a decent microphone. Vocoder currently uses the wav2vec2 acoustic model published by Facebook on Hugging Face.

Instructions on how to run examples

All of the following examples can be run by first running

from vocoder.app import App
from vocoder.grammar import Grammar

g = Grammar()

then running the code given, and finally running

App(g).run()

or

App(g).run(text=True)

DSL

Nonterminals

The DSL represents the speech grammar using assignment statements like

!start = !number is a number
!number = one | two | three

In this example, !start and !number can be thought of as variables with values defined by the regular expressions on the right-hand side of the equals signs. The DSL uses special characters prefixed to words to denote their type. Regular expression variables or "nonterminals" are prefixed by !. Every configuration must define the nonterminal !start, which is the entrypoint to the grammar. Nonterminals can't be recursively defined.

Lexicons

For regular expressions in contexts like IDEs and most other programs, the basic matching unit of a regular expression is a single character or class of characters like [a-z]. In vocoder, the basic unit is better thought of as a word like "one" or "two" in the example above. You can also use a set of words as a basic matching unit:

numbers = ["one", "two", "three"]
g(f"""
!start = :number is a number
:number = :{g(numbers)}  // lexicon assignment statement
""")

Sets of words are denoted by identifiers prefixed with : and called "lexicons." They can also be inlined like

g(f'!start = :{g(["one", "two", "three"])} is a number')

Lexicons can also be provided as python dictionaries. For details, see below.

You can also create lexicons by forming the union of or performing subtraction on existing lexicons and words.

g(f'''
:first_three = :{g(["one", "two", "three"])}
:rest = four + five + six + seven + eight + nine
:some_numbers = :first_three + :rest - three - four
!start = :some_numbers is a number
''')

Attributes

The DSL uses the % prefix to represent functions called "attributes" that should run when regular expressions are matched.

_print = lambda x: print(x)
g(f"""
!start = print this -> %_print | print that -> %_print | don't print anything
%_print = %{g(_print)}  // attribute assignment statement
""")

Attributes can also be inlined as shown in the example in the quick start.

The DSL consists of three types of statements: nonterminal assignments, lexicon assignments, and attribute assignments. The above examples have shown each type of statement.

Syntax of common regex operations

The DSL ignores extra whitespace and uses C style comments. The syntax the vocoder uses to represent standard regular expression operations is unique:

Operation Vocoder
Match "hello" and then "world" hello world
Match "hello" 0 or 1 times [ hello ]
Match "hello" 0 or more times <* hello >
Match "hello" 1 or more times < hello >
Match "hello" or "world" hello | world

Within utterance expressions

People usually speak in segments divided by silence. The pause between segments, or utterances, carries some information. For instance, it sometimes indicates that a thought has been completed. Speech recogntion systems naturally reflect the segmented nature of speech. One component of the system, the voice activity detector, segments audio into utterances and another component attempts to determine what words were spoken in each utterance.

In vocoder, we can specify that some regular expression must be matched entirely within a single utterance using the operator ~:

g(f"""
!start = < ~< hello > -> %{g(lambda words: print(" ".join(words)))} > end
""")

Without the ~ in this program, the attribute will not run until you say "end." With the ~, every time you pause in your speech, vocoder will print all of the "hello"s you just said.

You can use the syntactic sugar !A ~= regex for !A = ~(regex), which helps reduce the amount of parentheses in your regular expressions.

Attribute arguments and captures

As explained above, attributes are python functions. Consider again the example from the quick start:

def _action(t):
    print(f"You said '{' '.join(t)}'!")

g(f"""
!start = hello world => %{g(_action)}
""")

First, note that the single line in the config is actually syntactic sugar for the following:

g(f"""
!start = (hello world)@1 -> %{g(_action)}
""")

The "capture" @1 means that the "value" (defined below) of the regular expression (hello world) will be passed to the first argument of the attribute _action. The form !nonterminal = regex => %attribute always resolves to !nonterminal = (regex) -> %attribute. The capture @1 was added because vocoder detected that the attribute _action had one argument and there was no corresponding capture in the regular expression hello world.

Here is an example with multiple captures:

g(f"""
!start = hello@1 world@2 => %{g(lambda x, y: print(f"Reversed: {y} {x}"))}
""")

Captures of the form @i where i is an integer indicating the position of an argument in an attribute are called "positional captures." There are also "named captures" that map onto attribute arguments by name. For instance:

g(f"""
!start = (one | two)@num is a number => %{g(lambda num: print(f"{num} is a number"))}
""")

The "values" of regular expressions passed to attributes are defined as follows

Regex Value Note
word word The value of a word or (non-attributed) lexicon is the word that was spoken (as a str)
A ... Z [ Value(A), ..., Value(Z) ] I.e. python list of component values
[ A ] None if A was not matched, otherwise Value(A)
A | B Value(A) if A was matched, otherwise Value(B)
< A > [ Value(A), ..., Value(A) ] I.e. list of values of all matches of child expression

The env argument

If an attribute has an argument named env, that argument will be passed a special env object. When the program first starts running, the env object will have a single attribute app that refers to the running vocoder application. The app object has an exit method that can be used to exit vocoder. For instance

g(f"""
!start = exit => %{g(lambda env: env.app.exit())}
""")

Vocoder will simply exit as soon as you say "exit." You can also assign values to attributes of env and use it to store whatever objects you like.

Attributed lexicons

Normally, the value of a lexicon is the word that was spoken. For instance, in the following example

g(f'!start = :{g(["one", "two", "three"])}@x => %{g(lambda x: print(x))}')

vocoder will print whatever word you say from the lexicon.

You can create a lexicon with special values by providing a dictionary with string keys instead of a list of strings:

g(f'!start = :{g({"one": 1, "two": 2, "three": 3})}@x => %{g(lambda x: print(x))}')

If you say "one", then vocoder will print the digit 1.

The null symbol

The symbol _ matches the empty string. For instance

g(f"!start = _ -> %{g(lambda: print('hello world'))} one two three")

will print "hello world" when you run it (before you say "one").

Captures within closures

If a capture (i.e. an expression of the form R@i) is within a "closure" like < S > or <* T > then the capture doesn't correspond to an argument of an attribute. Instead, the value of the closure will have a special way of referring to the capture. The value of the closure is an object that inherits from python's list and has an extra method iter_captures that allows you to iterate over all matched instances of the captures. Details in the number example.

Examples

The following subsections show how to implement some common dictation patterns in vocoder.

Sleep

We can use regular expressions to create a grammar with a sleep mode.

from vocoder.lexicons import en_frequent

g(f"""
!start = <   ~(vocoder sleep) <* :en - vocoder > ~(vocoder wake)
           | ~< :en - vocoder > -> %{g(lambda words: print(" ".join(words)))}
         >
:en = :{g(en_frequent(30_000))}
""")

If you say "vocoder sleep," vocoder will enter a mode in which it ignores all audio input except the phrase "vocoder wake." When you are in wake mode, any phrase you speak will be written to stdout.

Numbers

The following grammar will recognize spoken numbers like "ten thousand eight hundred and fifty five" and print them as integers.

from vocoder.lexicons import digit, tens, teen, scale

def construct_number(closure):
    out = 0
    for var, *_ in closure.iter_captures():
        scale = 1
        for s in var.scales:
            scale *= s
        out += var.head * scale
    return out

g(f"""
!start = < !number -> %{g(lambda i: print(i))} >

!number ~= <!nums_0_99@head <*:scale>@scales [and]> => %{g(construct_number)}
!nums_0_99 = :digit | :teen | !nums_20_99
!nums_20_99 = :tens@x [:digit]@y => %{g(lambda x,y: x+(y or 0))}

:digit = :{g(digit)}
:scale = :{g(scale)}
:tens = :{g(tens)}
:teen = :{g(teen)}
""")

Key chords

You need to install pynput in order to use the following grammar. It allows you to execute key strokes (for the letters "a," "b", and "c") and key chords (like "ctrl-a" or "ctrl-shift-p"). For instance, try saying "alfa", "bravo", "control alfa", etc.

from pynput.keyboard import Controller, Key

keyboard = Controller()

def execute_chord(mods, term):
    with keyboard.pressed(*mods):
        keyboard.press(term)
        keyboard.release(term)


g(f"""
!start = < !chord >
!chord ~= <*:modifier> @mods :terminal @term => %{g(execute_chord)}

:modifier = :{g({
    "super": Key.cmd,
    "control": Key.ctrl,
    "shift": Key.shift,
    "meta": Key.alt,
})}
:terminal = :{g({
    "alfa": "a",
    "bravo": "b",
    "charlie": "c",
})}
""")

Grammar inspiration and resources

Credits

The way that vocoder represents and works with grammars was inspired by the work leading to kleenexlang. The presentation in Søholm and Tørholm was especially useful.