Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding nuance to "Observe spoken text" #6

Open
WestonThayer opened this issue Nov 10, 2021 · 1 comment
Open

Adding nuance to "Observe spoken text" #6

WestonThayer opened this issue Nov 10, 2021 · 1 comment

Comments

@WestonThayer
Copy link

https://github.com/bocoup/aria-at-automation#observe-spoken-text

Scoping speech metadata sent to the TTS

While exploring NVDA source and the TTS-engine side of the SAPI 5.4 API, I realized that screen readers send much more than basic speech strings to be spoken by the TTS. In the case of SAPI, NVDA sends SSML, which ISpTTSEngine::Speak receives in SVPSTATE.

Metadata includes:

  • LangID - the language associated with the whole or section of an announcement, which the TTS can use to adjust vocalization. We could use this to test that NVDA correctly processes a multi-language web page
  • EmphAdj - Not sure this is used, but presumably could ensure that <em> semantics are picked up and conveyed by the screen reader
  • PitchAdj - Could test that NVDA is correctly increasing pitch for capital letters
  • SilenceMSecs - Via the SSML <silence> tag, NVDA inserts this for BreakCommands. Could be used to test appropriate cadance
  • There's also SPVACTIONS, which include SPVA_Pronounce and SPVA_SpellOut. I think NVDA provides it's own spelling functionality, but does appear to use <pron>

Should "observe spoken text" should include this level of detail?

Technical speech observation solution scopes to a particular TTS API

I realized looking through NVDA's source that it has many synthDrivers, currently for SAPI 4, SAPI 5, OneCore, and eSpeak. Our SAPI 5 driver only tests NVDA's code path for SAPI 5.

Is it worth documenting this... tradeoff?

Pragmatically, I think the chance of finding a bug in a specific TTS driver is low, and finding a comprehensive solution probably isn't worth the effort. That said, the drivers do have some complexity. synthDrivers/oneCore.py maintains its own queue. All 3 have different SSML algorithms (looking at commit history, espeak seems to allow malformed SSML while OneCore rejects it).

@jscholes
Copy link

@WestonThayer This is great information; thanks for carrying out the research and writing it up.

Keep in mind that a virtual system-level (i.e. SAPI5 on Windows) engine is only one of the paths that will be investigated going forward. It is likely that screen-reader-specific code will also be needed to implement parts of the automation driver protocol, and such in-process facilities may also involve capturing the speech before it even leaves the screen reader's boundaries, e.g. with a "tee"-like synth driver to allow speech to be captured while also speaking it out loud for developers and/or testers. That would make use of similar things to what you've outlined here, albeit SR-specific internal ones, e.g. NVDA's formatting/command fields.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants