10 Aug 2005 18:00
Re: speech recognition
Willie Walker <William.Walker <at> Sun.COM>
2005-08-10 16:00:51 GMT
2005-08-10 16:00:51 GMT
Hi Peter:
One way to think about a speech recognition engine is similar to
This is an interesting first step. I've had many conversations with
Sounds like you are doing some fun work!
Robert Brewer has done some work in this space with his SpeechLion
I've been thinking about speech recognition engines and where theyshould be integrated. I now think that the window manager is not theplace for it, it should be at a 'lower' level.
that of a speech synthesis engine: it's a service that can be used
by assistive technologies.
The speech synthesis problem is a bit simpler because the interface
between the engine and the assistive technology is not so complex.
Speech recognition requires perhaps a bit more complexity because of
the typical need to tell the engine to listen for different grammars
as well as the high degree of two-way communication between the
speech app and the speech engine. It's a solvable problem, however,
and emerging standards such as MRCPv2 are addressing this.
I'm currently hacking on a little daemon which uses the sphinx2recognition engine to convert speech to text after which it sends thistext to the keyboard driver (using the uinput device driver). Thismeans that I'll be able to use my voice to 'type' every keyboardcharacter. (My current implementation already does this for a limitedset of characters)
various folks who run down this path. My personal opinion is that
turning speech into keyboard events is a potentially workable path,
but I believe much more compelling access can be done via higher
level access to the application, such as the AT-SPI.
As one goes further down the speech input path, one starts to realize
that speech recognition is not perfect. As such, one needs to start
tuning/modifying the speech engine and the grammars it uses to squeeze
the best accuracy/performance out of the engine. Really good tuning
can be done by understanding just what utterances are acceptable input
to the application based on its given state. This understanding can
be better obtained by something such as the AT-SPI.
Furthermore, once users can start talking to an application, they start
expecting more than just "speech buttons." For example, one might want
to be able to say "change the current selection to 12 point bold
helvetica." This involves a plurality of UI operations. While this
might be able to be done by injecting a sequence of well known keyboard
events, direct semantic access via something such as the AT-SPI might
be a better way to go.
In any case, it sounds like you are getting pretty interested in this
space, and I'd be excited to hear more about your progress!
Will
_______________________________________________ Gnome-accessibility-devel mailing list Gnome-accessibility-devel <at> gnome.org http://mail.gnome.org/mailman/listinfo/gnome-accessibility-devel
RSS Feed