January 4, 2005

Early progress in speech recognition

Since my area of research here at the U is natural language processing (NLP), that is the part of the project I am best equipped for, and thus expect the most early progress on. In fact, I have already made some preliminary progress. I'm going to be using the Sphinx system developed at Carnegie Mellon. It's biggest drawback as far as I can tell is that it uses an HMM trigram language model as opposed to a structural language model. However, since the speech commands I intend to use are probably not even grammatical sentences (e.g. "Window up", "Volume down"), and are generally fairly short (the longest is probably something like "Rear left window down"), a trigram model probably captures most of the information needed.

The example commands given above are fairly simple, and this is clearly the best way to start. If I try to get the system to understand "Computer, could you be a dear and roll down the rear starboard window halfway, please," I would probably not meet with early success. While the accuracy rate of each word in Sphinx is fairly high, over a complete utterance the accuracy is (word_acc^N), where word_acc is the per-word accuracy and N is the number of words in the utterance. Therefore, for a per-word accuracy of e.g. 90% and a mere 5 words, accuracy of utterance recognition dips to 59%, and after 10 words it drops to 35%. By limiting the vocabulary, though, I can increase the single word recognition, and by keeping the commands as short "stock" phrases I can increase the total utterance recognition accuracy.

I have started in this direction a little bit already. I used this corpus as a starting point for commands to be recognized. Using the CMU online language modeling tool, I uploaded the above-mentioned corpus file to obtain this language model file.

If you would like to try this on your own computer, download the files above. Then install sphinx-2 on your computer. There will be a sample language model set up in "/usr/local/share/sphinx2/lm/turtle/". I'm not sure the exact directory, I'm doing this from memory. The turtle directory contains a sample language model that is suitable for an environment to control a turtle graphics system. Since my environment is much different, I created a new directory called "/usr/local/share/sphinx2/lm/car/" and unpacked the above tarball there. Now, there is a program installed with sphinx2 called sphinx2-test, located in "/usr/local/bin". This is a script which loads the turtle language model and calls sphinx-continuous, which is a program that just waits for speech input and outputs the most likely utterance. Now make a copy of sphinx2-test called sphinx2-car and edit the script to include your car language model instead of the turtle model - it's only like two lines. Whew! That sounds like a lot of work, but if you're familiar with *nix at all it should be fairly trivial.

I did all this on my computer, and ran the sphinx-car program. So far, it hasn't missed a beat, as long as I "stick to the script." I tried it the morning after I had set it up, and forgot that the language model doesn't know the word "radio" or "stereo," just the words "volume", "up", and "down," so I obviously wasn't met with much success. The next thing I tried was playing loud music near the microphone to see if that bothered it. Obviously if I'm using this system to control my stereo there will be background music, and I hope that isn't a deal-breaker. The first experiment was The Shins - this worked fine - I think the singing is too high-pitched to be interpreted as a voice, since the system was probably mostly trained using men. The next experiment was Johnny Cash. This is a much tougher test, because not only does he have a nice deep voice, but much of what constitutes "singing" for the Man in Black is indistuingishable from talking. The system did interpret some of this music as speech, but it didn't assign it any words that it knew. So, as long as there aren't any Johnny Cash songs where he yells out "Roll the window up!" or "Smash into the car in front of you!" I should be safe.

So, the simplest part of the project is effectively completed. The remaining work to be done in speech recognition is the following:

  • Port it from my desktop into my car
  • There are a couple different approaches I could take here. The first is just using my laptop with a microphone. This is the cheapest idea, since I already own both. Another idea is to buy one of those $500 laptops running linux from Walmart. I hate Walmart, but that is a damn good price, and then I could have a dedicated car computer which I could possibly extend with other software. The final option is a custom computer made from something like mini PCI. While this is the cleanest solution, there is a good chance it would cost at least $500 and a lot more in labor as I attempt to build the system. For now the best choice is to run it on my laptop, and if that works I'll look into getting a dedicated system from Walmart.
  • Integrate environmental data
  • This is the area of research that is in focus in the NLP lab right now on a mobile robot. For instance, the commands "Start turning right" and "Stop turning right" are ambiguous because "start" is phonetically somewhat close to "stop." However, if the robot is already turning right, it's highly unlikely that you would command it to start turning right again, so the command is therefore disambiguated to "stop turning right." This sort of system could be useful in a car as well, especially considering the background noise from the engine and the music. So, environmental data might include current stereo volume, current stereo input source, fade and balance, car temperature, current window status, direction of audio input, etc. These sorts of things are proprioceptive sensors, just like the muscle spindles in your biceps muscles that tell you where your arms are even if your eyes are closed.
  • More complex language model
  • Like I mentioned above, the first priority is getting stock commands to work. After all, this is not part of some far-flung research program, but a practical system I want to actually benefit me ASAP. With that said, some of my interest now is on word-learning systems. A nifty feature would be a system that could understand a general grammar and then extend meaning to novel commands like "Crank it up, Rosie!"
Posted by mill1991 at January 4, 2005 1:44 PM
Comments

Hi! Nice work going on there :)

You forgot to add the most important command to the corpus!!! "Smash into the car in front of you!" ;) kidding.

I'm working on a small project to control my old computer to run as a voice controlled media player / jukebox.

Anyway, good luck :) I'll probably bother you with questions if/when I run into a wall :P

Regards,
Harshad.

Posted by: Harshad Sharma at April 25, 2005 9:09 AM

cool... made any progress since the last entry ? just a thought but rudimentary speech recognition systems already exist in some high end models (mercedes, bmw...). have you considered examining these as part of your research ?

Posted by: JC at August 3, 2006 12:43 AM
Post a comment









Remember personal info?