Tuesday, January 22, 2008

lip reading

Some researchers at the University of East Anglia are hoping to create a 'lip reading' machine -- in essance: a computer which uses only video of a person's face, as she is talking, to decipher what is being said.

First off - lip reading is, at best, unreliable. This is because the vast majority of contrastive sounds are not visible (or barely visible) at the lips (or the front of the oral cavity). For example: to the average speaker of American-English, the words "do" and "to/two/too" look almost identical to each other when being pronounced. This is because the major difference between them is whether the vocal folds are vibrating during the pronunciation of the first consonant. However, the way the lips are shaped during the pronunciation of these words varies a lot, as it is dependent on the words preceding and following. This is basically because of the grammatical function and the shortness of these words.

Furthermore - even tiny visible differences which you might expect can differ drastically from speaker to speaker. It is surprisingly common, for example, for young American-English speakers of certain regions to pronounce some 'l' sounds with the tongue placed prominently between the teeth. This could actually help lip readers, but because it is not applicable to all speakers, it is not something a computer will be able to apply all that well.

There are many more reasons why lip reading is difficult, and why no one can lipread as reliably as a hearing individual can hear. In a paper linked to the bottom of this article, it seems that the researchers have some understanding of these problems. However, they still think they can overcome them.

Presumably, the researchers are motivated by the fact that there exist a few people who have become quite good at lip reading. But here is the real kicker - the way that those people become good is that they use cues from syntax and semantics to make good educated guesses about what the words might be.

For example the following two sentences (when I pronounce them) look identical at the lips:
I tread the path to success
I dread the path to success

If I'm having a conversation with an expert lip reader, she will use many non-lip cues to figure out which of these sentences I've said: cues like what kind of person I am, or what my previous sentence meant, or what our conversation has been about.

So, for example, if I had said "I want to be successful but I [d/t]read the path to success" then one of those words is going to make a lot more sense than the other. As far as I know, no English parsing software exists which can reliably use such semantic cues to help make sense of a sentence.

Good Luck, British researchers...

No comments: