Kinect Voice Recognition and Motion Capture Musical Instruments

By Peter Drescher
March 8, 2011

@ GDC 2011, I attended a session about speech recognition for Kinect, titled "Xbox, Listen" by Scott Selfon, who (as always) delivered a detailed, focused, and informative presentation on the technical aspects of talking to a computer ... that understands what you say.

Mr. Selfon discussed the algorithms required to match words and phrases with voice recorded by the device. There's an array of mics, so it can determine where you are by measuring phase differences. Audio produced by the Xbox is subtracted from the recording, like noise cancelling headphones, leaving only clean vocalizations to analyze.

The part that caught my attention most was the "probability threshold". The result of the vocal analysis contains an estimate of how well the computer thinks it has correctly identified what the user said. The programmer sets the level below which, instead of processing a command, the system responds with "uh, sorry, say what?"

There's another computer that rates itself like that -- Watson. An important winning strategy for Jeopardy is "don't guess", and obviously, it worked. Better to keep one's mouth shut and be thought a fool, than open it and lose a thousand bucks.

So Watson rates itself on how correct it thinks its answer is, and lets Ken Jennings take one, when the percentage is below a certain threshold. And so I'm thinking, you hook Watson up to a Kinect voice recognition system, and you've got a computer that can recognize and understand ANY set of words, not just the 100 commands used by video games.

That will be a significant step closer towards artificial intelligence as idealized by the Star Trek computer. I am hereby starting a petition to name such a system "Majel" (best. computer voice. ever.)

But that's not what I wanted to talk about.

Delayed Response

What I wanted to talk about is latency: the elapse time from when you hit the key, to when the sound comes out. Let me put it this way - zero latency is best; the bigger the number, the more it suckz!

Drums are zero latency. The hit and the sound are simultaneous (the hit makes the sound). Drumming is an essential human activity, and our brains are wired for rhythm. One might even define Man as "the animal that keeps time".

But not all instruments are zero latency, starting with the grand daddy of them all, the pipe organ. The first time I played one as a kid, I was astonished by how you hit the key ... and THEN you heard the sound. It takes awhile for those towering columns of air to get vibrating, so the delay is a fundamental part of performance.

That's not the way a Steinway works. The complex mechanism that allows me to hit strings with felt hammers, gives me complete and instantaneous control over the vibrations produced by the harp. It's as if I can feel the strings, even though I never touch them ... but that's why I can play so fast.

Latency is built into electronic instruments, and always will be. In the early days of MIDI synthesizers, sync problems would sometimes cause horrible delays. This made recording difficult, and live improvised performance almost impossible.

Modern digital instruments have very low latency, below the threshold where you can tell. I wonder if this is because we have our own built-in latency, caused by the time it takes the "press key" impulse to travel down the nerves, added to the time needed by the brain to process the sound coming in through the ears.

Still, there's a point where it's simply "close enough" ... and Kinect video ain't there yet.

Music Applications

The elapsed time between moving your hand, and the Kinect skeleton following, is significantly greater than what organists put up with. It's good enough for "move your body" and "pet the kitty" games, but even pugilistic punching seems to push the limits of usability. In any case, the video latency of the current version is simply too great to make a playable musical instrument.

Ironically, rhythm games work well, because the computer can take the latency into account, scoring your moves based on the music tempo (plus the delay). But it's one thing to dance to music while the computer watches, and another to perform music using the computer as an instrument. Personally, I'm a piano player, I don't dance, don't ask me ...

Fortunately, Kinect video latency will only decrease in subsequent versions, and eventually you'll see a new genre of software synthesizer emerge, one that blurs the line between music performance and dance.

Imagine a virtual playfield consisting of two discs floating at raised-hand level. Hit the left one, it plays the low "tumba" note of a conga drum. Hit the right one, you hear the high note. Now play "guaguanco".

Next, watch a video of someone playing this kind of motion capture percussion instrument (or mocaPerc™) ... with the sound turned off. You would think he was doing some new kind of dance, waving his arms rhythmically, creating complex patterns, punctuating the performance with artistic flair. It is dance, by any definition ... but it's also music performance.

Now add more hit points, particularly at kick level, and in three dimensions. Change the triggered sample presets from conga to, oh, I don't know, any sound you can imagine. Add gestural controls for timbre and sequencing. Start recording, and lay down some serious beats.

Of course, start, stop, and other parameters, would be controlled by voice command, which seems natural. Music and voice recognition can even be considered related, because they both use sound to communicate (musician to audience, user to computer).

I look forward to the day when low-latency "music-from-dance" composition tools are available for motion capture systems. For now, the best way to audio interact with Kinect is by using your voice. But I wonder ... does singing to Kinect help or hinder the speech recognition system? :)

   - pdx


You might also be interested in:

News Topics

Recommended for You

Got a Question?