Prognosticating the Future of Mobile Audio

By Peter Drescher
November 27, 2011

The following is a transcript of my Fireside Chat (keynote address) presented at this year's Project BBQ, an Interactive Audio Think Tank held every October in Texas:

Seven years ago, at this very conference, I became the Annoying Audio guy. I spoke about the online catalog of downloadable products available for the T-Mobile Sidekick in a presentation called "Could Ringtones BE More Annoying!?" and for a slam-bang ending, I predicted a "convergent technology" device that would be a phone, a camera, an iPod, and a web browser -- two and a half years before the first iPhone was released.

While I did not have any inside information about Apple product development, I did have a few advantages foreseeing that particular advancement in mobile technology. A number of Danger's core engineers had been hired away by Apple, so we suspected they had to be working on something cell phone related. Personally, I used to carry around a black custom Sidekick with a black iPod Nano stuck to the back with duct tape, to demonstrate the concept of a music phone. And now in retrospect, the whole thing seems a bit obvious.

I have published some fairly successful audio industry predictions, including the death of MIDI and looping ringtones, and the rise of roll-your-own game music. I have also forecasted other products, such as ringtones for electric vehicles, and cloud-based game audio, that have yet to be developed. And of course, there are many aspects of mobile technology that I missed altogether, such as the dominance of touch screen interfaces, but hey, that's not exactly an audio issue, is it?
Fig. 12a: nanoHiptop Front Fig. 12b: nanoHiptop Back

Of course, one area of mobile technology that does not seem to be developing quite as I had imagined is the evolution of Bluetooth stereo headsets into earpods: wearable computing devices the size of earbuds, one for each ear, containing a speaker, a battery, a microphone, memory, and an internet connection. In 2007, I did two conference presentations, and a BBQ session, on the topic, and got a fair amount of press and positive feedback on the idea. Ironically, in the Annoying Audio blog entitled "earpods: can you iHear me now?" based on the BBQ report, I wrote that the technology seemed to me to be so useful, powerful, and inevitable that, quote:

"I suspect only global catastrophic economic collapse can prevent it"

Son of a bitch! That is NOT what I was trying to predict!!

And so, people still wear wired earbuds plugged into smart phones for music, video, and game soundtrack listening, and earpod levels of miniaturization still seem several years away. And yet, some of the features I described 4 years ago are available today. Jawbone sells Bluetooth headsets that support noise cancelation for more intelligible phone conversations. Interphone headsets, for communicating over short distances in small groups, usually installed in motorcycle helmets, are close to the local audio sharing network I envisioned. There are Silent Discos, clubs with no PA system, where all the dancers wear WiFi headphones. And the form factors for Bluetooth stereo headsets continue to get smaller and smaller.

So time, and the economy, will tell how close to reality my predictions actually turn out, but before I make some more outrageous audio industry forecasts, I wanted to talk a little about my thought process when developing these ideas.

First, I keep myself grounded by concentrating only on currently achievable technologies, and extrapolating from there. Sure, nanobots in your brain directly accessing your cochlear nerve, or force fields for generating tactile feedback and virtual speaker systems, would solve many audio problems, but let's keep it real. That doesn't mean I can't dream about data networks supporting transfer speeds that make WiFi seem like a dialup modem, or motion capture systems capable of tracking the twitch of an eyebrow, because those kinds of things are merely extensions of products available today, and not dependent on new physics.

The second thing I consider when making predictions is how patterns of development in the past may affect current technology in the future. My favorite pattern is the progression of data types as bandwidth bottlenecks expand. It goes "first text, then graphics, then audio, then video". The first time we saw that was "telegraph, telephone, television", but we've also seen it on the internet (first email, then web pages, then iTunes, then YouTube) and on mobile (first texting, then camera phones, then music phones, now video and gaming devices).

The basic pattern of "bandwidth increases, data transfer speeds get faster, memory becomes smaller and cheaper" may seem obvious, but to really understand it, it helps to be old:

I remember my first hand-held calculator in high school in 1973 was a Texas Instruments SR-10, with basic add, subtract, mulitply, and divide, functionality that cost $100, and was the coolest thing I had ever seen.

I remember when the company I worked for in 1982 got a Wang VS 80 computer installed in the raised floor, temperature controlled, machine room, and the programmers got all excited because it had a whole megabyte of RAM, that's one Meg of memory.

I remember being in awe of a disk drive the size of a shoebox in 1992 because it housed a gigabyte of storage (prompting me to ask "what's a gigabyte?")

The Android device in my pocket is a couple of orders of magnitude more powerful than my first ProTools rig, and the terabyte drive I use to backup my studio is the size of two packs of cigarettes, and costs the same as that calculator I had in high school.

Observing computing devices get smaller, faster, and cheaper, for almost 40 years, really drives home Moore's law, and generates an expectation that the trend will continue.

The third and final thing I think about when making predictions is related to the design process we used at Danger, which can be summarized by the question "How should it work?" Whenever we had a user interface issue, or a use case we hadn't come across before, we'd always ask "What do we think it should do? What's the right way to solve this problem? Not necessarily the standard way, or a new way, or a different way, but the right way" ...

We never asked focus groups, or did user research, or discussed what might be trendy or cool. Instead, we thought about how we, as the primary users, would want the device to respond, and built it that way. This was remarkably different than the process used at some other companies I've worked for, where whenever there was a design problem, the program managers would invariably ask, "Well, what does the iPhone do?"

Danger's blueprint for innovation was to make the device as simple and intuitive to use as possible, and we were quite successful doing this with the T-Mobile Sidekick. As the device evolved, control of some features did become buried under layers of hierarchical menus, but in general, use of the device was so easy, even Paris Hilton could do it. More importantly, this concept of "how things should work" is useful when thinking about future technological development.

One last point about Danger: Prominently posted on the wall outside the office of one of the founders, was the quote by Alan Kay "The best way to predict the future is to invent it" ... and wow, looking back, that's exactly what we did. Sidekicks derived a lot of their power from maintaining the data on the Danger servers, which also formatted information from the Internet for efficient download to "thin client" mobile devices. We called it "the Danger service", and renting it out to carriers is how we made our money.

Today, that kind of system is called "The Cloud", and is the way all data will be stored and accessed in the future. In many ways, it's a technology throwback to the days of mainframe computers being accessed by "dumb terminals" via physical cables and command line interfaces.


The Cloud can be thought of as a network of mainframes, with smart phones acting as wireless terminals. No matter how powerful pocket sized devices become, they will never be able to match the capabilities of millions of servers in raised floor, temperature controlled, machine rooms, connected together in a planet wide network.

In fact, that's kind of the point. The ability to access the vast data resources of every computer on the Internet, at anytime, from anywhere, is the utopian dream of a Cloud covered world. The system of Web enabled smart phones we have now is the Space Invaders equivalent of what is to come, provided of course that there's still an advanced civilization around to build it. Given the pathetic state of science education and economic policy in the US these days, I wouldn't be surprised if the native language of the Cloud turns out to be Chinese.

But, be that as it may, once you have ultra wideband network connectivity equivalent to today's power grid, basically rock solid anywhere and everywhere you go, four bars 24/7 taken for granted, it profoundly changes the way your data will be stored. The device in your pocket can be lost, stolen, or destroyed, but when all your information lives in the Cloud anyway, that's not really a problem.

This means that there's no reason to carry around gigabytes of music files anymore. Every piece of music ever recorded will be available in the Cloud, and you'll simply access audio streams via music subscription services. There's no downloading, and no piracy. Artist fees are paid automatically based on how many times a song is listened to, and your "record collection" is merely a set of pointers to files on Cloud servers. We are already seeing this kind of technology with services like Rhapsody and Spotify.

The Cloud will also have a profound influence on mobile gaming. Already, OnLive turns your PC into a dumb terminal, by doing all the game processing on their servers. They send out a broadband audio/video stream, and receive gameplay commands from your controller. Now imagine that kind of system without the Comcast cable connection.

What kind of games could you play if you had that kind of access on the train, at work, or on the beach. What kind of mobile multiplayer experience could be developed when the players can be literally anywhere on the planet, not just tethered to their home PCs.

Like Cloud-based music services, the audio for ultra wideband mobile games would just be a stream of data. You tap your screen, the missiles fire, you hear the explosion. Your local device does nothing more than send gameplay commands and GPS data up to the server, which does all the audio mixing and graphics processing, and then transmits the stream back down to your device. Simple, given that level of bandwith ... or is it?

Because now we start running up against the laws of physics, specifically, the speed of light. Some amount of lag is going to be inevitable, based on the time it takes for the signal to travel from your device to the server and back via wireless, possibly satellite, connection. There is a certain threshold below which you cease to notice the lag, somewhere around 20 milliseconds, but will ultra wideband mobile networks be able to approach that speed of transmission and reception? Time will tell ...

There's another physics-based stumbling block that needs to be addressed before these kinds of mobile devices can become a reality: battery power. Right now, batteries are big, and clumsy, and don't hold a lot of juice, so they constantly need to be recharged. The technology for even the most advanced lithium ion polymer battery is still based on electro-chemical principles from the last century. I keep hearing about nanowire supercapacitors, portable hydrogen fuel cells, and other futuristic electrical generation and storage concepts, but so far, you're lucky if your cell phone power lasts all day during normal use. We are overdue for a breakthrough in battery technology.

When/if that happens, the new batteries will power devices that do everything your most advanced cell phones do today, only faster, cheaper, and smaller. In fact, if mobile devices follow the same pattern as computers (and I suspect only global catastrophic economic collapse can prevent it ... d'oh!), they will become so small and lightweight that you will no longer carry them around, you'll simply wear them, like glasses or jewelry or Star Trek communicator pins, and control them via voice command.

Already, Google voice transcription is astonishingly good, approaching Star Trek levels of accuracy. It manages this by sending a recording of your voice up to the server, and comparing it to a database of a bazillion of other voice recordings. It then sends back its best guess of what you said, and more often than not, it's right on the money ... and the more people use it, the better it gets. (Author's note: Siri does something similar, and does it remarkably well)

Now connect that kind of transcription service to Watson, the IBM computer that recently beat the crap outa the two best Jeopardy players ever, and you've got a system that can understand anything you say to it. Connect that to the Cloud, and you can access basically any information you can describe. You'll be able to say, "Computer, today I feel like listening to some jazz, nothing too frenetic, maybe with a little latin flavor" and the computer will respond "Now playing: Cal Tjader". I am starting a petition to name this kind of system "Majel", after Majel Barrett, the voice of every computer on Star Trek ... and given the wide variety of high-quality voice-over recordings she made over the years, I want the computer's text-to-speech synthesizer to use her phonemes and inflection.

As machine listening and comprehension continues to develop, so will computer vision. Kinect is a first step in this direction: a consumer level camera and motion capture system that lets the game watch you, and respond accordingly. Of course, Wii-mote controllers perform a similar function using accelerometers, but the ability to track the motion of a player's entire body has implications for creating that elusive "immersive environment" game developers are always talking about.

If the user's position can be tracked in 3D space, so that the computer knows where you are, and where you are looking, then the user's physical surroundings can be overlayed with computer generated images. This is called "augmented reality" and already there are rudimentary apps that use it. My favorite is Google SkyMaps: hold your phone up to the night sky and look through the screen to see star names and planetary positions. Point the phone down, and look through the entire planet to see constellations on the other side of the world.

Now combine augmented reality with wearable computing, Cloud-based gaming, and motion tracking systems to build a World of Warcraft theme park. You put on your AR-goggles, with built-in earpods, and enter a warehouse, or an outdoor reservation. The computer tracks your movements, and displays a castle over there, and a dragon over there. Other players in the field look like their avatars through your goggles, and the monster's roar always eminates from the same direction using head-related transfer functions. When you schwing your sword, it makes ambisonic whooshing noises, like they always do in the movies. If your character is a giant, you hear the booming thud of your footsteps when you run. And of course, magic wands make magical sounds, positioned in 3D space, so you can throw spells at your enemies, or dodge them when they come your way.

Now that sounds like a fun gaming experience, but there's something else this kind of technology can do that is even more interesting to audio follks -- motion capture musical instruments.

One of the things musicians are always complaining about is the clumsiness of digital interfaces. You can't feel the detent when adjusting an onscreen fader with a mouse, you can't bend a note with a pitch wheel with the same level of detail and nuance that you can with a cello, and no synthesizer keyboard will ever be as sensitive and expressive as a Steinway action. Sure, there are lots of things synthesizers and plugins can do that no physical instrument can, but the kind of muscle memory and years of practice required for virtuoso live performance is largely lacking in the digital world.

This issue can be addressed in new ways by new technologies. A system that could track every subtle gesture, every foot movement, every facial expression, could be used to generate notes, harmonies, rhythms, and effects, with a degree of control unavailable to other instruments, in fact, currently only available to dancers.

Now, I'm not talking about playing an acoustic instrument in pantomime, but rather using real time motion capture data to control audio synthesis and effect parameters of a sophisticated audio engine, capable of producing musical sequences based on movement. Rather than dancing to music, this would be music generated by dance, and would require new skills, and ten thousand hours of practice, to master at expert levels.

Hiptop, open
Motion capture sensors send data to an audio engine running on a Cloud server, which translates the movements into music, and transmits it to an audience wearing augmented reality goggles with built-in earpods.

Let's take that one step further, and use it to put the band back together. The stage is covered by multiple motion capture cameras, or possibly radio frequency locators, and the audience wears AR-goggle/earpod headsets. The dancer-musicians might wear control surfaces on various parts of their bodies for specific functions, such as gloves that play notes or chords when various combinations of fingertips are touched, or shoes with tap-style sensors on the soles for creating complex rhythms.

One dancer-musician might generate beats, while others could create melodies and harmonic accompaniments. Song lyrics could be programmed to produce video images in the audience's goggles when the computer recognized phrases or keywords. And of course, no speakers are required, because everything is streamed to everybody's headsets.

In fact, the audience need not even be in the same room as the dancer-musicians, who themselves might be situated in multiple venues. While watching a performance, you could place yourself anywhere you wanted, in the front row, on stage, backstage, or even up above floating in the rafters. The appropriate images would be displayed on your goggles, and the sound would be mixed and positioned accordingly.

Here's another idea: point the motion capture cameras at the dance floor and let the crowd's movements generate the music. Or point them at a flock of birds, or the ocean, or a tree on a windy day, and create music based on that kind of motion. Remember, the sounds produced by this instrument are limited only by the composer's imagination.

And so I'd like to present a challenge to the imagination of the BBQ Giant Brain: Design a motion capture musical instrument system, to be played by a group of dancers, that could not only produce a wide range of traditional musical styles, but also new digital audio art forms.

SO, will this technology generate a new form of "rock band slash dance troupe", performing concerts like nothing you've ever seen or heard before, with a worldwide audience connected to the Cloud? I tell you what, let's build it and find out -- because the best way to predict the future of mobile audio ... is to invent it!

Thank you.

    - pdx

Special thanks to Linda Law, Theresa Avallone, and all of my BBQ Brothers and Sisters ...


You might also be interested in:

News Topics

Recommended for You

Got a Question?