In Spike Jonze’ Her, before we ever meet OS One(Scarlett Johansson), we are introduced to Joaquin Phoenix’ unnamed, dumb non-sentient operating system.

Using the same hardware piece, Phoenix issues voice commands to the near future version of Siri. Less annoying and more useful, this interface is much more likely to become a reality in the near future that the more prominently featured OS One. We see Phoenix control his music, check his email, and hear recent news items, all through conversational commands. there’s no having to down the home button like Siri, or saying “Okay, Google.” The process is so smooth and effortless, it just melts into the background of the movie, just like the rest of the technology in the movie.

It’s so close to how our voice-based UI’s will behave in the next few years that the movies technology designer KK Barrett deserves some serious kudos. That said, there are a few interaction missteps that will leave us looking back at the movie and comparing its UI to Geocities or Windows 8.

“Email from, email from, email from…”
Before reading every new email message, we hear “Email from.” Anyone sitting in the audience who regularly receives more than 5 emails a day is cringing. The context is already clear; every new item is already understood to be an email message. There is no reason to repeat it.

There are no cues to communicate the state of the UI
The app never tells you when it’s done talking. Is there more? Is it thinking? There is no way of knowing unless there is some audio cue. The same goes for the times that Phoenix issues a command. Sometimes we hear a pause, and sometimes we hear a confirmation tone. Inconsistencies aside, audio-based applications need to be communicating state to the user in the same way that visual ones do today.

Speech Rate
As we become accustomed to interacting with devices using Text to Speech, we are going to speed up the rate that our apps speak to us. Over time, slower speech interfaces are going to feel like browsing the internet on a dial-up connection. Check out this video of someone who has used Text to speech based interfaces for years. It’s mnd-boggling that someone can understand an application speaking that fast (~950 wpm).

These are pretty small critiques, but it’s important to think about as owe move away from touch screens towards more wearable devices. The sooner we can move away from the current interactions with Siri and Glass, the better.