With the advent of Amazon’s Alexa and Siri’s consistent capacity to take on more chores (and get more and more sassy, many are wondering: what’s next for natural language understanding and conversational voice interfaces?

There are several companies neck-and-neck in this race. There’s Wit.ai, the company Facebook acquired — you can toy around with demo. (Try this command: “I want to watch cats.”) Apple has its HomeKit and, with it, is doing what Apple does best — kicking ass. Amazon’s also out front with its Alexa-equipped Echo and Echo Dot.

One company hot on the trail of natural language understanding is MindMeld. MindMeld provides its natural language understanding capabilities to other companies that are looking to add intelligent voice interfaces to their products, services, or devices. The San Francisco–based company gives partners the infrastructure and customization options such that their devices can have their own, fine-tuned personal assistants. MindMeld recently announced such a partnership with Spotify, but is also working with automotive companies, defense agencies, e-commerce companies, and more. (And, naturally, it’s unable to share many specifics of such partnerships.)

Inverse spoke with MindMeld’s Sam Vasisht about the state of the voice recognition field — but he was quick to point out that “voice recognition,” as an enterprise, is now a “mundane topic.” These days, it’s all about “natural language understanding.” Voice recognition has nearly reached its zenith: after 50-odd years of development, A.I.s can now effectively recognize speech. These systems are almost better than humans at the job, and will certainly surpass mere mortals soon.

The predictable next step, then — much like a child’s development — is to teach these systems to understand the language that they can now recognize. “This human is speaking words; these are the words” is a far cry from, “I comprehend what this human is saying; allow me to assist.”

And that further step and development requires interpretation of meaning: Imitating the way the human mind processes verbal information. There are two parts to this equation. The first is intent: What is the human’s goal or desire in speaking this sentence? A computer that can extract an intent from a spoken sentence can “understand” that the human wants to affect x or interact with y. Intertwined with this process is the second part of the equation: Entity. The A.I. must know how to determine the entity being addressed, the object of the human’s intent.

To do so, MindMeld is not (as I presumed, or hoped) employing philosophers. It is employing natural language experts, but much of the A.I. “learning” process is itself relatively hands-off. If you’re teaching the system to comprehend coffee orders, you need to show the system all the different ways that people might presumably order coffee.

“I’d like a mocha.”

“Could I please have a cup of joe?”

“Just a large coffee for me.”

And that’s where the natural language experts — linguists — come in. But even that’s no longer necessary because we can crowdsource the data. These tools enable you to ask thousands of people the same question and compile their responses. Then you just feed those responses into the A.I., and voila: the A.I. can react to the wide range of possible inquiries. “From the thousands of queries, we now can just basically machine-learn how billions of other queries might be generated,” Vasisht says.

Inverse asked Vasisht, who’s long been an insider in the A.I. and natural language understanding realm, to speculate for us.

Can MindMeld participate in extended dialogue? For instance, if I ask a follow-up question, will the A.I. understand and keep responding?

Yes. That is part of the design. If somebody asks a question that is incomplete — so, for example, if I’m ordering coffee, and I don’t specify the size of the coffee that I want, it’s going to come back and say, “What size coffee do you want?”

Do you expect any progress on the Turing test?

I think we’re pretty darn close to it. I mean, IBM Watson did Jeopardy!, and I think that was a really good example. We are at that point: It is getting very close. Just as, now, in terms of speech recognition we’re at the point where machines are as good as human beings, I think we’ll — certainly in the next three to five years — be at a point where most of these conversational voice systems will be considered to be as good as humans.

What sort of home automation things does MindMeld do?

We can apply our technology to any kind of product, any kind of service, any kind of data domain. Home automation is one of those. Within the home, you have lighting control, thermostat, security systems, audio systems, video systems, all those kinds of things. We are able to control any of the systems provided that there’s the appropriate interface.

What do you wish you could hook up to MindMeld within your own home?

I think that more advanced use-cases — such as talking to my Spotify to say “Play me the Rolling Stones playlist,” or “Play me classical music this evening” — those kinds of things would be … awesome.

story continues below
What's Next

Anything more unexpected or out-of-the-box that you would like to control with your voice?

The things I described to you are the things I think are imminent. In other words, these will happen very soon. What will not happen right away, I think, would be things such as microwaves, coffee machines, and refrigerators. Having these kinds of appliances be controlled — so I can basically say, “Is my coffee machine ready for making coffee? Turn on the coffee machine” and if it hasn’t been prepped, it should come back and say “I’m sorry, but your coffee machine is not ready” — that kind of intelligence does not yet exist. That will be the holy grail: Where basically every device can talk back to you and tell you what it can and cannot do. But we’re not quite there yet.

What do you think is holding the industry back?

These are extremely low-cost appliances, now. I mean, these are appliances you can buy for almost nothing. Ten years ago, they cost a lot more. So, building in new features is something that adds to the costs of these devices. Ultimately, the [current] value proposition is very strong; most of these manufacturers are not inclined to add new features, unless they are at a very low cost point.

I think that’s one aspect of it. The other aspect of it is, we’re talking about having these devices connected. So, there has to be more than just a voice use-case to connect these devices. There’s gotta be more capabilities that need to ride on that connection before they become viable.

Do you know of any company that’s working on that latter capacity?

A lot of semiconductor companies are working on very low-cost microphone arrays. The kind of thing that you can basically embed — at very low cost, on pretty much any device or application — that would allow there to be a voice input. And you don’t have to be standing next to these devices — you can talk from 10 feet away. Building that capability — I think that’s the starting point. And I think that’ll allow people to start putting microphones on devices, and then the other, advanced capabilities will follow. But as of right now, I don’t know any company that’s building this kind of a smart coffee machine, or smart microwave, or washing machine.

What’s your best estimate for when we have fully smart homes, fully smart apartments?

Today, we actually almost have all the essential subsystems in the house that people want automated, that are capable of being automated. This includes lights, thermostats, security systems, garage doors, front door locks — things like that. All these things can be done. The issue is really around price points. These are still at the price point where it’s mainly early adopters and people who have a really dire need for them. But the price points on these things drop dramatically, very fast. I think we’ll probably get these subsystems to mass-market in the next couple of years.

The other things that I talked about — automating the very low-cost appliances — I think those are probably in the five- to seven-year time frame at the earliest. More like 10 years out, before those become a reality. But, like I said before, those are things that will require a number of other things to come together. And it could happen sooner if those various ingredients mesh together sooner.

What do you think a New York City or San Francisco apartment would look like in, say, 2050?

2050! Wow. I think we’ll be fully there. The kind of things that we see in science-fiction movies — where you can pretty much talk to every system in your house, and control everything with voice — I think those kinds of capabilities will be widespread. Certainly in cities like New York and San Francisco.

Photos via Mind Meld

Joe is a writer from Vermont who lives in Brooklyn. He has written for PopSci and McSweeney’s Internet Tendency and spent a year playing with words and other writers’ dreams at Tin House in Portland, Oregon.