Why A.I. Can Struggle to Understand Arabic

Getty Images

In the world of artificial intelligence, data is king. The more you have, the better your tools become as systems are able to “learn” more about what to expect. But depending on the platform the A.I. is drawing data from, some languages may be better represented than others, according to Miriam Redi, a research scientist for Yahoo Labs.

“For example like Flickr, where we take our data from, some languages are very little represented,” said Redi, speaking at London’s Deep Learning Summit on Thursday. “So we have English, millions of images for English, but we have maybe 100,000 for Arabic.”

Redi’s team is working on a tool that can identify non-visible elements to images, like cultural values and emotional connotations. The tool analyzes the text attached to publicly available images on Flickr. Over time, the A.I. starts to understand why someone may tag an image “happy party” or “awkward moment,” but these ideas will grow ever more accurate as the tool analyzes more images.

“Unfortunately, the accuracy for sentiment detection in images for Arabic languages tend to be lower because we don’t have enough data,” Redi said.

In the languages that had larger amounts of data, Redi’s team noticed a few interesting patterns. Romance languages like French and Spanish tended to express themselves in similar ways, while Italian appeared to be the only language where users identified images with the term “tax evasion.”

Different languages tended to have different sets of text markers for images.

Getty Images / Jamie McDonald

Language barriers still remain something of an issue for A.I. researchers. Anyone who’s used Google Translate will know that switching languages is never quite as simple as it sounds. However, new developments are changing things, with Facebook announcing this summer that it was moving closer to its dream of a single-language social network, automatically translating texts for users.

Developments in removing language barriers can help foster international communication, but for projects like Redi’s, there’s no real substitute for human-generated sentiment data.

Related Tags