For a computer programmer working in English, building a program that sorts email spam — using the power of artificial intelligence, you see — is straightforward enough. But doing it in Wolof — a language spoken in Senegal — is like re-inventing the wheel, because the program’s not available in the language (which is spoken by some 4.2 million people). And the scenario’s the same for programming in nearly all of the 300 most common languages on Earth, spoken by a population of almost 1.9 billion people.
But on Tuesday, Facebook just released its open-source machine learning program called fastText that has the potential to solve this problem. The part language library, part machine learning algorithm was made open source with 90 languages at the end of August last year. The updated version released brings the number of supported languages to 294. Along with the expansion, the team of researchers at Facebook’s Artificial Intelligence Research team in New York have optimized the method to run on extremely small operating systems, like a smartphone. This increases the reach of the program to the nearly 1.9 billion people whose native languages aren’t supported well.
“We hope that this will help people to easily be able to learn and play with machine learning,” Armand Joulin, a research scientist at Facebook Artificial Intelligence Research who developed the product, tells Inverse. “This release contributes to our on-going effort at FAIR to democratize machine learning.”
Typically, if you’re writing in a language that isn’t used all over the world, like English or Mandarin, building a machine learning algorithm to predict what hashtags you want to use, or automatically sort spam out of your email takes a lot of work. You have to build a library of words to train on and create a method to identify some of the meaning of words to determine the important elements in sorting the message as a whole. So programs that run in major languages are typically pretty well set, but languages like Wolof are left behind.
Something like fastText makes the initial steps a lot easier by providing a library of words to use to train your algorithm to work in a specific language. On top of that, it uses a method of classifying words that can sort through half a million sentences in under a minute, according to the research blog post.
To do this, the method uses a technique called “bag-of words,” where it just counts how many times a work appears in a document. It also then counts how many times simple phrases appear in a document, both of which can happen very quickly. The program then learns to use certain words or phrases to sort or predict what you want to do, like counting “Viagra” in an email means it’s spam.
The key is that the program works so quickly, it can learn a lot about a large number of languages to be useful on a very short time scale. “More complex models, like neural networks, are often too slow to be trained at such a scale,” Joulin says.
Both the speed and the number of languages that can be used makes fastText accessible to researchers who don’t have the language resources available to programs that run in English. And decreasing the memory the method needs to run so researchers don’t need a supercomputer also helps equalize the playing field. Joulin says, “we release word representations for 294 different languages allowing people around the world to power their applications regardless of the language.”