In a previous deep learning article we talked about how inference workloads – the use of already trained neural networks to analyze data – can be done on relatively cheap hardware, but the training workload that the neural network "learns from" "is of the order of magnitude more expensive.
The more potential inputs you have for an algorithm, the more uncontrolled your scaling problem becomes when analyzing its problem space. This is where MACH comes in, a research project by Tharun Medini and Anshumali Shrivastava from Rice University. MACH is an abbreviation for merged average classifiers via hashing times faster and … memory prints are 2-4 times smaller "than with previous large-scale deep learning techniques.
When describing the extent of extreme classification problems, Medini referred to online shopping searches and noted that "slightly more than 100 million products are online". This is more conservative: A data company said Amazon sold 606 million separate products in the U.S. alone, with the entire company offering more than three billion products worldwide. Another company estimates the US number of products at 353 million. Medini continues: "A neural network that uses input for searching and predicts from 100 million issues or products will typically contain about 2,000 parameters per product. So you multiply them and the last layer of the neural network consists of 200 billion parameters … [and] I'm talking about a very, very dead simple neural network model. "
At this scale, a supercomputer would likely need terabytes of memory to store the model. The memory problem gets worse if you bring GPUs into the picture. GPUs can process neural network workloads by orders of magnitude faster than general-purpose CPUs, but each GPU has relatively little RAM – even the most expensive Nvidia Tesla GPUs only have 32 GB of RAM. Medini: "Training such a model is unaffordable due to the massive communication between graphics processors."
Instead of training the entire 100 million results – in this example, product purchases – Mach divides them into three "buckets", each of which contains 33.3 million randomly selected results. Now MACH is creating another "world", and in this world the 100 million results are again randomly divided into three buckets. What matters is that the random sorting in the first and second worlds is separate – they each have the same 100 million results, but their random distribution in buckets is different for each world.
With each instantiated world, a search is given both a "World One" classifier and a "World Two" classifier with only three possible results per piece. "What is this person thinking about?" Shrivastava asks. "The most likely class is something that is common between these two buckets."
At this point there are nine possible results – three buckets in the first world and three buckets in the second world. But MACH only had to create six classes – the three buckets from World One and the three buckets from World Two – to model this search space with nine results. This advantage improves as more "worlds" are created. A three world approach gives 27 results from just nine classes created, a four world approach 81 results from 12 classes and so on. "I pay the costs linearly and get an exponential improvement," says Shrivastava.
Even better, MACH is more suitable for distributed computing on smaller individual instances. The worlds "don't even have to talk to each other," says Medini. "In principle, you could train everyone [world] on a single GPU, which you could never do with a non-independent approach." In the real world, researchers used MACH for a 49 million product Amazon training database that was randomly sorted into 10,000 buckets in 32 different worlds. This reduced the required parameters in the model by more than an order of magnitude – and, according to Medini, training the model required less time and less memory than some of the best training times for models with comparable parameters.
Of course, this would not be an Ars article on deep learning if we did not end it with a cynical reminder of unintended consequences. The unspoken reality is that the neural network doesn't really learn to show customers what they asked for. Instead, you will learn how to turn requests into purchases. The neural network does not know or care what man was actually looking for. it has only an idea of what people are most likely to buy – and without adequate control, systems that are trained to increase the likelihood of results this way can suggest baby products to women who have miscarried or worse.