ÎIn the last 2-3 years, a new buzzword has appeared: deep learning. In 2012, Microsoft presented a pretty impressive demo with an application that recognized spoken English, translated the text to Chinese and then spoke the translation with the original speakers voice.In the same year, Google developed a system that, from 10 million YouTube thumbnails, learned by itself to recognize cats (and 22.000 other categories of objects).
In December 2013, Facebook opened a new research lab, led by Yann LeCunn from New York University, one of the foremost researchers in deep learning.
Deep learning represents a category of machine learning algorithms that differ from previous ones by not learning the desired function directly, but they learn how to process the input data first. Until now, for example, to recognize objects in images, researchers developed various feature extractors, the more specific to an object the better, and after applying them to an image, they used a classification algorithm such as an SVM or a Random Forest to determine what was in each image. For some objects, there are really good feature extractors (for faces, for example), but for others not (paper shredders represent a challenge even for humans to recognize and to describe).But deep learning algorithms don"t require this feature extraction step because they learn to do it themselves.
The deep part of the name comes from the fact that instead of having a single layer that receives the input data and outputs the desired result, we have a series of layers that process data received from the previous layer, extracting higher and higher levels of features. Only the last layer is used to obtain the result, after the data has been transformed and compressed.
Most of the deep learning algorithms are some kind of neural networks. The first kind of neural network was proposed in 1958 by Frank Rosenblatt, who was using research done by Warren McCulloch and Walter Pitts on the functioning of the neurons in the animal brain.
The perceptron is a very simple model: each element of the input is multiplied by a weight and then they are all summed. If the result is larger than a bias, the result is 1, else it is 0. These weights and biases have to be adapted for each dataset, using the perceptron learning algorithm.
Being a simple model, it was quickly revealed that it had a big problem: it worked only for linearly separable data (for example, in a 2D plane, if all the points with the positive label can be separated by a line from all the points that have a negative label). This was a big limitation, which led to the disuse of neural networks for a while.
In 1975, Paul Werberos discovered the backpropagation algorithm, which made it possible to use multiple layers in neural networks and to use different activation functions (which made the transition from one class to another smooth, instead of abrupt). Hornik also proved mathematically that a neural network with one hidden layer, with enough neurons, could approximate any computable function to an arbitrary precision.
The backpropagation algorithm works in the following way: you initialize the weights randomly and you do a normal calculation to see what result you get. You calculate the difference between the expected and the obtained result and you "propagate" it back to the penultimate layer, dividing it proportionally with each neurons weight. And you continue doing this until you get to the first layer. Then you repeat this whole cycle until the obtained error is low enough for your purposes (or until you run out of time and patience!).
So, did we solve the problem, do we have the perfect machine learning algorithm? Not quite. Two other problems were found. The first one was that the number of neurons necessary to approximate functions grew exponentially with the size of the input. The other one was that it was proved that training neural networks was an NP-Complete problem(which is presumed to be unsolvable in polynomial time).
Because of these two problems, and because around that time several other machine learning algorithms were developed, which traded off generality for speed, such as decision trees (1986), SVM (1996), random forests (1995) and others, neural networks started to be less and less popular with most researchers. Some still obtained good results with them, such as Yann LeCun with LeNet (1998), which is a system for recognizing handwritten digits and uses convolutional neural networks; this system is used by many banks
A major paradigm change came in 2006, when Geoffrey Hinton published two papers containing new and revolutionary ideas: "Reducing the dimensionality of data with neural networks" and "A fast learning algorithm for deep belief nets", both based on Restricted Boltzmann Machines. These new algorithms start the search for the parameters of the neural networks from a place that is closer to the optimal value. Instead of randomly choosing the initial weights, these neural networks were trained layer by layer, in an unsupervised way, to find the structure of the data, and only at the end were some final small corrections done using the backpropagation algorithm.
One of the standard testing methods for machine learning algorithms is MNIST, a set of 70.000 images, each containing a digit. The set is split into two parts, 60.000 being used for training, and 10.000 for testing (this is done to verify that the algorithm really did learn what each digit is like and not just memorized the input data). On this dataset, a neural network pretrained with RBMs got an error of 1%, which at the time was in the top 5 best results.
In 2007 Yoshua Bengio presented autoencoders as an alternative to RBM which were then developed and several variations appeared: denoising autoencoders , sparse autoencoders and others.
There were many progresses in the pretraining area until, in 2010, James Martens published a paper in which he presented a novel algorithm for finding the weights, using 2nd order derivatives, without any pretraining, and he got better results than what had been published until then .
In the same year, Dan Cireșan showed that it"s not necessary to use this algorithm, but, by using the classical backpropagation, a deep and wide neural net, trained on a GPU and on a dataset augmented with elastic deformations, one can get excellent results on MNIST: 0.35% error . In 2012, his group set the current record on MNIST, 0.23%, which is smaller than the average human error rate for this task . This time he used convolutional neural networks and max-pooling, but still without pretraining.
So the circle has closed and we are back where we "started": the latest papers don"t use unsupervised pretraining anymore, only the backpropagation algorithm. The improvements come from new activation functions, where the Rectified Linear Units are all the rage now , in regularization methods (which help with generalization and prevent memorization) such as maxout and dropout , and from alternating convolutional layers with max-pooling ones.
In the following articles I will go into more details about each of the mentioned techniques and present their advantages and disadvantages, and how you can implement them using various machine learning frameworks.