The birth of deep learning can be traced back to 1958. That year, Frank Rosenblatt, then a research psychologist and project engineer in the aviation Laboratory of Cornell University, inspired by the interconnection of brain neurons, designed the first artificial neural network and called it a “pattern recognition device”. After the completion of this equipment, it was grafted into a huge IBM 704 computer. After 50 tests, it can automatically distinguish the cards marked on the left or right. This surprised Frank Rosenblatt. He wrote: “being able to create a machine with human quality has always been a hot topic in science fiction, and we are about to see the birth of such a machine that can perceive and recognize the surrounding environment without any manual control.”
However, at the same time, Frank Rosenblatt also knew that the computer capacity at that time could not meet the computing needs of neural networks. In his pioneering work, he once lamented: “with the increasing number of connections in neural networks… The load of traditional digital computers will be heavier and heavier.”
Fortunately, after decades of development, thanks to the improvement of Moore’s law and other computer hardware, the computing power of the computer has made a qualitative leap, and the amount of computing that can be executed per second has increased by 10 million times, so that the artificial neural network has room for further development. Thanks to the powerful computing power of computer, neural network not only has more connections and neurons, but also has greater ability to model complex phenomena. At this time, the artificial neural network adds an additional neuron layer, which is known as “deep learning”.
Nowadays, deep learning has been widely used in language translation, protein folding prediction, analytical medical scanning and go. The success of neural network in these applications makes deep learning a silent technology and a leader in the field of computer science. However, today’s neural network / deep learning seems to encounter the same development bottleneck as decades ago: the limitation of computing power.
Recently, IEEE spectrum published a paper to discuss the development and future of deep learning. Why does computing power become the bottleneck of today’s deep learning? What are the possible responses? If we can’t solve the limitation of computing resources, where should deep learning go?
Power of calculation: fortune, misfortune
Deep learning is known as the mainstream of modern artificial intelligence. In the early days, artificial intelligence system was based on rules and applied logic and professional knowledge to infer results; Then, artificial intelligence systems rely on learning to set adjustable parameters, but the amount of parameters is usually limited. Today’s neural networks also learn parameter values, but these parameters are part of the computer model: if the parameters are large enough, they will become a general function approximator and can fit any type of data. This flexibility enables deep learning to be applied in different fields.
The flexibility of neural networks comes from feeding many inputs into the model, and then the network combines them in many ways. This means that the output of neural network comes from the application of complex formulas, not simple formulas. In other words, the amount of calculation of neural network will be very large, and the computational power of computer is also very high.
For example, when noisy student (an image recognition system) converts the pixel value of the image into the object probability in the image, it is realized through a neural network with 480 million parameters. The training to determine the value of such a large-scale parameter is even more amazing: because only 1.2 million labeled images are used in this training process. If we think of high school algebra, we would like to get more equations rather than unknowns. However, in the deep learning method, the determination of unknowns is the key to solve the problem.
Deep learning models are over parameterized, that is, they have more parameters than the data points that can be used for training. Generally speaking, excessive parameters will also lead to over fitting. At this time, the model will not only learn the general trend, but also learn the random changes of training data. In order to avoid over fitting, the method of deep learning is to initialize the parameters randomly, and then use the random gradient descent method to iteratively adjust the parameter set to better fit the data. Experiments show that this method can ensure that the learned model has good generalization ability.
The success of deep learning model can be seen in machine translation. For decades, people have been using computer software for text translation, from language a to language B. Early machine translation methods used rules designed by linguistic experts. However, with more and more text data available in a language, statistical methods, such as maximum entropy, hidden Markov model and conditional random field, are gradually applied to machine translation.
Initially, the effectiveness of each method for different languages is determined by the availability of data and the grammatical characteristics of the language. For example, when translating Urdu, Arabic and Malay, the rule-based method is better than the statistical method. But now, all these methods have been surpassed by deep learning. Almost all the fields touched by deep learning show the advantages of this machine learning method.
On the one hand, deep learning has strong flexibility; On the other hand, this flexibility is based on huge computing costs. However, the computing resources and energy consumption required to train such a system are huge, and the carbon dioxide emitted is about as much as that produced by New York City in a month:
There are two main reasons for the increase of computing cost: 1) to improve the performance by factor K, it needs at least the second power of K and even more data points to train the model; 2) Over parameterization. Once the phenomenon of over parameterization is considered, the total computational cost of the improved model is at least the fourth power of K. The small “4” in this index is very expensive: a 10 fold improvement requires at least 10000 times more computation.
If you want to strike a balance between flexibility and computational needs, consider a scenario where you try to predict whether TA has cancer through the patient’s X-ray. Further assume that only if you measure 100 details (i.e. “variables” or “features”) in X-rays can you find the correct answer. At this time, the challenge of the problem becomes: we can’t judge which variables are important in advance. At the same time, we have to choose from a large number of candidate variables.
When the expert knowledge-based system solves this problem, people with radiology and oncology knowledge background are asked to indicate the variables they think are important, and then the system only checks these variables. The flexible deep learning method is to test as many variables as possible, and then let the system judge which variables are important, which requires more data and will also produce higher computing cost.
Models with important variables identified by experts in advance can quickly learn the values most suitable for these variables and require only a small amount of calculation – which is why Expert Methods (symbolism) were so popular in the early days. However, if experts do not correctly indicate all variables that should be included in the model, the learning ability of the model will stagnate.
In contrast, although flexible models such as deep learning are less efficient and require more calculations to achieve the performance of expert models, the performance of flexible models can outperform expert models through sufficient calculations (and data). Obviously, if you use more computing power to build larger models and train models with more data, you can improve the performance of deep learning. But how expensive will this computational burden become? Will the cost be high enough to hinder progress? These issues remain to be explored.
Computational consumption of deep learning
In order to answer these questions more specifically, the research team from MIT, Yonsei University of Korea and the University of Brasilia (hereinafter referred to as “the team”) collected data from more than 1000 papers on in-depth learning, and discussed in detail the application of in-depth learning in image classification. In the past few years, in order to reduce the error of image classification, the computational burden has also increased. For example, in 2012, the alexnet model first demonstrated the ability to train the deep learning system on the graphics processing unit (GPU): alexnet training alone used two GPUs for five to six days. By 2018, nasnet-a will reduce the error rate of alexnet by half, but the cost of this performance improvement is to increase the calculation by more than 1000 times.
In theory, in order to improve the performance of the model, the computing power of the computer should at least meet the 4th power of the model improvement. But the reality is that the computational power should be increased to at least the 9th power. This 9th power means that to halve the error rate, you may need more than 500 times the computing resources. This is a devastating price. However, the situation may not be so bad: the gap between real and ideal computing power requirements may mean that there are undiscovered algorithm improvements that can greatly improve the efficiency of deep learning.
The team pointed out that Moore’s law and other hardware advances have greatly improved the performance of the chip. Does this mean that the upgrade of computing requirements does not matter? Unfortunately, the answer is No. The computing resources used by alexnet and nasnet-a are 1000 different, but only 6 times of the improvement comes from the improvement of hardware; The rest rely on more processors or longer running time, which leads to higher computing costs.
After estimating the computational cost and performance curve of image recognition, the team estimated how much computing is needed to achieve a better performance benchmark in the future. Their estimate is that reducing the error rate by 5% requires 1019 billion floating-point operations.
In 2019, the team of University of Massachusetts Amherst published the research work of “energy and policy considerations for deep learning in NLP”, which first revealed the economic and environmental costs behind the computing burden, which caused a great sensation at that time.
Previously, deepmind also disclosed that it spent about $35 million on training the in-depth learning system for playing go. Open AI also spent more than $4 million training gpt-3. Later, when deepmind designed a system to play StarCraft 2, it specifically avoided trying many ways to build an important component because the training cost was too high.
In addition to technology enterprises, other institutions have also begun to take into account the calculation cost of in-depth learning. A large European supermarket chain recently abandoned a system based on deep learning. The system can significantly improve the ability of supermarkets to predict which products to buy, but company executives abandoned this attempt because they thought the cost of training and running the system was too high.
Facing the rising economic and environmental costs, researchers of deep learning need to find a perfect method that can improve performance without causing a surge in computing demand. Otherwise, the development of deep learning is likely to stop here.