Earlier this week, Musk tweeted that Tesla AI Day would take place on August 19 in North America. According to his previous tweet, the event will cover Tesla’s software and hardware advances in artificial intelligence, particularly in training and predictive reasoning; The main purpose of this activity is to attract relevant talents.
As with Autonomous Day in 2019 and Battery Day in 2020, the entire AI Day event is expected to involve a lot of technical details about software and hardware in order to flex its muscles. This technological muscle-flexing is tesla’s unique way of recruiting top talent. In some ways, Tesla’s events are aimed more at industry professionals; Attract people who are excited by ambitious planning directions and industry-disrupting research and development.
Peter Bannon, head of AI hardware at Tesla, said in an interview: “You know there are a lot of people who want to work at Tesla simply because they want to work in [FSD] development and related work.” In fact, in recent years, Tesla and SpaceX often alternate to the first place in the rankings of the companies that engineering students want to work for, which actually supports this phenomenon mentioned by Peter. Although this time as usual did not reveal any “AI Day” information, but only through the above warm-up picture, let a lot of people engaged in the FIELD of AI excited.
The mysterious Dojo computer chip
On the invitation to AI Day’s launch was an exaggerated chip drawing. It is estimated from the figure that the chip adopts an unconventional packaging form. The copper structure of the first and fifth layers is water-cooled cooling module. The second structure circled in red is composed of 25 chips in a 5*5 array. The third layer is BGA encapsulation substrate with 25 array cores. The fourth and seventh layers should only be physical bearing structures with some thermal properties; The sixth layer circled in blue should be the power module, as well as the vertical black bar above, probably through the heat dissipation and high-speed communication module chip interconnection;
The second tier, with its rounded corners and 25 chips, looks a lot like Cerebras’ WSE super-large processor, meaning Tesla could have used TSMC’s Info-SOW (Integrated fan-out System) design. The info-SOW design meant that wafers could “slice” many chips, making them into many TYPES of CPU/GPU chips (depending on the design, the type of chip was decided during photography). The info-SOW design meant that all the chips came from the same Wafer and did not cut. Instead, the whole wafer is directly made into a large chip to achieve the design of System on Wafer.
The benefits of this approach are threefold: extremely low communication latency, large communication bandwidth, and improved energy efficiency. To put it simply, because the physical distance between C2C (chip and chip) is very short, and the communication structure can be directly arranged on the wafer, all the cores can be interconnected using a unified 2D network structure, realizing the ultra-low latency and high bandwidth of C2C communication. As well as lower PDN impedance due to structural advantages, energy efficiency is improved. In addition, because the array is composed of multiple small chips, the problem of “good yield” can be avoided through redundant design, and the flexibility of small chip processing can be realized.
For an example of image, tesla was released by the super computer, a Shared the 5760 Nvida A100 80 gb of GPU, and then between these chips, need huge amounts of physical structure to connect in order to realize the communication, not only cost a lot of cost, and due to the bandwidth of the connection structure limit to become a “bucket” board, cause the overall efficiency is low, And then there’s the huge problem of dispersed heat dissipation.
As a comparison, Cerabraas WSE-2 has 123 times more cores than Nvdia A100, 1000 times chip cache, 12,733 times cache bandwidth, and 45,833 times Fabric Fabric bandwidth. The primary purpose of this level of performance monster is for AI data processing and training. Its first-generation chip, WSE, has been used by a number of heavyweight users, such as Argonne National Laboratory, Lawrence Livermore National Laboratory, Pittsburgh Supercomputing Center, University of Edinburgh supercomputing Center, GlaxoSmithKline, Tokyo Electronics, etc.
Kim Branson, senior vice president of global pharmaceutical giant GlaxoSmithKline, praised the WSE’s superior performance for reducing training time by 80 times. At Argonne National Laboratory, the largest science and engineering laboratory in the United States, WSE chips are being used for cancer research, reducing the turnaround time of cancer models to one in 300.
So it’s not hard to assume that the image on the “AI Day” invitation is a prototype of musk’s so-called Dojo supercomputer. Interestingly enough, the launch date was August 19, 2021. Exactly one year earlier, on August 19, 2020, Musk tweeted, “Dojo V1.0 is not done yet, probably another year away. Not only are the chips themselves difficult to develop, but so are energy efficiency and cooling. ”
The reason why the cooling problem is difficult is that according to the standard wafer size of 300mm, the design of tesla’s Dojo chip should be similar to RTX 3090, at least each chip has about 28 billion to 32 billion transistors, and the power consumption of a single chip can reach about 250-300W. The overall power consumption is about 6250W-7500W; TSMC also confirmed that the Info-Sow was designed for a maximum power consumption of about 7,000 WATTS.
A few months later, he added: “Dojo uses our own chips and a computing architecture optimized for neural network training, not GPU clusters. It may not be accurate, but I think Dojo is going to be the best supercomputer in the world.” Also, Musk said in Q1 2021 that Dojo is a supercomputer optimized for neural network training. We think Dojo will be the most efficient in the world in terms of video data processing speed. ”
Musk actually mentioned Dojo back in Autonomous Day 2019, describing it as a supercomputer that could use massive amounts of video-level data to do “unsupervised” tagging and training. And if you look closely at the Autonomous Day event in 2019, you’ll see that Tesla’s launch of Dojo supercomputers and its own chips is inevitable and planned, something tesla has to do. In other words, Tesla didn’t want to be an A.I. giant, it was forced to be.
Why do Dojo?
Musk actually responded to the question on Twitter, which basically said: “Autonomous driving can only be solved if real world AI is solved… Unless you have very strong AI and super computing power, there is no way… Everyone in the autonomous driving industry is well aware that countless edge scenarios can only be solved with real-world visual AI, because the entire world’s roads are built according to human cognition… Once you have an AI chip that solves these problems, everything else is just icing on the cake.”
If Tesla can build Dojo, it will be able to train with huge amounts of data with incredible efficiency, solve problems in various “edge scenarios,” and accelerate the maturity and perfection of the autopilot system. More importantly, Tesla has a very high degree of vertical integration of its software and hardware. It is not only not subject to others, but also able to provide deep learning training services to the outside world.