Deep learning technology requires a lot of tensor arithmetic operation. In order to support real-time execution, the performance of memory and processor must meet the performance goal as much as possible higher than the standard software driven architecture. This demand leads to the use of special hardware accelerator based designs to perform parallel and highly pipelined tensor arithmetic operations. In order to avoid channel blocking, data must appear in the right place, at the right time and in the right format. Dedicated data orchestration hardware avoids accelerator channel blocking to support maximum efficiency.
Data orchestration includes pre-processing and post-processing operations to ensure that data is transmitted to the machine learning engine at the best speed and in the most suitable format for efficient processing. Operations range from resource management and usage planning, to I / O adaptation, transcoding, conversion and sensor fusion, to data compression and rearrangement in shared storage arrays. How to deploy these functions will depend on the performance and cost requirements of the target application. However, for most application scenarios, the programmable logic platform optimized for data ingestion, transformation and transmission provides the best data arrangement strategy for the machine learning accelerator.
Deep learning brings great pressure to computing hardware. The shift to dedicated accelerators provides a way for chip technology to keep pace with the development of artificial intelligence, but these units themselves can not meet the demand for higher performance at a lower cost.
It is understandable that integrated circuit (IC) suppliers and systems companies have been focusing on the original performance of their matrix and tensor processing arrays. At peak throughput, these architectures can easily achieve performance levels measured in trillions of operations per second (TOPS), even for systems designed for edge computing. Although understandable, if there is a delay due to the unavailability of data or the need to convert to the correct format for each model layer, the focus on peak tops poses the risk of insufficient hardware utilization.
The system must compensate for network and storage delays, ensure that the format and location of data elements are appropriate, and transfer in and out of AI accelerators at a consistent rate. Data orchestration provides a method to ensure that the data format and location are appropriate in each clock cycle, so as to maximize the system throughput.
Due to the complexity of typical artificial intelligence implementation, whether it is located in the data center, edge computing environment or real-time embedded applications, such as automatic driving assistance system (ADAS) design, there are many tasks that must be processed by the data orchestration engine, including:
Scheduling and load balancing among multiple vector units
Packet check for data corruption, such as data corruption caused by sensor failure
Although these functions can be realized by adding data control and exception handling hardware to the core processing array, due to the wide variety of operations that may be required and the increasing demand for flexibility with the development of artificial intelligence model, hardwiring these functions to the core accelerator chip may become an expensive short-term option. For example, in some application environments, encryption support is rapidly becoming a requirement to ensure high data security, but different levels of encryption may be used according to the application sensitivity of each layer of data. There is a risk that fixed architecture solutions cannot adapt to changing needs.
One possible approach is to use a programmable microprocessor to control the data flow through the accelerator. The problem with this method is that the software execution can not meet the needs of accelerator hardware at all. The need for a more hardware centric data orchestration response makes it possible for accelerator design to focus entirely on core channel efficiency. External data orchestration can handle all storage and I / O management to ensure uninterrupted transmission of operands and weights. Because the data orchestration engine must handle revisions and changes to the application and model design, hardwired logic is not an appropriate approach. The programmable logic supports modification and avoids the risk that the data orchestration engine cannot be updated.
In principle, field programmable logic gate array (FPGA) combines distributed memory, arithmetic unit and look-up table to provide combined function, which is very suitable for real-time reorganization, remapping and storage management of stream data required by artificial intelligence driven applications. FPGA supports the creation of customized hardware circuits, supports the intensive data flow of deep pipelined artificial intelligence accelerator, and enables users to change the implementation mode according to needs to adapt to the new architecture. However, the performance requirements of data choreography require new FPGA design methods.
Application scenario of data arrangement
There are many different types of data orchestration architectures in application scenarios such as data center, edge computing and embedded system deployment. For example, in a data center application environment, multiple accelerators can be deployed on a single model, and their data throughput is managed by one or more data orchestration engines.
The reasoning system needs data orchestration to ensure the maximum utility of each work engine, avoid bottlenecks, and ensure that the input data samples are processed as quickly as possible. Distributed training increases the requirement for rapid updating of neuron weights. These updates must be allocated to other work engines dealing with relevant model components as soon as possible to avoid stagnation.
The data arrangement logic in FPGA supports a wide range of weight allocation and synchronization protocols to support efficient operation and reduce the data organization burden of the accelerator itself. The following figure shows a possible implementation method, which uses one FPGA device to manage multiple AI engines on the same circuit board. Using a suitable low noise communication protocol, a single machine learning application specific integrated circuit (ASIC) does not need a storage controller. Instead, the data orchestration engine organizes all weights and data elements in local memory and simply transfers them to each ASIC it manages in the appropriate order. The result is high performance at a lower overall cost by reducing duplicate storage and interface logic.
With data orchestration, hardware can further improve performance without increasing cost. One option is to use the compression of network or system bus data to avoid more expensive interconnection. The logic level programmability of FPGA supports data compression and decompression through network interface. The data orchestration hardware also supports the use of forward error correction protocol to ensure effective data transmission at full pipeline speed. In most designs, corruption events usually rarely occur, but without external error correction support, the recovery cost will be high for highly pipelined accelerator design.
For example, the format and structure of a single data element provides an important opportunity to take advantage of data arrangement, because the source data must usually be represented in a format suitable for feature extraction by deep neural network (DNN).
In image recognition and classification applications, pixel data is usually channelized so that each color plane can be processed separately before aggregating the results through a pooled layer that extracts shape and other high-level information. Channelization helps to identify edges and other features that may not be easy to identify with combined RGB representation. More extensive conversions are performed in speech and language processing. Data is usually mapped into a form that is easier to be processed by DNN. Instead of directly processing ASCII or Unicode characters, the words and subwords to be processed in the model are converted into vectors and one hot representations. Similarly, speech data may not be presented in the form of original time-domain samples, but converted to joint time-frequency representation, so that important features can be more easily recognized by the early DNN layer.
Although data conversion can be performed through the arithmetic kernel in the artificial intelligence accelerator, it may not be suitable for the tensor engine. The reformatted nature makes it suitable for processing by FPGA based modules. FPGA can effectively convert at linear speed without the delay caused by running software on general-purpose processor.
In real-time and embedded applications involving sensors, preprocessing data can bring more benefits. For example, although DNN can be trained to eliminate the influence of noise and changes in environmental conditions, its reliability can be improved by using front-end signal processing to denoise or normalize the data. In the implementation of advanced driving assistance system (ADAS), the camera system must deal with the changes of lighting conditions. In general, the high level of dynamic range in the sensor can be utilized by using brightness and contrast adjustment. The FPGA can perform the necessary operations to provide a less variable pixel stream for DNN.
Sensor fusion is an increasingly important aspect of ADAS design, which helps to improve the performance of terminal system. Because environmental conditions make it difficult to interpret single sensor data, artificial intelligence models must effectively obtain input from many different types of sensors, including cameras, lidar and radar.
Format conversion is crucial. For example, lidar provides depth information for target objects in Cartesian space, while radar operates on polar coordinate system. Many models can fuse sensors more easily by transforming one coordinate space into another. Similarly, image data from multiple cameras must be spliced together and transformed using projection to transfer the most useful information to the artificial intelligence model.
Lower level transformations are also required. Automotive original equipment manufacturers (OEMs) purchase sensor modules from different suppliers, and each supplier interprets the connection communication standards in its own way. This requires some functions to parse the data packets sent by these sensors through the on-board network and convert the data into a standard format that DNN can process. For security reasons, the module must also authenticate to the ADAS unit and, in some cases, send encrypted data. The data orchestration chip supports unloading decryption and format conversion functions from the artificial intelligence accelerator engine.
Further optimization can be achieved by using the front-end signal processing function implemented in the data arrangement subsystem to remove unnecessary data. For example, sensors used to process inputs from microphones and other one-dimensional sensors can eliminate noise when silent or low-level background, and reduce the number of video frames transmitted when the vehicle is stationary, so as to reduce the load of artificial intelligence engine.
The rapid development of deep learning has brought great pressure on the hardware architecture needed to realize this technology on a large scale. Although the industry pays great attention to the peak tops score because it recognizes that performance is an absolute requirement, intelligent data orchestration and management strategy provides a way to deliver cost-effective and energy-efficient systems.
Data orchestration includes many pre-processing and post-processing operations to ensure that the data is transmitted to the machine learning engine at the best speed and in the most suitable format for efficient processing. Operations range from resource management and usage planning, to I / O adaptation, transcoding, conversion and sensor fusion, to data compression and rearrangement in shared storage arrays. Some orchestration engines use a subset of these functions according to the core requirements of the target machine learning architecture.
Achronix speedster7t FPGA architecture provides a highly flexible platform for these data arrangement strategies. This FPGA has the characteristics of high throughput, low delay and high flexibility. Its data transmission form can make even a highly specialized accelerator adapt to the changing needs. In addition, the extensive logic and arithmetic capabilities of speedster7t FPGA and high-throughput interconnection enable the overall design of front-end signal conditioning and back-end machine learning to maximize the overall efficiency.