On the afternoon of September 26, the 2021 World Internet Conference was held in Wuzhen. At the data and algorithm forum, academician Zhang Yaqin, President of the Institute of intelligent industry (air) of Tsinghua University, introduced the new digital and intelligent changes in the biological world around the theme of “Ai enabled life science”, and shared the new layout of the Institute of intelligent industry (air) of Tsinghua University in the interdisciplinary development of artificial intelligence and life and health.
With the development of gene sequencing technology, high-throughput biological experiments, sensors and other technologies, the field of life science and biomedicine is entering the digital 3.0 era, and the process of digitization and automation is accelerating. As a new intelligent scientific computing model, health computing is the fourth research paradigm with artificial intelligence and data-driven as the core. It will greatly help mankind to explore and solve life and health problems.
From the 1950s to today, artificial intelligence has produced many different algorithms, especially the deep learning technology represented by early RNN, LSTM and CNN, as well as Gan, transformer based (Bert and gpt-3 models), pre training models and so on in the past two years. It can be said that in terms of our perception, speech recognition, face recognition and object classification, Has reached the same level as people. However, there are still many gaps in natural language understanding, knowledge reasoning, and video semantics and generalization ability. In addition, there are still great challenges in algorithm transparency, interpretability, causality, security, privacy and ethics.
Recently, there has been a lot of progress in trusted AI computing. An example is federal learning, which is also an important research topic of the Intelligent Industry Research Institute of Tsinghua University. There are two main federative learning schemes. One is horizontal federative learning, which mainly faces scenes with different source characteristics and the same model, and can ensure the privacy between data from different sources in the same mode. The other is called longitudinal federated learning, which can deal with the characteristics of different sources and different models, and ensure the privacy of multimodal data.
We have seen that AI is accelerating the steady development of life, health and biomedicine towards a faster, more accurate, safer, more economical and more inclusive direction. Specifically, the research of artificial intelligence in protein structure prediction, CRISPR gene editing technology, antibody / TCR / personalized vaccine research and development, precision medicine and AI assisted drug design has become an international cutting-edge strategic research hotspot.
Considering such discipline development trend and industrial background, the Intelligent Industry Research Institute (air) of Tsinghua University has made a layout of four research directions in the “Ai + life and health direction”, focusing on the research directions of “Ai enhanced personal health management and public health”, “Ai + medical and life sciences”, “Ai assisted drug research and development” and “Ai + gene analysis and editing”.
As a cross domain research and application, air recognizes that there is a large knowledge gap between artificial intelligence and life sciences and biomedicine, and there is a lack of data sets, AI platforms, core algorithms and computing engines for biological computing. At the same time, cross-border talents are also very scarce. In response to the above challenges, air proposed the “Ai + life science wall breaking plan”, which aims to define the core cutting-edge research tasks in the field of AI + life science, cross the gap between the field of life and health and artificial intelligence, break the barriers, promote the deep cross integration of AI and life science, and accelerate scientific discovery.
Therefore, we need to build artificial intelligence infrastructure, data platform and core algorithm engine for the field of life science to support the cutting-edge research tasks of life science. At the same time, by building a flagship public data set, organizing algorithm challenge competition, building an AI + life science wisdom platform, cultivating cross-border talents and building an industrial ecology.
Alphafold2 is a typical successful case of AI + life science. Its success factors come from two aspects. First, the particularity of the task. Protein structure prediction can be regarded as a one-to-one mapping problem from sequence to three-dimensional structure. Therefore, it is a well defined AI problem. This is the goal of the wall breaking program, which is to find research tasks that are of great significance in life science but can be abstracted into research tasks suitable for AI.
The second is the superiority of the model. On the one hand, long-term research in the field of life sciences has accumulated large-scale protein structure data, while the whole model architecture of alphafold2 makes full use of the data-driven end-to-end deep learning model. The combination of big data and deep model is just the typical feature of the fourth paradigm. Therefore, the enlightenment alphafold2 brings us is to pay attention to the importance of wall breaking and the fourth paradigm in the research of AI + life science.
Obviously, alphafold2 is just the beginning, and its success is opening a new model. The accurate prediction of protein structure not only provides life scientists with efficient computing tools, but also makes it possible for major life science discoveries based on AI. In the future, epitope prediction of antibodies and antigens, precision therapy of tumors, design and optimization of TCR / personalized vaccine will become important research hotspots, and breakthrough progress will be made under the new AI driven computing mode. The golden age of AI + macromolecular pharmacy will officially come.
Among them, there will be many new scientific challenges, which also indicates the emergence of new computing paradigms, such as the closed-loop computing framework of dry wet integration. On the one hand, the artificial intelligence model will become more intelligent through the closed-loop verification and data supplement of high-throughput and multi round wet experiments. On the other hand, through active learning or reinforcement learning, AI will actively plan the automation of wet experiment, form dry and wet closed-loop verification and iteration, and accelerate life science discovery and industrial application. We foresee that the life science research and biomedical industry will usher in a new research paradigm and industrial model through the dry wet closed loop.
At present, air has made some preliminary progress in the expression and prediction of gene data. Recently, led by Professor LAN Yanyan of Intelligent Industry Research Institute (air) of Tsinghua University, genebert team designed a novel gene pre training model, realized a multimodal gene pre training model by constructing a two-dimensional matrix between sequences and transcription factors, obtained the effective representation of gene data, especially the data value of non coding regions, In the prediction of downstream promoter and transcription binding sites, the performance of gene screening for Hirschsprung’s disease has been greatly improved. We believe that the continuous and in-depth application of cutting-edge AI technologies such as pre training in genetic data will further tap the value of genetic data, help us crack human passwords and play a role in important issues such as accurate treatment of cancer.
In conclusion, we believe that the biological world is in the new transformation of digitization, automation and intelligent scientific computing. Using computing methods, that is, the fourth research paradigm of artificial intelligence and data-driven, to assist people to explore and solve life and health problems has become an important research direction. In the future, academia and industry need to jointly promote the development of life science, biomedicine, genetic engineering and personal health from isolation and open-loop to collaborative and closed-loop, so as to achieve faster, more accurate, safer, more economical and more inclusive innovation in life science and biomedicine, which represents a huge new opportunity for scientific development and industrial innovation in the next decade.
We earnestly appeal to more people to pay attention to, support or devote themselves to the development of this emerging interdisciplinary.