news

how is ai changing science?

2024-09-26

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

text|bai ge

editor: wang yisu

"the underlying logic of ai for science is different from the current training logic of large language models." lu jintan, technical director of deepin technology, said frankly that the logic of current artificial intelligence in the field of scientific research is different from the logic of large language models used in other industries.

it is well known in the industry that the training of large language models relies on high-quality data "feeding". the more data, the stronger the large language model. however, in some specific areas of scientific research, the amount of data is relatively scarce. for example, for certain types of protein structures, it may take several years to obtain hundreds of high-quality experimental data.

this means that the application of ai in scientific research requires relatively less data to achieve better model effects.

so, how does ai change science? and how can it run a business model? as one of the representative companies of ai for science, deepin technology has given its own thoughts and solutions.

in traditional scientific computing, if you want to predict the physical properties of molecules and atoms through their structural information, you usually need to perform microscopic calculations in combination with practical problems. the industry is still relatively lacking in the ability to achieve such cross-scale computing, and relies more on experience, judgment and experimental verification.

at the same time, with the expansion of the scale of computing systems, the amount of computation required for traditional scientific computing has increased exponentially, often involving systems on the scale of tens of thousands or even hundreds of millions of atoms. if calculations were entirely based on conventional physical models, the overall computing cycle could be very long.

"deep trends can make the model output calculation results close to the accuracy of the physical model, while greatly improving the computing performance." lu jintan said, "we use ai to fit these physical methods, making things that may have previously required a lot of calculations faster."

taking image recognition as an example, its core lies in analyzing the pixel information of the picture. by introducing convolutional neural networks, it is possible to extract local features in the picture, decompose the original picture into different feature maps, and perform approximate solutions through feature combinations. this is actually a dimensionality reduction analysis.

in fact, ai is also a dimensionality reduction application in the scientific field. through the powerful model capabilities, especially in the early stages involving cross-scale calculations, it reduces the computational complexity through its powerful modeling capabilities. at the same time, based on the multimodal large model capabilities, it can also achieve multi-dimensional analysis and prediction of various types of data such as molecular structure, physical properties, experimental data, etc.

for example, in the process of drug discovery, it is usually necessary to first analyze the protein structure and target, and then screen candidate compounds with high affinity to the target from a library of hundreds of thousands or even millions of compounds. multi-dimensional evaluation is usually performed, including affinity analysis, prediction and evaluation of drug chemical properties (such as toxicity, absorption, metabolism, etc.).

"in the field of ai for science, the key to large models supporting cross-scale computing lies in their large parameter scale and strong generalization ability. the large number of parameters of the model enables it to capture complex physical, chemical and biological phenomena, while the strong generalization ability helps the model to be flexibly applied across scientific problems in different fields," said lu jintan. "the basic model can predict the relevant properties of a drug based on its microstructure. when the model is transferred and applied to the field of materials, it can also predict the stability and other physical states of the material at different temperatures and pressures by analyzing the microstructure of the material."

therefore, deepin's understanding of the basic general models in ai for science lies more in having a set of basic pre-trained models, which can be applied to various industrial fields to solve problems through fine-tuning.

several basic models currently being researched by deepin can still achieve good results with limited data augmentation, and can be further optimized and modified based on higher-quality data. in this model system, ai can learn basic scientific principles and achieve good results with a small amount of domain data augmentation, which is slightly different from large language models.

lu jintan told lightcone intelligence, "in the field of scientific computing, data sources are not widespread, and there are not many public data sets, so a large part of our work now is how to make the model training effect better and better based on small data sets."

in the past two years, shenshi technology has also successfully launched a series of industry big models, such as the dpa molecular simulation big model, the uni-mol 3d molecular conformation big model, the uni-fold protein folding big model, the uni-rna nucleic acid structure big model, the uni-dock high-performance drug molecule docking engine, and the uni-smart scientific literature multimodal big language model.

according to lu jintan, deepin technology currently has hundreds of models in the fields of materials and medicine, and these models have been successfully integrated into deepin technology's product platform.at the same time, shenzhen shida technology has also reached strategic cooperation with dozens of leading pharmaceutical companies in the industry, and achieved a commercial breakthrough in 2023, with revenue exceeding 100 million yuan.

currently, shenzhen shida technology’s business has covered smart education in universities, biomedicine research and development, and new battery materials.

however, according to the current algorithm classification of ai for science, the overall development is still at the l2 stage, which is close to experimental accuracy. it is still more people-oriented, assisting humans through model calculations and reducing pressure.

at the l3 stage, ai can directly give results, and in some scenarios, it can directly replace human experiments.

to move from l2 to l3, "the main difficulty is that the accuracy of each link needs to reach a certain level. at the same time, how to integrate the algorithms in each link is also a big challenge," said lu jintan.

looking to the future, lu jintan believes that the market space for ai for science is large enough. whether it is education and research, biomedicine, or battery materials, the addition of ai, at least at the experimental level, can actually solve many fundamental problems and provide more ideas and entry points for empowering industries and cutting-edge exploration.

the following is a detailed conversation between lightcone intelligence and lu jintan, technical director of deepin technology (edited by lightcone intelligence):

deep technology uses ai

improving the quality and efficiency of scientific research

q: big models have changed natural language processing, video and image generation. how have they changed science?

A:large language models have begun to be used in areas such as mining literature information and patent information. we call it the large literature model. we have also achieved some research results in this field. in addition to using it to mine more professional compound information, we also do some multimodal applications such as interpreting pictures and charts.

in traditional scientific computing, we often encounter the problem of using different physical models to solve problems from microscopic to macroscopic scales, but the ability is still relatively lacking in some cross-scale computing scenarios. for example, if we want to predict the macroscopic properties of molecules and atoms based on their structural information, we need the ability to model across scales.

artificial intelligence includes large models that can achieve cross-scale modeling. by learning these physical models and applying them to specific problems, these problems can be solved well.

we usually need to perform large-throughput calculations, often involving systems with tens of thousands or even hundreds of millions of atoms. if the calculations are based on physical models, the time period will be relatively long. what deepin technology does here is to enable the model to produce calculation results close to the accuracy of the physical model, while greatly improving the computing performance.

q: how can calculations on the scale of systems with hundreds of millions of atoms be converted to calculations in the large model field? what is the approximate amount of calculations required?

A:at the microscopic scale, the interaction between two atoms can be analyzed by physical models, such as calculating the interaction force and motion trajectory between them through classical mechanics or quantum mechanics equations. at this time, the calculation only needs to consider the mutual influence of the two atoms, and the problem is relatively simple. however, as the number of atoms in the system increases, the situation becomes more complicated. for example, when a third atom is introduced, in addition to considering the pairwise interaction between each atom, the multi-body effect between the three must also be analyzed. at this time, the interaction and trajectory between atoms depend not only on the two atoms, but are jointly determined by the state of the entire system, and the amount of calculation increases nonlinearly. scientists usually introduce approximate algorithms, such as density functional theory or molecular dynamics simulation, to effectively handle calculations at different scales.

what we did in the early days of ai was actually to fit these physical equations through ai to improve computing performance. it can be compared to image recognition, the core point of which is to analyze various pixels. after we add convolutional neural networks, it will break a picture into feature pictures and then perform approximate solutions. this is actually a dimensionality reduction analysis. what we did in the early days of artificial intelligence in the field of scientific computing can also be seen as a dimensionality reduction action, that is, to make things that may have previously required a lot of computing faster.

q: what is the difference between the calculation method of traditional ai and the big model?

A:the definition of a large model is relatively vague, and is generally based on the number of parameters. the more parameters there are, the greater the amount of calculation. for us, it is more about providing multi-scale calculations. our current pre-trained model uni-mol predicts relevant physical properties based on the three-dimensional structure of molecules and atoms, establishes structure-activity relationships, and directly solves them. in the past, the approach often relied on experiments and experience to predict. this method combines calculations at different scales and provides a new computing method for fields such as materials science.

when dealing with large models, we generally emphasize the generalization ability of the model. in the field of ai for science, it is relatively universal. for example, the basic model can predict some drug-related properties based on the microstructure. if this model is migrated, it can be applied to the field of materials. however, the properties of interest may not be the pharmaceutical properties, but its state at different temperatures and pressures. therefore, our understanding of the basic general model in ai for science is more of a set of basic pre-trained models, which can be applied to various industrial fields for problem solving through fine-tuning.

q: what is the main role of multimodality?

A:it involves combining different types of data, such as molecular structure, physical properties, experimental data, etc., for comprehensive analysis. for example, in the process of drug discovery, it is usually necessary to first analyze the protein structure and target, and then screen out compounds with high affinity to the target from a library of hundreds of thousands or even millions of compounds. in the screening process, it may include affinity analysis, pharmacochemical property analysis, whether it is toxic, whether it is conducive to human absorption, etc., which may be a multi-dimensional analysis. therefore, in order to achieve a better screening effect, it is necessary to analyze from multiple angles and multiple properties.

common multimodal problems such as images and videos may be closer to our multimodal applications in literature data mining. for example, in a paper, we need to read not only the text information in the paper, but also the image information. we need to conduct in-depth mining of the image information, integrate it with the text information, and finally output the results. in the literature, we will also apply this common multimodal capability.

q: how great is the demand for model data in the field of ai for science?

A:different fields are different. of course, the more the better. there is also the problem of difficulty in obtaining data. for example, the difficulty of obtaining data in the sub-applications in the biomedical field and the sub-applications in the battery field is different. in industries with long r&d and verification cycles, the data output will be relatively small, and the absolute amount of data is limited. for example, for certain types of protein structures, there may be only a few hundred in a few years, but in other fields, the data is definitely more than that.

however, basic physical models can generate more data. the basic models we are currently studying can still obtain good results under limited data training, and can be optimized and corrected based on the higher quality data obtained later. in our model system, we let ai learn the basic scientific principles itself, and can get good results through training with a small amount of domain data, which is slightly different from the large language model.

q: how can we let ai learn basic scientific logic and then solve specific application problems?

A:generally, calculations are performed directly through some physical models, and the resulting data is then used for training, and then the physical model is simulated.

q: what is the relationship between the basic big model and each vertical big model? do you train the basic big model yourself or use a third-party open source big model?

A:different scenarios are different. if it refers to a large language model, it is more used in literature interpretation, such as paper interpretation. for the basic application of single paper interpretation, for cost considerations, some general large models will be used to help me interpret the paper. if we want to interpret multiple papers, or even conduct an overall search in our large paper library, including patent search and analysis, we will use our own literature model to perform more detailed paper interpretation.

therefore, we are still working on products for users. we may see which model is more suitable for our products and make choices based on cost considerations.

many models we now call pre-trained models. for example, the dpa we released last year is a pre-trained model for calculating the potential functions between atoms of different elements. we also opened the openlam large atom model project some time ago, hoping to mobilize some open source power to contribute and share data together to make the model training more mature.

q: how many models does deepin technology have now?

A:we now have hundreds of models in the fields of materials and medicine combined.

revenue exceeded 100 million yuan, and joined hands with dozens of pharmaceutical companies

deepin technology’s business model

q: could you share with us the latest r&d progress of deepin in the field of ai medicine?

A:in the field of medicine, we are actually currently focusing on preclinical research, covering almost all preclinical computing scenarios, such as early target discovery, protein structure analysis, target analysis, molecular screening, affinity analysis, property prediction, etc. this series of links includes many such computing methods, and we now have practical algorithms.

in combination with the medical scenario, we packaged all these algorithms into a product, which is our drug design platform hermite. we now basically cooperate with the top 50 domestic pharmaceutical companies in different fields, mainly involving three aspects: biotech (biotechnology), cro (clinical research organization), and pharma (pharmaceutical company), each with its own representative company.

last week, we signed a cooperation agreement with a domestic listed company, east sunshine, which makes the influenza drug oseltamivir and has just obtained three first certifications in the united states. we will cooperate with them on target-related businesses next.

in addition to typical biomedical companies like east sunshine, we also cooperate with many scientific research institutions and universities engaged in drug research and development, such as west china university of medical sciences, xiangya hospital and medical college.

q: our current products can actually be used directly with a browser, and the overall deployment is also very lightweight. is this the case with all core product deployment methods?

Ayes, most of the work we perform online is ai reasoning, and training is usually done offline, so the amount of data transmitted is not that large. there will also be small-scale training scenarios, which are more based on fine-tuning of pre-trained models. it can also be done with small batches of data, and the pressure of data transmission is also small. lightweight deployment does not mean that we do not use enough computing power. the system uses hybrid cloud and hpc computing power behind the scenes, but it is packaged for users to access through browsers. if it is some private scenarios, we also need to deploy the computing power system behind this set, which is not necessary on saas.

generally speaking, large enterprises need to privatize their data because they have very high requirements for data privacy. in some scenarios such as teaching or research institutes, data may be temporarily used in a certain project and do not need to be privately deployed.

q: how is the actual project progress with the current partner companies? what stage is it at?

A:our cooperation with pharmaceutical companies is mainly in the computing aspect. our company does not produce drugs, so we do not participate in the drug development phase of pharmaceutical companies.

we basically cover the entire load chain. we are also trying some new areas and exploring new areas, such as integrating software into automated laboratories that focus on hardware research and development, so as to jointly serve more companies, because the needs of pharmaceutical companies are quite numerous and complex.

then in terms of revenue, our revenue last year exceeded 100 million.

our cooperation with pharmaceutical companies has two business models: one is selling software, and the other is joint research and development.

many large companies will deploy locally and can afford to support their own teams and have sufficient funds to support the purchase of software. however, there are still some medium-sized or new innovative pharmaceutical companies that lack advanced production tools and the corresponding talent to support their use of these tools. therefore, they will choose to jointly develop with us. we can help them do more calculations. because this involves data and information security, the combination between the two parties is particularly close.

however, many large pharmaceutical companies have sufficient funds and talent, and even hope that we can provide them with some customized services in addition to providing them with saas.

the future of ai for science

q: i see that the current algorithm levels are very similar to the five levels of autonomous driving. what state can we achieve at the l2 stage? to what extent can it replace the previous experimental mode? can you give an example of a specific scenario?

A:in the l2 scenario, we are more concerned about approaching experimental accuracy, and we are more people-oriented, using computing to assist humans and reduce experimental pressure. because there are large differences between different systems in drug design, we have been able to achieve accuracy close to that of experiments in some systems. so it does not mean that users do not need to do experiments at all, but i can help users do more basic things, such as molecular screening. there may be 1 million drug compounds, and ai can help users screen out hundreds of thousands. the remaining ones may need to be verified by scientific researchers, and the scale of experiments will be greatly reduced.

q: what are the difficulties in moving from l2 to l3?

A:my understanding of l3 is that ai can directly give results, which is equivalent to directly replacing human experiments in some scenarios. the difficulty of reaching l3 lies mainly in the fact that the accuracy of each link needs to reach a certain level. in addition, there are many algorithms involved, and the integration of various algorithms is also a difficult point. the integration of algorithms is actually similar to a complete workflow system, and this system can continuously backtrack and optimize itself.

q: have there been any iterations in the overall technology from the past to now? and are there any bottlenecks in the development of the current model?

A:at present, we are mainly iterating on various algorithms based on data, especially on products that are commonly used by users, where algorithm iterations are faster. for example, our dpa product has been upgraded from generation 1 to generation 2. the upgraded capability lies in that generation 1 can support pre-training in a single field, while generation 2 can perform parallel training based on data sets with different annotation methods.

the bottleneck mainly comes from data. in the field of scientific computing, data sources are not extensive, and there are not many public data sets. therefore, a large part of our work now is to make the model training effect better and better based on small data sets.

in addition, another thing that needs extra attention is the issue of interpretability. because scientific computing is more rigorous and has higher requirements for interpretability, we are now trying to enhance the interpretability of the model by exposing parameters, translation paths, etc.

q: how to solve the problem of data scarcity?

A:in the field of ai for science, whether it is the field of materials or the field of medicine, the most basic physical principles at the microscopic level are the same, so the advantage is that some data in the field of materials can be directly reused in the field of medicine. for example, dpa 2 can help users train a unified model based on data given under different standard systems. then, when this model is applied to the industry, it can be fine-tuned with a small amount of data and then used.

we have not yet touched upon the toc market, but our system has already covered some teaching scenarios. we have a scientific research platform that integrates teaching, research and use. our main customers are colleges and universities, or some c-end users. for colleges and universities, we have a teacher training platform that can support everything from the entire teaching process to students attending classes, to use, and even the implementation of research results.

Q:AI for what is the future market development space for science?

A:i think the market space is large enough. whether it is scientific research, medicine, or materials, the addition of ai can actually solve problems, at least at the experimental level, help researchers improve experimental results, and reduce the burden of experiments.

in terms of overall customer acceptance, part of the cost in many scenarios for us is to educate users. for example, in the pharmaceutical field, we establish long-term cooperative relationships with our customers because we need to accompany them through the entire verification cycle.

in comparison, the materials field is much faster. for example, the research and development cycle of batteries is quite fast. if the effect of the electrolyte ratio can be predicted using ai, preparation verification can be carried out very quickly.

at the national level, the ministry of science and technology, together with the national natural science foundation of china, has launched the al for science special deployment work. this also further shows that from the social and economic level to the national macro-policy, there is optimism and strong support for this, which is definitely a future direction, no doubt.

q: ai for science is still in its early stages. what stage will it reach in the next three years?

A:i think at least all customers will have a unified understanding of this matter. now everyone has begun to actively embrace ai and has a deeper understanding of ai. all industries will not feel unfamiliar or repulsive to this term and have a relatively positive attitude. the next step is how we can establish a similar co-creation partnership with customers, after all, this industry is a data-sensitive industry. at the three-year node, we also hope to help customers come up with some practical implementation scenarios.

in fact, i think if we can consider the value point a little more clearly, customers will be very receptive, because overall, whether it is pharmaceutical companies or the new energy mentioned earlier, everyone is paying more and more attention to innovation investment. we also hope to help innovate the entire scientific research paradigm, including scientific research infrastructure and various upper-level scenarios, which can be connected through our scientific research platform and then empower various industries.