Suppr超能文献

蛋白质科学与人工智能相遇:跨领域的系统评价与生化荟萃分析

Protein Science Meets Artificial Intelligence: A Systematic Review and a Biochemical Meta-Analysis of an Inter-Field.

作者信息

Villalobos-Alva Jalil, Ochoa-Toledo Luis, Villalobos-Alva Mario Javier, Aliseda Atocha, Pérez-Escamirosa Fernando, Altamirano-Bustamante Nelly F, Ochoa-Fernández Francine, Zamora-Solís Ricardo, Villalobos-Alva Sebastián, Revilla-Monsalve Cristina, Kemper-Valverde Nicolás, Altamirano-Bustamante Myriam M

机构信息

Unidad de Investigación en Enfermedades Metabólicas, Centro Médico Nacional Siglo XXI, Instituto Mexicano del Seguro Social, Mexico City, Mexico.

Instituto de Ciencias Aplicadas y Tecnología (ICAT), Universidad Nacional Autónoma de México (UNAM), Mexico City, Mexico.

出版信息

Front Bioeng Biotechnol. 2022 Jul 7;10:788300. doi: 10.3389/fbioe.2022.788300. eCollection 2022.

Abstract

Proteins are some of the most fascinating and challenging molecules in the universe, and they pose a big challenge for artificial intelligence. The implementation of machine learning/AI in protein science gives rise to a world of knowledge adventures in the workhorse of the cell and proteome homeostasis, which are essential for making life possible. This opens up epistemic horizons thanks to a coupling of human tacit-explicit knowledge with machine learning power, the benefits of which are already tangible, such as important advances in protein structure prediction. Moreover, the driving force behind the protein processes of self-organization, adjustment, and fitness requires a space corresponding to gigabytes of life data in its order of magnitude. There are many tasks such as novel protein design, protein folding pathways, and synthetic metabolic routes, as well as protein-aggregation mechanisms, pathogenesis of protein misfolding and disease, and proteostasis networks that are currently unexplored or unrevealed. In this systematic review and biochemical meta-analysis, we aim to contribute to bridging the gap between what we call artificial intelligence (AI) and protein science (PS), a growing research enterprise with exciting and promising biotechnological and biomedical applications. We undertake our task by exploring "the state of the art" in AI and machine learning (ML) applications to protein science in the scientific literature to address some critical research questions in this domain, including What kind of tasks are already explored by ML approaches to protein sciences? What are the most common ML algorithms and databases used? What is the situational diagnostic of the AI-PS inter-field? What do ML processing steps have in common? We also formulate novel questions such as Is it possible to discover what the rules of protein evolution are with the binomial AI-PS? How do protein folding pathways evolve? What are the rules that dictate the folds? What are the minimal nuclear protein structures? How do protein aggregates form and why do they exhibit different toxicities? What are the structural properties of amyloid proteins? How can we design an effective proteostasis network to deal with misfolded proteins? We are a cross-functional group of scientists from several academic disciplines, and we have conducted the systematic review using a variant of the PICO and PRISMA approaches. The search was carried out in four databases (PubMed, Bireme, OVID, and EBSCO Web of Science), resulting in 144 research articles. After three rounds of quality screening, 93 articles were finally selected for further analysis. A summary of our findings is as follows: regarding AI applications, there are mainly four types: ) genomics, ) protein structure and function, ) protein design and evolution, and ) drug design. In terms of the ML algorithms and databases used, supervised learning was the most common approach (85%). As for the databases used for the ML models, PDB and UniprotKB/Swissprot were the most common ones (21 and 8%, respectively). Moreover, we identified that approximately 63% of the articles organized their results into three steps, which we labeled , , and . A few studies combined data from several databases or created their own databases after the pre-process. Our main finding is that, as of today, there are no research road maps serving as guides to address gaps in our knowledge of the AI-PS binomial. All research efforts to collect, integrate multidimensional data features, and then analyze and validate them are, so far, uncoordinated and scattered throughout the scientific literature without a clear epistemic goal or connection between the studies. Therefore, our main contribution to the scientific literature is to offer a road map to help solve problems in drug design, protein structures, design, and function prediction while also presenting the "state of the art" on research in the AI-PS binomial until February 2021. Thus, we pave the way toward future advances in the synthetic redesign of novel proteins and protein networks and artificial metabolic pathways, learning lessons from nature for the welfare of humankind. Many of the novel proteins and metabolic pathways are currently non-existent in nature, nor are they used in the chemical industry or biomedical field.

摘要

蛋白质是宇宙中最迷人且最具挑战性的分子之一,它们对人工智能构成了巨大挑战。机器学习/人工智能在蛋白质科学中的应用开启了一个知识探索的世界,涉及细胞的主要组成部分和蛋白质组稳态,而这些对于生命的存在至关重要。由于人类隐性-显性知识与机器学习能力的结合,这拓宽了认知视野,其益处已经显现,比如在蛋白质结构预测方面取得了重要进展。此外,蛋白质的自组织、调节和适应性过程背后的驱动力,在量级上需要一个相当于千兆字节生命数据的空间。还有许多任务,如新型蛋白质设计、蛋白质折叠途径、合成代谢途径,以及蛋白质聚集机制、蛋白质错误折叠和疾病的发病机制、蛋白质稳态网络等,目前尚未被探索或揭示。在本系统综述和生化荟萃分析中,我们旨在弥合我们所称的人工智能(AI)与蛋白质科学(PS)之间的差距,蛋白质科学是一个不断发展的研究领域,具有令人兴奋且前景广阔的生物技术和生物医学应用。我们通过探索科学文献中人工智能和机器学习(ML)在蛋白质科学应用方面的“最新技术水平”来完成任务,以解决该领域的一些关键研究问题,包括:机器学习方法在蛋白质科学中已经探索了哪些类型的任务?使用的最常见的机器学习算法和数据库有哪些?人工智能 - 蛋白质科学跨领域的现状诊断如何?机器学习处理步骤有哪些共同之处?我们还提出了一些新问题,例如:借助人工智能 - 蛋白质科学的结合能否发现蛋白质进化的规则?蛋白质折叠途径是如何进化的?决定折叠的规则是什么?最小的核蛋白结构是什么?蛋白质聚集体是如何形成的,以及它们为何表现出不同的毒性?淀粉样蛋白的结构特性是什么?我们如何设计一个有效的蛋白质稳态网络来处理错误折叠的蛋白质?我们是一个来自多个学科的跨职能科学家团队,我们使用PICO和PRISMA方法的变体进行了系统综述。搜索在四个数据库(PubMed、Bireme、OVID和EBSCO科学网)中进行,共获得144篇研究文章。经过三轮质量筛选,最终选择了93篇文章进行进一步分析。我们的研究结果总结如下:关于人工智能的应用,主要有四种类型:)基因组学,)蛋白质结构与功能,)蛋白质设计与进化,以及)药物设计。在使用的机器学习算法和数据库方面,监督学习是最常见的方法(85%)。至于用于机器学习模型的数据库,PDB和UniprotKB/Swissprot是最常用的(分别为21%和8%)。此外,我们发现约63%的文章将其结果组织为三个步骤,我们将其标记为、和。一些研究在预处理后合并了来自多个数据库的数据或创建了自己的数据库。我们的主要发现是,截至目前,没有研究路线图可作为指南来填补我们在人工智能 - 蛋白质科学结合方面的知识空白。到目前为止,所有收集、整合多维数据特征,然后进行分析和验证的研究工作都是不协调的,分散在科学文献中,没有明确的认知目标或研究之间的联系。因此,我们对科学文献的主要贡献是提供一个路线图,以帮助解决药物设计、蛋白质结构、设计和功能预测方面的问题,同时展示截至2021年2月人工智能 - 蛋白质科学结合研究的“最新技术水平 ”。从而为新型蛋白质和蛋白质网络以及人工代谢途径的合成重新设计的未来进展铺平道路,从自然中汲取经验以造福人类。许多新型蛋白质和代谢途径目前在自然界中并不存在,也未在化学工业或生物医学领域中使用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e4d4/9301016/1cc7d8c927c6/fbioe-10-788300-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验