Immuneering Corporation, Cambridge, MA 02142, USA.
Immuneering Corporation, Cambridge, MA 02142, USA.
Biochim Biophys Acta Rev Cancer. 2021 Aug;1876(1):188548. doi: 10.1016/j.bbcan.2021.188548. Epub 2021 Apr 24.
The concurrent growth of large-scale oncology data alongside the computational methods with which to analyze and model it has created a promising environment for revolutionizing cancer diagnosis, treatment, prevention, and drug discovery. Computational methods applied to large datasets have accelerated the drug discovery process by reducing bottlenecks and widening the search space beyond what is experimentally tractable. As the research community gains understanding of the myriad genetic underpinnings of cancer via sequencing, imaging, screens, and more that are ingested, transformed, and modeled by top open-source machine learning and artificial intelligence tools readily available, the next big drug candidate might seem merely an "Enter" key away. Of course, the reality is more convoluted, but still promising.
We present methods to approach the process of building an AI model, with strong emphasis on the aspects of model development we believe to be crucial to success but that are not commonly discussed: diligence in posing questions, identifying suitable datasets and curating them, and collaborating closely with biology and oncology experts while designing and evaluating the model. Digital pathology, Electronic Health Records, and other data types outside of high-throughput molecular data are reviewed well by others and outside of the scope of this review. This review emphasizes the importance of considering the limitations of the datasets, computational methods, and our minds when designing AI models. For example, datasets can be biased towards areas of research interest, funding, and particular patient populations. Neural networks may learn representations and correlations within the data that are grounded not in biological phenomena, but statistical anomalies erroneously extracted from the training data. Researchers may mis-interpret or over-interpret the output, or design and evaluate the training process such that the resultant model generalizes poorly. Fortunately, awareness of the strengths and limitations of applying data analytics and AI to drug discovery enables us to leverage them carefully and insightfully while maximizing their utility. These applications when performed in close collaboration with domain experts, together with continuous critical evaluation, generation of new data to minimize known blind spots as they are found, and rigorous experimental validation, increases the success rate of the study. We will discuss applications including AI-assisted target identification, drug repurposing, patient stratification, and gene prioritization.
Data analytics and AI have demonstrated capabilities to revolutionize cancer research, prevention, and treatment by maximizing our understanding and use of the expanding panoply of experimental data. However, to separate promise from true utility, computational tools must be carefully designed, critically evaluated, and constantly improved. Once that is achieved, a human-computer hybrid discovery process will outperform one driven by each alone.
This review highlights the challenges and promise of synergizing predictive AI models with human expertise towards greater understanding of cancer.
随着分析和建模的计算方法与大型肿瘤学数据的同步发展,癌症的诊断、治疗、预防和药物发现领域迎来了革命性变革的契机。将计算方法应用于大型数据集,可以通过减少瓶颈和扩大搜索空间来加速药物发现过程,而这些空间在实验上是难以企及的。随着研究人员通过测序、成像、筛选等手段深入了解癌症的众多遗传基础,并通过易于获取的顶级开源机器学习和人工智能工具对这些数据进行转化和建模,下一个大的候选药物似乎只需点击“Enter”键即可找到。当然,现实情况要复杂得多,但仍然充满希望。
我们提出了构建人工智能模型的方法,重点强调了我们认为对成功至关重要但通常未被讨论的模型开发方面:在提出问题、确定合适的数据集并对其进行整理以及在设计和评估模型时与生物学和肿瘤学专家密切合作方面要保持严谨。其他人已经很好地综述了数字病理学、电子健康记录和其他类型的高通量分子数据以外的数据类型,因此不在本综述的范围内。本综述强调了在设计人工智能模型时考虑数据集、计算方法和我们自身局限性的重要性。例如,数据集可能偏向于研究兴趣、资金和特定患者群体的领域。神经网络可能会在数据中学习到基于生物学现象的表示和相关性,而这些表示和相关性是从训练数据中错误提取的统计异常。研究人员可能会错误地解释或过度解释输出,或者设计和评估训练过程,使得最终模型的泛化能力较差。幸运的是,对数据分析和人工智能在药物发现中的应用的优势和局限性有了认识,使我们能够在最大化其效用的同时谨慎而有见地地利用它们。这些应用与领域专家密切合作,同时不断进行批判性评估、生成新数据以最小化已知盲点,并进行严格的实验验证,从而提高研究的成功率。我们将讨论包括人工智能辅助靶标识别、药物再利用、患者分层和基因优先级排序在内的应用。
数据分析和人工智能已经证明了通过最大限度地提高我们对不断扩展的实验数据的理解和利用来彻底改变癌症研究、预防和治疗的能力。然而,为了将承诺与真正的实用性区分开来,计算工具必须经过精心设计、严格评估和不断改进。一旦实现了这一点,人机混合发现过程将优于单独由人和计算机驱动的过程。
本综述强调了将预测性人工智能模型与人类专业知识相结合以更好地理解癌症所面临的挑战和机遇。