OncoRTT：使用BERT嵌入和组学特征预测新型肿瘤相关治疗靶点。

OncoRTT: Predicting novel oncology-related therapeutic targets using BERT embeddings and omics features.

作者信息

Thafar Maha A, Albaradei Somayah, Uludag Mahmut, Alshahrani Mona, Gojobori Takashi, Essack Magbubah, Gao Xin

机构信息

Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center, Computer (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia.

College of Computers and Information Technology, Computer Science Department, Taif University, Taif, Saudi Arabia.

出版信息

Front Genet. 2023 Apr 6;14:1139626. doi: 10.3389/fgene.2023.1139626. eCollection 2023.

DOI:10.3389/fgene.2023.1139626

PMID:37091791

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10117673/

Abstract

Late-stage drug development failures are usually a consequence of ineffective targets. Thus, proper target identification is needed, which may be possible using computational approaches. The reason being, effective targets have disease-relevant biological functions, and omics data unveil the proteins involved in these functions. Also, properties that favor the existence of binding between drug and target are deducible from the protein's amino acid sequence. In this work, we developed OncoRTT, a deep learning (DL)-based method for predicting novel therapeutic targets. OncoRTT is designed to reduce suboptimal target selection by identifying novel targets based on features of known effective targets using DL approaches. First, we created the "OncologyTT" datasets, which include genes/proteins associated with ten prevalent cancer types. Then, we generated three sets of features for all genes: omics features, the proteins' amino-acid sequence BERT embeddings, and the integrated features to train and test the DL classifiers separately. The models achieved high prediction performances in terms of area under the curve (AUC), i.e., AUC greater than 0.88 for all cancer types, with a maximum of 0.95 for leukemia. Also, OncoRTT outperformed the state-of-the-art method using their data in five out of seven cancer types commonly assessed by both methods. Furthermore, OncoRTT predicts novel therapeutic targets using new test data related to the seven cancer types. We further corroborated these results with other validation evidence using the Open Targets Platform and a case study focused on the top-10 predicted therapeutic targets for lung cancer.

摘要

晚期药物研发失败通常是靶点无效的结果。因此，需要进行恰当的靶点识别，这可以通过计算方法来实现。原因在于，有效的靶点具有与疾病相关的生物学功能，而组学数据揭示了参与这些功能的蛋白质。此外，药物与靶点之间结合存在所青睐的特性可从蛋白质的氨基酸序列推导得出。在这项工作中，我们开发了OncoRTT，一种基于深度学习（DL）的预测新型治疗靶点的方法。OncoRTT旨在通过使用DL方法基于已知有效靶点的特征识别新型靶点，从而减少次优靶点选择。首先，我们创建了“肿瘤学TT”数据集，其中包括与十种常见癌症类型相关的基因/蛋白质。然后，我们为所有基因生成了三组特征：组学特征、蛋白质的氨基酸序列BERT嵌入以及综合特征，以分别训练和测试DL分类器。这些模型在曲线下面积（AUC）方面取得了较高的预测性能，即所有癌症类型的AUC均大于0.88，白血病的AUC最高为0.95。此外，在两种方法共同评估的七种癌症类型中的五种中，OncoRTT使用其数据的表现优于现有最先进的方法。此外，OncoRTT使用与七种癌症类型相关的新测试数据预测新型治疗靶点。我们使用开放靶点平台和一个专注于肺癌前10个预测治疗靶点的案例研究，用其他验证证据进一步证实了这些结果。