人类基因组 DNA 中多聚 A 信号的高效预测混合模型。

Hybrid model for efficient prediction of poly(A) signals in human genomic DNA.

机构信息

King Abdullah University of Science and Technology, Computational Bioscience Research Center, Thuwal 23955-6900, Saudi Arabia; Taif University, Electrical Engineering, Taif 21944, Saudi Arabia.

King Abdullah University of Science and Technology, Computational Bioscience Research Center, Thuwal 23955-6900, Saudi Arabia.

出版信息

Methods. 2019 Aug 15;166:31-39. doi: 10.1016/j.ymeth.2019.04.001. Epub 2019 Apr 13.

DOI:10.1016/j.ymeth.2019.04.001

PMID:30991099

Abstract

Polyadenylation signals (PAS) are found in most protein-coding and some non-coding genes in eukaryotes. Their accurate recognition improves understanding gene regulation mechanisms and recognition of the 3'-end of transcribed gene regions where premature or alternate transcription ends may lead to various diseases. Although different methods and tools for in-silico prediction of genomic signals have been proposed, the correct identification of PAS in genomic DNA remains challenging due to a vast number of non-relevant hexamers identical to PAS hexamers. In this study, we developed a novel method for PAS recognition. The method is implemented in a hybrid PAS recognition model (HybPAS), which is based on deep neural networks (DNNs) and logistic regression models (LRMs). One of such models is developed for each of the 12 most frequent human PAS hexamers. DNN models appeared the best for eight PAS types (including the two most frequent PAS hexamers), while LRM appeared best for the remaining four PAS types. The new models use different combinations of signal processing-based, statistical, and sequence-based features as input. The results obtained on human genomic data show that HybPAS outperforms the well-tuned state-of-the-art Omni-PolyA models, reducing the classification error for different PAS hexamers by up to 57.35% for 10 out of 12 PAS types, with Omni-PolyA models being better for two PAS types. For the most frequent PAS types, 'AATAAA' and 'ATTAAA', HybPAS reduced the error rate by 35.14% and 34.48%, respectively. On average, HybPAS reduces the error by 30.29%. HybPAS is implemented partly in Python and in MATLAB available at https://github.com/EMANG-KAUST/PolyA_Prediction_LRM_DNN.

摘要

多聚腺苷酸化信号 (PAS) 存在于真核生物的大多数蛋白质编码基因和一些非编码基因中。准确识别它们可以帮助我们更好地理解基因调控机制，并识别转录基因区域的 3' 端，因为过早或交替的转录末端可能导致各种疾病。尽管已经提出了许多用于基因组信号计算机预测的方法和工具，但由于与 PAS 六聚体相同的大量非相关六聚体，正确识别基因组 DNA 中的 PAS 仍然具有挑战性。在这项研究中，我们开发了一种新的 PAS 识别方法。该方法被应用于一种混合 PAS 识别模型 (HybPAS) 中，该模型基于深度神经网络 (DNN) 和逻辑回归模型 (LRM)。为 12 种最常见的人类 PAS 六聚体中的每一种开发了一个模型。对于八种 PAS 类型（包括两个最常见的 PAS 六聚体），DNN 模型表现最好，而对于其余四种 PAS 类型，LRM 表现最好。新模型使用基于信号处理、统计和序列的不同组合特征作为输入。在人类基因组数据上获得的结果表明，HybPAS 优于经过精心调整的最先进的 Omni-PolyA 模型，对于 12 种 PAS 类型中的 10 种，将不同 PAS 六聚体的分类错误减少了高达 57.35%，而对于另外两种 PAS 类型，Omni-PolyA 模型表现更好。对于最常见的 PAS 类型，'AATAAA' 和 'ATTAAA'，HybPAS 将错误率分别降低了 35.14% 和 34.48%。平均而言，HybPAS 将错误率降低了 30.29%。HybPAS 的部分实现是用 Python 和 MATLAB 编写的，可在 https://github.com/EMANG-KAUST/PolyA_Prediction_LRM_DNN 上获得。