Suppr超能文献

从全基因组序列数据中识别复杂表型的遗传决定因素。

Identifying genetic determinants of complex phenotypes from whole genome sequence data.

机构信息

Department of Biology, University of Ottawa, Ottawa, Ontario, Canada.

Department of Mathematics and Statistics, University of Ottawa, Ottawa, Ontario, Canada.

出版信息

BMC Genomics. 2019 Jun 10;20(1):470. doi: 10.1186/s12864-019-5820-0.

Abstract

BACKGROUND

A critical goal in biology is to relate the phenotype to the genotype, that is, to find the genetic determinants of various traits. However, while simple monofactorial determinants are relatively easy to identify, the underpinnings of complex phenotypes are harder to predict. While traditional approaches rely on genome-wide association studies based on Single Nucleotide Polymorphism data, the ability of machine learning algorithms to find these determinants in whole proteome data is still not well known.

RESULTS

To better understand the applicability of machine learning in this case, we implemented two such algorithms, adaptive boosting (AB) and repeated random forest (RRF), and developed a chunking layer that facilitates the analysis of whole proteome data. We first assessed the performance of these algorithms and tuned them on an influenza data set, for which the determinants of three complex phenotypes (infectivity, transmissibility, and pathogenicity) are known based on experimental evidence. This allowed us to show that chunking improves runtimes by an order of magnitude. Based on simulations, we showed that chunking also increases sensitivity of the predictions, reaching 100% with as few as 20 sequences in a small proteome as in the influenza case (5k sites), but may require at least 30 sequences to reach 90% on larger alignments (500k sites). While RRF has less specificity than random forest, it was never <50%, and RRF sensitivity was significantly higher at smaller chunk sizes. We then used these algorithms to predict the determinants of three types of drug resistance (to Ciprofloxacin, Ceftazidime, and Gentamicin) in a bacterium, Pseudomonas aeruginosa. While both algorithms performed well in the case of the influenza data, results were more nuanced in the bacterial case, with RRF making more sensible predictions, with smaller errors rates, than AB.

CONCLUSIONS

Altogether, we demonstrated that ML algorithms can be used to identify genetic determinants in small proteomes (viruses), even when trained on small numbers of individuals. We further showed that our RRF algorithm may deserve more scrutiny, which should be facilitated by the decreasing costs of both sequencing and phenotyping of large cohorts of individuals.

摘要

背景

生物学的一个关键目标是将表型与基因型联系起来,也就是说,找到各种特征的遗传决定因素。然而,虽然简单的单因子决定因素相对容易识别,但复杂表型的基础更难预测。虽然传统方法依赖于基于单核苷酸多态性数据的全基因组关联研究,但机器学习算法在全蛋白质组数据中找到这些决定因素的能力仍不为人知。

结果

为了更好地了解机器学习在这种情况下的适用性,我们实现了两种这样的算法,自适应增强(AB)和重复随机森林(RRF),并开发了一个分块层,便于分析整个蛋白质组数据。我们首先评估了这些算法的性能,并在流感数据集上对它们进行了调整,对于该数据集,三种复杂表型(感染性、传染性和致病性)的决定因素基于实验证据是已知的。这使我们能够表明分块将运行时提高了一个数量级。基于模拟,我们表明分块还可以提高预测的灵敏度,在小蛋白质组(5k 个位点)中,甚至在流感情况下(5k 个位点),只需 20 个序列就可以达到 100%,但在更大的比对(500k 个位点)上,可能需要至少 30 个序列才能达到 90%。虽然 RRF 的特异性不如随机森林高,但它从未低于 50%,并且在较小的分块大小下,RRF 的灵敏度明显更高。然后,我们使用这些算法来预测细菌铜绿假单胞菌对环丙沙星、头孢他啶和庆大霉素三种类型耐药性的决定因素。虽然这两种算法在流感数据的情况下都表现良好,但在细菌情况下结果更加复杂,RRF 做出了更明智的预测,错误率更小,优于 AB。

结论

总之,我们证明了机器学习算法可以用于识别小蛋白质组(病毒)中的遗传决定因素,即使是在对少数个体进行训练的情况下。我们进一步表明,我们的 RRF 算法可能值得更仔细的研究,这应该通过测序和对大量个体进行表型分析的成本降低来促进。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cf60/6558885/c01db5165f86/12864_2019_5820_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验