从全基因组序列数据中识别复杂表型的遗传决定因素。

Identifying genetic determinants of complex phenotypes from whole genome sequence data.

机构信息

Department of Biology, University of Ottawa, Ottawa, Ontario, Canada.

Department of Mathematics and Statistics, University of Ottawa, Ottawa, Ontario, Canada.

出版信息

BMC Genomics. 2019 Jun 10;20(1):470. doi: 10.1186/s12864-019-5820-0.

DOI:10.1186/s12864-019-5820-0

PMID:31182025

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6558885/

Abstract

BACKGROUND

A critical goal in biology is to relate the phenotype to the genotype, that is, to find the genetic determinants of various traits. However, while simple monofactorial determinants are relatively easy to identify, the underpinnings of complex phenotypes are harder to predict. While traditional approaches rely on genome-wide association studies based on Single Nucleotide Polymorphism data, the ability of machine learning algorithms to find these determinants in whole proteome data is still not well known.

RESULTS

To better understand the applicability of machine learning in this case, we implemented two such algorithms, adaptive boosting (AB) and repeated random forest (RRF), and developed a chunking layer that facilitates the analysis of whole proteome data. We first assessed the performance of these algorithms and tuned them on an influenza data set, for which the determinants of three complex phenotypes (infectivity, transmissibility, and pathogenicity) are known based on experimental evidence. This allowed us to show that chunking improves runtimes by an order of magnitude. Based on simulations, we showed that chunking also increases sensitivity of the predictions, reaching 100% with as few as 20 sequences in a small proteome as in the influenza case (5k sites), but may require at least 30 sequences to reach 90% on larger alignments (500k sites). While RRF has less specificity than random forest, it was never <50%, and RRF sensitivity was significantly higher at smaller chunk sizes. We then used these algorithms to predict the determinants of three types of drug resistance (to Ciprofloxacin, Ceftazidime, and Gentamicin) in a bacterium, Pseudomonas aeruginosa. While both algorithms performed well in the case of the influenza data, results were more nuanced in the bacterial case, with RRF making more sensible predictions, with smaller errors rates, than AB.

CONCLUSIONS

Altogether, we demonstrated that ML algorithms can be used to identify genetic determinants in small proteomes (viruses), even when trained on small numbers of individuals. We further showed that our RRF algorithm may deserve more scrutiny, which should be facilitated by the decreasing costs of both sequencing and phenotyping of large cohorts of individuals.

摘要

背景

生物学的一个关键目标是将表型与基因型联系起来，也就是说，找到各种特征的遗传决定因素。然而，虽然简单的单因子决定因素相对容易识别，但复杂表型的基础更难预测。虽然传统方法依赖于基于单核苷酸多态性数据的全基因组关联研究，但机器学习算法在全蛋白质组数据中找到这些决定因素的能力仍不为人知。

结果

为了更好地了解机器学习在这种情况下的适用性，我们实现了两种这样的算法，自适应增强（AB）和重复随机森林（RRF），并开发了一个分块层，便于分析整个蛋白质组数据。我们首先评估了这些算法的性能，并在流感数据集上对它们进行了调整，对于该数据集，三种复杂表型（感染性、传染性和致病性）的决定因素基于实验证据是已知的。这使我们能够表明分块将运行时提高了一个数量级。基于模拟，我们表明分块还可以提高预测的灵敏度，在小蛋白质组（5k 个位点）中，甚至在流感情况下（5k 个位点），只需 20 个序列就可以达到 100%，但在更大的比对（500k 个位点）上，可能需要至少 30 个序列才能达到 90%。虽然 RRF 的特异性不如随机森林高，但它从未低于 50%，并且在较小的分块大小下，RRF 的灵敏度明显更高。然后，我们使用这些算法来预测细菌铜绿假单胞菌对环丙沙星、头孢他啶和庆大霉素三种类型耐药性的决定因素。虽然这两种算法在流感数据的情况下都表现良好，但在细菌情况下结果更加复杂，RRF 做出了更明智的预测，错误率更小，优于 AB。

结论

总之，我们证明了机器学习算法可以用于识别小蛋白质组（病毒）中的遗传决定因素，即使是在对少数个体进行训练的情况下。我们进一步表明，我们的 RRF 算法可能值得更仔细的研究，这应该通过测序和对大量个体进行表型分析的成本降低来促进。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cf60/6558885/c01db5165f86/12864_2019_5820_Fig1_HTML.jpg

相似文献

Identifying genetic determinants of complex phenotypes from whole genome sequence data.

BMC Genomics. 2019 Jun 10;20(1):470. doi: 10.1186/s12864-019-5820-0.

A Genome-Based Model to Predict the Virulence of Pseudomonas aeruginosa Isolates.

mBio. 2020 Aug 25;11(4):e01527-20. doi: 10.1128/mBio.01527-20.

Influenza virus genotype to phenotype predictions through machine learning: a systematic review.

Emerg Microbes Infect. 2021 Dec;10(1):1896-1907. doi: 10.1080/22221751.2021.1978824.

Evaluation of parameters affecting performance and reliability of machine learning-based antibiotic susceptibility testing from whole genome sequencing data.

PLoS Comput Biol. 2019 Sep 3;15(9):e1007349. doi: 10.1371/journal.pcbi.1007349. eCollection 2019 Sep.

Genome-Wide Mutation Scoring for Machine-Learning-Based Antimicrobial Resistance Prediction.

Int J Mol Sci. 2021 Dec 2;22(23):13049. doi: 10.3390/ijms222313049.

SNooPer: a machine learning-based method for somatic variant identification from low-pass next-generation sequencing.

BMC Genomics. 2016 Nov 14;17(1):912. doi: 10.1186/s12864-016-3281-2.

Acquisition of antimicrobial-resistant variants in repeated infections caused by Pseudomonas aeruginosa revealed by whole genome sequencing.

J Infect Chemother. 2019 Feb;25(2):154-156. doi: 10.1016/j.jiac.2018.07.016. Epub 2018 Aug 17.

Genomics and Susceptibility Profiles of Extensively Drug-Resistant Pseudomonas aeruginosa Isolates from Spain.

Antimicrob Agents Chemother. 2017 Oct 24;61(11). doi: 10.1128/AAC.01589-17. Print 2017 Nov.

Evolution of Antibiotic Resistance in Biofilm and Planktonic Pseudomonas aeruginosa Populations Exposed to Subinhibitory Levels of Ciprofloxacin.

Antimicrob Agents Chemother. 2018 Jul 27;62(8). doi: 10.1128/AAC.00320-18. Print 2018 Aug.

Acquired qnrVC1 and blaNDM-1 resistance markers in an international high-risk Pseudomonas aeruginosa ST773 clone.

J Med Microbiol. 2019 Mar;68(3):336-338. doi: 10.1099/jmm.0.000927. Epub 2019 Jan 22.

引用本文的文献

Genome-Wide Characterization of the Phosphofructokinase Gene Family in and Functional Analysis of AtPFK2 in Stress Tolerance.

Int J Mol Sci. 2025 Jul 16;26(14):6828. doi: 10.3390/ijms26146828.

Interactions between Cognitive, Affective, and Respiratory Profiles in Chronic Respiratory Disorders: A Cluster Analysis Approach.

Diagnostics (Basel). 2024 May 30;14(11):1153. doi: 10.3390/diagnostics14111153.

Machine Learning Algorithms Associate Case Numbers with SARS-CoV-2 Variants Rather Than with Impactful Mutations.

Viruses. 2023 May 24;15(6):1226. doi: 10.3390/v15061226.

Influenza virus genotype to phenotype predictions through machine learning: a systematic review.

Emerg Microbes Infect. 2021 Dec;10(1):1896-1907. doi: 10.1080/22221751.2021.1978824.

Identifying the drivers of computationally detected correlated evolution among sites under antibiotic selection.

Evol Appl. 2020 Feb 13;13(4):781-793. doi: 10.1111/eva.12900. eCollection 2020 Apr.

The frontiers of addressing antibiotic resistance in Neisseria gonorrhoeae.

Transl Res. 2020 Jun;220:122-137. doi: 10.1016/j.trsl.2020.02.002. Epub 2020 Feb 29.

本文引用的文献

Deep mutational scanning of hemagglutinin helps predict evolutionary fates of human H3N2 influenza variants.

Proc Natl Acad Sci U S A. 2018 Aug 28;115(35):E8276-E8285. doi: 10.1073/pnas.1806133115. Epub 2018 Aug 13.

Predicting the Reasons of Customer Complaints: A First Step Toward Anticipating Quality Issues of In Vitro Diagnostics Assays with Machine Learning.

JMIR Med Inform. 2018 May 15;6(2):e34. doi: 10.2196/medinform.9960.

Prediction of inherited genomic susceptibility to 20 common cancer types by a supervised machine-learning method.

Proc Natl Acad Sci U S A. 2018 Feb 6;115(6):1322-1327. doi: 10.1073/pnas.1717960115. Epub 2018 Jan 22.

Iterative random forests to discover predictive and stable high-order interactions.

Proc Natl Acad Sci U S A. 2018 Feb 20;115(8):1943-1948. doi: 10.1073/pnas.1711236115. Epub 2018 Jan 19.

Identification of individuals by trait prediction using whole-genome sequencing data.

Proc Natl Acad Sci U S A. 2017 Sep 19;114(38):10166-10171. doi: 10.1073/pnas.1711125114. Epub 2017 Sep 5.

Evolution of Cost-Free Resistance under Fluctuating Drug Selection in .

mSphere. 2017 Jul 19;2(4). doi: 10.1128/mSphere.00158-17. eCollection 2017 Jul-Aug.

10 Years of GWAS Discovery: Biology, Function, and Translation.

Am J Hum Genet. 2017 Jul 6;101(1):5-22. doi: 10.1016/j.ajhg.2017.06.005.

PA3225 Is a Transcriptional Repressor of Antibiotic Resistance Mechanisms in Pseudomonas aeruginosa.

Antimicrob Agents Chemother. 2017 Jul 25;61(8). doi: 10.1128/AAC.02114-16. Print 2017 Aug.

Deep learning for computational biology.

Mol Syst Biol. 2016 Jul 29;12(7):878. doi: 10.15252/msb.20156651.

The PB2 Subunit of the Influenza A Virus RNA Polymerase Is Imported into the Mitochondrial Matrix.

J Virol. 2016 Sep 12;90(19):8729-38. doi: 10.1128/JVI.01384-16. Print 2016 Oct 1.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

从全基因组序列数据中识别复杂表型的遗传决定因素。

Identifying genetic determinants of complex phenotypes from whole genome sequence data.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献