Hall Barry G
Bellingham Research Institute, Bellingham, Washington, United States of America.
PLoS One. 2014 Feb 28;9(2):e90490. doi: 10.1371/journal.pone.0090490. eCollection 2014.
SNP-association studies are a starting point for identifying genes that may be responsible for specific phenotypes, such as disease traits. The vast bulk of tools for SNP-association studies are directed toward SNPs in the human genome, and I am unaware of any tools designed specifically for such studies in bacterial or viral genomes. The PPFS (Predict Phenotypes From SNPs) package described here is an add-on to kSNP , a program that can identify SNPs in a data set of hundreds of microbial genomes. PPFS identifies those SNPs that are non-randomly associated with a phenotype based on the χ² probability, then uses those diagnostic SNPs for two distinct, but related, purposes: (1) to predict the phenotypes of strains whose phenotypes are unknown, and (2) to identify those diagnostic SNPs that are most likely to be causally related to the phenotype. In the example illustrated here, from a set of 68 E. coli genomes, for 67 of which the pathogenicity phenotype was known, there were 418,500 SNPs. Using the phenotypes of 36 of those strains, PPFS identified 207 diagnostic SNPs. The diagnostic SNPs predicted the phenotypes of all of the genomes with 97% accuracy. It then identified 97 SNPs whose probability of being causally related to the pathogenic phenotype was >0.999. In a second example, from a set of 116 E. coli genome sequences, using the phenotypes of 65 strains PPFS identified 101 SNPs that predicted the source host (human or non-human) with 90% accuracy.
单核苷酸多态性(SNP)关联研究是识别可能导致特定表型(如疾病特征)的基因的起点。绝大多数用于SNP关联研究的工具都针对人类基因组中的SNP,而我并不知晓有任何专门为细菌或病毒基因组的此类研究设计的工具。这里描述的PPFS(从SNP预测表型)软件包是kSNP的一个附加组件,kSNP是一个能够在数百个微生物基因组的数据集中识别SNP的程序。PPFS基于χ²概率识别那些与表型非随机关联的SNP,然后将这些诊断性SNP用于两个不同但相关的目的:(1)预测表型未知的菌株的表型,以及(2)识别那些最有可能与表型存在因果关系的诊断性SNP。在本文所示的示例中,从一组68个大肠杆菌基因组(其中67个的致病性表型已知)中,共有418,500个SNP。利用其中36个菌株的表型,PPFS识别出207个诊断性SNP。这些诊断性SNP以97%的准确率预测了所有基因组的表型。然后它识别出97个与致病表型存在因果关系的概率>0.999的SNP。在第二个示例中,从一组116个大肠杆菌基因组序列中,利用65个菌株的表型,PPFS识别出101个SNP,这些SNP以90%的准确率预测了源宿主(人类或非人类)。