Suppr超能文献

剖析性状异质性:应用于基因型数据的三种聚类方法的比较

Dissecting trait heterogeneity: a comparison of three clustering methods applied to genotypic data.

作者信息

Thornton-Wells Tricia A, Moore Jason H, Haines Jonathan L

机构信息

Neuroscience Graduate Program, Vanderbilt Brain Institute, Vanderbilt University Medical Center, Nashville, TN, USA.

出版信息

BMC Bioinformatics. 2006 Apr 12;7:204. doi: 10.1186/1471-2105-7-204.

Abstract

BACKGROUND

Trait heterogeneity, which exists when a trait has been defined with insufficient specificity such that it is actually two or more distinct traits, has been implicated as a confounding factor in traditional statistical genetics of complex human disease. In the absence of detailed phenotypic data collected consistently in combination with genetic data, unsupervised computational methodologies offer the potential for discovering underlying trait heterogeneity. The performance of three such methods--Bayesian Classification, Hypergraph-Based Clustering, and Fuzzy k-Modes Clustering--appropriate for categorical data were compared. Also tested was the ability of these methods to detect trait heterogeneity in the presence of locus heterogeneity and/or gene-gene interaction, which are two other complicating factors in discovering genetic models of complex human disease. To determine the efficacy of applying the Bayesian Classification method to real data, the reliability of its internal clustering metrics at finding good clusterings was evaluated using permutation testing.

RESULTS

Bayesian Classification outperformed the other two methods, with the exception that the Fuzzy k-Modes Clustering performed best on the most complex genetic model. Bayesian Classification achieved excellent recovery for 75% of the datasets simulated under the simplest genetic model, while it achieved moderate recovery for 56% of datasets with a sample size of 500 or more (across all simulated models) and for 86% of datasets with 10 or fewer nonfunctional loci (across all simulated models). Neither Hypergraph Clustering nor Fuzzy k-Modes Clustering achieved good or excellent cluster recovery for a majority of datasets even under a restricted set of conditions. When using the average log of class strength as the internal clustering metric, the false positive rate was controlled very well, at three percent or less for all three significance levels (0.01, 0.05, 0.10), and the false negative rate was acceptably low (18 percent) for the least stringent significance level of 0.10.

CONCLUSION

Bayesian Classification shows promise as an unsupervised computational method for dissecting trait heterogeneity in genotypic data. Its control of false positive and false negative rates lends confidence to the validity of its results. Further investigation of how different parameter settings may improve the performance of Bayesian Classification, especially under more complex genetic models, is ongoing.

摘要

背景

当一个性状的定义特异性不足以至于实际上是两个或更多不同的性状时,就会出现性状异质性,它被认为是复杂人类疾病传统统计遗传学中的一个混杂因素。在缺乏与遗传数据一致收集的详细表型数据的情况下,无监督计算方法为发现潜在的性状异质性提供了可能性。比较了三种适用于分类数据的此类方法——贝叶斯分类、基于超图的聚类和模糊k-模式聚类的性能。还测试了这些方法在存在位点异质性和/或基因-基因相互作用的情况下检测性状异质性的能力,这是发现复杂人类疾病遗传模型的另外两个复杂因素。为了确定将贝叶斯分类方法应用于实际数据的有效性,使用置换检验评估了其内部聚类指标在找到良好聚类方面的可靠性。

结果

贝叶斯分类的表现优于其他两种方法,但在最复杂的遗传模型上,模糊k-模式聚类表现最佳。对于在最简单遗传模型下模拟的75%的数据集,贝叶斯分类实现了出色的恢复,而对于样本量为500或更多(在所有模拟模型中)的56%的数据集以及对于10个或更少非功能性位点的86%的数据集(在所有模拟模型中),它实现了中等程度的恢复。即使在一组受限条件下,超图聚类和模糊k-模式聚类对于大多数数据集都没有实现良好或出色的聚类恢复。当使用类强度的平均对数作为内部聚类指标时,误报率得到了很好的控制,在所有三个显著性水平(0.01、0.05、0.10)下均为3%或更低,对于最宽松的显著性水平0.10,漏报率也低至可以接受的18%。

结论

贝叶斯分类作为一种用于剖析基因型数据中性状异质性的无监督计算方法显示出前景。其对误报率和漏报率的控制为其结果的有效性提供了信心。目前正在进一步研究不同的参数设置如何可能提高贝叶斯分类的性能,特别是在更复杂的遗传模型下。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5a73/1525209/122ffff6d616/1471-2105-7-204-1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验