文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

文档翻译

学术文献翻译模型，支持多种主流文档格式。

一种改进的T检验特征选择方法及其在HapMap基因型数据上的应用。

A modified T-test feature selection method and its application on the HapMap genotype data.

作者信息

Zhou Nina, Wang Lipo

机构信息

School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore.

出版信息

Genomics Proteomics Bioinformatics. 2007 Dec;5(3-4):242-9. doi: 10.1016/S1672-0229(08)60011-X.

DOI:10.1016/S1672-0229(08)60011-X

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5054219/

Abstract

Single nucleotide polymorphisms (SNPs) are genetic variations that determine the differences between any two unrelated individuals. Various population groups can be distinguished from each other using SNPs. For instance, the HapMap dataset has four population groups with about ten million SNPs. For more insights on human evolution, ethnic variation, and population assignment, we propose to find out which SNPs are significant in determining the population groups and then to classify different populations using these relevant SNPs as input features. In this study, we developed a modified t-test ranking measure and applied it to the HapMap genotype data. Firstly, we rank all SNPs in comparison with other feature importance measures including F-statistics and the informativeness for assignment. Secondly, we select different numbers of the most highly ranked SNPs as the input to a classifier, such as the support vector machine, so as to find the best feature subset corresponding to the best classification accuracy. Experimental results showed that the proposed method is very effective in finding SNPs that are significant in determining the population groups, with reduced computational burden and better classification accuracy.

摘要

单核苷酸多态性（SNPs）是决定任意两个不相关个体之间差异的基因变异。利用SNPs可以区分不同的人群组。例如，HapMap数据集包含四个群体组，约有一千万个SNPs。为了更深入了解人类进化、种族变异和群体归属，我们建议找出哪些SNPs在确定群体组时具有重要意义，然后将这些相关的SNPs作为输入特征对不同群体进行分类。在本研究中，我们开发了一种改进的t检验排序方法，并将其应用于HapMap基因型数据。首先，与包括F统计量和归属信息性在内的其他特征重要性度量方法相比，我们对所有SNPs进行排序。其次，我们选择不同数量的排名最高的SNPs作为分类器（如支持向量机）的输入，以找到对应最佳分类准确率的最佳特征子集。实验结果表明，所提出的方法在寻找对确定群体组具有重要意义的SNPs方面非常有效，同时降低了计算负担并提高了分类准确率。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1f5b/5054219/8c68e82a0522/gr1.jpg