School of Informatics, Aristotle University of Thessaloniki, 54124, Greece; Department of Genetics, Development and Molecular Biology, School of Biology, Aristotle University of Thessaloniki, 54124, Greece.
School of Informatics, Aristotle University of Thessaloniki, 54124, Greece.
Comput Biol Med. 2017 Nov 1;90:146-154. doi: 10.1016/j.compbiomed.2017.09.020. Epub 2017 Sep 28.
Single Nucleotide Polymorphism (SNPs) are, nowadays, becoming the marker of choice for biological analyses involving a wide range of applications with great medical, biological, economic and environmental interest. Classification tasks i.e. the assignment of individuals to groups of origin based on their (multi-locus) genotypes, are performed in many fields such as forensic investigations, discrimination between wild and/or farmed populations and others. Τhese tasks, should be performed with a small number of loci, for computational as well as biological reasons. Thus, feature selection should precede classification tasks, especially for Single Nucleotide Polymorphism (SNP) datasets, where the number of features can amount to hundreds of thousands or millions.
In this paper, we present a novel data mining approach, called FIFS - Frequent Item Feature Selection, based on the use of frequent items for selection of the most informative markers from population genomic data. It is a modular method, consisting of two main components. The first one identifies the most frequent and unique genotypes for each sampled population. The second one selects the most appropriate among them, in order to create the informative SNP subsets to be returned.
The proposed method (FIFS) was tested on a real dataset, which comprised of a comprehensive coverage of pig breed types present in Britain. This dataset consisted of 446 individuals divided in 14 sub-populations, genotyped at 59,436 SNPs. Our method outperforms the state-of-the-art and baseline methods in every case. More specifically, our method surpassed the assignment accuracy threshold of 95% needing only half the number of SNPs selected by other methods (FIFS: 28 SNPs, Delta: 70 SNPs Pairwise FST: 70 SNPs, In: 100 SNPs.) CONCLUSION: Our approach successfully deals with the problem of informative marker selection in high dimensional genomic datasets. It offers better results compared to existing approaches and can aid biologists in selecting the most informative markers with maximum discrimination power for optimization of cost-effective panels with applications related to e.g. species identification, wildlife management, and forensics.
单核苷酸多态性(SNPs)如今已成为生物分析的首选标记物,适用于具有重要医学、生物学、经济和环境意义的广泛应用。分类任务,即根据个体的(多位点)基因型将其分配到起源群体中,在法医调查、野生和/或养殖种群之间的区分等多个领域中都有执行。出于计算和生物学方面的原因,这些任务应使用少量的基因座来完成。因此,特征选择应先于分类任务进行,特别是对于单核苷酸多态性(SNP)数据集,其中特征数量可能达到数十万或数百万。
在本文中,我们提出了一种新的数据挖掘方法,称为 FIFS-频繁项特征选择,该方法基于使用频繁项从群体基因组数据中选择最具信息量的标记物。它是一种模块化方法,由两个主要组件组成。第一个组件确定每个采样群体中最常见和最独特的基因型。第二个组件从中选择最合适的基因型,以创建要返回的信息 SNP 子集。
该方法(FIFS)在一个真实数据集上进行了测试,该数据集涵盖了英国存在的各种猪品种类型,包括 446 个个体,分为 14 个亚群,在 59436 个 SNP 上进行了基因分型。我们的方法在每种情况下都优于最新技术和基准方法。具体来说,我们的方法在需要选择的 SNP 数量上超过了 95%的分配准确率阈值,只需要其他方法(FIFS:28 个 SNP,Delta:70 个 SNP,Pairwise FST:70 个 SNP,In:100 个 SNP)选择的 SNP 数量的一半。
我们的方法成功地解决了高维基因组数据中信息量标记选择的问题。与现有方法相比,它提供了更好的结果,并可以帮助生物学家选择最具信息量的标记物,以获得最大的区分能力,优化具有成本效益的面板,应用于例如物种鉴定、野生动物管理和法医学等领域。