• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

FIFS:一种用于高维群体基因组数据中信息标记选择的数据挖掘方法。

FIFS: A data mining method for informative marker selection in high dimensional population genomic data.

机构信息

School of Informatics, Aristotle University of Thessaloniki, 54124, Greece; Department of Genetics, Development and Molecular Biology, School of Biology, Aristotle University of Thessaloniki, 54124, Greece.

School of Informatics, Aristotle University of Thessaloniki, 54124, Greece.

出版信息

Comput Biol Med. 2017 Nov 1;90:146-154. doi: 10.1016/j.compbiomed.2017.09.020. Epub 2017 Sep 28.

DOI:10.1016/j.compbiomed.2017.09.020
PMID:28992453
Abstract

BACKGROUND AND OBJECTIVE

Single Nucleotide Polymorphism (SNPs) are, nowadays, becoming the marker of choice for biological analyses involving a wide range of applications with great medical, biological, economic and environmental interest. Classification tasks i.e. the assignment of individuals to groups of origin based on their (multi-locus) genotypes, are performed in many fields such as forensic investigations, discrimination between wild and/or farmed populations and others. Τhese tasks, should be performed with a small number of loci, for computational as well as biological reasons. Thus, feature selection should precede classification tasks, especially for Single Nucleotide Polymorphism (SNP) datasets, where the number of features can amount to hundreds of thousands or millions.

METHODS

In this paper, we present a novel data mining approach, called FIFS - Frequent Item Feature Selection, based on the use of frequent items for selection of the most informative markers from population genomic data. It is a modular method, consisting of two main components. The first one identifies the most frequent and unique genotypes for each sampled population. The second one selects the most appropriate among them, in order to create the informative SNP subsets to be returned.

RESULTS

The proposed method (FIFS) was tested on a real dataset, which comprised of a comprehensive coverage of pig breed types present in Britain. This dataset consisted of 446 individuals divided in 14 sub-populations, genotyped at 59,436 SNPs. Our method outperforms the state-of-the-art and baseline methods in every case. More specifically, our method surpassed the assignment accuracy threshold of 95% needing only half the number of SNPs selected by other methods (FIFS: 28 SNPs, Delta: 70 SNPs Pairwise FST: 70 SNPs, In: 100 SNPs.) CONCLUSION: Our approach successfully deals with the problem of informative marker selection in high dimensional genomic datasets. It offers better results compared to existing approaches and can aid biologists in selecting the most informative markers with maximum discrimination power for optimization of cost-effective panels with applications related to e.g. species identification, wildlife management, and forensics.

摘要

背景与目的

单核苷酸多态性(SNPs)如今已成为生物分析的首选标记物,适用于具有重要医学、生物学、经济和环境意义的广泛应用。分类任务,即根据个体的(多位点)基因型将其分配到起源群体中,在法医调查、野生和/或养殖种群之间的区分等多个领域中都有执行。出于计算和生物学方面的原因,这些任务应使用少量的基因座来完成。因此,特征选择应先于分类任务进行,特别是对于单核苷酸多态性(SNP)数据集,其中特征数量可能达到数十万或数百万。

方法

在本文中,我们提出了一种新的数据挖掘方法,称为 FIFS-频繁项特征选择,该方法基于使用频繁项从群体基因组数据中选择最具信息量的标记物。它是一种模块化方法,由两个主要组件组成。第一个组件确定每个采样群体中最常见和最独特的基因型。第二个组件从中选择最合适的基因型,以创建要返回的信息 SNP 子集。

结果

该方法(FIFS)在一个真实数据集上进行了测试,该数据集涵盖了英国存在的各种猪品种类型,包括 446 个个体,分为 14 个亚群,在 59436 个 SNP 上进行了基因分型。我们的方法在每种情况下都优于最新技术和基准方法。具体来说,我们的方法在需要选择的 SNP 数量上超过了 95%的分配准确率阈值,只需要其他方法(FIFS:28 个 SNP,Delta:70 个 SNP,Pairwise FST:70 个 SNP,In:100 个 SNP)选择的 SNP 数量的一半。

结论

我们的方法成功地解决了高维基因组数据中信息量标记选择的问题。与现有方法相比,它提供了更好的结果,并可以帮助生物学家选择最具信息量的标记物,以获得最大的区分能力,优化具有成本效益的面板,应用于例如物种鉴定、野生动物管理和法医学等领域。

相似文献

1
FIFS: A data mining method for informative marker selection in high dimensional population genomic data.FIFS:一种用于高维群体基因组数据中信息标记选择的数据挖掘方法。
Comput Biol Med. 2017 Nov 1;90:146-154. doi: 10.1016/j.compbiomed.2017.09.020. Epub 2017 Sep 28.
2
TRES: Identification of Discriminatory and Informative SNPs from Population Genomic Data.TRES:从群体基因组数据中识别具有鉴别力和信息量的单核苷酸多态性
J Hered. 2015 Sep-Oct;106(5):672-6. doi: 10.1093/jhered/esv044. Epub 2015 Jul 2.
3
Genome-wide association data classification and SNPs selection using two-stage quality-based Random Forests.使用基于质量的两阶段随机森林进行全基因组关联数据分类和单核苷酸多态性选择。
BMC Genomics. 2015;16 Suppl 2(Suppl 2):S5. doi: 10.1186/1471-2164-16-S2-S5. Epub 2015 Jan 21.
4
A machine learning approach for the identification of population-informative markers from high-throughput genotyping data: application to several pig breeds.一种从高通量基因分型数据中识别群体信息标记的机器学习方法:在多个猪品种中的应用。
Animal. 2020 Feb;14(2):223-232. doi: 10.1017/S1751731119002167. Epub 2019 Oct 11.
5
Comparison of three statistical approaches for feature selection for fine-scale genetic population assignment in four pig breeds.四种猪品种中用于精细尺度遗传群体分配的特征选择的三种统计方法的比较。
Trop Anim Health Prod. 2021 Jul 10;53(3):395. doi: 10.1007/s11250-021-02824-x.
6
Combined use of principal component analysis and random forests identify population-informative single nucleotide polymorphisms: application in cattle breeds.主成分分析与随机森林的联合使用可识别群体信息单核苷酸多态性:在牛品种中的应用
J Anim Breed Genet. 2015 Oct;132(5):346-56. doi: 10.1111/jbg.12155. Epub 2015 Mar 17.
7
Comparative analysis of five different methods to design a breed-specific SNP panel for cattle.五种不同方法设计牛种特异性 SNP 面板的比较分析。
Anim Biotechnol. 2021 Feb;32(1):130-136. doi: 10.1080/10495398.2019.1646266. Epub 2019 Jul 31.
8
Increasing accuracy of genomic selection in presence of high density marker panels through the prioritization of relevant polymorphisms.通过优先考虑相关的多态性,在高密度标记面板存在的情况下提高基因组选择的准确性。
BMC Genet. 2019 Feb 22;20(1):21. doi: 10.1186/s12863-019-0720-5.
9
Evaluation of approaches for identifying population informative markers from high density SNP chips.评价从高密度 SNP 芯片中识别群体信息标记的方法。
BMC Genet. 2011 May 13;12:45. doi: 10.1186/1471-2156-12-45.
10
Predicting the disease of Alzheimer with SNP biomarkers and clinical data using data mining classification approach: decision tree.使用数据挖掘分类方法(决策树),通过单核苷酸多态性(SNP)生物标志物和临床数据预测阿尔茨海默病
Stud Health Technol Inform. 2014;205:511-5.

引用本文的文献

1
Global and Local Ancestry and its Importance: A Review.全球和本地血统及其重要性:综述
Curr Genomics. 2024;25(4):237-260. doi: 10.2174/0113892029298909240426094055. Epub 2024 May 9.
2
Elucidation of population stratifying markers and selective sweeps in crossbred Landlly pig population using genome-wide SNP data.利用全基因组 SNP 数据阐明杂交 Landrace 猪群体中的群体分层标记和选择清除。
Mamm Genome. 2024 Jun;35(2):170-185. doi: 10.1007/s00335-024-10029-4. Epub 2024 Mar 15.
3
Can we rely on selected genetic markers for population identification? Evidence from coastal Atlantic cod.
我们能否依靠选定的基因标记进行种群识别?来自大西洋沿岸鳕鱼的证据。
Ecol Evol. 2018 Dec 1;8(24):12547-12558. doi: 10.1002/ece3.4648. eCollection 2018 Dec.