• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

一种基于回归的K近邻算法,用于从异构数据预测基因功能。

A regression-based K nearest neighbor algorithm for gene function prediction from heterogeneous data.

作者信息

Yao Zizhen, Ruzzo Walter L

机构信息

Department of Computer Science and Engineering, AC101 Paul G. Allen Center, University of Washington, Seattle WA 98195, USA.

出版信息

BMC Bioinformatics. 2006 Mar 20;7 Suppl 1(Suppl 1):S11. doi: 10.1186/1471-2105-7-S1-S11.

DOI:10.1186/1471-2105-7-S1-S11
PMID:16723004
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC1810312/
Abstract

BACKGROUND

As a variety of functional genomic and proteomic techniques become available, there is an increasing need for functional analysis methodologies that integrate heterogeneous data sources.

METHODS

In this paper, we address this issue by proposing a general framework for gene function prediction based on the k-nearest-neighbor (KNN) algorithm. The choice of KNN is motivated by its simplicity, flexibility to incorporate different data types and adaptability to irregular feature spaces. A weakness of traditional KNN methods, especially when handling heterogeneous data, is that performance is subject to the often ad hoc choice of similarity metric. To address this weakness, we apply regression methods to infer a similarity metric as a weighted combination of a set of base similarity measures, which helps to locate the neighbors that are most likely to be in the same class as the target gene. We also suggest a novel voting scheme to generate confidence scores that estimate the accuracy of predictions. The method gracefully extends to multi-way classification problems.

RESULTS

We apply this technique to gene function prediction according to three well-known Escherichia coli classification schemes suggested by biologists, using information derived from microarray and genome sequencing data. We demonstrate that our algorithm dramatically outperforms the naive KNN methods and is competitive with support vector machine (SVM) algorithms for integrating heterogenous data. We also show that by combining different data sources, prediction accuracy can improve significantly

CONCLUSION

Our extension of KNN with automatic feature weighting, multi-class prediction, and probabilistic inference, enhance prediction accuracy significantly while remaining efficient, intuitive and flexible. This general framework can also be applied to similar classification problems involving heterogeneous datasets.

摘要

背景

随着各种功能基因组学和蛋白质组学技术的出现,对整合异构数据源的功能分析方法的需求日益增加。

方法

在本文中,我们通过提出一种基于k近邻(KNN)算法的基因功能预测通用框架来解决这个问题。选择KNN的动机在于其简单性、能够灵活纳入不同数据类型以及对不规则特征空间的适应性。传统KNN方法的一个弱点,尤其是在处理异构数据时,是性能取决于相似性度量的通常临时选择。为了解决这个弱点,我们应用回归方法来推断相似性度量,作为一组基本相似性度量的加权组合,这有助于定位最有可能与目标基因属于同一类别的邻居。我们还提出了一种新颖的投票方案来生成置信度分数,以估计预测的准确性。该方法可以优雅地扩展到多分类问题。

结果

我们根据生物学家提出的三种著名的大肠杆菌分类方案,使用从微阵列和基因组测序数据中获得的信息,将这项技术应用于基因功能预测。我们证明,我们的算法显著优于朴素KNN方法,并且在整合异构数据方面与支持向量机(SVM)算法具有竞争力。我们还表明,通过组合不同的数据源,预测准确性可以显著提高。

结论

我们对KNN的扩展,包括自动特征加权、多类预测和概率推理,在保持高效、直观和灵活的同时,显著提高了预测准确性。这个通用框架也可以应用于涉及异构数据集的类似分类问题。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2531/1810312/1b55a5efa3c5/1471-2105-7-S1-S11-4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2531/1810312/a56a0c0a81a9/1471-2105-7-S1-S11-1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2531/1810312/38ceba6bcb94/1471-2105-7-S1-S11-2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2531/1810312/d03bb8aeace6/1471-2105-7-S1-S11-3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2531/1810312/1b55a5efa3c5/1471-2105-7-S1-S11-4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2531/1810312/a56a0c0a81a9/1471-2105-7-S1-S11-1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2531/1810312/38ceba6bcb94/1471-2105-7-S1-S11-2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2531/1810312/d03bb8aeace6/1471-2105-7-S1-S11-3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2531/1810312/1b55a5efa3c5/1471-2105-7-S1-S11-4.jpg

相似文献

1
A regression-based K nearest neighbor algorithm for gene function prediction from heterogeneous data.一种基于回归的K近邻算法,用于从异构数据预测基因功能。
BMC Bioinformatics. 2006 Mar 20;7 Suppl 1(Suppl 1):S11. doi: 10.1186/1471-2105-7-S1-S11.
2
MS-kNN: protein function prediction by integrating multiple data sources.MS-kNN:整合多数据源的蛋白质功能预测
BMC Bioinformatics. 2013;14 Suppl 3(Suppl 3):S8. doi: 10.1186/1471-2105-14-S3-S8. Epub 2013 Feb 28.
3
SVM-Fold: a tool for discriminative multi-class protein fold and superfamily recognition.支持向量机折叠法:一种用于判别式多类别蛋白质折叠和超家族识别的工具。
BMC Bioinformatics. 2007 May 22;8 Suppl 4(Suppl 4):S2. doi: 10.1186/1471-2105-8-S4-S2.
4
Gene expression cancer classification using modified K-Nearest Neighbors technique.使用改进的K近邻技术进行基因表达癌症分类。
Biosystems. 2019 Feb;176:41-51. doi: 10.1016/j.biosystems.2018.12.009. Epub 2019 Jan 3.
5
Instance-based concept learning from multiclass DNA microarray data.基于实例的多类DNA微阵列数据概念学习
BMC Bioinformatics. 2006 Feb 16;7:73. doi: 10.1186/1471-2105-7-73.
6
Quadratic regression analysis for gene discovery and pattern recognition for non-cyclic short time-course microarray experiments.用于非循环短时间进程微阵列实验的基因发现和模式识别的二次回归分析。
BMC Bioinformatics. 2005 Apr 25;6:106. doi: 10.1186/1471-2105-6-106.
7
Exploring the within- and between-class correlation distributions for tumor classification.探讨肿瘤分类的类内和类间相关系数分布。
Proc Natl Acad Sci U S A. 2010 Apr 13;107(15):6737-42. doi: 10.1073/pnas.0910140107. Epub 2010 Mar 25.
8
Computer-assisted lip diagnosis on Traditional Chinese Medicine using multi-class support vector machines.基于多类支持向量机的中医唇诊计算机辅助诊断。
BMC Complement Altern Med. 2012 Aug 16;12:127. doi: 10.1186/1472-6882-12-127.
9
Feature weight estimation for gene selection: a local hyperlinear learning approach.特征权重估计在基因选择中的应用:一种局部超线性学习方法。
BMC Bioinformatics. 2014 Mar 14;15:70. doi: 10.1186/1471-2105-15-70.
10
A new fuzzy support vectors machine for biomedical data classification.一种用于生物医学数据分类的新型模糊支持向量机。
Annu Int Conf IEEE Eng Med Biol Soc. 2008;2008:4676-9. doi: 10.1109/IEMBS.2008.4650256.

引用本文的文献

1
A machine-learning approach for predicting butyrate production by microbial consortia using metabolic network information.一种利用代谢网络信息预测微生物群落丁酸盐产量的机器学习方法。
PeerJ. 2025 May 28;13:e19296. doi: 10.7717/peerj.19296. eCollection 2025.
2
Methodological Integration of Machine Learning and Geospatial Analysis for PM Pollution Mapping.用于细颗粒物污染制图的机器学习与地理空间分析的方法整合
MethodsX. 2025 Apr 17;14:103322. doi: 10.1016/j.mex.2025.103322. eCollection 2025 Jun.
3
IL-1β and associated molecules as prognostic biomarkers linked with immune cell infiltration in colorectal cancer: an integrated statistical and machine learning approach.

本文引用的文献

1
An integrated probabilistic model for functional prediction of proteins.一种用于蛋白质功能预测的综合概率模型。
J Comput Biol. 2004;11(2-3):463-75. doi: 10.1089/1066527041410346.
2
Kernel-based data fusion and its application to protein function prediction in yeast.基于核的数据融合及其在酵母蛋白质功能预测中的应用。
Pac Symp Biocomput. 2004:300-11. doi: 10.1142/9789812704856_0029.
3
Support vector machine classification on the web.网络上的支持向量机分类
白细胞介素-1β及相关分子作为与结直肠癌免疫细胞浸润相关的预后生物标志物:一种综合统计和机器学习方法
Discov Oncol. 2025 Feb 28;16(1):252. doi: 10.1007/s12672-025-01989-3.
4
Biological subphenotypes in patients hospitalized with suspected infection in Thailand: a secondary analysis of a prospective observational study.泰国疑似感染住院患者的生物学亚表型:一项前瞻性观察性研究的二次分析。
Lancet Reg Health Southeast Asia. 2025 Jan 30;33:100536. doi: 10.1016/j.lansea.2025.100536. eCollection 2025 Feb.
5
Identification and validation of a metabolic-related gene risk model predicting the prognosis of lung, colon, and breast cancers.预测肺癌、结肠癌和乳腺癌预后的代谢相关基因风险模型的识别与验证
Sci Rep. 2025 Jan 8;15(1):1374. doi: 10.1038/s41598-025-85366-8.
6
NetSci: A Library for High Performance Biomolecular Simulation Network Analysis Computation.NetSci:一个用于高性能生物分子模拟网络分析计算的库。
J Chem Inf Model. 2024 Oct 28;64(20):7966-7976. doi: 10.1021/acs.jcim.4c00899. Epub 2024 Oct 4.
7
Machine learning-guided engineering of genetically encoded fluorescent calcium indicators.基于机器学习的基因编码荧光钙指示剂的工程设计。
Nat Comput Sci. 2024 Mar;4(3):224-236. doi: 10.1038/s43588-024-00611-w. Epub 2024 Mar 21.
8
Using metabolic networks to predict cross-feeding and competition interactions between microorganisms.利用代谢网络预测微生物之间的交叉喂养和竞争相互作用。
Microbiol Spectr. 2024 May 2;12(5):e0228723. doi: 10.1128/spectrum.02287-23. Epub 2024 Mar 20.
9
Evaluation of penalized and machine learning methods for asthma disease prediction in the Korean Genome and Epidemiology Study (KoGES).评估惩罚和机器学习方法在韩国基因组与流行病学研究(KoGES)中对哮喘病的预测作用。
BMC Bioinformatics. 2024 Feb 2;25(1):56. doi: 10.1186/s12859-024-05677-x.
10
Assessing predictive performance of supervised machine learning algorithms for a diamond pricing model.评估用于钻石定价模型的监督式机器学习算法的预测性能。
Sci Rep. 2023 Oct 12;13(1):17315. doi: 10.1038/s41598-023-44326-w.
Bioinformatics. 2004 Mar 1;20(4):586-7. doi: 10.1093/bioinformatics/btg461. Epub 2004 Jan 22.
4
ASAP, a systematic annotation package for community analysis of genomes.ASAP,一个用于基因组群落分析的系统注释软件包。
Nucleic Acids Res. 2003 Jan 1;31(1):147-51. doi: 10.1093/nar/gkg125.
5
Genomic functional annotation using co-evolution profiles of gene clusters.利用基因簇的共进化谱进行基因组功能注释。
Genome Biol. 2002 Oct 10;3(11):RESEARCH0060. doi: 10.1186/gb-2002-3-11-research0060.
6
Physiological genomics of Escherichia coli protein families.大肠杆菌蛋白质家族的生理基因组学
Physiol Genomics. 2002;9(1):15-26. doi: 10.1152/physiolgenomics.00086.2001.
7
Exploring gene expression data with class scores.利用类别分数探索基因表达数据。
Pac Symp Biocomput. 2002:474-85.
8
Functional organization of the yeast proteome by systematic analysis of protein complexes.通过对蛋白质复合物的系统分析实现酵母蛋白质组的功能组织
Nature. 2002 Jan 10;415(6868):141-7. doi: 10.1038/415141a.
9
Correlation between transcriptome and interactome mapping data from Saccharomyces cerevisiae.酿酒酵母转录组与相互作用组图谱数据之间的相关性
Nat Genet. 2001 Dec;29(4):482-6. doi: 10.1038/ng776.
10
A relationship between gene expression and protein interactions on the proteome scale: analysis of the bacteriophage T7 and the yeast Saccharomyces cerevisiae.蛋白质组规模上基因表达与蛋白质相互作用之间的关系:噬菌体T7和酿酒酵母的分析。
Nucleic Acids Res. 2001 Sep 1;29(17):3513-9. doi: 10.1093/nar/29.17.3513.