• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

蛋白质数据库中的分类与知识发现。

Classification and knowledge discovery in protein databases.

作者信息

Radivojac Predrag, Chawla Nitesh V, Dunker A Keith, Obradovic Zoran

机构信息

Center for Information Science and Technology, Temple University, USA.

出版信息

J Biomed Inform. 2004 Aug;37(4):224-39. doi: 10.1016/j.jbi.2004.07.008.

DOI:10.1016/j.jbi.2004.07.008
PMID:15465476
Abstract

We consider the problem of classification in noisy, high-dimensional, and class-imbalanced protein datasets. In order to design a complete classification system, we use a three-stage machine learning framework consisting of a feature selection stage, a method addressing noise and class-imbalance, and a method for combining biologically related tasks through a prior-knowledge based clustering. In the first stage, we employ Fisher's permutation test as a feature selection filter. Comparisons with the alternative criteria show that it may be favorable for typical protein datasets. In the second stage, noise and class imbalance are addressed by using minority class over-sampling, majority class under-sampling, and ensemble learning. The performance of logistic regression models, decision trees, and neural networks is systematically evaluated. The experimental results show that in many cases ensembles of logistic regression classifiers may outperform more expressive models due to their robustness to noise and low sample density in a high-dimensional feature space. However, ensembles of neural networks may be the best solution for large datasets. In the third stage, we use prior knowledge to partition unlabeled data such that the class distributions among non-overlapping clusters significantly differ. In our experiments, training classifiers specialized to the class distributions of each cluster resulted in a further decrease in classification error.

摘要

我们考虑在有噪声、高维且类别不平衡的蛋白质数据集中进行分类的问题。为了设计一个完整的分类系统,我们使用了一个三阶段机器学习框架,该框架由一个特征选择阶段、一种处理噪声和类别不平衡的方法,以及一种通过基于先验知识的聚类来组合生物学相关任务的方法组成。在第一阶段,我们采用Fisher排列检验作为特征选择过滤器。与其他标准的比较表明,它可能适用于典型的蛋白质数据集。在第二阶段,通过使用少数类过采样、多数类欠采样和集成学习来处理噪声和类别不平衡。系统地评估了逻辑回归模型、决策树和神经网络的性能。实验结果表明,在许多情况下,逻辑回归分类器的集成可能由于其对噪声的鲁棒性以及在高维特征空间中的低样本密度而优于更具表现力的模型。然而,神经网络的集成可能是大型数据集的最佳解决方案。在第三阶段,我们使用先验知识对未标记数据进行划分,使得非重叠聚类之间的类别分布有显著差异。在我们的实验中,针对每个聚类的类别分布训练专门的分类器导致分类误差进一步降低。

相似文献

1
Classification and knowledge discovery in protein databases.蛋白质数据库中的分类与知识发现。
J Biomed Inform. 2004 Aug;37(4):224-39. doi: 10.1016/j.jbi.2004.07.008.
2
Hum-PLoc: a novel ensemble classifier for predicting human protein subcellular localization.Hum-PLoc:一种用于预测人类蛋白质亚细胞定位的新型集成分类器。
Biochem Biophys Res Commun. 2006 Aug 18;347(1):150-7. doi: 10.1016/j.bbrc.2006.06.059. Epub 2006 Jun 21.
3
Simultaneous feature selection and clustering using mixture models.使用混合模型进行同步特征选择和聚类
IEEE Trans Pattern Anal Mach Intell. 2004 Sep;26(9):1154-66. doi: 10.1109/TPAMI.2004.71.
4
Sparse multinomial logistic regression: fast algorithms and generalization bounds.稀疏多项逻辑回归:快速算法与泛化界
IEEE Trans Pattern Anal Mach Intell. 2005 Jun;27(6):957-68. doi: 10.1109/TPAMI.2005.127.
5
Enhancing instance-based classification with local density: a new algorithm for classifying unbalanced biomedical data.基于局部密度增强实例分类:一种用于不平衡生物医学数据分类的新算法。
Bioinformatics. 2006 Apr 15;22(8):981-8. doi: 10.1093/bioinformatics/btl027. Epub 2006 Jan 27.
6
Mining sequential patterns for protein fold recognition.挖掘用于蛋白质折叠识别的序列模式。
J Biomed Inform. 2008 Feb;41(1):165-79. doi: 10.1016/j.jbi.2007.05.004. Epub 2007 May 17.
7
A nearest neighbor approach for automated transporter prediction and categorization from protein sequences.一种基于最近邻方法从蛋白质序列进行自动转运蛋白预测和分类的方法。
Bioinformatics. 2008 May 1;24(9):1129-36. doi: 10.1093/bioinformatics/btn099. Epub 2008 Mar 12.
8
Semi-supervised protein classification using cluster kernels.使用聚类核的半监督蛋白质分类
Bioinformatics. 2005 Aug 1;21(15):3241-7. doi: 10.1093/bioinformatics/bti497. Epub 2005 May 19.
9
A novel kernel method for clustering.一种用于聚类的新型核方法。
IEEE Trans Pattern Anal Mach Intell. 2005 May;27(5):801-5. doi: 10.1109/TPAMI.2005.88.
10
Capitalize on dimensionality increasing techniques for improving Face Recognition Grand Challenge performance.利用维度增加技术来提高人脸识别大挑战的性能。
IEEE Trans Pattern Anal Mach Intell. 2006 May;28(5):725-37. doi: 10.1109/TPAMI.2006.90.

引用本文的文献

1
Cognitive Outcome Prediction in Infants With Neonatal Hypoxic-Ischemic Encephalopathy Based on Functional Connectivity and Complexity of the Electroencephalography Signal.基于脑电图信号功能连接性和复杂性的新生儿缺氧缺血性脑病患儿认知结局预测
Front Hum Neurosci. 2022 Jan 27;15:795006. doi: 10.3389/fnhum.2021.795006. eCollection 2021.
2
A Class-Imbalanced Deep Learning Fall Detection Algorithm Using Wearable Sensors.基于可穿戴传感器的不平衡深度学习跌倒检测算法。
Sensors (Basel). 2021 Sep 29;21(19):6511. doi: 10.3390/s21196511.
3
ML-AdVInfect: A Machine-Learning Based Adenoviral Infection Predictor.
ML - AdVInfect:一种基于机器学习的腺病毒感染预测器。
Front Mol Biosci. 2021 May 7;8:647424. doi: 10.3389/fmolb.2021.647424. eCollection 2021.
4
Global Phosphoproteomic Analysis Reveals the Involvement of Phosphorylation in Aflatoxins Biosynthesis in the Pathogenic Fungus Aspergillus flavus.全球磷酸化蛋白质组学分析揭示磷酸化参与致病性真菌黄曲霉的黄曲霉毒素生物合成过程。
Sci Rep. 2016 Sep 26;6:34078. doi: 10.1038/srep34078.
5
Joint use of over- and under-sampling techniques and cross-validation for the development and assessment of prediction models.过采样和欠采样技术与交叉验证联合用于预测模型的开发和评估
BMC Bioinformatics. 2015 Nov 4;16:363. doi: 10.1186/s12859-015-0784-9.
6
Imbalanced class learning in epigenetics.表观遗传学中的不均衡类学习
J Comput Biol. 2014 Jul;21(7):492-507. doi: 10.1089/cmb.2014.0008. Epub 2014 May 5.
7
Iterative nearest neighborhood oversampling in semisupervised learning from imbalanced data.不平衡数据半监督学习中的迭代最近邻过采样
ScientificWorldJournal. 2013 Jul 10;2013:875450. doi: 10.1155/2013/875450. Print 2013.
8
SMOTE for high-dimensional class-imbalanced data.过采样处理高维类别不平衡数据。
BMC Bioinformatics. 2013 Mar 22;14:106. doi: 10.1186/1471-2105-14-106.
9
Analysis of structured and intrinsically disordered regions of transmembrane proteins.跨膜蛋白的结构化区域和内在无序区域分析
Mol Biosyst. 2009 Dec;5(12):1688-1702. doi: 10.1039/B905913J.
10
Predicting protein disorder by analyzing amino acid sequence.通过分析氨基酸序列预测蛋白质无序状态。
BMC Genomics. 2008 Sep 16;9 Suppl 2(Suppl 2):S8. doi: 10.1186/1471-2164-9-S2-S8.