• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

验证质谱数据库搜索结果中的特征选择。

Feature selection in validating mass spectrometry database search results.

作者信息

Fang Jianwen, Dong Yinghua, Williams Todd D, Lushington Gerald H

机构信息

Bioinformatics Core Facility & Information and Telecommunication Technology Center, University of Kansas, 2099 Constant Dr., Lawrence, Kansas 66047, USA.

出版信息

J Bioinform Comput Biol. 2008 Feb;6(1):223-40. doi: 10.1142/s0219720008003345.

DOI:10.1142/s0219720008003345
PMID:18324754
Abstract

Tandem mass spectrometry (MS/MS) combined with protein database searching has been widely used in protein identification. A validation procedure is generally required to reduce the number of false positives. Advanced tools using statistical and machine learning approaches may provide faster and more accurate validation than manual inspection and empirical filtering criteria. In this study, we use two feature selection algorithms based on random forest and support vector machine to identify peptide properties that can be used to improve validation models. We demonstrate that an improved model based on an optimized set of features reduces the number of false positives by 58% relative to the model which used only search engine scores, at the same sensitivity score of 0.8. In addition, we develop classification models based on the physicochemical properties and protein sequence environment of these peptides without using search engine scores. The performance of the best model based on the support vector machine algorithm is at 0.8 AUC, 0.78 accuracy, and 0.7 specificity, suggesting a reasonably accurate classification. The identified properties important to fragmentation and ionization can be either used in independent validation tools or incorporated into peptide sequencing and database search algorithms to improve existing software programs.

摘要

串联质谱法(MS/MS)与蛋白质数据库搜索相结合已广泛应用于蛋白质鉴定。通常需要一个验证程序来减少假阳性的数量。使用统计和机器学习方法的先进工具可能比人工检查和经验性过滤标准提供更快、更准确的验证。在本研究中,我们使用基于随机森林和支持向量机的两种特征选择算法来识别可用于改进验证模型的肽段特性。我们证明,在灵敏度评分为0.8的情况下,基于优化特征集的改进模型相对于仅使用搜索引擎评分的模型,可将假阳性数量减少58%。此外,我们在不使用搜索引擎评分的情况下,基于这些肽段的物理化学性质和蛋白质序列环境开发了分类模型。基于支持向量机算法的最佳模型的性能为AUC 0.8、准确率0.78和特异性0.7,表明分类相当准确。所确定的对片段化和电离重要的特性可用于独立验证工具,或纳入肽段测序和数据库搜索算法中,以改进现有的软件程序。

相似文献

1
Feature selection in validating mass spectrometry database search results.验证质谱数据库搜索结果中的特征选择。
J Bioinform Comput Biol. 2008 Feb;6(1):223-40. doi: 10.1142/s0219720008003345.
2
PepSplice: cache-efficient search algorithms for comprehensive identification of tandem mass spectra.PepSplice:用于全面鉴定串联质谱的高效缓存搜索算法。
Bioinformatics. 2007 Nov 15;23(22):3016-23. doi: 10.1093/bioinformatics/btm417. Epub 2007 Sep 3.
3
MSDash: mass spectrometry database and search.MSDash:质谱数据库与搜索
Comput Syst Bioinformatics Conf. 2008;7:63-71.
4
A predictive model for identifying proteins by a single peptide match.一种通过单肽匹配来识别蛋白质的预测模型。
Bioinformatics. 2007 Feb 1;23(3):277-80. doi: 10.1093/bioinformatics/btl595. Epub 2006 Nov 22.
5
Optimization of filtering criterion for SEQUEST database searching to improve proteome coverage in shotgun proteomics.优化用于SEQUEST数据库搜索的过滤标准以提高鸟枪法蛋白质组学中的蛋白质组覆盖率。
BMC Bioinformatics. 2007 Aug 31;8:323. doi: 10.1186/1471-2105-8-323.
6
Identification of post-translational modifications via blind search of mass-spectra.通过对质谱进行盲目搜索来鉴定翻译后修饰。
Proc IEEE Comput Syst Bioinform Conf. 2005:157-66. doi: 10.1109/csb.2005.34.
7
Large-scale database searching using tandem mass spectra: looking up the answer in the back of the book.使用串联质谱进行大规模数据库搜索:在书的后面查找答案。
Nat Methods. 2004 Dec;1(3):195-202. doi: 10.1038/nmeth725.
8
Validation of tandem mass spectrometry database search results using DTASelect.使用DTASelect验证串联质谱数据库搜索结果。
Curr Protoc Bioinformatics. 2007 Jan;Chapter 13:Unit 13.4. doi: 10.1002/0471250953.bi1304s16.
9
Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry.用于提高质谱法大规模蛋白质鉴定可信度的靶标-诱饵搜索策略。
Nat Methods. 2007 Mar;4(3):207-14. doi: 10.1038/nmeth1019.
10
Robust accurate identification of peptides (RAId): deciphering MS2 data using a structured library search with de novo based statistics.肽段的稳健准确鉴定(RAId):使用基于从头统计的结构化库搜索来解析MS2数据。
Bioinformatics. 2005 Oct 1;21(19):3726-32. doi: 10.1093/bioinformatics/bti620. Epub 2005 Aug 16.

引用本文的文献

1
Random forest-based protein model quality assessment (RFMQA) using structural features and potential energy terms.基于随机森林的蛋白质模型质量评估(RFMQA),使用结构特征和势能项。
PLoS One. 2014 Sep 15;9(9):e106542. doi: 10.1371/journal.pone.0106542. eCollection 2014.
2
Improving the chances of successful protein structure determination with a random forest classifier.利用随机森林分类器提高蛋白质结构测定成功的几率。
Acta Crystallogr D Biol Crystallogr. 2014 Mar;70(Pt 3):627-35. doi: 10.1107/S1399004713032070. Epub 2014 Feb 15.
3
PROTS-RF: a robust model for predicting mutation-induced protein stability changes.
PROTS-RF:一种用于预测突变诱导的蛋白质稳定性变化的稳健模型。
PLoS One. 2012;7(10):e47247. doi: 10.1371/journal.pone.0047247. Epub 2012 Oct 15.
4
An improved machine learning protocol for the identification of correct Sequest search results.一种改进的机器学习协议,用于识别正确的 Sequest 搜索结果。
BMC Bioinformatics. 2010 Dec 7;11:591. doi: 10.1186/1471-2105-11-591.
5
Detection and identification of potential biomarkers of breast cancer.乳腺癌潜在生物标志物的检测与鉴定。
J Cancer Res Clin Oncol. 2010 Aug;136(8):1243-54. doi: 10.1007/s00432-010-0775-1. Epub 2010 Mar 17.
6
Discovery and identification of potential biomarkers of papillary thyroid carcinoma.发现和鉴定甲状腺乳头状癌的潜在生物标志物。
Mol Cancer. 2009 Sep 28;8:79. doi: 10.1186/1476-4598-8-79.
7
Bioinformatic analysis of xenobiotic reactive metabolite target proteins and their interacting partners.异源生物活性代谢物靶蛋白及其相互作用伙伴的生物信息学分析。
BMC Chem Biol. 2009 Jun 12;9:5. doi: 10.1186/1472-6769-9-5.