Suppr超能文献

Biological sequence classification utilizing positive and unlabeled data.

作者信息

Xiao Yuanyuan, Segal Mark R

机构信息

Department of Epidemiology and Biostatistics, Center for Bioinformatics and Molecular Biostatistics, University of California, San Francisco, CA 94107, USA.

出版信息

Bioinformatics. 2008 May 1;24(9):1198-205. doi: 10.1093/bioinformatics/btn089. Epub 2008 Mar 14.

Abstract

MOTIVATION

In the genomics setting, an increasingly common data configuration consists of a small set of sequences possessing a targeted property (positive instances) amongst a large set of sequences for which class membership is unknown (unlabeled instances). Traditional two-class classification methods do not effectively handle such data.

RESULTS

Here, we develop a novel method, likely positive-iterative classification (LP-IC) for this problem, and contrast its performance with the few existing methods, most of which were devised and utilized in the text classification context. LP-IC employs an iterative classification scheme and introduces a class dispersion measure, adopted from unsupervised clustering approaches, to monitor the model selection process. Using two case studies--prediction of HLA binding, and alternative splicing conservation between human and mouse--we show that LP-IC provides superior performance to existing methodologies in terms of: (i) combined accuracy and precision in positive identification from the unlabeled set; and (ii) predictive performance of the resultant classifiers on independent test data.

摘要

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验