基于逻辑回归的剪接位点预测任务的域适应分类器研究

A Study of Domain Adaptation Classifiers Derived From Logistic Regression for the Task of Splice Site Prediction.

作者信息

Herndon Nic, Caragea Doina

出版信息

IEEE Trans Nanobioscience. 2016 Mar;15(2):75-83. doi: 10.1109/TNB.2016.2522400. Epub 2016 Jan 28.

DOI:10.1109/TNB.2016.2522400

PMID:26849871

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4894847/

Abstract

Supervised classifiers are highly dependent on abundant labeled training data. Alternatives for addressing the lack of labeled data include: labeling data (but this is costly and time consuming); training classifiers with abundant data from another domain (however, the classification accuracy usually decreases as the distance between domains increases); or complementing the limited labeled data with abundant unlabeled data from the same domain and learning semi-supervised classifiers (but the unlabeled data can mislead the classifier). A better alternative is to use both the abundant labeled data from a source domain, the limited labeled data and optionally the unlabeled data from the target domain to train classifiers in a domain adaptation setting. We propose two such classifiers, based on logistic regression, and evaluate them for the task of splice site prediction-a difficult and essential step in gene prediction. Our classifiers achieved high accuracy, with highest areas under the precision-recall curve between 50.83% and 82.61%.

摘要

监督式分类器高度依赖大量带标签的训练数据。解决标签数据缺乏问题的替代方法包括：标记数据（但这成本高且耗时）；使用来自另一个领域的大量数据训练分类器（然而，随着领域间距离增加，分类准确率通常会降低）；或者用来自同一领域的大量未标记数据补充有限的标记数据并学习半监督分类器（但未标记数据可能会误导分类器）。更好的替代方法是在域适应设置中，使用来自源域的大量标记数据、目标域的有限标记数据以及可选的未标记数据来训练分类器。我们基于逻辑回归提出了两个这样的分类器，并针对剪接位点预测任务对它们进行评估——剪接位点预测是基因预测中一个困难且关键的步骤。我们的分类器取得了很高的准确率，精确率-召回率曲线下的最高面积在50.83%至82.61%之间。

相似文献

A Study of Domain Adaptation Classifiers Derived From Logistic Regression for the Task of Splice Site Prediction.

IEEE Trans Nanobioscience. 2016 Mar;15(2):75-83. doi: 10.1109/TNB.2016.2522400. Epub 2016 Jan 28.

A statistical approach for 5' splice site prediction using short sequence motifs and without encoding sequence data.

BMC Bioinformatics. 2014 Nov 25;15:362. doi: 10.1186/s12859-014-0362-6.

Evaluating the performance of sequence encoding schemes and machine learning methods for splice sites recognition.

Gene. 2019 Jul 15;705:113-126. doi: 10.1016/j.gene.2019.04.047. Epub 2019 Apr 19.

An empirical study of ensemble-based semi-supervised learning approaches for imbalanced splice site datasets.

BMC Syst Biol. 2015;9 Suppl 5(Suppl 5):S1. doi: 10.1186/1752-0509-9-S5-S1. Epub 2015 Sep 1.

AucPR: an AUC-based approach using penalized regression for disease prediction with high-dimensional omics data.

BMC Genomics. 2014;15 Suppl 10(Suppl 10):S1. doi: 10.1186/1471-2164-15-S10-S1. Epub 2014 Dec 12.

An evolutionary algorithm approach for feature generation from sequence data and its application to DNA splice site prediction.

IEEE/ACM Trans Comput Biol Bioinform. 2012 Sep-Oct;9(5):1387-98. doi: 10.1109/TCBB.2012.53.

In vivo and In vitro methods to identify DNA sequence variants that alter RNA Splicing.

Curr Protoc Hum Genet. 2018 Apr;97(1):e60. doi: 10.1002/cphg.60. Epub 2018 Apr 26.

Statistical geometry based prediction of nonsynonymous SNP functional effects using random forest and neuro-fuzzy classifiers.

Proteins. 2008 Jun;71(4):1930-9. doi: 10.1002/prot.21838.

Estimating classification accuracy in positive-unlabeled learning: characterization and correction strategies.

Pac Symp Biocomput. 2019;24:124-135.

A transfer learning approach via procrustes analysis and mean shift for cancer drug sensitivity prediction.

J Bioinform Comput Biol. 2018 Jun;16(3):1840014. doi: 10.1142/S0219720018400140.

本文引用的文献

Assessment of transcript reconstruction methods for RNA-seq.

Nat Methods. 2013 Dec;10(12):1177-84. doi: 10.1038/nmeth.2714. Epub 2013 Nov 3.

High-accuracy splice site prediction based on sequence component and position features.

Genet Mol Res. 2012 Sep 25;11(3):3432-51. doi: 10.4238/2012.September.25.12.

Accurate splice site prediction using support vector machines.

BMC Bioinformatics. 2007;8 Suppl 10(Suppl 10):S7. doi: 10.1186/1471-2105-8-S10-S7.

An introduction to kernel-based learning algorithms.

IEEE Trans Neural Netw. 2001;12(2):181-201. doi: 10.1109/72.914517.

CONTRAST: a discriminative, phylogeny-free approach to multiple informant de novo gene prediction.

Genome Biol. 2007;8(12):R269. doi: 10.1186/gb-2007-8-12-r269.

Global discriminative learning for higher-accuracy computational gene prediction.

PLoS Comput Biol. 2007 Mar 16;3(3):e54. doi: 10.1371/journal.pcbi.0030054. Epub 2007 Feb 2.

Splice site identification using probabilistic parameters and SVM classification.

BMC Bioinformatics. 2006 Dec 18;7 Suppl 5(Suppl 5):S15. doi: 10.1186/1471-2105-7-S5-S15.

What is a support vector machine?

Nat Biotechnol. 2006 Dec;24(12):1565-7. doi: 10.1038/nbt1206-1565.

Gene prediction with a hidden Markov model and a new intron submodel.

Bioinformatics. 2003 Oct;19 Suppl 2:ii215-25. doi: 10.1093/bioinformatics/btg1080.

Modeling splicing sites with pairwise correlations.

Bioinformatics. 2002;18 Suppl 2:S27-34. doi: 10.1093/bioinformatics/18.suppl_2.s27.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基于逻辑回归的剪接位点预测任务的域适应分类器研究

A Study of Domain Adaptation Classifiers Derived From Logistic Regression for the Task of Splice Site Prediction.

作者信息

出版信息

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献