Herndon Nic, Caragea Doina
IEEE Trans Nanobioscience. 2016 Mar;15(2):75-83. doi: 10.1109/TNB.2016.2522400. Epub 2016 Jan 28.
Supervised classifiers are highly dependent on abundant labeled training data. Alternatives for addressing the lack of labeled data include: labeling data (but this is costly and time consuming); training classifiers with abundant data from another domain (however, the classification accuracy usually decreases as the distance between domains increases); or complementing the limited labeled data with abundant unlabeled data from the same domain and learning semi-supervised classifiers (but the unlabeled data can mislead the classifier). A better alternative is to use both the abundant labeled data from a source domain, the limited labeled data and optionally the unlabeled data from the target domain to train classifiers in a domain adaptation setting. We propose two such classifiers, based on logistic regression, and evaluate them for the task of splice site prediction-a difficult and essential step in gene prediction. Our classifiers achieved high accuracy, with highest areas under the precision-recall curve between 50.83% and 82.61%.
监督式分类器高度依赖大量带标签的训练数据。解决标签数据缺乏问题的替代方法包括:标记数据(但这成本高且耗时);使用来自另一个领域的大量数据训练分类器(然而,随着领域间距离增加,分类准确率通常会降低);或者用来自同一领域的大量未标记数据补充有限的标记数据并学习半监督分类器(但未标记数据可能会误导分类器)。更好的替代方法是在域适应设置中,使用来自源域的大量标记数据、目标域的有限标记数据以及可选的未标记数据来训练分类器。我们基于逻辑回归提出了两个这样的分类器,并针对剪接位点预测任务对它们进行评估——剪接位点预测是基因预测中一个困难且关键的步骤。我们的分类器取得了很高的准确率,精确率-召回率曲线下的最高面积在50.83%至82.61%之间。