半监督斜向预测聚类树

Semi-supervised oblique predictive clustering trees.

作者信息

Stepišnik Tomaž, Kocev Dragi

机构信息

Department of Knowledge Technologies, Jožef Stefan Institute, Ljubljana, Slovenia.

Jožef Stefan International Postgraduate School, Ljubljana, Slovenia.

出版信息

PeerJ Comput Sci. 2021 May 3;7:e506. doi: 10.7717/peerj-cs.506. eCollection 2021.

DOI:10.7717/peerj-cs.506

PMID:33987461

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8101547/

Abstract

Semi-supervised learning combines supervised and unsupervised learning approaches to learn predictive models from both labeled and unlabeled data. It is most appropriate for problems where labeled examples are difficult to obtain but unlabeled examples are readily available (e.g., drug repurposing). Semi-supervised predictive clustering trees (SSL-PCTs) are a prominent method for semi-supervised learning that achieves good performance on various predictive modeling tasks, including structured output prediction tasks. The main issue, however, is that the learning time scales quadratically with the number of features. In contrast to axis-parallel trees, which only use individual features to split the data, oblique predictive clustering trees (SPYCTs) use linear combinations of features. This makes the splits more flexible and expressive and often leads to better predictive performance. With a carefully designed criterion function, we can use efficient optimization techniques to learn oblique splits. In this paper, we propose semi-supervised oblique predictive clustering trees (SSL-SPYCTs). We adjust the split learning to take unlabeled examples into account while remaining efficient. The main advantage over SSL-PCTs is that the proposed method scales linearly with the number of features. The experimental evaluation confirms the theoretical computational advantage and shows that SSL-SPYCTs often outperform SSL-PCTs and supervised PCTs both in single-tree setting and ensemble settings. We also show that SSL-SPYCTs are better at producing meaningful feature importance scores than supervised SPYCTs when the amount of labeled data is limited.

摘要

半监督学习结合了监督学习和无监督学习方法，以便从有标签和无标签数据中学习预测模型。它最适用于难以获取有标签示例但无标签示例很容易获得的问题（例如，药物重新利用）。半监督预测聚类树（SSL - PCT）是半监督学习的一种突出方法，在各种预测建模任务（包括结构化输出预测任务）中都能取得良好性能。然而，主要问题是学习时间与特征数量呈二次方关系。与仅使用单个特征来分割数据的轴平行树不同，倾斜预测聚类树（SPYCT）使用特征的线性组合。这使得分割更加灵活且表现力更强，通常会带来更好的预测性能。通过精心设计的准则函数，我们可以使用高效的优化技术来学习倾斜分割。在本文中，我们提出了半监督倾斜预测聚类树（SSL - SPYCT）。我们调整分割学习以考虑无标签示例，同时保持高效。相对于SSL - PCT的主要优势在于，所提出的方法与特征数量呈线性关系。实验评估证实了理论上的计算优势，并表明SSL - SPYCT在单树设置和集成设置中通常都优于SSL - PCT和监督式PCT。我们还表明，当有标签数据量有限时，SSL - SPYCT在生成有意义的特征重要性分数方面比监督式SPYCT表现更好。