Brouard Céline, Mariette Jérôme, Flamary Rémi, Vialaneix Nathalie
Université de Toulouse, INRAE, UR MIAT, F-31320, Castanet-Tolosan, France.
École Polytechnique, CMAP, F-91120, Palaiseau, France.
NAR Genom Bioinform. 2022 Mar 7;4(1):lqac014. doi: 10.1093/nargab/lqac014. eCollection 2022 Mar.
The substantial development of high-throughput biotechnologies has rendered large-scale multi-omics datasets increasingly available. New challenges have emerged to process and integrate this large volume of information, often obtained from widely heterogeneous sources. Kernel methods have proven successful to handle the analysis of different types of datasets obtained on the same individuals. However, they usually suffer from a lack of interpretability since the original description of the individuals is lost due to the kernel embedding. We propose novel feature selection methods that are adapted to the kernel framework and go beyond the well-established work in supervised learning by addressing the more difficult tasks of unsupervised learning and kernel output learning. The method is expressed under the form of a non-convex optimization problem with a ℓ penalty, which is solved with a proximal gradient descent approach. It is tested on several systems biology datasets and shows good performances in selecting relevant and less redundant features compared to existing alternatives. It also proved relevant for identifying important governmental measures best explaining the time series of Covid-19 reproducing number evolution during the first months of 2020. The proposed feature selection method is embedded in the R package mixKernel version 0.8, published on CRAN. Installation instructions are available at http://mixkernel.clementine.wf/.
高通量生物技术的显著发展使得大规模多组学数据集越来越容易获得。处理和整合这些通常从广泛异质来源获取的大量信息出现了新的挑战。核方法已被证明在处理对同一批个体获得的不同类型数据集的分析方面是成功的。然而,它们通常缺乏可解释性,因为由于核嵌入,个体的原始描述丢失了。我们提出了新颖的特征选择方法,这些方法适用于核框架,并且通过解决无监督学习和核输出学习等更困难的任务,超越了监督学习中已有的工作。该方法以带有ℓ惩罚的非凸优化问题的形式表示,并使用近端梯度下降法求解。它在几个系统生物学数据集上进行了测试,与现有方法相比,在选择相关且冗余度较低的特征方面表现出良好的性能。它还被证明对于识别最能解释2020年头几个月新冠病毒再生数演变时间序列的重要政府措施是相关的。所提出的特征选择方法嵌入在CRAN上发布的R包mixKernel版本0.8中。安装说明可在http://mixkernel.clementine.wf/获得。