Computational Bioscience Research Center, Division of Chemical and Life Sciences and Engineering, King Abdullah University for Science and Technology (KAUST), Jeddah, Kingdom of Saudi Arabia.
Bioinformatics. 2010 Sep 15;26(18):i531-9. doi: 10.1093/bioinformatics/btq376.
Nonlinear small datasets, which are characterized by low numbers of samples and very high numbers of measures, occur frequently in computational biology, and pose problems in their investigation. Unsupervised hybrid-two-phase (H2P) procedures-specifically dimension reduction (DR), coupled with clustering-provide valuable assistance, not only for unsupervised data classification, but also for visualization of the patterns hidden in high-dimensional feature space.
'Minimum Curvilinearity' (MC) is a principle that-for small datasets-suggests the approximation of curvilinear sample distances in the feature space by pair-wise distances over their minimum spanning tree (MST), and thus avoids the introduction of any tuning parameter. MC is used to design two novel forms of nonlinear machine learning (NML): Minimum Curvilinear embedding (MCE) for DR, and Minimum Curvilinear affinity propagation (MCAP) for clustering.
Compared with several other unsupervised and supervised algorithms, MCE and MCAP, whether individually or combined in H2P, overcome the limits of classical approaches. High performance was attained in the visualization and classification of: (i) pain patients (proteomic measurements) in peripheral neuropathy; (ii) human organ tissues (genomic transcription factor measurements) on the basis of their embryological origin.
MC provides a valuable framework to estimate nonlinear distances in small datasets. Its extension to large datasets is prefigured for novel NMLs. Classification of neuropathic pain by proteomic profiles offers new insights for future molecular and systems biology characterization of pain. Improvements in tissue embryological classification refine results obtained in an earlier study, and suggest a possible reinterpretation of skin attribution as mesodermal.
https://sites.google.com/site/carlovittoriocannistraci/home.
计算生物学中经常出现非线性小数据集,其特点是样本数量少,测量数量非常多。这些数据集在研究中存在问题。无监督混合两阶段(H2P)程序,特别是降维和聚类,为无监督数据分类以及高维特征空间中隐藏模式的可视化提供了有价值的帮助。
“最小曲率”(MC)原则对于小数据集,建议通过其最小生成树(MST)上的成对距离来近似特征空间中的曲线样本距离,从而避免引入任何调整参数。MC 用于设计两种新形式的非线性机器学习(NML):最小曲率嵌入(MCE)用于降维,最小曲率亲和传播(MCAP)用于聚类。
与其他几种无监督和监督算法相比,MCE 和 MCAP 无论是单独使用还是组合使用 H2P,都克服了经典方法的局限性。在可视化和分类方面取得了优异的性能:(i)周围神经病变患者的疼痛(蛋白质组学测量);(ii)基于胚胎起源的人类器官组织(基因组转录因子测量)。
MC 为估计小数据集中的非线性距离提供了一个有价值的框架。其扩展到大数据集为新的 NML 提供了依据。通过蛋白质组学谱对神经性疼痛进行分类,为疼痛的未来分子和系统生物学特征提供了新的见解。组织胚胎学分类的改进完善了早期研究的结果,并暗示了皮肤归因于中胚层的可能重新解释。
https://sites.google.com/site/carlovittoriocannistraci/home。