Quinn Thomas P, Erb Ionas
Applied Artificial Intelligence Institute, Deakin University, 75 Pigdons Rd, WaurnPonds VIC 3216, Geelong, Australia.
Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Carrer del Dr.Aiguader, 88, 08003, Barcelona, Spain.
NAR Genom Bioinform. 2020 Oct 2;2(4):lqaa076. doi: 10.1093/nargab/lqaa076. eCollection 2020 Dec.
Many next-generation sequencing datasets contain only relative information because of biological and technical factors that limit the total number of transcripts observed for a given sample. It is not possible to interpret any one component in isolation. The field of compositional data analysis has emerged with alternative methods for relative data based on log-ratio transforms. However, these data often contain many more features than samples, and thus require creative new ways to reduce the dimensionality of the data. The summation of parts, called amalgamation, is a practical way of reducing dimensionality, but can introduce a non-linear distortion to the data. We exploit this non-linearity to propose a powerful yet interpretable dimension method called data-driven amalgamation. Our new method, implemented in the user-friendly R package amalgam, can reduce the dimensionality of compositional data by finding amalgamations that optimally (i) preserve the distance between samples, or (ii) classify samples as diseased or not. Our benchmark on 13 real datasets confirm that these amalgamations compete with state-of-the-art methods in terms of performance, but result in new features that are easily understood: they are groups of parts added together.
由于生物和技术因素限制了给定样本中观察到的转录本总数,许多下一代测序数据集仅包含相对信息。不可能孤立地解释任何一个组成部分。基于对数比变换的相对数据替代方法催生了成分数据分析领域。然而,这些数据通常包含的特征比样本多得多,因此需要创新的新方法来降低数据的维度。部分的总和,称为合并,是一种降低维度的实用方法,但可能会给数据引入非线性失真。我们利用这种非线性提出了一种强大且可解释的降维方法,称为数据驱动合并。我们的新方法在用户友好的R包amalgam中实现,通过找到能(i)最佳地保留样本之间的距离,或(ii)将样本分类为患病或未患病的合并方式,来降低成分数据的维度。我们在13个真实数据集上的基准测试证实,这些合并在性能方面与最先进的方法竞争,但会产生易于理解的新特征:它们是相加在一起的部分组。