Ultsch Alfred, Lötsch Jörn
DataBionics Research Group, University of Marburg, Hans - Meerwein - Straße, 35032 Marburg, Germany.
Institute of Clinical Pharmacology, Goethe - University, Theodor - Stern - Kai 7, 60590 Frankfurt am Main, Germany.
Brief Bioinform. 2024 Nov 22;26(1). doi: 10.1093/bib/bbae640.
Small sample sizes in biomedical research often led to poor reproducibility and challenges in translating findings into clinical applications. This problem stems from limited study resources, rare diseases, ethical considerations in animal studies, costly expert diagnosis, and others. As a contribution to the problem, we propose a novel generative algorithm based on self-organizing maps (SOMs) to computationally increase sample sizes. The proposed unsupervised generative algorithm uses neural networks to detect inherent structure even in small multivariate datasets, distinguishing between sparse "void" and dense "cloud" regions. Using emergent SOMs (ESOMs), the algorithm adapts to high-dimensional data structures and generates for each original data point k new points by randomly selecting positions within an adapted hypersphere with distances based on valid neighborhood probabilities. Experiments on artificial and biomedical (omics) datasets show that the generated data preserve the original structure without introducing artifacts. Random forests and support vector machines cannot distinguish between generated and original data, and the variables of original and generated data sets are not statistically different. The method successfully augments small group sizes, such as transcriptomics data from a rare form of leukemia and lipidomics data from arthritis research. The novel ESOM-based generative algorithm presents a promising solution for enhancing sample sizes in small or rare case datasets, even when limited training data are available. This approach can address challenges associated with small sample sizes in biomedical research, offering a tool for improving the reliability and robustness of scientific findings in this field. Availability: R library "Umatrix" (https://cran.r-project.org/package=Umatrix).
生物医学研究中的小样本量常常导致可重复性差,以及在将研究结果转化为临床应用方面面临挑战。这个问题源于研究资源有限、罕见疾病、动物研究中的伦理考量、昂贵的专家诊断等。作为对该问题的一种贡献,我们提出了一种基于自组织映射(SOM)的新型生成算法,以通过计算增加样本量。所提出的无监督生成算法使用神经网络来检测即使在小的多变量数据集中的内在结构,区分稀疏的“空洞”区域和密集的“云”区域。使用涌现自组织映射(ESOM),该算法适应高维数据结构,并通过基于有效邻域概率在适应的超球体内随机选择位置,为每个原始数据点生成k个新点。在人工和生物医学(组学)数据集上的实验表明,生成的数据保留了原始结构而不引入伪影。随机森林和支持向量机无法区分生成的数据和原始数据,并且原始数据集和生成数据集的变量在统计上没有差异。该方法成功地增加了小样本量,例如来自一种罕见白血病形式的转录组学数据和来自关节炎研究的脂质组学数据。基于ESOM的新型生成算法为在小样本或罕见病例数据集中增加样本量提供了一个有前景的解决方案,即使在可用训练数据有限的情况下也是如此。这种方法可以解决生物医学研究中与小样本量相关的挑战,为提高该领域科学发现的可靠性和稳健性提供一种工具。可用性:R库“Umatrix”(https://cran.r-project.org/package=Umatrix)。