Faculty of Computer Science, Białystok University of Technology, Białystok, Poland.
Institute of Biocybernetics and Biomedical Engineering, Polish Academy of Sciences, Warsaw, Poland.
J Appl Genet. 2022 May;63(2):361-368. doi: 10.1007/s13353-022-00691-2. Epub 2022 Mar 24.
Rare disease datasets are typically structured such that a small number of patients (cases) are represented by multidimensional feature vectors. In this report, we considered a rare disease, mucopolysaccharidosis (MPS). This disease is divided into 11 types and subtypes, depending on the genetic defect, type of deficient enzyme, and nature of accumulated glycosaminoglycan(s). Among them, 7 types are known as possibly neuronopathic and 4 are non-neuronopathic, and in the case of the former group, prediction of the course of the disease is crucial for patient's treatment and the management. Here, we have used transcriptomic data available for one patient from each MPS type/subtype. The approach to gene grouping considered by us was based on the minimization of the perceptron criterion in the form of convex and piecewise linear function (CPL). This approach allows designing complexes of linear classifiers on the basis of small samples of multivariate vectors. As a result, distinguishing neuronopathic and non-neuronopathic forms of MPS was possible on the basis of bioinformatic analysis of gene expression patterns where each MPS type was represented by only one patient. This approach can be potentially used also for assessing other features of patients suffering from rare diseases, for which large body of data (like transcriptomic data) is available from only one or a few representatives.
罕见病数据集通常采用多维特征向量的方式表示少量患者(病例)。在本报告中,我们考虑了一种罕见病,黏多糖贮积症(MPS)。这种疾病根据遗传缺陷、缺乏的酶类型以及积累的糖胺聚糖的性质分为 11 种类型和亚型。其中,7 种为可能神经病变型,4 种为非神经病变型,对于前者,预测疾病进程对于患者的治疗和管理至关重要。在这里,我们使用了每种 MPS 类型/亚型的一位患者的转录组数据。我们考虑的基因分组方法基于感知器准则的最小化,形式为凸和分段线性函数(CPL)。这种方法允许在小样本的多元向量的基础上设计线性分类器的组合。结果,在对基因表达模式进行生物信息学分析的基础上,我们能够区分 MPS 的神经病变型和非神经病变型,其中每种 MPS 类型仅由一位患者代表。这种方法也可以潜在地用于评估其他患有罕见病的患者的特征,对于这些患者,只有一个或几个代表有大量的数据(如转录组数据)可用。