Shi Xiuchao, Yue Chunxiao, Quan Meiping, Li Yalin, Nashwan Sam Hiba
College of Environment and Life Sciences, Weinan Normal University, Weinan, 714099, Shaanxi, China.
Weinan Junior Middle School, Weinan, 714000, Shaanxi, China.
J Cancer Res Clin Oncol. 2024 Jan 2;150(1):3. doi: 10.1007/s00432-023-05559-4.
In recent decades, many theories have been proposed about the cause of hereditary diseases such as cancer. However, most studies state genetic and environmental factors as the most important parameters. It has been shown that gene expression data are valuable information about hereditary diseases and their analysis can identify the relationships between these diseases.
Identification of damaged genes from various diseases can be done through the discovery of cell-to-cell biological communications. Also, extraction of intercellular communications can identify relationships between different diseases. For example, gene disorders that cause damage to the same cells in both breast and blood cancers. Hence, the purpose is to discover cell-to-cell biological communications in gene expression data.
The identification of cell-to-cell biological communications for various cancer diseases has been widely performed by clustering algorithms. However, this field remains open due to the abundance of unprocessed gene expression data. Accordingly, this paper focuses on the development of a semi-supervised ensemble clustering algorithm that can discover relationships between different diseases through the extraction of cell-to-cell biological communications. The proposed clustering framework includes a stratified feature sampling mechanism and a novel similarity metric to deal with high-dimensional data and improve the diversity of primary partitions.
The performance of the proposed clustering algorithm is verified with several datasets from the UCI machine learning repository and then applied to the FANTOM5 dataset to extract cell-to-cell biological communications. The used version of this dataset contains 108 cells and 86,427 promoters from 702 samples. The strength of communication between two similar cells from different diseases indicates the relationship of those diseases. Here, the strength of communication is determined by promoter, so we found the highest cell-to-cell biological communication between "basophils" and "ciliary.epithelial.cells" with 62,809 promoters.
The maximum cell-to-cell biological similarity in each cluster can be used to detect the relationship between different diseases such as cancer.
近几十年来,针对诸如癌症等遗传性疾病的病因提出了许多理论。然而,大多数研究表明遗传和环境因素是最重要的参数。研究表明,基因表达数据是有关遗传性疾病的宝贵信息,对其进行分析可以识别这些疾病之间的关系。
通过发现细胞间的生物通讯,可以识别各种疾病中受损的基因。此外,提取细胞间通讯可以识别不同疾病之间的关系。例如,在乳腺癌和血癌中导致相同细胞受损的基因紊乱。因此,目的是在基因表达数据中发现细胞间的生物通讯。
通过聚类算法广泛地对各种癌症疾病的细胞间生物通讯进行识别。然而,由于未处理的基因表达数据丰富,该领域仍然开放。因此,本文着重于开发一种半监督集成聚类算法,该算法可以通过提取细胞间生物通讯来发现不同疾病之间的关系。所提出的聚类框架包括分层特征采样机制和一种新颖的相似性度量,以处理高维数据并提高初始分区的多样性。
使用来自UCI机器学习库的几个数据集验证了所提出聚类算法的性能,然后将其应用于FANTOM5数据集以提取细胞间生物通讯。该数据集的使用版本包含来自702个样本的108个细胞和86,427个启动子。来自不同疾病的两个相似细胞之间的通讯强度表明了这些疾病的关系。在这里,通讯强度由启动子决定,因此我们发现“嗜碱性粒细胞”和“睫状上皮细胞”之间的细胞间生物通讯最强,有62,809个启动子。
每个聚类中最大的细胞间生物相似性可用于检测诸如癌症等不同疾病之间的关系。