IEEE/ACM Trans Comput Biol Bioinform. 2020 Jan-Feb;17(1):207-219. doi: 10.1109/TCBB.2018.2849362. Epub 2018 Jun 21.
To describe the cellular functions of proteins and genes, a potential dynamic vocabulary is Gene Ontology (GO), which comprises of three sub-ontologies namely, Biological-process, Cellular-component, and Molecular-function. It has several applications in the field of bioinformatics like annotating/measuring gene-gene or protein-protein semantic similarity, identifying genes/proteins by their GO annotations for disease gene and target discovery, etc. To determine semantic similarity between genes, several semantic measures have been proposed in literature, which involve information content of GO-terms, GO tree structure, or the combination of both. But, most of the existing semantic similarity measures do not consider different topological and information theoretic aspects of GO-terms collectively. Inspired by this fact, in this article, we have first proposed three novel semantic similarity/distance measures for genes covering different aspects of GO-tree. These are further implanted in the frameworks of well-known multi-objective and single-objective based clustering algorithms to determine functionally similar genes. For comparative analysis, 10 popular existing GO based semantic similarity/distance measures and tools are also considered. Experimental results on Mouse genome, Yeast, and Human genome datasets evidently demonstrate the supremacy of multi-objective clustering algorithms in association with proposed multi-factored similarity/distance measures. Clustering outcomes are further validated by conducting some biological/statistical significance tests. Supplementary information is available at https://www.iitp.ac.in/sriparna/journals.html.
为了描述蛋白质和基因的细胞功能,潜在的动态词汇是基因本体论 (GO),它由三个子本体组成,即生物过程、细胞成分和分子功能。它在生物信息学领域有多种应用,例如注释/测量基因-基因或蛋白质-蛋白质语义相似性、根据 GO 注释识别疾病基因和靶标发现中的基因/蛋白质等。为了确定基因之间的语义相似性,文献中提出了几种语义度量方法,涉及 GO 术语的信息量、GO 树结构或两者的组合。但是,大多数现有的语义相似性度量方法并没有综合考虑 GO 术语的不同拓扑和信息理论方面。受此启发,本文首次提出了三种覆盖 GO 树不同方面的新型基因语义相似性/距离度量方法。这些方法进一步植入了著名的多目标和单目标聚类算法框架中,以确定功能相似的基因。为了进行比较分析,还考虑了 10 种流行的基于 GO 的现有语义相似性/距离度量方法和工具。在小鼠基因组、酵母和人类基因组数据集上的实验结果明显表明,多目标聚类算法与提出的多因素相似性/距离度量方法相结合具有优越性。通过进行一些生物学/统计意义测试来验证聚类结果。补充信息可在 https://www.iitp.ac.in/sriparna/journals.html 获得。