发现分子分类的有趣分子亚结构。

Discovering interesting molecular substructures for molecular classification.

机构信息

Department of Computing, Hong Kong Polytechnic University, Hung Hom, Hong Kong.

出版信息

IEEE Trans Nanobioscience. 2010 Jun;9(2):77-89. doi: 10.1109/TNB.2010.2042609.

DOI:10.1109/TNB.2010.2042609

PMID:20650702

Abstract

Given a set of molecular structure data preclassified into a number of classes, the molecular classification problem is concerned with the discovering of interesting structural patterns in the data so that "unseen" molecules not originally in the dataset can be accurately classified. To tackle the problem, interesting molecular substructures have to be discovered and this is done typically by first representing molecular structures in molecular graphs, and then, using graph-mining algorithms to discover frequently occurring subgraphs in them. These subgraphs are then used to characterize different classes for molecular classification. While such an approach can be very effective, it should be noted that a substructure that occurs frequently in one class may also does occur in another. The discovering of frequent subgraphs for molecular classification may, therefore, not always be the most effective. In this paper, we propose a novel technique called mining interesting substructures in molecular data for classification (MISMOC) that can discover interesting frequent subgraphs not just for the characterization of a molecular class but also for the distinguishing of it from the others. Using a test statistic, MISMOC screens each frequent subgraph to determine if they are interesting. For those that are interesting, their degrees of interestingness are determined using an information-theoretic measure. When classifying an unseen molecule, its structure is then matched against the interesting subgraphs in each class and a total interestingness measure for the unseen molecule to be classified into a particular class is determined, which is based on the interestingness of each matched subgraphs. The performance of MISMOC is evaluated using both artificial and real datasets, and the results show that it can be an effective approach for molecular classification.

摘要

给定一组预先分类为若干类别的分子结构数据，分子分类问题涉及发现数据中的有趣结构模式，以便能够准确地对“未见过”的原始数据集之外的分子进行分类。为了解决这个问题，必须发现有趣的分子子结构，这通常是通过首先将分子结构表示为分子图，然后使用图挖掘算法在其中发现频繁出现的子图来完成的。然后，这些子图用于对不同的分子类别进行特征描述。虽然这种方法可能非常有效，但应该注意的是，在一个类别中频繁出现的子结构也可能在另一个类别中出现。因此，频繁子图的发现对于分子分类可能并不总是最有效的。在本文中，我们提出了一种名为“用于分类的分子数据中有趣子结构挖掘”（MISMOC）的新技术，它不仅可以发现用于描述分子类别的有趣频繁子图，还可以发现用于区分不同分子类别的有趣频繁子图。MISMOC 使用测试统计量筛选每个频繁子图，以确定它们是否有趣。对于那些有趣的子图，使用信息论度量来确定它们的有趣程度。在对未见过的分子进行分类时，将其结构与每个类别的有趣子图进行匹配，并根据每个匹配子图的有趣程度确定该未见过的分子被分类到特定类别的总有趣程度度量。使用人工和真实数据集评估了 MISMOC 的性能，结果表明它是一种有效的分子分类方法。