Pospisil Pavel, Iyer Lakshmanan K, Adelstein S James, Kassis Amin I
Harvard Medical School, Department of Radiology, 200 Longwood Avenue, Boston, Massachusetts, USA.
BMC Bioinformatics. 2006 Jul 20;7:354. doi: 10.1186/1471-2105-7-354.
We present an effective, rapid, systematic data mining approach for identifying genes or proteins related to a particular interest. A selected combination of programs exploring PubMed abstracts, universal gene/protein databases (UniProt, InterPro, NCBI Entrez), and state-of-the-art pathway knowledge bases (LSGraph and Ingenuity Pathway Analysis) was assembled to distinguish enzymes with hydrolytic activities that are expressed in the extracellular space of cancer cells. Proteins were identified with respect to six types of cancer occurring in the prostate, breast, lung, colon, ovary, and pancreas.
The data mining method identified previously undetected targets. Our combined strategy applied to each cancer type identified a minimum of 375 proteins expressed within the extracellular space and/or attached to the plasma membrane. The method led to the recognition of human cancer-related hydrolases (on average, approximately 35 per cancer type), among which were prostatic acid phosphatase, prostate-specific antigen, and sulfatase 1.
The combined data mining of several databases overcame many of the limitations of querying a single database and enabled the facile identification of gene products. In the case of cancer-related targets, it produced a list of putative extracellular, hydrolytic enzymes that merit additional study as candidates for cancer radioimaging and radiotherapy. The proposed data mining strategy is of a general nature and can be applied to other biological databases for understanding biological functions and diseases.
我们提出了一种有效、快速、系统的数据挖掘方法,用于识别与特定兴趣相关的基因或蛋白质。我们组合了一系列程序,这些程序可探索PubMed摘要、通用基因/蛋白质数据库(UniProt、InterPro、NCBI Entrez)以及最新的通路知识库(LSGraph和Ingenuity通路分析),以区分在癌细胞胞外空间表达的具有水解活性的酶。针对前列腺、乳腺、肺、结肠、卵巢和胰腺中发生的六种癌症类型对蛋白质进行了鉴定。
数据挖掘方法识别出了先前未检测到的靶点。应用于每种癌症类型的联合策略识别出至少375种在胞外空间表达和/或附着于质膜的蛋白质。该方法促使人们识别出与人类癌症相关的水解酶(平均每种癌症类型约35种),其中包括前列腺酸性磷酸酶、前列腺特异性抗原和硫酸酯酶1。
对多个数据库进行联合数据挖掘克服了查询单个数据库的许多局限性,并能够轻松识别基因产物。对于与癌症相关的靶点,它生成了一份假定的胞外水解酶清单,这些酶作为癌症放射性成像和放射治疗的候选物值得进一步研究。所提出的数据挖掘策略具有通用性,可应用于其他生物数据库以理解生物学功能和疾病。