Suppr超能文献

两种用于从MEDLINE自动提取关键词以进行功能基因聚类的方案比较。

Comparison of two schemes for automatic keyword extraction from MEDLINE for functional gene clustering.

作者信息

Liu Ying, Ciliax Brian J, Borges Karin, Dasigi Venu, Ram Ashwin, Navathe Shamkant B, Dingledine Ray

机构信息

College of Computing, Georgia Institute of Technology, Atlanta, 30322, USA.

出版信息

Proc IEEE Comput Syst Bioinform Conf. 2004:394-404. doi: 10.1109/csb.2004.1332452.

Abstract

One of the key challenges of microarray studies is to derive biological insights from the unprecedented quatities of data on gene-expression patterns. Clustering genes by functional keyword association can provide direct information about the nature of the functional links among genes within the derived clusters. However, the quality of the keyword lists extracted from biomedical literature for each gene significantly affects the clustering results. We extracted keywords from MEDLINE that describes the most prominent functions of the genes, and used the resulting weights of the keywords as feature vectors for gene clustering. By analyzing the resulting cluster quality, we compared two keyword weighting schemes: normalized z-score and term frequency-inverse document frequency (TFIDF). The best combination of background comparison set, stop list and stemming algorithm was selected based on precision and recall metrics. In a test set of four known gene groups, a hierarchical algorithm correctly assigned 25 of 26 genes to the appropriate clusters based on keywords extracted by the TDFIDF weighting scheme, but only 23 og 26 with the z-score method. To evaluate the effectiveness of the weighting schemes for keyword extraction for gene clusters from microarray profiles, 44 yeast genes that are differentially expressed during the cell cycle were used as a second test set. Using established measures of cluster quality, the results produced from TFIDF-weighted keywords had higher purity, lower entropy, and higher mutual information than those produced from normalized z-score weighted keywords. The optimized algorithms should be useful for sorting genes from microarray lists into functionally discrete clusters.

摘要

微阵列研究的关键挑战之一是从基因表达模式的海量数据中获得生物学见解。通过功能关键词关联对基因进行聚类,可以直接提供有关衍生簇内基因间功能联系本质的信息。然而,从生物医学文献中为每个基因提取的关键词列表的质量会显著影响聚类结果。我们从MEDLINE中提取了描述基因最显著功能的关键词,并将得到的关键词权重用作基因聚类的特征向量。通过分析聚类结果的质量,我们比较了两种关键词加权方案:归一化z分数和词频-逆文档频率(TFIDF)。基于精确率和召回率指标,选择了背景比较集、停用词列表和词干提取算法的最佳组合。在一个包含四个已知基因组的测试集中,基于TDFIDF加权方案提取的关键词,层次算法将26个基因中的25个正确分配到了合适的簇中,但使用z分数方法时,26个基因中只有23个被正确分配。为了评估从微阵列图谱中提取关键词以进行基因聚类的加权方案的有效性,将44个在细胞周期中差异表达的酵母基因用作第二个测试集。使用既定的聚类质量衡量标准,与归一化z分数加权关键词相比,TFIDF加权关键词产生的结果具有更高的纯度、更低的熵和更高的互信息。优化后的算法应有助于将微阵列列表中的基因分类到功能离散的簇中。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验