Bales Michael E, Lussier Yves A, Johnson Stephen B
Department of Biomedical Informatics, Columbia University, Vanderbilt Clinic, 622 West 168th Street, New York, NY 10032, USA.
J Am Med Inform Assoc. 2007 Nov-Dec;14(6):788-97. doi: 10.1197/jamia.M2080. Epub 2007 Aug 21.
To characterize global structural features of large-scale biomedical terminologies using currently emerging statistical approaches.
Given rapid growth of terminologies, this research was designed to address scalability. We selected 16 terminologies covering a variety of domains from the UMLS Metathesaurus, a collection of terminological systems. Each was modeled as a network in which nodes were atomic concepts and links were relationships asserted by the source vocabulary. For comparison against each terminology we created three random networks of equivalent size and density.
Average node degree, node degree distribution, clustering coefficient, average path length.
Eight of 16 terminologies exhibited the small-world characteristics of a short average path length and strong local clustering. An overlapping subset of nine exhibited a power law distribution in node degrees, indicative of a scale-free architecture. We attribute these features to specific design constraints. Constraints on node connectivity, common in more synthetic classification systems, localize the effects of changes and deletions. In contrast, small-world and scale-free features, common in comprehensive medical terminologies, promote flexible navigation and less restrictive organic-like growth.
While thought of as synthetic, grid-like structures, some controlled terminologies are structurally indistinguishable from natural language networks. This paradoxical result suggests that terminology structure is shaped not only by formal logic-based semantics, but by rules analogous to those that govern social networks and biological systems. Graph theoretic modeling shows early promise as a framework for describing terminology structure. Deeper understanding of these techniques may inform the development of scalable terminologies and ontologies.
运用当前新兴的统计方法来描述大规模生物医学术语的全局结构特征。
鉴于术语的快速增长,本研究旨在解决可扩展性问题。我们从术语系统集合UMLS元词表中选取了16个涵盖各种领域的术语。每个术语都被建模为一个网络,其中节点是原子概念,链接是源词汇表所断言的关系。为了与每个术语进行比较,我们创建了三个大小和密度相当的随机网络。
平均节点度、节点度分布、聚类系数、平均路径长度。
16个术语中有8个表现出平均路径长度短和局部聚类性强的小世界特征。9个术语的重叠子集在节点度上呈现幂律分布,表明其具有无标度架构。我们将这些特征归因于特定的设计约束。在更多合成分类系统中常见的节点连通性约束,限制了变化和删除的影响。相比之下,在综合医学术语中常见的小世界和无标度特征,促进了灵活导航和类似有机的、限制较少的增长。
虽然一些受控术语被认为是合成的、类似网格的结构,但在结构上它们与自然语言网络并无区别。这一矛盾的结果表明,术语结构不仅由基于形式逻辑的语义塑造,还受类似于支配社会网络和生物系统的规则影响。图论建模作为描述术语结构的框架显示出早期的前景。对这些技术的更深入理解可能会为可扩展术语和本体的开发提供信息。