TW2Informatics, Göteborg 42166, Sweden.
J Cheminform. 2013 Feb 11;5(1):10. doi: 10.1186/1758-2946-5-10.
While chemical databases can be queried using the InChI string and InChIKey (IK) the latter was designed for open-web searching. It is becoming increasingly effective for this since more sources enhance crawling of their websites by the Googlebot and consequent IK indexing. Searchers who use Google as an adjunct to database access may be less familiar with the advantages of using the IK as explored in this review. As an example, the IK for atorvastatin retrieves ~200 low-redundancy links from a Google search in 0.3 of a second. These include most major databases and a very low false-positive rate. Results encompass less familiar but potentially useful sources and can be extended to isomer capture by using just the skeleton layer of the IK. Google Advanced Search can be used to filter large result sets. Image searching with the IK is also effective and complementary to open-web queries. Results can be particularly useful for less-common structures as exemplified by a major metabolite of atorvastatin giving only three hits. Testing also demonstrated document-to-document and document-to-database joins via structure matching. The necessary generation of an IK from chemical names can be accomplished using open tools and resources for patents, papers, abstracts or other text sources. Active global sharing of local IK-linked information can be accomplished via surfacing in open laboratory notebooks, blogs, Twitter, figshare and other routes. While information-rich chemistry (e.g. approved drugs) can exhibit swamping and redundancy effects, the much smaller IK result sets for link-poor structures become a transformative first-pass option. The IK indexing has therefore turned Google into a de-facto open global chemical information hub by merging links to most significant sources, including over 50 million PubChem and ChemSpider records. The simplicity, specificity and speed of matching make it a useful option for biologists or others less familiar with chemical searching. However, compared to rigorously maintained major databases, users need to be circumspect about the consistency of Google results and provenance of retrieved links. In addition, community engagement may be necessary to ameliorate possible future degradation of utility.
虽然可以使用 InChI 字符串和 InChIKey(IK)查询化学数据库,但后者是专为开放网络搜索而设计的。由于越来越多的来源通过 Googlebot 增强了对其网站的爬行,并且随之进行了 IK 索引,因此这种方法变得越来越有效。在数据库访问中使用 Google 作为辅助工具的搜索者可能不太熟悉本综述中探讨的使用 IK 的优势。例如,阿托伐他汀的 IK 在 0.3 秒内从 Google 搜索中检索到大约 200 个低冗余链接。这些链接包括大多数主要数据库和非常低的假阳性率。结果包括不太知名但可能有用的来源,并且可以通过仅使用 IK 的骨架层来扩展到异构体捕获。可以使用 Google 高级搜索来过滤大型结果集。使用 IK 进行图像搜索也是有效的,并且可以与开放网络查询互补。结果对于不太常见的结构特别有用,阿托伐他汀的一种主要代谢物的示例仅产生三个命中。测试还证明了通过结构匹配进行文档到文档和文档到数据库的连接。可以使用专利、论文、摘要或其他文本来源的开放工具和资源从化学名称生成 IK。通过在开放实验室笔记本、博客、Twitter、figshare 和其他途径中显示,可实现本地 IK 链接信息的全球主动共享。虽然信息丰富的化学物质(例如已批准的药物)可能会出现淹没和冗余效应,但链接较少的结构的 IK 结果集成为变革性的首选方法。因此,通过将大多数重要来源(包括超过 5000 万 PubChem 和 ChemSpider 记录)的链接合并到 IK 索引中,Google 已成为事实上的开放全球化学信息中心。匹配的简单性、特异性和速度使其成为不太熟悉化学搜索的生物学家或其他人的有用选择。但是,与经过严格维护的主要数据库相比,用户需要谨慎对待 Google 结果的一致性和检索链接的出处。此外,可能需要社区参与来减轻未来可能出现的功能降级。