Dufresne Yoann, Noé Laurent, Leclère Valérie, Pupin Maude
Univ. Lille, CNRS, Centrale Lille, UMR 9189-CRIStAL-Centre de Recherche en Informatique Signal et Automatique de Lille, 59000 Lille, France ; Inria Lille Nord Europe, Bonsai team, Parc scientifique de la Haute Borne, 40 avenue Halley, 59650 Villeneuve d'Ascq, France.
Univ. Lille, CNRS, Centrale Lille, UMR 9189-CRIStAL-Centre de Recherche en Informatique Signal et Automatique de Lille, 59000 Lille, France ; Inria Lille Nord Europe, Bonsai team, Parc scientifique de la Haute Borne, 40 avenue Halley, 59650 Villeneuve d'Ascq, France ; Univ. Lille, INRA, ISA, Univ. Artois, Univ. Littoral Côte d'Opale, EA 7394 - ICV - Institut Charles Viollette, 59000 Lille, France.
J Cheminform. 2015 Dec 29;7:62. doi: 10.1186/s13321-015-0111-5. eCollection 2015.
The monomeric composition of polymers is powerful for structure comparison and synthetic biology, among others. Many databases give access to the atomic structure of compounds but the monomeric structure of polymers is often lacking. We have designed a smart algorithm, implemented in the tool Smiles2Monomers (s2m), to infer efficiently and accurately the monomeric structure of a polymer from its chemical structure.
Our strategy is divided into two steps: first, monomers are mapped on the atomic structure by an efficient subgraph-isomorphism algorithm ; second, the best tiling is computed so that non-overlapping monomers cover all the structure of the target polymer. The mapping is based on a Markovian index built by a dynamic programming algorithm. The index enables s2m to search quickly all the given monomers on a target polymer. After, a greedy algorithm combines the mapped monomers into a consistent monomeric structure. Finally, a local branch and cut algorithm refines the structure. We tested this method on two manually annotated databases of polymers and reconstructed the structures de novo with a sensitivity over 90 %. The average computation time per polymer is 2 s.
s2m automatically creates de novo monomeric annotations for polymers, efficiently in terms of time computation and sensitivity. s2m allowed us to detect annotation errors in the tested databases and to easily find the accurate structures. So, s2m could be integrated into the curation process of databases of small compounds to verify the current entries and accelerate the annotation of new polymers. The full method can be downloaded or accessed via a website for peptide-like polymers at http://bioinfo.lifl.fr/norine/smiles2monomers.jsp.Graphical abstract:.
聚合物的单体组成对于结构比较和合成生物学等领域具有重要意义。许多数据库提供化合物的原子结构,但聚合物的单体结构往往缺失。我们设计了一种智能算法,并在工具Smiles2Monomers(s2m)中实现,以便从聚合物的化学结构高效且准确地推断其单体结构。
我们的策略分为两个步骤:首先,通过高效的子图同构算法将单体映射到原子结构上;其次,计算最佳平铺方式,使不重叠的单体覆盖目标聚合物的所有结构。映射基于动态规划算法构建的马尔可夫指数。该指数使s2m能够在目标聚合物上快速搜索所有给定的单体。之后,贪心算法将映射的单体组合成一致的单体结构。最后,局部分支定界算法对结构进行优化。我们在两个手动注释的聚合物数据库上测试了该方法,从头重建结构的灵敏度超过90%。每个聚合物的平均计算时间为2秒。
s2m能自动为聚合物从头创建单体注释,在计算时间和灵敏度方面都很高效。s2m使我们能够检测测试数据库中的注释错误,并轻松找到准确的结构。因此,s2m可集成到小分子化合物数据库的管理过程中,以验证当前条目并加速新聚合物的注释。完整方法可通过http://bioinfo.lifl.fr/norine/smiles2monomers.jsp网站下载或访问,该网站用于类肽聚合物。图形摘要:.