• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

通过大规模重排进行高效的基因直系同源推断。

Efficient gene orthology inference via large-scale rearrangements.

作者信息

Rubert Diego P, Braga Marília D V

机构信息

Faculdade de Computação, Universidade Federal de Mato Grosso do Sul, Campo Grande, Brazil.

Faculty of Technology and Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany.

出版信息

Algorithms Mol Biol. 2023 Sep 28;18(1):14. doi: 10.1186/s13015-023-00238-y.

DOI:10.1186/s13015-023-00238-y
PMID:37770945
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10540461/
Abstract

BACKGROUND

Recently we developed a gene orthology inference tool based on genome rearrangements (Journal of Bioinformatics and Computational Biology 19:6, 2021). Given a set of genomes our method first computes all pairwise gene similarities. Then it runs pairwise ILP comparisons to compute optimal gene matchings, which minimize, by taking the similarities into account, the weighted rearrangement distance between the analyzed genomes (a problem that is NP-hard). The gene matchings are then integrated into gene families in the final step. The mentioned ILP includes an optimal capping that connects each end of a linear segment of one genome to an end of a linear segment in the other genome, producing an exponential increase of the search space.

RESULTS

In this work, we design and implement a heuristic capping algorithm that replaces the optimal capping by clustering (based on their gene content intersections) the linear segments into [Formula: see text] subsets, whose ends are capped independently. Furthermore, in each subset, instead of allowing all possible connections, we let only the ends of content-related segments be connected. Although there is no guarantee that m is much bigger than one, and with the possible side effect of resulting in sub-optimal instead of optimal gene matchings, the heuristic works very well in practice, from both the speed performance and the quality of computed solutions. Our experiments on primate and fruit fly genomes show two positive results. First, for complete assemblies of five primates the version with heuristic capping reports orthologies that are very similar to the orthologies computed by the version of our tool with optimal capping. Second, we were able to efficiently analyze fruit fly genomes with incomplete assemblies distributed in hundreds or even thousands of contigs, obtaining gene families that are very similar to [Formula: see text] families. Indeed, our tool inferred a higher number of complete cliques, with a higher intersection with [Formula: see text], when compared to gene families computed by other inference tools. We added a post-processing for refining, with the aid of the [Formula: see text] algorithm, our ambiguous families (those with more than one gene per genome), improving even more the accuracy of our results. Our approach is implemented into a pipeline incorporating the pre-computation of gene similarities and the post-processing refinement of ambiguous families with [Formula: see text]. Both the original version with optimal capping and the new modified version with heuristic capping can be downloaded, together with their detailed documentations, at https://gitlab.ub.uni-bielefeld.de/gi/FFGC or as a Conda package at https://anaconda.org/bioconda/ffgc .

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/94f6/10540461/9776d952f8ce/13015_2023_238_Fig8_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/94f6/10540461/5783030c24df/13015_2023_238_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/94f6/10540461/252d8b770ce6/13015_2023_238_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/94f6/10540461/7f700125e43d/13015_2023_238_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/94f6/10540461/d6534f9eae13/13015_2023_238_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/94f6/10540461/74a67d75c1f3/13015_2023_238_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/94f6/10540461/4c8452a1ca58/13015_2023_238_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/94f6/10540461/48d69c53e160/13015_2023_238_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/94f6/10540461/9776d952f8ce/13015_2023_238_Fig8_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/94f6/10540461/5783030c24df/13015_2023_238_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/94f6/10540461/252d8b770ce6/13015_2023_238_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/94f6/10540461/7f700125e43d/13015_2023_238_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/94f6/10540461/d6534f9eae13/13015_2023_238_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/94f6/10540461/74a67d75c1f3/13015_2023_238_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/94f6/10540461/4c8452a1ca58/13015_2023_238_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/94f6/10540461/48d69c53e160/13015_2023_238_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/94f6/10540461/9776d952f8ce/13015_2023_238_Fig8_HTML.jpg
摘要

背景

最近我们开发了一种基于基因组重排的基因直系同源推断工具(《生物信息学与计算生物学杂志》,2021年第19卷第6期)。给定一组基因组,我们的方法首先计算所有基因对之间的相似度。然后运行成对的整数线性规划(ILP)比较来计算最优基因匹配,通过考虑相似度来最小化分析基因组之间的加权重排距离(这是一个NP难问题)。最后一步将基因匹配整合到基因家族中。上述ILP包括一种最优封顶,即将一个基因组线性片段的两端连接到另一个基因组线性片段的一端,这使得搜索空间呈指数级增长。

结果

在这项工作中,我们设计并实现了一种启发式封顶算法,该算法通过将线性片段聚类(基于它们的基因内容交集)为[公式:见原文]个子集来取代最优封顶,这些子集的末端独立封顶。此外,在每个子集中,我们不是允许所有可能的连接,而是只连接内容相关片段的末端。尽管不能保证m比1大很多,并且可能会导致产生次优而非最优的基因匹配,但从速度性能和计算解的质量来看,这种启发式方法在实践中效果很好。我们在灵长类动物和果蝇基因组上的实验显示了两个积极的结果。第一,对于五种灵长类动物的完整组装,采用启发式封顶的版本报告的直系同源关系与我们工具采用最优封顶版本计算的直系同源关系非常相似。第二,我们能够有效地分析分布在数百甚至数千个重叠群中的不完整组装的果蝇基因组,获得与[公式:见原文]家族非常相似的基因家族。事实上,与其他推断工具计算的基因家族相比,我们的工具推断出了更多的完整团簇,与[公式:见原文]的交集更高。我们借助[公式:见原文]算法对模糊家族(每个基因组有多个基因的家族)进行后处理以优化,进一步提高了结果的准确性。我们的方法被实现为一个管道,其中包括基因相似度的预计算以及使用[公式:见原文]对模糊家族进行后处理优化。带有最优封顶的原始版本和带有启发式封顶的新修改版本,连同它们的详细文档,都可以在https://gitlab.ub.uni-bielefeld.de/gi/FFGC上下载,或者作为一个Conda包在https://anaconda.org/bioconda/ffgc上下载。

相似文献

1
Efficient gene orthology inference via large-scale rearrangements.通过大规模重排进行高效的基因直系同源推断。
Algorithms Mol Biol. 2023 Sep 28;18(1):14. doi: 10.1186/s13015-023-00238-y.
2
The potential of family-free rearrangements towards gene orthology inference.无家族重排用于基因直系同源性推断的潜力。
J Bioinform Comput Biol. 2021 Dec;19(6):2140014. doi: 10.1142/S021972002140014X. Epub 2021 Nov 13.
3
Natural family-free genomic distance.自然的无家族基因组距离。
Algorithms Mol Biol. 2021 May 10;16(1):4. doi: 10.1186/s13015-021-00183-8.
4
Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区,服用抗叶酸抗疟药物的人群中,叶酸补充剂与疟疾易感性和严重程度的关系。
Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.
5
Investigating the complexity of the double distance problems.研究双距离问题的复杂性。
Algorithms Mol Biol. 2024 Jan 4;19(1):1. doi: 10.1186/s13015-023-00246-y.
6
New algorithms for structure informed genome rearrangement.用于结构信息基因组重排的新算法。
Algorithms Mol Biol. 2023 Dec 1;18(1):17. doi: 10.1186/s13015-023-00239-x.
7
Family-Free Genome Comparison.无家族基因组比较。
Methods Mol Biol. 2024;2802:57-72. doi: 10.1007/978-1-0716-3838-5_3.
8
Recombinations, chains and caps: resolving problems with the DCJ-indel model.重组、链与端粒帽:用DCJ-插入缺失模型解决问题
Algorithms Mol Biol. 2024 Feb 27;19(1):8. doi: 10.1186/s13015-024-00253-7.
9
On the rank-distance median of 3 permutations.关于 3 个排列的秩距中值。
BMC Bioinformatics. 2018 May 8;19(Suppl 6):142. doi: 10.1186/s12859-018-2131-4.
10
On Computing Breakpoint Distances for Genomes with Duplicate Genes.关于计算具有重复基因的基因组的断点距离
J Comput Biol. 2017 Jun;24(6):571-580. doi: 10.1089/cmb.2016.0149. Epub 2016 Oct 27.

引用本文的文献

1
Reconstructing rearrangement phylogenies of natural genomes.重建天然基因组的重排系统发育树。
Algorithms Mol Biol. 2025 Jun 7;20(1):10. doi: 10.1186/s13015-025-00279-5.

本文引用的文献

1
The potential of family-free rearrangements towards gene orthology inference.无家族重排用于基因直系同源性推断的潜力。
J Bioinform Comput Biol. 2021 Dec;19(6):2140014. doi: 10.1142/S021972002140014X. Epub 2021 Nov 13.
2
Natural family-free genomic distance.自然的无家族基因组距离。
Algorithms Mol Biol. 2021 May 10;16(1):4. doi: 10.1186/s13015-021-00183-8.
3
Computing the Rearrangement Distance of Natural Genomes.计算自然基因组的重排距离。
J Comput Biol. 2021 Apr;28(4):410-431. doi: 10.1089/cmb.2020.0434. Epub 2020 Dec 30.
4
OMA standalone: orthology inference among public and custom genomes and transcriptomes.OMA 独立版:公共和定制基因组和转录组之间的同源推断。
Genome Res. 2019 Jul;29(7):1152-1163. doi: 10.1101/gr.243212.118. Epub 2019 Jun 24.
5
Family-Free Genome Comparison.无家族基因组比较
Methods Mol Biol. 2018;1704:331-342. doi: 10.1007/978-1-4939-7463-4_12.
6
On the family-free DCJ distance and similarity.关于无家族的DCJ距离和相似度。
Algorithms Mol Biol. 2015 Apr 1;10:13. doi: 10.1186/s13015-015-0041-9. eCollection 2015.
7
An Exact Algorithm to Compute the Double-Cut-and-Join Distance for Genomes with Duplicate Genes.一种用于计算具有重复基因的基因组的双切割连接距离的精确算法。
J Comput Biol. 2015 May;22(5):425-35. doi: 10.1089/cmb.2014.0096. Epub 2014 Dec 17.
8
Fast and sensitive protein alignment using DIAMOND.使用 DIAMOND 进行快速灵敏的蛋白质比对。
Nat Methods. 2015 Jan;12(1):59-60. doi: 10.1038/nmeth.3176. Epub 2014 Nov 17.
9
Orthology detection combining clustering and synteny for very large datasets.结合聚类和共线性分析的直系同源性检测方法用于超大型数据集
PLoS One. 2014 Aug 19;9(8):e105015. doi: 10.1371/journal.pone.0105015. eCollection 2014.
10
Double cut and join with insertions and deletions.带有插入和缺失的双切割与连接。
J Comput Biol. 2011 Sep;18(9):1167-84. doi: 10.1089/cmb.2011.0118.