School of Biological Sciences, University of Western Australia, Perth, WA, 6009, Australia.
Centre for Applied Bioinformatics, University of Western Australia, Perth, WA, 6009, Australia.
Genetica. 2023 Dec;151(6):325-338. doi: 10.1007/s10709-023-00196-8. Epub 2023 Oct 10.
Identifying homologs is an important process in the analysis of genetic patterns underlying traits and evolutionary relationships among species. Analysis of gene families is often used to form and support hypotheses on genetic patterns such as gene presence, absence, or functional divergence which underlie traits examined in functional studies. These analyses often require precise identification of all members in a targeted gene family. Manual pipelines where homology search and orthology assignment tools are used separately are the most common approach for identifying small gene families where accurate identification of all members is important. The ability to curate sequences between steps in manual pipelines allows for simple and precise identification of all possible gene family members. However, the validity of such manual pipeline analyses is often decreased by inappropriate approaches to homology searches including too relaxed or stringent statistical thresholds, inappropriate query sequences, homology classification based on sequence similarity alone, and low-quality proteome or genome sequences. In this article, we propose several approaches to mitigate these issues and allow for precise identification of gene family members and support for hypotheses linking genetic patterns to functional traits.
确定同源物是分析物种特征和进化关系背后的遗传模式的重要过程。基因家族分析常用于形成和支持关于遗传模式的假设,例如基因的存在、缺失或功能分化,这些模式是功能研究中检查的特征的基础。这些分析通常需要精确识别目标基因家族中的所有成员。使用同源性搜索和同源物分配工具分别进行的手动流程是识别小基因家族的最常见方法,其中准确识别所有成员非常重要。在手动流程中,在步骤之间进行序列整理的能力允许简单而精确地识别所有可能的基因家族成员。然而,这种手动流程分析的有效性通常会因同源性搜索的不当方法而降低,包括过于宽松或严格的统计阈值、不适当的查询序列、仅基于序列相似性的同源性分类以及低质量的蛋白质组或基因组序列。在本文中,我们提出了几种方法来减轻这些问题,并允许精确识别基因家族成员,并支持将遗传模式与功能特征联系起来的假设。