Suppr超能文献

用于核苷酸序列分析的潜在狄利克雷分配混合模型。

Latent Dirichlet allocation mixture models for nucleotide sequence analysis.

作者信息

Wang Bixuan, Mount Stephen M

机构信息

Dept. of Cell Biology and Molecular Genetics, University of Maryland, College Park, MD 20742, USA.

出版信息

NAR Genom Bioinform. 2024 Aug 9;6(3):lqae099. doi: 10.1093/nargab/lqae099. eCollection 2024 Sep.

Abstract

Strings of nucleotides carrying biological information are typically described as sequence motifs represented by weight matrices or consensus sequences. However, many signals in DNA or RNA are recognized by multiple factors in temporal sequence, consist of distinct alternative motifs, or are best described by base composition. Here we apply the latent Dirichlet allocation (LDA) mixture model to nucleotide sequences. Using positions in an alignment of human or Drosophila splice sites as samples, we show that LDA readily identifies motifs, including such elusive cases as the intron branch site. Using whole sequences with positional k-mers as features, LDA can identify sequence subtypes enriched in long vs. short introns. LDA with bulk k-mers can reliably distinguish reading frame and species of origin in coding sequences from humans and Drosophila. We find that LDA is a useful model for describing heterogeneous signals, for assigning individual sequences to subtypes, and for identifying and characterizing sequences that do not fit recognized subtypes. Because LDA topic models are interpretable, they also aid the discovery of new motifs, even those present in a small fraction of samples. In summary, LDA can identify and characterize signals in nucleotide sequences, including candidate regulatory factors involved in biological processes.

摘要

携带生物信息的核苷酸序列通常被描述为由权重矩阵或共有序列表示的序列基序。然而,DNA或RNA中的许多信号是由多个因子按时间顺序识别的,由不同的替代基序组成,或者最好用碱基组成来描述。在这里,我们将潜在狄利克雷分配(LDA)混合模型应用于核苷酸序列。以人类或果蝇剪接位点比对中的位置作为样本,我们表明LDA能够轻松识别基序,包括内含子分支位点这种难以捉摸的情况。使用具有位置k-mer的全序列作为特征,LDA可以识别在长内含子与短内含子中富集的序列亚型。使用大量k-mer的LDA能够可靠地区分人类和果蝇编码序列中的阅读框和起源物种。我们发现LDA是一种用于描述异质信号、将单个序列分配到亚型以及识别和表征不符合公认亚型的序列的有用模型。由于LDA主题模型是可解释的,它们还有助于发现新的基序,即使是那些存在于一小部分样本中的基序。总之,LDA可以识别和表征核苷酸序列中的信号,包括参与生物过程的候选调控因子。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/da01/11310860/100e5425c1bd/lqae099figgra1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验