Computer and Information Sciences, University of Delaware, 101 Smith Hall, Newark, DE, 19716, USA.
Plant and Soil Sciences, University of Delaware, 15 Innovation Way, Newark, 19716, USA.
BMC Bioinformatics. 2021 Mar 26;22(1):162. doi: 10.1186/s12859-021-04080-0.
Hidden Markov models (HMM) are a powerful tool for analyzing biological sequences in a wide variety of applications, from profiling functional protein families to identifying functional domains. The standard method used for HMM training is either by maximum likelihood using counting when sequences are labelled or by expectation maximization, such as the Baum-Welch algorithm, when sequences are unlabelled. However, increasingly there are situations where sequences are just partially labelled. In this paper, we designed a new training method based on the Baum-Welch algorithm to train HMMs for situations in which only partial labeling is available for certain biological problems.
Compared with a similar method previously reported that is designed for the purpose of active learning in text mining, our method achieves significant improvements in model training, as demonstrated by higher accuracy when the trained models are tested for decoding with both synthetic data and real data.
A novel training method is developed to improve the training of hidden Markov models by utilizing partial labelled data. The method will impact on detecting de novo motifs and signals in biological sequence data. In particular, the method will be deployed in active learning mode to the ongoing research in detecting plasmodesmata targeting signals and assess the performance with validations from wet-lab experiments.
隐马尔可夫模型(HMM)是一种强大的工具,可用于在各种应用中分析生物序列,从分析功能蛋白家族到识别功能域。用于 HMM 训练的标准方法是使用计数的最大似然法,当序列标记时,或者使用期望最大化法,如 Baum-Welch 算法,当序列未标记时。然而,越来越多的情况是序列只是部分标记。在本文中,我们设计了一种新的训练方法,基于 Baum-Welch 算法,用于训练 HMM,以解决某些生物学问题中只有部分标记的情况。
与之前为文本挖掘中的主动学习而设计的类似方法相比,我们的方法在模型训练方面取得了显著的改进,在使用合成数据和真实数据对训练后的模型进行解码测试时,准确性更高。
开发了一种新的训练方法,通过利用部分标记数据来改进隐马尔可夫模型的训练。该方法将对检测生物序列数据中的从头基序和信号产生影响。特别是,该方法将部署在主动学习模式中,用于检测质膜定位信号的持续研究,并通过湿实验室实验的验证来评估性能。