Department of Molecular Genetics, The Forsyth Institute, Boston, MA 02115, USA.
Bioinformatics. 2010 Jun 1;26(11):1423-30. doi: 10.1093/bioinformatics/btq162. Epub 2010 Apr 15.
RNA expression signals detected by high-density genomic tiling microarrays contain comprehensive transcriptomic information of the target organism. Current methods for determining the RNA transcription units are still computation intense and lack the discriminative power. This article describes an efficient and accurate methodology to reveal complicated transcriptional architecture, including small regulatory RNAs, in microbial transcriptome profiles.
Normalized microarray data were first subject to support vector regression to estimate the profile tendency by reducing noise interruption. A hybrid supervised machine learning algorithm, hidden Markov support vector machines, was then used to classify the underlying state of each probe to 'expression' or 'silence' with the assumption that the consecutive state sequence was a heterogeneous Markov chain. For model construction, we introduced a profile geometry learning method to construct the feature vectors, which considered both intensity profiles and changes of intensities over the probe spacing. Also, a robust strategy was used to dynamically evaluate and select the training set based only on prior computer gene annotation. The algorithm performed better than other methods in accuracy on simulated data, especially for small expressed regions with lower (<1) SNR (signal-to-noise ratio), hence more sensitive for detecting small RNAs.
Detail implementation steps of the algorithm and the complete result of the transcriptome analysis for a microbial genome Porphyromonas gingivalis W83 can be viewed at http://bioinformatics.forsyth.org/mtd.
高密度基因组平铺微阵列检测到的 RNA 表达信号包含目标生物的综合转录组信息。目前确定 RNA 转录单位的方法仍然计算密集且缺乏辨别力。本文描述了一种有效且准确的方法,用于揭示微生物转录组谱中复杂的转录结构,包括小调控 RNA。
首先对归一化的微阵列数据进行支持向量回归,通过减少噪声干扰来估计图谱趋势。然后使用混合监督机器学习算法——隐马尔可夫支持向量机,假设连续状态序列是异构马尔可夫链,将每个探针的潜在状态分类为“表达”或“沉默”。对于模型构建,我们引入了一种图谱几何学习方法来构建特征向量,同时考虑了强度图谱和探针间距上强度变化。此外,还使用了一种稳健的策略,仅根据先前的计算机基因注释动态评估和选择训练集。该算法在模拟数据上的准确性优于其他方法,尤其是对于 SNR(信噪比)较低(<1)的小表达区域,因此更能检测到小 RNA。
算法的详细实施步骤和微生物基因组 Porphyromonas gingivalis W83 的转录组分析的完整结果可在 http://bioinformatics.forsyth.org/mtd 上查看。