Yang Jiaoyun, Wang Haipeng, Ding Huitong, An Ning, Alterovitz Gil
School of Computer and Information, Hefei University of Technology, Tunxi Road, Hefei, 230009, China.
Harvard Medical School, Boston Children's Hospital, Boston, 02115, MA, USA.
BMC Bioinformatics. 2017 Jan 19;18(1):47. doi: 10.1186/s12859-017-1484-4.
Visualizing data by dimensionality reduction is an important strategy in Bioinformatics, which could help to discover hidden data properties and detect data quality issues, e.g. data noise, inappropriately labeled data, etc. As crowdsourcing-based synthetic biology databases face similar data quality issues, we propose to visualize biobricks to tackle them. However, existing dimensionality reduction methods could not be directly applied on biobricks datasets. Hereby, we use normalized edit distance to enhance dimensionality reduction methods, including Isomap and Laplacian Eigenmaps.
By extracting biobricks from synthetic biology database Registry of Standard Biological Parts, six combinations of various types of biobricks are tested. The visualization graphs illustrate discriminated biobricks and inappropriately labeled biobricks. Clustering algorithm K-means is adopted to quantify the reduction results. The average clustering accuracy for Isomap and Laplacian Eigenmaps are 0.857 and 0.844, respectively. Besides, Laplacian Eigenmaps is 5 times faster than Isomap, and its visualization graph is more concentrated to discriminate biobricks.
By combining normalized edit distance with Isomap and Laplacian Eigenmaps, synthetic biology biobircks are successfully visualized in two dimensional space. Various types of biobricks could be discriminated and inappropriately labeled biobricks could be determined, which could help to assess crowdsourcing-based synthetic biology databases' quality, and make biobricks selection.
通过降维来可视化数据是生物信息学中的一项重要策略,它有助于发现隐藏的数据属性并检测数据质量问题,例如数据噪声、标注不当的数据等。由于基于众包的合成生物学数据库面临类似的数据质量问题,我们建议通过可视化生物模块来解决这些问题。然而,现有的降维方法不能直接应用于生物模块数据集。因此,我们使用归一化编辑距离来增强包括等距映射(Isomap)和拉普拉斯特征映射(Laplacian Eigenmaps)在内的降维方法。
通过从合成生物学数据库标准生物部件登记处提取生物模块,对六种不同类型生物模块的组合进行了测试。可视化图展示了有区别的生物模块和标注不当的生物模块。采用聚类算法K均值来量化降维结果。等距映射和拉普拉斯特征映射的平均聚类准确率分别为0.857和0.844。此外,拉普拉斯特征映射比等距映射快5倍,并且其可视化图在区分生物模块方面更加集中。
通过将归一化编辑距离与等距映射和拉普拉斯特征映射相结合,合成生物学的生物模块成功地在二维空间中实现了可视化。可以区分各种类型的生物模块,并确定标注不当的生物模块,这有助于评估基于众包的合成生物学数据库的质量,并进行生物模块的选择。