Kaminuma Eli, Baba Yukino, Mochizuki Masahiro, Matsumoto Hirotaka, Ozaki Haruka, Okayama Toshitsugu, Kato Takuya, Oki Shinya, Fujisawa Takatomo, Nakamura Yasukazu, Arita Masanori, Ogasawara Osamu, Kashima Hisashi, Takagi Toshihisa
Center for Information Biology, National Institute of Genetics.
Graduate School of Informatics, Kyoto University.
Genes Genet Syst. 2020 Apr 22;95(1):43-50. doi: 10.1266/ggs.19-00034. Epub 2020 Mar 26.
Recently, the prospect of applying machine learning tools for automating the process of annotation analysis of large-scale sequences from next-generation sequencers has raised the interest of researchers. However, finding research collaborators with knowledge of machine learning techniques is difficult for many experimental life scientists. One solution to this problem is to utilise the power of crowdsourcing. In this report, we describe how we investigated the potential of crowdsourced modelling for a life science task by conducting a machine learning competition, the DNA Data Bank of Japan (DDBJ) Data Analysis Challenge. In the challenge, participants predicted chromatin feature annotations from DNA sequences with competing models. The challenge engaged 38 participants, with a cumulative total of 360 model submissions. The performance of the top model resulted in an area under the curve (AUC) score of 0.95. Over the course of the competition, the overall performance of the submitted models improved by an AUC score of 0.30 from the first submitted model. Furthermore, the 1- and 2-ranking models utilised external data such as genomic location and gene annotation information with specific domain knowledge. The effect of incorporating this domain knowledge led to improvements of approximately 5%-9%, as measured by the AUC scores. This report suggests that machine learning competitions will lead to the development of highly accurate machine learning models for use by experimental scientists unfamiliar with the complexities of data science.
最近,应用机器学习工具来自动化分析来自新一代测序仪的大规模序列注释过程的前景引起了研究人员的兴趣。然而,对于许多实验生命科学家来说,找到具有机器学习技术知识的研究合作者很困难。解决这个问题的一个办法是利用众包的力量。在本报告中,我们描述了我们如何通过举办一场机器学习竞赛——日本DNA数据库(DDBJ)数据分析挑战赛,来研究众包建模在生命科学任务中的潜力。在挑战赛中,参与者使用竞争模型从DNA序列预测染色质特征注释。该挑战赛吸引了38名参与者,累计提交了360个模型。最佳模型的性能在曲线下面积(AUC)得分上达到了0.95。在比赛过程中,提交模型的整体性能相比第一个提交的模型,AUC得分提高了0.30。此外,排名第一和第二的模型利用了外部数据,如基因组位置和具有特定领域知识的基因注释信息。通过AUC得分衡量,纳入这些领域知识的效果导致了约5%-9%的提升。本报告表明,机器学习竞赛将促使开发出高精度的机器学习模型,供不熟悉数据科学复杂性的实验科学家使用。