School of Computing, Clemson University, Clemson, SC, United States of America.
Department of Computer and Information Sciences, University of Delaware, Newark, DE, United States of America.
PLoS One. 2021 Jul 6;16(7):e0253905. doi: 10.1371/journal.pone.0253905. eCollection 2021.
Biomedical research papers often combine disjoint concepts in novel ways, such as when describing a newly discovered relationship between an understudied gene with an important disease. These concepts are often explicitly encoded as metadata keywords, such as the author-provided terms included with many documents in the MEDLINE database. While substantial recent work has addressed the problem of text generation in a more general context, applications, such as scientific writing assistants, or hypothesis generation systems, could benefit from the capacity to select the specific set of concepts that underpin a generated biomedical text. We propose a conditional language model following the transformer architecture. This model uses the "encoder stack" to encode concepts that a user wishes to discuss in the generated text. The "decoder stack" then follows the masked self-attention pattern to perform text generation, using both prior tokens as well as the encoded condition. We demonstrate that this approach provides significant control, while still producing reasonable biomedical text.
生物医学研究论文经常以新颖的方式结合不相关的概念,例如在描述一个新发现的、未充分研究的基因与一种重要疾病之间的关系时。这些概念通常作为元数据关键字明确编码,例如 MEDLINE 数据库中许多文档中包含的作者提供的术语。虽然最近有大量工作解决了更一般上下文中的文本生成问题,但应用程序(如科学写作助手或假设生成系统)可能会受益于选择生成生物医学文本所依据的特定概念集的能力。我们提出了一种基于转换器架构的条件语言模型。该模型使用“编码器堆叠”对用户希望在生成文本中讨论的概念进行编码。然后,“解码器堆叠”遵循屏蔽的自注意力模式执行文本生成,同时使用前一个标记和编码条件。我们证明了这种方法提供了显著的控制能力,同时仍然生成了合理的生物医学文本。