European Commission, Joint Research Centre (JRC), Seville, Spain.
Institute for Complex Systems, CNR, Rome, Italy.
PLoS One. 2020 Apr 30;15(4):e0230107. doi: 10.1371/journal.pone.0230107. eCollection 2020.
Predicting innovation is a peculiar problem in data science. Following its definition, an innovation is always a never-seen-before event, leaving no room for traditional supervised learning approaches. Here we propose a strategy to address the problem in the context of innovative patents, by defining innovations as never-seen-before associations of technologies and exploiting self-supervised learning techniques. We think of technological codes present in patents as a vocabulary and the whole technological corpus as written in a specific, evolving language. We leverage such structure with techniques borrowed from Natural Language Processing by embedding technologies in a high dimensional euclidean space where relative positions are representative of learned semantics. Proximity in this space is an effective predictor of specific innovation events, that outperforms a wide range of standard link-prediction metrics. The success of patented innovations follows a complex dynamics characterized by different patterns which we analyze in details with specific examples. The methods proposed in this paper provide a completely new way of understanding and forecasting innovation, by tackling it from a revealing perspective and opening interesting scenarios for a number of applications and further analytic approaches.
预测创新是数据科学中的一个特殊问题。根据其定义,创新总是一个前所未有的事件,没有传统监督学习方法的空间。在这里,我们提出了一种在创新专利背景下解决该问题的策略,通过将创新定义为技术的前所未有的关联,并利用自监督学习技术。我们将专利中存在的技术代码视为词汇,将整个技术语料库视为用特定的、不断发展的语言书写的。我们通过将技术嵌入到高维欧几里得空间中来利用这些结构,在这个空间中,相对位置代表学习到的语义。在这个空间中的接近度是特定创新事件的有效预测指标,优于广泛的标准链接预测指标。专利创新的成功遵循一种复杂的动态,其特征是具有不同的模式,我们通过具体示例详细分析了这些模式。本文提出的方法通过从一个有启发性的角度处理创新问题,并为许多应用程序和进一步的分析方法开辟了有趣的场景,为理解和预测创新提供了一种全新的方式。