Hao Shuyan, Xia Ting, Zhang Ruizhi, Guo Meng
Key Laboratory of Computing Power Network and Information Security, Shandong Computer Science Center (National Supercomputing Center in Jinan), Ministry of Education, Qilu University of Technology (Shandong Academy of Sciences), Jinan, 250013, Shandong, P. R. China.
Jinan Key Laboratory of High-Performance Industrial Software, Jinan Institute of Supercomputing Technology, Jinan, 250103, Shandong, P. R. China.
Sci Rep. 2024 Dec 30;14(1):31602. doi: 10.1038/s41598-024-79126-3.
Crystal structure similarity is useful for the chemical analysis of nowadays big materials databases and data mining new materials. Here we propose to use two-dimensional Wasserstein distance (earth mover's distance) to measure the compositional similarity between different compounds, based on the periodic table representation of compositions. To demonstrate the effectiveness of our approach, 1586 Cu-S based compounds are taken from the inorganic crystal structure database (ICSD) to form a validation dataset. By using local structure order parameters as a geometrical similarity metric, the similarity matrix including both compositional and geometrical similarities is calculated. Then all the Cu-S compounds are clustered into 86 groups using the similarity matrix and "density-based spatial clustering of applications with noise" (DBSCAN) algorithm. Some selected groups are analyzed using crystal structure visualization of hundreds of compounds, which provides chemical insights of the similarity metrics and shows the effectiveness of clustering. A group of rare earth containing layered Cu-S compounds is proposed for further experimental investigation as potential thermoelectric materials, based on a structure-property relationship consideration that similar structures tend to have similar properties. The unsupervised clustering approach in this work can be easily applied to other datasets, which will help for chemical understanding of the materials datasets and discover new materials with similarity properties based on the similarity metrics.
晶体结构相似性对于当今大型材料数据库的化学分析和新材料的数据挖掘很有用。在此,我们建议基于成分的元素周期表表示,使用二维瓦瑟斯坦距离(推土机距离)来测量不同化合物之间的成分相似性。为了证明我们方法的有效性,从无机晶体结构数据库(ICSD)中选取了1586种铜硫基化合物,形成一个验证数据集。通过使用局部结构序参量作为几何相似性度量,计算出包含成分和几何相似性的相似性矩阵。然后,使用相似性矩阵和“基于密度的带有噪声的空间聚类应用”(DBSCAN)算法,将所有铜硫化合物聚类为86组。使用数百种化合物的晶体结构可视化对一些选定的组进行了分析,这提供了相似性度量的化学见解,并展示了聚类的有效性。基于相似结构往往具有相似性质的结构-性质关系考虑,提出了一组含稀土的层状铜硫化合物作为潜在的热电材料进行进一步的实验研究。这项工作中的无监督聚类方法可以很容易地应用于其他数据集,这将有助于对材料数据集进行化学理解,并基于相似性度量发现具有相似性质的新材料。