Materials Informatics. Methods, Tools and Applications 978-3-527-34121-4

819 82 11MB

English Pages 298 Year 2019

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Materials Informatics. Methods, Tools and Applications
 978-3-527-34121-4

Citation preview

Materials Informatics

Materials Informatics Methods, Tools and Applications

Edited by Olexandr Isayev Alexander Tropsha Stefano Curtarolo

Editors

University of North Carolina at Chapel Hill UNC Eshelman School of Pharmacy Campus Box 7568 Chapel Hill, NC United States

All books published by Wiley-VCH are carefully produced. Nevertheless, authors, editors, and publisher do not warrant the information contained in these books, including this book, to be free of errors. Readers are advised to keep in mind that statements, data, illustrations, procedural details or other items may inadvertently be inaccurate.

Prof. Alexander Tropsha

Library of Congress Card No.:

University of North Carolina at Chapel Hill UNC Eshelman School of Pharmacy Campus Box 7568 Chapel Hill, NC United States

applied for

Prof. Stefano Curtarolo

Bibliographic information published by the Deutsche Nationalbibliothek

Prof. Olexandr Isayev

Duke University Mechanical Engineering & Mat. Science 144 Hudson Hall Durham, NC United States Cover Image: © Floriana/Getty Images

British Library Cataloguing-in-Publication Data

A catalogue record for this book is available from the British Library.

The Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data are available on the Internet at . © 2019 Wiley-VCH Verlag GmbH & Co. KGaA, Boschstr. 12, 69469 Weinheim, Germany

All rights reserved (including those of translation into other languages). No part of this book may be reproduced in any form – by photoprinting, microfilm, or any other means – nor transmitted or translated into a machine language without written permission from the publishers. Registered names, trademarks, etc. used in this book, even when not specifically marked as such, are not to be considered unprotected by law. Print ISBN: 978-3-527-34121-4 ePDF ISBN: 978-3-527-80225-8 ePub ISBN: 978-3-527-80227-2 oBook ISBN: 978-3-527-80226-5 Typesetting SPi Global, Chennai, India Printing and Binding

Printed on acid-free paper 10 9 8 7 6 5 4 3 2 1

v

Contents

1

Crystallography Open Database: History, Development, and Perspectives 1 Saulius Gražulis, Andrius Merkys, Antanas Vaitkus, Daniel Chateigner, Luca Lutterotti, Peter Moeck, Miguel Quiros, Robert T. Downs, Werner Kaminsky, and Armel Le Bail

1.1 1.2 1.3 1.3.1 1.3.2 1.3.3 1.3.3.1 1.3.3.2 1.3.3.3 1.3.4 1.4 1.4.1 1.4.1.1 1.4.1.2 1.4.1.3 1.4.1.4 1.4.1.5 1.4.1.6 1.4.1.7 1.4.1.8 1.4.1.9 1.4.1.10 1.4.2 1.5 1.5.1 1.5.2 1.5.3 1.5.4 1.5.5

Introduction 1 Open Databases for Science 3 Building COD 6 Scope and Contents 7 Data Sources 7 Data Maintenance 8 Version Control 11 Data Curation Policies 12 Quarterly Releases 13 Sister Databases (PCOD, TCOD) 14 Use of COD 14 Data Search and Retrieval 14 Data Identification 15 Web Search Interface 15 RESTful Interfaces 15 Output Formats 17 Accessing COD Records 17 MySQL Interface 18 Alternative Implementations of COD Search on the Web 20 Installing a Local Copy of the COD 21 File System-Based Queries 23 Programmatic Use of COD CIFs 24 Data Deposition 26 Applications 27 Material Identification 27 Applications for the Mining Industry 27 Extracting Chemical Information 28 Property Search 30 Geometry Statistics 30

vi

Contents

1.5.6 1.5.7 1.6 1.6.1 1.6.2 1.6.3

High-Throughput Computations 31 Applications in College Education and Complementing Outreach Activities 31 Perspectives 32 Historic Structures 32 Theoretical Data in (T)COD 32 Conclusion 32 Acknowledgments 33 References 33

2

The Inorganic Crystal Structure Database (ICSD): A Tool for Materials Sciences 41 Stephan Rühl

2.1 2.2 2.3 2.4 2.4.1 2.4.2 2.4.3

Introduction 41 Content of ICSD 42 Interfaces 46 Applications of ICSD 46 Prediction of Ferroelectricity 47 Using the Concept of Structure Types 47 Two Examples of Training Machine Learning Algorithms with ICSD Data 48 High-Throughput Calculation 50 Outlook 51 References 51

2.4.4 2.5

3

Pauling File: Toward a Holistic View 55 Pierre Villars, Karin Cenzual, Roman Gladyshevskii, and Shuichi Iwata

3.1 3.1.1 3.2 3.2.1 3.2.2 3.2.3 3.2.4 3.2.5 3.2.5.1 3.2.5.2 3.2.5.3 3.2.6 3.2.7 3.2.8 3.3 3.4 3.4.1 3.4.2 3.4.3 3.5

Introduction 55 Creation and Development of the PAULING FILE 57 PAULING FILE: Crystal Structures 57 Data Selection 58 Categories of Crystal Structure Entries 58 Database Fields 59 Structure Prototypes 62 Standardized Crystallographic Data 63 Checking of Symmetry 63 Standardization 65 Comparison with the Type-Defining Data Set 67 Assigned Atom Coordinates 67 Atomic Environment Types (AETs) 68 Cell Parameters from Plots 72 PAULING FILE: Phase Diagrams 72 PAULING FILE: Physical Properties 75 Data Selection 75 Database Fields 76 Physical Properties Considered in the PAULING FILE 76 Data Quality 80

Contents

3.5.1 3.6 3.6.1 3.6.2 3.7 3.8 3.8.1 3.8.2 3.8.3 3.9 3.10

Computer-Aided Checking 80 Distinct Phases 81 Chemical Formulas and Phase Names 83 Phase Classifications 84 Toward a Megadatabase 84 Applications 89 Products Containing PAULING FILE Data 89 Holistic Overviews Based on the PAULING FILE 91 Principles Defining Ordering of Chemical Elements 92 Lessons to Learn from Experience 99 Conclusion 103 References 104

4

From Topological Descriptors to Expert Systems: A Route to Predictable Materials 107 Alexander P. Shevchenko, Eugeny V. Alexandrov, Olga A. Blatova, Denis E. Yablokov, and Vladislav A. Blatov

4.1 4.2 4.2.1 4.2.2 4.2.3 4.2.4 4.2.4.1 4.2.4.2 4.2.4.3 4.2.5 4.3

Introduction 107 Topological Tools for Developing Knowledge Databases 108 Why Topological? 108 Topological vs. Other Descriptors of Crystal Structures 110 Topological vs. Crystallographic Databases 111 Deriving Topological Knowledge from Crystallographic Data 116 Algorithms for Topological Analysis 116 Building Distributions of Descriptors 118 Finding Correlations Between Descriptors 123 Universal Data Storage 126 Applications of Topological Tools in Crystal Chemistry and Materials Science 131 Network Topology Prediction 131 Prediction of Properties 137 Conclusions 137 References 138

4.3.1 4.3.2 4.4

5

A High-Throughput Computational Study Driven by the AiiDA Materials Informatics Framework and the PAULING FILE as Reference Database 149 Martin Uhrin, Giovanni Pizzi, Nicolas Mounet, Nicola Marzari, and Pierre Villars

5.1 5.1.1

Introduction 149 Three Key Developments Opened Up Unprecedented Opportunities 150 Relative Few Inorganic Solids Have Been Experimentally Investigated 151 Nature Defines Cornerstones Providing a Marvelously Rich but Still Very Rigid Systematic Framework of Restraint Conditions 151 The First, Second, and Third Paradigms 153

5.1.2 5.2 5.3

vii

viii

Contents

5.4 5.4.1 5.4.2 5.4.3

5.5 5.6

5.6.1 5.6.2 5.6.3 5.7 5.8

5.8.1 5.8.2 5.8.3 5.8.4 5.8.5 5.8.6 5.9

The Realization of the Fourth and Fifth Paradigms Requires Three Preconditions 153 Introduction of the Prototype Classification to Link Crystallographic Databases Created by Different Groups 153 Introduction of the Distinct Phases Concept to Link Different Kinds of Inorganic Solids Data 154 The Existence of a Comprehensive, Critically Evaluated Inorganic Solids Database Concept (DBMS) of Experimentally Determined Single-Phase Inorganic Solids Data to Be Used as Reference 154 The Core Idea of the Fifth Paradigm 154 Restraint Conditions Revealed by “Inorganic Solids Overview–Governing Factor Spaces (Maps)” Discovered by Data-Mining Techniques 156 Compound Formation Maps 157 Atomic Environment Type Stability Maps for AB Inorganic Solids 158 Twelve Principles in Materials Science Supporting Three Cornerstones Given by Nature 159 Quantum Simulation Strategy 161 Workflows Engine in AiiDA to Carry Out High-Throughput Calculation for the Creation of the Materials Cloud, Binaries Edition 164 AiiDA 164 SSSP (Standard Solid State Pseudopotentials) Library 165 Workflows 166 Workfunctions 166 Workchains 166 Workflows Used in This Project 168 Conclusions 169 Acknowledgment 169 References 169

6

Modeling Materials Quantum Properties with Machine Learning 171 Felix A. Faber and O. Anatole von Lilienfeld

6.1 6.2 6.3 6.3.1 6.3.2 6.4 6.5

Introduction 171 Kernel Ridge Regression 171 Model Assessment 173 Learning Curve 173 Speedup 174 Representations 176 Recent Developments 177 References 178

7

Automated Computation of Materials Properties 181 Cormac Toher, Corey Oses, and Stefano Curtarolo

7.1

Introduction 181

Contents

7.2 7.2.1 7.2.2 7.3 7.3.1 7.3.2 7.3.3 7.3.4 7.3.5 7.3.6 7.4 7.4.1 7.4.2 7.5 7.5.1 7.5.1.1 7.5.1.2 7.5.1.3 7.5.2 7.5.3 7.5.4 7.6

Automated Computational Materials Design Frameworks 182 Generating and Using Databases for Materials Discovery 182 Standardized Protocols for Automated Data Generation 185 Integrated Calculation of Materials Properties 187 Autonomous Symmetry Analysis 189 Elastic Constants 191 Quasi-harmonic Debye–Grüneisen Model 193 Harmonic Phonons 195 Quasi-harmonic Phonons 197 Anharmonic Phonons 198 Online Data Repositories 198 Computational Materials Data Web Portals 198 Programmatically Accessible Online Repositories of Computed Materials Properties 200 Materials Applications 202 Disordered Materials 202 High Entropy Materials 203 Metallic Glasses 203 Modeling Off-Stoichiometry Materials 204 Superalloys 205 Thermoelectrics 205 Magnetic Materials 208 Conclusion 209 Acknowledgments 209 References 209

8

Cognitive Chemistry: The Marriage of Machine Learning and Chemistry to Accelerate Materials Discovery 223 Edward O. Pyzer-Knapp

8.1 8.2 8.3 8.3.1 8.3.2 8.4 8.5

Introduction 223 Describing Molecules for Machine Learning Algorithms 224 Building Fast and Accurate Models with Machine Learning 234 Squared Exponential Kernel 239 Rational Quadratic Kernel 240 Searching Through Chemical Libraries 244 Conclusion 248 References 249

9

Machine Learning Interatomic Potentials for Global Optimization and Molecular Dynamics Simulation 253 Ivan A. Kruglov, Pavel E. Dolgirev, Artem R. Oganov, Arslan B. Mazitov, Sergey N. Pozdnyakov, Efim A. Mazhnik, and Alexey V. Yanilkin

9.1 9.2 9.2.1 9.2.2

Introduction 253 Machine Learning Potential for Global Optimization 258 Lattice Sums Method 258 Feature Vector 261

ix

x

Contents

9.2.3 9.2.4 9.2.4.1 9.2.4.2 9.2.4.3 9.2.5 9.3 9.3.1 9.3.2 9.3.3 9.3.4 9.4 9.4.1 9.4.2

Feature Vector Analysis 262 Examples of Machine Learning Interatomic Potentials 265 Aluminum 265 Carbon 267 Helium and Xenon 271 Discussion 272 Interatomic Potential for Molecular Dynamics 273 General Form of the Potential 273 Parameters Selection 274 Thermodynamic Quantities and Phase Transitions 277 Interatomic Potential for System of Two (or More) Atomic Types 281 Statistical Approach for Constructing ML Potentials 284 Two-Body Potential 284 Three-Body Potential 286 Acknowledgements 286 References 286 Index 289

1

1 Crystallography Open Database: History, Development, and Perspectives Saulius Gražulis 1 , Andrius Merkys 1 , Antanas Vaitkus 1 , Daniel Chateigner 2 , Luca Lutterotti 3 , Peter Moeck 4 , Miguel Quiros 5 , Robert T. Downs 6 , Werner Kaminsky 7 , and Armel Le Bail 8 1 Vilnius University, Institute of Biotechnology, Department of Protein-DNA Interactions, Saul˙etekio al. 7, 10257 Vilnius, Lithuania 2 Normandie Université, Université de Caen Normandie, CRISMAT-CNRS, ENSICAEN, IUT-Caen, boulevard du Maréchal Juin, 6, 14050, Caen Cedex, France 3 University of Trento, Department of Industrial Engineering, Via Sommarive 9, 38123, Trento, Italy 4 Portland State University, Department of Physics, 1719 SW 10th Avenue, Portland, OR 97201, USA 5 Universidad de Granada, Departamento de Química Inorgánica, Facultad de Ciencias, Avenida de Fuentenueva, 18071, Granada, Spain 6 University of Arizona, Department of Geosciences, 1040 E 4 Street, Tucson, AZ 85721, USA 7 University of Washington at Seattle, Department of Chemistry, 4000 15th Avenue NE 36 Bagley Hall, Seattle, WA 98195-1700, USA 8 Université du Maine, Institut des Molécules et des Matériaux du Mans, Département des Oxydes et Fluorures, CNRS UMR 6283, 72085 Le Mans, France

1.1 Introduction Science is crucially based on observational data. As an example of an ancient data-driven discovery, the observation of equinox precession by Hipparchus around 130 BCE comes to mind [1] – Hipparchus compared the longitudes of Spica and Regulus and other bright stars with the measurements from his predecessors, Timocharis and Aristillus, who lived about 100 years earlier, and concluded from the differences that the equinox points drift with time. Needless to say, this discovery could only be made because old observations of Timocharis school were meticulously recorded, accurate enough, and preserved for future generations. Today, the amount of data that scientists collect each year has grown by roughly 10 orders of magnitude, with fields such as astronomy or particle physics currently accumulating from several terabytes (TB) [2] to as much as 15 petabytes (PB) of data per year [3, 4]. In the field of crystallography, the need of long-term data preservation was recognized very early in the field. Currently, the International Union of Crystallography (IUCr) and the crystallographic community take great care with respect to data archiving and data reuse. The IUCr has rigorously described mathematical definitions necessary for crystal structure and experiment description in the International Tables for Crystallography [5] and created the crystallographic information file/framework (CIF) standard for crystallographic Materials Informatics: Methods, Tools and Applications, First Edition. Edited by Olexandr Isayev, Alexander Tropsha, and Stefano Curtarolo. © 2019 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2019 by Wiley-VCH Verlag GmbH & Co. KGaA.

Table 1.1 Material property and structure databases available online.

No.

Database

Approx. no. of records

License

Current URL

Est.

References

1.

MPOD

300

Public domain

http://mpod.cimav.edu.mx

2010

[10]

2.

RRUFF

47 000

Open access

http://rruff.info/

2015

[11]

3.

AMCSD

20 000

Open access

http://rruff.geo.arizona.edu/AMS/amcsd.php

2003

([12]; [13])

4.

IZA Zeolite database

176a)

Open access

http://www.iza-structure.org/databases/

1996

[14]

5.

Bilbao server



http://www.cryst.ehu.es

1997

[15]

6.

B-IncStrDB (Bilbao Incommensurate Structures Database)

— 140

Open access

http://webbdcrista1.ehu.es/incstrdb/

2010

[16]

7.

MAGNDATA (Bilbao Magnetic Structure Database)

428

Open access

http://webbdcrista1.ehu.es/magndata/

2015

[17]

8.

NDB

8 600

Open access

http://ndbserver.rutgers.edu/

1992

([18]; [19])

9.

COD

400 000

Public domain

http://www.crystallography.net/cod

2003

([20]; [21])

10.

PCOD

1 000 000

Public domain

http://www.crystallography.net/pcod

2003

[22]

11.

TCOD

2 000

Public domain

http://www.crystallography.net/tcod

2013

[23]

12.

CSD

Subscription based

http://www.ccdc.cam.ac.uk/solutions/csdsystem/components/csd/

1965

[24]

13.

ICSD

200 000

Subscription based

https://icsd.fiz-karlsruhe.de/

1987

[25]

14.

PDF

380 000

Subscription based

http://www.icdd.com/products/pdf4.htm

1941

[9]

15.

CRYSTMET

170 000

Subscription based

http://www.TothCanada.com,b) https://cds.dl.ac .uk/cgi-bin/news/disp?crystmet

1996

[26]

16.

Linus Pauling file

290 000

Subscription based; free of charge queries acceptedc)

http://paulingfile.com, http://crystdb.nims.go .jp/index_en.html

1995

([27]; [28])

17.

PDB

124 000

Open access

http://www.rcsb.org/pdb

1971

([29]; [30])

18.

BMCD

43 000

Open access

http://xpdb.nist.gov:8060/BMCD4

1995

[31]

800 000

a) The number of unique zeolite framework types that had been approved and assigned a 3-letter code by the Structure Commission of the IZA. b) The page at the http://www.TothCanada.com advertised in [26] seems no longer operational, but the access for subscribers is advertised at https://cds.dl.ac.uk/cgibin/news/disp?crystmet. c) Free of charge queries are offered at http://crystdb.nims.go.jp/index_en.html, but “no reproduction, republication or distribution to third parties of any content is permitted without written permission of NIMS.”

1.2 Open Databases for Science

data exchange [6, 7], which is constantly maintained to address new challenges in data management [8]. Crystal diffraction data has been accumulated systematically in a number of databases since as early as 1941 [9], archived in various crystallographic databases (Table 1.1), the largest ones being the Crystallography Open Database (COD) [21], the Cambridge Structural Database (CSD) [24], the Inorganic Crystal Structure Database (ICSD) [25], the Pauling File [28], the Protein Data Bank (PDB) [30], the Powder Diffraction File (PDF) from International Centre for Diffraction Data (ICDD) [9], and the CRYSTMET [26]. Several other databases that focus on specific aspects of crystallographic data exist; the structures they mention are usually included in one or several above-mentioned databases. References to these specialized databases will be given in the following text. Before 2003, of the above-mentioned crystal structure archives, only the PDB offered full open access to the crystallographic data it contained; all other databases followed a subscription-based model, offering little or no data on the Web for the general public or nonsubscribers, as well as requiring purchase of a license for systematic data searches and, occasionally, restricting publication of derived data [32, 33]. The advent of the Web, ubiquitous computing, and advantages of open linked data prompted a group of crystallographers to initiate the COD, offering crystal structures for chemical crystallography on similar grounds as the PDB provides them for macromolecular crystallography. Currently, the COD and the PDB remain the two largest databases offering the open-access model to crystallographic data and together covering the largest domain of crystal structures in an open way. While other databases contain larger collections of crystals structures and claim higher level of data curation than COD [34], they still require acquisition of licenses for systematic data searches. In this chapter, we will review the COD contents, data collection, and data curation policies. We will then describe various ways how COD data can be accessed and used. Finally, we will give examples of COD applications in the fields of crystallography, chemistry, material identification, and teaching.

1.2 Open Databases for Science Over the years, various researchers found that open access to articles consistently increases citations of these publications [35–39]. Similar trends are observed for data in the field of bioinformatics [40], and one would expect crystallography to follow similar trends. Thus there is a pure pragmatic reason for researchers to deposit data openly so that they are findable, reusable, and citable. For the user of data, the absence of paywalls and use restrictions provides the convenience of one-click access to data. Finally, there are ethical considerations – most published research were funded by public money, and the society members whose taxes were used to produce scientific results have reasonable expectations that these results would be available to them without demand of extra payment and without restrictions. Understandably, then, many funding agencies require that researchers whom they have supported publish their results under open-access licenses for both publications and data.

3

1 Crystallography Open Database: History, Development, and Perspectives

450 400 350 300 250 200 150 100

2019

2017

2015

2013

2011

2009

2007

0

2005

50 2003

COD records (thousands)

4

Year

Figure 1.1 COD record number growth.

To answer the above-mentioned concerns, many open databases have been established by researchers. In the following, we describe topic-specific databases, in addition to more general databases outlined previously. A list of scientific databases in the field of biosciences can be found in the Nucleic Acids Research [41], and crystallographic databases are listed by the IUCr (http://www.iucr.org/resources/data). The COD incorporates a continuously increasing number of determined crystal structures, reaching >367 000 entries at the time of writing this chapter (Figure 1.1). The equivalent of COD for structures obtained from first-principle calculation and/or optimization is Theoretical Crystallography Open Database (TCOD), started in 2013, with consequently a more modest number of entries around 2000 (Figure 1.2). However such entries require long calculation times and one can expect larger increases in the years to come. The COD was founded in February 2003 as a grassroot initiative – its establishment was proposed in a letter published at the Structure Determination by Powder Diffractometry (SDPD) mailing list by Michael Berndt: What if crystallographers work together to establish a public domain database with all relevant crystallographic data? This would not only overcome the current situation with ‘fragmented’ databases, it would also prevent for becoming dependent from monopolists. What would be needed? 1. A small team of engaged scientists with some experience in database and software design to coordinate the project. 2. The authors (i.e. the scientific community = you) who provide the project with database entries (note, that if you haven’t sold your experimental results exclusively, you are free to distribute the data to

1.2 Open Databases for Science

3 2 2 1

2019

2015

0

2017

1

2013

TCOD records (thousands)

3

Year

Figure 1.2 TCOD record number growth.

such a database, even if they have already been part of a publication and a lot of good data have never been published). 3. Free software a) for maintaining the database, b) for data evaluation and calculation of derived data (e.g. calculated powder pattern from crystal structures for search-match purposes), c) for browsing and retrieval. We are not in the same situation as decades before when the well-known databases (ICSD, CSD, PDF) started. Today we have the Internet, fast computers, and a big pool of free available software. The question is: Do we have enough scientists who are willing to cooperate? Several laboratories contributed a lot to the COD at its very beginning. Bob Downs offered his collection of mineralogical data, including the whole American Mineralogist Crystal Structure Database (AMCSD) [13] data set (all the crystal structures previously published in the American Mineralogist that were made freely accessible from the websites of the Mineralogical Society of America). The necessary MySQL/PHP scripts were written by Hareesh Rajan. In the meantime, Daniel Chateigner joined, and less than three weeks after the letter from Michael Berndt, the COD project was announced at various Internet media (Newsgroups, various mailing lists, and What’s New pages) by the following letter: Dear Crystallographers, a project of Crystallography Open Database (COD), accommodating crystal structure atomic coordinates prior to their publication, is under development. It is intended to give faster access to the latest structure determinations, openly. Its development and success depends completely on your contributions, either by data download or/and by giving help in software improvements. Visit the COD project Web pages (www.crystallography.net) for more details and a crystallography database(s) quiz. Thanks for your future help, the COD is

5

6

1 Crystallography Open Database: History, Development, and Perspectives

yours, it is the right time to do something for an open database controlled by crystallographers, now or never! The advisory board (wishing to enlarge): Michael Berndt, Daniel Chateigner, Robert T. Downs, Lachlan M.D. Cranswick, Armel Le Bail, Luca Lutterotti, Hareesh Rajan This letter produced a lot of positive and negative comments. Some researchers who responded positively joined the COD team and the number of entries in the COD increased, attaining more than 5000 entries by the end of March 2003 (3725 CIFs from the AMCSD, 450 CIFs from the Laboratoire des Oxydes et Fluorures, Université du Maine (LdOF), 850 CIFs from the CRISMAT). The CIF2COD computer program (FORTRAN) was built on the basis of CIF2SX with the permission from Louis Farrugia. CIF2COD reads several CIFs (from n.cif to n+m.cif ), performs several quality tests, and produces a .txt file containing m+1 lines with the MySQL database (cod) unique table (data) fields (including a, b, c, alpha, beta, gamma, volume, number of elements, space group, chemical formula, reference, and additional text). The first minimal COD search page was coded in the PHP language. Donations continued in April 2003 (1200 CIFs from IPMC) and the IUCr was contacted, asking for permission to download systematically the CIFs freely available at the IUCr website. The decision had to wait for the next IUCr Executive Committee meeting in August 2003. After four months, the number of entries in the COD reached 12 000, essentially by donations, from individuals or laboratories and the AMCSD. Then came the sad news. Michael Berndt died on 30 June 2003, after a long, serious illness at the age of 39. Lachlan Cranswick went missing on 18 January 2010 at the age of 41 and his body was found later in the water on the Ottawa River, near Deep River. Despite the losses, the COD team continued to implement his plan and to work on the database. Five years after its founding, the COD passed a major milestone in 2008, by archiving the 50 000th entry. To attain completion, the COD should add much more than 40 000 new entries per year and also digitize older data that were published in print form. The required growth rate of the COD was attained in 2011 (Figure 1.1), when automated procedures for crystallographic data collection were implemented. Nevertheless, a lot of work remains to be done and the COD welcomes contributions from all crystallographers in order to accelerate its completion. During the past 10 years, the COD Advisory Board underwent some variations, departures, and new admissions, and the list of coauthors of this chapter reflects the current situation, presenting the main actors of the COD development until now.

1.3 Building COD The COD collects all published crystal structures with small- to medium-sized unit cells. To facilitate this process, the CIF framework is employed. Currently, COD uses the CIF 1.1 [7] version of the framework. The framework files (CIFs) are used to input data into the COD, as an intermediate versioned archive for storage, and for providing data to the users.

1.3 Building COD

The main founding principle of the COD is open access – all data are readily available on the Internet. COD data records are identified by stable Uniform Resource Identifiers (URIs) and accessible via the REpresentational State Transfer (REST) interface. The COD main page on the Web (http://www.crystallography .net/) states, “All data on this site have been placed in the public domain by the contributors,” which we assume binding for COD Advisory Board, data maintainers, and contributors. All deposited data, unless embargoed by depositors for a fixed amount of time as a “prepublication deposition,” are immediately available after the deposition on the Internet and accessible via the automatically generated stable identifiers. Such arrangement enables immediate and permanent linking of COD structures into the World Wide Web fabric. Each data item that is committed to the COD repository is first of all checked for the syntactic correctness of the incoming CIF. Since not all submitted files can be guaranteed to conform to the formal CIF definition [42], an error-correcting CIF parser [43] is employed. This ensures that all COD CIFs can be automatically parsed and supports unassisted COD data processing. 1.3.1

Scope and Contents

COD aims at collecting all experimentally determined small-molecule crystal structures into an open-access resource. “Small-molecule” category encompasses all inorganic, metal–organic, and organic compounds with an exception of macromolecules – organic polymers. The latter are being collected into dedicated well-known open-access databases such as the PDB [44] and the Nucleic Acid Database (NDB) [18, 19]. As an experimental database, the COD collects structures determined by any experimental method. However, there are sister databases, the PCOD and the TCOD, which aim to collect predicted and theoretically determined structures, respectively (see Section 1.3.4 for a more comprehensive description). COD structures may be refined using just X-ray data and first physical principles (using full-matrix least-squares methods), but they may also be refined using restrains (especially when determined using powder diffraction methods) or, more recently, hybrid methods (from experimental powder data using Rietveld and Le Bail methods combined with first principles using density functional theory (DFT)). 1.3.2

Data Sources

The COD acquires most of its structures (over 90%) from peer-reviewed scientific publications. The rest is deposited by authors either as personal communications or as prepublication depositions. Data published in papers are subjected to checks for conformance with CIF syntax, CIF dictionary definitions, and the completeness of bibliographic and other provenance information. Personal communications and prepublication depositions are in addition checked for conformance to the IUCr data criteria.1 The COD permits both manual deposition 1 ftp://ftp.iucr.org/pub/dvntests and http://journals.iucr.org/services/cif/checking/autolist.html.

7

8

1 Crystallography Open Database: History, Development, and Perspectives

by crystallographers using a Web interface (http://www.crystallography.net/cod/ deposit) and an automated deposition using various Web-inspecting engines. Automated Web searches are conducted on journals that publish openly accessible crystallographic supplementary data. Data are also automatically extracted from open-access publications. Data from other crawlers, such as CrystalEye [45], and other open databases (e.g. AMCSD [13]) are incorporated into the COD on a regular basis, either using automated or semi-manual procedures. Such a strategy permits broad coverage of published structures with little resources required; it leverages the power of Internet automation while at the same time permitting humans to intervene at critical points when necessary. It must be noted with regret that some journals still do not provide the supporting data for their papers openly. Data are either located behind the paywalls or available only in subscription-based databases with explicit restrictions on their reuse. Unfortunately, this makes a technically simple task of collecting all currently published crystal structures into open databases virtually impossible, not for technical but for purely organizational reasons. The barriers are not even related to intellectual property, since published data and facts of nature are not copyrightable. We thus urge everyone who sees virtue in open scientific data exchange and has benefited from open-access database to approach every publisher and ask them to provide underlying publication data for deposition to open-access databases or to deposit her or his crystallographic data directly to the COD. 1.3.3

Data Maintenance

Scientific databases are an indispensable resource in the modern-day research, and as such they must adhere to the criteria of all properly designed experiments – reproducibility and traceability. Obtained results are of little value if repetition of the same procedures under the same conditions yields a different outcome. The same holds true if the experiments are purely computational in origin such as simulations [46] or compilation of statistical data. In addition to that, any conclusions drawn from claims of untraceable origin become unverifiable and run the risk of polluting every sequential experiment they are used in. As a result, the employment of the Write Once Read Many (WORM) principle, which ensures that once data is written it is never changed irreversibly, becomes a necessity for scientific databases. Collecting and preserving scientific data is an important endeavor, but maintaining it is a task of no less importance. Reasons behind the need to modify the data are numerous – from a simple human error to new insights about the data or even the introduction of a novel way of describing certain phenomenon. The means for updating scientific articles via the issuing of addenda and errata are well established; however, the same mechanism is usually not applied to the supplementary material. A more common approach is to silently replace the outdated version with a new one leaving the returning reader with a very unexpected sense of jamais vu. The situation is only worsened by the fact that supplementary material is rarely well-reviewed before publishing, resulting in an even greater need for a proper data maintenance strategy.

1.3 Building COD

Table 1.2 Error classes routinely addressed by the COD maintainers. Error class

Ease of detection

Ease of correction

Effect on data usability

Syntax

Detected automatically by the parser

Mostly automatic

Unreadable file

Semantic

Detected mostly automatically by specialized software, requires occasional manual analysis

Automatic and manual

Incorrect supporting information

Crystal structure

Detected by specialized software and manual analysis

Mostly manual

Incorrect crystal structure

Data discrepancies addressed by the COD maintainers can be grouped into three main classes: syntax errors, semantic errors, and errors relating to the crystal structure. Each of these classes requires different detection and correction strategies and affects the data usability in varying degrees (see Table 1.2). The initial step of data management in the COD is the detection and correction of syntactic errors. This kind of discrepancies is especially important since it renders the files unreadable and limits the possibility of any further data maintenance. Crystallographic structures in the COD are stored as CIFs, a format that has been adopted by the crystallographic community. However, even with the widespread use of the CIF format, none of the parsers available at the time were capable enough to satisfy the specific needs arising from the curation of large data sets. As a result, maintainers of the COD have developed an open-source error-fixing CIF parser, which is able to correct some of the most prominent syntax errors [43]. Initial file parsing upon deposition as well as the routinely database-wide checks guarantee that at any given moment all files in the COD can be read correctly according to the CIF format rules. Syntactical correctness ensures that the files are readable, but does not guarantee the validity of the data stored inside the files – this is the task for semantic validation. Due to a great variety of semantic errors and the fact that they usually only affect a portion of the data in the file, the COD has adapted a very flexible policy regarding discrepancies of this kind. During the initial deposition semantic errors are recognized, automatically corrected, and reported to the depositor and in case an automatic correction is not possible, these errors are recorded in an internal database for further analysis. Once a significant amount of similar errors accumulate, heuristics-based programs are developed to automatically fix the errors in question. Since it is unreasonable to expect perfect detection of all possible semantic error cases in advance, the file validation strategy also addresses the handling of new kinds of semantic discrepancies that were previously missed during the initial deposition. In this case, heuristics-based programs are developed for the detection of these new errors and the whole database is revalidated based on the new criterion. In the end, both the new error-correction programs and the new error-detection programs are eventually integrated into the deposition step. The described workflow ensures that the overall semantic validity of the COD data set will only increase.

9

10

1 Crystallography Open Database: History, Development, and Perspectives

The set of computer programs developed by the COD maintainers for the detection and correction of syntactic errors are collectively called cod-tools. These tools are capable of recognizing most of the problems listed in the IUCr validation criteria (http://journals.iucr.org/services/cif/checking/autolist.html), such as misspellings of data item names or their enumerated values, as well as some other common issues identified by scanning the COD. Examples of such discrepancies include data items designated to specify temperature containing values in units other than Kelvin or data items used to describe the density of a crystal containing values in kg/m3 instead of g/cm3 . Instances of errors like these might not seem significant when handling individual files, but they do complicate the workflow and skew the results of database-wide analyses. Luckily enough, some of the errors can be automatically corrected by using heuristics (for example, unit designators after the temperature values); others, however, require manual curation. One type of manually curated errors is the incorrect number of implicit hydrogen atoms. This number, provided using the “_atom_site_attached_hydrogens” data item, specifies the amount of hydrogen atoms attached to the atom site excluding the hydrogen atoms for which coordinates are given explicitly. Such discrepancies are easily spotted even by a novice chemist, but they are much harder to detect automatically. Incorrectly marked hydrogen atoms result in erroneous calculated atom charges, mismatch between the declared and the calculated formulas, and skewed distributions of geometric parameters. Errors in the coordinates, cell constants, and symmetry are especially difficult to locate and correct. Nevertheless, the structures in the COD are routinely scanned for “bumps” (suspiciously small interatomic distances) and voids. Examination of “bumps” usually reveal modeling errors, unmarked disordered sites, or redundant atoms; several non-P1 structures, which had all symmetric atoms listed, have been spotted and corrected while scanning the COD. Voids, on the other hand, are a sign of missing atoms or their groups, wrong cell constants, or incorrectly low symmetry. Currently, new means of detecting other geometric anomalies in deposited structures based on statistical distributions of geometric parameters are being developed. Such checks will make the identification of unfinished refinement, missing atoms, and typographical errors in coordinates and cell constants possible. Not all structures, however, can be successfully corrected. To inform the user and enable the recognition of such entries in automated analyses, a warning or an error flag is added to the CIF manually. Currently there are around 20 such entries in the COD. Another type of structures in the COD unfit for normal use are the retracted ones. Retraction rate, as reported by RetractionWatch, is around 500–600 retractions/year (http://retractionwatch.com/help-us-heres-some-of-what-wereworking-on/) and the field of crystallography is not immune to incorrect conclusions and scientific fraud. Since, at least to the knowledge of the COD maintainers, there is no open database listing all retracted publications, the process of retraction in the COD is completely manual. Each entry coming from retracted publications is blanked and excluded from the search so as not to bias

1.3 Building COD

automated analyses. However, since the history of all structures is preserved in the COD, retracted structures can be accessed if necessary. Alongside retractions, there are a few more types of entries that are not desired in the COD but often are identified as such only after the deposition. One of them is duplicates: in order to not overcrowd the COD with repeated entries and thus bias statistical results, deposited structures are compared with the rest of structures in the database during an attempt to locate duplicates. Currently, two structures are assumed to be duplicates if they originate from the same publication, have the same lattice cell constants and contents, are measured at the same temperature and pressure, and are not enantiomers of one another or deliberately suboptimal versions of some properly refined structure (the suboptimal structures are sometimes published to support the space group or refinement parameter choices). We must note, however, that not all duplicates are marked in the COD at the moment. Therefore, new methods to locate duplicates are devised and employed in the COD, almost always requiring supervision of a data curator. As entries are not removed from the COD, duplicates are marked with a special flag, indicating the original entry. In 2013 results of theoretical calculations being deposited to the COD were spotted. This resulted in the policy of accepting only experimentally detected structures to be reiterated, and a sister database, the TCOD, was opened to house all kinds of theoretically defined structures. Since then more than 400 theoretical structures were identified and marked as such in the COD. Difficulty to identify theoretical structures from data given in CIFs hinders automatic detection of such depositions. However, properties like high numeric precisions of cell constants and coordinates, missing standard uncertainties, and experimental details may be used to guide this otherwise manual task. As with any other structure not fitting the scope or criteria of the COD, theoretical structures are also marked as such instead of being removed. 1.3.3.1

Version Control

Scientific data, when used, must be properly cited and available for verification of the conclusion drawn from them. The availability must be ensured both during the research, for the benefit of the scientist conducting it, and at later stages, for peer review and for replications of conclusions reached. Curated databases, however, change over time, and databases like the COD that follow immediate release policy can change at any time and at high rate, comparable with the rate at which data are queried for computations. To make sure that computations done with the COD are repeatable, and inference drawn from them are reproducible, it is crucial that any previous state of the database can be restored. We implement this requirement by using version control on the COD data. Currently, a Subversion server [47, 48] is used to register versions of the COD data in CIFs. Subversion is a powerful, off-the-shelf open-source software system that enables track of changes in a tree of files, assigns each state of the file tree an unchangeable sequential revision number, and allows restoring any previous revision from the repository. Although originally designed as a tool for software development, Subversion offers precisely those functions that are needed for a scientific database of the medium size, such as the COD. The text nature of

11

12

1 Crystallography Open Database: History, Development, and Perspectives

the CIF format makes them particularly well suited for tracking with revision control systems. Since the introduction of Subversion, all COD data curation history is available, and any state of the database can be restored. As an additional advantage, Subversion also records movement of files in the file tree and rename operations, thus providing full data provenance of each COD CIF in the version control system since its insertion into the repository. When a COD ID of a structure and a revision number of the structure is known, a unique string of bits (a digital object) describing that structure at a given revision can be retrieved. The COD MySQL data tables are automatically produced from each current COD revision. These tables themselves are not currently versioned, i.e. currently MySQL tables contain only data from the most recent revision of the COD (although a nightly dump of the COD MySQL database is inserted into the COD Subversion repository). Such implementation was deemed satisfactory, since the primary COD data are CIFs, and MySQL tables for any revision can in principle be reconstructed from the CIFs of that particular revision. As the database grows, however, and more queries are executed on the MySQL database, and not on the CIF tree, the need arises to quickly perform historic SQL queries, without reconstructing MySQL tables for each revision. This need is explicitly recommended in the Research Data Alliance (RDA) Recommendations for Data Citation [49, 50]. Therefore, the COD will implement a possibility to query every revision of COD database online (historic states of MySQL tables will be restored from COD CIFs and marked with corresponding time stamps and revision numbers) and to cite COD queries in a durable and reproducible way, enabling to rerun each historic query, both on the original data and on newer database revisions. 1.3.3.2

Data Curation Policies

Since COD record contents can change during data curation, a question arises what rules does the COD curation policy follow and what a researcher can rely upon. The current COD data policy is as follows. A COD entry record is essentially a claim made by a data depositor that the specified authors have published certain findings about the structure described in the COD entry. To this extent, the COD data curation team makes reasonable efforts to make each COD entry represent the publication authors’ intent. To that end, data in COD entries can be enhanced during the data curation; additional data from the original publication may be added. Data values in CIFs may be corrected if a correct value is clearly specified by the authors in the original publication, and it is clear that the authors meant that value to be published (usually, such corrections also make good physical sense, making it obvious that the curated structure describes better the physical reality). In cases where the intent of the author is not so clear, or where essential data items such as coordinates of atoms or atomic symbols have to be changed, authors are first contacted to approve the changes. In all cases it must be clear that the original finding of the authors meant exactly the curated value and is not a new interpretation of the experiment. Data curation never involves a new structure solution from the same data, re-refinement, guessing values from common chemical knowledge or similar investigative steps. Such processes are possible, but in that case, a new COD ID

1.3 Building COD

must be assigned to the new structure solution and will be treated by the COD as a new publication. The data curation process has data uniformity and accuracy of claims as its main aim. All COD structures must use the same conventions to describe analogous situations. In most cases, the IUCr CIF standard provides adequate means for uniform description, and we curate the data records to adhere to these standards. For example, atomic coordinates must be provided either as fractions of cell vectors along the crystal axes or as Cartesian coordinates in an orthogonal frame (in which case orthogonalization matrices relating the used Cartesian frame and the crystal axes must be given). Another instance is the melting point of a crystalline material that must be given in Kelvin. If an original publication contains these data items recorded in different ways (different coordinate systems, different units), COD data curators convert them to the common mandated format, leaving original values in specific COD data items for reference. Sometimes, however, there is no standard way to express certain circumstances; for example, sometimes authors are not sure what is the chemical nature of atom occupying certain site in a crystal unit cell, and they mark such sites using different codes (such as “I1,” “M2,” or so on). COD introduces a uniform notation, “X” for completely unknown atom at a site and “M” for an unknown metal. In that case the original authors’ designators might be changed; the curated version (atom site “X”), however, expresses the authors’ message “unknown atom” better than the original “I” designator, since the latter can be confused with iodine in the COD context. 1.3.3.3

Quarterly Releases

The COD follows a continuous release policy – each commit to the COD database is immediately available on the Web and in the public Subversion repository. Each such commit introduces a new COD revision. The COD content is mostly updated on a daily basis, and several revisions can be generated each day. It is therefore important that COD users keep track of which revision they are using for their calculations and data searches. Since such tracking might introduce extra burden, we are providing, after a popular request, quarterly releases of COD data snapshots. Four times a year the latest COD revision is exported, both CIFs and MySQL table dumps, and packed in several most popular data formats. The revision and time stamp of the most recent release is available at http://www .crystallography.net/cod/archives/LAST_RELEASE.txt. Each current release is available for download in the COD archive area: • Current Release: – http://www.crystallography.net/cod/archives/cod-cifs-mysql.tgz – http://www.crystallography.net/cod/archives/cod-cifs-mysql.txz – http://www.crystallography.net/cod/archives/cod-cifs-mysql.zip (The contents of all three files are identical, so only one is needed to obtain a release.) • Historic releases: can be found in each year’s “data” directory, following the URIs of the type http://www.crystallography.net/cod/archives//data/; for example, all four releases of 2015 are in http://www.crystallography.net/ cod/archives/2015/data/.

13

14

1 Crystallography Open Database: History, Development, and Perspectives

While the use of COD releases is conceptually simple and does not require the use of version control software and revision tracking, it must be noted that the releases get outdated quickly. Also, downloading a new release repeatedly downloads all previous data anew, wasting bandwidth and time. Thus, frequent COD user’s should consider incremental means of updating their COD collection, such as Subversion (“svn”) or Rsync. 1.3.4

Sister Databases (PCOD, TCOD)

The growing need for COD-like databases for other than experimental structures has sparked the creation of two sister databases: the Predicted Crystallography Open Database (PCOD) for predicted structures and the Theoretical Crystal Structure Database (TCOD) for theoretically constructed structures. Predicted Crystallography Open Database (PCOD) (http://www.crystallography.net/pcod/) was launched in December 2003 with the goal of collecting computationally predicted structures. It was expected that the number of such entries could easily exceed the number of experimentally determined ones. In January 2004, the PCOD offered 200 entries. In February 2007, the number of entries were boosted to more than 60 000 by the deposition of crystal structure predictions using Geometrically Restrained INorganic Structure Prediction (GRINSP) software [22]. As the COD passed a major milestone by archiving the 50 000th entry in 2008, the PCOD climbed over the 100 000 structure limit in the same year. A year later PCOD reached one million entries, most of them being generated by Zeolite Framework Solution (ZEFSA II) [51]. As a fork of the COD, the PCOD has inherited most of its features, such as stable unique data identifiers, data versioning, and Web and MySQL interfaces for searching. An automatic deposition service remains to be implemented in the PCOD. The Theoretical Crystallography Open Database (http://www.crystallography .net/tcod/) was launched in May 2013, thus addressing the need for an open repository of theoretically computed crystal structures. As methods of computational chemistry enjoy unprecedented growth and computer power increases, a large number of atomistic simulations can be carried out, producing theoretical material structures and calculating their properties using DFT, post-HF, QM/MM, and other methods. By the end of that year, the TCOD offered around 200 entries. To ensure high quality of deposited data, development of ontologies in a format of CIF dictionaries was initiated. In addition to that, a COD-like pipeline to check each deposited structure against a set of community-specified criteria for convergence, computation quality, and reproducibility was developed and installed in the TCOD. As of the time of writing, the TCOD contains more than 2000 entries.

1.4 Use of COD 1.4.1

Data Search and Retrieval

Open-access Web resources pave the way for unprecedented applications that interconnect and reuse data hosted by many different organizations without the

1.4 Use of COD

need of coordination between them. Key elements for such cooperation are the interfaces for data access. Commonly used architectural style for both humanand machine-usable Web interfaces is REST, according to which RESTful interfaces are built [52], which use common HTTP requests to stable URLs for data retrieval. 1.4.1.1

Data Identification

Each entry in the COD consists of a CIF data block, listing the atomic positions of the crystal of interest, and an optional data block for diffraction data (Fobs, powder diffractograms). If an experiment results in more than one CIF data block (N data blocks), they are split across N COD entries. To provide permanent descriptors, unique identifiers – integers from range 1 000 000 to 9 999 999 – are assigned for each deposited entry upon the deposition into the COD. The COD identifiers are promised to be permanent – both retracted and duplicate entries, which are detected after their deposition, are marked as such instead of removal. COD identifiers are straightforwardly transformed into stable URIs by prefixing them with http://www.crystallography.net/cod/ and postfixing with file type (.html for general review of an entry, .cif for CIF with atomic positions, and .hkl for the diffraction data file). For example, files of entry 2002916 can be accessed via http://www.crystallography.net/cod/2002916.html, http://www .crystallography.net/cod/2002916.cif, and http://www.crystallography.net/cod/ 2002916.hkl. 1.4.1.2

Web Search Interface

Data can be searched on the Web using simple Web forms that use the COD MySQL database as a fast search index (Figure 1.3): The COD server returns found results as a paginated HTML table (Figure 1.4). From this page, results can be downloaded in bulk as an archive. COD currently supports ZIP archives for downloaded data. The result table can be downloaded as a comma-separated value (CSV) file, and the list of selected structures can be obtained as a text file, either one COD number or one COD URI per line. 1.4.1.3

RESTful Interfaces

The same search interface can also be accessed programmatically using the COD RESTful API. The base URL for carrying out searches is http://www.crystallo graphy.net/cod/result, while search terms have to be defined as HTTP GET or POST parameters. An example of such query using the “curl” command line tools is given in Figure 1.5. A list of supported search terms is given in a list below: • text: textual search; for example, text=caffeine • id: search by COD identifier; for example, id=3000000 • el1, el2, … , el8: search for elements in composition; for example, el1=Ba &el2=O4 • nel1, nel2, … , nel8: exclude entries with given elements; for example, nel1=Os • vmin, vmax: minimum and maximum volume of the cell, in Å3 ; for example, vmin=10&vmax=20

15

16

1 Crystallography Open Database: History, Development, and Perspectives

Figure 1.3 COD search Web interface form.

Figure 1.4 COD search result page, obtained as of 05 November 2016 from the query shown in Figure 1.3.

1.4 Use of COD

Figure 1.5 Example of the COD programmatic search interface.

• • • •

minZ, maxZ: minimum and maximum Z value minZprime, maxZprime: minimum and maximum value of Z′ spacegroup: search by spacegroup journal, year, volume, issue, doi: search by terms in bibliography

By default, the result of the structure request is returned in the CIF format; however, additional output formats can be requested. 1.4.1.4

Output Formats

A combination of search parameters results in logical conjunction (OR operation). The output format can also be controlled using HTTP GET or POST parameter “format,” with one of the following values: “html,” “csv,” “zip,” and “json.” In addition, “lst” value can be used to get the list of COD identifiers, “urls” to get the list of COD URLs and “count” to get the number of entries matching the search query. The default format currently used for the “result” query is “html,” returning a paginated HTML table. Since the request of the search result with no search terms selects all COD entries, this URI can be also used for browsing the COD database by COD ID. Other browsing pages (currently by journal or by publication date; the full list is available at http://www.crystallography.net/cod/ browse.html) are actually also implemented using the “result” requests. 1.4.1.5

Accessing COD Records

As presented in Section 1.4.1.1, each entry in the COD is identified by unique seven-digit number. COD presents the following URLs for access to the entryrelated data: • Coordinates: http://www.crystallography.net/cod/XXXXXXX.cif • Diffraction data: http://www.crystallography.net/cod/XXXXXXX.hkl • Metadata in RDF: http://www.crystallography.net/cod/XXXXXXX.rdf Here, the XXXXXXX placeholder should be replaced by a single COD identifier. An example of a query made using these identifiers from the Unix-style command line is shown in Figure 1.6. Depositions to the database in the form of CIFs are also available using the RESTful interface. Currently, registration of a depositor account at the COD is required beforehand. The URL of the RESTful deposition interface is http://www .crystallography.net/cod/cgi-bin/cif-deposit.pl. All parameters along with a CIF must be provided via HTTP POST: • username: depositor’s username • password: depositor’s password • user_email: depositor’s e-mail address

17

18

1 Crystallography Open Database: History, Development, and Perspectives

Figure 1.6 Retrieving a specific COD structure using the stable COD URI identifier.

• cif : contents of to-be-deposited CIF • hkl: contents of to-be-deposited diffraction data file (optional) • deposition_type: type of deposition, either “published,” “prepublication,” or “personal” 1.4.1.6

MySQL Interface

The Web-based interfaces are readily available, can be accessed using standard software such as Web browser or URL downloader, and do not require any sophisticated programming. Their capabilities are naturally limited since we cannot expose a full data query language such as SQL at the moment. To alleviate this limitation, the COD exposes a read-only version of the COD MySQL database for queries. When accessed as the “cod_reader” user, this database grants SELECT privilege to that user without asking a password to enable full use of the SQL query language. A special “sql.crystallography.net” host is dedicated for such queries. An example of such query using the Linux “mysql” command line client is illustrated in Figure 1.7. The structure of the “data” view can be queried using standard SQL commands (Figure 1.8). A human-readable and machine-verifiable description of the semantics for each “data” column is currently provided as an XML file (http:// www.crystallography.net/cod/xml/documents/database-description/databasedescription.xml).

1.4 Use of COD

Figure 1.7 Querying the COD MySQL database.

Figure 1.8 Finding column definitions of the COD “data” view.

When querying data using SQL, the user has access to the raw SQL tables, and is therefore responsible for filtering the data to get the desired results. In particular, the COD “data” table may contain structures that are flagged as retracted (“status = ‘retracted’” in SQL “where” statements) or containing errors. These structures are most probably not desired, unless we investigate the sociology of structural science, not the structures themselves. In addition, the COD “data” table contains a small number of marked duplicates, and some structures that were computed by theoretical methods and thus do not represent experimental results (such structures are systematically collected in the TCOD). These records are most probably also to be excluded from searches when investigations of crystal structures are carried out. This can be done by the SQL query provided in Figure 1.9. This query method is recommended for the most material structure searches in the COD and in its sister databases. The queries performed via the REST interface already perform such filtering, as indicated by the result count in both examples of Figure 1.9. Currently, the COD MySQL tables do not contain atomic coordinate data. A common strategy to get coordinates from SQL queries is to get the list of COD IDs and then convert them either to COD CIF URIs or to local file names than can be retrieved. An example of both strategies (assuming that the local COD CIF tree is checked out in the directory ∼/struct/cod/cif ) is presented in Figures 1.10 and 1.11. Fetching coordinates from a copy on a local file system is of course much

19

20

1 Crystallography Open Database: History, Development, and Perspectives

Figure 1.9 Filtering out structures from the COD MySQL queries.

Figure 1.10 A COD CIF data retrieval after a MySQL query using COD URIs. The requested structures are experimental structures of silicon solved after the year 2000. The “-NB” option provides a plain tab-separated value list (TSV), which is suitable for Unix pipe processing. Please note the “sleep 1” command inserted after each download, which delays the queries and saves the public COD servers from the overload.

Figure 1.11 Preparing coordinates for an SQL query using a locally installed COD copy.

faster but requires preparation and maintenance of the up-to-date COD copy. In Section 1.4.1.8, we describe how to build such a COD copy. 1.4.1.7

Alternative Implementations of COD Search on the Web

Since the COD is openly accessible on the Web and all data are free for download, anyone can implement an alternative Web-based search engine for the COD, and indeed such sites have been implemented already. The oldest is probably the http://nanocrystallography.research.pdx.edu/ Web page that uses a subset of the COD for teaching purposes. The COD database access is provided by the STFC Chemical Database Service at Sci-Tech Daresbury (https://cds.dl.ac.uk/) on their page (https://cds.dl.ac.uk/cgi-bin/news/disp?COD). Another chemist-oriented

1.4 Use of COD

search tools existing at the moment are the MolView online molecular viewer by Herman Bergwerf (http://molview.org/) and the DataWarrior stand-alone Java program by Thomas Sander (http://www.openmolecules.org/), to mention just two mature open-source projects. Other similar endeavors exist on the Web as well. In addition, the Web base abstractors of chemical information such as PubChem [53] and ChemSpider [54] now provide links to some of the COD structures, and we expect number of such links to grow in the future. In this way, various types of information resources can be seamlessly integrated on the Web, providing instant access to multiple facets of object description. When implementing an alternative COD interface, all implementers are encouraged to use the latest revision of the COD, either by regularly updating their local copies using one of the methods described in this chapter or by querying the online COD servers. If a subset of the COD data is deliberately selected, this should be indicated so that the users of the resource are not confused. If such preclusions are met, additional independent services will provide more possibilities for end users of scientific data and thus allow them to use the full potential of open databases, something that is completely impossible with closed archives of data. 1.4.1.8

Installing a Local Copy of the COD

Since the COD is an open-access database, each user can and may install a local copy of the COD database, a practice which is in fact encouraged. The first method to obtain a full copy of the COD is to use a Subversion client and to check out a working copy of the COD files. The COD Subversion repository is world-readable and can be accessed using Subversion protocol at svn://www.crystallography.net/cod/, with CIF collection only available as a subtree at svn://www.crystallography.net/cod/cif. A command to check out the COD working copy on a Linux operating system is provided in Figure 1.12; for other platforms, alternative SVN clients can be used (for example, TortoiseSVN (www.tortoisesvn.net/) for Windows). Alternatively, another client that can be currently used to fetch data from the COD Subversion repository is GIT, with the GIT SVN plug-in (readily available in Linux software repositories for most popular Linux distributions). The corresponding cloning commands are provided in Figure 1.13. Access via Subversion stands out of other methods to obtain COD data by an advantage of easier retrieval of recent changes. Once cloned, a local copy (called

Figure 1.12 Obtaining (checking-out) a working copy of the COD data using the command line “svn” Subversion client.

Figure 1.13 Cloning COD data directory with GIT and GIT SVN.

21

22

1 Crystallography Open Database: History, Development, and Perspectives

“working copy” in Subversion parlance), can be updated, say, per-regular basis, fetching only the changes – modifications, additions, and deletions. In addition to that, the “svn log” (or “git log” if GIT client is used) commands provide the full history of data additions and changes, with all metadata (dates, committers, changed files) and with human-readable log messages. Thus maintaining a Subversion working copy is arguably the best method to have the most up-to-date local mirror of COD data. If the full history of the COD changes is not needed and the use of Subversion clients is undesired, an incremental update of the local COD copy can be performed using the “rsync” tool [55]. The COD file collection is presented to “rsync” users as the rsync modules “hkl,” “cif,” or “cod-cif” (for the COD data), “pcod-cif” (for PCOD data), and “tcod-cif” (for TCOD data). The commands to synchronize a local tree with the COD database are provided in Figure 1.14. The provided “rsync” commands ensure that the local COD file tree becomes exactly the same as the one on the COD server, including deletion (option “--delete”) of the removed files. User may want to use additional options, such as “--backup” and “--backup-dir,” to preserve copies of the removed files if such references are needed. The “rsync” method provides a lean and fast way to synchronize two directories. However, COD file change history is not available when using this method. Moreover, while “svn” updates are atomic, i.e. they always transfer a complete latest revision even if new commits are taking place simultaneously, the “rsync” protocol has no knowledge about the Subversion repository transactions and cannot ensure that a complete revision is transferred. If an update of the COD happens during the “rsync” process, some transferred files may end up from the newer revision, while the others will be from the older one. To guard against this, running two or more “rsync” commands in a row is recommended so that the last command does not fetch any new updates. To install the COD MySQL database, one has to obtain dumps of the COD MySQL tables and source them into a MySQL database in a MySQL server. Dumps can be either checked out from Subversion repository (commands shown in Figure 1.15) or downloaded and extracted from the COD quarterly releases (commands shown in Figure 1.16). One should note, however, that the first method is the most effective, since the latter requires downloading of a whole archive of a quarterly release (3–4 GB as of 2016, size depending on compression).

Figure 1.14 Using the “rsync” program to download and update the COD file collection.

Figure 1.15 Checking out the COD MySQL dumps from the Subversion repository.

1.4 Use of COD

Figure 1.16 Extracting the COD MySQL dumps from a ZIP archive of quarterly release.

Downloaded table schemata (*.sql) and tab-separated value lists (*.txt) have then to be loaded into an empty MySQL database. A script “cod-load-mysqldump.sh,” which is included in MySQL dumps, creates COD “data” table, provided that the user has root access to an empty database “cod” on a local machine. The same script can be used to update already existing MySQL database. However, the script should be used with attention, as it blanks the “data” table before loading in the data, so all local changes to the table between subsequent updates will be lost. 1.4.1.9

File System-Based Queries

When all COD files are available on a local disk, another kind of COD queries becomes possible, namely, queries of the COD CIFs directly using the standard Unix file processing utilities. While such queries are as a rule slower than the database queries (although with fast disks and large RAM caches they can be speeded up a lot), they are more flexible and do not require building local SQL database or connecting online to an existing one. Being ASCII-encoded text files, CIFs can be searched using the Unix “grep” and other tools. A query in Figure 1.17 will find all CIFs that contain the line “diamond” in them, regardless of case. The first command will print out all lines that have this word, and the second command will list names of all files that contain this word (note that “diamond” in this case can be a name of a mineral, a name of a program, or something else). Another powerful method to query and possibly process COD files is the use of “find” and “xargs” Unix tools or employment of the “make” tool to organize computations. The use of these methods is beyond the scope of our present chapter, but it should be noted that all of them permit running arbitrary programs, written in any programming language, on any subset of COD CIFs. When using home-written programs for CIF processing, one must take into account that CIF is a structured, free-text format described by a formal syntax [42] and thus requires a correct parser to extract data properly (simple tools like “awk” or Perl’s “split()” function are not sufficient). Fortunately, numerous parser libraries for proper CIF parsing exist: the COD employs an error-correcting parser from the “cod-tools” package [43] that has C, Perl, and Python bindings; other parsers have been proposed by various authors [56–58]. For quick composition of different processing tools, however, one can employ simple command line utilities to extract values. The “cod-tools” package [43] contains utility “cifvalue,” which is written entirely in C and permits fast extraction

Figure 1.17 Search COD CIFs using “grep.” Options of this command are supported by the GNU “grep” utility on the Ubuntu 12.04 operating system or higher.

23

24

1 Crystallography Open Database: History, Development, and Perspectives

Figure 1.18 Use of “find,” “xargs” and “cifvalues” from the “cod-tools” package to extract requested data from CIFs.

of requested CIF values and their printout in a space-separated-value form that is then easily processed by “awk,” “perl,” and “R,” in most spreadsheet programs and a multitude of other tools. An example in Figure 1.18 shows how to use “cifvalue,” in conjunction with the aforementioned “find” and “xargs” programs, to extract molecular weight, unit cell volume, and melting point data from the COD collection. 1.4.1.10

Programmatic Use of COD CIFs

The proper usage of any resource requires mutual understanding between the resource provider and the resource consumer. Since the COD is a completely open database, there are no legal restrictions on the use of data; however, one should be aware of certain COD policies to ensure the optimal utilization of the COD and the validity of the desired results. The COD promises to retain stable structure identifiers, document any changes introduced by the COD maintainers, and provides the means of recognizing structures unfit for conventional use. Reciprocally, the user of the COD is expected to make use of these premises and apply critical thinking when examining the results; the data set is not yet perfect nor complete, but voluntary collaboration is the driving force behind projects rooted in openness. As a result, reporting of any observed errors and the deposition of new structures to the COD is highly endorsed. Finally, whether one is planning on using the COD for viewing individual structures, processing the whole data set using intricate programs, or getting more involved into the project, the knowledge of the basic COD conventions is tantamount. Since definitions of structure classes, such as organic compounds and minerals, are often under debate, there is no programmatic classification of structures in the COD. Nevertheless, the user can narrow the search by selecting structures by chemical composition or symmetry and remove the false positives according to one’s needs. CIFs describing natural minerals can be detected by checking the presence of “_chemical_name_mineral” CIF data item. However, the addition of this data item is relied upon to be done by the depositor; thus the COD cannot guarantee that all mineral structures in the database are marked as such. As described in the Section 1.3.3, CIFs of entries with issues are marked with special data items to be recognized as such both by human users and programs. The main data item to look at is “_cod_error_flag,” which indicates entries with warnings (enumeration value “warnings”) and errors (enumeration value “errors”). Furthermore, the same data item with value of “retracted” indicates structures, retracted by the authors.

1.4 Use of COD

At the time of writing there are around 1100 entries without coordinates in the COD (excluding retracted structures). Most of these entries were created as references of otherwise inaccessible published crystal structures, such as from pre-CIF or paywalled publications. Although the practice of creating such entries is not common and their number is small, all of them can be filtered out according to the following principle: such entries have _atom_site_ CIF loop with a mock atom site, whose all parameters (label and coordinates) are equal to the value “unknown,” denoted as a lone question mark (“?”, ASCII character 63 decimal). Automatically identifying chemical types of the atoms in the CIF file is a bit more complex task than it may seem at first glance. Even though the core CIF dictionary describes a way of specifying the chemical species of the observed atoms, it is often ignored or misused. The recommended practice is to use the “_atom_site_type_symbol” data item that is designated just for this purpose. Alternatively, the chemical type symbols can be prepended to the “_atom_site_label” data item values; for example, following this naming scheme, the “C11,” “Au,” and “Pb*” labels would be used to specify carbon, gold, and lead (Pb) atoms accordingly. The latter approach seems to be preferred in practice; however, it introduces a lot of ambiguity. First of all, it is not clear whether the user meant to use the labels for this purpose or if he or she simply forgot to include the “_atom_site_type_symbol” data item. In addition to that, some ambiguity also arises when trying to extract the chemical symbol from the label. Usually, it is sufficient enough to take the first one or two letters from the atom label as its chemical symbol (“/^([A-Za-z]{1,2})/” in regular expression form); however, this approach fails when labels are constructed following some additional arbitrary rules. For example, “HO” and “HOH,” often used to indicate hydroxide and water molecules, respectively, would be recognized as holmium (Ho); other labels often used for water molecules (“Wat”, “W,” and “Ow”) demonstrate the flaws of this simplistic approach even further. The maintainers of the COD have adopted a practice of manually putting chemical types to “_atom_site_type_symbol” data items values, if previously empty, thus removing any ambiguity. This, however, is not yet done automatically, as it often requires manual double-checking. Current widely used approach of splitting same-site atoms into separate “_atom_site_ loop” entries results in often misinterpretation of sites which are mixtures of two or more different chemical types. For example, the grunerite structure in the COD entry 9000000 contains four iron–magnesium sites, which can only be identified as such by comparing their coordinates. We have adopted a practice of marking atoms in such sites as alternative using CIF’s “_atom_site_disorder_…” data items in order to present downstream applications with semantically connected “_atom_site_…” entries. However, instead of transforming all COD CIFs, we use this practice on the fly, as implemented in command line tool “cif_mark_disorder” from “cod-tools” package [43]. It is a well-known fact in crystallography that low resolution experiments extract very little to no information on the positions of hydrogen atoms in the

25

26

1 Crystallography Open Database: History, Development, and Perspectives

structures. There is a wide spectrum of methods for hydrogen position treatment from restraints to geometric prediction. Of course, sometimes hydrogen atoms are completely excluded from crystal structures, especially if their positions are of little interest in the research. It is important, though, to detect such cases for computational analyses in order to avoid misinterpretations. For a known number of hydrogen atoms, attached to a known site, the CIF standard defines data item “_atom_site_attached_hydrogens”. However, there is no recommended notation for a known number of hydrogen atoms, whose sites of attachment are unknown. We have made a decision to “attach” them to a “fake” atom with unknown coordinates (all equal to the special CIF value “unknown,” denoted as a lone question mark [“?,” ASCII character 63 decimal]). 1.4.2

Data Deposition

An automatic deposition interface was opened in 2010, allowing the scientific community to directly participate in the expansion of the COD data collection. The whole process of insertion of new data, which was detailed beforehand [20], was automated and embedded into a set of Web pages (accessible at http://www. crystallography.net/cod/deposit) to guide all interested researchers through the deposition of their data in CIF format. Acknowledging a concern about the preservation of the original research data, the COD accepts diffraction data files (in CIF format) as well as atomic coordinates, in line with the publication standards by the IUCr (http://www.iucr.org/home/leading-article/2011/201106-02#letter). The COD accepts three types of depositions: • Data that was published before the deposition and has a full bibliographic record. Such depositions are accepted from anyone registered at the COD Web site and are immediately put into public domain. • Prepublication structures are accepted from the authors of future publications. Contrary to the published material, such structures are not released until the corresponding publication is issued or the hold period expires, although details such as lattice constants, symmetry, summary chemical formula, substance name, and the list of authors are made public under persistent COD identifiers that are retained after the release. Coordinates and diffraction data are thus retained confidential within the COD, and we assume that such depositions maintain the originality of the submitted work and publications of such structures are eligible as original research. Depositors are granted possibility to extend the hold period up to 18 months after that they are contacted via e-mail and asked either to indicate the publication, make the records public as personal communications (in case the publication does not happen), or, as a last resort, to withdraw it from the COD. • Structures are also accepted as personal communications to the COD. Such structures are assumed to be published at the COD by their authors personally and are immediately put into the public domain. Prior to the automatic deposition interface, all data was collected, corrected, and placed in the COD by its maintainers. Since 2010 all depositions have been directed to the novel interface, thus saving many man-hours of effort.

1.5 Applications

1.5 Applications 1.5.1

Material Identification

The more obvious application occurs once a crystallographer has determined the cell parameters of a supposedly new phase. Then these cell parameters and the corresponding cell volume can be used in a simple search in the COD so as to avoid to waste time if the crystal structure is already published. Full confidence in the result of such a search will wait for the COD attaining completion. Crystal structure databases have for long been used to identify phases in polycrystalline materials. Subsets of databases designed for specific user application (e.g. inorganics, organics, metals, etc.) have been developed and sold separately. Databases containing only diffraction peak positions have also been constructed from structure databases. In both cases (from crystal structure or peak lists), the usual search–match commercial software work only on the comparison between peak positions from the database and the ones of the samples to be identified. Consequently, only these structures stored in the actual database can be identified, e.g. organics, ignoring the other phases (inorganics, metal–organics, etc.), except if the user can afford all databases and corresponding software. Another approach resolving the mentioned drawbacks of classical databases is clearly provided using the COD. Since the COD records all structures independently of their “classification” as inorganics or other classes, the search–match results extend to a wider range of materials (obviously selection on elements, bonds, or whatsoever and even phase class can be introduced if necessary). This warrants a more ab initio phase identification whatever the material of concern. Additionally, the COD open character allows any user to benefit of this aspect using its own software. Such application has recently been developed, called Full-Pattern Search–Match (FPSM), which allows COD-based identification, quantification, and microstructural characterization, in an automated way through the Internet [90]. The COD and its sister databases are free for download and use to everybody, even companies. This wonderful value addition from the academic to the industrial and technological worlds has rapidly been noticed by companies constructing X-ray diffractometers. Crystal Impact was the first company to incorporate the COD in the 2000s in its search–match software, rapidly followed by Panalytical (Highscore+ software), Bruker (Eva), and Rigaku (PDXL). More recently an employee at the 3D Systems Corp. used it for the 3D printing of crystallographic models, and Kagaku Benran incorporated the COD in his Crystallography Handbook. 1.5.2

Applications for the Mining Industry

The usefulness of the COD for mineral identification proved very useful for practical applications in mining. In the SOLSA2 (Sonic Drilling coupled with Automated Mineralogy and chemistry On-Line-On-Mine-Real-Time)3 project 2 http://www.solsa-mining.eu/. 3 https://ec.europa.eu/easme/en/printpdf/7079.

27

28

1 Crystallography Open Database: History, Development, and Perspectives

that started in 2016, the COD is used as an essential data provider for identifying minerals for characterization of the drill cores. The COD is also planned as a vehicle of the subsequent data dissemination, storing results of crystallographic investigations of drill cores. All properties of the COD are essential here – open-access regime permits efficient distribution and fast access to data; the well-established CIF framework provides a sound foundation for describing measurement results, and the RESTful interface enables easy integration. The COD codebase has also been reused to launch the Raman Open Database (ROD) in order to properly store Raman spectroscopy measurements as well as to interlink them with the crystal structures in the COD [59]. It is anticipated that other results of the SOLSA project will be made openly available to the community after the project is completed. 1.5.3

Extracting Chemical Information

Many of the potential users of the COD are chemists so they will be more interested in the chemical features of any crystallized compound than in the purely crystallographic facts. For organic and metal–organic chemists, the chemical features of the compound are mostly defined by the statement of how atoms are directly bonded or not to each other: this is the so-called “chemical connectivity” or “molecular structure.” Hence, a chemist is more likely to be interested in particular associations of atoms (functional groups, coordination environments) than in unit cell parameters or space groups. But the molecular structure is not usually explicitly established in the CIFs uploaded to COD and it needs to be deduced from atom coordinates and/or the bond list (if present). This chemical connectivity should be written in a format suitable to chemically define the compound and to perform searches. Among many available possibilities, we have chosen the SMILES format for this purpose (there are two specifications for this format, the original one elaborated by the Daylight Chemical Information Systems [60] (http://www.daylight.com/ dayhtml/doc/theory/theory.smiles.html) and an open specification established afterward (http://opensmiles.org); both are essentially identical). This format represents a chemical species by a single chain of ASCII characters and has the advantages of storing only the molecular structure and nothing else, which makes it very compact, and of being both human and machine writable/readable, which is convenient for both automatic or manual edition. With some practice, it is possible to directly “see” the molecular structure (in simple cases) or at least important features of it (in more complicated ones) by just reading the SMILES, and there are several informatics tools able to depict the molecular structure for a given SMILES (for example, indigo-depict: http://lifescience.opensource.epam .com/indigo/). The SMILES format presents, however, also important drawbacks: it has been designed with the valence bond theory in mind (as the very concept of “chemical connectivity” somehow implies the valence bond theory), and hence it has problems representing species that are not well explained by this theory, like delocalized bonds (other than aromatic rings) or polycentric bonds (metalocenes, boranes, etc.). Another drawback is that it can only represent discrete species and not polymeric ones, for which only a fragment may be represented.

1.5 Applications

Deriving the molecular connectivity as a SMILES chain from the corresponding CIF is however far from being a trivial task. We are using the Open Babel toolbox [61] (http://openbabel.org) which, in principle, has the ability for performing the CIF to SMILES conversion, but the result is not optimal in many cases. To begin with, Open Babel reads the atoms as they are in the input file, does not perform any symmetry generation, or consider the occupancy factors and hence does not handle properly chemical species placed on symmetry elements of the crystal, nor does it considers the possible disorder. To circumvent these problems, algorithms and corresponding software have been developed by the COD maintainers [62]. But even if we have a set of atoms chemically representing our compound, there are still important problems to face regarding the choice of the best representation for any particular chemical species since Open Babel, in many cases, does not yield a SMILES displaying the schematic image that most chemists will have about it; image that, after all, is just a convention. Most problems arise from the fact that Open Babel has been designed from the point of view of an organic chemist and in the realm of valence bond theory, trying to force every atom to have its usual valence. The number of bonds that an atom can form is also limited making it necessary very often to supplement the bonds found by Open Babel with those provided by the authors in the _geom_bond_distance_ loop. For the above-mentioned reasons, the obtained crude SMILES usually represent accurately organic compounds (easily recognizable by the absence of square brackets) that may be accepted without further treatment, but not metal–organic compounds, for which one very frequently finds missing bonds, spurious or lacking H-atoms, wrong bond representations, etc. The list of compound families showing these kinds of problems is quite large. At present, the curation of such SMILES is done mostly by human intervention with the aid of a number of helper scripts that identify and, in some cases, automatically solve the problems associated with some of the more frequently found families of compounds. It is noteworthy that human intervention in this task has not been eliminated even by the proprietary or unreleased software and by not well-disclosed algorithms that are used by commercial databases [63]. Due to these reasons, the number of entries with SMILES that has been considered as acceptable is, at present, just about one-third of the total number of COD entries. The procedure needs to be improved in order to accelerate the conversion and diminish the need for human intervention. The establishment of the chemical identity of COD entries is quite useful to cross-link COD with other chemical databases. In this sense, the available SMILES have already been used to set around 35 000 links between the COD and the open chemical database ChemSpider (http://www.chemspider.com) [54], and it is expected that the same can be used for other important open databases like PubChem. The built SMILES are also used to perform substructure searches, in which the user of the database tries to find all compounds containing a given molecular fragment. This is surely the main kind of search that an organic or metal–organic chemist is interested in, since such molecular fragments are the main way of defining families of compounds. The COD website implements such searches

29

30

1 Crystallography Open Database: History, Development, and Perspectives

by allowing the user to introduce the fragment also in SMILES format and then use the Open Babel fast search utility to get the hits. For the benefit of users that are not familiar with the SMILES format, the query may also be built in the COD website using graphical interfaces written in the JavaScript [64] http:// www.molinspiration.com/jme/ language. The whole SMILES collection is also downloadable as a single file (http://www.crystallography.net/cod/smi/allcod .smi) so that the user can perform the search locally with any software of his/her own choice. An interesting possibility is to use Open Babel package without the involvement of a fast search index: this procedure is much slower than the above-mentioned fast search (it takes several minutes, which makes it difficult to implement in the Web interface), but it yields more accurate results and the query can be written in the SMARTS language (https://github.com/timvdm/ OpenSMARTS; http://www.daylight.com/dayhtml/doc/theory/theory.smarts .html), which allows for more versatile and sophisticated searches than SMILES. 1.5.4

Property Search

Modern methods of computational chemistry can greatly reduce the efforts in fields such as the material science. In silico experiments can quite accurately predict various properties of the materials without the need of time- and cost-intensive synthesis and experimentation. For example, knowledge of crystal contents and densities is sufficient enough to carry out the search of possible hydrogen storage materials, as demonstrated by Breternitz and Gregory in their research using the COD [65]. A group of researchers has embarked on the screening for crystal structures with periodic layered compounds in order to identify novel graphene-like compounds in both the COD and the ICSD [66]. 1.5.5

Geometry Statistics

In order to simplify and encourage similar research on the basis of the COD, we are developing a database for the geometry of the COD structures. Our main goals are to collect bond lengths, valence, and dihedral angle sizes and provide their descriptions in the form of statistical models. To achieve that, we have devised a novel descriptor for chemical environment, that is, a “name,” allowing to group geometric parameters, measured from similar compounds [67, 68]. We have chosen a “fuzzy” descriptor as a balance between too strict matching, which would yield huge numbers of classes with small number of observations and short-sightedness. However, there are cases when geometry parameters from chemically different environments fall in the same class, thus yielding multimodal or skewed distributions. In order to accommodate such irregularities, we have chosen mixture models of Gaussian and Cauchy distributions. Thus, we have developed fully automatic software, capable of extracting aforementioned geometry parameters from crystal structure descriptions without the need of human supervision. With this software we have extracted geometric parameters from more than 300 000 small-molecule entries from the COD to date. To ease browsing of the collected geometry data set and the describing

1.5 Applications

models, we have launched a Web interface. Currently, browsing is implemented using atom descriptors as mentioned earlier. One of the possible uses of our geometry database is the detection of common geometric features. The semi-automated search for artifacts and outliers in the crystal structures is another possible use. Furthermore, derived statistical distributions from our database could be used to generate force fields in modeling, as well as constraints or restraints for the refinement of crystal structures. This particular approach is being used to compile a dictionary of constraints for macromolecular structure refinement using the REFMAC5 refinement program [69, 70]. 1.5.6

High-Throughput Computations

Successful usage of the results from high-throughput in silico research is somewhat hindered by the problem of reproducibility. A key to this problem is to preserve provenance for all steps, leading from the inputs to the results [71]. To aid the field of atomistic simulations, Pizzi et al. have developed the AiiDA framework [72], based on the Open Provenance Model [73, 74]. AiiDA can automate the execution of computations, automatically store inputs, and results in a tailored database, while keeping track of data provenance and helping to share the results. In order to ease the importing and exporting data to and from AiiDA, it was interfaced with the COD and the TCOD. The current pipeline allows seamless importing of experimental data from the COD to AiiDA for further atomistic simulations while at the same time preserving all metadata required for unambiguous identification of inputs and exporting of the results, bundled together with all metadata required for reproducibility to the TCOD. 1.5.7 Applications in College Education and Complementing Outreach Activities Crystallographic open-access databases have been built from 2004 onward for educational purposes at Portland State University. The focus of these activities has always been interactive visualizations of crystal structures with educational relevance. The well-known Java-based Jmol plug-in (now replaced by the more secure JavaScript version known as JSmol) into Web browsers by Bob Hanson and his team at St. Olaf College in Minnesota, United States, has been adopted for this purpose [75]. In recent years, we augmented our educational activities with 3D-printed crystallographic models [76, 77]. The key to these activities was a Windows executable program by Werner Kaminsky [78] that converts *.cif files directly into *.stl or *.wrl files, as required for the 3D printing process. Note that there are also Windows executable programs by Werner Kaminsky that create 3D print files for crystal morphology models [78] and longitudinal representation surfaces for anisotropic crystal physical properties [78]. While the CIF dictionaries contain provisions to encode crystal morphologies in *.cif files directly so that it can be read into Werner’s program [79], the

31

32

1 Crystallography Open Database: History, Development, and Perspectives

developers of the Material Properties Open Database [10] needed to write their own modified CIF extension dictionary. 3D print files can also be created directly at the website of the Material Properties Open Database [80]. Selected 3D print files and CIF-encoded crystal morphologies are available for download at the above-mentioned educational project of Portland State University.

1.6 Perspectives 1.6.1

Historic Structures

As of August 2016, most of the structures in the COD are published in the “CIF era” (1990s onward), with the contribution of older structures equal to only 8% (27 000 entries). However, it is assumed that the amount of published pre-CIF structures is much larger, and much effort has to be made to digitalize and deposit them as CIFs. Therefore, we have produced a few dozens of such entries manually, but the laborious nature of such task prevents the conversion from attaining speed. Nevertheless, the collection of historic structures can be speeded up by harnessing crowdsourcing for detection of coordinate tables in scanned publications, optical character recognition, and evaluation of geometry as a means for error detection. 1.6.2

Theoretical Data in (T)COD

Over the last 25 years, the CIF format has become the standard for the reporting and archiving of the results of experimental crystal structure solutions. It was adopted and used by the crystallographic journals as well as the structural databases. New CIF dictionaries are being developed to define ontologies in such fields as macromolecular crystallography [6], powder diffraction [81], and electron density studies [82]. However, much effort is still needed to consolidate the knowledge in the field of theoretical materials science, which is expanding rapidly currently. Nevertheless, there are a few disjoint attempts, namely, European Theoretical Spectroscopy Facility (ETSF) ([83, 84], and NoMaD [85]. Addressing this issue, the TCOD has been launched, adopting the practice of using the CIF format, approach-specific dictionaries (for example, cif_dft dictionary for DFT) and defining data validation criteria for automated checks. In addition, the TCOD puts emphasis on the provenance of the results and reproducibility by devising a special dictionary for related metadata – cif_tcod [86]. The TCOD, accompanied with a huge collection of experimental structures in the COD [21], opens an immediate potential for the cross-validation of experimental and theoretical data. 1.6.3

Conclusion

The 16 years of COD development demonstrate that it is possible to build a fully open-access, high quality database in a well-defined area of scientific inquiry, namely, in the field of crystallography. In its history the COD was online most

References

of the time, except for a very few short technical glitches. Its volume grew constantly over time, and it enjoys an increasing number of citations as well. Although not yet covering every published structure, the COD is suitable for many applications and impossible to substitute when openness is an essential requirement. We see a large potential of open data in the new, connected world, with many not only self-evident but also unanticipated uses of scientific results for the benefit of everyone, and will continue to develop and support the COD into the future [87–89].

Acknowledgments We acknowledge financial supports from the Research Council of Lithuania (grant numbers MIP-124/2010 and MIP-025/2013), the European Community (SOLSA, 2016–2020, grant agreement no. 689868), and the Conseil Régional de Normandie (COMBIX project, 2013–2014, Chair of Excellence of LL). We thank Dr. Peter Murray-Rust for providing information about CrystalEye.

References 1 Authors of Wikipedia (2016). Hipparchus. https://en.wikipedia.org/wiki/

Hipparchus (accessed 16 October 2016). 2 Annis, J., Bakken, J., Holmgren, D., et al. (1999). The Sloan Digital Sky Survey

3 4

5

6

7

8

data acquisition system, and early results. Real Time Conference, 1999. Santa Fe 1999. 11th IEEE NPSS 14–18 June 1999. IEEE. DOI: https://doi.org/10. 1109/RTCON.1999.842551 Hewett J. (2006). LHC factoids. http://blogs.discovermagazine.com/ cosmicvariance/2006/09/27/lhc-factoids/ (accessed 16 October 2016). PPARC (2006). ’Maiden Flight’ for LHC computing grid breaks gigabyte-per-second barrier. http://phys.org/news/2006-02-maiden-flightlhc-grid-gigabyte-per-second.html (accessed 16 October 2016). Hahn, T. (ed.) (2006). International Tables for Crystallography. Vol. A: Space-group Symmetry. Dordrecht, The Netherlands: Published for the International Union of Crystallography by Springer https://doi.org/10.1107/ 97809553602060000100. Fitzgerald, P.M.D., Westbrook, J.D., Bourne, P.E. et al. (2006). Macromolecular dictionary (mmCIF). In: International Tables for Crystallography (ed. S.R. Hall and B. McMahon). International Union of Crystallography https://doi.org/10 .1107/97809553602060000745. Hall, S.R., Allen, F.H., and Brown, I.D. (1991). The crystallographic information file (CIF): a new standard archive file for crystallography. Acta Crystallogr. Sect. A 47: 655–685. https://doi.org/10.1107/S010876739101067X. Bernstein, H.J., Bollinger, J.C., Brown, I.D. et al. (2016). Specification of the Crystallographic Information File format, version 2.0. J. Appl. Crystallogr. 49 (1): 277–284. https://doi.org/10.1107/s1600576715021871.

33

34

1 Crystallography Open Database: History, Development, and Perspectives

9 Faber, J. and Fawcett, T. (2002). The Powder Diffraction File: present and

10

11

12 13

14

15 16

17

18

19

20

21

22 23

future. Acta Crystallogr. Sect. B 58 (3 Part 1): 325–332. https://doi.org/10 .1107/S0108768102003312. Pepponi, G., Gražulis, S., and Chateigner, D. (2012). MPOD: A Material Property Open Database linked to structural information. Nucl. Instrum. Methods Phys. Res., Sect. B 284: 10–14. https://doi.org/10.1016/j.nimb.2011.08.070. Lafuente, B., Downs, R.T., Yang, H., and Stone, N. (2015). The power of databases: the RRUFF project. In: Highlights in Mineralogical Crystallography (ed. T. Armbruster and R.M. Danisi), 1–29. W. De Gruyter. Downs, R.T. and Hall-Wallace, M. (2003). The American Mineralogist crystal structure database. Am. Mineral. 88: 247–250. Rajan, H., Uchida, H., Bryan, D. et al. (2006). Building the American Mineralogist crystal structure database: a recipe for construction of a small Internet database. In: Geoinformatics: Data to Knowledge (ed. A. Sinha). Geological Society of America https://doi.org/10.1130/2006.2397(06). Baerlocher, C., McCusker, L., and Olson, D. (2007). Atlas of Zeolite Framework Types, 6th revised edition. Amsterdam – London – New York – Oxford – Paris – Shannon – Tokyo: Elsevier. Aroyo, M.I., Perez-Mato, J.M., Orobengoa, D. et al. (2011). Crystallography online: Bilbao Crystallographic Server. Bulg. Chem. Commun. 43 (2): 183–197. Aroyo, M.I., Perez-Mato, J.M., Capillas, C. et al. (2006). Bilbao Crystallographic Server: I. Databases and crystallographic computing programs. Z. Kristallogr. – Cryst. Mater. 221 (1): 15–27. https://doi.org/10.1524/zkri.2006 .221.1.15. Perez-Mato, J., Gallego, S., Tasci, E. et al. (2015). Symmetry-based computational tools for magnetic crystallography. Ann. Rev. Mater. Res. 45 (1): 217–248. https://doi.org/10.1146/annurev-matsci-070214-021008. Berman, H.M., Olson, W.K., Beveridge, D.L. et al. (1992). The nucleic acid database: a comprehensive relational database of three-dimensional structures of nucleic acids. Biophys. J. 63: 751–759. https://doi.org/10.1016/S00063495(92)81649-1. Coimbatore Narayanan, B., Westbrook, J., Ghosh, S. et al. (2014). The nucleic acid database: new features and capabilities. Nucleic Acids Res. 42: D114–D122. https://doi.org/10.1093/nar/gkt980. Gražulis, S., Chateigner, D., Downs, R.T. et al. (2009). Crystallography Open Database – an open-access collection of crystal structures. J. Appl. Crystallogr. 42: 726–729. https://doi.org/10.1107/S0021889809016690. Gražulis, S., Daškeviˇc, A., Merkys, A. et al. (2012). Crystallography Open Database (COD): an open-access collection of crystal structures and platform for world-wide collaboration. Nucleic Acids Res. 40: D420–D427. https://doi .org/10.1093/nar/gkr900. Le Bail, A. (2005). Inorganic structure prediction with it GRINSP. J. Appl. Crystallogr. 38: 389–395. https://doi.org/10.1107/S0021889805002384. Chateigner, D., Grazulis, S., Pérez, O., et al. (2015). COD, PCOD, TCOD, MPOD … open structure and property databases. http://www.ecole.ensicaen .fr/~chateign/danielc/abstracts/Chateigner_abstract_JNCO2013.pdf (accessed 19 April 2019).

References

24 Groom, C.R., Bruno, I.J., Lightfoot, M.P., and Ward, S.C. (2016). The Cam-

25

26

27

28

29 30

31

32

33

34

35

36 37

38

bridge Structural Database. Acta Crystallogr. Sect. B 72 (2): 171–179. https:// doi.org/10.1107/S2052520616003954. Belsky, A., Hellenbrandt, M., Karen, V.L., and Luksch, P. (2002). New developments in the Inorganic Crystal Structure Database (ICSD): accessibility in support of materials research and design. Acta Crystallogr. Sect. B 58: 364–369. https://doi.org/10.1107/S0108768102006948. White, P.S., Rodgers, J.R., and Le Page, Y. (2002). CRYSTMET: a database of the structures and powder patterns of metals and intermetallics. Acta Crystallogr. Sect. B 58: 343–348. https://doi.org/10.1107/S0108768102002902. Villars, P., Onodera, N., and Iwata, S. (1998). The Linus Pauling file (LPF) and its application to materials design. J. Alloys Compd. 279: 1–7. https://doi.org/ 10.1016/S0925-8388(98)00605-7. Villars, P., Berndt, M., Brandenburg, K. et al. (2004). The Pauling File, Binaries Edition. J. Alloys Compd. 367 (1–2): 293–297. https://doi.org/10.1016/j.jallcom .2003.08.058. Protein Data Bank (1971). Protein Data Bank. Nat. New Biol. 233: 22–23. https://doi.org/10.1038/newbio233223b0. Berman, H., Kleywegt, G., Nakamura, H., and Markley, J. (2012). The Protein Data Bank at 40: reflecting on the past to prepare for the future. Structure 20: 391–396. https://doi.org/10.1016/j.str.2012.01.010. Gilliland, G.L., Tung, M., and Ladner, J.E. (2002). The Biological Macromolecule Crystallization Database: crystallization procedures and strategies. Acta Crystallogr. Sect. D 58 (6 Part 1): 916–920. https://doi.org/10.1107/ S0907444902006686. Baldi, P. (2011). Data-driven high-throughput prediction of the 3-D structure of small molecules: review and progress. A response to the letter by the Cambridge Crystallographic Data Centre. J. Chem. Inf. Model. 51: 3029. https://doi .org/10.1021/ci200460z. Sadowski, P. and Baldi, P. (2013). Small-molecule 3D structure prediction using open crystallography data. J. Chem. Inf. Model. 53: 3127–3130. https:// doi.org/10.1021/ci4005282. Bruno, I. and Groom, C. (2014). A crystallographic perspective on sharing data and knowledge. J. Comput. Aided Mol. Des. 28 (10): 1015–1022. https:// doi.org/10.1007/s10822-014-9780-9. Eger, T., Scheufen, M. and Meierrieks D. (2013). The determinants of Open Access Publishing: survey evidence from Germany. http://ssrn.com/ abstract=2232675 (accessed 19 April 2019). Eysenbach, G. (2006). Citation advantage of open access articles. PLoS Biol. 4 (5): e157. https://doi.org/10.1371/journal.pbio.0040157. Harnad, S. and Brody, T. (2004). Comparing the impact of Open Access (OA) vs. Non-OA articles in the same journals. D-Lib Magaz. 10 (6): https://doi .org/10.1045/june2004-harnad. Harnad, S., Brody, T., Vallières, F. et al. (2008). The access/impact problem and the green and gold roads to open access: an update. Serials Rev. 34 (1): 36–40. https://doi.org/10.1080/00987913.2008.10765150.

35

36

1 Crystallography Open Database: History, Development, and Perspectives

39 Zucker, L.G., Darby, M.R., Furner, J., et al. (2006). Minerva unbound: knowl-

40 41

42 43

44 45

46

47

48 49

50

51

52

53

edge stocks, knowledge flows and new knowledge production. NBER Working Paper Series. http://www.nber.org/papers/w12669 (accessed 19 April 2019). Piwowar, H.A. and Vision, T.J. (2013). Data reuse and the open data citation advantage. PeerJ 1: e175. https://doi.org/10.7717/peerj.175. Galperin, M.Y. and Cochrane, G.R. (2010). The 2011 Nucleic Acids Research Database Issue and the online Molecular Biology Database Collection. Nucleic Acids Res. 39 (Database): D1–D6. https://doi.org/10.1093/nar/gkq1243. IUCr (2016). CIF version 1.1 working specification. http://www.iucr.org/ resources/cif/spec/version1.1 (accessed 06 November 2016, 14:55 EET). Merkys, A., Vaitkus, A., Butkus, J. et al. (2016). COD::CIF::Parser: an error-correcting CIF parser for the Perl language. J. Appl. Crystallogr. 49 (1): 292–301. https://doi.org/10.1107/S1600576715022396. Berman, H.M., Westbrook, J., Feng, Z. et al. (2000). The Protein Data Bank. Nucleic Acids Res. 28: 235–242. https://doi.org/10.1093/nar/28.1.235. Day, N., Downing, J., Adams, S. et al. (2012). CrystalEye: automated aggregation, semantification and dissemination of the world’s open crystallographic data. J. Appl. Crystallogr. 45: 316–323. https://doi.org/10.1107/ S0021889812006462. Dalle O. (2012). On reproducibility and traceability of simulations. Proceedings of the Winter Simulation Conference. Winter Simulation Conference, 9–12 December 2012. IEEE. Collins-Sussman, B., Fitzpatrick, B.W., and Pilato, C.M. (2004). Version Control with Subversion: Next Generation Open Source Version Control. O’Reilly Media. Collins-Sussman B., Fitzpatrick B.W. and Pilato C.M. (2011). Version control with subversion. http://svnbook.red-bean.com/ (accessed 19 April 2019). Rauber, A., Asmi, A., van Uytvanck, D., and Pröll, S. (2015). Data Citation of Evolving Data: Recommendations of the Working Group on Data Citation (WGDC). RDA https://rdalliance.org/system/files/documents/RDA-DCRecommendations_151020.pdf (accessed 19 April 2019). Rauber, A., Asmi, A., van Uytvanck, D. and Pröll, S. (2016). Identification of reproducible subsets for data citation, sharing and re-use. https://www .ieeetcdl.org/Bulletin/v12n1/papers/IEEE-TCDL-DC-2016_paper_1.pdf (accessed 19 April 2019). Falcioni, M. and Deem, M.W. (1999). A biased Monte Carlo scheme for zeolite structure solution. J. Chem. Phys. 110 (3): 1754–1766. https://doi.org/10 .1063/1.477812. Fielding, R.T. (2000). Architectural styles and the design of network-based software architectures. University of California, Irvine. https://www.ics.uci .edu/~fielding/pubs/dissertation/top.htm (accessed 19 April 2019). Bolton, E.E., Wang, Y., Thiessen, P.A., and Bryant, S.H. (2008). Chapter 12 PubChem: integrated platform of small molecules and biological activities. In: Annual Reports in Computational Chemistry (ed. R.A. Wheeler and D.C. Spellmeyer). Oxford, UK: Elsevier https://doi.org/10.1016/S15741400(08)00012-1.

References

54 Pence, H.E. and Williams, A. (2010). ChemSpider: an online chemical infor-

55 56

57 58

59

60

61 62

63

64 65

66

67 68

69

mation resource. Chem. Educ. Today 87: 1123–1124. https://doi.org/10.1021/ ed100697w. Davison W. (2015). Rsync. http://samba.anu.edu.au/rsync/ (accessed 06 November 2016, 13:42 EET). Gildea, R.J., Bourhis, L.J., Dolomanov, O.V. et al. (2011). iotbx.cif: a comprehensive CIF toolbox. J. Appl. Crystallogr. 44: 1259–1263. https://doi.org/10 .1107/S0021889811041161. Hester, J.R. (2006). A validating CIF parser: PyCIFRW. J. Appl. Crystallogr. 39: 621–625. https://doi.org/10.1107/S0021889806015627. Todorov, G. and Bernstein, H.J. (2008). it VCIF2: extended CIF validation software. J. Appl. Crystallogr. 41: 808–810. https://doi.org/10.1107/ S002188980801385X. El Mendili, Y., Vaitkus, A., Merkys, A. et al. (2019). Raman Open Database: first interconnected Raman-XRD open-access resource for material identification. J. Appl. Crystallogr. 52: 618–625. https://doi.org/10.1107/ S1600576719004229. Funatsu, K., Miyabayashi, N., and Sasaki, S. (1988). Further development of structure generation in the automated structure elucidation system CHEMICS. J. Chem. Inf. Model. 28 (1): 18–28. https://doi.org/10.1021/ ci00057a003. O’Boyle, N.M., Banck, M., James, C.A. et al. (2011). Open Babel: an open chemical toolbox. J. Cheminf. 3: 3. https://doi.org/10.1186/1758-2946-3-33. Gražulis, S., Merkys, A., Vaitkus, A., and Okuliˇc-Kazarinas, M. (2015). Computing stoichiometric molecular composition from crystal structures. J. Appl. Crystallogr. 48: 85–91. https://doi.org/10.1107/S1600576714025904. Bruno, I.J., Shields, G.P., and Taylor, R. (2011). Deducing chemical structure from crystallographically determined atomic coordinates. Acta Crystallogr. Sect. B Struct. Sci. 67 (4): 333–349. https://doi.org/10.1107/ s0108768111024608. Bienfait, B. and Ertl, P. (2013). JSME: a free molecule editor in JavaScript. J. Cheminf. 5: 2–4. https://doi.org/10.1186/1758-2946-5-24. Breternitz, J. and Gregory, D. (2015). The search for hydrogen stores on a large scale; a straightforward and automated open database analysis as a first sweep for candidate materials. Crystals 5: 617–633. https://doi.org/10.3390/ cryst5040617. Mounet, N., Gibertini, M., Schwaller, P., et al. (2016). High-throughput prediction of two-dimensional materials. https://doi.org/10.1038/s41565-0170035-5. Long, F., Nicholls, R.A., Emsley, P., et al. (2016). ACEDRG: a stereo-chemical description generator for ligands. https://doi.org/10.1107/s2059798317000067. Long, F., Nicholls, R.A., Emsley, P., et al. (2016). Validation and extraction of stereochemical information from small molecular databases. https://doi.org/10 .1107/s2059798317000079. Long, F., Gražulis, S., Merkys, A., and Murshudov, G.N. (2014). A new generation of CCP4 monomer library based on Crystallography Open Database. Acta Crystallogr. Sect. A 70: C338.

37

38

1 Crystallography Open Database: History, Development, and Perspectives

70 Vagin, A.A., Steiner, R.A., Lebedev, A.A. et al. (2004). it REFMAC5 dic-

71 72

73

74

75

76

77

78

79

80

81

82

83

tionary: organization of prior chemical knowledge and guidelines for its use. Acta Crystallogr. Sect. D 60 (12): 2184–2195. https://doi.org/10.1107/ S0907444904023510. Mesirov, J.P. (2010). Computer science. Accessible reproducible research. Science (New York, NY) 327: 415–416. https://doi.org/10.1126/science.1179653. Pizzi, G., Cepellotti, A., Sabatini, R. et al. (2016). AiiDA: automated interactive infrastructure and database for computational science. Comput. Mater. Sci. 111: 218–230. https://doi.org/10.1016/j.commatsci.2015.09.013. Moreau, L., Freire, J., Futrelle, J. et al. (2008). The open provenance model: an overview. In: Provenance and Annotation of Data and Processes (ed. J. Freire, D. Koop and L. Moreau). Berlin, Heidelberg: Springer https://doi.org/10.1007/ 978-3-540-89965-5_31. Moreau, L., Freire, J., Futrelle, J. et al. (2007). The Open Provenance Model. University of Southampton http://eprints.soton.ac.uk/264979/ (accessed 19 April 2019). ˇ Moeck, P., Certík, O., Upreti, G. et al. (2005). Crystal structure visualizations in three dimensions with database support. MRS Proc. 909E: 3.5.1–3.5.6. https://doi.org/10.1557/PROC-0909-PP03-05. Moeck, P., Stone-Sundberg, J., Snyder, T.J., and Kaminsky, W. (2014). Enlivening a 300 level general education class on nanoscience and nanotechnology with 3D printed crystallographic models. J. Mater. Educ. 36: 77–96. Stone-Sundberg, J., Kaminsky, W., Snyder, T., and Moeck, P. (2015). 3D printed models of small and large molecules, structures and morphologies of crystals, as well as of their anisotropic physical properties. Cryst. Res. Technol. 1–11. https://doi.org/10.1002/crat.201400469. Kaminsky, W., Snyder, T., Stone-Sundberg, J., and Moeck, P. (2015). 3D printing of representation surfaces from tensor data of KH2 PO4 and low-quartz utilizing the WinTensor software. Z. Kristallogr. 230: 651–656. https://doi.org/ 10.1515/zkri-2014-1826. Kaminsky, W. (2007). From CIF to virtual morphology: new aspects of predicting crystal shapes as part of the WinXMorph program. J. Appl. Crystallogr. 40: 382–385. https://doi.org/10.1107/S0021889807003986. Fuentes-Cobas, L., Chateigner, D., Pepponi, G. et al. (2014). Implementing graphic outputs for the Material Properties Open Database (MPOD). Acta Cryst. 70: C1039. https://doi.org/10.1107/S2053273314089608. Toby, B.H., Von Dreele, R.B., and Larson, A.C. (2003). CIF applications. XIV. Reporting of Rietveld results using pdCIF: GSAS2CIF. J. Appl. Crystallogr. 36: 1290–1294. Mallinson, P.R. and Brown, I.D. (2006). Classification and use of electron density data. In: International Tables for Crystallography, vol. G (ed. S.R. Hall and B. McMahon). International Union of Crystallography. https://doi.org/10 .1107/97809553602060000107. Caliste, D., Pouillon, Y., Verstraete, M. et al. (2008). Sharing electronic structure and crystallographic data with ETSFIO. Comput. Phys. Commun. 179: 748–758. https://doi.org/10.1016/j.cpc.2008.05.007.

References

84 Gonze X., Almbladh C.-O., Cucca A., et al. (2008). Specification of file for-

85 86 87

88

89

90

mats for ETSF Specification version 3.3. Second revision for this version (SpecFF ETSF3.3). European Theoretical Spectroscopy Facility. http://www .etsf.eu/system/files/SpecFFETSF_v3.3.pdf (accessed 19 April 2019). Mohamed F.R. (2016). Nomad meta info. https://gitlab.rzg.mpg.de/nomad-lab/ nomad-meta-info/wikis/home (accessed 18 February 2016). Gražulis S. (2016). TCOD mailing list. http://lists.crystallography.net/cgi-bin/ mailman/listinfo/tcod (accessed 13 April 2016). Gražulis, S., Sarjeant, A.A., Moeck, P. et al. (2015). Crystallographic education in the 21st century. J. Appl. Crystallogr. 48 (6): 1964–1975. https://doi.org/10 .1107/S1600576715016830. Kaminsky, W., Snyder, T., Stone-Sundberg, J., and Moeck, P. (2014). One-click preparation of 3D print files (*.stl, *.wrl) from *.cif (crystallographic information framework) data using Cif2VRML. Powder Diffr. 29: S42–S47. https://doi .org/10.1017/S0885715614001092. Moeck, P., Kaminsky, W., Fuentes-Cobas, L. et al. (2016). 3D printed models of materials tensor representations and the crystal morphology of alpha quartz. Symmetry: Cult. Sci. 27: 319–330. Lutterotti, L., Pillière, H., Fontugne, C., Boullay, P., and Chateigner, D. (2019). Full-profile search–match by the Rietveld method. J. Appl. Crystallogr. 52: 587–598. https://doi.org/10.1107/S160057671900342X.

39

41

2 The Inorganic Crystal Structure Database (ICSD): A Tool for Materials Sciences Stephan Rühl FIZ Karlsruhe – Leibniz Institute for Information Infrastructure, Hermann-von-Helmholtz-Platz 1, 76344 Eggenstein-Leopoldshafen, Germany

2.1 Introduction The Inorganic Crystal Structure Database (ICSD) [1] is a comprehensive numerical database containing fully determined crystal structures of inorganic and metallic compounds. Reliable crystal structure data of high quality play an important part in optimizing the development of new materials that foster innovation in various areas. Especially in materials sciences, crystallographic data can be used to explain and predict material properties. The traditional approach in materials research of first synthesizing new compounds and then checking their properties is rather time consuming and quite expensive. Nowadays the reverse approach of computer-aided materials design is applied increasingly. This is possible due to many advances in fields related to modern materials sciences [2]. Especially the combination of increasing computing power and new developments in numerical techniques allows for calculations that a decade ago were at best hard to achieve. The nowadays common approach of predicting crystal structures is an important technique in computational materials design [3]. Depending on the specific approach used to predict an unknown crystal structure, the information from known crystal structures can be used at different stages, from data-mining parameters for optimization methods to the verification of calculated structures by comparing theoretical with experimental results. Factual databases, especially crystallographic databases, deliver the information needed for this step. Most crystal structure databases contain a lot more valuable information than the obvious unit cells, atomic coordinates or bond lengths and angles that are derived from them. For example, the concept of structure types implemented in ICSD [4] can be used to find similar structures by comparing certain basic features, like the space group or the ANX formula. For a database to be of any use in materials research, it has to cover several essential aspects [5]. The first aspect is the comparability of data. For crystallographic data this is partly inherited from the principles of crystallography itself and further enforced by the application of standardization tools to the published Materials Informatics: Methods, Tools and Applications, First Edition. Edited by Olexandr Isayev, Alexander Tropsha, and Stefano Curtarolo. © 2019 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2019 by Wiley-VCH Verlag GmbH & Co. KGaA.

2 The Inorganic Crystal Structure Database (ICSD): A Tool for Materials Sciences

crystal structure. Even for the exchange of crystallographic information, a generally accepted format is defined (Crystallographic Information File [CIF]) [6]. The second essential aspect is the completeness of the information provided. A statistical interpretation based only on a small subset will likely produce unreliable results. The third and most decisive factor for a database is the quality of the data. Hence, carefully checking and evaluating new information is fundamental. Section 2.2 will focus on a more detailed explanation of the most important data stored in ICSD. In addition, we will show some examples of how to use ICSD data in various fields of research.

2.2 Content of ICSD The ICSD, produced by FIZ Karlsruhe – Leibniz Institute for Information Infrastructure, contains records of inorganic crystal structures dating back to 1913. It currently contains information on structures • that have no C—C and C—H bonds at the same time (unless the “organic” part is just a small moiety) • whose atomic coordinates have been fully determined or were derived from a corresponding structure type. Over the years the scope of the ICSD has gradually changed to the current definition and even this definition is not strict anymore. Structures with small organic residues may be found in both databases – ICSD and Cambridge Structural Database (CSD) – as long as the inorganic part is significant. In general, ICSD covers elements, metals, alloys, minerals, and other classic inorganic compounds (Figure 2.1). 68 064

70 000

Other Oxides Metals/alloys

60 000 45 569

50 000 Records in ICSD

42

40 000

34 587

30 000 21 248 20 000

13 248

10 000 2032 0

Elements

Binary

Ternary

Quaternary

Quintenary

Figure 2.1 Overview of ICSD content by composition of the compounds.

>5 types

2.2 Content of ICSD

Each record in ICSD corresponds to a structure determination reported in the literature or sent to us as a private communication. For common compounds there may be several records that only differ slightly. The oldest structure in ICSD is sodium chloride published by W.L. Bragg in 1913 [7]. The structure with the largest volume, rivaling even small protein structures, is currently Al12827.56 Cu1244.05 Ta9063 with a volume of more than 360 000 Å3 [8]. With 22 different elements in one structure, the mineral Johnsenite is the most complex structure in ICSD [9]. In addition to the full description of the published structure, the database contains additional information, such as the Wyckoff sequence, Pearson symbol, ANX formula, mineral name and group, structure type, and many more that can be useful in data mining. The concept of assigning structure types has been introduced into ICSD in 2005 and was subsequently expanded during the recent years. More than 80% of the records are assigned to one of currently 9141 structure types. A new structure type in ICSD is only included if at least two compounds can be assigned to it. The two defining properties for several crystal structures to be considered belonging to one structure type are that the structures are isopointal and isoconfigurational. Those rather unhandy properties are broken down to some easily checkable properties like the ANX formula, Pearson symbol, Wyckoff sequence, c/a ratio, etc. A complete description of this procedure is beyond the scope of this review and can be found in the article by Allmann and Hinek [4]. Of special importance are remarks or comments that are generated when a new structure is entered. These remarks or comments can explain or at least highlight possible inconsistencies in the structure or describe actions taken during input to solve observed problems. Various visualization options for the 3D crystal structure (wireframe, ball-and-stick, space-filling, etc.) can be easily selected in the interactive JavaScript framework JSMol [10] that is integrated into ICSD (Figure 2.2). Some often used features can be selected directly from the ICSD interface, while the internal JSMol menu offers many more options. The powder pattern of each structure can be calculated according to a wide spectrum of user-defined settings. This feature cannot be used for automatic qualitative or quantitative analysis, but its potential to compare structures and recognize similar structures at a glance makes this a helpful addition. Especially viewing the powder patterns – or the crystal structures – of up to six compounds at a time with the same settings in a synoptic view helps to understand the structural similarity of seemingly not related structure types (Figure 2.3). The number of records in ICSD has roughly doubled in the last 10 years (Figure 2.4). As of April 2016, the ICSD contains more than 180 000 records. Apart from currently more than 6000 new records per year, FIZ Karlsruhe is working continually on filling in gaps in older data. Innovations like the classification of structures to structure types, calculation of standardized data, or the inclusion of author abstracts provide new search options. ICSD records provide the full structural information published, the bibliographic reference, and additional information included by its editors:

43

44

2 The Inorganic Crystal Structure Database (ICSD): A Tool for Materials Sciences

Figure 2.2 Visualization using JSMol in ICSD.

• Published information – Chemical formula – Mineral name and origin – Unit cell parameters and volume – Space group and symmetry – Atomic coordinates with oxidation states and occupancies – Thermal parameters – Reliability index – Density – Method of measurement and experimental conditions • Bibliographic information – Authors – Journal, volume, issue, page, and year of publication – Title of paper • Additional information – Chemical name – Structure type – ANX/AB formula

2.2 Content of ICSD

Figure 2.3 The synoptic view helps to reveal structural similarity.

200 000

Number of records

160 000

120 000

80 000

40 000

Year

Figure 2.4 Growth of the number of records in ICSD since 1980.

20 15

20 10

05 20

00 20

19 95

0 19 9

19 85

19 8

0

0

45

46

2 The Inorganic Crystal Structure Database (ICSD): A Tool for Materials Sciences

– Pearson symbol – Wyckoff sequence – Remarks.

2.3 Interfaces ICSD is available as a stand-alone version for local PC installation (ICSD Desktop), a local intranet version for small user groups, and as a Web version hosted by FIZ Karlsruhe (ICSD Web). At present, ICSD Desktop and ICSD Web share the same interface (Figure 2.5). This offers many synergies in the development and inclusion of new features for all versions based on this interface. It is also planned to provide an interface to the intranet version based on this framework. The advantages for a user will be that he or she can easily switch from one interface to any other interface without having to learn how to use this specific interface. The ICSD Web version may be used on mobile devices with a fully functional interface even without a special ICSD App.

2.4 Applications of ICSD Crystallographic data in general are used routinely in a wide spectrum ranging from education in basic materials sciences to the prediction of unknown structures in advanced research. Applications of data from ICSD can be as simple

Figure 2.5 Basic search screen of ICSD Desktop/Web interface.

2.4 Applications of ICSD

as comparing cell parameters from a measurement with known crystal structure from the database, or as sophisticated as the prediction of unknown structures. Although new compounds are nowadays routinely predicted, using factual databases like ICSD still offers many more options to improve the prediction. Schön [11] proposes some possible prediction-related applications that benefit from using ICSD. In the following a few selected examples of published applications of employing ICSD data in materials sciences are highlighted. 2.4.1

Prediction of Ferroelectricity

An example of using crystallographic data to predict material properties is the systematic examination of ferroelectricity in relevant point groups started by Abrahams [12] in 1988. The general idea behind it was to find crystallographic conditions that must be fulfilled by a crystal structure in order to exhibit certain properties. In this special case Abrahams was examining ferroelectricity. Ferroelectricity describes the spontaneous electric polarization, which can be reversed by an appropriate electric field showing hysteresis effects. Ferroelectricity was named in analogy to the previously known ferromagnetism, which is based on spontaneous magnetization and the presence of hysteresis. Abrahams found that the principal structural requirements for a polar crystal to be considered as potentially ferroelectric is the presence in the unit cell of a maximum atomic displacement of about 1 angstrom along the polar direction from the corresponding position in which the resulting spontaneous polarization is zero. In addition, the largest atomic displacement from such a position must be significantly greater than about 0.1 angstrom or the r.m.s. amplitude of thermal displacement of that atom. Furthermore, the thermodynamic barrier to be overcome by each atom in reaching its location corresponding to zero spontaneous polarization must be less than the equivalent of an applied d.c. field that is sufficient to reverse the polarization sense but does not exceed the dielectric strength of the material … [13]. In several articles Abrahams predicted potential ferroelectricity for hundreds of structures taken from the ICSD, which had not been known to exhibit ferroelectricity before. 2.4.2

Using the Concept of Structure Types

In his paper, Kaduk [14] highlighted some practical examples of how the information stored in databases like ICSD can be used to solve challenges in identifying unknown materials. In one example black tar recovered from a pump seal at an alkylation unit was examined. One of the concerns was that sulfuric acid might have leaked into the pump, which was made from aluminum. The powder pattern revealed that the black tar contained the following compounds: Al4 H2 (SO4 )7 (H2 O)24 [15], FeSo4 (H2 O), and Al2 (SO4 )3 (H2 O)17 . But a Rietveld refinement could not be performed, because the crystal structure of the first compound was not known at that time.

47

48

2 The Inorganic Crystal Structure Database (ICSD): A Tool for Materials Sciences

At that point the concept of structure types was used to find potential structures that would be structurally very similar to the unknown compound. Structure types itself were not yet introduced in ICSD, but the descriptors for the structure types – especially the ANX formula – were already available. The ANX formula in ICSD is generated according to the following simplified rules: • H+ is not taken into account. Hydride anions are counted normally. • The coordinates for all sites of all other atoms must be determined. • All sites occupied by the same atom type are combined unless the oxidation number is different. • For each atom type the multiplicities are multiplied by the site occupation factors (SOFs) and the products are added up. The sums are rounded and divided by the greatest common divisor. • Cations are assigned the symbols A-M; neutral atoms, N-R; and anions, X, Y, Z, S-W. • Symbols are sorted alphabetically and characters are reordered according to ascending indices. • All ANX formulae with more than four cation symbols, three neutral symbols, or three anion symbols are deleted. From the composition of the unknown substance, the ANX formula could easily be derived to be A4B7X52. A search in ICSD only revealed one compound with this ANX formula: Cr4 H2 (SO4 )7 (H2 O)24 [16]. The chromium-containing compound did not match very well with the aluminum-containing compound from the black tar, but replacing the chromium position with aluminum allowed for a Rietveld refinement of the mixture; so, in fact, the two structures were isostructural (Figure 2.6). With this information it could be explained that sulfuric acid must have leaked from the pump and reacted with the materials in the pump to yield that black tar. 2.4.3 Two Examples of Training Machine Learning Algorithms with ICSD Data Another approach uses the data by feeding them to a machine learning algorithm in order to train or guide the algorithm. This approach leads to some kind of structure–classification relationship and has been done in various ways, but in this chapter we will highlight two applications using ICSD data. The first example tries to classify zeolite structures. A zeolite structure, or more precisely a framework type, has a specific signature of channels and building units in the structure, and this signature can be used to define the framework type. Traditionally, zeolites are aluminosilicates and there are several different framework types for the known zeolite structures. The term “zeolite” is used quite often as a generalization for microporous structures. The cavities and channels in the structures can be of major interest in material sciences, because they can be used as storage media for liquid or more volatile substances. Framework types are given as three-letter codes and distinguish in a topological network the connectivity of atoms that are tetrahedrally bonded to four oxygen atoms. Yang et al. introduced the zeolite-structure-predictor (ZSP) [17], which was trained with ICSD data.

2.4 Applications of ICSD

Intensity

5000

2500

0

10

(a)

20

30

40

50

40

50



Intensity

5000

2500

0 (b)

10

20

30 2θ

Figure 2.6 Simulated powder patterns of Al4 H2 (SO4 )7 (H2 O)24 (a) and Cr4 H2 (SO4 )7 (H2 O)24 (b).

The 1436 zeolite structures taken from the database correspond to 96 different framework type codes of which 40 types occurred at least five times in ICSD. It was shown that the ZSP performs with a rather good prediction accuracy of more than 95% if at least five structures are available for each framework type class for training. Using the ZSP the assignment of new compounds to a framework type can therefore be done very fast and reliably without user intervention. In the second example, experimental diffraction patterns are classified by means of a fast and efficient algorithm. Traditionally, the synthesis of new materials is rather time consuming, while the investigation of the new materials can be done very fast. With the advent of combinatorial chemistry in combination with fast collection of experimental data arises a new bottleneck in high-throughput materials science. When experimental data can be acquired faster than they can be evaluated, the procedures for data analysis need to be adapted – in this case by using a machine learning algorithm. Kusne et al. [18] applied an algorithm based on a mean shift procedure that was first presented by Fukunaga and Hostetler [19]. Mean shift is an analysis technique that is used here for clustering. It performs very well in the presence of peak shifting, which is often a problem for clustering algorithms. Other advantages of mean shift are that it does not assume a shape of the data clusters and it only relies on the parameter bandwidth. On the other hand, the bandwidth parameter can also prove to be a problem, because the selection of a good value is not trivial and choosing an inappropriate bandwidth can lead to merged or additional clusters. To increase the reliability and efficiency of the clustering, Kusne et al. developed a procedure that combines experimental diffraction patterns with simulated

49

50

2 The Inorganic Crystal Structure Database (ICSD): A Tool for Materials Sciences

diffraction patterns from ICSD. This way the verified data from ICSD help in guiding the iterative clustering of the experimental data in a fast and efficient manner and therefore provide an improved stability in the clustering despite a suboptimal choice for the bandwidth parameter. Kusne et al. then applied this procedure to Fe–Co–X compositions with X being a 3d, 4d, or 5d element in search of novel rare-earth free permanent magnets. The material for these permanent magnets should exhibit a sufficient magnetization and a large magnetic anisotropy without any rare-earth elements present. Kusne et al. could show that the insertion of a small percentage of X in the cell of cubic Fe–O can cause a tetragonal distortion in the cell inducing magnetocrystalline anisotropy. Even so the effect observed was only small this still shows that the procedure itself has potential to be of use in identifying rare-earth free magnetic materials. 2.4.4

High-Throughput Calculation

The introduction of the density functional theory (DFT) provided a powerful and efficient method to investigate the electronic structure of interesting compounds. Especially the versatility of this method in including different functionals to model specific interactions made this method very popular in computational chemistry and materials sciences. With modern computational resources the electronic structures and properties of thousands of compounds can be scrutinized in high-throughput calculations. There are many projects performing high-throughput calculations of large amounts of compounds like the Materials Project [20], the Computational Materials Repository [21], and AFLOWLIB [22], to name a few. The Open Quantum Materials Database [23] is rather new and it is freely available to the scientific community without any conditions or limitations. It currently contains almost 300 000 entries with about 32 000 entries representing relaxed structures from ICSD (corresponding to all calculable structures with up to 34 atoms in the cell) and about 260 000 entries of commonly occurring hypothetical prototype structures. Kirklin et al. [24] described the database in detail and discuss the accuracy of calculated formation energies. Accurate formation energies are very important as they are required to assess the stability of the compound or to determine many material properties. Kirklin et al. compared 1670 experimental formation energies for oxides, nitrides hydrides, halides, and intermetallics with the corresponding lowest energy structure that not necessarily corresponds to a structure from ICSD but to one of the hypothetical prototype structures. Since DFT-calculated formation energies are only valid for 0 K, using elemental DFT total energies as chemical potentials for the formation energy calculation results in some error, because the experimental formation energies are usually determined at standard temperature and pressure. In order to compensate for this discrepancy, Kirklin et al. evaluated three different schemes for the chemical potential. Scheme 1 uses the DFT-calculated elemental formation energies, and another scheme fits all chemical potentials to experiment. The last scheme only fits to experimental data those elements that arguably do not correspond to 0 K DFT-calculated formation energies. Comparing the results for the three

References

schemes, Kirklin et al. found that the last scheme significantly improves the calculation of formation energies compared to just using the DFT-calculated elemental energies. Scheme 2 has only a small gain over scheme 3 and introduces the risk of overfitting. Therefore scheme 3 was chosen as the optimal chemical potentials.

2.5 Outlook ICSD is already extensively used in data mining and in computational chemistry. One can already observe a strong tendency to shift materials research from the traditional synthesis-oriented approach to a more theory-oriented approach. Especially crystal structure predictions become more and more reliable. Therefore FIZ Karlsruhe has already started a collaboration to include theoretically calculated structures that are usually not experimentally determined into the ICSD. This allows comparing calculated structures either with each other or directly with experimental data. This feature was made available to ICSD users in 2017. Experimental crystal structures are of great value in many ways. Currently ICSD supports exporting the crystal structure information only in the CIF format, which is very common in crystallography. Unfortunately, the CIF format is usually not directly supported in many computational chemistry software packages. In order to simplify the import of experimental crystal structure information into these packages, we plan to provide many more different export formats in ICSD to common software (e.g. ABINIT [25], CPMD [26], MOPAC [27], Siesta [28], VASP [29]). We currently offer ICSD Desktop only for Windows operating systems, but, in principle, the interfaces can be used also on Unix systems or derivatives – at present, this is only limited by some external tools needed by the interfaces. So ICSD interfaces will also be available on other operating systems, which opens up many opportunities for interactions with other programs that are mainly run on Unix-based systems.

References 1 (a) Bergerhoff, G. and Brown, I.D. (1987). Inorganic crystal structure database.

In: Crystallographic Databases, 77–95. Chester: International Union of Crystallography. (b) Belsky, A., Hellenbrandt, M., Karen, V.L., and Luksch, P. (2002). New developments in the Inorganic Crystal Structure Database (ICSD): accessibility in support of materials research and design. Acta Cryst. B58: 364–369. https://doi.org/10.1107/S0108768102006948. 2 Butler, K.T., Frost, J.M., Skelton, J.M. et al. (2016). Computational materials design of crystalline solids. Chem. Soc. Rev. https://doi.org/10.1039/ C5CS00841G.

51

52

2 The Inorganic Crystal Structure Database (ICSD): A Tool for Materials Sciences

3 Curtarolo, S., Hart, G.L.W., Nardelli, M.B. et al. (2013). The high-throughput

4

5

6

7

8

9

10

11

12

highway to computational materials design. Nat. Mater. 12: 191–201. https:// doi.org/10.1038/nmat3568. Allmann, R. and Hinek, R. (2007). The introduction of structure types into the Inorganic Crystal Structure Database ICSD. Acta Cryst. A63: 412–417. https://doi.org/10.1107/S0108767307038081. Buchsbaum, C., Höhler-Schlimm, S., and Rehme, S. (2010). Data bases, the base for data mining. In: Data Mining in Crystallography (ed. D.W.M. Hofmann and L.N. Kuleshova), 135–167. Heidelberg: Springer Verlag https:// doi.org/10.1007/978-3-642-04759-6_5. (a) Hall, S.R. (1991). The STAR file: a new format for electronic data transfer and archiving. J. Chem. Inf. Comp. Sci. 31: 326–333. https://doi.org/10.1021/ ci00002a020. (b) Hall, S.R. and Spadaccini, N. (1994). The STAR file: detailed specifications. J. Chem. Inf. Comp. Sci. 34: 505–508. https://doi.org/10.1021/ ci00019a005. Bragg, W.L. (1913). The structure of some crystals as indicated by their diffraction of X-rays. Proc. R. Soc. London A 89: 248–277. https://doi.org/10 .1098/rspa.1913.0083. Weber, T., Dshemuchadse, J., Kobas, M. et al. (2009). Large, larger, largest – a family of cluster-based tantalum copper aluminides with giant unit cells. I. Structure solution and refinement. Acta Cryst. B65: 308–317. https://doi.org/ 10.1107/S0108768109014001. Grice, J.D. and Gault, R.A. (2006). Johnsenite-(Ce), a new member of the eudialyte group from Mount Saint Hilaire, Quebec, Canada. Can. Mineral. 44: 105–116. https://doi.org/10.2113/gscanmin.44.1.105. Hanson, R.M., Prilusky, J., Renjian, Z. et al. (2013). JSmol and the next-generation web-based representation of 3D molecular structure as applied to Proteopedia. Isr. J. Chem. 53: 207–216. https://doi.org/10.1002/ijch .201300024. Schön, J.C. (2014). How can databases assist with the prediction of chemical compounds? Z. Anorg. Allg. Chem. 640: 2717–2726. https://doi.org/10.1002/ zaac.201400374. (a) Abrahams, S.C. (1988). Structurally based prediction of ferroelectricity in inorganic materials with point group 6 mm. Acta Cryst. B44: 585–595. https:// doi.org/10.1107/S0108768188010110. (b) Abrahams, S.C. (1989). Structurally based predictions of ferroelectricity in seven inorganic materials with space group Pba2 and two experimental confirmations. Acta Cryst. B45: 228–232. https://doi.org/10.1107/S0108768189001072. (c) Abrahams, S.C., Mirsky, K., and Nielson, R.M. (1996). Prediction of ferroelectricity in recent inorganic crystal structure database entries under space group Pba2. Acta Cryst. B52: 806–809. https://doi.org/10.1107/S0108768196004582. (d) Abrahams, S.C. (1996). New ferroelectric inorganic materials predicted in point group 4 mm. Acta Cryst. B52: 790–805. https://doi.org/10.1107/S0108768196004594. (e) Abrahams, S.C. (1999). Systematic prediction of new inorganic ferroelectrics in point group 4. Acta Cryst. B55: 494–506. https://doi.org/10.1107/ S0108768199003730. (f ) Abrahams, S.C. (2000). Systematic prediction of new ferroelectrics in space group P3. Acta Cryst. B56: 793–804. https://doi.org/10

References

13

14

15

16

17

18

19

20

21

22

.1107/S0108768100007849. (g) Abrahams, S.C. (2003). Systematic prediction of new ferroelectrics in space groups P31 and P32 . Acta Cryst. B59: 541–556. https://doi.org/10.1107/S0108768103013284. (h) Abrahams, S.C. (2006). Systematic prediction of new ferroelectrics in space group R3. I. Acta Cryst. B62: 26–41. https://doi.org/10.1107/S0108768105040577. (i) Abrahams, S.C. (2007). Systematic prediction of new ferroelectrics in space group R3. II. Acta Cryst. B63: 257–269. https://doi.org/10.1107/S0108768107005290. (j) Abrahams, S.C. (2008). Inorganic structures in space group P3m1; coordinate analysis and systematic prediction of new ferroelectrics. Acta Cryst. B64: 426–437. https:// doi.org/10.1107/S0108768108018144. (k) Abrahams, S.C. (2010). Inorganic structures in space group P31m; coordinate analysis and systematic prediction of new ferroelectrics. Acta Cryst. B66: 173–183. https://doi.org/10.1107/ S0108768110003290. Abrahams, S.C. (1990). Systematic prediction of new ferroelectric inorganic materials in point group 6. Acta Cryst. B46: 311–324. https://doi.org/10.1107/ S0108768190000532. Kaduk, J.A. (2002). Use of the Inorganic Crystal Structure Database as a problem solving tool. Acta Cryst. B58: 370–379. https://doi.org/10.1107/ S0108768102003476. Fischer, T., Kniep, R., and Wunderlich, H. (1996). Crystal structure of dioxonium bis(tris(sulfato)dialuminate) sulfate docosahydrate, (H3 O)2 ⋅[Al2 (SO4 )3 ]2 ⋅(SO4 )⋅22(H2 O). Z. Kristallogr. 211: 469–470. https:// doi.org/10.1524/zkri.1996.211.7.469. Gustafsson, T., Lundgren, J.O., and Olovsson, I. (1980). Hydrogen bond studies. CXXXIX. The structure of Cr4 H2 (SO4 )7 ⋅24H2 O. Acta Cryst. B36: 1323–1326. https://doi.org/10.1107/S0567740880006012. Yang, S., Lach-hab, M., Vaisman, I.I., and Blaisten-Barojas, E. (2008). Machine learning approach for classification of zeolite crystals. In: Proceedings of the 2008 International Conference on Data Mining (ed. R. Stahlbock, S.F. Crone and S. Lessmann), 702–706. Las Vegas: CSREA. Kusne, A.G., Gao, T., Mehta, A. et al. (2014). On-the-fly machine-learning for high-throughput experiments: search for rare-earth-free permanent magnets. Sci. Rep. 4: 6367. https://doi.org/10.1038/srep06367. Fukunaga, K. and Hostetler, L. (1975). The estimation of the gradient of a density function, with applications in pattern recognition. IEEE Trans. Inf. Theory 21: 32–40. https://doi.org/10.1109/TIT.1975.1055330. Jain, A., Ong, S.P., Hautier, G. et al. (2013). The Materials Project: a materials genome approach to accelerating materials innovation. APL Mater. 1: 011002. https://doi.org/10.1063/1.4812323. Landis, D.D., Hummelshøj, J.S., Nestorov, S. et al. (2012). The computational materials repository. Comput. Sci. Eng. 14: 51–57. https://doi.org/10.1109/ MCSE.2012.16. Curtarolo, S., Setyawan, W., Wang, S. et al. (2012). AFLOWLIB.ORG: a distributed materials properties repository from high-throughput ab initio calculations. Comp. Mat. Sci. 58: 227–235. https://doi.org/10.1016/j.commatsci .2012.02.002.

53

54

2 The Inorganic Crystal Structure Database (ICSD): A Tool for Materials Sciences

23 Saal, J.E., Kirklin, S., Aykol, M. et al. (2013). Materials design and discovery

24

25

26

27 28 29

with high-throughput density functional theory: the open quantum materials database (OQMD). JOM 65: 1501–1509. https://doi.org/10.1007/s11837-0130755-4. Kirklin, S., Saal, J.E., Meredig, B. et al. (2015). The open quantum materials database (OQMD): assessing the accuracy of DFT formation energies. npj Comput. Mater. 1: 15010. https://doi.org/10.1038/npjcompumats.2015.10. Gonze, X., Amadon, B., Anglade, P.M. et al. (2009). ABINIT: first-principles approach to material and nanosystem properties. Comput. Phys. Commun. 180: 2582–2615. https://doi.org/10.1016/j.cpc.2009.07.007. CPMD. Copyright IBM Corp 1990–2015, Copyright MPI für Festkörperforschung Stuttgart 1997–2001. http://www.cpmd.org/ (accessed 24 August 2016). Stewart, J.J.P. (2016). Stewart Computational Chemistry, Colorado Springs, CO, USA. http://OpenMOPAC.net (accessed 24 August 2016). Soler, J.M., Artacho, E., Gale, J.D. et al. (2002). The SIESTA method for ab initio order-N materials simulation. J. Phys.: Condens. Matter 14: 2745–2779. Hafner, J. (2008). Ab-initio simulations of materials using VASP: Density-functional theory and beyond. J. Comput. Chem. 29: 2044–2078. https://doi.org/10.1002/jcc.21057.

55

3 Pauling File: Toward a Holistic View Pierre Villars 1 , Karin Cenzual 2 , Roman Gladyshevskii 3 , and Shuichi Iwata 4 1 Material Phases Data System (MPDS), Unterschwanden 6, CH-6354 Vitznau, Switzerland 2

Geneva University, Laboratory of Crystallography, Quai E. Ansermet 24, CH-1211 Genève, Switzerland Ivan Franko National University of Lviv, Department of Inorganic Chemistry, Kyryla i Mefodiya Street 6, 79005 Lviv, Ukraine 4 The Graduate School of Project Design, 3-13-16, Minami-aoyama, Minato-ku, Tokyo, 107-8411, Japan 3

3.1 Introduction To get a holistic view on materials is the ultimate goal of not only materials scientists, but also materials users. Deep insight into materials has been enhanced through continuous interaction between observation, experiment, data compilation, theoretical modeling, and calculation, as illustrated through history by prominent scientists such as Dmitri Mendeleev and Linus Pauling. Now, when moving into the “Data Era,” such interactions can be carried out on a large scale and relatively easily by building up procedures using digitized logics and data. More than 20 years ago, two of us (P.V., I.S.) undertook the task to do so, following the spirit of Linus Pauling, facing many challenges and learning from mistakes in developing materials data systems since the 1960s. This was the beginning of the PAULING FILE project [1] in the 1990s. There are two general approaches/concepts to obtain a holistic view on materials. The first one is the bottom-up approach (BUA), based on materials data. This data-driven approach is the key guideline of the PAULING FILE project. The second one is the top-down approach (TDA), for which guidelines and logics are taken from outside, from models (mathematics, physics, chemistry, and biology) and surrounding environments (nature and artifacts). TDA can be developed into a combination of powerful sets of scientific disciplines and/or market needs, based on logics, which requires networking of multifacet knowledge models, bridging gaps, and tuning mismatches. Logics described by explicitly digitized scientific knowledge (i.e. not tacit knowledge) have been used in simulation programs with embedded algorithms, rules in artificial intelligence (AI) systems and inspiring interfaces. In different fields, there have been pioneering projects: • structure-property correlations for organic materials (E.J. Corey, 1960s); • phase diagrams based on thermodynamics (NIST & ASM, since 1970s); Materials Informatics: Methods, Tools and Applications, First Edition. Edited by Olexandr Isayev, Alexander Tropsha, and Stefano Curtarolo. © 2019 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2019 by Wiley-VCH Verlag GmbH & Co. KGaA.

56

3 Pauling File: Toward a Holistic View

• structure maps based on crystallography and/or quantum mechanics (D.G. Pettifor, 1980s); • deformation/fracture mechanism maps based on defect theory (M.F. Ashby, 1970s). To integrate digitized models has become attractive and popular since the 1980s, thanks to the vertiginous development of high-performance computing, consequently named “third science.” Thus many projects have been proposed, so-called multiscale modeling or multi-physics simulation connecting firstprinciples calculations, molecular dynamics (MD), rate equations, the finite elements method (FEM), and other methods that were developed in the 1980s and 1990s. The data-driven approach to obtain a holistic view requires a digitized system, a prototype of which was developed as a computer-aided design (CAD) system of databases, simulation and AI in the middle of the 1970s. In the middle of the 1980s, almost all ideas, such as metadata, meta-knowledge, distributed data systems (DDS), intelligent systems of data classification and data mining, knowledge management, and learning and inference logics, had been discussed in the dawn of the new networked information environment and the increasing availability of high-performance computers. Holistic views are obtained through converging interplay between BUAs and TDAs. This was successfully proved “manually” in 1985 [2] and can in a strategic way be transcribed into computational algorithms. The PAULING FILE project proposed the possibility of such a converging interplay via a digital platform. Concepts and prototypes for digital systems supporting similar interplay had also been proposed by E.J. Corey in 1969, and further developed in 1975, but the essential difference of the PAULING FILE project is to discover knowledge directly from data, rather than to reuse a set of predefined knowledge. New knowledge is emerging from data in principle, even if we reuse a set of predefined knowledge for convenience. Triggered by a workshop held in Como in 1993 [3], important contributions toward a holistic view on materials were conceived in the form of networked intelligent systems of key data and models with powerful PCs and high-performance computers, and several projects to build material data systems, targeting practical needs from materials users in industry, were kicked off in the middle of the 1990s. The virtual experiment for materials design (VEMD) project [4] and the PAULING FILE project, focusing on TDA and BUA, respectively, were typical projects launched at that time. However, as shown in the following sections, it takes time to implement a digital system of practical impact. The editorial team of the PAULING FILE has devoted their time for several decades to develop the PAULING FILE to practical competitiveness. Without a comprehensive compilation of highly quality-controlled data, followed by strategic mixing of inductive and deductive inferences, holistic views have not yet “emerged” digitally. As for TDA, it is still in an incubation period, especially from the viewpoint of materials users. Implications of PAULING FILE experience to get a set of holistic views were not learned properly for VEMD cases, where focus was on offering holistic views to materials users as a piling up of comparative studies. In order to

3.2 PAULING FILE: Crystal Structures

reach the first milestone, the following directions are at present under development by allocating logics for each fact, improving coherence of the collected facts and logics, and balancing combined uncertainties through comparative studies of complexities: • Internet of Things (IoT) (huge monitoring data, data mining, etc.) of engineering products (recent General Electric engines as an implementation of J.H. Westbrook’s ideas covering from scientific basics to commercial services); • Open and closed approaches to data, necessary to overcome the weakness of BUA and TDA by AI and collective knowledge in the cloud environment; • Market-in and market-out synergetic collaborations between materials producers and materials users. In order to become innovative and adapt to the market in materials data, as well as to the recent cloud environment, lessons should be learned and/or unlearned from the PAULING FILE and VEMD projects. A few remarks in this sense are summarized at the end of this chapter, as a reference for the future. 3.1.1

Creation and Development of the PAULING FILE

The shortcomings of the empirical (BUA) approach provided the basic motivation for the initiation of the PAULING FILE project, which was launched in 1995 as a joint venture of the Japan Science and Technology Corporation (JST), Material Phases Data System (MPDS), and The University of Tokyo, RACE. The PAULING FILE project [1, 5] planned three steps: the first goal was to create and maintain a comprehensive database for inorganic crystalline substances, covering crystallographic data, diffraction patterns, intrinsic physical properties, and phase diagrams. The data should be checked with extreme care, and the term “inorganic substances” was defined as compounds containing no C—H bonds. In parallel to the database creation, appropriate retrieval software should be developed to make the different groups of data mentioned above accessible via a single user interface. In longer term, new tools for materials design should be created, which would more or less automatically search the database for correlations for intelligent design of new inorganic materials with predefined intrinsic physical properties. To test the concept, the prototype PAULING FILE – Binaries Edition was published online and offline in 2002 [6, 7]. Since about the same time, the PAULING FILE has been under the leadership of MPDS alone. Selected PAULING FILE data are included in several printed, offline and online products, most of them updated on a yearly basis, and three multinary online editions of the PAULING FILE will be available in 2017 [6, 8, 9].

3.2 PAULING FILE: Crystal Structures The minimal requirement for a crystal structure database entry in the PAULING FILE is a complete set of published cell parameters, assigned to a compound of well-defined composition. Whenever published data are available, the crystallographic data also include atom coordinates, (an)isotropic displacement

57

58

3 Pauling File: Toward a Holistic View

parameters, and experimental diffraction lines and are accompanied by information concerning preparation, experimental conditions, characteristics of the sample, phase transitions, dependence of the cell parameters on temperature, pressure, and composition. In order to give an approximate idea of the actual structure, a complete set of atom coordinates and site occupancies is proposed for database entries where a prototype could be assigned (by the authors or by the editors), but atom coordinates were not refined. The crystallographic data are stored as published but have also been standardized according to the method proposed by Parthé and Gelato [10, 11], using the program STRUCTURE TIDY [12], and, when relevant, further adjusted so that the data for isotypic entries can be directly compared [13]. Derived data include Atomic Environments (AEs) of individual atom sites, based on the maximum gap method [14–16], and the reduced Niggli cell. The database entries are checked for inconsistencies within the database entry (e.g. chemical elements, charge balance, interatomic distances, space group, symmetry constraints) and by comparing different database entries (e.g. cell parameters and atom coordinates of isotypic compounds) with a program package including more than 30 modules [17]. For 5% of the database entries, one or more misprints in the published crystallographic data are detected and corrected. Warnings concerning remaining short interatomic distances, deviations from the nominal composition, etc. are added with remarks. SI units are used everywhere and the crystallographic terms follow the recommendations by the International Union of Crystallography [18, 19]. 3.2.1

Data Selection

The data are extracted from primary literature. Thesis works are not considered and conference abstracts are processed only in exceptional cases. When available, supplementary material deposited as CIF or in other formats is used as data source. Approximately 10% of the processed documents exist in an original and a translated version; duplicates are carefully avoided and both references are stored. Crystallographic data simulated by ab initio calculations or optimized by differential pair distribution function (d-PDF) or other methods are only considered when confirmed by experimental observations. Distinct database entries are created for all complete refinements reported in a particular paper. For cell parameters without published atom coordinates, a database entry is prepared for each chemical system and crystal structure (a distinct phase; see definition in the following text). For example, for a continuous solid solution between two ternary compounds, there will be three database entries: one for each ternary boundary composition and one for the quaternary system, the latter possibly containing a remark describing the composition dependence of the cell parameters. For the choice of retrievable cell parameters, preference is given to values determined under ambient conditions. 3.2.2

Categories of Crystal Structure Entries

As stated earlier, the minimal requirement for a database entry in the crystal structure part of the PAULING FILE is a complete set of published cell

3.2 PAULING FILE: Crystal Structures

parameters. The database entries are subdivided into different categories, according to the level of investigation, of which the most common are • • • • •

complete structure determined; coordinates of non-H atoms determined; cell parameters determined and prototype with fixed coordinates assigned; cell parameters determined and prototype assigned; cell parameters determined.

Atom coordinates are included in the PAULING FILE for the first four categories. Less frequent categories are average structure, commensurate approximant, part of atom coordinates determined, cell parameters determined and parent structure assigned (for filled-up derivatives such as carbides and hydrides), and subcell determined. The brief summary defining the level of investigation may be followed by information about additional studies, such as: • • • • • • •

absolute structure determined; composition dependence studied; electron density studied; magnetic structure studied; pressure dependence studied; refinement in superspace; temperature dependence studied.

3.2.3

Database Fields

In addition to the crystallographic data, large amounts of information concerning the sample preparation and experimental investigation are included in the PAULING FILE. Basic data are stored as published (for rapid comparison with the original paper) and standardized (for efficient data checking and retrieval and for a homogeneous presentation). The following database fields may be present in a crystal structure database entry: • Classification: chemical system, chemical formula (as published, standardized), modification, colloquial name, structure prototype, Pearson symbol, space group number, Wyckoff sequence, mass per formula unit, computed density, level of structural investigation, and additional studies. • Bibliographic data: data source, authors (affiliation), language, and title. • Published crystallographic data: space group, cell parameters, number of formula units per cell, and atom coordinates (site label, element(s), site multiplicity, Wyckoff letter, site symmetry, x, y, z, partial site occupancy). • Standardized crystallographic data: space group, cell parameters, number of formula units per cell, atom coordinates (site label, element(s), site multiplicity, Wyckoff letter, site symmetry, x, y, z, partial site occupancy), and transformation from published to standardized data. • Niggli-reduced cell: cell parameters and transformation from published to Niggli-reduced cell.

59

60

3 Pauling File: Toward a Holistic View

• Displacement parameters: isotropic, anisotropic, and computed equivalent isotropic. • Published diffraction lines: Bragg angle or equivalent parameter, interplanar spacing, intensity, Miller indices, radiation, and remarks. • Preparation: starting materials (purity, form), method of synthesis (crucible, atmosphere, solvent), and annealing or crystal growth. • Mineral: mineral name and locality. • Compound description: chemical analysis (method, composition from analysis); stability with respect to temperature, pressure, and composition; color; optical characteristics; sample form (crystal habit, grain size); chemical reactivity; and measured density. • Determination of cell parameters: sample, experimental method, radiation, temperature, and pressure, theta range, and software used. • Structure determination: sample, experimental method, diffractometer/reactor, radiation, temperature, pressure, scan mode, theta range, number of reflections, linear absorption coefficient and absorption correction, starting model, refinement, number of refined parameters, numbers of reflections, condition for observed reflections, R factors, and software used. • Remarks: general remarks, errata, editor remarks (modifications of published data, warnings), remarks on/from related references, and dependence of cell parameters on temperature, pressure, and composition. • Figure descriptions: figure number in the original publication, title, parameters, and ranges. The data extracted and stored for a ternary aluminide are shown in Table 3.1. Table 3.1 Example of data stored for a PAULING FILE crystal structure entry. Summary

Standardized formula: YNiAl4 ; Alphabetic formula: Al4 NiY; Published formula: YNiAl4 ; Formula from refinement: Al4 NiY Structure prototype: YNiAl4 ,oS24,63; Space group: Cmcm (63); Wyckoff sequence: 63,fc3 a Computed density: 4.07 mg/m3 Molar mass: 255.5 Level of investigation: complete structure determined Bibliographic data

Reference: Sov. Phys. Crystallogr. (1972) 17, 453–455; Kristallografiya (1972) 17, 521–524; Language: Russian/English; Title: Crystal structure of the compounds YNiAl4 and YNiAl2 Author

Department

Organization

City

Country

Rykhal’, R.M.

Department of Inorganic Chemistry

Lviv Ivan Franko National University

Lviv

Ukraine

Zarechnyuk, O.S.

Department of Inorganic Chemistry

Lviv Ivan Franko National University

Lviv

Ukraine

Yarmolyuk, Y.P.

Department of Inorganic Chemistry

Lviv Ivan Franko National University

Lviv

Ukraine

3.2 PAULING FILE: Crystal Structures

Table 3.1 (Continued) Published crystallographic data

Space group: Cmcm (63) Cell parameters: a = 0.408 nm, b = 1.544 nm, c = 0.662 nm, 𝛼 = 90∘ , 𝛽 = 90∘ , 𝛾 = 90∘ , V = 0.417 nm3 , a/b = 0.264, b/c = 2.332, c/a = 1.623, Z = 4 Atom coordinates Site

Elements

Wyckoff position

Site symmetry

x

y

z

Y

Y

4c

m2m

0

0.121

1/4

Ni

Ni

4c

m2m

0

0.771

1/4

Al1

Al

8f

m..

0

0.314

0.054

Al2

Al

4c

m2m

0

0.943

1/4

Al3

Al

4b

2/m..

0

1/2

0

Partial occupancy

Standardized crystallographic data

Space group: Cmcm (63) Cell parameters: a = 0.408 nm, b = 1.544 nm, c = 0.662 nm, 𝛼 = 90∘ , 𝛽 = 90∘ , 𝛾 = 90∘ V = 0.4170 nm3 , a/b = 0.264, b/c = 2.332, c/a = 1.623, Z = 4 Atom coordinates Site

Elements

Wyckoff position

Site symmetry

x

y

z

Al1

Al

8f

m..

0

0.186

0.054

Y

Y

4c

m2m

0

0.379

1/4

Al2

Al

4c

m2m

0

0.557

1/4

Ni

Ni

4c

m2m

0

0.729

1/4

Al3

Al

4a

2/m..

0

0

0

Partial occupancy

Transformation origin shift 0 1/2 1/2 Niggli-reduced cell

a = 0.408 nm, b = 0.662 nm, c = 0.7985 nm, 𝛼 = 90∘ , 𝛽 = 104.802∘ , 𝛾 = 90∘ , V = 0.2085 nm3 , a/b = 0.616, b/c = 0.829, c/a = 1.957 Atomic Environments Site

Coordination number

Atomic environment type

Composition

Al1

12

Cuboctahedron

Ni3 Al6 Y3

Y

19

Distorted pseudo Frank–Kasper (19)

Al13 Ni4 Y2

Al2

12

Cuboctahedron

NiY3 Al8

Ni

9

Tricapped trigonal prism

Al7 Y2

Al3

12

Cuboctahedron

Al8 Y4

61

62

3 Pauling File: Toward a Holistic View

Table 3.1 (Continued) Preparation Starting material

Purity

Y

99.9 wt%

Ni

electrolytic, 99.98 wt%

Al

99.98 wt%

Form

Synthesis: arc-melted; Atmosphere: purified argon; Composition of sample: Al70 Ni15 Y15 Description of the sample

Measured density: 4.27 mg/m3 Determination of the cell parameters

Sample: single crystal; Method: rotation photographs; Radiation: X-rays, Cu K𝛼 Structure determination

Sample: single crystal; Method: Weissenberg photographs; Radiation: X-rays, Cu K𝛼; Data collection: 0 k l Model: crystal chemical considerations; Refinement: least-squares refinement, 69 reflections; R factors: R = 0.150 Processing information

Document: 102868; S-entry: 1407077; Processing: 12 May 2003; Checking: 23 April 2004; Last update: 12 May 2003

3.2.4

Structure Prototypes

The structure type is a well-known concept in inorganic chemistry, where a large number of compounds often crystallize with very similar atom arrangements. The compilation Strukturbericht [20] started already in the beginning of the twentieth century to classify crystal structures into types, named by codes such as A1, B1, or A15. Though these notations are still in use, structure types are nowadays generally referred to by the name of the compound for which this particular atom arrangement was first identified, i.e. for the types enumerated earlier: Cu, NaCl, and Cr3 Si. The PAULING FILE uses a longer notation, which includes also the Pearson symbol (a lowercase letter for the crystal system, an uppercase letter for the Bravais lattice, sum of multiplicities of all, fully or partially occupied atom sites) [21] and the number of the space group in the International Tables for Crystallography [18]: Cu,cF4,225; NaCl,cF8,225; and Cr3 Si,cP8,223. All data sets with published atom coordinates are in the PAULING FILE classified into structure prototypes, following the criteria defined in TYPIX [22]. According to this definition, isotypic compounds must crystallize in the same space group and have similar cell parameter ratios, and the same Wyckoff positions should be occupied in the standardized description (see the following text), with the same or similar values of the atom coordinates. If all these criteria are fulfilled, the AEs should be similar. Different ordering variants (substitution derivatives) are distinguished but, in the general case, no distinction is made between

3.2 PAULING FILE: Crystal Structures

structures with fully and partly occupied atom sites. Because of the difficulty to locate protonic hydrogen atoms by X-ray diffraction, the positions of H atoms in structures containing more than two chemical elements (with the exception of hydrides) are ignored in the classification. Each structure prototype is defined on a database entry in the crystal structure part of the PAULING FILE. These database entries are grouped in the so-called structure type pool (STP) and may later be replaced. More than 36 000 different prototypes have up to date been identified and added to the STP. When possible, a structure type has been assigned also to data sets without atom coordinates. The structure type is often stated in the original publication; in other cases it is assigned directly by the editors. The assigned prototype may in some cases be an approximation of the real structure, ignoring for instance a certain disorder. When not published, the editor assigns also the space group setting to which the published cell parameters refer. 3.2.5

Standardized Crystallographic Data

There exist an infinite number of ways to select the crystallographic data (cell parameters, space group setting, representative atom coordinate triplets) that define a crystal structure. The number remains high even when the basic rules recommended by the International Tables for Crystallography [18] are respected, due to space group-allowed operations such as permutations, origin shifts, etc. It follows that even identical or very similar atom arrangements may not be recognized as such (see Figure 3.1). The classification of crystal structures into structure prototypes is largely facilitated by the use of standardized crystallographic data (several examples are given in [23]). The crystallographic data in the PAULING FILE are not only stored as published but also standardized. This second representation of the same data is such that compounds crystallizing with the same prototype (isotypic compounds) can be directly compared. It is prepared in a three-step procedure: (1) The published data are checked for the presence of overlooked symmetry elements [24] and, if relevant, converted into a space group of higher symmetry. (2) The resulting data are standardized with the program STRUCTURE TIDY [12]. (3) The resulting data are compared with the standardized data of the typedefining database entry, and, if relevant, additional space group-permitted operations are performed [13]. 3.2.5.1

Checking of Symmetry

A crystal structure can always be refined and described in a subgroup of the actual space group. To an extreme, any structure can be described in the triclinic space group P1, having no other symmetry elements than identity and translation. To know the correct space group is important not only for the recognition of isotypic structures but also in connection with intrinsic physical properties. Particular properties are effectively restrained to certain symmetries, e.g. ferroelectricity can only be observed for polar space groups, whereas pyroelectricity is excluded

63

RbO

CsS

Figure 3.1 Data sets for RbO and CsS, as published and after standardization, revealing their isotypism. Data as shown in the PAULING FILE – Binaries Edition [7].

3.2 PAULING FILE: Crystal Structures

Figure 3.2 The structure of WAl5 , reported in space group P63 (173), can be described in space group P63 22 (182), after applying an origin shift of 0 0 3/4 to the published data. Data set from the PAULING FILE – Binaries Edition [7].

for crystal structures possessing an inversion center. Therefore, the crystallographic data in the PAULING FILE are checked for the presence of overlooked symmetry elements [24]. Whenever it is possible to describe the structure in a space group of higher symmetry, or with a smaller unit cell, without any approximations, this is done. Figure 3.2 shows how the structure of WAl5 , reported in space group P63 (173), can be described in space group P63 22 (182), after applying an origin shift of 0 0 3/4 to the published data [25]. 3.2.5.2

Standardization

At the next step, the crystallographic data are standardized following the method proposed by Parthé and Gelato [10, 11], using the program STRUCTURE TIDY [12]. The standardization procedure applies criteria to select the space group setting, the cell parameters, the origin of the coordinate system, the representative atom coordinates, and the order of the atom sites. The main criteria are summarized as follows.

65

66

3 Pauling File: Toward a Holistic View

The coordinate system must be right-handed and refer to a space group setting defined in the International Tables for Crystallography [18], with the following additional constraints: • triclinic space groups: Niggli-reduced cell; • monoclinic space groups: b-axis unique, “best” cell; • orthorhombic space groups: a ≤ b ≤ c, when not fixed by the space group setting; • trigonal space groups with R-lattice: triple hexagonal cell; • space groups with two origin choices: origin choice 2 (origin at inversion center); • enantiomorphic space groups: smallest index of the relevant screw axis. For the 148 nonpolar space groups, there exist between 1 and 24 possibilities to rotate, invert or shift the coordinate system, respecting the conditions listed above. For each possibility the standardization program prepares a complete description where the representative triplet of each atom site must obey a series of eliminative conditions: • • • •

first triplet in the International Tables for Crystallography [18]; 0 ≤ x, y, z < 1; minimum value of (x2 + y2 + z2 ); minimum value of x, then y, then z.

For polar space groups similar data sets are prepared where one atom site after the other, belonging to the “lowest Wyckoff set” (set of Wyckoff sites containing the first letters in the alphabet) represented in the structure, fixes the origin on the polar axis. One of the data sets, prepared as described above, is selected based on the following eliminative conditions: ∑ • minimum value of (x2j + y2j + zj2 )1∕2 summing over all atom sites; ∑ ∑ ∑ • minimum value of xj summing over all atom sites, then yj , then zj ; • minimum value of x2n + y2n + zn2 for the nth atom site. Finally, the atom sites are reordered according to the following eliminative criteria: • inverse alphabetic order of Wyckoff letters; • increasing x, then y, then z. In order to obtain similar standardized data sets for refinements with and without hydrogen positions, the positions of H (D, T) atoms in structures containing more than two chemical elements (with the exception of hydrides) are not taken into consideration for the choice of the standardized data set. The coordinates of the hydrogen atoms, when determined, are transformed according to the same operations as the remaining coordinates, and the atom sites are listed at the end of the standardized data set. Protonic hydrogen atoms are also ignored in parameters used for structural classification, such as the Pearson symbol or the Wyckoff sequence.

3.2 PAULING FILE: Crystal Structures

3.2.5.3

Comparison with the Type-Defining Data Set

In the general case the standardization procedure produces directly comparable data sets for isotypic compounds. This is, however, not always true since particular situations may occur, e.g.: • Two refinable cell parameters have similar values. Whichever is the larger one may differ for isotypic compounds, and the constraint a ≤ b ≤ c will lead to different standardized descriptions. • The condition imposing that all the angles of the Niggli-reduced cell must be either ≤90∘ or ≥90∘ may cause flipping of triclinic unit cells, when the value of one of the angles switches from slightly larger than 90∘ to slightly smaller than 90∘ . • The constraint that refinable atom coordinates must be ≥0 is responsible for a certain number of diverging standardizations observed for isotypic structures with refinable atom coordinates close to 0. • The order of the atom sites may differ for isotypic compounds where several atom sites in the same Wyckoff position with similar refinable x-coordinates (y-, z-) are present. To remedy these problems, each standardized data set is compared with the standardized database entry that defines the prototype in the PAULING FILE. The program COMPARE [13] generates the different space group-permitted crystallographic representations. Each representation is compared with the standardized description of the type-defining entry, based on the value of the sum of the “minimum distances” between corresponding atom sites, expressed in “fractional coordinates,” multiplied by the site multiplicity: B(est) S(etting) ∑ C(riterion) = mi [(Δxi )2 +(Δyi )2 +(Δzi )2 ]1/2 , summing over all atom sites. The standardized data set is replaced by the data set having the smallest BSC value, and the isotypism is then checked by detecting atom coordinates differing by more than 0.1 from those of the type-defining entry. For data sets with no published coordinates, the cell parameters are standardized following the criteria defined for the unit cell and space group setting. For data sets with unknown space group, the cell parameters are standardized assuming the space group of lowest symmetry in agreement with the Pearson symbol, e.g. P222 for no more information than orthorhombic (o**) or orthorhombic primitive (oP*). For triclinic structures, the cell is adjusted by comparing with the cell of the type-defining database entry. 3.2.6

Assigned Atom Coordinates

In order to give an approximate idea of the actual structure, a complete set of atom coordinates and site occupancies is proposed for database entries where a structure prototype could be assigned (by the authors and/or by the editors), but atom coordinates were not determined. Two different cases occur: (1) A structure type where all atom coordinates are fixed by symmetry is assigned. The editor, based on the chemical formula, will in this case assign also a probable atom distribution. For off-stoichiometric compositions, different situations

67

68

3 Pauling File: Toward a Holistic View

are proposed as a first approximation, depending on the structure type. In the general case, fully occupied atom sites with mixed occupation are assumed, whereas for structure types such as NaCl, ZnS, CaF2 , NiAs, or Ni2 In, vacancies are assumed on one atom site. (2) A structure type with refinable atom coordinates is assigned. The atom coordinates of the type-defining entry are proposed as a first approximation. The atom distribution is inserted by a program that compares the chemical formula of the type-defining entry with a chemical formula modified by the editor so that the substitution element by element is emphasized [17]. The structure types having both entries with and without refined atom coordinates in the PAULING FILE have been analyzed, and information concerning their behavior with respect to off-stoichiometry is stored if vacancies or mixed occupation are expected to occur selectively on particular atom sites. The positions of protonic H positions are not included among the assigned coordinates, but sites occupied by, e.g. O, OH, or OH2 are distinguished. No attempt has been made to propose a data set closer to the real structure, e.g. by copying refinements for the same compound, since assigned atom coordinates and site occupancies can anyhow not replace a structure refinement. 3.2.7

Atomic Environment Types (AETs)

For the approach used here [15, 16], the AE, also called coordination polyhedron, is defined using the method of Brunner and Schwarzenbach [14]. According to this method, the interatomic distances between an atom and its neighbors are plotted in a next-neighbor histogram, as shown on the left-hand side of Figure 3.3 for the Ti atom in BaTiO3 rt. In most cases a clear maximum gap is revealed and the atoms situated at distances to the left of the maximum gap are considered to

n

dmin=0.172 nm dgap=0.232 nm CN=6

25 20 15 10 5 0 1.0

1.5 d/dmin 2.0 Atomic environment of: Ti

2.5

Distance statistics (class width = 0.001 nm): 5000 Ti-O

1000 500 500

Ti-Ti

Ti-Ba

1.0

1.5

d/dmin

2.0

2.5

Figure 3.3 Next-neighbor histogram (NNH) (top left) and the corresponding coordination polyhedron (AET) for an entry for BaTiO3 rt in Pearson’s Crystal Data [26].

3.2 PAULING FILE: Crystal Structures

belong to the AE of the central atom. This rule is called the “maximum gap rule” and the coordination polyhedron, the atomic environment type (AET), is constructed with the atoms to the left of the maximum gap. The polyhedron around the Ti atom in Figure 3.3 is a (distorted) octahedron. In those cases where the maximum gap rule leads to AETs with not only the selected central atoms but also additional atoms enclosed in the polyhedron, or to AETs with atoms located on one or more of the faces or edges of the coordination polyhedron, the so-called “maximum-convex-volume rule” is applied. This rule is defined as the maximum volume around the central atom delimited by convex faces, with all the atoms of the AE lying at the intersection of at least three faces. This rule is also used in those cases where no clear maximum gap is detected. All the structure entries with refined or fixed atom coordinates in the PAULING FILE are analyzed, applying the rules given previously. One hundred different AETs have been identified, of which the 50 most frequent ones are listed in Table 3.2. Each AET is identified by a code and the name of the coordination Table 3.2 The 50 most frequently occurring atomic environment types (AETs) with their counts (number of point sets) in PCD 2016/2017. Count

AET-code

Name of the coordination polyhedron

1

295 885

1#a

Single atom

2

234 712

2#a

Non-collinear

3

177 943

6-a

Octahedron

4

168 982

4-a

Tetrahedron

5

107 137

3#a

Non-coplanar triangle

6

40 263

12-b

Cuboctahedron

7

29 728

9-a

Tricapped trigonal prism

8

26 893

2#b

Collinear

9

24 678

12-a

Icosahedron

10

17 863

3#b

Coplanar triangle

11

16 693

14-b

Rhombic dodecahedron

12

16 536

8-a

Square prism (cube)

13

16 075

8-b

Square antiprism

14

15 285

5-a

Square pyramid

15

14 824

5-c

Trigonal bipyramid

16

10 840

7-g

Monocapped trigonal prism

17

10 151

14-a

14-Vertex Frank–Kasper

18

9 424

10-a

Fourcapped trigonal prism

19

8 921

6-b

Trigonal prism

20

8 806

16-a

16-Vertex Frank–Kasper

21

8 504

4#c

Coplanar square

22

7 730

7-h

Pentagonal bipyramid (Continued)

69

70

3 Pauling File: Toward a Holistic View

Table 3.2 (Continued) Count

AET-code

Name of the coordination polyhedron

23

7 499

20-a

Pseudo Frank–Kasper (20)

24

7 155

13-a

Pseudo Frank–Kasper (13)

25

6 484

4#d

Non-coplanar square

26

6 067

12-d

Anticuboctahedron

27

5 074

8-d

Distorted square antiprism-a

28

4 405

8-g

Double anti-trigonal prism

29

4 329

4#b

Tetrahedron, central atom outside

30

4 160

15-a

15-Vertex Frank–Kasper

31

4 088

10-b

Bicapped square prism

32

4 021

17-d

7-Capped pentagonal prism

33

3 989

11-a

Pentacapped trigonal prism

34

3 860

6-d

Pentagonal pyramid

35

3 360

8-i

Side-bicapped trigonal prism

36

3 339

11-b

Pseudo Frank–Kasper (11)

37

3 203

8-c

Hexagonal bipyramid

38

3 042

10-c

Bicapped square antiprism

39

2 941

18-a

Eight-equatorial-capped pentagonal prism

40

2 816

22-a

Polarity, eight-equatorial-capped hexagonal prism

41

2 363

10-e

Distorted equatorial fourcapped trigonal prism

42

2 023

5#d

Square pyramid, central atom outside

43

1 998

8-j

Distorted square antiprism-b

44

1 987

12-f

Hexagonal prism

45

1 343

14-d

Bicapped hexagonal prism

46

1 243

20-h

Twelve-pentagonal-faced polyhedron

47

1 117

7-a

Monocapped octahedron

48

961

18-d

Sixcapped hexagonal prism

49

945

6-h

Distorted trigonal prism

50

908

12-c

Bicapped pentagonal prism

polyhedron; the count in the second column gives the number of times this AET is present in Pearson’s Crystal Data [26], release 2016/2017. In most structures the coordination numbers (CNs) vary from CN = 1 to CN = 22. It may be noted that this purely geometrical approach, which was developed for intermetallic compounds, does not distinguish types of bonding. As a consequence, the selected AE may include both cations and anions, both atoms forming covalent bonds with the central atom and counterions, or large atoms at contact distances and small atoms with little interaction. The procedure further considers all atom sites as being fully occupied and, consequently, a tetrahedron (e.g. a sulfate ion) in statistical disorder between two orientations will be classified as a

3.2 PAULING FILE: Crystal Structures cell vs. T for La0.5FeBi0.5O3, CMATEX (2012) 24, 4563-4571 0.3975 0.3970

cell parameter [nm]

0.3965 0.3960 0.3955 0.3950 0.3945 0.3940 0.3935 0.3930 0.3925 0

100

200

300

400

600

500

700

800

900

T [K]

(a)

1: a/SQR2 (fit to points), prototype: GdFeO3,oP20,62 2: b/2 (fit to points), prototype: GdFeO3,oP20,62 3: c/SQR2 (fit to points), prototype: GdFeO3,oP20,62

cell vs. T for Sr0.60La0.40MnO3, PRBMDO (2003) 67(094431), 1-13 0.5500

cell parameter [nm]

0.5480 0.5460 0.5440 0.5420 0.5400 0.5380 0.5360 0.5340 50

100

150

200

250

300

350

400

450

500

550

T [K]

(b)

1: a/SQR2 (points), prototype: (Ba0.3Sr0.2Pr0.5)MnO3,oF40,69 2: b/SQR2 (points), prototype: (Ba0.3Sr0.2Pr0.5)MnO3,oF40,69 3: c/SQR2 (points), prototype: (Ba0.3Sr0.2Pr0.5)MnO3,oF40,69 4: a (points), prototype: SrZrO3,tl20,140 5: c/SQR2 (points), prototype: SrZrO3,tl20,140

cell vs. p for Cr2O3, JSSCBI (2011) 184, 3040-3049 0.2900 0.2850

cell volume [nm3]

0.2800 0.2750 0.2700 0.2650 0.2600 0.2550 0.2500 0.2450 0.2400 0

(c)

10

20

30

40

50

60

p [GPa] 1: V (fit to points), prototype: Al2O3,hR30,167

Figure 3.4 Examples of cell parameter plots from Pearson’s Crystal Data [26], release 2016/2017: (a) thermal expansion for La0.5 FeBi0.5 O3 , (b) parameter change through the phase transition for Sr0.6 La0.4 MnO3 , (c) pressure dependence of the cell volume for Cr2 O3 .

71

72

3 Pauling File: Toward a Holistic View

cube. However, the method is simple to apply and of great use in the majority of the cases. The AE approach offers an additional possibility to check the crystal structure data for geometrical correctness. Coordination polyhedra also constitute a tool for the classification of crystal structures into geometrically similar types [27], with the definition used for AET used here called “coordination types” [16]. 3.2.8

Cell Parameters from Plots

Since 2009, values have been extracted from plots of cell parameters (or functions of these) vs. temperature, pressure, or composition [28] and stored in the database. Three cases are distinguished: experimental points, fit to experimental points, and linear dependence. The extracted values are converted to SI units and used to produce new figures, illustrating thermal expansion, phase transitions, or compression under pressure (see examples in Figure 3.4). Values extracted from the same publication for the same phase and temperature/pressure/composition are identified and linked, converted to basic cell parameters a, b, c, 𝛼, 𝛽, 𝛾, and standardized. After checking, these can be used for retrieval and it is possible to assign approximate atom coordinates.

3.3 PAULING FILE: Phase Diagrams The phase diagram section of the PAULING FILE contains temperature– composition phase diagrams for binary systems, as well as horizontal and vertical sections and liquidus/solidus projections for ternary and multinary systems. Both experimentally determined and calculated diagrams are processed. Primary literature is considered in first priority, but diagrams from a few well-known compilations, among which the compendium of binary phase diagrams edited by Massalski et al. [29] and the series of books on ternary phase diagrams edited by Petzow and Effenberg [30], have been included. All the diagrams have been converted to at.% and ∘ C and redrawn in a standardized version so that different reports for the same chemical system can easily be compared. Single-phase fields are colored in blue and three-phase fields in yellow. Not only are the phases identified on the diagrams named according to PAULING FILE conventions, but also the original names are stored in the database. Examples of phase diagrams redrawn for the PAULING FILE are shown in Figure 3.5. Each phase diagram is linked to a database entry, which contains the following database fields: • Classification: chemical system and type of diagram • Investigation: experimental/calculated, calculation technique, APDIC/nonAPDIC, and remark • Bibliographic data: data source, authors (affiliation), language, and title • Original diagram: figure number in the original publication, borders, scales, and original size • Redrawn diagram: concentration range, temperature (range), and conversion of concentration

3.3 PAULING FILE: Phase Diagrams

1800

1769

1600 L

1188

1200 1000 962

(Pt)

Temperature, °C

1400

803

(Ag)

600

Ag15Pt17rt

800

400 10

0 Ag

(a)

20

30

40

50 at. %

60

70

80

90

100 Pt

1850 1822 1800

1700 1662

1660

1650 1630 1600

MgO

Mg2TiO4 ht

MgTi2O5

TiO2 rut

1550

1605 MgTiO3

Temperature, °C

1760

L

1750

1500

(b)

0 0 Mg 66.7 O 33.3 Ti

10

20

30 at. % Mg

40

50 50.0 Mg 50.0 O 0 Ti

Figure 3.5 Examples of phase diagrams as redrawn for the PAULING FILE: (a) phase diagram of a binary system, (b) vertical section and (c) isothermal section of the phase diagram of a ternary system, and (d) liquidus projection of the phase diagram of a quaternary system.

73

3 Pauling File: Toward a Holistic View

Pr

500°C

(773K)

55

(MgPr rt,MgY) 60

Mg3Pr

65

Mg

70

at. %

75

Mg41Pr5

80

Mg12Pr

85

90

Y0.75Pr0.25Mg2

Y0.5Pr0.5Mg5

95 (Mg)

5

(c)

10

15

20

Mg24Y5

Mg

25

30

35

10 20 30

80 70

KCl 60

l.%

l.%

Ag

Cl

90

71 0

40 30

35 0

e 80

l

70

50 m

KC

59 0 56 0 53 50 0 50 47 0 60 44 0 41 0 38 0 0

Y

liquidus projection

68 0 65 8

62 0

45

mo

40

74 0

40

Mg2Y

at.% Y KCl

mo

(NaCl,AgCl)

20

90

10

20

AgCl

30

40

50

60

mol.% NaCl

70

80

770

740

(d)

710

10 680

74

90

NaCl

Figure 3.5 (Continued)

• List of phases present on the diagram: standardized phase name, name used in the original publication; structure prototype assigned by the editor; structural information given in the original publication; and link to a representative PAULING FILE crystal structure entry. For binary systems also the temperature and reaction type for the upper and/or lower limit of existence of the phase are stored.

3.4 PAULING FILE: Physical Properties

3.4 PAULING FILE: Physical Properties The physical properties section of the PAULING FILE stores experimental and (to a limited extent) simulated data for a broad range of intrinsic physical properties of inorganic compounds in the solid, crystalline state. Processing is literature oriented, and each database entry groups selected data extracted for a particular phase in a particular publication. Focus is on the characterization of inorganic substances (single-phase samples), rather than on the optimization of materials. When published, the entries also contain information about synthesis and sample preparation, as well as information that helps to establish the links to phase diagram and crystal structure entries, such as colloquial names, crystallographic data, limits of stability of the phase with respect to temperature, pressure, or composition. The physical properties are stored in four different ways: • • • •

numerical values, figure descriptions (Y vs. X), property classes such as superconductor, ferroelectric, etc., keywords indicating the existence of particular data, e.g. different spectra.

The symbols for the most common physical properties have been standardized, mainly based on the CRC Handbook of Chemistry and Physics [31]. Numerical values are stored in as-published units and also converted to standard units. Most standard units are SI units; however, for certain properties at the atomic level, more suitable units such as eV or 𝜇B are used. Properties expressed with respect to a defined quantity of substance (per kilogram, per mole) are converted to per atom-gram. Each numerical property value is accompanied by information about the experimental conditions for that particular measurement. Great flexibility is provided through the links to reference tables, thanks to which new properties may be selected and their symbols, units, and ranges of magnitude can be controlled. 3.4.1

Data Selection

Data are taken from primary literature. Each database entry corresponds to a particular combination data source – inorganic crystalline phase – but can contain several numerical values, figure descriptions, and keywords. For an investigation of a compound through a temperature- or pressure-induced structural phase transition, there will be two database entries, for instance, one for the room-temperature modification and one for the low-temperature modification. By default, ferroelectric transitions are assumed to be accompanied by structural changes and will justify the creation of two database entries, whereas magnetic, electric, or superconducting transitions are not. Data for phases with a certain homogeneity range are grouped under a representative chemical formula. The actual composition for a particular measurement, when differing from the composition representing the database entry, is specified among the parameters. As for the crystal structure part, there will be three database entries for a continuous solid solution between two ternary

75

76

3 Pauling File: Toward a Holistic View

compounds: one for each ternary boundary compositions and a third one grouping samples containing four chemical elements. Some simulated data from ab initio calculations are also included, in particular energy band structures, but focus is on experimentally measured data and values directly derived from measurements. 3.4.2

Database Fields

In addition to the physical properties (in the form of numerical values, figure descriptions, or keywords), and compulsory items such as the chemical formula, large amounts of information concerning the sample preparation and experimental conditions are stored in the PAULING FILE. The following database fields may be present in a physical properties database entry: • Compound: chemical system, published chemical formula (samples investigated), representative standardized chemical formula, and modification • Bibliographic data: data source, authors (affiliation), language, and title • Preparation: starting materials (purity, description), method of synthesis (crucible, atmosphere, solvent), and annealing or crystal growth • Sample description: sample form; chemical analysis; stability with respect to temperature, pressure, and composition; elastic behavior; density; color; chemical reactivity • Crystallographic data: structure prototype, space group, cell parameters, and remark For each physical property: • Numerical values: symbol; value in published unit; value in standard unit; temperature; other experimental conditions (pressure, magnetic field, wavelength, etc.); direction; composition or chemical element; remark • Figures: number in the original publication; parameters; ranges; remark • Keywords: code for additional topic treated in the publication • Property class: one or several property classes. Figure 3.6 shows the part concerning the properties of a data sheet taken from the PAULING FILE – Binaries Edition [7]. 3.4.3

Physical Properties Considered in the PAULING FILE

The physical properties considered in the PAULING FILE belong to one of the following categories: electronic and electrical properties, ferroelectricity, magnetic properties, mechanical properties, optical properties, phase transitions, superconductivity, and thermal and thermodynamic properties. Table 3.3 lists the main properties considered in the PAULING FILE for the first two categories. Items in square brackets are keywords, for which numerical values are in principle not extracted. Primary properties, to which particular attention is paid for the extraction of numerical values, are emphasized with bold characters. Thanks to the flexible construction of the relational database, new properties can easily be added.

3.4 PAULING FILE: Physical Properties

Figure 3.6 Part of the data sheet of a physical properties entry for Li2 O in the PAULING FILE – Binaries Edition [7]. Table 3.3 Electronic and electrical and ferroelectric properties considered in the PAULING FILE. Electronic and electrical properties • Metal/nonmetal character Temperature for metal–nonmetal transition Pressure derivative Pressure for metal–nonmetal transition • Electron energy band structure [Electron energy band structure] [Brillouin zone] [Fermi energy] [Fermi surface] • Electron density of states Electron density of states at Fermi level [Electron density of states diagram] [Electron density maps] • Energy gap Energy gap Pressure derivative Temperature derivative Composition derivative Energy gap for direct transition Pressure derivative Temperature derivative Energy gap for indirect transition Pressure derivative Temperature derivative (Continued)

77

78

3 Pauling File: Toward a Holistic View

Table 3.3 (Continued)

• •



• • •











Thermal energy gap Exciton energy Pressure derivative Temperature derivative Activation energy Electrical conductivity/resistivity Electrical resistivity Temperature derivative Concentration derivative Electrical resistivity anisotropy Phonon resistivity Temperature derivative Magnetic resistivity Temperature derivative Ionic conductivity Electron conductivity Hole conductivity Residual resistivity Residual resistivity Residual resistivity ratio (RRR) Spin-disorder resistivity [Spin-disorder resistivity data] Spin-fluctuation resistivity [Spin-fluctuation resistivity data] Piezoresistivity Piezoresistivity Pressure derivative Temperature derivative Magnetic contribution to piezoresistivity Magnetoresistivity Magnetoresistivity Temperature derivative Hall coefficients Hall coefficient Pressure derivative of Hall coefficient Temperature derivative of Hall coefficient Ordinary Hall coefficient Extraordinary Hall coefficient Effective mass Effective mass of electrons in conduction band Effective mass of electrons anisotropy Effective mass of holes in valence band Pressure derivative Effective mass of electrons/holes ratio Effective mass of polarons Charge carrier concentration Electron concentration Hole concentration Electron/hole concentration ratio Charge carrier concentration Donor concentration Acceptor concentration Donor/acceptor concentration ratio Charge carrier mobility Electron mobility Pressure derivative

3.4 PAULING FILE: Physical Properties

Table 3.3 (Continued)

• •

• •

Hole mobility Pressure derivative Electron/hole mobility ratio Hall mobility Pressure derivative Ion mobility Charge density wave [Charge density wave energy gap] Charge transfer Effective charge Mean valence Quadrupole splitting [Electric-field gradient]

Ferroelectricity • Ferroelectric transitions Ferroelectric Curie temperature Pressure derivative Antiferroelectric Néel temperature Pressure derivative Temperature for transition between Different ferroelectric states • Permittivity (dielectric constant) Permittivity Pressure derivative Temperature derivative Real part of permittivity Imaginary part of permittivity Permittivity change at phase transition Static permittivity Pressure derivative Temperature derivative High-frequency permittivity Pressure derivative Temperature derivative • Dielectric loss tangent • Electric polarization Electric polarization Spontaneous electric polarization Pressure derivative Electric dipole moment • Paraelectric state Paraelectric Curie temperature Pressure derivative Paraelectric Curie coefficient • Ferroelectric hysteresis Coercive electrical field Remanent polarization • Ferroelectric phase diagram [Electrical field – composition diagram] [Electrical field – temperature diagram] • Piezoelectricity Piezoelectric coefficients • Pyroelectricity Pyroelectric coefficients Bold characters emphasize primary parameters, while square brackets indicate keywords.

79

80

3 Pauling File: Toward a Holistic View

3.5 Data Quality Only reliable data can be used for sensible data mining and great importance is given to the quality of the data in the PAULING FILE. The articles selected for processing are analyzed by scientists who specialized in crystallography, phase diagrams, or solid-state physics, most of them with a doctor’s degree and own experience in solid-state chemistry or physics research [1]. A minimum of 50% editing activity is required, in order to achieve efficiency and homogeneity in data processing, and some of the editors have already processed more than 5000 scientific papers. 3.5.1

Computer-Aided Checking

The PAULING FILE data are checked for consistency with the help of an original software package, ESDD (evaluation, standardization, derived data), containing more than 100 different modules [17]. The checking is carried out progressively, level by level. Checks on Individual Database Fields formatting of numerical values, units and symbols for physical properties, Hermann–Mauguin symbols and Pearson symbols, consistency of journal code–year–volume, first–last page for literature references, • formatting of chemical formulas, • usual order of magnitude, • spelling. • • • •

Consistency Within Individual Data Sets • consistency of atom coordinates – Wyckoff letters – site multiplicity, • comparison of chemical elements in chemical system, chemical formula, refinement, and preparation, • comparison of computed and published values: cell volume, density, absorption coefficient, and interplanar spacings, • consistency of Pearson symbol – space group – cell parameters, • consistency of refined composition – chemical formula, • consistency of units – symbols for physical properties, • consistency of Bravais lattice – diffraction conditions, • consistency of site symmetry – anisotropic displacement parameters. Special Checking of Crystallographic Data • comparison of interatomic distances with the sum of atomic radii, • comparison of interatomic distances within chemical units (carbonates, phosphates, etc.), • check on charge balance for oxides and halides,

3.6 Distincta Phases

• search for overlooked symmetry elements, • comparison with the type-defining entry (cell parameter ratios, atom coordinates). Consistency Within the Database • • • •

comparison of densities, comparison of cell parameter ratios for isotypic compounds, check for compulsory data, check of database links.

Wherever possible, misprints detected in the original paper are corrected, based on arguments explained in remarks. Five percent of the crystal structure entries contain errata referring to misprints in the published crystallographic data. Since editing mistakes can never be completely avoided, all modifications of the originally published data and interpretations of ambiguous data are stored in remarks. The ESDD software further computes the following parameters: at.% of the different elements, molar mass, refined composition/formula, computed density, interplanar spacings (from functions of Bragg angle), equivalent isotropic displacement parameters, linear absorption coefficient, and Miller indices referring to the published space group setting. It converts compositions expressed in wt% to at.% and values expressed in various published units to standard units (including units per mole or wt% to units per gram-atom), respecting the number of significant digits. The modular construction facilitates incorporation of new checking procedures.

3.6 Distinct Phases The first part of the challenge in building up a comprehensive database consisted in compiling large amounts of data. However, to provide a global (holistic) overview of the database content and allow combined retrieval, it was also necessary to link the different database entries from the three parts of the PAULING FILE in a more efficient way than provided through the bibliographic information and the chemical system. The concept of Distinct Phases was introduced for this purpose. The linkage of the three different groups of data is achieved via a Distinct Phases table, to which each individual crystal structure, phase diagram, and physical properties entry is linked via a coded phase identifier (chemical system and an arbitrary number). To prepare this table, each chemical system has been evaluated and the distinct phases identified based on information available in the PAULING FILE. As an example, the eight phases reported in the Al–Ta system are listed in Table 3.4. A phase is in the PAULING FILE defined by the chemical system, the crystal structure (when known), and/or the domain of existence with respect to temperature, pressure, or composition. Each distinct phase has been given a unique name containing a representative chemical formula, when necessary followed by a specification such as “ht,” “rt,” “3R,” “hex,” etc. The crystal structure is defined

81

82

3 Pauling File: Toward a Holistic View

Table 3.4 Distinct phases in the Al–Ta system. System

at.% Ta

Phase

Prototype

Space group

Al–Ta

25

TaAl3

TiAl3 ,tI8,139

I4/mmm

Al–Ta

36.11

Ta39 Al69 ht

Ta39 Al69 ,cF444,216

F-43m

Al–Ta

40

Ta2 Al3 rt

*,aP*,*



Al–Ta

41.67

TaAl1.4 rt

*,hP*,*



Al–Ta

51.16

Ta22 Al21

Ta22 Al21 ,mP86,14

P121 /c1

Al–Ta

58.62

Ta17 Al12

Mg17 Al12 ,cI58,217

I-43m

Al–Ta

62.5

Ta5 Al3

Mn5 Si3 ,hP16,193

P63 /mcm

Al–Ta

67

Ta0.67 Al0.33

(Cr0.49 Fe0.51 ),tP30,136

P42 /mnm

*, Not known.

referring to the structure prototype, if known. For not yet (fully) investigated structures, partial structural information is given, if available, e.g. the complete Pearson symbol may be replaced by t** (tetragonal) or cI* (cubic body centered). Information about colloquial names and stability with respect to temperature, pressure, or composition, collected in the three parts of the database, are used to assign a phase identifier to physical properties and phase diagram entries with no structural data. Special cases are as follows: • Phases that crystallize with the same structure type, but are separated by a two-phase region in phase diagrams, are distinguished. The same is true for temperature- or pressure-induced isostructural phase transitions where a discontinuity in the cell parameters is reported. • Structures with different degrees of ordering have in some cases been considered separately, in others not, depending on the possibility to assign unambiguously one or the other modification to the database entries. Structure refinements considering, for instance, split atom positions are often grouped under the parent type. • Structure proposals stated to be incorrect in later literature have been grouped under a phase identifier in agreement with more recent reports. A crystal structure entry reporting a hexagonal cell may in such a case, for instance, be grouped under an orthorhombic phase. • The definition of a structure type applied here makes that a continuous solid solution may smoothly shift from one type to another. A typical case is the progressive transition of a phase Ax B from a NiAs-type to a Ni2 In-type structure by filling first one A site and then a second one. Refinements considering one or the other type have been grouped together. • Physical properties reported ignoring the crystal structure, and in principle referring to ambient conditions, are assigned to the rt modification, or, if the temperature dependence is not known, to the most commonly observed modification. • By default a paraelectric–ferroelectric phase transition is assumed to be accompanied by a structural transition, and different phases are considered above and below the transition temperature. On the contrary, magnetic ordering is assumed not to modify the nuclear structure to a significant extent.

3.6 Distincta Phases

There exist of course still parts of chemical systems that are little explored and reports in the literature are sometimes contradictory. The phase assignment becomes difficult and the list of distinct phases will sometimes contain more phases than there exist in reality. It follows that there is a certain amount of subjectivity when assigning a phase identifier; we believe, however, that this approach represents a substantial advantage for the user. 3.6.1

Chemical Formulas and Phase Names

The chemical formulas have been standardized so that the chemical elements are always written in the same order, an order that roughly corresponds to the order of the groups in the periodic system. Chemical units, such as water molecules or sulfate ions, are distinguished and written within square brackets. Deuterium and tritium are considered as distinct chemical elements. In the crystal structure part of the PAULING FILE, whenever a structure type has been assigned to the published data, the chemical formula is written so that the number of formula units per cell is the same as for the type-defining compound. A phase containing 50 at.% A and 50 at.% B, for example, will be called A0.50 B0.50 if the structure type is Cu,cF4,225 (Z = 4), but AB if it is CuAu,tP2,123 (Z = 1) and A2 B2 if it is Cu3 Au,cP4,221 (Z =1). A two-phase sample of the same composition would be written A50 B50 . Such conventions imply a certain hypothesis on the atom distribution in the case of off-stoichiometric formulas. In particular it is necessary to choose between a formula assuming a structure with vacancies and one with mixed occupation, e.g. between A0.9 B and A0.95 B1.05 . Adding to this the uncertainty on the chemical composition itself, especially when the authors did not recognize the crystal structure, this must be taken as a formal way of writing and no claims are made on its correctness. Each phase is assigned a name, which, in the general case, is a representative chemical formula, written as described earlier. Whenever several phases are known for the same chemical composition, a short code specifying the modification is added. Preference is given to terms such as “rt” (room temperature), “ht” (high temperature), “lt” (low temperature), or “hp” (high pressure), possibly followed by a digit when a series of temperature- or pressure-induced phase transitions are known. If only one modification, stable at room temperature, is known, the field modification is left blank. The specification “ht” is in principle added for phases that are only stable above room temperature (298.15 K), and by analogy, the specification “lt” for phases that are only stable below room temperature. In cases where no or contradictory information about phase stability is found in the literature, a specification such as “cub” (cubic), “rhom” (rhombohedral), “orth” (orthorhombic), etc. may be preferred. Ramsdell notations are used for polytypic compounds such as CdI2 . Mineral names can also be used as specifications and are then abbreviated to the first three letters. Special notations are used in the phase diagram part, where a chemical element in parentheses indicates a terminal solid solution based on this element. For complete solid solutions, two or more chemical elements or chemical formulas (if relevant, with specifications) are written within parentheses, separated by commas, e.g. (LiBr,AgBr) or (Ag2 La,Ag2 Ce rt).

83

84

3 Pauling File: Toward a Holistic View

3.6.2

Phase Classifications

A certain number of characteristics, attributed to the phases, are stored in the table Distinct Phases. • Compound classes: The classification into compound classes is to a first extent based on the existence of complex anions such as sulfate, nitrate, carbonate, fulleride, etc. Simple compound classes, such as intermetallics (both elements situated on the left-hand side of the Zintl line of the periodic system), oxides, sulfides, and hydrates are also distinguished. • Structure classes: Certain structure prototypes have been grouped into families, initially based on crystal chemical tables in TYPIX [22]. The family closepacked structures, for instance, group structures built up of close-packed layers in any kind of stacking, without interstitial atoms. The structure classes perovskites, AlB2 family, close-packed structures, bcc atom arrangement, rocksalt family, and high-T c cuprates have at present the largest numbers of representatives. The nomenclature of zeolites, using three-letter codes to characterize different frameworks, is taken from the Database of Zeolite Structures [32]. It may be noted that since the classification is applied to the prototypes, phases that have not been assigned a structure prototype will also not be assigned a structure class. • Property classes: Property classes such as antiferromagnet, ferroelectric, metal, semiconductor, ionic conductor, superconductor, etc. are distinguished based on data available in the physical properties part of the PAULING FILE. It follows that a phase, which from the chemical formula is expected to have metallic character, will not be assigned this class if properties leading to this conclusion have not (yet) been processed for that particular phase. On the contrary, phases with a significant range of existence in composition, temperature, or pressure may exhibit very different properties, depending on the doping level, temperature, or pressure, and all of the property classes assigned to the phase (e.g. antiferromagnet, ferromagnet, and spin glass for the same phase) may not apply to the representative chemical formula or only to a particular temperature or pressure range. • Mineral names: The names reported in the original publications have been checked by consulting Strunz Mineralogical Tables [33] and the list of minerals approved by the International Mineralogical Association [34] and updated consequently. The mineral names are stored in the table Distinct Phases so that all database entries for this phase will be linked to this information. When a continuous solid solution has been confirmed experimentally, several mineral names have sometimes been assigned to the same phase, e.g. enstatite/ ferrosilite or annite/phlogopite 1 M. • Color: Color has tentatively been assigned also at the phase level [35], but is in some cases strongly composition dependent, or due to small amounts of impurities.

3.7 Toward a Megadatabase After almost 25 years of existence, the PAULING FILE has reached a respectable size in the fields of crystal structures and phase diagrams of inorganic substances.

3.7 Toward a Megadatabase

Focus is here on the yearly update, and old, not yet processed publications represent a few percent. On the contrary, in spite of the relatively high number of database entries, the coverage of physical properties is still at a low level, considering the huge amount of data published in this field. In 2016, the PAULING FILE contains over 310 000 structural data sets (including atom coordinates and displacement parameters, when relevant) for some 140 000 different phases, more than 44 000 phase diagrams (with updated phase assignment) for about 9800 chemical systems, and 120 000 physical properties entries (about 420 000 numerical values and 130 000 figure descriptions) for some 50 000 phases. To reach this result, over 140 000 scientific publications have been processed, from more than 1500 different journals. Some 250 scientific journals are browsed from cover to cover for the yearly updates. Figure 3.7 shows the distribution of the database entries according to the top journals in each part of the database, where in some cases related titles have been grouped (e.g. Journal of the Less Common Metals is included under its successor Journal of Alloys and Compounds). In principle, only primary literature is considered in the PAULING FILE, but also a few handbooks have been processed for the phase diagram part. It may be noticed that the number of database entries processed from “others” is particularly high for crystal structures.

Crystal structure database entries

Journal of Alloys and Compounds Journal of Solid State Chemistry Acta Crystallographica Zeitschrift fuer Anorganische und Allgemeine Chemie Physical Review B Inorganic Chemistry Inorganic Materials Russian Journal of Inorganic Chemistry Materials Research Bulletin Zeitschrift fuer Kristallographie American Mineralogist Journal of Magnetism and Magnetic Materials C.R. des Seances de l'Academie des Sciences others

Phase diagram database entries

Russian Journal of Inorganic Chemistry Journal of Alloys and Compounds Journal of Phase Equilibria Petzow G., Effenberg G.: Ternary Alloys Zeitschrift fuer Metallkunde Massalski T.B.: Binary Alloy Phase Diagrams Inorganic Materials CALPHAD Russian Metallurgy Metallurgical Transactions Zeitschrift fuer Anorganische und Allgemeine Chemie Archiv fuer das Eisenhuettenwesen Journal of the Institute of Metals others

Physical properties database entries

Physical Review B Journal of Alloys and Compounds Solid State Communications Physica B+C Journal of Magnetism and Magnetic Materials Journal of Solid State Chemistry Journal of Physics: Condensed Matter Journal of the Physical Society of Japan Physica Status Solidi Journal of Applied Physics Physical Review Letters Materials Research Bulletin others

Figure 3.7 Distribution of the database entries in the PAULING FILE (June 2016) according to the data source. The journals are listed in decreasing order of the number of database entries, in clockwise order from the top on the diagrams.

85

3 Pauling File: Toward a Holistic View

Crystal structure database entries

10000 not refined 8000 refined 6000

4000

2000

0 1920 1930 1940 1950 1960 1970 1980 1990 2000 2010

Publication year

(a)

Phase diagram database entries

1000 calculated 800

experimental

600

400

200

0 1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000 2010

(b)

Publication year 10000

Physical properties database entries

86

8000

6000

4000

2000

0 1940

(c)

1950

1960

1970

1980

1990

2000

2010

Publication year

Figure 3.8 Distribution of the database entries in the PAULING FILE (June 2016) according to the publication year: (a) crystal structure, (b) phase diagram, and (c) physical properties database entries. Phase diagram data from handbooks are not included.

3.7 Toward a Megadatabase

Figure 3.8 shows the distribution of the database entries per publication year. The regular shape of the diagrams for crystal structure and phase diagram entries confirms the good coverage of the world literature for these two sections of the PAULING FILE. The diagram of crystal structure entries (Figure 3.8a) also shows the proportion of database entries with refined (or fixed) coordinates, which, thanks to the development of the experimental methods and software for structure refinement, has increased significantly over the last 20 years. Figure 3.8b confirms that the number of experimental investigations of phase diagrams per year is decreasing, whereas the number of thermodynamic assessments is increasing. The third overview, shown in Figure 3.9, proves that, contradicting common ideas based on earlier works by the same authors (e.g. [36]), the PAULING FILE is not limited to intermetallics. On the contrary, except for the phase diagram part, oxides dominate. Figure 3.10 shows the number of numerical values, figure descriptions, and keywords, processed in June 2016, distributed over the eight property categories considered in the PAULING FILE (electronic and electrical properties, ferroelectricity, magnetic properties, mechanical properties, optical properties, phase transitions, superconductivity, thermal and metals Si/P/As B/C/N/H S/Se/Te oxides halides

1 or 2 chemical elements

3 chemical elements

4 or more chemical elements

Crystal structure database entries

59′643

120′470

131′218

11′346

30′070

2′700

44′501

40′825

36′036

Phase diagram database entries

Physical properties database entries

Figure 3.9 Distribution of database entries in the PAULING FILE (June 2016) according to the chemical class. The order in the legend corresponds to clockwise order, starting from the top, on the diagrams.

87

3 Pauling File: Toward a Holistic View

magnetic

250000

200000

0

thermal

superconductor

ferroelectric

50000

mechanical

100000

phase transitions

numerical values

optical

150000

keywords for additional data figure descriptions

electrical

Keywords / descriptions / values

88

Physical property category

Figure 3.10 Number of items in the physical properties part of the PAULING FILE (June 2016) according to the property category and the data category; from bottom to top for each column: numerical values, figure descriptions, keywords for additional data.

Table 3.5 Numbers of distinct chemical systems, phases, and entries in PCD-2016/2017, subdivided into one, two, three, and more than three chemical elements, and the total numbers. Number of chemical elements

Number of chemical systems

Number of phases

Number of entries

1

97

434

3 017

2

2 570

18 496

52 839

3

18 096

61 358

112 309

>3

40 142

85 063

120 656

Any

60 905

165 351

288 847

thermodynamic properties). The most common physical properties extracted from the publications are magnetic susceptibility, electrical resistivity, heat capacity, and different transition temperatures. Tables 3.5 and 3.6 give some numbers from the main product for the crystallographic data, Pearson’s Crystal Data [26], release 2016/2017. The first table shows the distribution according to the number of chemical elements, and the second one the distribution according to the level of structural investigation. It can be seen from the latter that the entries have been classified into 36 080 different structure prototypes. Each year some 15 000 new entries are added to Pearson’s Crystal Data, most of them based on recent literature.

3.8 Applications

Table 3.6 Number of entries in PCD-2016/2017 according to the level of structural investigation.

Level of structural investigationa)

Number of entries

All atom coordinates refined or fixed, data set defining a prototype

36 080

All atom coordinates refined or fixed, not type-defining

148 298

Part of atom coordinates determined

640

Cell determined, prototype, and atom coordinates assigned by the editor

85 455

Cell determined for filled-up derivative, parent type assigned by the editor

1 297

Cell determined

17 077

Total

288 847

a) Positions of protonic hydrogen atoms are ignored in the classification.

3.8 Applications Thanks to the large amount of information stored in hundreds of distinct database fields, the PAULING FILE offers almost unlimited possibilities for retrieval. It can of course be used for all kinds of trivial search, based on the chemical system, or literature data. The conversion to standard units facilitates the search for properties within a particular numerical range, and the assignment of distinct phases plays an essential role, making it possible to combine searches on data stored in the three parts of the database: crystal structures, phase diagrams, and physical properties. It is, for example, possible to search for inorganic substances having low density (2700 K). Several distinct phases fulfill these requirements, among them AlP and BN cub with adamantane structures, tetragonal BeO ht, and CaS, which crystallizes with the structure type NaCl,cF8,225. Other examples could be quaternary ferro(ferri)magnets with ordering temperatures above 600 K, such as parts of the spinel (solid solution) phases CrFeNiO4 , TiFeCoO4 , and Zn0.5 Mn0.5 Fe2 O4 , or carbides ordering antiferromagnetically above 30 K (e.g. several RC2 and R2 Fe14 C compounds, where R is a rare-earth metal), or ionic conductors containing Ag and crystallizing with a cubic structure (halides, chalcogenides, including phases adopting the structure type RbAg4 I5 ,cP80,213). 3.8.1

Products Containing PAULING FILE Data

Hundreds of interconnected database fields can be used as LEGO pieces to create different products. PAULING FILE data are included in several online, offline, and printed products, of which part are listed below. Some of these products contain only structure data, others phase diagrams and crystallographic data, and others the three groups of data. Following the preference of the producers, some

89

90

3 Pauling File: Toward a Holistic View

products contain only the published cell parameters, others only the standardized cell parameters, and yet others both published and standardized crystallographic data. Some of the products listed below are limited to PAULING FILE data, whereas others also contain data from other sources. • ASM Phase Diagram Database, ASM International (online) [37] The Phase Diagram Database offers easy viewing of phase diagram details, crystallographic and reaction data. The content is updated on an annual basis and the 2016 update brings the database to more than 40’000 on-line phase diagrams for binary and ternary systems. • Inorganic Material Database (AtomWork), NIMS (online) [6] The data part of AtomWork is the result of collaboration between Japan Science and Technology Corporation (JST), the National Institute for Materials Science (NIMS), and Material Phases Data System (MPDS). The Inorganic Material Database aims to cover all basic crystal structure, X-ray diffraction, physical properties and phase diagram data of inorganic and metallic solids from main literature sources. In 2016 (last update of the data part in 2000) AtomWork contains 82’000 crystal structures, 55’000 physical properties and 15’000 phase diagrams. A new release, which will also contain recent data, is under development. • PAULING FILE – Binaries Edition, ASM International (offline) [7] The Binaries Edition of the PAULING FILE, which is limited to binary compounds, was published in 2002. It contains 8’000 phase diagrams covering 2’300 binary systems, 28’300 structural data sets for more than 10’000 different phases, roughly 3’000 experimental and 27’000 calculated diffraction patterns, and around 17’300 physical-property entries (with about 43’100 numerical values and 10’000 figure descriptions) for some 5’000 phases. To reach this result, 21’000 original publications had been processed. Even if restricted to binary compounds, the data contained on the CD-ROM equals over 30’000 printed pages, i.e. a 20 volume Handbook! The user-friendly retrieval program offers numerous possibilities, including some data-mining options. • Pearson’s Crystal Data, ASM International (offline and online) [26] The tenth release of Pearson’s Crystal Data (2016/17) contains 288’846 structural data sets for 130’000 different phases. Differently from the similarly named Pearson’s Handbook, the electronic product contains data for all classes of inorganic substances (approx. 50% oxides). All data sets with published coordinates, and 80% of the data sets where only cell parameters were published, have been assigned a structure type: 185’000 data sets with published atom coordinates, 85’000 data sets with assigned atom coordinates, 19’000 data set with only cell parameters. Atomic environments have been defined for the first category. The crystallographic data are presented as published and standardized, and are accompanied by experimental details and remarks. In addition, the product contains: 18’300 experimental and 271’000 calculated diffraction patterns; 40’000 descriptions of cell parameters as a function of temperature, pressure, or composition, 13’000 plots; 100’000 unit cells extracted from plots vs. T or p; links to the publication, PDF4+ , ASM

3.8 Applications

Phase Diagram Database, SpringerMaterials. The software offers numerous possibilities to retrieve and process the data. • Powder Diffraction File PDF4+ , ICDD (offline) [38] Since 1940 ICDD provides tools in the form of experimental and calculated powder patterns for phase analysis based on diffraction methods. PDF-4+ (inorganic solids) and PDF-4 Minerals include also atom coordinates, which can be used to perform Rietveld refinements. Over two thirds of the structures in the current edition of PDF-4+ originate from the PAULING FILE. PAULING FILE entries, containing more data, replace duplicate reference patterns and citations of other origin. • SpringerMaterials, Springer (online) [8] Based on the well-known series of Landolt-Börnstein Handbooks, SpringerMaterials allows materials scientists to identify materials and their properties by offering access to physical and chemical data in materials science on an on-line platform. The PAULING FILE provides crystal structure, phase diagram, and physical properties entries on a yearly basis to the section Inorganic Solid Phases. • Materials Platform for Data Science, MPDS (on-line, released 2017) [9] Materials Platform for Data Science is a web platform, presenting on-line the three parts of the PAULING FILE data. In its first release 2017, it contains 45’300 phase diagrams, over 400’000 crystal structures, and over 500’000 physical properties entries. About 80% of the data can be requested remotely in a developer-friendly format, ready for external data-mining applications. The remaining 20% can be obtained as references to the original publications. Altogether 265’000 scientific publications in materials science, chemistry, physics, etc. will serve as starting point for the platform, and this number will steadily increase. Main focus is on the verbatim representation of the original scientific data, however, convenient for quick re-use and re-purposing. A web-browser (without any plugins) and internet connection will allow comfortable work with the scientific data, be it for a literature overview, evaluation of hypotheses, or design of new materials. The Landolt–Börnstein handbook series Inorganic Crystal Structures [39] and the Handbook of Inorganic Substances [35] also contain PAULING FILE crystal structure data. The former describes structure prototypes in space groups 123–230, whereas the most recent edition of the latter lists crystallographic data for 157 000 inorganic phases. The software proposed by Materials Design Inc. to perform ab initio calculations [40] also contains crystallographic data from the PAULING FILE. The electronic book Inorganic Substances Bibliography [41], lists publications selected for processing in the PAULING FILE, ordered according to the chemical systems considered in the papers. 3.8.2

Holistic Overviews Based on the PAULING FILE

For any data mining or statistical approach to inorganic crystalline substances, the prototype classification of their crystal structures represents a key point, since it offers a “window” to view the electronic interactions of the atoms. In 2016,

91

92

3 Pauling File: Toward a Holistic View

more than 36 000 different prototypes, as defined in the PAULING FILE, have been experimentally established for inorganic compounds. Several strong patterns have been revealed in maps using as coordinates elemental-property parameters (or expressions of these), based on thousands of data sets for different chemical systems/compounds [42–45]. This proves that the underlying quantum mechanical laws can be parameterized using elementalproperty parameters of the constituent chemical elements. An appropriate choice of parameters leads to relatively simple maps with well-defined stability domains, offering excellent overviews of experimentally known inorganic substances. The maps provide, as a direct consequence, some possibilities to predict features of not yet known compounds. Particularly nice overviews of the phase diagrams of binary systems can be obtained using the Constitution Browser in the PAULING FILE – Binaries Edition [7]. Figure 3.11 shows all available binary Mo–X phase diagrams in a periodic table representation. It can be seen that the systems exhibit certain regularities, e.g. all Mo–s1 (s1 = H, Li, Na, K, Rb, Cs, Fr) systems are non-formers, which means that no true binary compounds form under ambient conditions. Figure 3.12 shows an “Inorganic Solids Overview–Elemental-Property Parameter Map” in the form of a generalized AET matrix using as coordinates PNA vs. PNB , where PNA and PNB are the periodic numbers (to a first approximation, the periodic number runs from top to bottom and from left to right, column by column, through the periodic system: Li, 1; Na, 2; K, 3; and so on) of the chemical elements A and B, respectively [44]. The map on the left-hand side is based on experimental data, whereas the equivalent map on the right-hand side shows simulated (or extrapolated) data, making it possible to estimate in one glance the agreement or disagreement between experimental and simulated data. 3.8.3

Principles Defining Ordering of Chemical Elements

Before initiating the PAULING FILE project, in 1994 one of us reviewed the world literature, focusing on intermetallics and alloys on the topic Factors Governing Crystal Structures [46], and came up with nine quantitative principles. The conclusions were based on the second edition of Pearson’s Handbook of Crystallographic Data for Intermetallic Phases [36], which covers about 28 000 intermetallics and alloys (including a few oxides). Twenty years later, having now access to Pearson’s Crystal Data [26], release 2013/2014, with structural information for over 165 000 distinct phases (not only intermetallics but also oxides, halides, etc.), i.e. almost six times more experimental data than in Pearson’s Handbook, a new study was undertaken [45]. Most of the examples given below are based on the content of Pearson’s Crystal Data [26], release 2016/2017, hereafter referred to as PCD-2016/2017. When chemical elements combine to form solid compounds, their crystal structures are beautifully rich, yet systematic patterns underlie this process. The most striking manifestation of this fact is the existence of crystal structure prototypes, which can be understood as geometrical templates adopted by large groups of compounds, e.g. the prototype NaCl,cF8,225 (rocksalt) is adopted

Figure 3.11 Example of the Constitution Browser in the PAULING FILE – Binaries Edition [7], showing phase diagrams of binary systems containing Mo.

3 Pauling File: Toward a Holistic View

Holistic View Elemental Property Parameter (Expressions) Elemental Property Parameter (Expressions)

94

(a)

(b)

Figure 3.12 A generalized atomic environment type (AET) matrix PNA vs. PNB , which is independent of the stoichiometry and the number of chemical elements in the inorganic solid. The element occupying the center of the AET is given on the y-axis and the coordinating element on the x-axis. Different colors represent different AETs, gray fields correspond to non-former systems. The results for experimentally determined data are given (a), simulated or extrapolated data (b) [44].

by 1392 phases in PCD-2016/2017. Different compounds crystallizing in the same prototype are either geometrically identical or very similar to each other. The work from 1994 focused on the 1000 most populous prototypes and their representatives. The about 1000 most frequent prototypes in PCD-2016/2017 (987 prototypes adopted by at least 28 phases) cover about 70% of all the entries. The four statistical plots shown below quantitatively illustrate the core principle that defines ordering of chemical elements within a prototype. (1) Simplicity principle Figure 3.13 shows that the majority of the phases crystallizing with one of the about 1000 most frequent prototypes have less than 40 atoms per unit cell, with the maximum at 12. This principle was formulated in 1994 [46] as follows: The vast majority of the intermetallic compounds have less than 24 atoms per unit cell. Considering all inorganic substances as defined in the PAULING FILE (no C—H bonds) the 24 atoms per cell have become 40; nevertheless the maximum remains about 10 atoms per unit cell. In addition the majority of the crystal structures have three or less Atomic Environment Types (single-, two-, and three-environment types). This statement is still supported and an analog observation can be made focusing on the number of point sets (atom sites) per prototype. Figure 3.14 shows that the majority of the 1000 most common prototypes (and therefore also their representatives) have six or less different AETs, with a maximum at three different AETs. The number of point sets per prototype (see the same figure) is for the majority of the phases also less than six, with a maximum at three point sets per prototype.

3.8 Applications

Number of phases

5000

4000

3000

2000

1000

0 0

50

100

150

200

250

300

Number of atoms per unit cell Figure 3.13 Number of phases vs. the number of atoms per unit cell, considering the representatives of the about 1000 most common prototypes in PCD-2016/2017. 20000 Atomic Environment Types (AETs)

Number of phases

16000

point sets

12000

8000

4000

0 2

4

6

8 10 12 14 16 18 20 22 24 26 28 30 32 34 Number of point sets / AETs per phase

Figure 3.14 Number of phases according to the number of different AETs (right column), respectively number of point sets (left column), in the structure, considering the representatives of the about 1000 most common prototypes in PCD-2016/2017.

(2) Symmetry principle Figure 3.15 shows the distribution of the c. 165 000 phases in PCD-2016/2017 according to the space group number. The symmetry principle was initially formulated as: The vast majority of all intermetallic compounds and alloys crystallize in one of the following 11 space groups: 12, 62, 63, 139, 166, 191, 194, 216, 221, 225, and 227. Extending the statistics to all classes of inorganic compounds

95

6000

2 12 15

cubic 225 194

139 166

4000

hexagonal

62

trigonal

triclinic

8000

monoclinic

14

tetragonal

10000

orthorhombic

3 Pauling File: Toward a Holistic View

Number of phases

96

227 221

63 2000

0 0

20

40

60

80 100 120 140 160 180 200 220 Space group number

Figure 3.15 Number of phases according to the crystal system and space group number in PCD-2016/2017.

considered in the PAULING FILE, a few more space groups have to be added: 2, 14, 15, 123, 129, 136, 140, 148, 164, 167, 176, and 229. It appears from the figure that, within each crystal system, certain space groups of high symmetry are preferred. As a consequence, 10% of the 230 space groups count for 67% of the entries in PCD-2016/2017. (3) Atomic Environment principle Figure 3.16 shows a frequency plot for the 18 most often observed AETs, considering the representatives of the about 1000 most frequent prototypes in PCD-2016/2017. In 1994, the AE principle states: The vast majority of all atoms (point sets) in intermetallic compounds have as Atomic Environment one or several of the following 14 AETs: tetrahedron, octahedron, cube, tricapped prism, fourcapped trigonal prism, icosahedron, cuboctahedron, bicapped pentagonal pyramid, anticuboctahedron, pseudo Frank-Kasper (CN13), 14-vertex Frank-Kasper, rhombic dodecahedron, 15-vertex Frank-Kasper, and 16-vertex Frank-Kasper. The statement, made based on intermetallics that certain AETs are highly preferred, is still correct, but by considering also other classes of inorganic compounds, the order has changed and the following low-coordination AETs: single atom (CN = 1), collinear (CN = 2), non-collinear (CN = 2), coplanar triangle (CN = 3), non-coplanar triangle (CN = 3), and square antiprism (CN = 8) have appeared among the most popular AETs. These 18, out of the 100 AETs distinguished in the PAULING FILE, are found for 90% of the point sets in the representatives of the most common prototypes. One may conclude that Nature strongly prefers certain AETs, most of them highly symmetrical.

square pyramid

coplanar triangle

pseudo Frank-Kasper (20)

14-vertex Frank-Kasper

fourcapped trigonal pris

16-vertex Frank-Kasper

square prism (cube)

square antiprism

non-coplanar triangle

rhombic dodecahedron

tricapped trigonal prism

20000

icosahedron

single atom

40000

collinear

non-collinear

60000

cuboctahedron

Number of point sets

80000

tetrahedron

100000

octahedron

3.8 Applications

0 Atomic Environment Type (AET)

Figure 3.16 Total number of point sets observed for the 18 most frequently occurring AETs, considering the representatives of the about 1000 most common prototypes in PCD-2016/2017.

(4) Ordering tendency principle Table 3.7 gives the number of phases crystallizing in one of the about 1000 most popular prototypes in PCD-2016/2017 in a representation showing the number of chemical elements in the type-defining entry (rows) vs. the number of chemical elements in all isotypic phases (columns). High numbers are found along the diagonal, where the numbers of chemical elements are identical. Relatively high numbers are also observed for phases containing more chemical elements than Table 3.7 Number of distinct phases in PCD-2016/2017 comparing the number of chemical elements in the type-defining database entry (rows) and the number of chemical elements in all representatives (columns), considering the about 1000 most common prototypes. Number of elements/ prototype

Number of phases according to the number of chemical elements 1

2

3

543

Total number Total number of entries of phases

4

5

6

>6

55

30

15

5

6 797

2 202

1 798

298

65

54

62 282

18 549

2 667

454

130

83 143

28 706

1

242 1 312

2

3

6 454 9 877

3

0

65

17 410 7 980

4

0

0

433

5 958 2 721

508

147

23 197

9 767

5

0

0

16

286

1 038 385

104

4 272

1 829

6

0

0

0

47

78

225 118

1 492

468

7

0

0

1

14

12

35

205 843

267

The diagonal corresponding to the same number of elements as in the type-defining entry is emphasized.

97

3 Pauling File: Toward a Holistic View

the type-defining entry. However, cases where the number of elements is lower than in the type-defining entry are rare, in part due to the definition of a structure type used here, where different ordering variants (substitution derivatives) are distinguished. It is interesting to note that the some 1000 most popular prototypes are represented by 182 026 entries in PCD-2016/2017, but only by 61 788 distinct phases. Among the entries, only about one half have no sites with mixed occupation. This means in general structures where the number of chemical elements is the same as for the prototype. The structures of the remaining database entries do contain mixed sites. Such database entries are likely to be part of solid solutions and are in this case not true distinct phases in the commonly accepted sense. For example, partial replacement of the chemical element A in a compound ABC by a few at.% of a closely related chemical element A′ may lead to a quaternary representative (A,A′ )BC, which in the PAULING FILE will be considered as a distinct phase. The systematic patterns described above lead to restraint conditions expressed in the below listed four principles, summarizing the preference of Nature for: • • • •

simplicity; particular overall symmetries; high local symmetry (symmetrical AETs); ordering of the chemical elements (distinct chemical elements occupy distinct atom sites).

The combination of the above given experimental observations reduces the number of potential prototypes for an unknown inorganic compound to a few hundred of the most common prototypes, i.e. approximately 1–2% of the experimentally known prototypes. Figure 3.17 shows a frequency plot for the 2000 CaTiO3,cP5,221 GdFeO3,oP20,62 NaCl,cF8,225 MgCu2,cF24,227 MgAl2O4,cF56,227 CaF2,cF12,225 CeAl2Ga2,tI10,139 Ba2CaWO6,cF40,225 Ca2Nb2O7,cF88,227 TiNiSi,oP12,62 Cu,cF4,225 CsCl,cP2,221 Cu3Au,cP4,221 ZrNiAl,hP9,189 MgZn2,hP12,194 W,cI2,229

1500 Number of phases

98

1000

500

0 10

20

30

40

50

60

70

80

90

100

100 most frequent prototypes

Figure 3.17 Number of representatives (phases) of the 100 most common prototypes in PCD-2016/2017.

3.9 Lessons to Learn from Experience

representatives of the 100 most frequent prototypes in PCD-2016/2017. Seen from the opposite point of view, the large majority of the 36 080 prototypes in PCD-2016/2017 (near 80%) have less than four representatives. One of the reasons for the high number of prototypes is the increasing number of refinements revealing a high degree of disorder and sites with low occupancy. For example, most structure refinements of ionic conductors, complex minerals, zeolites, or hydrides, represent distinct prototypes.

3.9 Lessons to Learn from Experience Even if it takes half a century to reach the critical size for database sustainability, it is worthwhile to re-examine the startup thoughts to get a holistic view on materials. As comparative cases, the PAULING FILE and VEMD projects were selected as BUA and TDA, respectively. For the case of developing the PAULING FILE, some key points on how to overcome the complexity and diversity of materials to transform them into values are listed below: (1) Define core scientific principles for target materials explicitly. At least one holistic view as a set of digital knowledge is required, which enables continuous quality refinement of newly added data logical deductive confirmation. Geometric group theory: crystallography is the principle as described in detail in the preceding sections. Other data difficult to refine deductively can be refined inductively through ad hoc holistic views generated from the compiled digital data. (2) Implement systematic and graded procedures for highest data quality – digital data compilation, escaping from a conventional way of “manual” industry. Digital data have been produced as products of digital manufacturing industry. (3) Follow business-to-business (B-to-B) models with copyright contracts, as well as open strategies. (4) Keep human resources with sufficient knowledge for (1) and (2). (5) Most importantly, link the data via a closed space defined by crystallography. So as to reduce uncertainties due to inductive evaluations, all the data are linked to deductively evaluated digital data. IDs for phases are linked for such data as phase diagrams and intrinsic properties. These are the prerequisites for a holistic view in the way where the whole should be greater than the sum of its parts. And, consequently, the whole becomes a window on the diversities of materials decorated by impurities, alloying elements, and complicated defects. However, for engineering applications, additional work is required to bridge other holistic views, namely, design windows in terms of materials properties, performances, and functions, and the corresponding holistic views created by materials scientists and materials producers. The former design window is used as a starting point for backcasting from the requirements, and the latter is ideally created as a summary by materials scientists and producers. Basic design windows about engineering materials were compiled and systematized into a digital system, Cambridge Materials Selectors (CMS), for all engineering materials by

99

100

3 Pauling File: Toward a Holistic View

M.F. Ashby (1980s), and such design windows are shared as a set of “common sense” by materials engineers, even if not explicitly, in the first phase of materials development. Ad hoc articulations for materials selection problems have been carried out by resolving each problem into a set of sub-problems, where the above guidelines can be applied. The VEMD project [4], carried out during the period 1995–1999, aimed to show exemplars of such bridging and converging procedures as a new approach to materials design. So the VEMD project is one of the prototyping projects integrating elementary technologies, such as numerical simulation, knowledge information processing, database, and human interface technology. The main objectives of the VEMD project are as follows: VEMD-1: Make a digital copy of the engineering materials world in terms of data and models. VEMD-2: Acquire practical knowledge from the copy. VEMD-3: Create design scenarios to answer requests from materials users. VEMD-4: Add necessary data by simulation and/or experiment to follow the design scenario. VEMD-5: Evaluate the designed materials from a viewpoint of material users. Figure 3.18 shows a schematic overview of the project [47]. As design targets, two groups were selected, namely, high-temperature superalloys and electronic materials. In the former case, it appeared too complex and too complicated to design materials, due to their time-, space-, and temperature-dependent features developed in open space, and the design strategy was reduced to the following: Design Tactic 1: Identify the most promising exemplar in the past. Design Tactic 2: Analyze the selected exemplar and resolve the problem into a set of sub-problems in terms of intrinsic properties and extrinsic structuresensitive properties, following the resolution principle by Robinson [48]. Design Tactic 3: Carry out comparative studies in accordance with intrinsic properties to discover candidate materials, following the way of dealing with complexities as summarized by Masahiko [49]. Design Tactic 4: Evaluate the candidate materials by experiments. For each Design Tactic we need to assume a holistic view not to miss potential solutions, which is combined into a design scenario balanced by another holistic view on the designed material for each engineering system. For the electronic materials, only results of first principles calculations were used to select the candidate materials for electronic devices, so that density of states (DOS) and electron mobility data were mainly used and defects and processing data were not taken into account. For the structural materials design scenarios were made as a hybrid of two parts, namely, one scenario consisting of intrinsic properties derived by calculation as well as experiment and another scenario about structure-sensitive properties rewritten as a combination of qualitative causality and/or statistical correlation calculated from experimental data. As the latter structure-sensitive part has extremely rich semantics, due to the strongly correlated dynamics of defects under stress with severe thermal and chemical environments, each design scenario cannot be derived deductively and/or

3.9 Lessons to Learn from Experience

Needs

Database Periodic Table

Selection of target material Prediction of structure Prediction of characteristic and property

Numerical Simulation Database Statistical Processing Knowledge Processing

Evaluation of function and property

Requirements

Design of process

Numerical Simulation Database Statistical Processing Knowledge Processing

Process simulation Requirements

Evaluation Verification by experiment (a)

Database Property data

Test data

Simulation result

Experiment result

Experiment

Knowledge Base

Numerical Simulation Subsystem

Knowledge Information Processing Subsystem

Database Management Subsystem

Documentation Subsystem

Gate Way Human Interface Subsystem

Creativity Support Subsystem

(b)

Figure 3.18 Outline of the VEMD project. (a) General overview, (b) database structure. Source: Nishikawa et al. 1997 [47]. Reproduced with permission of ASTM International.

inductively from available data. The data were not enough to make a design scenario by deduction and/or induction. So design scenarios of the VEMD were derived abductively, in other words, predetermined, and experts were expected to explain and refine the predetermined design scenario partly according to their own scientific domains. The VEMD project was thus conceived as a conventional research-oriented project on materials – mainly explaining what happened and why it happened. It was not a mission-driven strategic project “designing new materials,” but the aim was to rewrite the predetermined design scenario in terms of available data and models. Consequently, the priority of the project was shifted to writing original

101

102

3 Pauling File: Toward a Holistic View

scientific papers on calculated data and models, rather than to developing materials by narrowing gaps between experimentally obtained data and the requirements of materials users. Iterative dialogue/communication between materials experts and system designers as users of materials are required ideally to reach a common design scenario, and the calculated data need to be used to converge into a set of design solutions. They should not be just a set of output digits from selected programs. Connections revealed by first principles calculations and MD simulation via interatomic potentials were expected to explain atomic microscopic scale dynamics of materials, but the real dynamics of the microstructural evolutions was different. The results of the calculations could only be used to explain a particular observation in terms of models. The situation was the same for microstructural mesoscopic simulations, and also for macroscopic simulations, even if taking advantage of well-developed models such as the FEM. They were not holistic but fragmented, or just a collection of parts. Each time careful parametric tuning of computing parameters was required to bridge the gaps among models and also between the results of the calculations and real-world data. In practice virtual experiment (VE) followed real experiments and gave some explanations about the experimental data, but not the inverse. In short, one of the authors (S.I.) now concludes that the tasks of VEMD were not to design new materials, but to explain each fact in terms of established models and substantial calculated results, although it was a challenging feasibility study for dealing with complexities of materials and describe their dynamics as an interplay of data and models. Due to the complexity of materials, the above statements are more or less true for all materials projects. Mismatches between the objectives and the obtained results are usually explained in terms of weakness of the theory, models, algorithms, and/or computational power and also by the shortage/inaccuracy of experimental data. Historically all the projects were carried out properly and the number of original papers was in general sufficient to get good scores of evaluation, thanks to the very complexity of materials. New fascinating keywords were proposed, which are now “Big Data” and IoT coupled with the third AI wave “deep learning from data” toward “Industry 4.0” [50] in the cloud environment of “Collective Knowledge.” A single flight generates 500 GB data for a jet engine of several hundreds of components made by different materials, and an automatic driving vehicle may produce the same or larger amounts of data on materials. Not only devices like X-ray diffractometers, but almost all engineering products and parts are monitored by numerous sensors. This is a characteristic of the IoT era, and the total amount of data easily reaches exabytes (1 EB = 1018 bytes). Data on what is happening are flooding. Now, even if everyone feels that huge obstacles still exist, we are expecting data-centered sciences and engineering, so that everyone is convinced that something will change. How to bridge the gap between data producers and data users? This is an old question repeated many times, but data producers and users are changing. Digital devices and engineering systems are joining as users to such human professionals as members of manufacturing companies, scientists/engineers, and data editors. In such a digital data era, it seems more realistic to quickly put a digital prototype product into the hands of potential customers, than

3.10 Conclusion

thinking and spending a lot of time on hypothetical business forecasting and planning of attractive products. Here the system preparedness for further digital processing becomes crucial. The state of collective knowledge needs to be switched on, taking advantage of a minimum of viable products. This should be the first step in a build–measure–learn feedback loop, endorsed by such a system as the PAULING FILE, of high-quality data with traceability. Organization of serious users’ groups is important to increase the quality of collective knowledge, even more than studies of mission-driven data projects (risk, climate change). Collaboration frameworks to share data can be established step by step through activities like the Research Data Alliance (RDA). Data projects focusing on particular methods (neutron cross section, NMR, X-ray diffraction, spectroscopy, beam technology, and so on) will be organized as exemplars of IoT, outputs from which are expected to be used to link associated data as in the case of the PAULING FILE project. Business models around data will emerge by harmonizing various initiatives, inspired by innovative projects, so-called Complex Design, or initiatives of researchers to explore new dimensions, such as David Baker [51]. The key point is to hybridize databases, ab initio calculations, and evolving algorithms, as in the case of Artificial Life. A crystallographic embryo is created in the hybridized cloud environment taking advantage of PAULING FILE data and evolves there driven by ab initio calculations based on its boundary conditions and so on in accordance with multiscale models finally associated with materials requirements. It is a big challenge to fill the gaps between thermodynamics and quantum mechanics, but this encourages PAULING FILE customers to use phase diagrams to deal with time-, temperature-, and pressure-dependent dynamics with intelligent formulations of each dynamic phenomenon. Self-organization of microstructural evolution can become realistic, thanks to the recent development of image processing. A first descriptive analysis of the recorded data is followed by a diagnostic analysis of why it happened. The next steps consist in producing predictive analytics on what will happen next, and prescriptive analytics on what should be done. There is no shortage of data, but a lack of intelligence and knowledge on how to link key data and carry out clustering of data for these analytics. Powerful tools are needed to do so and the evaluation of structural stability of component materials may be performed by taking advantage, in the beginning, of various learning methods, which may be obtained by refining and reorganizing important tools and data developed through the PAULING FILE project. Then a digital ecosystem is created as a kind of rice nursery, parenting new materials to emerge, where all models and data are categorized and encapsuled into a set of active agents. Ad hoc tactics as used in the VEMD project are regarded as one of many knowledge chunks, and many lessons from the PAULING FILE project can be developed as core guiding principles for the digital ecosystem.

3.10 Conclusion Several factors must be taken into consideration for the development of new materials and it is essential to build up a holistic view on inorganic substances

103

104

3 Pauling File: Toward a Holistic View

by giving rapid access to different kinds of critically evaluated experimental data published in the world literature over the last 100 years. The PAULING FILE project was launched in 1993, and 23 years later this world’s largest materials database contains over 500 000 database entries for inorganic crystalline solids, summarizing over 160 000 scientific publications. The linkage between the three different groups of data (crystal structures, phase diagrams, physical properties) is achieved by linking each database entry to one of the distinct phases defined based on the chemical system and the crystal structure. With the help of several examples, we have shown that it is possible to gain a better view on inorganic substances, effective or potential materials, by looking at large amounts of different data in an appropriate way. Data mining applied to the PAULING FILE provides good examples of holistic views, showing that the whole is greater than the sum of its parts.

References 1 Villars, P. (Editor-in-Chief ) (2000). PAULING FILE. http://www.paulingfile

.com/ (accessed 2018). 2 Villars, P. (1985). J. Less-Common Met. 110: 11–25. 3 Rodgers, J. and Villars, P. (eds.) (1993). Proceeding of the workshop on reg-

4 5 6 7 8

9 10 11 12 13

14 15

ularities, classification and prediction of advanced materials, Como, April 13–15, 1992. J. Alloys Compd. 197: 127–307. Nishikawa, N., Nihei, M., and Iwata, S. (2003). Lect. Notes Comput. Sci. 2858: 320–329. Villars, P., Berndt, M., Brandenburg, K. et al. (2004). J. Alloys Compd. 367: 293–297. Inorganic Material Database (AtomWork) (2002), National Institute for Materials Science (NIMS), Japan. http://www.crystdbnims.go.jp/index. Villars, P., Cenzual, K., Hulliger, F. et al. (2002). PAULING FILE – Binaries Edition, on CD-ROM. Materials Park, OH: ASM International. Villars, P. (Editor-in-Chief ), Hulliger, F., Okamoto, H., and Cenzual, K. (Section Editors). SpringerMaterials, Inorganic Solid Phases. Heidelberg: Springer http://www.Springerlink/SpringerMaterials. Material Platform for Data Science, (2017). MPDS, Switzerland. https://www.mpds.io. Parthé, E. and Gelato, L.M. (1984). Acta Crystallogr., Sect. A 40: 169–183. Parthé, E. and Gelato, L.M. (1985). Acta Crystallogr., Sect. A 41: 142–151. Gelato, L.M. and Parthé, E. (1987). J. Appl. Crystallogr. 20: 139–143. Berndt, M. (1994). Development of the software COMPARE – directly comparable Crystal Data. Thesis. University of Bonn; updates by O. Shcherban, Scientific Consulting Company “Structure-Properties”, Lviv. Brunner, G.O. and Schwarzenbach, D. (1971). Z. Kristallogr. 133: 127–133. Daams, J.L.C., van Vucht, I.H.N., and Villars, P. (1992). J. Alloys Compd. 182: 1–33.

References

16 Daams, J.L.C. (1994). Atomic environments in some related intermetallic

17

18 19 20 21 22

23 24 25 26

27 28 29

30

31 32 33 34 35

structure types. In: Intermetallic Compounds, Vol. 1: Principles (ed. J.H. Westbrook and R.L. Fleischer), 363–383. New York, NY: Wiley. Cenzual, K., Berndt, M., Brandenburg, K., et al. (2000). ESDD Software Package, copyright: Japan Science and Technology Corporation; updates by O. Shcherban, Scientific Consulting Company “Structure-Properties”, Lviv. Hahn, T. (ed.) (1983 and more recent editions). International Tables for Crystallography, vol. A. Dordrecht: D. Reidel. De Wolff, P.P., Belov, N.V., Bertaut, E.F. et al. (1985). Acta Crystallogr., Sect. A 21: 278–280. Ewald, P.P. and Hermann, C. (eds.) (1931). Strukturbericht. Leipzig: Akad. Verlagsgesellschaft M.B.H. Pearson, W.B. (1967). Handbook of Lattice Spacings and Structure of Metals. New York, NY: Pergamon. Parthé, E., Gelato, L., Chabot, B. et al. (1993/1994). Gmelin handbook of inorganic and organometallic chemistry. In: TYPIX – Standardized Data and Crystal Chemical Characterization of Inorganic Structure Types, vols. 4, 8e. Heidelberg: Springer. Parthé, E., Cenzual, K., and Gladyshevskii, R. (1993). J. Alloys Compd. 197: 291–301. LePage, Y. (1988). J. Appl. Crystallogr. 21: 983–984. Cenzual, K., Gelato, L.M., Penzo, M., and Parthé, E. (1991). Acta Crystallogr., Sect. B 47: 433–439. Villars, P. and Cenzual, K. (eds.) (2016). Pearson’s Crystal Data: Crystal Structure Database for Inorganic Compounds. Materials Park, OH: ASM International, on DVD and on-line; http://www.asminternational.org/ AsmEnterprise/PCD. Kripyakevich, P.I. (1977). Structure Types of Intermetallic Compounds. Moscow: Nauka(in Russian). GetData Graph Digitizer (2004). getdata-graph-digitizer.com (accessed 2016). Massalski, T.B. (Editor-in-Chief ), Okamoto, H., Subramanian, P.R., and Kacprzak, L. (eds.) (1990). Binary Alloy Phase Diagrams, 2e. Materials Park, OH: ASM International. Petzow, G. and Effenberg, G. (Eds. of vols. 1–8) (1988–1995). Ternary Alloys: A Comprehensive Compendium of Evaluated Constitutional Data and Phase Diagrams, vols. 15. Weinheim: Wiley-VCH. Lide, D.R. (Editor-in-Chief ) (1997–1998 and more recent editions). CRC Handbook of Chemistry and Physics. Boca Raton, FL: CRC Press Inc. Database of Zeolite Structures (2015), IZA Structure Commission. http:// www.iza-structure.org/databases/ (accessed 2016). Strunz, H. and Nickel, E.H. (2001). Strunz Mineralogical Tables, 9e. Stuttgart: E. Schweizerbart’sche Verlagsbuchhandlung (Nägele u. Obermiller). IMA Database of Mineral Properties (2006). http://www.Rruff.info/ima/ (accessed 2016). Villars, P., Cenzual, K., and Gladyshevskii, R. (2016). Handbook of Inorganic Substances. Berlin: De Gruyter.

105

106

3 Pauling File: Toward a Holistic View

36 Villars, P. and Calvert, L.D. (1991). Pearson’s Handbook of Crystallographic

37

38 39

40 41 42 43 44 45 46

47

48 49 50

51

Data for Intermetallic Phases, 2e, vol. 1–4. Materials Park, OH: ASM International. Villars, P. (Editor-in-Chief ), Okamoto, H., and Cenzual, K. (Section Editors) (2016). ASM Alloy Phase Diagram Database. Materials Park, OH: ASM International http://www.asminternational.org/AsmEnterprise/APD. ICDD (2016). PDF-4+ . Newtown Square, PA: International Centre for Diffraction Data (ICDD). Villars, P. and Cenzual, K. (eds.) (2004–2012). Landolt–Börnstein, III-43. In: Crystal Structures of Inorganic Compounds, 11 vols., Daams, J., Gladyshevskii, R., Shcherban, O., Dubenskyy, V., Kuprysyuk, V., Savysyuk, I., and Zaremba, R. (Contributors to vol. 11). Heidelberg: Springer. MedeA (2008). Materials Design Inc. http://www.materialsdesign.com/ (accessed 2016). Villars, P., Cenzual, K., and Penzo, M. (2016). Inorganic Substances Bibliography. Berlin: De Gruyter (e-book). Villars, P., Cenzual, K., Daams, J. et al. (2004). J. Alloys Compd. 317–318: 167–175. Villars, P., Daams, J., Shikata, Y. et al. (2008). Chem. Met. Alloys 1: 1–23. http://www.chemetal-journal.org. Villars, P., Daams, J., Shikata, Y. et al. (2008). Chem. Met. Alloys 1: 210–226. http://www.chemetal-journal.org. Villars, P. and Iwata, S. (2013). Chem. Met. Alloys 6: 81–108. http://www .chemetal-journal.org. Villars, P. (1994). Factors governing crystal structures. In: Intermetallic Compounds, Principles and Practice, Vol. 1: Principles (ed. J.H. Westbrook and R.L. Fleischer), 227–275. New York, NY: Wiley. Nishikawa, N., Nagano, C., and Koike, H. (1997). Integration of virtual experiment technology for materials design. In: Computerization and Networking of Material Databases, ASTM STP 1311 (ed. S. Nishijima and S. Iwata), 21–27. West Conshohocken, PA: ASTM. Robinson, J.A. (1965). J. Assoc. Comput. Mach. 12 (1): 23–41. Masahiko, A. (2001). Towards a Comparative Institutional Analysis. Cambridge, MA: MIT Press. Plattform Industrie4.0 (2013), Bundesministerium für Wirtschaft und Energie & Bundesministerium für Bildung und Forschung; http://www.plattform-i40 .de (accessed 2016). R.F. Service (2016). Science 353: 338–341.

107

4 From Topological Descriptors to Expert Systems: A Route to Predictable Materials Alexander P. Shevchenko 1 , Eugeny V. Alexandrov 1 , Olga A. Blatova 1 , Denis E. Yablokov 1 , and Vladislav A. Blatov 1,2 1 Samara University, Samara Center for Theoretical Materials Science (SCTMS), Ac. Pavlov St. 1, 443011 Samara, Russia 2 Northwestern Polytechnical University, School of Materials Science and Engineering, Youyi West Rd. 127, 710072 Xi’an, People’s Republic of China

4.1 Introduction Facilitation and cheapening of the development of new advanced materials is a major problem of modern society, science, technology, and industry. A significant role in its solving belongs to computer tools, software, and new theoretical methods for exploring materials as well as to databases and data exchange systems, which will make elaboration of new materials faster, cheaper, and more predictable [1–3]. The traditional approach in materials science is based on carrying out series of experiments in order to obtain new materials and to measure their physical properties and structural characteristics. The next stage includes search for correlations between parameters, which characterize composition, structure, and properties of the substance. An alternative way is to compute the properties from mathematical models built with density functional theory (DFT), molecular dynamics, or Monte Carlo methods [2, 4–8]. Both ways are time consuming and do not use all available experimental information. An alternative is to create knowledge databases and expert systems that rest upon experimentally determined and/or simulated parameters and correlations between them [9–11]. A novel trend in this direction consists in development of crystallochemical methods and especially periodic graph approaches [7, 8, 12], which are intended to fill the gap between experimental and mathematical modeling approaches. These methods are fast and can be easily adapted to specific tasks, though their results are solely qualitative or semiquantitative. The interest to these methods was caused by their predictive power, in particular, to make possible constructing coordination networks of predetermined topology from secondary building units (SBU) [13–18]. As a result, the topological methods have founded fertile ground for design of coordination polymers, providing exponential growth of their number during last 15 years. Most promising properties such as gas adsorption and sieving, catalysis, and sensing were found for a subclass of coordination polymers, the so-called Materials Informatics: Methods, Tools and Applications, First Edition. Edited by Olexandr Isayev, Alexander Tropsha, and Stefano Curtarolo. © 2019 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2019 by Wiley-VCH Verlag GmbH & Co. KGaA.

108

4 From Topological Descriptors to Expert Systems: A Route to Predictable Materials

metal–organic frameworks (MOFs), which possess potential porous structures. It is the topological methods that allowed revealing many “composition– structure–property” correlations for MOFs [15, 19]. Subsequent successful targeted synthesis of large families of isoreticular MOFs [15] opened the era of practical crystal design. As a result, the modern development of the microporous compounds has a tendency to replace expensive experimental screening by modeling and large-scale screening of desired properties such as gas uptake, enthalpy of adsorption, and gas selectivity [20–25]. To provide effective exchange by information for experimentalists and theoreticians, there is a need in creating databases of structures and calculated properties of known microporous materials. MOFs give us the most exciting but not the single example of application of topological methods in materials science. In the last few years, these methods were successfully used to predict ionic conductivity in inorganic solid electrolytes [26] and cleavage in molecular crystals [27], to elucidate complicated intermetallic structures [28], and to describe assembly of zeolite frameworks [29]. Various computer tools were developed for topological analysis of crystal structures, their classification, and design of new periodic motifs (ToposPro [30], Gavrog [31], RCSR [32], Zeo++ [33]). Nevertheless, crystal design remains a difficult task for some complicated compounds [34] consisting of SBUs with a high degree of freedom and connection possibilities that leads to a huge number of possible topologies. In this review, we consider the current state of topological approaches to processing large amounts of experimental information, the perspectives of developing combined topological–DFT methods, and the ways of their application in crystal chemistry and materials science.

4.2 Topological Tools for Developing Knowledge Databases 4.2.1

Why Topological?

The microscopic detailing of the material structure is based on crystallographic data, which are obtained from experiment (X-ray or neutron diffraction) or theoretical modeling by quantum mechanics, molecular dynamics, or Monte Carlo simulations. Initially, these data contain information about positions of atoms, but not about their connectivity. Being sufficient for computation of many physical properties, they are unable to reflect the entire diversity of crystallochemical features. In particular, any relation “chemical composition–crystal structure” should account for coordination numbers of atoms, groups of connected atoms (structural building units), their dimensionalities, connection modes, and other topological properties. It is also important that topological methods allow the researcher to formalize his vision of the crystal structure and those properties of the substance that follow from connections between atoms. This opens a route to much deeper and wider computer analysis of crystal structures than the traditional visualization tools can

4.2 Topological Tools for Developing Knowledge Databases

give. There are at least two main cases, where topological approaches can play a crucial role: (i) elucidation of very complicated structures that can hardly be rationalized by visual analysis and (ii) screening of large crystallographic databases for new correlations, regularities, or even general laws. To illustrate the first case, one can mention the so-called Samson monsters, intermetallic compounds with more than 1000 atoms in the unit cell (Figure 4.1a). After discovering this structure type [35], there were at least four quite different models of its architecture [36–39], none of which included all structure atoms. The topological “nanoclustering” algorithm (see Section 4.2.4.1) allowed us not only to represent this structure as an assembly of just 2 two-shell nanoclusters (Figure 4.1b,c) but also to find its relation to simpler intermetallic compounds [40]. Irrespective of the approach one should generally pass the following steps to formalize data processing: (i) task definition; (ii) data sampling or selection of relevant data in according with the task; (iii) data slicing and/or merging from different sources; (iv) data rationalization and finding significant descriptors; (v) transformation of data in a computer-readable format, data rectification (removing anomalies, duplicated records, contradictions, missing values, and errors); (vi) deriving new knowledge using methods of data mining; (vii) definition and

Na Cd 61 atoms

(a)

63 atoms

(b)

(c)

Figure 4.1 The crystal structure of one of the Samson’s “monsters,” NaCd2 . (a) The unit cell content. (b) The nanocluster representation. (c) Two types of two-shell nanoclusters (structural units) consisting of 61 and 63 atoms.

109

110

4 From Topological Descriptors to Expert Systems: A Route to Predictable Materials

refinement of predictive models; and (viii) application of new knowledge and models. The primary information sources usually contain many inaccuracies, inconsistencies, and even errors or misprints; thus there is a need to check the data before their use. The data sampling at step (ii) needs objective criteria of correctness of the information that are known only to experts in a particular specific area. For example, the crystallographic data can be discarded in the following cases: (i) the structure is incomplete (not all atoms are allocated) or its composition is wrong (it does not fit the chemical formula of the compound) and (ii) atomic parameters (volume, coordination number, coordination figure, shape of atomic domain, interatomic distances, bond angles) are improper, i.e. they do not fall into the allowable ranges. These are general criteria for the data sampling, but some additional ones can be introduced depending on the task at step (i). The topological information is crucial; it is the data on wrong structure connectivity (improper coordination numbers or interatomic contacts) that are generally used to reveal inconsistencies. Merging data from different sources at step (iii) can lead to duplicates, which must be eliminated from consideration at step (v). Search for duplicates requires structure descriptors that can be used for comparing crystal structures. Developing a set of proper descriptors at step (iv) is crucial for all subsequent steps, because the descriptors will form the knowledge database for the system under consideration. Each descriptor should adequately (i.e. acceptably at a given precision level) reflect some structural property and should be adjusted for the computer analysis. The initial data are usually represented in a way to be convenient for a human. At step (v) they should be transformed to a form suitable for machine processing. For example, porous structures can be rationalized if one uses the concept of tiling and describes the system of cavities and channels as a system of tiles and their faces (windows; see Section 4.2.4). Step (vi) will be discussed in Section 4.2.4. The last two steps use the results of the analysis obtained at step (vi) and significantly depend on the subject area; therefore we do not consider them here. 4.2.2

Topological vs. Other Descriptors of Crystal Structures

For deriving knowledge, a set of descriptors is needed for two main reasons: (i) descriptors help one to construct a general classification scheme and hierarchical system of correlations for a particular sample and (ii) chemical and physical properties can be treated as relations between the descriptors values. For example, there are at least seven approaches with different sets of descriptors for evaluation of the geometry of pores in crystal structures [8, 41–51] and to predict possibility of adsorption for molecules of a specified shape and size. The descriptors of crystal structures can be divided in two groups, which characterize structural and physical properties. The first group describes chemical composition, structure symmetry, geometry, and topology; all structural descriptors are essentially interrelated. For example, to denote the compound in accordance with The International Union of Pure and Applied Chemistry

4.2 Topological Tools for Developing Knowledge Databases

(IUPAC) nomenclature, the elemental composition should be supplied by geometrical and topological information to distinguish isoskeletal, geometrical, or optical isomers. Physical properties present large variety of descriptors like gas uptake, bandgap, magnetic susceptibility, emission wavelength, Gibbs free energy, extinction coefficient, polarizability, etc., which intrinsically correlate with structural properties. For example, circular dichroism is predetermined by chiral symmetry of the material (absence of rotoinversion axes). There are correlations between structural transformations of MOFs, their sorption behavior, and stability of the crystal structure [52]. Further, the information on the bandgap width is important for chemical sensors. In turn, this value can correlate with such parameters of the structure as chemical type of central atoms [53, 54], ligand composition [55, 56], and structure topology [57, 58]. In addition, there is an influence of the electron density distribution and electrostatic potential map to adsorption parameters [59]. Further, we mainly consider structural descriptors because correlations between them are most explored; their correlations to physical properties are merely being established and have been mentioned much rarer. Geometrical structural descriptors (atomic positions and occupancies, unit cell dimensions, space group, etc.) are used since the beginning of the X-ray analysis because they are obtained as a result of Fourier mapping of experimentally measured magnitudes (intensities of X-ray or neutron beam reflections). Topological descriptors form the next level of knowledge as they are derived from the geometrical parameters. Usually the topological information on crystal structures is presented on four levels of their organization [60–62]: (i) coordination properties of atoms, (ii) topology of structural units and their coordination, (iii) overall structural topological motif, and (iv) entanglements of atomic networks. In Table 4.1, we enumerate the topological descriptors that have already been used for rationalization of structural features and phenomena. We give only short explanations of the descriptors in terms of crystal structure; the reader can find more detailed and rigorous definitions in [63–71]. Compared with geometrical descriptors, the topological ones have significant advantages: (i) they reflect structural relations independently of symmetry distortions and chemical composition (isoreticular series, decorated–undecorated net relations) [67, 68]; (ii) they can describe entangled motifs (interpenetration, polycatenation, polythreading) [72]; (iii) structural units (molecules, ligands, clusters, tiles, low-periodic substructures) can be selected according to strict topological criteria [69, 73]; and (iv) they can be combined with other descriptors, for example, with geometrical ones to characterize systems of channels in solid electrolytes [26, 74] or with electronic ones in the Bader analysis of electron density [75–77]. 4.2.3

Topological vs. Crystallographic Databases

The values of structural descriptors need to be deposited to electronic databases for effective retrieval and search for correlations. Usually, such databases have either a unique binary format (crystallographic databases like CSD or ICSD,

111

112

4 From Topological Descriptors to Expert Systems: A Route to Predictable Materials

Table 4.1 Structural topological descriptors. Descriptor

Structural analog

Application to crystal structure

Node or vertex

Atom, center of structural group

Description in terms of nets

Edge

Interatomic interaction, link between structural groups

Description in terms of nets

Net

System of structural groups and links between them

Formalization of overall connectivity

Ring

Ring of connected structural groups

Representation of channels, description of entanglements

Tile

Empty polyhedron with atoms in vertices

Representation of cages and cavities

Coordination number or degree

Number of structural groups connected to the given one

Formalization of local connectivity

Coordination sequence

Composition of far coordination shells

Formalization of local connectivity

Coordination mode

Coordination type of ligands or other structural groups

Formalization of local connectivity

Shell graph

Method of connectivity of structural groups in a limited region

Formalization of local and overall connectivity

Hopf and extended ring net

Method of entangling of structural groups

Formalization of entanglements

ToposPro Topological Collections, etc.) or a textual format (CIF-files, RCSR, or EPINET databases). The Cambridge Crystallographic Database (CSD), the first crystallographic database, was started in 1965 as “a computer-based file containing both bibliographic information and numerical data abstracted from the literature and relevant to molecular crystal structures, as obtained by diffraction methods” [78]. It contains information about all published crystal structures of organic, metal–organic, and organometallic compounds; in 2015, the number of entries exceeded 800 000. Since the 1990s, the Cambridge Crystallographic Data Centre has been working on transforming the CSD into a knowledge database for analysis and prediction of molecular geometry and packing in the crystalline state [79–83]. The CSD analytical tools use a set of molecular descriptors. First, these are standard crystallochemical parameters, like interatomic distances, bond, plane and torsion angles, bond type, etc. For prediction of intermolecular interactions between neighboring molecules, the atomic descriptors (van der Waals radius,

4.2 Topological Tools for Developing Knowledge Databases

donor–acceptor properties of atoms) are applied. The statistical treatment of the molecular descriptors and correlations between them is implemented into the powerful data analysis module. Such an analysis through the whole CSD led to developing Mogul, the knowledge database of molecular geometries. Mogul provides rapid access to the information about preferred values of bond lengths, valence angles, acyclic torsion angles, and the geometry of isolated ring systems. One more set of descriptors invented to predict polymorphs of molecular crystals rests upon the information about the number and type of functional groups involved in hydrogen bonding. This set includes qualitative parameters like type of donor or acceptor functional group and quantitative ones like competition function and steric density function. The probability of formation of the hydrogen bonds is calculated in the program IsoStar by statistical analysis of relevant structures in the CSD. Inorganic crystal structures database (ICSD) [84, 85] created by FIZ Karlsruhe provides access to crystallographic data on about 185 000 inorganic crystal structures. Pearson’s Crystal Data (PCD) is a crystallographic database for inorganic compounds published by ASM International [86] and edited by Pierre Villars and Karin Cenzual [87]. It started from PAULING FILE project [88] and stores information about more than 274 000 crystal structures of inorganic materials. Each entry is linked to external sources like ASM International Alloy Phase Diagrams Centre Online, SpringerMaterials, and PAULING FILE Multinaries, as well as to original publications. Powder Diffraction File (PDF) is supported by the International Centre for Diffraction Data and traditionally focuses on collecting powder patterns [89]. However, the last PDF editions (PDF-4) include information on atomic coordinates in more than 300 000 inorganic crystal structures and minerals as well as in almost 10 000 organic compounds. Crystallography Open Database (COD) is an open-access collection of crystal structures of organic, inorganic, and metal–organic compounds and minerals, excluding biopolymers [90]. The database has more than 360 000 entries. Protein Data Bank (PDB) contains crystallographic information for 12 1654 biological macromolecular structures (nucleic acids and proteins) [91–93]. All the crystallographic storages except the CSD do not contain topological or other special structure descriptors, correlations between descriptors; currently they are mainly used as electronic manuals, not as tools for crystal structure prediction of materials design. The next group of databases can be called topological: they collect the topological descriptors derived from the crystallographic databases. Reticular Chemistry Structure Resource (RCSR) contains highest-symmetry embeddings of two- and three-periodic networks as well as some polyhedra to be most useful in crystal chemistry [32, 94]. The 3D nets are of interest to crystal design or to the theory of periodic graphs and tilings. Besides crystallographic data, each record in RCSR bears topological indices like point and vertex symbols, coordination sequences [68], or tiling face symbols [95]. Database of Zeolite Structures contains structural information on the zeolite framework types that have been approved by the Structure Commission of the

113

114

4 From Topological Descriptors to Expert Systems: A Route to Predictable Materials

International Zeolite Association [96]. It includes many geometrical and topological descriptors for each framework type, as well as its building models. ToposPro Topological Collections (TTC) are intended for topological analysis and classification of both the periodic networks and various types of finite structural units [30, 71, 97]. The TTC can help in solving the following tasks of crystal design: (i) determine the type of the overall topology of atomic network; (ii) determine the topological types of nanoclusters, polynuclear complex groups, molecules, ligands, or polyhedral zeolite cages; (iii) find all examples of crystal structures, where the given topological type was observed; (iv) build distributions of the topological motifs on their occurrence; (v) find relations between different topological types, i.e. which topologies can be transformed to each other; and (vi) get the information about connection types of ligands in coordination compounds, molecules in molecular crystals, or nanoclusters in intermetallic compounds. The following collections currently compose the TTC: TTA (topological types of atoms) collection stores the values of 35 Voronoipolyhedron-based descriptors [98] of more than one million independent atoms from more than half million of crystal structures. TTD (TOPOS topological database) collection is the largest database of topological types for hypothetical or observed nets as well as finite graphs. It currently contains 11 annually updating databases, which store more than 140 000 graphs and networks. The TTD collection is used for automatic assignment of the crystal structure to a topological type [71, 99]. TTO (topological types observed) collection matches topological types of abstract nets and graphs collected in the TTD with examples of observed crystal structures. The TTO collection contains such descriptors as net periodicity, degree of interpenetration, topological type, and type of the structure representation. It consists of 14 databases storing information for more than 1.7 million representations of inorganic, organic, and metal-organic compounds. TTR (topological types relations) collection is based on the TTO collection and lists all ways of transformation from one net to another that are realized in crystal structures [15, 73]. TTL (topological types of ligands) collection is a storage of more than 160 000 ligands and their coordination modes in mononuclear, polynuclear, and polymeric coordination compounds. TTM (topological types of molecules) collection contains data on more than 307 000 molecules, their sizes, forms, and types of connection in crystals [100, 101]. TTN (topological types of nanoclusters) collection stores the data on chemical composition, topological structure, and methods of connection of more than 2000 polyshell nanoclusters in intermetallic compounds [28]. Development of topological methods led to appearance of the databases that can be considered as intermediate between crystallographic and topological ones. They contain information on both positions of atoms, not just abstract nodes, and also their connections within the whole structure. We consider here two groups of such databases.

4.2 Topological Tools for Developing Knowledge Databases

Databases of Computation-Ready Structures. These databases appeared quite recently in the field of microporous structures [8]. They are called computation-ready because they collect frameworks taken from real structures, but are cleaned of extraframework solvate and clathrate molecules. The frameworks can then be used in the design of new adsorption materials as objects for mathematical modeling: (i) Goldsmith’s database of real MOFs extracted from the CSD [20]. It contains structural information and data on porosity and surface area for about 22 700 non-disordered MOFs. The authors used it for screening the materials for hydrogen storage. (ii) Chung’s database of ready-for-calculation real MOFs [21]. The database contains about 4700 MOF structures and was built similarly to Goldsmith’s database, though with stronger selection criteria. The structures are systematized according to the maximum size of pores and minimum size of channels. Using it, the capacity of methane sorption of MOFs was studied. (iii) First and co-authors [22, 23] have developed two open-access databases of nanoporous materials ZEOMICS [24] and MOFOMICS [25]. They approximated channels and cages of porous structures by simple geometric shapes, namely, spheres and cylinders. The MOFOMICS database contains the parameters of these shapes in 251 real MOFs, while the ZEOMICS database stores the information for 248 known zeolites. Databases of Hypothetical Structures. Besides the databases on experimental crystal structures, the development of collections of hypothetical crystalline architectures is no less important. If the former databases show us what structures Nature prefers, the latter ones include also those that have never been obtained. Such structures can be either targets for synthesis or challenges to understand why they are unrealistic, and this question is no less important than the question why other structures are stable. Most of such databases have been created quite recently, and they are devoted to microporous materials [8]: (i) The databases on hypothetical zeolites contain information on about 2 million structures [102–106]. (ii) The first database on hypothetical MOFs developed by Wilmer and coauthors stores 137 953 entries [107, 108]. Their structures are constructed from a set of 120 building blocks. The enumeration is limited to one type of node and two types of linkers. (iii) The database of zeolite-like zinc-imidazolate frameworks [109] is obtained from 300 thousand structures of hypothetical zeolites by replacing silicon atoms to zinc and oxygen atoms to imidazolate fragments. (iv) The MOFOMICS database contains information about sizes of pores in 1424 hypothetical MOFs [25]. (v) A database of 324 500 hypothetical non-interpenetrated geometrically optimized MOF structures was created [110] using an algorithm similar to Wilmer’s one [107, 108]. Initially the authors generated ∼1800 unfunctionalized base structures, and then the 324 500 MOF structures were obtained by replacing positions of H atoms by one or two functional groups. In total,

115

116

4 From Topological Descriptors to Expert Systems: A Route to Predictable Materials

66 building units were combined in the unfunctionalized MOFs, and 19 functional groups were utilized in their functionalization. (vi) The database of hypothetical carbon allotropes, SAmara Carbon Allotrope DAtabase (SACADA) [111], appeared as a handbook on topological and physical properties of carbon polymorphs and related materials. It contains information on more than 600 hypothetical carbon allotropes, out of which 280 have unique topologies. (vii) A special attention should be paid to the EPINET project, which explores 2D hyperbolic tilings as a source of crystal networks in 3D Euclidean space [112]. It is a striking example of the networks generated with an abstract method that guarantees their complete independence of the structure revealed in Nature. As we see, MOFs essentially stimulated the development of topology-related databases. We can expect that this influence retains in the near future. Most studies of microporous frameworks were devoted to new materials for sorption, separation, and storage of practically important gases: H2 [113, 114], CH4 [107, 108, 115, 116], CO2 [109, 110], and noble gases [113, 114]. In these studies, the characteristics of the materials were evaluated either by the geometrical parameters of porous structure (volume, size, area) or by energetic parameters of host–guest interactions or by grand canonical Monte Carlo simulations of the adsorption process. However, the existing parameters and methods are insufficient for comprehensive analysis and effective prediction of molecules sorption. There is a need in new descriptors (geometrical, topological, energetic, and combined) and in development of new methods for searching for the correlations between the sorption characteristics and parameters of host and guest molecules to explain the features of the sorption process. As a result, we can expect that more and more data on physical properties will be included into the existing databases, and new databases focused on combined structural–physical descriptors will be created. For example, screening microporous materials for sensors requires information on the band structure to be included in the database. 4.2.4 4.2.4.1

Deriving Topological Knowledge from Crystallographic Data Algorithms for Topological Analysis

Topological approaches are useful not only because they allow one to formalize crystal structure representations in terms of strictly defined descriptors but also because they provide rigorous algorithms to derive new knowledge from the crystallographic data. For example, all main ways of representation of the crystal structure of a coordination polymer or MOF can be formalized in terms of so-called underlying net, i.e. the net of structural groups. The corresponding algorithms were implemented in the program package ToposPro [30]; they start with representation of the structure as a periodic graph, whose nodes and edges correspond to atoms and interatomic interactions [73] (Figure 4.2). If the user chooses the standard representation algorithm, at the next step the program specifies metals and ligands as structural units and squeezes them to their centers of mass. After removing 0-, 1-, and 2-coordinated nodes, the user obtains the underlying net, which reflects the overall motif of connection of the structural units.

4.2 Topological Tools for Developing Knowledge Databases

N Zn

O

μ2

H

C

μ4 (a)

(b)

μ2

Edge

μ4 (c)

4-c node

(d) ToposPro

http://www.topospro.com

xah

xah; sqc320 ID: 1049 TS: {4^2.6^2.8^2}{4^6.6^4} Node 1 CS: 5 13 26 49 73 98 145 185 218 293 ES: [4.4.4.4.4.4.6.6.6.6] VS: [4.4.4.4.4.4.6.6.6.6] Node 2 CS: 4 10 26 44 68 110 132 172 250 268 ES: [4(3).4(3).6.6.8(14).8(14)] VS: [4(3).4(3).6.6.8(8).8(8)]

(e)

Figure 4.2 Steps of the underlying net construction and classification in the standard representation. (a) Fragment of structure of [Zn2 (μ4 -bdc)2(μ2 -dabco)] (bdc, 1,4-benzenedicarboxylate-anion; dabco, 1,4-diazabicyclo(2.2.2)octane) with determined interatomic bonds. Source: Kim et al. 2009 [117]. Reproduced with permission of John Wiley and Sons. (b) Ligands are selected as structural units (highlighted in green and magenta). (c) Selected structural units (ligands) are simplified into their centers (green and magenta balls). (d) 2-Coordinated nodes are replaced by edges. (e) Resulting underlying net belongs to the topological type xah (see the three-letter nomenclature in according to computed topological indices: TS, total point symbol; CS, coordination sequence; ES, extended point symbol; VS, vertex symbol. sqc320, the second name of the net from the EPINET database; ID, identification key in the database. Source: O’Keeffe et al. 2008 [32]. Reproduced with permission of American Chemical Society.

117

118

4 From Topological Descriptors to Expert Systems: A Route to Predictable Materials

Another choice of the structural units for the coordination polymer is provided with the cluster representation algorithm, which uses the length of the shortest ring passing through the edges of the initial structure net as the criterion for division of all interatomic interactions to intracluster and intercluster (Figure 4.3). In this case, the program chooses structural units irrespective of their chemical composition resting upon only topological criteria. As a result, the topological motifs of different structures can be grouped into distinct topological types, which are unambiguously characterized by sets of topological indices (Figures 4.2 and 4.3). In this way, various types of crystal structures have been analyzed: inorganic [118], organic [119], metal–organic [73], and intermetallic [120]. Different nature of crystal structures and different structural properties to be accounted for invoked a number of other topological algorithms that were also implemented into ToposPro: (i) The algorithm for determining interatomic interactions of different types and building the adjacency matrix of the crystal network [121, 122]. It represents an example of expert system, because it mimics the way of thinking of the crystal chemist when analyzing bonds between atoms. When drawing a conclusion it rests upon a complex of chemical, geometrical, and topological parameters. (ii) The algorithm for building migration paths for mobile cations in solid electrolytes and cathode materials. The algorithm is based on the so-called Voronoi network of vertices and edges of Voronoi polyhedra of all framework (not mobile) atoms: the vertices and edges correspond to centers of voids and lines of channels between voids [74]. Applying additional descriptors, both geometrical and physical, like interatomic distances, ionic radii, or migration energies, one can find prospective fast-ion materials in a good agreement with experiment [26, 123]. (iii) The “Nanoclustering” algorithm, which represents the crystal structure as an assembly of onion-like polyshell nanoclusters [40, 64, 120]. The structural units are also to be chosen independently of the researcher, and the whole procedure – from construction of the initial network to its representation as an underlying net of nanoclusters – is fully automated. (iv) The algorithm for building natural tiling for a network. The natural tiling is composed by minimal cages, which are built from minimal rings (windows) in accordance with strict topological criteria [69]. Any larger cage can by assembled from the natural tiles as a LEGO figure. (v) The algorithm for automated separation of polynuclear coordination clusters, which is important to screen crystallographic databases for prospective molecular magnets [124–126]. 4.2.4.2

Building Distributions of Descriptors

To find correlations between structural descriptors, one should first build their distributions. Even simple analysis of occurrences gives important information that can be used in design of new compounds. For example, for planning synthesis of new coordination compounds, including polymers and MOFs, one should first choose proper structural units (metals and ligands). Using the TTA collection one

4.2 Topological Tools for Developing Knowledge Databases

Intercluster bonds Cluster

Spacers

(a)

(b)

(c)

(d)

pcu

(e)

Figure 4.3 Steps of the underlying net construction and classification in the cluster representation. (a) A fragment of the structure of [Zn2 (μ4 -bdc)2(μ2 -dabco)] (bdc, 1,4-benzenedicarboxylate-anion; dabco, 1,4-diazabicyclo(2.2.2)octane) with determined interatomic bonds. Source: Kim et al. 2009 [117]. Reproduced with permission of John Wiley and Sons. (b) dabco, benzene rings, and paddle wheels are selected as structural units (highlighted in magenta, green, and yellow, respectively). (c) Selected structural units (clusters) are simplified into their centers (magenta, green, and yellow balls). (d) 2-Coordinated nodes are replaced by edges. (e) The resulting underlying net belongs to the topological type pcu.

119

4 From Topological Descriptors to Expert Systems: A Route to Predictable Materials

can find that the most popular metals in coordination compounds are Cu and Fe (Figure 4.4). To illustrate more complicated correlations, we consider below only the copper compounds that do not contain other metal atoms. Taking then into account the nearest environment of the metal atom, one can find that the Cu complexing atom prefers coordination numbers 4, 5, or 6 (Figure 4.5) and mostly connects to O and/or N donor atoms (Figure 4.6). More detailed correlations can be obtained from the analysis of interatomic distances. Thus, the Cu coordination centers form three main groups of valence contacts in ranges 1.8–2.2, 2.2–2.5, and 2.5–3.0 Å (Figure 4.7) depending on the 9 Cu 8.43

Fe 7.39

8

Occurrence (%)

7 6

Mo 3.99

5

Ru 4.41

4

W 2.55

Sn 2.23

K 2.14

3

Pt 2.45

2

U 0.97

1 Cm

U

Tl

Ra

Ir

Ta

Tm

Tb

Pm

La

Sn

Pd

Mo

Sr

Zn

Fe

Ti

Al

Li

0

Element

Figure 4.4 Distribution of 722 012 metal coordination centers. 40

30 Occurrence (%)

120

20

10

0

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

Coordination number

Figure 4.5 Distribution of coordination numbers of 73 530 copper atoms in 44 278 copper-containing structures.

15

4.2 Topological Tools for Developing Knowledge Databases

10

Occurrence (%)

8 6 4 2

4O 2 NO 3 N 3O N 6 N 2 Cl 2 N 3 Cl 2 N 4 Cl

4 Cu

NO

N

5

6

O

4

O

5

O

4

N

3O 2 N 4O O 5 Cu N 2O 4 NO

N

4

N

2O 3

N

N

2O 2

0

Chemical composition of CP

Figure 4.6 Distribution of chemical composition of the coordination polyhedra of copper atoms.

oxidation state and sort of surrounding atoms. The Jahn–Teller distortion is evident in the range 2.3–3.0 Å, while distances more than 3.3 Å can be assigned to van der Waals interactions. This analysis being routine for the crystal chemist includes all types of structural descriptors related to chemical composition and geometrical and topological properties. Further, the structural units can be selected and analyzed. For example, the distribution of the copper coordination compounds on the ligand composition (Figure 4.8) shows the most frequent ligands. Note that oxo-ligand in most cases corresponds to water or hydroxyl. This is the result of incomplete crystallographic data: in many cases the hydrogen atoms are not allocated in the X-ray experiment. 1.6

Occurrence (%)

1.2

0.8

0.4

0.0 1.5

2.0

2.5

3.0

3.5

4.5

1.5

5.0

r(Cu–Nm)

Figure 4.7 Distribution of distances between copper atoms and nonmetal atoms r(Cu–Nm), which form a face in the Voronoi polyhedron of copper atoms (blue line) or only for valence contacts Cu–Nm (orange line). Both curves coincide with each other at short distances.

121

4 From Topological Descriptors to Expert Systems: A Route to Predictable Materials

10

Occurrence (%)

8 6 4 2

3

HO Ac CNS eto nit r 4,4 ile ʹ -b ipy Me Py tha no l C 2O F6 4 ac ac CN

3

N

Ac Ph 3P NO

ipy

ʹ -b

I en

2,2

Ph

O

Br

2O

H

Cl

0

Ligand

Figure 4.8 Distribution of the most abundant ligands in the copper coordination compounds. Oxo-, hydroxo-, and aqua-ligands could not be distinguished in the structures, where the H atoms are not determined: phen, 1,10-phenanhroline; 2,2′ -bipy, 2,2′ -bipyridine; ac, CH3 COO− , Ph3 P, (C6 H5 )3 P; 4,4′ -bipy, 4,4′ -bipyridine; py, pyridine; F6acac, hexafluoroacetylacetonate.

The most abundant coordination modes of ligands (CML) are terminal M1 (monodentate) and B01 (bidentate chelate) (Figure 4.9; see Ref. [127] for nomenclature of coordination modes). This is dictated by denticity (number of donor atoms), geometry (for example, the B01 mode for 2,2′ -bipyridyl) of ligands, and the metal/ligand ratio. Both conditions lead to a high occurrence of molecular coordination compounds. Thus, for polydentate ligands the occurrence of terminal-like modes (T001 or K0001 ) is also high, which is characteristic for molecular complexes. From the distribution of overall topologies, we can extract information about structure periodicity and topological type. The distribution of the coppercontaining coordination groups over their periodicity (Figure 4.10) shows 0D 40 35 Occurrence (%)

30 25 20 15 10 5 K4 00 2 B1 1 M 4 T1 01 K1 P0 01 00 01 B O 3 00 02 T0 2 G

T1 1 T3

3 K0 2

M

M 2 T0 01 K0 00 1

0 M 1 B0 1 B2

122

Coordination modes

Figure 4.9 Distribution of ligands over most widespread coordination modes.

4.2 Topological Tools for Developing Knowledge Databases

molecules to be the most abundant; chain (1D) polymers occupy the second place, while 2D and 3D polymers are less widespread. The distribution over topological types shows well-known regularity, which was found before for coordination networks [19, 60, 73]: only a few topological types are abundant (Table 4.2). The data on degree of interpenetration of coordination networks can be also summarized in a distribution (Figure 4.11), where the major part of 3D complexes are comprised only by a single framework (1917), while the interpenetrating structures are less common (446). The most widespread degree of interpenetration (number of interpenetrating motifs, Z) is 2 (273), and the number of structures sharply decreases with increasing Z. 4.2.4.3

Finding Correlations Between Descriptors

The more efficient way is to use the databases for finding correlations between parameters of metal/ligand coordination and network topology. For this reason, Figure 4.10 Distributions of copper-containing coordination groups over their periodicity.

2D

3D

1D

0D

Table 4.2 Abundant topological types in standard representation and their occurrences (𝜔) for 42 206 copper compounds. 0D

𝝎 (%)

1D

𝝎 (%)

2D

𝝎 (%)

3D

𝝎 (%)

6.43

1,2M3-1

27.57

2C1

54.88

sql

30.02

dia

1,3M4-1

15.27

2,4C4

12.33

hcb

23.93

4,6T4

3.83

1,4M5-1

10.14

2,2,3C6

5.38

fes

6.87

cds

2.65 2.55

1M2-1

7.35

2,2,5C3

4.14

bey

4.32

srs

1,2,3M6-1

5.61

SP1-p.n.(4,4)(0,2)

3.77

4,4L1

4.23

xah

2.19

2M4-1

3.82

2,2,4C1

2.20

bex

2.68

4,8T24

2.14

1,2,4M8-3

3.52

2,2,3C4

2.10

3,4L124

1.90

ths

2.04

1,2,2M5-1

2.96

2,3,5C3

1.05

4,4L47

1.34

qzd

1.89

1,2,3M7-2

2.65

SP1-p.n.(4,4)(2,2)

0.60

3,5L2

1.12

pts

1.79

1,2,5M8-2

2.47

2,3,4C2

0.51

kgm

0.86

4,12T1

1.68

Other 996

18.64

Other 285

13.05

Other 299

22.72

Other 715

72.79

123

4 From Topological Descriptors to Expert Systems: A Route to Predictable Materials

2000 Number of structures

124

1917

1500 1000 500 0

273

1

2

79 45 15

9

18

3

1

1

0

2

3

6

7

8

9

10

11

12

4

5

Degree of interpenetration (Z)

Figure 4.11 Distribution of 3D coordination networks over degree of interpenetration (Z) for the copper-containing coordination compounds. Table 4.3 All possible correlations between the ligand coordination mode (CML), coordination formula (CF), and underlying net topology in the standard representation (UNSR) as well as the corresponding chemical tasks for Cu-MOFs. Combination

Chemical task

CML–CF–UNSR

What overall topology corresponds to the selected structural units with the specified type of coordination

UNSR–CF–CML

What structural units correspond to particular overall topology built for a given type of their coordination

CF–CML–UNSR

What overall topology corresponds to a particular type of coordination of specified structural units

UNSR–CML–CF

What type of ligand–metal coordination corresponds to a particular overall topology with specified structural units

CML–UNSR–CF

What types of ligand–metal coordination correspond to the specified structural units, which form the coordination group with a specified overall topology

CF–UNSR–CML

How the structural units are coordinated to result in a certain type of the overall network topology

the data from the selected sample should be combined in one database, and the whole list of correlations should be automatically extracted using special tools [61]. The correlations will be stored in a knowledge database as a set of rules provided by values for assessment of their robustness. The number of correlations to be analyzed (N) simply depends on the number of descriptors (n): N = n! However, usually there is no need to consider all correlations, but only the most significant ones that make sense for the task under consideration. For example, for Cu-MOFs it could be important to consider the group of three properties: CML, coordination formula (CF), and underlying net in the standard representation (UNSR). Each sequence of the descriptors relates to a particular chemical task (Table 4.3).

4.2 Topological Tools for Developing Knowledge Databases

Table 4.4 Relationships CML–UNSR–CF for 4196 two- or three-periodic copper-containing coordination compounds. CML

N

𝝎 (%)

UNSR

N

𝝎 (%)

CF

N

𝝎 (%)

B2

909

21.7

sql

469

51.6

AB2 2

192

40.9

AB2 2 M1 2

144

30.7

AB2 2 M1

93

19.8

A2 B2 3

30

30.9

hcb

97

10.7

2

2

21

21.6

dia

90

9.9

AB2 2

71

78.9

qzd

36

4.0

AB2 2

35

97.2

cds

36

4.0

AB2 2

25

69.4

AB

B2 and M2 K4 B2 and M3 B2 and K4 G6

O8 21

K

208 263 141 176 103

96 87

5.0 6.3 3.4 4.2 2.5

2.3 2.1

2

pcu

25

2.8

AB

21

84.0

hcb

129

62.0

AB2 M2

97

75.2

sql

37

17.8

AB2 M2

25

67.6

4,4L1

87

33.1

AK4 M1

80

92.0

4

3

pts

25

9.5

AK

21

84.0

bey

63

44.7

A2 M3 2 B2

63

100

3,4L124

36

25.5

A2 M3 2 B2

36

100

xah

39

22.2

A2 K4 2 B2

39

100

4

2

4,5T4

24

13.6

A2 K 2 B

24

100

4,6T4

62

60.2

A3 G6 2

39

62.9

A3 G6 2 *M1 3

22

35.5

4,6T119

22

21.4

A3 G6 2 *M1 3

21

95.5

4,8T24

36

37.5

A2 O8 M1 2

33

91.7

24

70.6

hcb

34

39.1

21

AK M

1

For example, the CML–UNSR–CF correlations for two- and three-periodic copper-containing coordination groups are shown in Table 4.4. Using these data one can determine a spectrum of the most probable overall topological motifs for a particular CML. Thus, if we use only the bridge ligands with the CML=’B2 ’ for the synthesis of copper-containing compounds, we get the sql underlying net with a probability of more than 50%. The overall topologies hcb, dia, qzd, cds, pcu can also be realized with much lower probabilities from 2.8% to 10.7%. The number of terminal ligands is unlimited as they do not directly influence the net connectivity, so the coordination formulae AB2 2 , AB2 2 M1 , AB2 2 M1 2 are possible for the sql underlying topology. If we use ligands with the B2 and M2 bridge coordination modes, the topological type hcb will be the most probable. As a result the following rules can be stored in a knowledge database: Rule I: IF (Me=’Cu’) AND (Number_of_sorts(Me)=1) AND (CML=’B2 ’) THEN UNSR IN (sql WITH P = 51.6%, hcb WITH P = 10.7%, dia WITH P = 9.9%, qzd WITH P = 4.0%, cds WITH P = 4.0%, pcu WITH P = 2.8%)

125

126

4 From Topological Descriptors to Expert Systems: A Route to Predictable Materials

Rule II: IF (Me=’Cu’) AND (Number_of_sorts(Me)=1) AND (CML=’B2 &M2 ’) THEN UNSR = hcb WITH P = 62.0% Expert system can then use such rules to help the chemist in the design of new copper coordination compounds. To make the prediction stricter and more detailed, the set of descriptors can be enlarged, for example, with the type of the coordination figure of the Cu atom or the chemical composition of the ligands. The format of the knowledge database should be flexible enough to accept new descriptors and rules. Thus, the problem of developing a universal data storage becomes crucial.

4.2.5

Universal Data Storage

Once structural databases like CSD, Inorganic Crystal Structure Database (ISCD), and PDB are used in a local computer system for a limited number of users, these formats look reasonable, but further development of the databases, their transformation to knowledge-based systems, will require other advanced methods of data storage that will be considered in this section. Such systems should contain quite diverse information from structural databases, topological collections, and storages of physical descriptors and, furthermore, information about correlations between all the descriptors. Moreover, the format of the data storage systems should be flexible enough to provide easily their extension [128] with new types of descriptors and correlations. Since this problem is crucial for the development of knowledge-based predicting systems, we consider it in this section in detail. We start with brief description of main methods of representing and storing information in databases, the so-called data patterns [129]: Declarative pattern: This pattern [130] describes the data “as is,” i.e. each notion in the subject area reflects to an abstraction used in modeling. For example, in the description of atomic network (Figure 4.12), each instance of the network contains network atoms, the general properties of which (mass, number of electrons, etc.) are described in the “atom” entity. The network atoms are united by the network bonds, and the corresponding entities can have attributes, which depend on

Figure 4.12 Declarative pattern.

4.2 Topological Tools for Developing Knowledge Databases

a particular crystal structure, where the network is realized, for example, oxidation states of the atoms or distances of the bonds. This kind of pattern is easy to understand, implement, and support. However, it has important disadvantages, which hinder development of the corresponding databases. First, the storage structure is initially fixed and has to be changed before adding any new entity or attribute. For this purpose, a refactoring procedure like “Introduce New Table” or “Introduce New Column” [131] has to be performed that often requires changing the information scheme of the database. Second, this pattern is non-universal and has a low level of abstraction. With developing the subject area, emerging new concepts and knowledge, the novel entities and attributes are difficult to be introduced into the existing database architecture. For example, the discovering of incommensurate structures or quasicrystals required an essential extension of the semantics of crystallographic databases, and even now, they cannot directly store the information on these objects. Advanced declarative pattern: This pattern has an additional indirection level compared with declarative pattern, where the properties of the objects are described in the additional structural units “*_attribute,” while the entities themselves contain only the attribute values (Figure 4.13). As a result, one can easily add new attributes without rebuilding the logical structure. At the same time, similar attributes can repeat for different entities that could lead to a data redundancy and even to an inconsistence in the data since identical attributes can have different descriptions. This pattern is effective if the number of entities

Figure 4.13 Advanced declarative pattern.

127

128

4 From Topological Descriptors to Expert Systems: A Route to Predictable Materials

remains the same, but only the list of their properties changes; otherwise the disadvantages of the declarative pattern retain. Contextual pattern: This pattern [130] is independent on the information context and is intended for description of the objects that can be differently treated depending on current situation even in the runtime mode. It uses associative entities, which formalize the relationships between objects and attributes as well as between attributes and relationships of entities. As a result one can assign any number of attributes to any object or relationship of objects. Any subject area can be described in terms of objects, their attributes, object relationships, and the relationship attributes (Figure 4.14). The fields “name” and “description” specify the meaning of the object, while “name,” “type,” and “size” formalize the attribute without declaration of its actual value, which can be given in the field “value” of the “object_attribute” or “relationship_attribute” entities. Thus, unlike the advanced declarative pattern, the contextual pattern separates itself from the context of the subject area. The flexibility of this pattern requires additional tools to provide nonstandard data access to the storage that can result in decreasing the system productivity. Typed contextual pattern: This pattern introduces a set of user types for the objects and their relationships (Figure 4.15) and allows one to classify the corresponding data by the criteria that have not been necessarily predetermined. The information on types of the attributes is stored in a separate structure that solves the problem of the data redundancy. At the same time, converting the attribute values in accordance to their types requires implementation of the corresponding requests or representations in the database applications. The crystallographic and Object attributes

Objects

Relationships

Relationship attributes

(a)

(b)

Figure 4.14 Contextual pattern: (a) conceptual diagram and (b) ER model.

Attributes

4.2 Topological Tools for Developing Knowledge Databases

Figure 4.15 Typed contextual pattern.

topological databases mentioned in Section 4.2.3 use one of the declarative patterns, while the contextual ones have never been used in materials science to the best of our knowledge. To build a universal data storage [132], we should apply a contextual pattern, whose structure (Figure 4.14a) is extended with an additional level of metadata and references between object attributes and relationship attributes (Figure 4.16) to avoid the disadvantages mentioned above. We consider below an example of such advanced contextual pattern for crystallographic and topological data on calcium titanate (mineral perovskite). In accordance with the typed contextual pattern concept, we specify abstract objects and their types but add a metatype level that defines the context of the type meaning in a particular subject area (Figure 4.17a). The “parent_row_id” attribute in the “object_type” entity allows to organize a treelike data structure, which can be considered as

Objects

Object attributes

Attributes

(metadata)

(metadata)

(metadata)

References

Relationships

Relationship attributes

(metadata)

(metadata)

Figure 4.16 Advanced contextual pattern: conceptual diagram.

129

130

4 From Topological Descriptors to Expert Systems: A Route to Predictable Materials

(a)

(b)

Figure 4.17 Advanced contextual pattern: (a) the objects part and (b) example of realization for the perovskite crystal structure.

multiple inheritance for a given object type [133]. A part of the universal storage on the CaTiO3 structure, which contains three non-equivalent atoms Ca, Ti, and O, is shown in Figure 4.17b. The attributes are also described in the same abstract way. They can be primitive or composite, i.e. consisting of several primitive or composite ones. Data types contain the information on signature of the primitive attribute as well as the field “type_exid” for access to the table with the attribute values (Figure 4.18). After determining abstract attributes and their types, one should define object attributes with respect to the subject area. After this, the objects gain real properties; in our case, they can be considered as solid substances or materials. The scheme of assigning attributes to objects from the perovskite structure is shown in Figure 4.19. Each object type corresponds to a relationship instance and specifies the meaning of the instance within a given class of entities. Thus relationships between Ca and O atoms as well as Ti and O atoms have the type “Bond” and the metatype “Edge” (Figure 4.20) that provides their treatment both in chemical and topological terms. As in the case of objects, any relationship between them can be described with a number of attributes (Figure 4.21a). The difference is that the set of relationship attributes is specified for an instance, not for a type. Such typification is dynamic,

4.3 Applications of Topological Tools in Crystal Chemistry and Materials Science

(a)

(b)

Figure 4.18 Advanced contextual pattern: (a) the attributes part and (b) example of realization for the perovskite crystal structure.

i.e. one can define the type of the relationship instance even in a runtime mode (Figure 4.21b). The attribute values are stored in separate tables, whose names include the values from the “type_exid” fields from the description of the data types (Figure 4.18). For example, “object_attribute_value_C5EFD93443BA48E182ACA 5FAB9E2E6DC” contains only string values. This approach excludes necessity to convert the attribute values to the associated data types (Figure 4.22). Thus, the universal storage organized in accordance with advanced contextual pattern can include information of any kind and complexity. This is especially important for creating data exchange systems in materials science [129, 132], where the data can come from different fields of science and from the researchers, who use different terminology systems and methods of material characterization.

4.3 Applications of Topological Tools in Crystal Chemistry and Materials Science 4.3.1

Network Topology Prediction

The topological methods, software, and databases have helped to find many correlations to be useful for design of coordination networks [134]. Thus, strong

131

132

4 From Topological Descriptors to Expert Systems: A Route to Predictable Materials

(a)

(b)

Figure 4.19 Advanced contextual pattern: (a) the object attributes part and (b) example of realization for the perovskite crystal structure.

4.3 Applications of Topological Tools in Crystal Chemistry and Materials Science

(a)

(b)

Figure 4.20 Advanced contextual pattern: (a) the relationships part and (b) example of realization for the perovskite crystal structure.

correlations were found between local topological characteristics (coordination numbers of atoms or complex groups, coordination polyhedra, coordination figures, CML, and coordination formula) and the overall topology (topological type of underlying net) [19, 60, 62, 73, 118, 135, 136]. These correlations in many cases allow predicting with high probability possible topological motifs using the data on their chemical composition. The knowledge database can include statements like “hxl topology underlies the coordination networks with coordination formula AB2 3 ” with probability P [60]: p = [number of structures obeying the statement] ∕[total number of structures]

133

134

4 From Topological Descriptors to Expert Systems: A Route to Predictable Materials

(a)

(b)

Figure 4.21 Advanced contextual pattern: (a) the relationship attributes part and (b) example of realization for the perovskite crystal structure.

The general steps for developing the expert system, which is able to work with complexes of any periodicity, were illustrated by the analysis of 811 cyanide complexes [61]. Six sets of structural descriptors were proposed, calculated with ToposPro, and stored in the TTA, TTL, TTD, TTO, and TTR collections. The correlations like “oxidation state–coordination environment of metal atoms,” “coordination environment–coordination formula,” “coordination formula–coordination figure,” and “coordination figure–underlying topology” were derived from the subsequent analysis. The special form for writing the correlations as the conditional rules was proposed; for example, for realization of the Prussian blue topology (primitive cubic net), the necessary condition is connection of metal centers by bridging bidentate CN− ligand with the ratio Me:CN− = 1 : 3: IF CF = ‘AB2 3 ’ THEN UT = ‘pcu’ WITH P = 97.7%.

4.3 Applications of Topological Tools in Crystal Chemistry and Materials Science

(a)

(b)

Figure 4.22 Advanced contextual pattern: (a) the object attribute values part and (b) example of realization for the object attribute string values of the types “Atom,” “Structure,” and “Equivalent positions” list in the perovskite crystal structure.

135

136

4 From Topological Descriptors to Expert Systems: A Route to Predictable Materials

All such rules were identified for the six sets of descriptors, and their values were used for construction of decision trees. Using the data of the decision tree, one can estimate the total probability (P) of obtaining a particular overall topology as well as other parameters of the cyanide complexes by multiplying the corresponding conditional probabilities (pi ). Prototypes of knowledge-based expert systems were also proposed to design topological motifs in organic crystals [101, 137], prediction of H-bonds in some classes of organic crystals [138], and controlling crystallinity of molecular materials [139]. To describe the local mutual arrangement of molecules, the molecular connection type symbol (MCTS) was proposed [100]. It was shown that the sequence of correlations “chemical structure of molecule–number of active centers–molecular connection type–coordination figure–overall topology” can serve as an effective scheme for predicting possible types of molecular packings and connection motifs. For example, one can state, “if a molecule has four active centers, the most probable connection type will be K4 (each H donor or acceptor connects to one H donor or acceptor of other molecule) that in 96.5% of cases results in a sql (square plane net) overall topology.” Such statements can form a knowledge database that can be used to produce expert conclusions about the possibility of obtaining a particular topological supramolecular motif. To predict some local topological properties, like the number of active centers, complex physical descriptors, such as potential map (ESP, electrostatic potential) [140] or full interaction map (FIM) [141, 142], can be applied [137]. Topological properties of inorganic materials have been explored just occasionally. We should mention an attempt to explain with the tiling approach why there are so many hypothetical zeolites and so few observed ones [29]. The model takes into account the process of polycondensation of the T4+ (OH)4 or [T3+ (OH)4 ]− complex groups resulting in the oligomeric Tn Om (OH)k units, which are represented by tiles (minimal cages). Application of the model to hypothetical zeolites showed that only a small set can be explained as assembly of tiles, while almost all natural zeolites fit the model. As a result, a topological ranking of zeolite framework feasibility was suggested. Another example is classification of intermetallic compounds with the “Nanoclustering” algorithm (see Section 4.2.4.1). It was successfully used to elucidate very complicated structures [40, 64], to arrange large samples of experimental data, and to create a database on structural units in intermetallic compounds [28]. Summarizing current experience in application of topological methods for crystal design, we can mention the following main tasks, where they can be effectively used: – Searching for general regularities between properties of structural units and the structure as a whole. – Enumerating possible local connectivity maps for structural units. – Generating possible structure architectures for a given set of structural units. All these tasks should be solved with tailored knowledge databases and expert systems.

4.4 Conclusions

4.3.2

Prediction of Properties

In this section, we show that geometrical and topological methods find more and more applications in searching for correlations between structural and physical descriptors and subsequent prediction of material properties: Porous metal-organic frameworks: To evaluate adsorption properties of MOF structures, the geometrical parameters of pores like pore limiting diameter, largest cavity diameter, global cavity diameter, and pore distribution function [143–146] were combined with topological network analysis and various modeling methods, such as grand canonical Monte Carlo simulations and molecular dynamics in couple with ab initio, DFT, or force field calculations. Adsorption characteristics (Henry’s constant, enthalpy and entropy of adsorption, gas uptake, diffusion coefficients, and distribution of molecules upon adsorption sites) of many known and hypothetical MOFs were estimated with high precision [20, 147–151]. Cleavable molecular crystals: To find prospective crystalline substrates, which provide clean and smooth surface for organic molecular beam epitaxy (OMBE) or hot wall epitaxy (HWE), the authors [27] proposed a new combined approach. In the topological part of the approach, the propensity of molecular crystals to cleave along distinct crystallographic planes is estimated by a new descriptor anisotropy of intermolecular interactions energy, the socalled X parameter. It represents the portion of cohesive energy of the molecule in a given layer (two-periodic molecular network) compared with the total cohesive energy in the whole environment of the molecule. The cohesive energy is computed by quantum-chemical methods. High X parameter values indicate high probability of robustness of the layer as a cleavage plane. The method was successfully used to find cleavable crystals of amino acids; and the prediction was experimentally proved. Fast-ion conductors: Structural descriptors were used for fast screening of crystallographic databases (ICSD or PCD) for prospective materials, which provide migration of cations. Complete lists of potential lithium [123] and sodium [26] solid electrolytes were published; they contain both substances with already confirmed ion conductivity and the targets for synthesis. In the last few years, these approaches tend to combine with DFT modeling [152] that promises a quantitative prediction of such properties of solid electrolytes and cathode materials as migration energy, conductivity, and anisotropic effects.

4.4 Conclusions The progress in topological approaches accomplished in the last years let us believe that they can become a bridge between experimentalists and theoreticians in developing new substances with specified useful properties. These approaches are formal and rigorous enough to provide an algorithmic strictness of the methods for processing databases with experimental information and, on the other side, are naturally understandable for practical chemists. They have

137

138

4 From Topological Descriptors to Expert Systems: A Route to Predictable Materials

already underlain new descriptors of crystal structures and important tools for revealing correlations between structural and physical properties of solids. Being combined with DFT and other modeling methods, they can result in developing of the first generation of knowledge databases and expert systems, which will essentially accelerate the deployment of new materials.

References 1 Kalil, T. and Wadia, C. Materials genome initiative: a renaissance of Ameri-

2

3

4 5 6

7 8

9

10

11 12

13

can manufacturing. https://www.whitehouse.gov/blog/2011/06/24/materialsgenome-initiative-renaissance-american-manufacturing (accessed 12 October 2016). Jain, A., Ong, S.P., Hautier, G. et al. (2013). A materials genome approach to accelerating materials innovation. The materials project. APL Mater. 1: 011002. Hill, J., Mulholland, G., Persson, K. et al. (2016). Materials science with large-scale data and informatics: Unlocking new opportunities. MRS Bull. 41: 399–409. Hohenberg, P. and Kohn, W. (1964). Inhomogeneous electron gas. Phys. Rev. 136: 864–871. Koch, W. and Holthausen, M.C. (2001). A Chemist’s Guide to Density Functional Theory. Wiley. Odoh, S.O., Cramer, C.J., Truhlar, D.G., and Gagliardi, L. (2015). Quantum-chemical characterization of the properties and reactivities of metal-organic framework. Chem. Rev. 115: 6051–6111. Oganov, A.R. (ed.) (2011). Modern Methods of Crystal Structure Prediction. Chichester: Wiley. Coudert, F.-X. and Fuchs, A.H. (2016). Computational characterization and prediction of metal-organic framework properties. Coord. Chem. Rev. 307 (2): 211–236. Kiselyova, N.N. (2002). Computer design of materials with artificial intelligence. In: Intermetallic Compounds – Principles and Practice, V. 3 (ed. J.H. Westbrook and R.L. Fleischer). Wiley. Isayev, O., Fourches, D., Muratov, E.N. et al. (2015). Materials cartography: representing and mining materials space using structural and electronic fingerprints. Chem. Mater. 27 (3): 735–743. Isayev, O., Oses, C., Toher, C. et al. (2017). Universal fragment descriptors for predicting properties of inorganic crystals. Nature Commun. 8: 15679. Li, Y. and Yu, J. (2014). New stories of zeolite structures: their descriptions, determinations, predictions, and evaluations. Chem. Rev. 114 (14): 7268–7316. Hoskins, B.F. and Robson, R. (1990). Design and construction of a new class of scaffolding-like materials comprising infinite polymeric frameworks of 3D-linked molecular rods. A reappraisal of the Zn(CN)2 and Cd(CN)2 structures and the synthesis and structure of the diamond-related frameworks [N(CH3 )4 ][CuIZnII(CN)4 ] and

References

14 15 16 17

18

19

20

21

22

23

24

25

26

27

28

CuI[4,4′ ,4′′ ,4′′′ -tetracyanotetraphenylmethane]-BF4 ⋅xC6 H5 NO2 . J. Am. Chem. Soc. 112 (4): 1546–1554. Batten, S.R., Neville, S.M., and Turner, D.R. (2009). Coordination Polymers: Design, Analysis and Application. Cambridge: Royal Society of Chemistry. O’Keeffe, M., Eddaoudi, M., Li, H. et al. (2000). Frameworks for extended solids: geometrical design principles. J. Solid State Chem. 152: 3–20. Yaghi, O.M., O’Keeffe, M., Ockwig, N.W. et al. (2003). Reticular synthesis and the design of new materials. Nature 423: 705–714. Eddaoudi, M., Kim, J., Rosi, N. et al. (2002). Systematic design of pore size and functionality in isoreticular MOFs and their application in methane storage. Science 295: 469–472. Nouar, F., Eubank, J.F., Bousquet, T. et al. (2008). Supermolecular building blocks (SBBs) for the design and synthesis of highly porous metal-organic frameworks. J. Am. Chem. Soc. 130: 1833–1835. Ockwig, N.W., Delgado-Friedrichs, O., O’Keeffe, M., and Yaghi, O.M. (2005). Reticular chemistry: occurrence and taxonomy of nets and grammar for the design of frameworks. Acc. Chem. Res. 38 (3): 176–182. Goldsmith, J., Wong-Foy, A.G., Cafarella, M.J., and Siegel, D.J. (2013). Theoretical limits of hydrogen storage in metal−organic frameworks: opportunities and trade-offs. Chem. Mater. 25: 3373–3382. Chung, Y.G., Camp, J., Haranczyk, M. et al. (2014). Computation-ready, experimental (CoRE) metal-organic frameworks: a tool to enable highthroughput screening of nanoporous crystals. Chem. Mater. 26: 6185–6192. First, E.L., Gounaris, C.E., Wei, J., and Floudas, C.A. (2011). Computational characterization of zeolite porous networks: an automated approach. Phys. Chem. Chem. Phys. 13 (38): 17339–17358. First, E.L. and Floudas, C.A. (2013). MOFomics: computational pore characterization of metal-organic frameworks. Microporous Mesoporous Mater. 165: 32–39. ZEOMICS. Zeolites and microporous structures characterization, an automated computational method for characterizing the three-dimensional porous networks of microporous materials, such as zeolites. http://helios .princeton.edu/zeomics (accessed 12 October 2016). MOFomics. Metal-organic frameworks characterization, an automated computational method for characterizing the three-dimensional porous networks of metal-organic frameworks. http://helios.princeton.edu/mofomics (accessed 12 October 2016). Meutzner, F., Mìnchgesang, W., Kabanova, N.A. et al. (2015). On the way to new possible Na-ion conductors: the Voronoi-Dirichlet approach, data mining and symmetry considerations in ternary Na oxides. Chem. Eur. J. 21: 16601–16608. Zolotarev, P.N., Moret, M., Rizzato, S., and Proserpio, D.M. (2016). Searching new crystalline substrates for OMBE: topological and energetic aspects of cleavable organic crystals. Cryst. Growth Des. 16: 1572–1582. Pankova, A.A., Akhmetshina, T.G., Blatov, V.A., and Proserpio, D.M. (2015). A collection of topological types of nanoclusters and its application to icosahedron-based intermetallics. Inorg. Chem. 54 (13): 6616–6630.

139

140

4 From Topological Descriptors to Expert Systems: A Route to Predictable Materials

29 Blatov, V.A., Ilyushin, G.D., and Proserpio, D.M. (2013). The zeolite conun-

30

31 32

33

34

35 36 37 38

39 40

41

42 43 44

45

drum: why are there so many hypothetical zeolites and so few observed? a possible answer from the zeolite-type frameworks perceived as packings of tiles. Chem. Mater. 25: 412–424. Blatov, V.A., Shevchenko, A.P., and Proserpio, D.M. (2014). Applied topological analysis of crystal structures with the program package ToposPro. Cryst. Growth Des. 14: 3576–3586. Delgado-Friedrichs, O. and O’Keeffe, M. (2003). Identification and symmetry computation for crystal nets. Acta Crystallogr. A59: 351–360. O’Keeffe, M., Peskov, M.A., Ramsden, S.J., and Yaghi, O.M. (2008). Acc. Chem. Res. 41 (12): 1782–1789. Reticular Chemistry Structure Resource. http://rcsr.anu.edu.au (accessed 12 October 2016). Martin, R.L., Smit, B., and Haranczyk, M. (2012). Zeo++, an open source (but registration-required) software package for analysis of crystalline porous materials. J. Chem. Inf. Model. 52: 308–318. Goesten, M.G., Kapteijn, F., and Gascon, J. (2013). Fascinating chemistry or frustrating unpredictability: observations in crystal engineering of metal-organic frameworks. CrystEngComm. 15: 9249–9257. Samson, S. (1962). Crystal structure of NaCd2 . Nature 195: 259–262. Yang, Q.-B., Andersson, S., and Stenberg, L. (1987). An alternative description of the structure of NaCd2 . Acta Crystallogr. B43: 14–16. Bergman, G. (1996). Structure of NaCd2 : an alternative path to a trial structure. Acta Crystallogr. B52: 54–58. Fredrickson, D.C., Lee, S., and Hoffmann, R. (2007). Interpenetrating polar and nonpolar sublattices in intermetallics: the NaCd2 structure. Angew. Chem. Int. Ed. 46: 1958–1976. Feuerbacher, M., Thomas, C., Makongo, J.P. et al. (2007). The Samson phase, β-Mg2 Al3 , revisited. Z. Kristallogr. 222: 259–288. Shevchenko, V.Y., Blatov, V.A., and Ilyushin, G.D. (2009). Intermetallic compounds of the NaCd2 family perceived as assemblies of nanoclusters. Struct. Chem. 20: 975–982. Düren, T., Millange, F., Férey, G. et al. (2007). Calculation geometric surface area as characterization tool for metal-organic framework. J. Phys. Chem. 111: 15350–15356. Accelrys, S. and Diego, C.A. http://accelrys.com/products/collaborativescience/biovia-materials-studio (accessed 12 October 2016). Shrake, A. and Rupley, J. (1973). Environment and exposure to solvent of protein atoms. Lysozyme and insulin. J. Mol. Biol. 79: 351–371. Phillips, M., Georgiev, I., Dehof, A.K. et al. (2010). Measuring properties of molecular surfaces using ray casting. In: International Symposium on Parallel and Distributed Processing, Workshops and Phd Forum (IPDPSW IEEE), 1–7. IEEE. Weiser, J., Shenkin, P.S., and Still, W.C. (1999). Approximate atomic surfaces from linear combinations of pairwise overlaps (LCPO). J. Comput. Chem. 20: 217–230.

References

46 Klenin, K.V., Tristram, F., Strunk, T., and Wenzel, W. (2011). Derivatives of

47 48

49 50 51 52

53

54 55

56

57 58

59

60

61

62

molecular surface area and volume: simple and exact analytical formulas. J. Comput. Chem. 32: 2647–2653. Gelb, L.D. and Gubbins, K.E. (1999). Pore size distributions in porous glasses: a computer simulation study. Langmuir 15: 305–308. Sarkisov, L. and Harrison, A. (2011). Computational structure characterisation tools in application to ordered and disordered porous materials. Mol. Simul. 37: 1248–1257. Haldoupis, E., Nair, S., and Sholl, D.S. (2011). Pore size analysis of >250000 hypothetical zeolites. Phys. Chem. Chem. Phys. 13: 5053–5060. Blatov, V.A. and Shevchenko, A.P. (2003). Analysis of voids in crystal structures: the methods of “dual” crystal chemistry. Acta Crystallogr. A59: 34–44. Blatov, V.A. (2004). Voronoi-Dirichlet polyhedra in crystal chemistry: theory and applications. Crystallogr. Rev. 10: 249–318. Bouëssel du Bourg, L., Ortiz, A.U., Boutin, A., and Coudert, F.-X. (2014). Thermal and mechanical stability of zeolitic imidazolate frameworks polymorphs. APL Mater. 2: 124110. Fuentes-Cabrera, M., Nicholson, D.M., Sumpter, B.G., and Widom, M. (2005). Electronic structure and properties of isoreticular metal-organic frameworks: the case of M-IRMOF1 (M = Zn, Cd, Be, Mg, and Ca). J. Chem. Phys. 123: 124713. Yang, L.-M., Fang, G.-Y., Ma, J. et al. (2014). Band gap engineering of paradigm MOF-5. Cryst. Growth Des. 14: 2532–2541. Flage-Larsen, E., Røyset, A., Cavka, J.H., and Thorshaug, K. (2013). Band gap modulations in UiO metal–organic frameworks. J. Phys. Chem. C 117: 20610–20616. Civalleri, B., Napoli, F., Noël, Y. et al. (2006). Ab-initio prediction of materials properties with CRYSTAL: MOF-5 as a case study. CrystEngComm. 8: 364–371. Hendon, C.H., Wittering, K.E., Chen, T.-H. et al. (2015). Absorbate-induced piezochromism in a porous molecular crystal. Nano Lett. 15: 2149–2154. Dan-Hardi, M., Serre, C., Frot, T. et al. (2009). A new photoactive crystalline highly porous titanium(IV) dicarboxylate. J. Am. Chem. Soc. 131: 10857–10859. Butler, K.T., Hendon, C.H., and Walsh, A. (2014). Electronic chemical potentials of porous metal–organic frameworks. J. Am. Chem. Soc. 136: 2703–2706. Mitina, T.G. and Blatov, V.A. (2013). Topology of 2-periodic coordination networks: toward expert systems in crystal design. Cryst. Growth Des. 13: 1655–1664. Alexandrov, E.V., Shevchenko, A.P., Asiri, A.A., and Blatov, V.A. (2015). New knowledge and tools for crystal design: local coordination versus overall network topology and much more. CrystEngComm 17: 2913–2924. Alexandrov, E.V., Virovets, A.V., Blatov, V.A., and Peresypkina, E.V. (2015). Topological motifs in cyanometallates: from building units to three-periodic frameworks. Chem. Rev. 115 (22): 12286–12319.

141

142

4 From Topological Descriptors to Expert Systems: A Route to Predictable Materials

63 Delgado-Friedrichs, O., Foster, M.D., O’Keeffe, M. et al. (2005). What do we

know about three-periodic nets? J. Solid State Chem. 178 (8): 2533–2554. 64 Ilyushin, G.D. and Blatov, V.A. (2009). Structures of the ZrZn22 family:

65 66

67 68 69

70

71

72

73

74

75 76 77 78

79 80

suprapolyhedral nanoclusters, methods of self-assembly and superstructural ordering. Acta Crystallogr. B65: 300–307. Klee, W.E. (2004). Crystallographic nets and their quotient graphs. Cryst. Res. Technol. 39 (11): 959–968. Blatov, V.A., O’Keeffe, M., and Proserpio, D.M. (2010). Vertex-, face-, point-, Schlafli-, and Delaney-symbols in nets, polyhedra and tilings: recommended terminology. CrystEngComm 12 (1): 44–48. Delgado-Friedrichs, O. and O’Keeffe, M. (2005). Crystal nets as graphs: terminology and definitions. J. Solid State Chem. 178 (8): 2480–2485. O’Keeffe, M. and Hyde, B.G. (1996). Crystal Structures. I. Patterns and Symmetry. Washington, DC: Mineralogical Society of America. Blatov, V.A., Delgado-Friedrichs, O., O’Keeffe, M., and Proserpio, D.M. (2007). Three-periodic nets and tilings: natural tilings for nets. Acta Crystallogr. A63: 418–425. Delgado-Friedrichs, O., O’Keeffe, M., and Yaghi, O.M. (2007). Taxonomy of periodic nets and the design of materials. Phys. Chem. Chem. Phys. 9 (9): 1035–1043. Blatov, V.A. and Proserpio, D.M. (2011). Periodic-graph approaches in crystal structure prediction. In: Modern Methods of Crystal Structure Prediction (ed. A.R. Oganov), 1–28. Weinheim: Wiley. Alexandrov, E.V., Blatov, V.A., and Proserpio, D.M.A. (2012). Topological method for classification of entanglements in crystal networks. Acta Crystallogr. A68 (4): 484–493. Alexandrov, E.V., Blatov, V.A., Kochetkov, A.V., and Proserpio, D.M. (2011). Underlying nets in three-periodic coordination polymers: topology, taxonomy and prediction from a computer-aided analysis of the Cambridge structural database. CrystEngComm 13 (12): 3947–3958. Anurova, N.A. and Blatov, V.A. (2009). Analysis of ion-migration paths in inorganic frameworks by means of tilings and Voronoi–Dirichlet partition: a comparison. Acta Crystallogr. B65: 426–434. Bader, R. (1994). Atoms in Molecules. A Quantum Theory. Oxford: Clarendon Press. Bader, R. (1991). A quantum theory of molecular structure and its applications. Chem. Rev. 91: 893–928. Bader, R.F.W. (2005). The quantum mechanical basis for conceptual chemistry. Monatsch. Chem. 136: 819–854. Cambridge Structural Database (2011). Cambridge Crystallographic Data Centre, 12 Union Road, Cambridge, UK. http://ccdc.cam.ac.uk (accessed 12 October 2016). Allen, F.H. (2002). The Cambridge structural database: a quarter of a million crystal structures and rising. Acta Crystallogr. B58: 380–388. Galek, P.T.A., Fabian, L., Allen, F.H. et al. (2007). Knowledge-based model of hydrogen-bonding propensity in organic crystals. Acta Crystallogr. B63: 768–782.

References

81 Galek, P.T.A., Pidcock, E., Wood, P.A. et al. (2012). One in half a million: a

82

83 84 85

86

87

88

89

90

91

92 93

94 95 96 97

solid form informatics study of a pharmaceutical crystal structure. CrystEngComm 14: 2391–2403. Cole, J.C., Groom, C.R., Korb, O. et al. (2016). Knowledge-based optimization of molecular geometries using crystal structures. J. Chem. Inf. Model. 56: 652–661. Groom, C.R., Bruno, I.J., Lightfoot, M.P., and Ward, S.C. (2016). The Cambridge structural database. Acta Crystallogr. B72: 171–179. Bergerhoff, G., Hundt, R., Sievers, R., and Brown, I.D. (1983). The inorganic crystal structure database. J. Chem. Inf. Comput. Sci. 23: 66–69. Allmanna, R. and Hinek, R. (2007). The introduction of structure types into the Inorganic Crystal Structure Database ICSD. Acta Crystallogr. A63: 412–417. ASM World Headquarters. 9639 Kinsman Road, Materials Park, OH 44073-0002, USA. http://www.asminternational.org (accessed 12 October 2016). CRYSTAL IMPACT. Dr. H. Putz & Dr. K. Brandenburg GbR, Kreuzherrenstr. 102, D-53227 Bonn, Germany. http://www.crystalimpact.com/pcd (accessed 12 October 2016). Villars, P., Cenzual, K., Daams, J.L.C. et al. (eds.) (2002). PAULING FILE, Binaries Edition. Materials Park: ASM International, http://paulingfile.com (accessed 12 October 2016). International Centre for Diffraction Data, 12 Campus Blvd., Newtown Square, PA 19073-3273 USA. http://www.icdd.com http://www.rcsb.org/ pdb (accessed 12 October 2016). Gražulis, S., Daškeviˇc, A., Merkys, A. et al. (2012). Crystallography Open Database (COD): an open-access collection of crystal structures and platform for world-wide collaboration. Nucl. Acids Res. 40: D420–D427, http:// www.crystallography.net/cod (accessed 12 October 2016). Rutgers. The State University of New Jersey, Center for Integrative Proteomics Research. 174 Frelinghuysen Road, Piscataway, NJ 08854-8076. http://www.rcsb.org/pdb (accessed 12 October 2016). Berman, H.M., Westbrook, J., Feng, Z. et al. (2000). The Protein Data Bank. Nucleic Acids Res. 28: 235–242. http://www.rcsb.org/pdb/. Rose, P.W., Prlic, A., Bi, C. et al. (2015). The RCSB Protein Data Bank: views of structural biology for basic and applied research and education. Nucleic Acids Res. 43: 345–356. O’Keeffe, M. (2010). Aspects of crystal structure prediction: some successes and some difficulties. Phys. Chem. Chem. Phys. 12: 8580–8583. Delgado-Friedrichs, O., O’Keeffe, M., and Yaghi, O.M. (2003). Three-periodic nets and tilings: regular nets. Acta Crystallogr. A59: 22–27. Baerlocher, C., McCusker, L.B., and Olson, D.H. (2007). Atlas of Zeolite Framework Types, 6e. London: Elsevier. ToposPro. The program package for multipurpose geometrical and topological analysis of crystal structures. http://topospro.com (accessed 12 October 2016).

143

144

4 From Topological Descriptors to Expert Systems: A Route to Predictable Materials

98 Blatov, V.A., Shevchenko, A.P., and Serezhkin, V.N. (1995). Crystal space

99 100

101

102 103

104 105 106 107 108 109 110

111

112

113

114

115

analysis by means of Voronoi-Dirichlet polyhedra. Acta Crystallogr. A51: 909–916. Blatov, V.A. (2000). Search for isotypism in crystal structures by means of the graph theory. Acta Crystallogr. A56 (2): 178–188. Aman, F., Asiri, A.M., Siddiqui, W.A. et al. (2014). Multilevel topological description of molecular packings in 1,2-benzothiazines. CrystEngComm. 16: 1963–1970. Zolotarev, P.N., Arshad, M.N., Asiri, A.M. et al. (2014). A possible route toward expert systems in supramolecular chemistry: 2-periodic H-bond patterns in molecular crystals. Cryst. Growth Des. 14: 1938–1949. Simperler, A., Foster, M.D., Delgado Friedrichs, O. et al. (2005). Hypothetical binodal zeolitic frameworks. Acta Crystallogr. B61: 263–279. Treacy, M.M.J., Rivin, I., Balkovsky, E. et al. (2004). Enumeration of periodic tetrahedral frameworks. II. Polynodal graphs. Microporous Mesoporous Mater. 74: 121–132. http://www.hypotheticalzeolites.net. Rivin, I. (2006). Geometric simulations: a lesson from virtual zeolites. Nat. Mater. 5: 931–932. Earl, D.J. and Deem, M.W. (2006). Toward a database of hypothetical zeolite structures. Ind. Eng. Chem. Res. 45: 5449–5454. Deem, M.W., Pophale, R., Cheeseman, P.A., and Earl, D.J. (2009). A databases of new zeolite-like materials. J. Phys. Chem. 113: 21353–21360. Wilmer, C.E., Kim, K.C., and Snurr, R.Q. (2012). An extended charge equilibration method. J. Phys. Chem. Lett. 3: 2506–2511. Wilmer, C.E., Leaf, M., Lee, C.Y. et al. (2012). Large-scale screening of hypothetical metal-organic frameworks. Nat. Chem. 4: 83–89. Lin, L.-C., Berger, A.H., Martin, R.L. et al. (2012). In silico screening of carbon-capture materials. Nat. Mater. 11: 633–641. Fernandez, M., Boyd, P.G., Daff, T.D. et al. (2014). Rapid and accurate machine learning recognition of high performing metal organic frameworks for CO2 capture. J. Phys. Chem. Lett. 5: 3056–3060. Hoffmann, R., Kabanov, A.A., Golov, A.A., and Proserpio, D.M. (2016). Homo citans and carbon allotropes: for an ethics of citation. Angew. Chem. Int. Ed. 55: 2–17. Ramsden, S.J., Robins, V., and Hyde, S.T. (2009). Three-dimensional Euclidean nets from two-dimensional hyperbolic tilings: kaleidoscopic examples. Acta Crystallogr. A65: 81–108. Sikora, B.J., Wilmer, C.E., Greenfield, M.L., and Snurr, R.Q. (2012). Thermodynamic analysis of Xe/Kr selectivity in over 137000 hypothetical metal-organic frameworks. Chem. Sci. 3: 2217–2223. Simon, C.M., Mercado, R., Schnell, S.K. et al. (2015). Metal-organic framework with optimally selective xenon adsorption and separation. Chem. Mater. 27: 4459–4475. Simon, C.M., Kim, J., Gomez-Gualdron, D.A. et al. (2015). The materials genome in action: identifying the performance limits for methane storage. Energy Environ. Sci. 8: 1190–1199.

References

116 Martin, R.L., Simon, C.M., Smit, B., and Haranczyk, M. (2014). In silico

117

118

119

120

121 122 123

124

125

126

127

128 129 130 131 132

design of porous polymer networks: high-throughput screening for methane storage materials. J. Am. Chem. Soc. 136: 5006–5022. Kim, H., Samsonenko, D.G., Das, S. et al. (2009). Methane sorption and structural characterization of the sorption sites in Zn-2(bdc)(2)(dabco) by single crystal X-ray crystallography chemistry. Chem. Asian J. Chemistry 4: 886–891. Baburin, I.A., Blatov, V.A., Carlucci, L. et al. (2005). Interpenetrating metal-organic and inorganic 3D networks: a computer-aided systematic investigation. Part II. Analysis of the Inorganic Crystal Structure Database (ICSD). J. Solid State Chem. 178: 2452–2474. Baburin, I.A. and Blatov, V.A. (2007). Three-dimensional hydrogen-bonded frameworks in organic crystals: a topological study. Acta Crystallogr. B63: 791–802. Pankova, A.A., Blatov, V.A., Ilyushin, G.D., and Proserpio, D.M. (2013). 𝛾-Brass polyhedral core in intermetallics: The nanocluster model. Inorg. Chem. 52 (22): 13094–13107. Blatov, V.A. (2006). A method for hierarchical comparative analysis of crystal structures. Acta Crystallogr. A62: 356–364. Blatov, V.A. (2016). A method for topological analysis of rod packings. Struct. Chem. https://doi.org/10.1007/s11224-016-0774-1. Anurova, N.A., Blatov, V.A., Ilyushin, G.D. et al. (2008). Migration maps of Li+ cations in oxygen-containing compounds. Solid State Ionics 179: 2248–2254. Kostakis, G.E., Blatov, V.A., and Proserpio, D.M. (2012). A method for topological analysis of high nuclearity coordination clusters and its application to Mn coordination compounds. Dalton Trans. 41: 4634–4640. Kostakis, G.E., Perlepes, S.P., Blatov, V.A. et al. (2012). High-nuclearity cobalt coordination clusters: synthetic, topological and magnetic aspects. Coord. Chem. Rev. 256: 1246–1278. Wix, P., Kostakis, G.E., Blatov, V.A. et al. (2013). A database of topological representations of polynuclear nickel compounds. Eur. J. Inorg. Chem. (4): 520–526. Serezhkin, V.N., Vologzhanina, A.V., Serezhkina, L.B. et al. (2009). Crystallochemical formula as a tool for describing metal-ligand complexes – a pyridine-2,6-dicarboxylate example. Acta Crystallogr. B65: 45–53. Simsion, G.C. and Witt, G.C. (2005). Data Modeling Essentials, 3e, 560. Morgan Kaufmann. Hey, D.C. and Barker, R. (1996). Data Model Patterns: Conventions of Thought. Dorset House Publishing. Silverstone, L. and Agnew, P. (2009). The Data Model Resource Book, Universal Patterns for Data Modeling, vol. 3. Wiley. Ambler, S. and Sadalage, P. (2006). Refactoring Databases: Evolutionary Database Design, vol. 384. Addison-Wesley. Silverstone, L. (2001). The Data Model Resource Book, A Library of Universal Data Models for All Enterprises, vol. 1. Wiley.

145

146

4 From Topological Descriptors to Expert Systems: A Route to Predictable Materials

133 Fowler, M. (2003). Patterns of Enterprise Application Architecture, 736.

Addison-Wesley. 134 Öhrström, L. (2016). Designing, describing and disseminating new materials

by using the network topology approach. Chem. Eur. J. 22: 1–7. 135 Carlucci, L., Ciani, G., Proserpio, D.M. et al. (2014). Entangled two dimen-

136

137

138

139 140 141 142

143

144

145

146 147

148

149

sional coordination networks: a general survey. Chem. Rev. 114 (15): 7557–7580. Blatov, V.A., Carlucci, L., Ciani, G., and Proserpio, D.M. (2004). Interpenetrating metal-organic and inorganic 3D networks: a computer-aided systematic investigation. Part I. Analysis of the Cambridge structural database. CrystEngComm 6: 377–395. Vologzhanina, A.V., Sokolov, A.V., Purygin, P.P. et al. (2016). Knowledgebased approaches to H-bonding patterns in heterocycle-1-carbohydrazoneamides. Cryst. Growth Des. https://doi.org/10.1021/acs.cgd.6b00990. Delori, A., Galek, P.T.A., Pidcock, E. et al. (2013). Knowledge-based hydrogen bond prediction and the synthesis of salts and cocrystals of the anti-malarial drug pyrimethamine with various drug and GRAS molecules. CrystEngComm 15: 2916. Wicker, J.G.P. and Cooper, R.I. (2015). Will it crystallise? Predicting crystallinity of molecular materials. CrystEngComm 17: 1927–1934. Csizmadia, I.G. (1976). Theory and Practice of MO Calculations on Organic Molecules. Amsterdam: Elsevier. Wood, P.A., Olsson, T.S.G., Cole, J.C. et al. (2013). Evaluation of molecular crystal structures using full interaction maps. CrystEngComm. 15: 65–72. Bruno, I.J., Cole, J.C., Lommerse, J.P.M. et al. (1997). IsoStar: a library of information about nonbonded interactions. J. Comput.-Aided Mol. Des. 11: 525–537. Haldoupis, E., Nair, S., and Sholl, D.S. (2010). Efficient calculation of diffusion limitations in metal organic framework materials: a tool for identifying materials for kinetic separations. J. Am. Chem. Soc. 132: 7528–7539. Eric, L.F. and Christodoulos, A. (2013). Floudas MOFomics: Computational pore characterization of metal-organic frameworks. Microporous Mesoporous Mater. 165: 32–39. First, E.L., Gounaris, C.E., and Floudas, C.A. (2013). Predictive framework for shape-selective separations in three-dimensional zeolites and metal−organic frameworks. Langmuir 29: 5599–5608. Colon, Y.J. and Snurr, R.Q. (2014). High-throughput computational screening of metal–organic frameworks. Chem. Soc. Rev. 43: 5735–5749. McDaniel, J.G., Li, S., Tylianakis, E. et al. (2015). Evaluation of force field performance for high-throughput screening of gas iptake in metal-organic frameworks. J. Phys. Chem. 119 (6): 3143–3152. Watanabe, T. and Sholl, D.S. (2012). Accelerating applications of metal−organic frameworks for gas adsorption and separation by computational screening of materials. Langmuir 28: 14114–14128. Erucar, I. and Keskin, S. (2011). Screening metal organic framework-based mixed-matrix membranes for CO2 /CH4 separations. Ind. Eng. Chem. Res. 50: 12606–12616.

References

150 Basdogan, Y., Sezginel, K.B., and Keskin, S. (2015). Identifying highly selec-

tive metal organic frameworks for CH4 /H2 separations using computational tools. Ind. Eng. Chem. Res. 54 (34): 8479–8491. 151 Gómez-Gualdrón, D.A., Wilmer, C.E., Farha, O.K. et al. (2014). Exploring the limits of methane storage and delivery in nanoporous materials. J. Phys. Chem. 118: 6941–6951. 152 Peskov, M.V. and Schwingenschlögl, U. (2015). First-principles determination of the K-conductivity pathways in KAlO2 . J. Phys. Chem. C 119 (17): 9092–9098.

147

149

5 A High-Throughput Computational Study Driven by the AiiDA Materials Informatics Framework and the PAULING FILE as Reference Database Martin Uhrin 1 , Giovanni Pizzi 1 , Nicolas Mounet 1 , Nicola Marzari 1 , and Pierre Villars 2 1 École Polytechnique Fédérale de Lausanne, Theory and Simulation of Materials (THEOS), Station 3, and National Centre for Computational Design and Discovery of Novel Materials (MARVEL), CH-1015 Lausanne, Switzerland 2 Material Phases Data System (MPDS), Unterschwanden 6, CH-6354 Vitznau, Switzerland

5.1 Introduction It is essential to notice that the realization of the plan in the Abstract requires three conditions so as to enable the abovementioned processes: 1. “Prototype classification.” 2. “Distinct phases concept.” 3. “Existence of a fully standardized inorganic solids database to be used as a reference.” These three conditions have been carefully enforced in the PAULING FILE (http://www.paulingfile.com). In this work, we plan to simulate for an extensive set of binary systems (inorganic solids) their crystal structures and a broad variety of physical properties using the PAULING FILE – Binaries Edition as a starting and reference database. The simulated results (open access) and the PAULING FILE – Binaries Edition (proprietary), containing experimentally determined inorganic solids data from the world literature, will be linked creating a complementary data set of materials properties. The goal here is not only to create a reference database of density functional theory (DFT) calculations verified against reliable experimental data but also to go beyond the known structures and explore compositions where no experimental data or intrinsic physical properties are reported. The data will be presented as a series of “overview maps” that highlight particular properties, e.g. crystal structure, physical property, coordination polyhedra, etc. An example is shown in Figure 5.1 [1]. For each type of inorganic solids overview map, one will display experimental data, while another equivalent map will show data from simulations so that the user sees in one glance the confirmation or disagreement with experimental data, as well as the extension to binary systems where so far no experimental data is known. Materials Informatics: Methods, Tools and Applications, First Edition. Edited by Olexandr Isayev, Alexander Tropsha, and Stefano Curtarolo. © 2019 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2019 by Wiley-VCH Verlag GmbH & Co. KGaA.

150

5 A High-Throughput Computational Study Driven by the AiiDA Materials Informatics

Figure 5.1 Generalized atomic environment type (AET) (coordination polyhedron) matrix PNA vs. PNB (PN, periodic number), which is independent of the stoichiometry and the number of chemical elements in the inorganic solid. The element A occupying the center of the AET is given on the y-axis and the coordinating element on the x-axis. CN stands for coordination number. Here are the results given from experimentally determined data [1].

5.1.1 Three Key Developments Opened Up Unprecedented Opportunities Three key developments during the past decades have opened up unprecedented opportunities: (i) The high-throughput capacity of hardware architectures has been on a 14-month doubling cycle for the past 30 years. (ii) In computational materials science, theoretical and algorithmic developments and sophisticated simulation software have combined to accelerate even further throughput capacity. In particular, electronic structure simulations based on DFT are now mature and can predict the crystal structures, and the intrinsic physical properties of inorganic solids from first principles (i.e. with no experimental input) and with

5.2 Nature Defines Cornerstones Providing a Marvelously Rich

an accuracy that in the best of case is comparable and at times better than experimental data (failures also abound, and this underlines the importance of validation against experiments). (iii) High-bandwidth low-cost communications have enabled decentralize calculations and data and knowledge exchange. 5.1.2 Relative Few Inorganic Solids Have Been Experimentally Investigated Despite the enormous fundamental importance of inorganic solids for our industrialized society in areas such as housing, energy, transportation, civil engineering, communication, food, and health, our knowledge of inorganic solids is astoundingly sparse. For example, of all possible ternary chemical systems, less than 16% have been partly characterized across the full composition and structure type range. In the case of inorganic solids containing four or more chemical elements, this fraction drops to 0.6% or less. In fact, the remarkable accomplishments in the development of advanced inorganic solids used in aircraft engines, computer processors, magnetic recording devices, or chemical catalysts rest on the optimization of their intrinsic physical properties. The only “robust” knowledge of inorganic solids exists for binary systems, where for about 72% of all binary systems, at least some information is experimentally known, and therefore this subset is optimally suited to develop a quantum simulation strategy, giving many possibilities to cross-check the trustworthiness of the simulated data vs. experimentally determined data and vice versa. In this work, we focus on single-phase inorganic solids; nevertheless the proposal is of general character. It is worth noting that on the one hand the number of experimentally investigated inorganic solids is still very low but on the other the ability to simulate the crystal structure of inorganic solids and their intrinsic physical properties from quantum simulations is growing rapidly. These two facts motivate us to explore new ways to build up a database consisting of two major interlinked parts: the published experimentally determined inorganic solids data to be used as a reference and a database of high-throughput quantum simulations generated using AiiDA, a high-throughput simulation tool. It is worth highlighting that engineering materials are typically multiphase inorganic solids that are, of course, affected by defects, interfaces, and microstructure. Nevertheless the fundamental bases of all these engineering materials are ordered single-phase inorganic solids. It is also worth to mention that binaries are always the starting point for any considerations of any multinaries.

5.2 Nature Defines Cornerstones Providing a Marvelously Rich but Still Very Rigid Systematic Framework of Restraint Conditions Below we outline four fundamental cornerstones given by nature, which on the one hand are responsible for the fact that we are confronted with infinitely many

151

152

5 A High-Throughput Computational Study Driven by the AiiDA Materials Informatics

chemical element combinations (equal to potential inorganic solids) but on the other hand provide a very rigid systematic framework of restraint conditions. As these cornerstones quantitatively reflect the underlying natural laws, they have a practical impact through their general validity and therefore are ideal for developing an optimal high-throughput quantum simulation strategy. (1) Infinitely many chemical element combinations representing potential novel inorganic solids: Nature provides about 100 chemical elements as well as their combinations. A direct consequence of this fact is that, in general, there exist an infinite number of chemical element combinations. In addition, nature realizes a very large variety of possible compositions for single-phase inorganic solids within a specific chemical element combination (over 300 stoichiometries occur). Furthermore nature also realizes a huge number of three-dimensional ways to order chemical elements (atoms) within such a single-phase inorganic solid (over 36 000 different prototypes have been experimentally found). And finally the magnetic moments of the chemical elements can be ordered in an even higher number in four-dimensional ways. (2) Laws that control which chemical elements do not form inorganic solids: In contrast to the infinitely many chemical element combinations, nature has very strict restrictions for the formation of inorganic solids. The enthalpy of formation has to be negative; otherwise the chemical element combination will not lead to a novel inorganic solid. In addition, at constant pressure, Gibbs’ phase rule P = C – F + 1 defines the relation between the number of phases (P, here potential advanced materials), the number of components (C, here chemical elements), and the number of degrees of freedom (F) of intensive properties, here temperature and composition. (3) Laws that control the ordering of the chemical elements within crystal structures of a considered novel inorganic solid: When chemical elements combine to form inorganic solids, their crystal structures are beautifully rich, yet very systematic patterns underlie this process. The most striking manifestation of this fact is the existence of so-called prototypes of crystalline inorganic solids, which can be understood as geometrical templates for large groups of inorganic solids. As an example, the NaCl, cF8, 225 (prototype, Pearson symbol, space group number) prototype has presently over 1300 different representatives (inorganic solids). In other words, different inorganic solids crystallizing in the same prototype are geometrically either identical or very similar to each other. A closer inspection of the known inorganic solids reveals that for each of the 1000 most populous prototypes, only a relatively small subset is actually known. The majority of the combinations remain unexplored. There exist at least two possibilities to classify crystal structures: one is based on its overall symmetry (this leads to the classical prototype classification), and the other on the atomic environments of each site, also called coordination polyhedra (this leads to the atomic environment type [AET] classification) [2, 3]. The first classification requires that the published crystallographic data are fully standardized, and this is only thoroughly done in the PAULING FILE (http://www.paulingfile.com) [4, 5], as well as products derived from it, e.g. Pearson’s Crystal Data [6]. The second classification involves, for some

5.4 The Realization of the Fourth and Fifth Paradigms Requires Three Preconditions

cases, a certain ambiguity to decide where the cutoff of the atomic environment is placed, especially for prototypes with partially occupied point sets. (4) Laws that control the link between the position of a chemical element in the periodic table and the sites it occupies in the considered prototypes of a novel inorganic solid: This fact is a manifestation that there exists a direct link, and not only given by the chemical formula, between the position of the constituent chemical elements in the periodic table (in other words, governing factors or elemental-property parameters, e.g. atomic number [AN], periodic number [PN],1 etc.) and its crystallographic positions within a specific crystal structure (as well as its intrinsic physical properties). Exactly this fact gives us the principal ability to make predictions and develop a quantum simulation strategy to design and discover novel inorganic solids.

5.3 The First, Second, and Third Paradigms In the case either of experimental preparation of novel inorganic solids or of their quantum simulation, one starts with variables such as the chemical elements, their combinations, their concentrations, the temperature, and the pressure. Typically, only two different approaches can be adopted: 1. The first approach is based on a pragmatic level. Most of our current knowledge in materials science has been collected empirically by searching for patterns, rules, or laws within experimental data of inorganic solids published in the world literature. Gray called this the observation and empirical branch of science representing the first paradigm of science. 2. The second approach is to simulate the motion of the atoms in inorganic solids as well as their electronic interactions remaining as close to reality as necessary using quantum simulations. Using these simulations the crystal structure as well as the intrinsic physical properties can be understood from the first principles. Based on this understanding it is possible to generate novel inorganic solids by computer simulations alone, resulting in simulated inorganic solids data. Gray called these the theoretical and computational branches of science representing, respectively, the second and third paradigms of science.

5.4 The Realization of the Fourth and Fifth Paradigms Requires Three Preconditions 5.4.1 Introduction of the Prototype Classification to Link Crystallographic Databases Created by Different Groups The first requirement is the introduction of the prototype classification. This has to be done in a fully standardized way so that the cell parameters and the atomic coordinates can be directly compared; otherwise not even crystallographic 1 The PN is a different enumeration of the chemical elements that groups them emphasizing the role of the valence electrons.

153

154

5 A High-Throughput Computational Study Driven by the AiiDA Materials Informatics

databases generated by different groups can be linked and compared. It is also a precondition for the second requirement. 5.4.2 Introduction of the Distinct Phases Concept to Link Different Kinds of Inorganic Solids Data This concept was introduced in the PAULING FILE inorganic solids database (http://www.paulingfile.com) [4, 5], as well as its derived products [6–9]. A phase is defined by the chemical system, chemical formula, and its crystal structure (using the prototype classification) and is given a unique name by the combination of its chemical formula and its modification. In devising the PAULING FILE, the ability to link data of different groups was considered paramount; therefore it was designed as a phase-oriented inorganic solids database using a relational database system. This was achieved by the creation of a distinct phases table along with all the required internal links. In practice this means that each chemical system has been evaluated, and the distinct phases have been identified based on all information available. Finally, every entry has been linked to such a distinct phase. 5.4.3 The Existence of a Comprehensive, Critically Evaluated Inorganic Solids Database Concept (DBMS) of Experimentally Determined Single-Phase Inorganic Solids Data to Be Used as Reference Searching for patterns and correlations among experimentally determined inorganic solids data published in the world literature relies strongly on the availability of a sufficiently large number of inorganic solids data of appropriate quality. In addition, investigations are generally made long and tedious by the fact that the relevant inorganic solids data is stored in separate, isolated databases. With this basic background the creation of a comprehensive database, the PAULING FILE was initiated in 1994, covering all inorganic solids and consisting of three interconnected parts: crystal structure/diffraction data, phase diagram, and a large variety of intrinsic physical properties. PAULING FILE data has been carefully checked and fully standardized [10, 11]. This has been done with extreme care, since unrecognized errors will at the very least confuse any correlation tools, if not result in wrong rules being deduced. Since the PAULING FILE (http://www .paulingfile.com) [4, 5] was launched 23 years ago, it represents the sole database of its kind, with over 500 000 inorganic solids data sets summarizing over 150 000 selected scientific publications covering the last 100 years. It is now becoming feasible to use it as the starting point for an inorganic solids reference database.

5.5 The Core Idea of the Fifth Paradigm For the first approach described previously (pragmatic level), the core idea is to use the links between the tabulated elemental-property parameters (or combinations of them) such as AN, PN, etc. For the second approach (simulation level),

5.5 The Core Idea of the Fifth Paradigm

the link is embodied by the laws of quantum mechanics (implying AN). For both approaches, the most central ingredient is the crystal structure of the inorganic solid, which is our “window” to visualize the electronic interactions of the atoms within a specific single-phase inorganic solid. The most striking manifestation of ordering in inorganic solids is reflected by the existence of the so-called prototypes of crystalline inorganic solids, offering a fundamental systematic framework. Without this experimentally established fact, little hope would exist that any of the two approaches could succeed at all. Presently there are three types of uncoordinated database producers: 1. Publishers who publish author’s original works on experiments and/or simulation in peer-reviewed journals or books. This is done in a way to be optimally understood by humans, but it is very poor for machine interrogation; indeed such complex scientific facts are difficult for machines to extract. 2. Professional data experts producing data sets of the highest quality (experimental and simulated) from available peer-reviewed data sources. 3. Computational scientists producing data sets by simulation with explicit definitions. As databases generated by producers 2 and 3 are relational, scientific data can then be presented in many different ways optimal for humans. In principle, they are also optimal for machine interrogation, but unfortunately it is a fact that neither the database fields “chemical system” nor “chemical formula” are able to link different kinds of inorganic solids data (these two database fields exist in all single-phase inorganic solids database). The major reason is the existence of polymorphic inorganic solids, minerals, and berthollides (inorganic solids stable over a broad concentration range), which represent over 50% of the total of over 165,000 experimentally known distinct inorganic solids. This leads to problems not only when trying to link different groups of inorganic solids databases such as a crystal structure database with a phase diagram database but also when trying to link inorganic solids databases of the same class produced by different expert groups. As a direct consequence of this situation, in practice, it is not possible to realize the fourth paradigm (see Figure 5.2). It lacks prototype classification and the distinct phases concept, and therefore the key database fields “prototype” and “distinct phase” are missing, which are necessary to link different databases. Therefore initiatives following only the fourth paradigm have no chance of success and will surely fail. In addition, a major problem for carrying out large-scale computational studies of inorganic solids is that there exist too many chemical element combinations to be simulated blindly and, most importantly, there are no publicly available, comprehensive, and critically evaluated reference inorganic solids databases that are standardized and grant access to the raw data (database management system, DBMS). The preconditions for the proposed fifth paradigm are the availability of an inorganic solids database covering experimentally determined data having prototype classification, the distinct phases concept implemented, and access to the

155

156

5 A High-Throughput Computational Study Driven by the AiiDA Materials Informatics

The fourth paradigm from the present database situation 3 Fourth paradigm: e Science: data-intensive discovery through data exploration Second paradigm Theoretical science First paradigm Third paradigm Experimental science Computational science Database producer 1 Authors who publish original papers on experimental and/or computation in e-journal/e-books Peer reviewed

1

Database producer 2 Professional data experts producing data sets of highest quality from available data sources Editor reviewed

PAULING FILE

Second paradigm Third paradigm Theoretical science Computational science

2

Only possible if linked

Database producer 3 Professional computational science experts who can calculate data with explicit definitions Editor reviewed

MARVEL

Figure 5.2 Outline of the fourth paradigm showing in addition the three major uncoordinated database producers 1–3. As there exists in practice no link between these three databases, they cannot be linked by a computer. Therefore data-intensive discovery through data exploration is, in practice, not possible.

raw data from the DBMS. We will use the PAULING FILE as the most comprehensive inorganic solids database fulfilling these preconditions and serving as a suitable starting point (see Figure 5.3). It is now straightforward to develop, in parallel to the reference database system, a high-throughput simulated inorganic solids database using AiiDA, the two being dynamically linked. The two could would provide a complementary “inorganic solids data system” that will contain the crystal structure and fundamental intrinsic physical properties of inorganic solids for a steadily growing number of inorganic solids with persistently better precision and reliability. In this context it is most essential to develop approaches that enable the following goal: Strategic data exploration by searching for governing factors with the aim of formulating restraint conditions. This framework also simultaneously opens the possibility to link inorganic solids databases produced by different groups. The use of governing factors and restraint conditions based on existing data should make it possible to reduce the infinite number of chemical element combination possibilities to a realistic number of simulations (estimated being still well over 1010 ). An additional, most important, aim is that such governing factors should be of general validity.

5.6 Restraint Conditions Revealed by “Inorganic Solids Overview–Governing Factor Spaces (Maps)” Discovered by Data-Mining Techniques Through a balanced combination of selection, discovery, and design, we have demonstrated that it is possible to derive meaningful knowledge from a large

5.6 Restraint Conditions Revealed by Inorganic Solids Overview

0

Fourth paradigm: e Science: data-intensive discovery through data exploration

Second paradigm theoretical science First paradigm Third paradigm Experimental science Computational science

Second paradigm Theoretical science

PAULING FILE Database producer 1 Authors who publish original papers on experimental and/or computation in e-journal/e-books Peer reviewed

Third paradigm Computational science

MARVEL

Prototype classification distinct phases concept database system concept (DBMS) Database producer 2 Professional data experts producing data sets from original papers Editor reviewed

1

Database producer 3 Professional computational science experts Editor reviewed

Reference database(s) DP21 to DP2n

Interoperability through data conversions

2

DP31 to DP3n

Interoperability through data conversions

The fifth paradigm: Strategic data exploration by searching for governing factors Taking advantage from governing factors given by nature

3

Strategic data exploration by searching for governing factors

Different data-mining techniques

Figure 5.3 Outline of the fifth paradigm showing in addition the three major coordinated database producers 1–3 linked with the help of the prototype classification, the distinct phases concept, and the database system concept (DBMS). Added are at the bottom two blocks: “Taking advantage from the marvelous systematic frameworks given by nature” and “Strategic data exploration by searching for governing factors.”

collection of data spanning thousands to tens of thousands of different inorganic solids [1, 12–14]. Relatively simple overview maps showing well-defined domains provide condensed overviews of experimentally determined data and reveal restraint conditions. They can therefore have predictive ability. The resulting restraint conditions (governing factors) show that atomic property parameters (e.g. AN, PN, REa (AN,PN), and SZa (AN,PN)) of the constituent chemical elements can be used effectively for the parameterization of intrinsic physical properties of its inorganic solids. For visualization purposes, it is convenient, wherever possible, to reduce the parameter space to two-dimensional overviews or governing factor maps. 5.6.1

Compound Formation Maps

Figure 5.4 gives an example of such an overview map. Here, we investigated the presence or absence of inorganic solids in binary, ternary, and quaternary systems, using experimental data for about 15 500 chemical systems, extracted from over 35 000 publications [1]. Several efficient compound formation maps were found that clearly separate the compounds into two distinct domains. This separation supports the assertion that the crystal structure of inorganic solids can be quantitatively described by elemental-property parameters. This generalization is an important step when

157

158

5 A High-Throughput Computational Study Driven by the AiiDA Materials Informatics

y 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9 x

Figure 5.4 Separation of 2330 binary systems into compound formers (blue) and non-formers (yellow) in one compound formation map showing max[PNA /PNmax , PNB /PNmax ] (y-axis) vs. [PNA /PNmax × PNB /PNmax ] (x-axis), where PN is the periodic number (a distinct integer assigned to each chemical element based on its position in Mendeleev’s periodic system).

strategically exploring structure-sensitive intrinsic physical properties of inorganic solids. By using elemental-property parameter expressions of PN as axes in two- or three-dimensional inorganic solids maps, we have reached an accuracy of 98% in separating formers and non-formers into distinct domains. Most importantly, these compound formation maps make it possible to predict the existence of inorganic solids in chemical systems that have not yet been experimentally investigated. 5.6.2

Atomic Environment Type Stability Maps for AB Inorganic Solids

The AETs, also called coordination polyhedra, realized by each chemical element in binary inorganic solids at the equi-atomic composition, were analyzed on a comprehensive set of literature data (about 2800 binary systems from over 8000 publications) [12, 13]. PN was used to classify the chemical systems. An AET map, using as coordinates the maximum PN vs. the ratio between the minimum and maximum PN, proved to effectively subdivide chemical systems into distinct stability domains of different atomic environment (Figure 5.5). The same AET stability map also showed a clear separation between chemical systems where AB

5.6 Restraint Conditions Revealed by Inorganic Solids Overview

100

80 no compounds single atom linear non-linear (co-planar) triangle (co-planar) square tetrahedron square pyramid trigonal prism octahedron 7-vertex polyhedron cube tricapped trigonal prism 9-11 vertex polyhedra (anti)cuboctahedron icosahedron rhombic dodecahedron 13-16 vertex polyhedra heptacapped pentagonal prism

60

40

20

0 0

0.2

0.4

0.6

0.8

1.0

0.8

0.6

0.4

0.2

0

Figure 5.5 Atomic environment type (AET) stability map showing the periodic number PNmax (y-axis) vs. PNmin /PNmax (x-axis) for equi-atomic AB inorganic solids. AET of the element with the highest periodic number is given on the left-hand side of x = 1. AET of the element with the lowest periodic number in the same inorganic solid on the right-hand side in the same row.

inorganic solids form (different crystal structures from its chemical elements) and those where no inorganic solids form. The AET stability map makes it possible to predict the existence of AB inorganic solids with a particular atomic environment. Analogous results can be obtained by focusing on crystal symmetry or prototype classification. 5.6.3 Twelve Principles in Materials Science Supporting Three Cornerstones Given by Nature In Ref. [14], we identified 12 principles elucidated from an inorganic solids database, which comprehensively covers the world literature. The principles are found to have general validity, giving us the ability to develop efficient experimental and computational exploration strategies. By way of example, we discuss one of these principles, the “stoichiometric ratio condition principle,” below. To date, 72% of all binaries have been investigated, from which we find that 95% of all binary daltonide inorganic solids crystallize in one of the following 10 stoichiometric ratios, ABx : 1 ∶ 6, 1 ∶ 5, 1 ∶ 4, 1 ∶ 3, 1 ∶ 2, 1 ∶ 1.67 1 ∶ 1.5, 1 ∶ 1.33, 1 ∶ 1.25, and 1 ∶ 1 For ternaries, 23 years ago, it was only possible to list seven of the most often occurring stoichiometric ratios. Meanwhile, 99% of the 17 083 daltonide basic ternary inorganic solids follow the following ternary stoichiometric ratio condition (see Figure 5.6): ABx Cy , where x is one of the 10 stoichiometric ratios listed above, y is equal to x, or y is equal to an integer larger than x, or y is equal to an

159

10

B

90

20

80

30

10

A at. %

x = 1,5/4,4/3,3/2,5/3,2,3,4,5,6

ABxCy

60

50

% at.

Ternary ratio condition

70

40

90

50

60

80

y > 1 and integer or integer divided by 2, 3 or 4

40

70

30

80

70

20

90

10

40

A %

50

60 1: 1: n; n > 1 and integer or integer divided by 2, 3 or 4

70

10

20

30

40

50 at.% C

20

70

40

50

60

70

80

90

at.% C

C

1 : 67 (3 : 5)

40

60

30

1 : 1.33 (3 : 4) 1 : 1.5 (2 : 3)

1: 2 1: 3

30

1: 4 20

80

1: 5 1: 6

1: 3 : 3 1: 4 : 4 1: 5 : 5 1 : 6 : 6 1: 6 : n; n > 6 and integer or integer divided by 2, 3 or 4

90

A

10

1: 1 1: 1.25 (4 : 5)

1: 1 : 1 1 : 1.25 : 1.25 (4 : 5 : 5) 1 : 1.33 : 1.33 (3 : 4 : 4) 1: 1.5 : 1.5 (2 : 3 : 3) 1 : 1.67 : 1.67 (3 : 5 : 5) 1: 2 : 2

80

B

at.

%

Ternary systems focusing on daltonide compounds

50

at.

60

10 10 highest-frequency binary ratios 90 C

Figure 5.6 Occurrences of daltonide inorganic solids vs. available concentration range for ternary systems (Ax By Cz , x < y < z, x + y + z = 1). Gray area showing where no ternary inorganic solid(s) occur.

5.7 Quantum Simulation Strategy

integer divided by 2, 3, or 4 with y larger than x. Together with the active composition range principle, this leads to 1 stoichiometric ratio for ABC (x = y = 1), to 9 possible stoichiometric ratios for ABx Cx (y = x), and to 819 possible stoichiometric ratios for ABx Cy (x different from y).

5.7 Quantum Simulation Strategy The success of such project depends on the validity and generality of the following basic assumptions: (i) First-principles methods employed are free of adjustable parameters and are code independent (e.g. Quantum ESPRESSO, VASP, CASTEP, etc.). (ii) The selected PAULING FILE data sets are correct, especially for its PAULING FILE structure/diffraction data as these are used as the starting point of our first principles simulations. (For each inorganic solid the “best” entry will be selected in cases where multiple entries of the same inorganic solid [from different publications] exist.)

Binary chemical systems

Number of potential systems

4 950

Number of investigated systems LPF-S/D (structure)

2 569

Number of investigated systems LPF-C (constitution = phase diagram)

2 367

Number of investigated systems LPF-P (physical properties)

2 332

Number of LPF-S/D entries

52 841

Number of phases LPF-S/D

13 331

Number of entries per phase LPF-S/D

3.9

(iii) Correctness of the prototype classification introduced by Bill Pearson and its standardization extension introduced by Erwin Parthé, which is based on the well-established symmetry-based space group theory. The LPF distinct phases concept is entirely dependent on this assumption. (iv) LPF-S/D and LPF-C provide a statistically large enough and comprehensive starting point, as the following data-centric observations fully rely on it. (a) Former system vs. non-former system plots (PN vs. PN) for binaries, ternaries, and quaternaries. A former system has at least one stable inorganic solid. PN stands for periodic number [data-centric fact, see Figure 5.3 [1]]. (b) Iso-stoichiometric structure maps and atomic environment maps (PN vs. PN) [data-centric fact, see Figure 5.4 [1, 12]].

161

162

5 A High-Throughput Computational Study Driven by the AiiDA Materials Informatics

(c) Over 2/3 of all experimentally investigated inorganic solids crystallize in one of the 1000 most populous prototypes. Presently there exist over 36 000 prototypes [data-centric statistical fact, unpublished]. The three data-centric facts listed above have been generally inferred from the data in the PAULING FILE. A key requirement for success is based on a careful choice of strategy for selecting which calculations to perform, as it is not practical to simulate all compositions (taking into account all known 36 000 possible prototypes per inorganic solid) even when restricting to binary systems. In general there are infinitely many inorganic solids to investigate. However we use the following questions as a guide to focus the scope of the study and reduce the number of simulations to be performed: (I) (II) (III) (IV)

Does a specific chemical system contain any inorganic solids? If yes, how many inorganic solids exist per chemical system? What is the stoichiometric composition of each inorganic solid? Which is its stable prototype under a set of given conditions?

For (I) we have found a strong data-centric correlation using only one integer parameter the PN (see above item a). As we have no clear answers for questions (II)–(IV), we have to develop a “work-around” approach, taking advantage of the data-centric observations above, (b) and (c). We know that for binary systems 95% of all experimentally known inorganic solids belong to 1 of 10 stoichiometric ratios. In addition, for 55% of all binary systems, there exists a fully determined phase diagram; therefore we know in which stoichiometric ratios inorganic solids occur and which ones do not occur for those chemical systems. Without a carefully chosen simulation strategy, we are risking wasting resources, e.g. by simulating potential inorganic solids belonging to non-compound-forming systems, by simulating potential stoichiometric inorganic solids belonging to known compound-forming systems having at that specific stoichiometric ratio no inorganic solid, or by simulating (and comparing) potential inorganic solids with improbable prototypes (for each chemical system under consideration). Given these considerations we propose the following strategy: Step 1: Exclude from all simulations chemical systems belonging clearly to non-formers, and focus on binary systems (in a potential second step of the project extend to ternary and quaternary systems). Step 2: Exclude for the 10 most populous iso-stoichiometric ratios (e.g. AB, AB2 ) all chemical systems, where we know from its compound forming binary phase diagrams that this specific iso-stoichiometric inorganic solid does not exist. Step 3: Focus on inorganic solids belonging to the 250 most populous binary prototypes (part of the 1000 most populous prototypes) representing 13 331 binary distinct inorganic solids. For inorganic solids with several independent experimental crystal structure determinations, focus first on those rated by the LPF editors as “best” qualified data set (refined and assigned). In addition fully refined data sets are given preference to assigned data sets.

5.7 Quantum Simulation Strategy

Step 4: For each of the 250 most populous binary prototypes, create point-set s-, p-, d-, and f-element correlations (based on its known phases) and select with the help of structure maps (based on its known prototypes) the competing 5–10 prototypes. This results in about 250 × 200 × (5/10) = 250 000/500 000 simulations, finally leading to 18 000–13 331 = 4669 predicted simulated inorganic solids (including the non-forming systems) belonging to one of the selected competing prototypes, where presently no information at all is known. Here we have taken advantage of several already published inorganic solids – overview – governing factor spaces [1, 12–14], the existence of the correlation between the position of the chemical elements in the periodic table and its sites in a considered crystal structure, and the existence of the prototype classifications. This makes it possible for each of the prototypes to populate the atomic sites by all elements, which are chemically meaningful. Step 5: Evaluate, for each compound group mentioned in Step 3, a confidence level by comparing experimental data with our simulated data, as well as consistency with generally valid inorganic solids overview maps. Use the inorganic solids where only the cell parameters are determined as far as consistency checks against the most stable simulated structures. Step 6: For inorganic solids having more than one modification, the simulation will be done also as a function of the temperature (in the quasi-harmonic approximation) to determine its phase transition temperature. Step 7: Simulate a broad range of intrinsic physical properties for already known inorganic solids and for predicted inorganic solids where there is a high confidence that the solid exists. Determine a confidence level for inorganic solids where published data exists. Step 8: Store the simulated data in a relational database consistent with the PAULING FILE database structure, and create dynamic links so that in case of any changes or corrections, the simulated data will be recalculated automatically. The trust in the correctness of the simulated data depends on three major factors: (1) Persistent verification with the experiments, especially with LPF-S/D, which represent the starting points of the simulations. Any change on the experimental LPF-S/D data set will automatically trigger a new simulation procedure. This interplay generates a “trust factor” for simulated structures. (2) After the structure has been confirmed, its intrinsic physical properties will be simulated and again permanently compared with the existent LPF-P(roperty) and LPF-C(onstitution). This interplay generates a “trust factor” for the simulated property data. In addition, take advantage of the existence over 11 000 experimentally determined binary phase diagrams, especially in context of its stoichiometric ratios of inorganic solids within such chemical systems. (3) Each simulated structure will have a permanent link to the corresponding LPF-S(tructure) data set used as a starting value. This has the purpose of always being able to compare with the corresponding experimental values.

163

164

5 A High-Throughput Computational Study Driven by the AiiDA Materials Informatics

5.8 Workflows Engine in AiiDA to Carry Out High-Throughput Calculation for the Creation of the Materials Cloud, Binaries Edition The advance of quantum mechanical methods, particularly DFT, has enabled the determination of crystal structures and their properties in a way that requires less and less human intervention while achieving predictive accuracy for many (but by no means all) systems and properties. This ease of use coupled with the increasing availability of large computational resources has led to a huge increase in the number of calculation that can be performed. Gone are the days where a handful of carefully cherry-picked calculations were carried out. Instead, nowadays, many calculations can be performed with relative ease in the hope of finding trends that may lead to further calculations or be used to focus experimental studies. Carrying out high-throughput calculations presents its own problems. Perhaps the two most significant are the validation of the accuracy of huge numbers of simulations and the management of the resulting data so as to allow for further scientific enquiry. Both of these goals are addressed by the AiiDA materials informatics platform [15]. In the following we first describe the AiiDA platform itself, and then we discuss an important aspect for carrying out reliable DFT calculations, namely, verifying pseudopotential calculations. Finally, we discuss workflows in AiiDA, an essential ingredient for automating high-throughput calculations. 5.8.1

AiiDA

AiiDA is a software platform written in Python that automates significant aspects of carrying out calculations by abstracting many of the commonly performed tasks, including interaction with remote computers and schedulers and running codes and, crucially, organizing and storing the sequence of calculations that led to any result. A comprehensive overview of AiiDA can be found in Ref. [15], but here we give an overview of some of the most important aspects. A fundamental focus of the AiiDA platform is the preservation of the provenance that led to any result, based on concepts from the Open Provenance Model. In practice, this means that AiiDA maintains a database, structured as a directed acyclic graph, that consists of nodes, representing data and calculations, and edges that encode the relationships between them. For example, all input data nodes are connected to the calculations they were used in. The calculations, in turn, have links connecting them to any output data they produced. Figure 5.7 shows an example of a portion of a provenance graph where a crystal structure was relaxed and the relaxed structure was used in a subsequent DFT self-consistent field (SCF) calculation. With the large amounts of data produced by high-throughput calculations, retaining information about where a result came from becomes essential to maintain reproducibility and to ensure its validity.

5.8 Workflows Engine in AiiDA to Carry Out High-Throughput Calculation

Figure 5.7 An example of a portion of a provenance graph generated by AiiDA. Calculations are shown as squares while data is shown as circles.

Struc2 c

tru

s n_

i

Relax

out ms

para

Param

Struc

stru c in_

s

aram

in_p ou

t_

ra j

MD

Traj

5.8.2

SSSP (Standard Solid State Pseudopotentials) Library

High-throughput computation necessarily induces a shift away from analyzing individually all calculations and their results to one that emphasizes the search for trends and common features. For such an approach to be successful, the underlying calculations must be trusted to be correct within the chosen theoretical framework. This does not necessarily mean that the calculations fully capture the physical nature of the system under study, for there may be approximations that either neglect or incorrectly capture physical phenomena. But rather the calculations should be correct to the extent that they are fully converged and give the answer for the given set of approximations. Thus another scientist using the same approximations implemented in another code should arrive at the same answer. An important aspect of this has been recently addressed in a large comparative study within the electronic structure community [16, 17].

165

166

5 A High-Throughput Computational Study Driven by the AiiDA Materials Informatics

For the project at hand, plane-wave DFT will be the main workhorse. A common approximation made when carrying out such calculations is to treat the core electronic states by a pseudopotential that should correctly reproduce certain properties of the wavefunctions of nonbonding electrons. There are many schemes for arriving at a pseudopotential, and different pseudopotentials can have significantly different convergence behaviors. In an attempt to quantify these for a range of pseudopotential families, I. Castelli and coworkers have recently performed a study [18] examining convergence, as a function of plane-wave cutoff, of properties including the maximum phonon frequency, formation energy of the solid, pressure, and other quantities for almost all of the relevant elements from the periodic table. For this work, we will use the “efficient” set of pseudopotentials and the corresponding cutoffs as provided by Castelli et al. 5.8.3

Workflows

Another import aspect for carrying out high-throughput calculations is the ability to encode scientific knowledge and decision making into workflows that automate the execution of a set of calculations making decisions along the way based on the outcome of previous calculations steps. AiiDA has a comprehensive workflow engine that has been designed to provide maximum ease of use and flexibility while making workflows easy to debug when things go wrong. AiiDA provides two means of writing a workflow, workfunctions and workchains both described below. 5.8.4

Workfunctions

Workfunctions are essentially plain Python functions with the decorator, @wf, which tells AiiDA to keep track of the provenance of all inputs and outputs whenever the function is called. Figure 5.8 gives an example of a simple workflow that sums two numbers and multiples the result by a third number. On the right we see the resulting provenance graph that shows not only the flow of data through the workflow but also call links to indicate when a workfunction called another. The advantage of workfunctions is their ease of use and debugging, particularly for those that already have a Python background. Their major drawback is that if a workflow is interrupted either intentionally or because of a fault or crash, it cannot be restarted from where it left off, and the user must restart the entire process. For the toy example above, this is not a problem; however if a long running calculation such as a DFT run crashes and the workflow is restarted, it will have no knowledge of where it left off and effectively rerun all calculations including those that had already finished successfully. To address these shortcomings AiiDA supports workchains as described below. 5.8.5

Workchains

In Figure 5.9, we show the workflow from Figure 5.8 split up into individual steps. The sequence of steps that make up the workflow are defined in the outline within the _define function. While this approach requires more lines

5.8 Workflows Engine in AiiDA to Carry Out High-Throughput Calculation

4

@wf

3

def sum(a, b): return a + b

b

5 a

c

b a add_mul_wf Call

@wf def prod(a, b):

sum

b

return a * b

sum @wf def add_multiply_wf(a, b, c):

Call 7

Result

a

return prod(sum(a, b), c)

prod print(add_multiply_wf(Int(3), Int(4), Int(5))

prod

> 35

35

Figure 5.8 A simple workflow based on workfunctions showing the definition of the workflow on the left and the corresponding provenance graph that is produced after execution on the right.

class AddMultiplyWf(Workchain): @classmethod def _define(cls, spec): super(Workchain, cls)._define(spec) spec.input("a", valid_type=NumericType) spec.input("b", valid_type=NumericType) spec.input("c", valid_type=NumericType) spec.outline( cls.sum, cls.prod) def sum(self, ctx): ctx.sum = self.inputs.a + self.inputs.b def prod(self, ctx): self.out(ctx.sum * self.inputs.c)

Figure 5.9 A toy example of a workchain showing the workflow from Figure 5.8 split up into separate steps.

167

168

5 A High-Throughput Computational Study Driven by the AiiDA Materials Informatics

of code, it has the key advantage that the workflow can be continued from where it left off if it is terminated because of user intervention or a fault. For this toy example this functionality is clearly not necessary; however in a real workflow the steps would execute long running DFT calculations, in which case the ability to shut down the local computer and recover from errors is invaluable. As an added benefit workchains allow the user to (optionally) specify, in an explicit way, the expected inputs (and outputs) and their data types. This serves as a machine-readable description of the workflow. Moreover, the AiiDA workflow engine will automatically take care of validating inputs and outputs. 5.8.6

Workflows Used in This Project

For this project we use the so-called energy workflow, which takes an input structure along with basic DFT parameters and attempts to relax the atomic positions and cell parameters to find the local minimum. This workflow, shown schematically in Figure 5.10, takes care of submitting a calculation on a remote cluster and deals with the limited time allocation by restarting the computation if it stopped before it ends. The workflow will also recover from common failures, e.g. when the convergence is too slow, when the diagonalization algorithm fails, or when a relaxation process ends up in a very different structure as the initial one. Also, temporary failures to submit calculations onto the cluster are taken care of by automatic resubmission. In addition, the energy workflow will attempt to converge the volume to within a threshold by rerunning the calculation several times. This is especially Input parameters

Structure

Energy calculation

Relaxation: loops until volume converges

Restart if clean stop (max CPU time reached)

Energy “restart” sub-workflow

Energy calculation If requested Loops on itself if fails (change parameters) Energy bands calculation

Relaxed structure

Electronic bands

Output parameters

Figure 5.10 A schematic of the structure relaxation workflow used throughout this study.

References

important if there is a large volume change in the process as a new grid of reciprocal-space G vectors must be generated for to the new cell. Finally, the workflow can also prepare and launch the computation of the Kohn–Sham band structure along high-symmetry lines of the structure reciprocal lattice. Without something like the energy workflow, it would simply be impossible to carry out such a large-scale study as the amount of human intervention required would be prohibitive.

5.9 Conclusions Using the fifth paradigm of science, we expect to be able to discover novel inorganic solids (e.g. with predefined intrinsic physical properties) more efficiently than in the past, thus shortening the time between discovery of novel inorganic solids and their industrial application. This happens at a time where solutions are sought for very fundamental problems of our modern society, such as global warming and energy shortage (development of alternative energy sources, as well as nuclear energy developments). It is a fact that the solutions to such problems are strongly connected to the ability of inorganic solids research to predict new materials with optimized properties of interest and especially on the art of focusing first on the most promising novel inorganic solids to accelerate discovery: this is the major target of the fifth paradigm. To conclude, it is worth mentioning that all intrinsic physical properties of single-phase inorganic solid (e.g. electronic, magnetic, superconductivity, ferroelectricity, etc.) are strongly linked to its crystal structure, stressing the key importance of any crystal structure classification. Nevertheless, one has to realize that certain prototypes are necessary but are not sufficient for the realization of predefined intrinsic physical properties of single-phase inorganic solids. As a direct consequence of this “non-one-way” condition, the relations are going to be complex, and new discoveries will always rely on the scientists’ ingenuity.

Acknowledgment We would like to thank Dr. Erich Wimmer and Dr. Paul Saxe of the company MDI Inc. (France/New Mexico), Ying Chen of Tohoku University (Sendai, Japan), and Shuichi Iwata of the University of Tokyo (Tokyo, Japan) for interesting and stimulating discussions.

References 1 Villars, P., Daams, J., Shikata, Y. et al. (2008). Chem. Met. Alloys 1: 1–23.

http://www.chemetal-journal.org. 2 Daams, J.L.C., van Vucht, I.H.N., and Villars, P. (1992). J. Alloys Compd. 182:

1–33.

169

170

5 A High-Throughput Computational Study Driven by the AiiDA Materials Informatics

3 Brunner, G.O. and Schwarzenbach, D. (1971). Z. Kristallogr. 133: 127–133. 4 Villars, P., Cenzual, K., Hulliger, F. et al. (2002). PAULING FILE (LPF), Bina-

ries Edition on CD-ROM. Materials Park, OH: ASM International. 5 Villars, P., Onodera, N., and Iwata, S. (1998). J. Alloys Compd. 279: 1–7. 6 Villars, P. and Cenzual, K. (2012/2013). Pearson’s Crystal Data: Crystal Struc-

7

8 9

10 11

12 13 14 15 16 17

18

ture Database for Inorganic Compounds on CD-ROM, Release. Materials Park, OH: ASM International. Villars, P. (Editor-in-Chief ), Okamoto, H. and Cenzual, K. (Section Editors) (2013). ASM Alloy Phase Diagrams Center. Materials Park, OH: ASM International. http://www.asminternational.org/AsmEnterprise/APD. (2013/2014). PDF-4+ on CD-ROM, Release. Newtown Square, PA: International Centre for Diffraction Data (ICDD). Villars, P. (Editor-in-Chief ), Hulliger, F., Okamoto, H., and Cenzual, K. (Section Editors) (2013). SpringerMaterials Online, PAULING FILE (LPF) part. Germany, Heidelberg: Springer Verlag. http://www.springerlink/ SpringerMaterials. Cenzual, K., Berndt, M., Brandenburg, K., et al. (2000), ESDD software package, copyright Japan Science and Technology Corporation. JST, Tokyo, Japan. Parthé, E., Gelato, L., Chabot, B. et al. (1993/1994). Gmelin handbook of inorganic and organometallic chemistry, 8th Ed. In: TYPIX – Standardized Data and Crystal Chemical Characterization of Inorganic Structure Types, vol. 4. Heidelberg: Springer. Villars, P., Cenzual, K., Daams, J. et al. (2004). J. Alloys Compd. 317–318: 167–175. Villars, P., Daams, J., Shikata, Y. et al. (2008). Chem. Met. Alloys 1: 210–226. http://www.chemetal-journal.org. Villars, P. and Iwata, S. (2013). Chem. Met. Alloys 6: 81–108. http://www .chemetal-journal.org. Pizzi, G., Cepellotti, A., Sabatini, R. et al. (2016). Comput. Mater. Sci. 111: 218–230. https://doi.org/10.1016/j.commatsci.2015.09.013. Lejaeghere, K., Bihlmayer, G., Björkman, T. et al. (2016). Science 351 (6280): https://doi.org/10.1126/science.aad3000. Lejaeghere, K., Van Speybroeck, V., Van Oost, G., and Cottenier, S. (2014). Crit. Rev. Solid State Mater. Sci. 39 (1): 1–24. https://doi.org/10.1080/ 10408436.2013.772503. Mounet, N., Gibertini, M., Schwaller, P. et al. (2018). Nature Nanotechnology 13: 246.

171

6 Modeling Materials Quantum Properties with Machine Learning Felix A. Faber and O. Anatole von Lilienfeld University of Basel, Department of Chemistry, Institute of Physical Chemistry and National Center for Computational Design and Discovery of Novel Materials (MARVEL), Klingelbergstr. 80, 4056 Basel, Switzerland

6.1 Introduction Machine learning (ML) models of quantum properties of materials have been introduced as a new tool for high-throughput computations and materials informatics. The ability of ML to estimate a property within milliseconds, which otherwise would take hours or even days using first principles methods, is appealing for combinatorial screening, iterative exploration approaches, and materials design or for the sampling of potential energy hypersurfaces. This chapter will summarize and review recently introduced models with a focus on material applications and using kernel ridge regression (KRR). To begin, we provide a brief introduction on how the KRR ML model works. We will then discuss the differences between using KRR and traditional fitting models and give some general guidelines for the application of KRR to the prediction of materials as well as molecular properties. The next section deals with criteria of what constitutes good representations of compounds for quantifying chemical similarity. Finally, a brief overview is given regarding latest research and advances made within the field.

6.2 Kernel Ridge Regression KRR [1–4] is one of the most commonly used ML models within molecular and materials science. KRR is a type of regression with regularization [5] and has shown remarkable promise in predicting properties [6, 7] for relaxed molecular structures as well as for combinatorial screening for materials [8]. KRR expands the property in a kernel-function basis set where each basis function (often Gaussians, exponential functions, or polynomials) is centered on a training instance. KRR neither requires the prior definition of a fixed functional form of the fit, nor does it suffer from lack of flexibility: in the limit of zero noise, it reduces to the interpolating reproducing kernel method [9].

Materials Informatics: Methods, Tools and Applications, First Edition. Edited by Olexandr Isayev, Alexander Tropsha, and Stefano Curtarolo. © 2019 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2019 by Wiley-VCH Verlag GmbH & Co. KGaA.

6 Modeling Materials Quantum Properties with Machine Learning

A property p of a query material m is predicted by a sum of weighted kernels K(m, mtrain ) between m and all materials mtrain in training set: i i p(m) =

N ∑

𝛼i K(m, mtrain ) i

(6.1)

i

The regression coefficients 𝛼i are obtained by minimizing the Euclidean distance between the property ptrain of all molecules in the training set and the predicting function with first-order Tikihonov regularization [10, 11]: ] [ 2 t min ||ptrain − ptrain (6.2) est ||2 + 𝜆𝛂 K𝛂 ref 𝛂

= min[(p − K𝛂)t (p − K𝛂) + 𝜆𝛂t K𝛂] 𝛂

K is the kernel matrix Ki,j = K(mtrain , mtrain ), and (⋅)t denotes the matrix i j t transpose. 𝜆𝛂 K𝛂 is the Tikihonov regularization, which prevents overfitting by penalizing large values of 𝛂. The amount of regularization is adjusted with the parameter 𝜆. Equation (6.2) poses a convex optimization problem, and the solution is obtained by finding 𝛼 for which the expression is stationary: 𝜕 ((p − K𝛂)t (p − K𝛂) + 𝜆𝛂t 𝛂) = 𝟎 ⇔ 𝜕𝛂 2pt K + 2𝛂t K2 + 2𝜆𝛂t K = 𝟎 ⇔ (6.3) pK = 𝛂(K + 𝜆I)K K is positive definite so that 𝛂 that minimizes Eq. (6.2) is obtained via p(K + 𝜆I)−1 = 𝛂

(6.4)

Figure 6.1 shows a schematic representation of how a KKR model can be trained and used for the prediction of properties of out-of-sample compounds.

Learning

172

Data m p Ki,j = K (mi , mj)

Training α = (K + λI)–1p

Out of sample m

Model N

p(m) = ∑ αiK (m, mi)

Prediction p

i

Prediction

Figure 6.1 Flowchart depicting the training and prediction of properties p of materials m, using the KRR model. Horizontal flow shows how properties of new, out-of-sample, materials are predicted and vertical flow shows how the kernel model is trained on existing data using KRR. The training coefficients are obtained by inverting the regularized kernel matrix (K + 𝜆I).

6.3 Model Assessment

The best choice of kernel function K(m1 , m2 ) is dependent on the descriptor and the problem. Commonly used kernels are the linear, Gaussian, and Laplacian kernels, and their respective forms are Linear∶ K(m1 , m2 ) = mt1 m2 ( ) ||m1 − m2 ||22 Gaussian∶ K(m1 , m2 ) = exp − 2𝜎 2 ( ) ||m1 − m2 ||1 Laplacian∶ K(m1 , m2 ) = exp − 𝜎 where 𝜎 is a hyperparameter, adjusting the width of the kernel. Systematic procedures for optimizing hyperparameters, such as 𝜎 or 𝜆, are discussed in [12].

6.3 Model Assessment Obviously, within all data fitting efforts, performance on training data is at least as good as performance on out-of-sample testing data. One should therefore never assess the quality of a model solely using the training set: it does not reflect the actual accuracy of the model. Instead, data sets should be split into training and test sets. The model is then fitted on the training set, and errors are reported for models being evaluated on the test set only. There are well-developed methods of evaluating the performance of a model when overall data access is limited, called cross-validation. They can also be employed to reduce statistical noise, and they are described in detail in Ref. [12]. 6.3.1

Learning Curve

Traditional fitting of properties, e.g. when parameterizing force fields, is inherently limited because of the fixed functional form assumed. As such, the traditional model finds merely the “best” fit with the resulting behavior, also depicted in Figure 6.2, that the error on the training data will increase with increasing

Traditional fitting

p

Kernel ridge regression

m

Figure 6.2 Schematic representation of the differences between a traditional fitting model and KRR. Horizontal axis depicts the descriptor m, vertical axis depicts the property p, and the dots constitutes low noise training set data. Traditional fitting is not flexible enough to accurately fit the data. KRR is highly flexible and will very closely cross all the data points, mimicking the behavior of a interpolation.

173

174

6 Modeling Materials Quantum Properties with Machine Learning

Test Traditional fitting

Error

Train

Kernel ridge regression

N

Figure 6.3 Schematic learning curves for training and test set, depicting the difference between KRR and traditional fitting model. Vertical axis shows the error of the model. Number of compounds for which the model is trained on is seen on the horizontal axis. The error on the test set usually converges to the error of training set when utilizing a traditional fitting model, while the training set error in a KRR model remains small.

training set size while the out-of-sample error comes down. Both errors converge toward the same finite residual, and the model is considered fully trained when the error on the training set is approximately equal to the error in the test set; see Figure 6.3. By contrast, the training set error of a KRR model remains very small – provided that the input m to the machine fulfills certain criteria discussed in Section 6.4. It does not increase with increasing training data set size while simultaneously avoiding overfitting artifacts. Consequently, the error for out-of-sample predictions can systematically be improved through the addition of more training data, and models can reach arbitrarily high accuracy for out-of-sample predictions. Plotting out-of-sample error vs. training set size with logarithmic scale on horizontal and vertical axis provides a convenient way to assess the quality of a model. The error of a good model should asymptotically approach a power-law decrease for sufficiently large number of training samples [4, 13]. The learning curve initial value and its learning rate (slope of the curve) can be used to characterize how well the model works. This can also be used to give an indication on how much data is needed to reach a target accuracy; see Figure 6.4. 6.3.2

Speedup

A KRR model can assess the property of a compound several orders of magnitudes faster than density functional theory or other mean-field first principles models. For a fair comparison, however, one needs to consider the time it takes to generate the training set as well as how many compounds and geometries can

6.3 Model Assessment

Data obtained Bad model Log Error Data needed Good model Target accuracy

Log N

Figure 6.4 Schematic representation of the learning curve of a good and bad KRR model. The vertical and horizontal axes represent the error and training set size on a logarithmic scale. The learning curve of a good model should asymptotically decrease linearly for large N when plotted on a logarithmic scale. A bad KRR model’s learning rate will on the other hand decrease.

be predicted by the model. A quantitative measure that captures this would be the speedup 𝜂: 𝜂=

ttot ttrain + tML

(6.5)

where ttot is the necessary time to obtain the properties of all compounds of interest using the reference method. ttrain corresponds to the time required to calculate the property of all training compounds (representative for the interesting compounds) using the reference method. tML is the time it takes to train and apply the ML model to predict all compounds of interest. Obviously, the use of ML models only makes sense when 𝜂 ≥ 1. In practice, tML is several orders of magnitude smaller than ttrain , resulting in 𝜂≃

ttot ttrain

(6.6)

Note that ttot is problem dependent and can be difficult to calculate. However, the speedup measure can be simplified further when there is a finite number of interesting compounds and under the assumption that all reference calculations require the same CPU time investment, 𝜏. ttot and ttrain can then be rewritten into 𝜏Ntot and 𝜏Ntrain , respectively, yielding the straightforward formula 𝜂≃

Ntot Ntrain

(6.7)

For example, in the case of predicting formation energies of 2 M elpasolite crystals using ML[8], a training set of 10k crystals was necessary for the ML model in order to reach DFT accuracy. As a result, the overall speedup was 𝜂 ≃ 200.

175

176

6 Modeling Materials Quantum Properties with Machine Learning

6.4 Representations A representation, or descriptor, m is a vector which encodes the structural and chemical information of a compound and can be used as input for the ML model. KRR and most other ML models are inductive and work under the assumption that related compounds should possess similar properties. The descriptor should therefore reflect the gradual change over similar compounds. The choice of descriptor plays an important role in how well a ML model works. This section will discuss attributes which a good descriptor should possess. There should be an injective map between the compound and the descriptor, i.e. different compounds are represented by a different descriptor. This is vital because the ML model will otherwise fail to distinguish between compounds and produce an identical property estimate. A descriptor consisting only of atom–atom pairwise features, for example, is a prime example of a representation with a non-injective mapping between descriptor and compound and will fail to distinguish among any pair of homometric molecules; see Figure 6.5 for an example. Another example of a non-injective descriptor is a molecular graph consisting of covalent bonds only. By construction, such descriptors will not distinguish conformational isomers, and ML models that rely on such descriptors cannot be used for certain properties, such as folding. It is clearly necessary to make the descriptor a surjection, i.e. there is at most one compound for every descriptor. If a descriptor is both a surjection and an injection, it is a bijection. This means that there is exactly one descriptor for each compound. This is not necessarily the case. When representing a molecule by its complete graph, for example, the ordering of the vertices can be altered, resulting in a different graph, without changing the compound. Imposing the bijection condition ensures that much superfluous information has been reduced. This is beneficial in accordance with the epistemological rule of thumb, also commonly referred to as Occam’s razor. A descriptor should also account for known constraints and invariances on the target property function, such as molecular translations or rotations for which the molecular potential energy is invariant. If these invariances are accounted for through trivial expansion of the training set, rather than through an appropriate descriptor, the corresponding increase in the offset of the learning curve is roughly ∼ b log(k), where k is the multiplicative factor by which the original Figure 6.5 Two homometric molecules (a) and (b), possessing same internal distances. A descriptor containing only pairwise information will therefore fail to distinguish between (a) and (b).

(a)

(b)

6.5 Recent Developments

training set N has to be multiplied in order to obtain an extended training set N ′ , which covers the relevant invariant dimensions. For example, to account for rotational degree of freedom by rotating all molecules by 𝜋 around their x, y, and z axes would result in k = 3: log(Error) = log(a) − b log(N ′ ) = log(a) + b log(k) − b log(N)

(6.8)

As such, ignoring invariances necessarily results in underperforming learning curves, even if the descriptor is unique. The modeling of specific properties, such as smooth potential energy surfaces, can be exploited to impose additional constraints on the design of the representation. The mapping from a compound to the descriptor and its inverse could therefore also be smooth. Such a mapping is called a diffeomorphism, and a descriptor-based metric that fulfills this criterion is necessary to model physical quantities involving differentiation, such as forces. Note that the criterion involving diffeomorphism can be relaxed when the ML model is used for predicting discrete information, e.g. the property of a crystal structure in its ground state, as the property itself is discrete. Finding a descriptor that fulfills most, or all, of these criteria is not a trivial problem and is still research in progress. However, it is possible to simplify the problem considerably by restricting the descriptor to a subspace of the chemical space (all possible atomic configurations and chemical compositions). This is exemplified in the next section, where we discuss advances and developments in this field. There are also so-called fingerprint descriptors, which can be seen as a coarse-grained representations that do not contain all information of the system. Fingerprints can be useful for measuring similarities between compounds and can be used for ML models. However, note that due to lack of uniqueness, fingerprints do not systematically improve with more data and will have a high training set error, as they do not fully represent the system.

6.5 Recent Developments This section will briefly discuss selected works from the past few years that includes some form of descriptor development and ML with focus on solid materials. Schütt et al. [14] proposes the use of the partial radial distribution function (PRDF) to represent a compound. PRDF considers the pairwise radial distance distributions between atom species. They apply the model to predict the density of states of structures collected from the inorganic crystal structure database [15, 16]. The model systematically improves with more training data. In Ref. [17] an attempt was made to generalize the Coulomb matrix descriptor [18, 19], which has been proven to work very well for molecules, to periodic systems. The “periodized” Coulomb matrices were used to predict formation energies of crystals collected from the materials project’s database [20]. These models also experience a systematic improvement with added data. However, the learning rate is relatively slow, and a large training set would be needed in order to reach desirable accuracy.

177

178

6 Modeling Materials Quantum Properties with Machine Learning

Another descriptor, introduced in Ref. [8], represents what elements are put in each position for a given crystal structure without using explicit coordinates. A structure with n positions is represented by a tuple of length n where each position in the tuple corresponds to a position in the crystal structure. Each element is in turn represented by its row and column in the periodic table. The ML model is trained on the property of the fully relaxed structure within the symmetry; thus the descriptor will always refer to the relaxed crystal without knowing exact coordinates. The model is trained and tested on a data set consisting of permutations of main-group elements up to Bi of elpasolite crystal structures (ABC2 D6 prototype). The model systematically improves with training data and reaches a mean absolute error below 0.1 eV/atom with 10k crystals in the training set. When considering the accuracy and learning rate of this model, it should be noted that the problem of learning across different compositions is more difficult than learning across different configurations. Many fingerprint descriptors have been developed for materials. In Ref. [21], several fingerprints are suggested, based on both the density of states and the structure of the material. Another fingerprint is suggested in Ref. [22] where the eigenvalues of an overlap matrix with an atom-centered Gaussian basis set are used to represent a structure. There have also been advances in developing descriptors for polymers. In Ref. [23] a fingerprint descriptor is developed and used to predict electronic dielectric constant, ionic dielectric constant, and bandgaps for polymers. The fingerprint incorporates information of what building blocks are used, how pairs of blocks relate, and how triple pairs of blocks are related. This fingerprint demonstrates promising results even though it is not unique.

References 1 Müller, K.R., Mika, S., Rätsch, G. et al. (2001). An introduction to

kernel-based learning algorithms. IEEE Trans. Neural Netw. 12 (2): 181–201. 2 Schölkopf, B. and Smola, A.J. (2002). Learning with Kernels: Support Vector

Machines, Regularization, Optimization, and Beyond. MIT Press. 3 Vovk, V. (2013). Kernel Ridge Regression, 105–116. Berlin, Heidelberg:

Springer-Verlag. https://doi.org/10.1007/978-3-642-41136-6_11. 4 Hastie, T., Tibshirani, R., and Friedman, J. (2011). The Elements of Statistical

Learning: Data Mining, Inference, and Prediction, 2e. New York: Springer. 5 Hoerl, A.E. and Kennard, R.W. (2000). Ridge regression biased estimation for

nonorthogonal problems. Technometrics 1: 80. 6 Montavon, G., Rupp, M., Gobre, V. et al. (2013). Machine learning of molec-

ular electronic properties in chemical compound space. New J. Phys. 15 (9): 095003. http://stacks.iop.org/1367-2630/15/i=9/a=095003. 7 De, S., Bartók, A.P., Csányi, G., and Ceriotti, M. (2015). Comparing molecules and solids across structural and alchemical space. arXiv:1601 .04077. 8 Faber, F., Lindmaa, A., von Lilienfeld, O.A., and Armiento, R. (2016). Machine learning energies of 2 m elpasolite (abc2 d6 ) crystals. Submitted. arXiv:1508.05315v2.

References

9 Hollebeek, T., Ho, T.S., and Rabitz, H. (2016). A fingerprint based metric for

10 11 12 13

14

15

16

17

18

19

20

21

22

23

measuring similarities of crystalline structures. J. Chem. Phys. 144 (3): 034203. https://doi.org/10.1063/1.4940026. Tikhonov, A.N. and Arsenin, V.Y. (1977). Solutions of Ill-Posed Problems. Washington, DC: V.H. Winston & Sons. Hansen, P.C. (2010). Discrete Inverse Problems: Insight and Algorithms, vol. 7. Philadelphia, PA: SIAM. Rupp, M. (2015). Machine learning for quantum mechanics in a nutshell. Int. J. Quantum Chem. 115 (16): 1058–1073. https://doi.org/10.1002/qua.24954. Müller, K.R., Finke, M., Murata, N. et al. (1996). A numerical study on learning curves in stochastic multilayer feedforward networks. Neural Comput. 8 (5): 1085–1106. Schütt, K.T., Glawe, H., Brockherde, F. et al. (2014). How to represent crystal structures for machine learning: towards fast prediction of electronic properties. Phys. Rev. B 89: 205118. https://doi.org/10.1103/PhysRevB.89.205118. Belsky, A., Hellenbrandt, M., Karen, V.L., and Luksch, P. (2002). New developments in the inorganic crystal structure database (ICSD): accessibility in support of materials research and design. Acta Crystallogr., Sect. B: Struct. Sci. 58 (3): 364–369. https://doi.org/10.1107/S0108768102006948. Bergerhoff, G., Hundt, R., Sievers, R., and Brown, I.D. (1983). The inorganic crystal structure data base. J. Chem. Inf. Comput. Sci. 23 (2): 66–69. https://doi.org/10.1021/ci00038a003. Faber, F., Lindmaa, A., von Lilienfeld, O.A., and Armiento, R. (2015). Crystal structure representations for machine learning models of formation energies. Int. J. Quantum Chem. 115 (16): 1094–1101. https://doi.org/10.1002/qua .24917. Rupp, M., Tkatchenko, A., Müller, K.R., and von Lilienfeld, O.A. (2012). Fast and accurate modeling of molecular atomization energies with machine learning. Phys. Rev. Lett. 108: 058301. https://doi.org/10.1103/PhysRevLett.108 .058301. Hansen, K., Scheffler, M., Tkatchenko, A. et al. (2013). Assessment and validation of machine learning methods for predicting molecular atomization energies. J. Chem. Theory Comput. 9 (8): 3404–3419. https://doi.org/10.1021/ct400195d. Jain, A., Ong, S.P., Hautier, G. et al. (2013). Commentary: The materials project: a materials genome approach to accelerating materials innovation. APL Mater. 1 (1): 011002. https://doi.org/10.1063/1.4812323. Isayev, O., Fourches, D., Muratov, E.N. et al. (2015). Materials cartography: representing and mining materials space using structural and electronic fingerprints. Chem. Mater. 27 (3): 735–743. https://doi.org/10.1021/cm503507h. Zhu, L., Amsler, M., Fuhrer, T. et al. (2016). A fingerprint based metric for measuring similarities of crystalline structures. J. Chem. Phys. 144 (3): 034203. https://doi.org/10.1063/1.4940026. Mannodi-Kanakkithodi, A., Pilania, G., Huan, T.D. et al. (2016). Machine learning strategy for accelerated design of polymer dielectrics. Sci. Rep. 6: 20952. https://doi.org/10.1038/srep20952.

179

181

7 Automated Computation of Materials Properties Cormac Toher 1 , Corey Oses 1 , and Stefano Curtarolo 2 1 Duke University, Department of Mechanical Engineering and Materials Science, 144 Hudson Hall, Durham, NC 27708, USA 2 Duke University, Materials Science, Electrical Engineering, Physics and Chemistry, 144 Hudson Hall, Durham, NC 27708, USA

7.1 Introduction Materials informatics requires large repositories of materials data to identify trends in and correlations between materials properties, as well as for training machine learning models. Such patterns lead to the formulation of descriptors that guide rational materials design. Generating large databases of computational materials properties requires robust, integrated, automated frameworks [1]. Built-in error correction and standardized parameter sets enable the production and analysis of data without direct intervention from human researchers. Current examples of such frameworks include Automatic FLOW (AFLOW) [2–10], Materials Project [11–14], Open Quantum Materials Database (OQMD) [15–17], the Computational Materials Repository [18] and its associated scripting interface Atomic Simulation Environment (ASE) [19], Automated Interactive Infrastructure and Database for Computational Science (AiiDA) [20–22], and the Open Materials Database at httk.openmaterialsdb.se with its associated High-Throughput Toolkit (HTTK). Other computational materials science resources include the aggregated repository maintained by the Novel Materials Discovery (NoMaD) Laboratory [23], and the Theoretical Crystallography Open Database (TCOD) [24]. For this data to be consumable by automated machine learning algorithms, it must be organized in programmatically accessible repositories [4, 5, 7, 11, 12, 15, 23]. These frameworks also contain modules that combine and analyze data from various calculations to predict complex thermomechanical phenomena, such as lattice thermal conductivity and mechanical stability. Computational strategies have already had success in predicting materials for applications including photovoltaics [25], water splitters [26], carbon capture and gas storage [27, 28], nuclear detection and scintillators [29–32], topological insulators [33, 34], piezoelectrics [35, 36], thermoelectric materials [37–40], catalysis [41], and battery cathode materials [42–44]. More recently, computational Materials Informatics: Methods, Tools and Applications, First Edition. Edited by Olexandr Isayev, Alexander Tropsha, and Stefano Curtarolo. © 2019 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2019 by Wiley-VCH Verlag GmbH & Co. KGaA.

182

7 Automated Computation of Materials Properties

materials data has been combined with machine learning approaches to predict electronic and thermomechanical properties [45, 46] and to identify superconducting materials [47]. Descriptors are also being constructed to describe the formation of disordered materials and have recently been used to predict the glass-forming ability (GFA) of binary alloy systems [48]. These successes demonstrate that accelerated materials design can be achieved by combining structured data sets generated using autonomous computational methods with intelligently formulated descriptors and machine learning.

7.2 Automated Computational Materials Design Frameworks Rapid generation of materials data relies on automated frameworks such as AFLOW [2–6], Materials Project’s pymatgen [13] and atomate [14], OQMD [15–17], ASE [19], and AiiDA [21]. The general automated workflow is illustrated in Figure 7.1. These frameworks begin by creating the input files required by the electronic structure codes that perform the quantum mechanical level calculations, where the initial geometry is generated by decorating structural prototypes (Figure 7.1a,b). They execute and monitor these calculations, reading any error messages written to the output files and diagnosing calculation failures. Depending on the nature of the errors, these frameworks are equipped with a catalog of prescribed solutions – enabling them to adjust the appropriate parameters and restart the calculations (Figure 7.1c). At the end of a successful calculation, the frameworks parse the output files to extract the relevant materials data such as total energy, electronic bandgap, and relaxed cell volume. Finally, the calculated properties are organized and formatted for entry into machine-accessible, searchable, and sortable databases. In addition to running and managing the quantum mechanical level calculations, the frameworks also maintain a broad selection of post-processing libraries for extracting additional properties, such as calculating X-ray diffraction (XRD) spectra from relaxed atomic coordinates, and the formation enthalpies for the convex hull analysis to identify stable compounds (Figure 7.1d). Results from calculations of distorted structures can be combined to calculate thermal and elastic properties [2, 49–51], and results from different compositions and structural phases can be amalgamated to generate thermodynamic phase diagrams. 7.2.1

Generating and Using Databases for Materials Discovery

A major aim of high-throughput computational materials science is to identify new, thermodynamically stable compounds. This requires the generation of new materials structures, which have not been previously reported in the literature, to populate the databases. The accuracy of analyses involving sets of structures, such as that used to determine thermodynamic stability, is contingent on sufficient exploration of the full range of possibilities. Therefore, autonomous materials design frameworks such as AFLOW use crystallographic prototypes to generate new materials entries consistently and reproducibly.

(a)

Structural prototypes

(b)

Decorate prototypes to create candidate materials

(c)

Automated QM calculations

(d)

Plot convex hull to identify stable materials

Formation enthalpy (meV/atom)

7.2 Automated Computational Materials Design Frameworks

0

Input file generation

aflow.in

Database storage

Elastic

–300 300

O

O

O O O

O O O

O

O

O

O O

0

O O

O

O O O

O

O O O O

–8

O O O O O O O

O O

O

–16

O O O O O O

O

O O O

–375 375 0.2

Hull count=249

O O O O

O O

O

O

O

–225 225

0 Al Storage

O O O

O

O

O O

O O

–24

O O O O O

AlRe2

Thermal

O O O

–150 150

O

O O

0.4

O

AlRe

Electronic

–75 −75

O O O

O

O O

Al2Re

Property calculation

O O

(kJ/mol)

ERROR?

O O O

O O O

Al12Re Al6Re

DFT-code DFT-code DFT-code Calculation management

O

O

O

0.6

–32 0.8

1 Re

2018/05/09 08:35:45

Figure 7.1 Computational materials data generation workflow. (a) Crystallographic prototypes are extracted from databases such as the ICSD or the NRL crystal structure library, or are generated by enumeration algorithms. The illustrated examples are for the rocksalt, zincblende, wurtzite, Heusler, inverse Heusler, and half-Heusler structures. (b) New candidate materials are generated by decorating the atomic sites with different elements. (c) Automated DFT calculations are used to optimize the geometric structure and calculate energetic, electronic, thermal, and elastic properties. Calculations are monitored to detect errors. The input parameters are adjusted to compensate for the problem, and the calculation is rerun. Results are formatted and added to an online data repository to facilitate programmatic access. (d) Calculated data is used to plot the convex hull phase diagrams for each alloy system to identify stable compounds.

Crystallographic prototypes are the basic building blocks used to generate the wide range of materials entries involved in computational materials discovery. These prototypes are based on (i) structures commonly observed in nature [52, 53], such as the rocksalt, zincblende, wurtzite, or Heusler structures illustrated in Figure 7.1b, as well as (ii) hypothetical structures, such as those enumerated by the methods described in Refs. [54, 55]. The AFLOW Library of Crystallographic Prototypes [53] is also available online at aflow.org/CrystalDatabase/, where users can choose from hundreds of crystal prototypes with adjustable parameters, which can be decorated to generate new input structures for materials science calculations. New materials are then generated by decorating the various atomic sites in the crystallographic prototype with different elements. These decorated prototypes serve as the structural input for ab initio calculations. A full relaxation of the geometries and energy determination follows, from which phase diagrams for

183

184

7 Automated Computation of Materials Properties

stability analyses can be constructed. The resulting materials data are then stored in an online data repository for future consideration. The phase diagram of a given alloy system can be approximated by considering the low-temperature limit in which the behavior of the system is dictated by the ground state [56, 57]. In compositional space, the lower-half convex hull defines the minimum energy surface and the ground-state configurations of the system. All non-ground-state stoichiometries are unstable, with the decomposition described by the hull facet directly below it. In the case of a binary system, the facet is a tie line as illustrated in Figure 7.2a. The energy gained from this decomposition is geometrically represented by the (vertical) distance of the compound from the facet and quantifies the excitation energy involved in forming this compound. While the minimum energy surface changes at finite temperature (a)

Entries not on convex hull are unstable

Decomposition energy

Entries on convex hull Tie lines forming convex hull (b)

Co

Al

Pt

Fe

Pd

Zn

Ru

V Tl

Zr

Te

Y

Figure 7.2 Convex hull phase diagrams for multicomponent alloys systems. (a) Schematic illustrating construction of convex hull for a general binary alloy system Ax B1−x . Ground-state structures are depicted as red points, with the minimum energy surface outlined with blue lines. The minimum energy surface is formed by connecting the lowest-energy structures with tie lines that form a convex hull. Unstable structures are shown in green, with the decomposition reaction indicated by orange arrows and the decomposition energy indicated in purple. (b) Example ternary convex hulls as generated by AFLOW.

7.2 Automated Computational Materials Design Frameworks

(favoring disordered structures), the T = 0 K excitation energy serves as a reasonable descriptor for relative thermodynamic stability [58]. This analysis generates valuable information such as ground-state structures, excitation energies, and phase coexistence for storage in the online data repository. This stability data can be visualized and displayed by online modules, such as those developed by AFLOW [58], the Materials Project [59], and the OQMD [16, 60]. An example visualization from AFLOW is shown in Figure 7.2b. Convex hull phase diagrams have been used to discover new thermodynamically stable compounds in a wide range of alloy systems, including hafnium [61, 62], rhodium [63], rhenium [64], ruthenium [65], and technetium [66] with various transition metals, as well as the Co–Pt system [67]. Magnesium alloy systems such as the lightweight Li–Mg system [68] and 34 other Mg-based systems [69] have also been investigated. This approach has also been used to calculate the solubility of elements in titanium alloys [70], to study the effect of hydrogen on phase separation in iron–vanadium [71], and to find new superhard tungsten nitride compounds [72]. The data has been employed to generate structure maps for hcp metals [73], as well as to search for new stable compounds with the Pt8 Ti phase [74] and with the L11 and L13 crystal structures [75]. Note that even if a structure does not lie on the ground-state convex hull, this does not rule out its existence. It may be synthesizable under specific temperature and pressure conditions and then be metastable under ambient conditions. 7.2.2

Standardized Protocols for Automated Data Generation

Standard calculation protocols and parameters sets [6] are essential to the identification of trends and correlations among materials properties. The workhorse method for calculating quantum mechanically resolved materials properties is density f unctional theory (DFT). DFT is based on the Hohenberg–Kohn theorem [76], which proves that for a ground-state system, the potential energy is a unique functional of the density: V (⃗r) = V (𝜌(⃗r)). This allows for the charge density 𝜌(⃗r) to be used as the central variable for the calculations rather than the many-body wavefunction Ψ(r⃗1 , r⃗2 , … , r⃗N ), dramatically reducing the number of degrees of freedom in the calculation. The Kohn–Sham equations [77] map the n coupled equations for the system of n interacting particles onto a system of n independent equations for n noninteracting particles: ] [ ℏ2 2 (7.1) ∇ + Vs (⃗r) 𝜙i (⃗r) = 𝜀i 𝜙i (⃗r) − 2m where 𝜙i (⃗r) are the noninteracting Kohn–Sham eigenfunctions and 𝜀i are their eigenenergies. Vs (⃗r) is the Kohn–Sham potential: Vs (⃗r) = V (⃗r) +



e2

𝜌s (r⃗′ ) |⃗r − r⃗′ |

d3 r⃗′ + VXC [𝜌s (⃗r)]

(7.2)

where V (⃗r) is the external potential (which includes influences of the nuclei, applied fields, and the core electrons when pseudopotentials are used),

185

186

7 Automated Computation of Materials Properties

the second term is the direct Coulomb potential, and VXC [𝜌s (⃗r)] is the exchange–correlation term. The mapping onto a system of n noninteracting particles comes at the cost of introducing the exchange–correlation potential VXC [𝜌s (⃗r)], the exact form of which is unknown and must be approximated. The simplest approximation is the local density approximation (LDA) [78], in which the magnitude of the exchange–correlation energy at a particular point in space is assumed to be proportional to the magnitude of the density at that point in space. Despite its simplicity, LDA produces realistic results for atomic structure, elastic, and vibrational properties for a wide range of systems. However, it tends to overestimate the binding energies of materials, even putting crystal bulk phases in the wrong energetic order [79]. Beyond LDA is the generalized gradient approximation (GGA), in which the exchange–correlation term is a functional of the charge density and its gradient at each point in space. There are several forms of GGA including those developed by Perdew, Burke, and Ernzerhof (PBE, [80]) or by Lee, Yang, and Parr (LYP, [81]). A more recent development is the meta-GGA strongly constrained and appropriately normed (SCAN) functional [82], which satisfies all 17 known exact constraints on exchange–correlation functionals. The major limitations of LDA and GGA include their inability to adequately describe systems with strongly correlated or localized electrons, due to the local and semilocal nature of the functionals. Treatments include the Hubbard U corrections [83, 84], self-interaction corrections [78], and hybrid functionals such as Becke’s three-parameter modification of LYP (B3LYP, [85]) and that of Heyd, Scuseria, and Ernzerhof (HSE, [86]). Within the context of ab initio structure prediction calculations, GGA-PBE is the usual standard since it tends to produce accurate geometries and lattice constants [56]. For accounting for strong correlation effects, the DFT+U method [83, 84] is often favored in large-scale automated database generation due to its low computational overhead. However, the traditional DFT+U procedure requires the addition of an empirical factor to the potential [83, 84]. Recently, methods have been implemented to calculate the U parameter self-consistently from first principles, such as the ACBN0 functional [87]. DFT also suffers from an inadequate description of excited/unoccupied states, as the theory is fundamentally based on the ground state. Extensions for describing excited states include time-dependent density functional theory (TDDFT) [88] and the GW correction [89]. However, these methods are typically much more expensive than standard DFT and are not generally considered for largescale database generation. At the technical implementation level, there are many DFT software packages available, including VASP [90–93], QuantumESPRESSO [94, 95], ABINIT [96, 97], FHI–AIMS [98], SIESTA [99], and GAUSSIAN [100]. These codes are generally distinguished by the choice of basis set. There are two principle types ∑ ⃗ of basis sets: plane waves, which take the form 𝜓(⃗r) = eik⋅⃗r , and local orbitals, formed by a sum over functions 𝜙a (⃗r) localized at particular points in space, such as Gaussians or numerical atomic orbitals [101]. Plane-wave-based packages include VASP, QuantumESPRESSO and ABINIT, and are generally better

7.3 Integrated Calculation of Materials Properties

suited to periodic systems such as bulk inorganic materials. Local orbital-based packages include FHI–AIMS, SIESTA and GAUSSIAN, and are generally better suited to nonperiodic systems such as organic molecules. In the field of automated computational materials science, plane-wave codes such as VASP are generally preferred: it is straightforward to automatically and systematically generate well-converged basis sets since there is only a single parameter to adjust, namely, the cutoff energy determining the number of plane waves in the basis set. Local orbital basis sets tend to have far more independently adjustable degrees of freedom, such as the number of basis orbitals per atomic orbital as well as their respective cutoff radii, making the automated generation of reliable basis sets more difficult. Therefore, a typical standardized protocol for automated materials science calculations [6] relies on the VASP software package with a basis set cutoff energy higher than that recommended by the VASP potential files, in combination with the PBE formulation of GGA. Finally, it is necessary to automate the generation of the reciprocallattice k-point grid and pathways in reciprocal space used for the calculation of forces, energies, and the electronic band structure. In general, DFT codes use standardized methods such as the Monkhorst–Pack scheme [102] to generate k-point grids, although optimized grids have been calculated for different lattice types and are available online [103]. Optimizing k-point grid density is a computationally expensive process that is difficult to automate, so instead standardized grid densities based on the concept of “k-points per reciprocal atom” (KPPRA) are used. The KPPRA value is chosen to be sufficiently large to ensure convergence for all systems. Typical recommended values used for KPPRA range from 6000 to 10 000 [6], so that a material with two atoms in the calculation cell will have a k-point mesh of at least 3000–5000 points. Standardized directions in reciprocal space have also been defined for the calculation of the band structure as illustrated in Figure 7.3 [3]. These paths are optimized to include all of the high-symmetry points of the lattice.

7.3 Integrated Calculation of Materials Properties Automated frameworks such as AFLOW combine the computational analysis of properties including symmetry, electronic structure, elasticity, and thermal behavior into integrated workflows. Crystal symmetry information is used to find the primitive cell to reduce the size of DFT calculations, to determine the appropriate paths in reciprocal space for electronic band structure calculations (see Figure 7.3, [3]), and to determine the set of inequivalent distortions for phonon and elasticity calculations. Thermal and elastic properties of materials are important for predicting the thermodynamic and mechanical stability of structural phases [104–107] and assessing their importance for a variety of applications. Elastic properties such as the shear and bulk moduli are important for predicting the hardness of materials [108, 109] and thus their resistance to wear and distortion. Elasticity tensors can be used to predict the properties of composite materials [110, 111]. They are also important in geophysics for modeling

187

b2

b3

b2

R

P

U

Г

X b2

Г

X K W

M

Z

b1

b3

b1

L

Г

b1

b2

b1

N

Z

Z1

b2

Г

b1

P

R

A

H N b3

X

M

Г b3

X

M

b3

b1

Г

Σ

X Y

Z X1 L2

Y1

Г T

X L

b3

b1

M2 H2

L1 Y

b1

M

b2

H

Y b3

X H1 C

Г R

X

S

A1 T Y X b 1 2

Z

Y2

M Г

Y3 H1 X Y1 b1

N

b1

M N1

b1

N

L

H

M

K

X2 I

Г

X1

L b2

F Y

b1

L

b3

L X

b2

b3 b2

Г B 1 P2

L

X

F

Г

Q1 P1

L1

Z F1

b2

F Y

N

H2 b3 Z I H Y F N1 2 M Y3 1 H Г 1 X F Y1 Y b1 N

I Y3 b2

Y2

N1 H1 Y1

Г b1

Y

Г L

b2

X L

N

F Y

b3

Z b3

R

Z H M

X

Z

N M

R

b3

H2

F2

L

Z

b1

b1

Z P

b3

Y b2

Y C

D

Q

P Z b3 B QP1 b1 F

Y1 I Г

N Г

X

b1

Г

X

b1

H

b3

M

R

b2

Y A1

b2

M

b3

N F1 L F Y

Г

D1 L

b3

F2 F3 N1 X

Z

C1

b2

H1

X

Г

Z

H I

T b 1 L

Y

A1

I1

Z Y1 F1

b3

H2 F2 N1

X

Z

A b2

A

A

I1

D

Г

X

F2 b3

A M1 E Y1 Г

b1

T X1

b3

Z

Z

I1

b2

L

b3

b1

D1

Y

S

A

b2

b3

S

b3

Z

T

R

Г

X

b1

W

R

Z

U

P

N

b2

b3

Σ1 Z Y1

b2

Г

M

N

b2

M

R X

b2

Г Y

Y

b1

b2

b1

L

Figure 7.3 Standardized paths in reciprocal space for calculation of the electronic band structures for the 25 different lattice types [3]. Source: Setyawan and Curtarolo 2010 [3]. Reproduced with permission of Elsevier.

7.3 Integrated Calculation of Materials Properties

the propagation of seismic waves in order to investigate the mineral composition of geological formations [105, 112, 113]. The lattice thermal conductivity (𝜅L ) is a crucial design parameter in a wide range of important technologies, such as the development of new thermoelectric materials [39, 114, 115], heat sink materials for thermal management in electronic devices [116], and rewritable phase-change memories [117]. High thermal conductivity materials, which typically have a zincblende or diamond-like structure, are essential in microelectronic and nanoelectronic devices for achieving efficient heat removal [118] and have been intensively studied for the past few decades [119]. Low thermal conductivity materials constitute the basis of a new generation of thermoelectric materials and thermal barrier coatings [120]. The calculation of thermal and elastic properties offers an excellent example of the power of integrated computational materials design frameworks. With a single input file, these frameworks can automatically set up and run calculations of different distorted cells, and combine the resulting energies and forces to calculate thermal and mechanical properties. 7.3.1

Autonomous Symmetry Analysis

Critical to any analysis of crystals is the accurate determination of the symmetry profile. For example, symmetry serves to (i) validate the forms of the elastic constants and compliance tensors, where the crystal symmetry dictates equivalence or absence of specific tensor elements [50, 106, 121], and (ii) reduce the number of ab initio calculations needed for phonon calculations, where, in the case of the finite displacement method, equivalent atoms and distortion directions are identified through factor group and site symmetry analyses [122]. Autonomous workflows for elasticity and vibrational characterizations therefore require a correspondingly robust symmetry analysis. Unfortunately, standard symmetry packages [123–126], catering to different objectives, depend on tolerance-tuning to overcome numerical instabilities and atypical data – emanating from finite temperature measurements and uncertainty in experimentally reported observations. These tolerances are responsible for validating mappings and identifying isometries, such as the n-fold operator depicted in Figure 7.4a. Some standard packages define separate tolerances for space, angle [126], and even operation type [123–125] (e.g. rotation vs. inversion). Each parameter introduces a factorial expansion of unique inputs, which can result in distinct symmetry profiles as illustrated in Figure 7.4b. By varying the spatial tolerance 𝜖, four different space groups can be observed for AgBr (ICSD #56551 [www.aflow.org/material.php?id=56551]), if one is found at all. Gaps in the range, where no consistent symmetry profile can be resolved, are particularly problematic in automated frameworks, triggering critical failures in subsequent analyses. Cell shape can also complicate mapping determinations. Anisotropies in the cell, such as skewness of lattice vectors, translate to distortions of fractional and reciprocal spaces. A uniform tolerance sphere in Cartesian space, inside which points are considered mapped, generally warps to a sheared spheroid, as depicted in Figure 7.4c. Hence, distances in these spaces are direction dependent,

189

190

7 Automated Computation of Materials Properties

(b)

(a)

mcl SG: #11

^r

fcc SG #225

ϵ 0.5

1.0

xi

ϕ

ϵ 0.1

xj

Cartesian

rhl SG #166

n–fold 0

(c)

orc SG #59

0.3

ϵ0 ϵnew Cubic

Orthorhombic

ϵ

ϵ

Triclinic

ϵ

Fractional

Figure 7.4 Challenges in autonomous symmetry analysis. (a) An illustration of a general n-fold symmetry operation. (b) Possible space group determinations with mapping tolerance 𝜖 for AgBr (ICSD #56551). (c) Warping of mapping tolerance sphere with a transformation from Cartesian to fractional basis.

compromising the integrity of rapid minimum-image determinations [127] and generally warranting prohibitively expensive algorithms [128]. Such failures can result in incommensurate symmetry profiles, where the real space lattice profile (e.g. bcc) does not match that of the reciprocal space (fcc). The new AFLOW-SYM module [128] within AFLOW offers careful treatment of tolerances, with extensive validation schemes, to mitigate the aforementioned challenges. Although a user-defined tolerance input is still available, AFLOW defaults to one of two predefined tolerances, namely, tight (standard) and loose. Should any discrepancies occur, these defaults are the starting values of a large tolerance scan, as shown in Figure 7.4b. A number of validation schemes have been incorporated to catch such discrepancies. These checks are consistent with crystallographic group theory principles, validating operation types and cardinalities [129]. From considerations of different extreme cell shapes, a heuristic threshold has been defined to classify scenarios where mapping failures are likely to occur – based on skewness and mapping tolerance. When benchmarked against standard packages for over 54 000 structures in the Inorganic Crystal Structure Database, AFLOW-SYM consistently resolves the symmetry characterization most compatible with experimental observations [128].

7.3 Integrated Calculation of Materials Properties

Along with accuracy, AFLOW-SYM delivers a wealth of symmetry properties and representations to satisfy injection into any analysis or workflow. The full set of operators – including that of the point, factor, crystallographic point, space group, and site symmetries – are provided in matrix, axis–angle, matrix generator, and quaternion representations in both Cartesian and fractional coordinates. A span of characterizations, organized by degree of symmetry-breaking, are available, including those of the lattice, superlattice, crystal, and crystal spin. Space group and Wyckoff positions are also resolved. The full data set is made available in both plain text and JSON formats. 7.3.2

Elastic Constants

There are two main methods for calculating the elastic constants based on the response of either the stress tensor or the total energy to a set of applied strains [50, 51, 130–132]. Automated implementations of these methods are included in the AFLOW (referred to as the Automatic Elasticity Library, AEL [51]) and Materials Project frameworks [50]. To calculate the elastic tensor, several different normal and shear strains should be applied to the calculation cell in each independent direction [50, 51], as illustrated in Figure 7.5a. The resulting stress tensor elements 𝜎ij , obtained from the directional forces on the cell calculated with DFT, can then be fitted to the applied strains 𝜖ij to obtain the corresponding elastic constants cij in the form of the stiffness tensor: ⎛ 𝜎11 ⎞ ⎛ c11 ⎜𝜎 ⎟ ⎜c ⎜ 22 ⎟ ⎜ 12 ⎜ 𝜎33 ⎟ = ⎜ c13 ⎜ 𝜎23 ⎟ ⎜ c14 ⎜ 𝜎13 ⎟ ⎜ c15 ⎟ ⎜ ⎜ ⎝ 𝜎12 ⎠ ⎝ c16

c12 c22 c23 c24 c25 c26

c13 c23 c33 c34 c35 c36

c14 c24 c34 c44 c45 c46

c15 c25 c35 c45 c55 c56

c16 ⎞ ⎛ 𝜖11 ⎞ c26 ⎟ ⎜ 𝜖22 ⎟ ⎟⎜ ⎟ c36 ⎟ ⎜ 𝜖33 ⎟ c46 ⎟ ⎜ 2𝜖23 ⎟ c56 ⎟ ⎜ 2𝜖13 ⎟ ⎟⎜ ⎟ c66 ⎠ ⎝ 2𝜖12 ⎠

(7.3)

written in the 6 × 6 Voigt notation using the mapping [105] 11 → 1, 22 → 2, 33 → 3, 23 → 4, 13 → 5, and 12 → 6. Symmetry analysis such as that provided by AFLOW-SYM can be used to reduce the number of required calculations by up to a factor of 3 in the case of cubic systems, as well as for verification of the computed tensors [106]. The elastic constants can then be used in the Voigt or Reuss approximations, which for polycrystalline materials correspond to assuming uniform strain and uniform stress, respectively, and give the upper and lower bounds on the elastic moduli. In the Voigt approximation, the bulk modulus is given by 1 [(c + c22 + c33 ) + 2(c12 + c23 + c13 )] 9 11 and the shear modulus is given by BVoigt =

GVoigt =

1 [(c + c22 + c33 ) − (c12 + c23 + c13 )] 15 11 1 + (c44 + c55 + c66 ) 5

(7.4)

(7.5)

191

(a)

Normal strain

(b)

(c)

Different volume cells

APL: harmonic IFCs

ΔF E(V) data from DFT E (DFT) Polynomial fit

Energy (eV)

–8

BS(V)

–9

θD(V)

–10 –11 –12 20

Fvib(θD,T ) 30

40

50 3

Volume (Å )

G( p, V, T )

Debye model

–7

60

AAPL: anharmonic IFCs

–7

G (V,p,T) (eV)

Shear strain

G (V,p,T) Polynomial fit

–8

Veq( p,T)

–9 –10 –11 –12 20

30

40

50 3

Volume (Å )

60

ΔF

Figure 7.5 (a) AEL applies a set of independent normal and shear strains to the crystal structure to obtain the elastic constants. (b) AGL applies a set of isotropic strains to the unit cell to obtain energy vs. volume data, which is fitted by a polynomial in order to calculate the bulk modulus as a function of volume, BS (V). BS (V) is then used to calculate the Debye temperature as a function of volume and thus the vibrational free energy as a function of temperature. The Gibbs free energy as a function of volume is then minimized for each pressure and temperature point to obtain the equilibrium volume and other thermomechanical properties. (c) APL obtains the harmonic interatomic f orce constants (IFCs) from supercell calculations where inequivalent atoms are displaced in inequivalent directions, and then the changes in the forces on the other atoms are calculated. The IFCs are then used to construct the dynamical matrix, which is diagonalized to obtain the phonon eigenmodes. AAPL calculates three-phonon scattering effects by performing supercell calculations where pairs of inequivalent atoms are displaced in inequivalent directions, and the changes in the forces on the other atoms in the supercell are calculated to obtain the third-order anharmonic IFCs.

7.3 Integrated Calculation of Materials Properties

The Reuss approximation uses the elements of the compliance tensor sij (the inverse of the stiffness tensor) to calculate the bulk modulus 1 = (s11 + s22 + s33 ) + 2(s12 + s23 + s13 ) BReuss

(7.6)

while the shear modulus is given by 15 = 4(s11 + s22 + s33 ) − 4(s12 + s23 + s13 ) GReuss + 3(s44 + s55 + s66 )

(7.7)

The two approximations are combined to obtain the V oigt–Reuss–Hill (VRH) averages [133] for the bulk modulus BVRH =

BVoigt + BReuss

2 and the shear modulus GVoigt + GReuss GVRH = 2 The Poisson ratio 𝜈 is then given by 𝜈=

7.3.3

3BVRH − 2GVRH 6BVRH + 2GVRH

(7.8)

(7.9)

(7.10)

Quasi-harmonic Debye–Grüneisen Model

Thermal properties can be predicted by several different methods, such as the quasi-harmonic Debye–Grüneisen model, which uses volume as a proxy for temperature [49], and by calculating the phonon dispersion from the dynamical matrix of IFCs [122]. The energy vs. volume data from a set of simple static primitive cell calculations can be fitted to a quasi-harmonic Debye–Grüneisen model such as the “GIBBS” method [49, 51, 134] to obtain thermal properties, as demonstrated in Figure 7.5b. This method has been implemented in the AFLOW framework in the form of the Automatic GIBBS Library (AGL). First, the adiabatic bulk modulus BS as a function of cell volume V is obtained either (i) by fitting the EDFT (V ) data to an equation of state (EOS) or (ii) by taking the numerical second derivative of a polynomial fit of EDFT (V ), which gives the static bulk modulus Bstatic : BS (V ) ≈ Bstatic (⃗x) ≈ Bstatic (⃗x(V )) ( 2 ) ( 2 ) 𝜕 E(⃗x(V )) 𝜕 E(V ) =V =V 𝜕V 2 𝜕V 2

(7.11)

Three different empirical EOS have been implemented within AGL: the Birch–Murnaghan EOS [105, 134, 135], the Vinet EOS [134, 136], and the Baonza–Cáceres–Núñez spinodal EOS [134, 137]. However, these EOS often introduce an additional source of error into the results since they are calibrated

193

194

7 Automated Computation of Materials Properties

for specific sets of systems and pressure–temperature regimes. Recent studies have found the numerical calculation of B to be just as, if not more, reliable as the empirical EOS [51]. Therefore, the numerical method is the default for the automated generation of thermomechanical properties for the AFLOW database. The bulk modulus can then be used to calculate the Debye temperature as a function of volume: √ BS ℏ 2 1∕2 1∕3 𝜃D (V ) = [6𝜋 V n] f (𝜈) (7.12) kB M where M is the mass of the unit cell and f (𝜈) is a function of the Poisson ratio 𝜈: }1 { [ )3∕2 ( )3∕2 ]−1 3 ( 2 1+𝜈 1 1+𝜈 f (𝜈) = 3 2 + (7.13) ⋅ ⋅ 3 1 − 2𝜈 3 1−𝜈 The integration offered by the AFLOW framework allows the value of 𝜈 required by this expression to be obtained directly and automatically from the AEL calculation (Eq. (7.10)). To obtain the equilibrium volume at a particular (p, T) point, the Gibbs free energy is minimized with respect to volume. In the quasi-harmonic approximation (QHA), the vibrational component of the free energy, Fvib (⃗x; T), is given by ] ∞[ ℏ𝜔 1 −𝛽ℏ𝜔 Fvib (⃗x; T) = ) g(⃗x; 𝜔)d𝜔 (7.14) + log(1 − e ∫0 2 𝛽 ( )−1 where 𝛽 = kB T and g(⃗x; 𝜔) is the phonon density of states, which depends on the system geometry x⃗. In the Debye–Grüneisen model, Fvib can be written as ( )] [ ) ( 𝜃D n 9 𝜃D −𝜃D ∕T Fvib (𝜃D ; T) = −D + 3 log 1 − e (7.15) 𝛽 8T T where D(𝜃D ∕T) is the Debye integral ( )3 𝜃D ∕T ( ) T x3 D 𝜃D ∕T = 3 dx x 𝜃D ∫0 e −1

(7.16)

Next, the full Gibbs free energy as a function of temperature and pressure is calculated by G(V ; p, T) = EDFT (V ) + Fvib (𝜃D (V ); T) + pV

(7.17)

and fitted by a polynomial in V , the minimum of which gives the equilibrium volume, Veq . Note that the symbol G is used for shear modulus, while G is used for the Gibbs free energy. 𝜃D is then determined from its value at Veq , while other thermal properties such as the Grüneisen parameter can be calculated using the expression 𝛾=−

V 𝜕𝜃D (V ) 𝜃D 𝜕V

(7.18)

7.3 Integrated Calculation of Materials Properties

The specific heat capacity at constant volume can be obtained using the expression [ ( ) ] 𝜃D 3𝜃D ∕T CV,vib = 3nkB 4D − 𝜃 ∕T (7.19) T e D −1 while the specific heat capacity at constant pressure is given by Cp,vib = CV,vib (1 + 𝛼𝛾T)

(7.20)

where 𝛼 is the coefficient of thermal expansion 𝛾CV,vib 𝛼= (7.21) BT V The lattice thermal conductivity can be calculated using the Leibfried– Schlömann equation [138–140] using the Debye temperature and the Grüneisen parameter: √ 3 0.849 × 3 4 𝜅L (𝜃a ) = 20𝜋 3 (1 − 0.514𝛾a−1 + 0.228𝛾a−2 ) 1 ( ) kB 𝜃a 2 kB mV 3 × (7.22) ℏ ℏ𝛾a2 where V is the volume of the unit cell and m is the average atomic mass, while 𝜃a and 𝛾a are the acoustic Debye temperature and Grüneisen parameter obtained by only considering the acoustic modes, based on the assumption that the optical phonon modes in crystals do not contribute to heat transport [139, 140]. 𝜃a and 𝛾a can be derived directly from the phonon DOS by only considering the acoustic modes [139, 141]. 𝜃a can also be estimated from the traditional Debye temper1 ature 𝜃D using the expression 𝜃a = 𝜃D n− 3 [139, 140]. There is no simple way to extract 𝛾a from the traditional Grüneisen parameter, so the approximation 𝛾a = 𝛾 is used in the AEL–AGL approach to calculating the thermal conductivity. The thermal conductivity at temperatures other than 𝜃a is estimated using the expression [139, 140, 142] 𝜅L (T) = 𝜅L (𝜃a )𝜃a ∕T. 7.3.4

Harmonic Phonons

Thermal properties can also be obtained by directly calculating the phonon dispersion from the dynamical matrix of IFCs. The approach is implemented within the AFLOW Phonon Library (APL) [2]. The IFCs are determined from a set of supercell calculations in which the atoms are displaced from their equilibrium positions [122] as shown in Figure 7.5c. The IFCs derive from a Taylor expansion of the potential energy, V , of the crystal about the atoms’ equilibrium positions: ∑ 𝜕V | | V = V |⃗r(i,t)=0,∀i + r(i, t)𝛼 𝛼| 𝜕r(i, t) | ⃗ r (i,t)=0,∀i i,𝛼 | 1∑ 𝜕2V | + r(i, t)𝛼 r(j, t)𝛽 2 i,𝛼, 𝜕r(i, t)𝛼 𝜕r(j, t)𝛽 ||⃗r(i,t)=0,∀i j,𝛽

+···

(7.23)

195

196

7 Automated Computation of Materials Properties

where r(i, t)𝛼 is the 𝛼-Cartesian component (𝛼 = x, y, z) of the time-dependent atomic displacement ⃗r(t) of the ith atom about its equilibrium position, V |⃗r(i,t)=0,∀i is the potential energy of the crystal in its equilibrium configuration, 𝜕V ∕𝜕r(i, t)𝛼 |⃗r(i,t)=0,∀i is the negative of the force acting in the 𝛼 direction on atom i in the equilibrium configuration (zero by definition), and 𝜕 2 V ∕𝜕r(i, t)𝛼 𝜕r(j, t)𝛽 |⃗r(i,t)=0,∀i constitute the IFC 𝜙(i, j)𝛼,𝛽 . To first approximation, 𝜙(i, j)𝛼,𝛽 is the negative of the force exerted in the 𝛼 direction on atom i when atom j is displaced in the 𝛽 direction with all other atoms maintaining their equilibrium positions, as shown in Figure 7.5c. All higher-order terms are neglected in the harmonic approximation. Correspondingly, the equations of motion of the lattice are ∑ 𝜙(i, j)𝛼,𝛽 r(j, t)𝛽 ∀i, 𝛼 (7.24) M(i)̈r(i, t)𝛼 = − j,𝛽

and can be solved by a plane-wave solution of the form v(i)𝛼 i(⃗q⋅R⃗ l −𝜔t) r(i, t)𝛼 = √ e M(i)

(7.25)

where v(i)𝛼 form the phonon eigenvectors (polarization vector), M(i) is the mass ⃗ l is the position of lattice point l, and 𝜔 of the ith atom, q⃗ is the wave vector, R form the phonon eigenvalues (frequencies). The approach is nearly identical to that taken for electrons in a periodic potential (Bloch waves) [143]. Plugging this solution into the equations of motion (Eq. (7.24)) yields the following set of linear equations: ∑ 𝛼,𝛽 Di,j (⃗q)v(j)𝛽 ∀i, 𝛼 (7.26) 𝜔2 v(i)𝛼 = j,𝛽

(⃗q) is defined as where the dynamical matrix D𝛼,𝛽 i,j (⃗q) = D𝛼,𝛽 i,j

∑ l

𝜙(i, j)𝛼,𝛽 −i⃗q⋅(R⃗ −R⃗ ) l 0 e √ M(i)M(j)

(7.27)

The problem can be equivalently represented by a standard eigenvalue equation: [ ] [ ][ ] 𝜔2 v⃗ = D(⃗q) v⃗ (7.28) where the dynamical matrix and phonon eigenvectors have dimensions (3na × 3na ) and (3na × 1), respectively, and na is the number of atoms in the cell. Hence, Eq. (7.28) has 3na solutions/modes referred to as branches indexed by 𝜆. In practice, Eq. (7.28) is solved for discrete sets of q⃗ -points to compute the phonon density of states (grid over all possible q⃗ ) and dispersion (along the high-symmetry paths of the lattice [3]). Thus, the phonon eigenvalues and eigenvectors are appropriately denoted 𝜔𝜆 (⃗q) and v⃗𝜆 (⃗q), respectively. Similar to the electronic Hamiltonian, the dynamical matrix is Hermitian, i.e. D(⃗q) = D∗ (⃗q). Thus 𝜔2𝜆 (⃗q) must also be real, so 𝜔𝜆 (⃗q) can either be real or purely imaginary. However, a purely imaginary frequency corresponds to vibrational motion of the lattice that increases exponentially in time. Therefore, imaginary frequencies, or those corresponding to soft modes, indicate the structure is

7.3 Integrated Calculation of Materials Properties

dynamically unstable. In the case of a symmetric, high-temperature phase, soft modes suggest there exists a lower symmetry structure stable at T = 0 K. Temperature effects on phonon frequencies can be modeled with 𝜔̃ 2𝜆 (⃗q, T) = 𝜔2𝜆 (⃗q, T = 0) + 𝜂T

(7.29)

where 𝜂 is positive in general. The two structures, the symmetric and the stable, differ by the distortion corresponding to this “frozen” (non-vibrating) mode. Upon heating, the temperature term increases until the frequency reaches zero, and a phase transition occurs from the stable structure to the symmetric [144]. In practice, soft modes [145] may indicate the following: (i) the structure is dynamically unstable at T; (ii) the symmetry of the structure is lower than that considered, perhaps due to magnetism; (iii) strong electronic correlations, or (iv) long-range interactions play a significant role, and a larger supercell should be considered. With the phonon density of states computed, the following thermal properties can be calculated: the internal vibrational energy ∞( ) 1 1 Uvib (⃗x, T) = ℏ𝜔g(⃗x; 𝜔)d𝜔 (7.30) + (𝛽ℏ𝜔) ∫0 2 e −1 the vibrational component of the free energy Fvib (⃗x; T) (Eq. (7.14)), the vibrational entropy Uvib (⃗x, T) − Fvib (⃗x; T) T and the isochoric specific heat Svib (⃗x, T) =



CV,vib (⃗x, T) =

7.3.5

∫0

kB (𝛽ℏ𝜔)2 g(⃗x; 𝜔) d𝜔 (1 − e−(𝛽ℏ𝜔) )(e(𝛽ℏ𝜔) − 1)

(7.31)

(7.32)

Quasi-harmonic Phonons

The harmonic approximation does not describe phonon–phonon scattering and so cannot be used to calculate properties such as thermal conductivity or thermal expansion. To obtain these properties, either the Quasi-Harmonic Approximation (QHA) can be used, or a full calculation of the higher-order anharmonic IFCs can be performed. QHA is the less computationally demanding of these two methods and compares harmonic calculations of phonon properties at different volumes to predict anharmonic properties. The different volume calculations can be in the form of harmonic phonon calculations as described above [146, 147] or simple static primitive cell calculations [49, 134]. QHA is implemented within APL and referred to as QHA-APL [49]. In the case of the quasi-harmonic phonon calculations, the anharmonicity of the system is described by the mode-resolved Grüneisen parameters, which are given by the change in the phonon frequencies as a function of volume: 𝛾𝜆 (⃗q) = −

V 𝜕𝜔𝜆 (⃗q) 𝜔𝜆 (⃗q) 𝜕V

(7.33)

197

198

7 Automated Computation of Materials Properties

where 𝛾𝜆 (⃗q) is the parameter for the wave vector q⃗ and the 𝜆th mode of the phonon dispersion. The average of the 𝛾𝜆 (⃗q) values, weighted by the specific heat capacity of each mode CV,𝜆 (⃗q), gives the average Gruneisen parameter: ∑ q)CV,𝜆 (⃗q) 𝜆,⃗q 𝛾𝜆 (⃗ (7.34) 𝛾= CV The specific heat capacity, Debye temperature, and Grüneisen parameter can then be combined to calculate other properties such as the specific heat capacity at constant pressure Cp , the thermal coefficient of expansion 𝛼, and the lattice thermal conductivity 𝜅L [147], using similar expressions to those described in Section 7.3.3. 7.3.6

Anharmonic Phonons

The full calculation of the anharmonic IFCs requires performing supercell calculations in which pairs of inequivalent atoms are displaced in all pairs of inequivalent directions [148–157] as illustrated in Figure 7.5c. The third-order anharmonic IFCs can then be obtained by calculating the change in the forces on all of the other atoms due to these displacements. This method has been implemented in the form of a fully automated integrated workflow in the AFLOW framework, where it is referred to as the AFLOW Anharmonic Phonon Library (AAPL) [157]. This approach can provide very accurate results for the lattice thermal conductivity when combined with accurate electronic structure methods [157] but quickly becomes very expensive for systems with multiple inequivalent atoms or low symmetry. Therefore, simpler methods such as the quasi-harmonic Debye model tend to be used for initial rapid screening [49, 51], while the more accurate and expensive methods are used for characterizing systems that are promising candidates for specific engineering applications.

7.4 Online Data Repositories Rendering the massive quantities of data generated using automated ab initio frameworks available for other researchers requires going beyond the conventional methods for the dissemination of scientific results in the form of journal articles. Instead, this data is typically made available in online data repositories, which can usually be accessed both manually via interactive Web portals and programmatically via an application programming interface (API). 7.4.1

Computational Materials Data Web Portals

Most computational data repositories include an interactive Web portal front end that enables manual data access. These Web portals usually include online applications to facilitate data retrieval and analysis. The front page of the AFLOW data repository is displayed in Figure 7.6a. The main features include a search bar where information such as ICSD reference number, AFLOW unique identifier (AUID) or the chemical formula, can be entered in order to retrieve specific materials entries. Below are buttons linking to several different online applications

7.4 Online Data Repositories

Element search filters

Property search filters

(b)

(a)

(c)

Figure 7.6 (a) Front page of the AFLOW online data repository, highlighting the link to (b) the AFLOW advanced search application, which facilitates complex search queries including filtering by chemical composition and materials properties, and (c) the AFLOW interactive convex hull generator, showing the 3D hull for the Pt–Sc–Zn ternary alloy system.

such as the advanced search functionality, convex hull phase diagram generators, machine learning applications [45, 158, 159], and AFLOW-online data analysis tools. The link to the advanced search application is highlighted by the orange square, and the application page is shown in Figure 7.6b. The advanced search application allows users to search for materials that contain (or exclude) specific elements or groups of elements, and also to filter and sort the results by properties such as electronic band structure energy gap (under the “Electronics” properties filter group) and bulk modulus (under the “Mechanical” properties filter group). This allows users to identify candidate materials with suitable properties for specific applications. Another example online application available on the AFLOW Web portal is the convex hull phase diagram generator. This application can be accessed by clicking on the button highlighted by the orange square in Figure 7.6a, which will bring up a periodic table allowing users to select two or three elements for which they want to generate a convex hull. The application will then access the formation enthalpies and stoichiometries of the materials entries in the relevant alloy systems and use this data to generate a two- or three-dimensional convex hull phase diagram as depicted in Figure 7.6c. This application is fully interactive, allowing users to adjust the energy axis scale, rotate the diagram to view from different directions, and select specific points to obtain more information on the corresponding entries.

199

7 Automated Computation of Materials Properties

7.4.2 Programmatically Accessible Online Repositories of Computed Materials Properties In order to use materials data in machine learning algorithms, it should be stored in a structured online database and made programmatically accessible via a representational state transfer API (REST API). Examples of online repositories of materials data include AFLOW [4, 5], Materials Project [11], and OQMD [15]. There are also repositories that aggregate results from multiple sources such as NoMaD [23] and Citrine [160]. REST APIs facilitate programmatic access to data repositories. Typical databases such as AFLOW are organized in layers, with the top layer corresponding to a project or catalog (e.g. binary alloys), the next layer corresponding to data sets (e.g. all of the entries for a particular alloy system), and then the bottom layer corresponding to specific materials entries, as illustrated in Figure 7.7a. In the case of the AFLOW database, there are currently four different “projects,” namely, the “ICSD,” “LIB1,” “LIB2,” and “LIB3” projects, along with three more under construction: “LIB4,” “LIB5,” and “LIB6.” The “ICSD” project contains calculated data for previously observed compounds [52], whereas the other three projects contain calculated data for single elements, binary alloys,

aflow.org University

Institute

Database layers

200

(a)

Project layer

ICSD

LIB1

LIB2

Set layer

CuNb

CuV

CuNbV

Calculation layer

Entry i

Laboratory

LIB3

Entry j

Materials data

$aurl=aflowlib.duke.edu:AFLOWDATA/LIB2_RAW/Cu_pvV_sv/15/?energy_atom Server

(b)

Project layer

Set layer

Entry Keyword

Figure 7.7 (a) The AFLOW database is organized as a multilayered system. (b) Example of an AURL that enables direct programmatic access to specific materials entry properties in the AFLOW database.

7.4 Online Data Repositories

and ternary alloys, respectively, and are constructed by decorating prototype structures with combinations of different elements. Within “LIB2” and “LIB3,” there are many different data sets, each corresponding to a specific binary or ternary alloy system. Each entry in the set corresponds to a specific prototype structure and stoichiometry. The materials properties values for each of these entries are encoded via keywords, and the data can be accessed via URLs constructed from the different layer names and the appropriate keywords. In the case of the AFLOW database, the location of each layer and entry is identified by an AFLOW uniform resource locator (AURL) [5], which can be converted to a URL providing the absolute path to a particular layer, entry, or property. The AURL takes the form server:AFLOWDATA/project/set/entry/? keywords, for example, aflowlib.duke.edu:AFLOWDATA/LIB2_RAW/Cu_pvV_ sv/15/?energy_atom, where aflowlib.duke.edu is the Web address of the physical server where the data is located, LIB2_RAW is the binary alloy project layer, Cu_ pvV_sv is the set containing the binary alloy system Cu–V, 15 is a specific entry with the composition Cu3 V in a tetragonal lattice, and energy_atom is the keyword corresponding to the property of energy per atom in units of eV, as shown in Figure 7.7b. Each AURL can be converted to a Web URL by changing the “:” after the server name to a “/,” so that the AURL in Figure 7.7b would become the URL aflowlib.duke.edu/AFLOWDATA/LIB2_RAW/Cu_pvV_sv/15/?energy_atom. This URL, if queried via a Web browser or using a UNIX utility such as wget, returns the energy per atom in eV for entry 15 of the Cu–V binary alloy system. In addition to the AURL, each entry in the AFLOW database is also associated with an AUID [5], which is a unique hexadecimal (base 16) number constructed from a checksum of the AFLOW output file for that entry. Since the AUID for a particular entry can always be reconstructed by applying the checksum procedure to the output file, it serves as a permanent, unique specifier for each calculation, irrespective of the current physical location of where the data are stored. This enables the retrieval of the results for a particular calculation from different servers, allowing for the construction of a truly distributed database that is robust against the failure or relocation of the physical hardware. Actual database versions can be identified from the version of AFLOW used to parse the calculation output files and post-process the results to generate the database entry. This information can be retrieved using the keyword aflowlib_version. The search and sort functions of the front-end portals can be combined with the programmatic data access functionality of the REST API through the implementation of a Search-API. The AFLUX Search-API uses the LUX language to enable the embedding of logical operators within URL query strings [161]. For example, the energy per atom of every entry in the AFLOW repository containing the element Cu or V, but not the element Ti, with an electronic bandgap between 2 and 5 eV, can be retrieved using the command aflowlib.duke .edu/search/API/?species((Cu:V),(!Ti)),Egap(2*,*5),energy_atom. In this AFLUX search query, the comma “,” represents the logical AND operation, the colon “:” the logical OR operation, the exclamation mark “!” the logical NOT operation, and the asterisk “*” the “loose” operation that defines a range of values to search within. Note that by default AFLUX returns only the first 64 entries matching the search query. The number and set of entries can be

201

202

7 Automated Computation of Materials Properties

controlled by appending the paging directive to the end of the search query as follows: aflowlib.duke.edu/search/API/?species((Cu:V),(!Ti)),Egap(2*,*5), energy_atom,paging(0), where calling the paging directive with the argument “0” instructs AFLUX to return all of the matching entries (note that this could potentially be a large amount of data, depending on the search query). The AFLUX Search-API allows users to construct and retrieve customized data sets, which they can feed into materials informatics machine learning packages to identify trends and correlations for use in rational materials design. The use of APIs to provide programmatic access is being extended beyond materials data retrieval, to enable the remote use of pretrained machine learning algorithms. The AFLOW-ML API [159] facilitates access to the two machine learning models that are also available online at aflow.org/aflow-ml [45, 158]. The API allows users to submit structural data for the material of interest using a utility such as cURL and then returns the results of the model’s predictions in JSON format. The programmatic access to machine learning predictions enables the incorporation of machine learning into materials design workflows, allowing for rapid prescreening to automatically select promising candidates for further investigation.

7.5 Materials Applications The automated approach to computational materials science has been used to accelerate the design of materials for structural applications such as metallic glasses and superalloys, and for functional applications including thermoelectrics, magnets, catalysts, batteries, photovoltaics, and superconductors. 7.5.1

Disordered Materials

Section 7.2 describes how the thermodynamic stability of ordered compounds at zero temperature can be predicted from the convex hull phase diagrams generated using the formation enthalpies available in computational materials data repositories such as AFLOW [2, 4–6]. At finite temperature, however, entropic contributions due to thermally driven disorder play an important role and lead to the formation of disordered materials such as metallic glasses and solid solutions. The thermodynamically favored phase at a given temperature and pressure is the phase with the lowest Gibbs free energy. Since the entropy term in the Gibbs free energy is multiplied by the temperature T, the entropic contribution to the Gibbs free energy becomes increasingly important at higher temperatures. The entropy of materials has two main components: the vibrational entropy, Svib , which can be calculated from the phonon dispersion or the Debye model as described in Section 7.3, and the configuration entropy, Sconfig , due to the disorder in the atomic positions or site occupations. Configurational entropy originates from chemical disorder as in the case of high entropy alloys in which all of the atoms are arranged on a regular lattice (but the specific lattice sites are randomly occupied by different chemical species) or structural disorder as in the

7.5 Materials Applications

case of metallic glasses, where the atoms no longer occupy regular lattice sites, resulting in an amorphous material. 7.5.1.1

High Entropy Materials

High entropy materials display structural order (i.e. all of the atoms are arranged on a periodic crystal lattice) but chemical disorder (i.e. the actual occupation of these lattice sites is random) [162]. In the ideal entropy limit in which the occupation of the atomic sites is completely random, the configuration entropy per atom is given by Sconfig = ∑ kB i xi loge (xi ) [162], where xi is the fractional composition of each species component. Note that this expression increases with increasing numbers of species and is also maximized when all of the values of xi are equal, i.e. for equimolar compositions. The expression for the ideal entropy can be combined with calculations for special quasirandom structures (SQS) [163], which are special structural configurations where the radial correlation functions mimic those of a perfectly random structure, to estimate the Gibbs free energy for high entropy alloys. This can then be used in conjunction with the energies of the ordered phases obtained from computational materials data repositories such as AFLOW [2, 4–6] to generate structural phase diagrams as a function of temperature and composition, predicting the phase transition boundaries between ordered compounds, phase separation regions, and single-phase solid solutions [107, 164, 165]. The calculated ordered structure energies in AFLOW can also be used to train cluster expansion models [166] to predict the energies of large ensembles of configurations, which can be combined with thermodynamic descriptors to estimate the transition temperature and miscibility gaps for solid solutions and high entropy alloys [167]. The concept of entropy stabilization has recently been extended beyond metallic alloys to include multicomponent ceramics, such as high entropy oxides [168, 169]. High entropy oxides consist of an ordered anion sublattice occupied by oxygen ions, with a disordered cation sublattice randomly occupied by five different metal ions, such as Co, Cu, Mg, Ni, and Zn [168, 169]. The oxygen ions screen the metal ions from each other, reducing the energy cost associated with forming a random configuration of the metal ions, enabling the formation of a single-phase, entropy-stabilized ceramic. 7.5.1.2

Metallic Glasses

Metallic glasses are alloys in which the atoms do not occupy the sites of a regular periodic lattice, but instead form a structurally disordered amorphous phase. These materials are of great commercial and industrial interest due to their unique combination of superb mechanical properties [170] and plastic-like processability [171–173] for several potential applications [174–178]. Several different attempts have been made to understand the formation of metallic glasses and predict the GFA of different alloy compositions. Most of these efforts center around maximizing the packing density of the different atoms [179], which requires elements with a range of different atomic radii [180–184]. Other efforts have been made to use phase diagram data on liquidus

203

204

7 Automated Computation of Materials Properties

temperatures to predict GFA [185–187]. Work is also underway to use machine learning techniques to predict potential glass formers [188]. Much of the theoretical work described above relies on the use of experimental rather than ab initio computational data to predict new materials, due to the difficulty of modeling amorphous structures using first principles techniques. However, Perim et al. [48] recently demonstrated that the energies of different structural phases can be combined into a descriptor to predict the formation of metallic glasses. If there are many different structural phases with similar formation enthalpy, this will frustrate crystallization during solidification and thus promote glass formation [48]. This frustration can be quantified to formulate a spectral descriptor for GFA using the structural and energetic information available in computational materials data repositories such as AFLOW [2, 4–6]. The differences in the geometry between two structures are quantified by describing each structure in terms of its atomic environments [189–191], while the formation enthalpy differences between the respective structures are expressed in the form of Boltzmann factors. The energetic and structural descriptors can then be combined with appropriate normalization factors to formulate a spectral descriptor for GFA as function of composition x: GFA ({x}). Comparisons with known glass-forming compositions available in the literature can then be used to define a threshold, such that if GFA({x}) exceeds this threshold, then the composition x would be expected to be glass forming. The GFA({x}) descriptor has been used to perform an automated analysis of the GFA of over 1400 binary alloy systems from the AFLOW data repository [48]. While over half of all binary alloy systems are predicted to have a GFA below that of the threshold, nevertheless some 17% of alloy systems display a maximum value of GFA({x}) greater than the maximum value for the Cu–Zr system, a well-known good glass former. These included several alloy systems for which glass formation had never previously been observed or sometimes even investigated, suggesting that there are many possible glass forming compositions that remain to be discovered. This success demonstrates the power of combining descriptors based on the easily calculated properties of periodic crystalline phases with large pre-calculated databases for predicting the synthesizability of complex disordered materials. 7.5.1.3

Modeling Off-Stoichiometry Materials

Incorporating the effects of disorder is a necessary, albeit difficult, step in materials modeling. Not only is disorder intrinsic to all materials, but it also offers a route to enhanced and even otherwise inaccessible functionality, as demonstrated by its ubiquity in technological applications. Prominent examples include fuel cells [192], high-temperature superconductors [193, 194], and low thermal conductivity thermoelectrics [195]. Specifically, chemical disorder can arise in the form of doping, vacancies, and even in the occupation of lattice sites themselves (random), which cannot inherently be modeled using periodic systems. One approach for modeling such effects includes SQS [163]. These quasirandom approximates are very computationally effective but only offer a single representation of the disordered states, i.e. that with the lowest site correlations. Instead of reducing down to a single

7.5 Materials Applications

representation, AFLOW treats such systems as an ensemble of ordered supercells [196]. Properties are resolved through ensemble averages of the representative states, with opportunities to optimize computation (via supercell size/site error) and tune the level of disorder explored (via parameter T). AFLOW partial occupation module (AFLOW-POCC) has already resolved significant stoichiometric trends in wide-gap semiconductors and magnetic systems while offering additional insight into underlying physical mechanisms. Ultimately, the screening criteria and property predictions generated by these bona fide thermodynamic models and descriptors are accelerating design of new, technologically significant materials, including advanced ceramics [197] and metallic glasses [48]. 7.5.2

Superalloys

Superalloys are characterized by their extraordinary mechanical properties, particularly at temperatures near their melting point. Such traits make them the ideal candidates for applications in the aerospace and power generation industries. Among the more common examples, many have a face-centered cubic structure with base elements nickel, cobalt, and iron, though nickel-based superalloys dominate the market. A novel cobalt-based superalloy, Co3 (Al,W), was discovered in 2006 that exhibits mechanical properties better than many nickel-based superalloys. This inspired a thorough computational investigation with AFLOW of alloys containing 40 different elements, yielding over 2224 relevant ternary systems [58]. The search offered 102 systems shown to (i) be more stable than Co3 [Al0.5 ,W0.5 ], the L12 -like random structure previously characterized thermodynamically [198] and very close to the compositions reported by experiments [199], (ii) have a relevant concentration (X3 [Ax B1−x ]) that is in two-phase equilibrium with the host matrix, and (iii) exhibit only small deviations from the host matrix lattice (within 5% relative mismatch). For these 102 candidates, additional pertinent properties were extracted, including the density and bulk modulus (as a proxy for hardness). Low density materials are preferred to mitigate the stress on turbine components. Significant trends for the bulk modulus are elucidated when plotted with respect to component B on a Pettifor scale: Ni-based materials show a peak at or before Ni, whereas Co-based materials monotonically increase. Additionally, Co-based materials are generally more resistant to compression compared with Ni-based materials. Of the 102 candidates, 37 materials have no reported phase diagrams in standard databases and are thus expected to be unexplored or new. Additional screening based on the toxicity and (low) melting temperature of components uncovered six priority candidates for experimental validation. 7.5.3

Thermoelectrics

Thermoelectric materials generate an electric voltage when subjected to a temperature gradient and can also generate a temperature gradient when a voltage is applied [200, 201]. Their lack of moving parts and resulting scalability means that they have potential applications in power generation for spacecraft, energy recovery from waste heat in automotive and industrial facilities [202, 203] and

205

206

7 Automated Computation of Materials Properties

in spot cooling for nanoelectronics using the Peltier cooling effect [202, 203]. However, most of the available thermoelectric materials have low efficiency, only converting a few percent of the available thermal energy into electricity. Therefore, a major goal of thermoelectrics research is to develop new materials that have higher thermoelectric efficiency. The thermoelectric efficiency of a material is determined by the figure of merit zT, which is obtained from [200, 201] zT =

𝜎S2 T 𝜅L + 𝜅e

(7.35)

where S is the Seebeck coefficient, 𝜎 is the electrical conductivity, 𝜅L is the lattice thermal conductivity, and 𝜅e is the electronic thermal conductivity. The lattice thermal conductivity 𝜅L can be calculated using the methods described in Section 7.3. Most of the electronic thermal conductivity 𝜅e will depend directly on the electrical conductivity 𝜎 through the Wiedemann–Franz law [200] 𝜅e = L𝜎T

(7.36)

where L is the Lorenz factor, which has a value of 2.4 × 10 J /(K2 ⋅ C2 ) for free electrons. The Seebeck coefficient S is given by [200] 8𝜋 2 kB2 ∗ ( 𝜋 ) 23 S= m T (7.37) 3eh2 3n where n is the charge carrier concentration, e is the electronic charge, and m∗ is the density of states effective mass of the charge carriers in the material. The effective mass tensor mij can be calculated from the curvature of electronic band ⃗ structure dispersion E(k): −8 2

m−1 ij =

1 d2 E ℏ2 dki dkj

(7.38)

⃗ Larger curvature of the where ki and kj are components of the wave vector k. band structure implies a lower effective mass, while flat narrow bands tend to result in a large effective mass. Charge carrier mobility and thus electrical conductivity tend to reduce with increasing effective mass. However, as can be seen from Eq. (7.37), the Seebeck coefficient increases with effective mass, and 𝜅e also increases with 𝜎. Therefore, a compromise should be found between high effective mass to maximize S and high charge carrier mobility to give high 𝜎 in order to optimize the thermoelectric efficiency of the device. Several computational high-throughput searches have been performed for thermoelectric materials [37–40, 115, 204–206]. Many of the efforts toward developing more efficient thermoelectric materials have focused on either lowering the lattice thermal conductivity 𝜅L or finding materials in which the electronic properties are highly directional, allowing for a narrow energy band distribution while simultaneously having a low effective mass, thus increasing the power factor 𝜎S2 . High-throughput searches for materials with low lattice thermal conductivity have focused on materials such as half-Heusler structures [39, 40, 207], which have lower densities and thus lower thermal conductivities than the full Heusler structures. Other promising materials include structures such as

7.5 Materials Applications

clathrates [208–212] and skutterudites [206, 213–215], which contain hollow voids that can be filled with “rattler” atoms to reduce the lattice thermal conductivity. Filled skutterudites in particular, such as Rx Co4 Sb12 , are excellent thermoelectric materials because of their combination of a high effective mass with high carrier mobility due to the existence of a secondary conduction band with 12 conducting charge carrier pockets [206]. Searches of large databases of inorganic materials to find new thermoelectrics include the study of 48 000 materials from the Materials Project database [204], where the power factor was calculated using the BoltzTraP code [216] and the thermal conductivity was estimated using the Clarke [217] and Cahill–Pohl [218] models. Almost 600 oxides, nitrides, and sulfides from the ICSD were investigated by Garrity [115], where the lattice thermal conductivity was calculated at the quasi-harmonic phonon level of approximation, with particular attention being paid to degeneracies in the conduction band minimum, or materials with strongly anisotropic conduction bands, that result in an effective lowdimensional conductor with a corresponding increase in the power factor. The thermoelectric material LiZnSb was proposed by an automated search of the calculated band structures of 1640 compounds in the ICSD containing Sb [38], although later experimental measurements did not find a high thermoelectric efficiency for this compound [219]. Other strategies to increase the power factor include engineering the band structure [220] through volume changes by alloying different materials to create solid solutions, such as antifluorite Mg2 Si and Mg2 Ge with Mg2 Sn or orthorhombic Ca2 Si and Ca2 Ge with Ca2 Sn [221]. Tuning the composition of alloys can also be used to converge the valence and conduction bands, enabling high valley degeneracy to be achieved in materials such as PbTe1−x Sex alloys [222]. Solid solutions can also produce local anisotropic structural disorder, increasing phonon scattering and thus improving the thermoelectric efficiency [223, 224]. The exploitation of thermodynamic phenomena such as spinodal decomposition to self-assemble heterostructures with increased phonon scattering [225] has also been proposed to enhance the efficiency of thermoelectric devices. In this approach, materials such as PbSe and PbTe, which are miscible at high temperatures, undergo phase separation when the mixture is cooled slowly, creating a layered heterostructure with a network of boundaries between the different components, which scatter phonons and thus suppress the thermal conductivity. This concept has also been extended to other nanotechnology applications, e.g. as a means to embed a network of electrically conducting nanowires, in the form of topologically protected interface states, within an insulating matrix [226]. The combination of different competing materials properties that must be optimized to maximize the thermoelectric efficiency highlights the importance of integrated frameworks such as AFLOW, which can automatically calculate different types of materials properties such as thermal conductivity and electronic band structures. Having all of these electronic and thermal properties calculated and available in an integrated, searchable, sortable data repository such as AFLOW.org accelerates the design of new, high-efficiency thermoelectric materials.

207

208

7 Automated Computation of Materials Properties

7.5.4

Magnetic Materials

The search for new magnetic systems remains a long-standing challenge despite their ubiquity in modern technology [227]. Magnetism demonstrates remarkable sensitivity to a number of properties, including electronic configuration, bond length/angle, and magnetic ion valence, and thus its presence is rather uncommon and difficult to predict. In fact, only two percent of the known inorganic compounds [52] exhibit magnetic order of any kind. Consumer applications place additional practical restrictions for magnets, with the current global market effectively populated by only two dozen compounds. These obstacles motivated a large-scale computational search with AFLOW for new magnets among the Heusler structure family. Heusler structures are of particular interest for a number of reasons: (i) several are known high-performance magnets, (ii) the breadth of distinct compounds offers an excellent chance for discovery, (iii) the full set of materials will likely offer other types of interesting materials (aside from magnets), and (iv) they are metallic and thus well described by DFT. There are three types of Heusler structures, i.e. the regular-Heuslers X2 YZ (Cu2 MnAl-type), inverse-Heuslers (XY )XZ (Hg2 CuTi-type), and the half-Heuslers XYZ (MgCuSb-type). By decorating these prototypes with ternary combinations of 55 elements, a total of 236 115 compounds were generated and added to the AFLOW.org repository. As a first attempt, the analysis is limited to Heuslers containing elements of the 3d, 4d, and 5d periods, i.e. a subset of 36 540 compounds. Of this set, 248 are determined to be thermodynamically stable, and 22 have a magnetic ground state compatible with the unit cells considered. Among these 22 magnetic ground-state compounds, a few prominent classes can be identified, including Co2 YZ and Mn2 YZ. Upon further analysis of these classes, four materials were of particular interest. In the first class Co2 YZ, there already exists 25 known compounds all lying on the Slater–Pauling curve (magnetic moment per formula unit vs. number of valence electrons) [228]. The regression predicts Co2 MnTi to have the notably high Curie transition temperature TC of 940 K – a feature shared by only two dozen known magnets. The second class Mn2 YZ is of interest because of their high TC and potentially large magnetocrystalline anisotropy [229]. Two known examples from this class, Mn2 VAl and Mn2 VGa, show ferrimagnetic ordering, matching two candidates from the list of 22, Mn2 PtCo and Mn2 PtV. One more compound was highlighted for satisfying a stringent thermodynamic constraint. Mn2 PdPt is robustly stable by at least 30 meV, where the criterion derives from the distance of the stable phase from the pseudo-convex hull that neglects it. This criterion quantifies the impact of the structure on the minimum energy surface. Following an attempt to synthesize these four candidates, two were successful (Co2 MnTi and Mn2 PtPd), and the other two decomposed into binary compounds. In fact, Co2 MnTi shows a TC of 938 K, almost exactly as predicted by the Slater–Pauling curve. Surprisingly, Mn2 PdPt shows antiferromagnetic ordering and tetragonal distortion (c∕a ∼ 1.8), a result corroborated by calculation upon further analysis. Beyond the synthesis of these two systems, this investigation offers a new, accelerated pathway to materials discovery over traditional trial-and-error approaches.

References

7.6 Conclusion Automated computational materials design frameworks have the capability to rapidly generate materials data without the need for laborious human intervention. They are being used to construct large repositories of programmatically accessible materials properties, calculated in a standardized, consistent fashion so as to facilitate the identification of trends and the training of machine learning models to predict electronic, thermal, and mechanical behavior. When combined with physical models and intelligently formulated descriptors, the data becomes a powerful tool to accelerate the discovery of new materials for applications ranging from high-temperature superalloys to thermoelectrics and magnets.

Acknowledgments We thank Drs. S. Barzilai, Y. Lederer, O. Levy, F. Rose, P. Nath, D. Usanmaz, D. Hicks, E. Gossett, D. Ford, R. Friedrich, M. Esters, P. Colinet, E. Perim, C. Calderon, K. Yang, M. Mehl, M. Buongiorno Nardelli, M. Fornari, G. Hart, I. Takeuchi, E. Zurek, P. Avery, R. Hanson, A. Kolmogorov, A. Natan, N. Mingo, J. Carrete, S. Sanvito, D. Brenner, K. Vecchio, M. Scheffler, L. Ghiringhelli, O. Isayev, A. Tropsha, J. Schroers, and J. J. Vlassak for insightful discussions. This work is supported by DOD-ONR (N00014-16-1-2326, N00014-16-1-2583, N00014-17-1-2090, N00014-17-1-2876), by NSF (DMR-1436151), and by Duke University – Center for Materials Genomics. SC acknowledges support by the Alexander von Humboldt Foundation for financial support. CO acknowledges support from the NSF Graduate Research Fellowship #DGF1106401.

References 1 Curtarolo, S., Hart, G.L.W., Buongiorno Nardelli, M. et al. (2013). The

2

3

4

5

6

high-throughput highway to computational materials design. Nat. Mater. 12: 191–201. Curtarolo, S., Setyawan, W., Hart, G.L.W. et al. (2012). AFLOW: an automatic framework for high-throughput materials discovery. Comput. Mater. Sci. 58: 218–226. Setyawan, W. and Curtarolo, S. (2010). High-throughput electronic band structure calculations: challenges and tools. Comput. Mater. Sci. 49: 299–312. Curtarolo, S., Setyawan, W., Wang, S. et al. (2012). AFLOWLIB.ORG: a distributed materials properties repository from high-throughput ab initio calculations. Comput. Mater. Sci. 58: 227–235. Taylor, R.H., Rose, F., Toher, C. et al. (2014). A RESTful API for exchanging materials data in the AFLOWLIB.org consortium. Comput. Mater. Sci. 93: 178–192. Calderon, C.E., Plata, J.J., Toher, C. et al. (2015). The AFLOW standard for high-throughput materials science calculations. Comput. Mater. Sci. 108 Pt. A: 233–238.

209

210

7 Automated Computation of Materials Properties

7 Setyawan, W. and Curtarolo, S. (2011). AflowLib: Ab-initio electronic struc-

ture library database. http://www.aflow.org (accessed 17 April 2019). 8 Toher, C., Oses, C., Hicks, D. et al. (2018). The AFLOW Fleet for Materials

9

10

11

12

13

14

15

16

17

18 19 20 21

22

Discovery. In: Handbook of Materials Modeling (ed. W. Andreoni and S. Yip), 1–28 Cham, Switzerland: Springer International Publishing. doi: 10.1007/978-3-319-42913-7_63-1. Supka, A.R., Lyons, T.E., Liyanage, L.S.I. et al. (2017). AFLOWπ: a minimalist approach to high-throughput ab initio calculations including the generation of tight-binding hamiltonians. Comput. Mater. Sci. 136: 76–84. Buongiorno Nardelli, M., Cerasoli, F.T., Costa, M. et al. (2017). PAOFLOW: a utility to construct and operate on ab initio Hamiltonians from the projections of electronic wavefunctions on atomic orbital bases, including characterization of topological materials. Comput. Mater. Sci. 143: 462–472. Jain, A., Hautier, G., Moore, C.J. et al. (2011). A high-throughput infrastructure for density functional theory calculations. Comput. Mater. Sci. 50: 2295–2310. Jain, A., Ong, S.P., Hautier, G. et al. (2013). Commentary: The Materials Project: A materials genome approach to accelerating materials innovation. APL Mater. 1: 011002. Ong, S.P., Richards, W.D., Jain, A. et al. (2013). Python Materials Genomics (pymatgen): a robust, open-source python library for materials analysis. Comput. Mater. Sci. 68: 314–319. Mathew, K., Montoya, J.H., Faghaninia, A. et al. (2017). Atomate: a high-level interface to generate, execute, and analyze computational materials science workflows. Comput. Mater. Sci. 139: 140–152. Saal, J.E., Kirklin, S., Aykol, M. et al. (2013). Materials design and discovery with high-throughput density functional theory: the Open Quantum Materials Database (OQMD). JOM 65: 1501–1509. Kirklin, S., Meredig, B., and Wolverton, C. (2013). High-throughput computational screening of new Li-Ion battery anode materials. Adv. Energy Mater. 3: 252–262. Kirklin, S., Saal, J.E., Hegde, V.I., and Wolverton, C. (2016). High-throughput computational search for strengthening precipitates in alloys. Acta Mater. 102: 125–135. Landis, D.D., Hummelshøj, J.S., Nestorov, S. et al. (2012). The computational materials repository. Comput. Sci. Eng. 14: 51–57. Bahn, S.R. and Jacobsen, K.W. (2002). An object-oriented scripting interface to a legacy electronic structure code. Comput. Sci. Eng. 4: 56–66. Pizzi, G., Cepellotti, A., Sabatini, R. et al. (2016). AiiDA. http://www.aiida .net (accessed 17 April 2019). Pizzi, G., Cepellotti, A., Sabatini, R. et al. (2016). AiiDA: automated interactive infrastructure and database for computational science. Comput. Mater. Sci. 111: 218–230. Mounet, N., Gibertini, M., Schwaller, P. et al. (2018). Two-dimensional materials from high-throughput computational exfoliation of experimentally known compounds. Nat. Nanotechnol. 13: 246–252.

References

23 Scheffler, M. and Draxl, C. (2014). Computer Center of the Max-Planck Soci-

ety. Garching: The NoMaD Repository. http://nomad-repository.eu. 24 Merkys, A., Mounet, N., Cepellotti, A. et al. (2017). A posteriori metadata

25

26

27 28

29

30

31

32

33

34

35

36 37

38 39

from automated provenance tracking: integration of AiiDA and TCOD. J. Cheminform 9: 56. Yu, L. and Zunger, A. (2012). Identification of potential photovoltaic absorbers based on first-principles spectroscopic screening of materials. Phys. Rev. Lett. 108: 068701. Castelli, I.E., Olsen, T., Datta, S. et al. (2012). Computational screening of perovskite metal oxides for optimal solar light capture. Energy Environ. Sci. 5: 5814–5819. Lin, L.-C., Berger, A.H., Martin, R.L. et al. (2012). In silico screening of carbon-capture materials. Nat. Mater. 11: 633–641. Alapati, S.V., Johnson, J.K., and Sholl, D.S. (2008). Large-scale screening of metal hydride mixtures for high-capacity hydrogen storage from firstprinciples calculations. J. Phys. Chem. C 112: 5258–5262. Derenzo, S., Bizarri, G., Borade, R. et al. (2011). New scintillators discovered by high-throughput screening. Nucl. Inst. Methods Phys. Res. A 652: 247–250. Ortiz, C., Eriksson, O., and Klintenberg, M. (2009). Data mining and accelerated electronic structure theory as a tool in the search for new functional materials. Comput. Mater. Sci. 44: 1042–1049. Setyawan, W., Gaumé, R.M., Lam, S. et al. (2011). High-throughput combinatorial database of electronic band structures for inorganic scintillator materials. ACS Comb. Sci. 13: 382–390. Setyawan, W., Gaumé, R.M., Feigelson, R.S., and Curtarolo, S. (2009). Comparative study of nonproportionality and electronic band structures features in scintillator materials. IEEE Trans. Nucl. Sci. 56: 2989–2996. Yang, K., Setyawan, W., Wang, S. et al. (2012). A search model for topological insulators with high-throughput robustness descriptors. Nat. Mater. 11: 614–619. Lin, H., Wray, L.A., Xia, Y. et al. (2010). Half-Heusler ternary compounds as new multifunctional experimental platforms for topological quantum phenomena. Nat. Mater. 9: 546–549. Armiento, R., Kozinsky, B., Fornari, M., and Ceder, G. (2011). Screening for high-performance piezoelectrics using high-throughput density functional theory. Phys. Rev. B 84: 014103. Roy, A., Bennett, J.W., Rabe, K.M., and Vanderbilt, D. (2012). Half-Heusler semiconductors as piezoelectrics. Phys. Rev. Lett. 109: 037602. Wang, S., Wang, Z., Setyawan, W. et al. (2011). Assessing the thermoelectric properties of sintered compounds via high-throughput Ab-Initio calculations. Phys. Rev. X 1: 021012. Madsen, G.K.H. (2006). Automated search for new thermoelectric materials: the case of LiZnSb. J. Am. Chem. Soc. 128: 12140–12146. Carrete, J., Li, W., Mingo, N. et al. (2014). Finding unprecedentedly low-thermal-conductivity half-Heusler semiconductors via high-throughput materials modeling. Phys. Rev. X 4: 011019.

211

212

7 Automated Computation of Materials Properties

40 Carrete, J., Mingo, N., Wang, S., and Curtarolo, S. (2014). Nanograined

41 42

43

44

45 46

47

48

49

50 51

52 53 54 55

56

57

half-Heusler semiconductors as advanced thermoelectrics: An Ab initio high-throughput statistical study. Adv. Funct. Mater. 24: 7427–7432. Nørskov, J.K., Bligaard, T., Rossmeisel, J., and Christensen, C.H. (2009). Towards the computational design of solid catalysts. Nat. Chem. 1: 37–46. Hautier, G., Jain, A., Chen, H. et al. (2011). Novel mixed polyanions lithium-ion batery cathode materials predicted by high-throughput. ab initio computations. J. Mater. Chem. 21: 17147–17153. Hautier, G., Jain, A., Ong, S.P. et al. (2011). Phosphates as lithium-ion battery cathodes: an evaluation based on high-throughput ab initio calculations. Chem. Mater. 23: 3495–3508. Mueller, T., Hautier, G., Jain, A., and Ceder, G. (2011). Evaluation of tavorite-structured cathode materials for lithium-ion batteries using high-throughput computing. Chem. Mater. 23: 3854–3862. Isayev, O., Oses, C., Toher, C. et al. (2017). Universal fragment descriptors for predicting properties of inorganic crystals. Nat. Commun. 8: 15679. de Jong, M., Chen, W., Notestine, R. et al. (2016). A statistical learning framework for materials science: application to elastic moduli of k-nary inorganic polycrystalline compounds. Sci. Rep. 6: 34256. Isayev, O., Fourches, D., Muratov, E.N. et al. (2015). Materials cartography: representing and mining materials space using structural and electronic fingerprints. Chem. Mater. 27: 735–743. Perim, E., Lee, D., Liu, Y. et al. (2016). Spectral descriptors for bulk metallic glasses based on the thermodynamics of competing crystalline phases. Nat. Commun. 7: 12315. Toher, C., Plata, J.J., Levy, O. et al. (2014). High-throughput computational screening of thermal conductivity, Debye temperature, and Grüneisen parameter using a quasiharmonic Debye model. Phys. Rev. B 90: 174107. de Jong, M., Chen, W., Angsten, T. et al. (2015). Charting the complete elastic properties of inorganic crystalline compounds. Sci. Data 2: 150009. Toher, C., Oses, C., Plata, J.J. et al. (2017). Combining the AFLOW GIBBS and elastic libraries to efficiently and robustly screen thermomechanical properties of solids. Phys. Rev. Mater. 1: 015401. Bergerhoff, G., Hundt, R., Sievers, R., and Brown, I.D. (1983). The inorganic crystal structure data base. J. Chem. Inf. Comput. Sci. 23: 66–69. Mehl, M.J., Hicks, D., Toher, C. et al. (2017). The AFLOW Library of Crystallographic Prototypes: Part 1. Comput. Mater. Sci. 136: S1–S828. Hart, G.L.W. and Forcade, R.W. (2008). Algorithm for generating derivative structures. Phys. Rev. B 77: 224115. Hart, G.L.W. and Forcade, R.W. (2009). Generating derivative structures from multilattices: algorithm and application to HCP alloys. Phys. Rev. B 80: 014120. Curtarolo, S., Morgan, D., and Ceder, G. (2005). Accuracy of ab initio methods in predicting the crystal structures of metals: a review of 80 binary alloys. Calphad 29: 163–211. Hart, G.L.W., Curtarolo, S., Massalski, T.B., and Levy, O. (2013). Comprehensive search for new phases and compounds in binary alloy systems based

References

58

59 60

61

62 63

64

65

66 67

68 69 70 71 72

73 74 75

on Platinum-Group metals, using a computational first-principles approach. Phys. Rev. X 3: 041035. Nyshadham, C., Oses, C., Hansen, J.E. et al. (2017). A computational high-throughput search for new ternary superalloys. Acta Mater. 122: 438–447. Ong, S.P., Wang, L., Kang, B., and Ceder, G. (2008). Li-Fe-P-O2 phase diagram from first principles calculations. Chem. Mater. 20: 1798–1807. Akbarzadeh, A.R., Ozoli¸ns˘ , V., and Wolverton, C. (2007). First-principles determination of multicomponent hydride phase diagrams: application to the Li-Mg-N-H system. Adv. Mater. 19: 3233–3239. Levy, O., Hart, G.L.W., and Curtarolo, S. (2010). Uncovering compounds by synergy of cluster expansion and high-throughput methods. J. Am. Chem. Soc. 132: 4830–4833. Levy, O., Hart, G.L.W., and Curtarolo, S. (2010). Hafnium binary alloys from experiments and first principles. Acta Mater. 58: 2887–2897. Levy, O., Chepulskii, R.V., Hart, G.L.W., and Curtarolo, S. (2010). The new face of rhodium alloys: revealing ordered structures from first principles. J. Am. Chem. Soc. 132: 833–837. Levy, O., Jahnátek, M., Chepulskii, R.V. et al. (2011). Ordered structures in rhenium binary alloys from first-principles calculations. J. Am. Chem. Soc. 133: 158–163. Jahnátek, M., Levy, O., Hart, G.L.W. et al. (2011). Ordered phases in ruthenium binary alloys from high-throughput first-principles calculations. Phys. Rev. B 84: 214110. Levy, O., Xue, J., Wang, S. et al. (2012). Stable ordered structures of binary technetium alloys from first principles. Phys. Rev. B 85: 012201. Chepulskii, R.V. and Curtarolo, S. (2011). Revealing low-temperature atomic ordering in bulk Co-Pt with the high-throughput ab-initio method. Appl. Phys. Lett. 99: 261902. Taylor, R.H., Curtarolo, S., and Hart, G.L.W. (2010). Ordered magnesiumlithium alloys: first-principles predictions. Phys. Rev. B 81: 024112. Taylor, R.H., Curtarolo, S., and Hart, G.L.W. (2011). Guiding the experimental discovery of magnesium alloys. Phys. Rev. B 84: 084101. Chepulskii, R.V. and Curtarolo, S. (2009). Calculation of solubility in titanium alloys from first principles. Acta Mater. 57: 5314–5323. Bloch, J., Levy, O., Pejova, B. et al. (2012). Prediction and hydrogen acceleration of ordering in iron–vanadium alloys. Phys. Rev. Lett. 108: 215503. Mehl, M.J., Finkenstadt, D., Dane, C. et al. (2015). Finding the stable structures of N1−x Wx with an ab initio high-throughput approach. Phys. Rev. B 91: 184110. Levy, O., Hart, G.L.W., and Curtarolo, S. (2010). Structure maps for hcp metals from first-principles calculations. Phys. Rev. B 81: 174106. Taylor, R.H., Curtarolo, S., and Hart, G.L.W. (2010). Predictions of the Pt8 Ti phase in unexpected systems. J. Am. Chem. Soc. 132: 6851–6854. Nelson, L.J., Hart, G.L.W., and Curtarolo, S. (2012). Ground-state characterizations of systems predicted to exhibit L11 or L13 crystal structures. Phys. Rev. B 85: 054203.

213

214

7 Automated Computation of Materials Properties

76 Hohenberg, P. and Kohn, W. (1964). Inhomogeneous electron gas. Phys. Rev.

136: B864–B871. 77 Kohn, W. and Sham, L.J. (1965). Self-consistent equations including

exchange and correlation effects. Phys. Rev. 140: A1133. 78 Perdew, J.P. and Zunger, A. (1981). Self-interaction correction to density-

79

80 81

82

83

84

85 86 87

88

89 90 91

92

93

functional approximations for many-electron systems. Phys. Rev. B 23: 5048–5079. Zupan, A., Blaha, P., Schwarz, K., and Perdew, J.P. (1998). Pressure-induced phase transitions in solid Si, SiO2 , and Fe: performance of local-spin-density and generalized-gradient-approximation density functionals. Phys. Rev. B 58: 11266. Perdew, J.P., Burke, K., and Ernzerhof, M. (1996). Generalized gradient approximation made simple. Phys. Rev. Lett. 77: 3865–3868. Lee, C., Yang, W., and Parr, R.G. (1988). Development of the Colle-Salvetti correlation-energy formula into a functional of the electron density. Phys. Rev. B 37: 785. Sun, J., Ruzsinszky, A., and Perdew, J.P. (2015). Strongly constrained and appropriately normed semilocal density functional. Phys. Rev. Lett. 115: 036402. Liechtenstein, A.I., Anisimov, V.I., and Zaanen, J. (1995). Density-functional theory and strong interactions: orbital ordering in Mott-Hubbard insulators. Phys. Rev. B 52: R5467. Dudarev, S.L., Botton, G.A., Savrasov, S.Y. et al. (1998). Electron-energy-loss spectra and the structural stability of nickel oxide: an LSDA+U study. Phys. Rev. B 57: 1505–1509. Becke, A.D. (1993). Density-functional thermochemistry. III. The role of exact exchange. J. Chem. Phys 98: 5648. Heyd, J., Scuseria, G.E., and Ernzerhof, M. (2003). Hybrid functionals based on a screened Coulomb potential. J. Chem. Phys. 118: 8207–8215. Agapito, L.A., Curtarolo, S., and Buongiorno Nardelli, M. (2015). Reformulation of DFT + U as a pseudohybrid Hubbard density functional for accelerated materials discovery. Phys. Rev. X 5: 011006. Hedin, L. (1965). New method for calculating the one-particle Green’s function with application to the electron-gas problem. Phys. Rev. 139: A796–A823. Aryasetiawan, F. and Gunnarsson, O. (1998). The GW method. Rep. Prog. Phys. 61: 237. Kresse, G. and Hafner, J. (1993). Ab initio molecular dynamics for liquid metals. Phys. Rev. B 47: 558–561. Kresse, G. and Furthmüller, J. (1996). Efficient iterative schemes for ab initio total-energy calculations using a plane-wave basis set. Phys. Rev. B 54: 11169–11186. Kresse, G. and Furthmüller, J. (1996). Efficiency of ab-initio total energy calculations for metals and semiconductors using a plane-wave basis set. Comput. Mater. Sci. 6: 15–50. Kresse, G. and Joubert, D. (1999). From ultrasoft pseudopotentials to the projector augmented-wave method. Phys. Rev. B 59: 1758–1775.

References

94 Giannozzi, P., Baroni, S., Bonini, N. et al. (2009). QUANTUM ESPRESSO:

95

96

97

98

99 100 101

102 103

104 105 106 107

108

109 110

111

a modular and open-source software project for quantum simulations of materials. J. Phys. Condens. Matter 21: 395502. Giannozzi, P., Andreussi, O., Brumme, T. et al. (2017). Advanced capabilities for materials modelling with QUANTUM ESPRESSO. J. Phys. Condens. Matter 29: 465901. Gonze, X., Beuken, J.-M., Caracas, R. et al. (2002). First-principles computation of material properties: the ABINIT software project. Comput. Mater. Sci. 25: 478–492. Gonze, X., Amadon, B., Anglade, P.-M. et al. (2009). ABINIT: first-principles approach to materials and nanosystem properties. Comput. Phys. Commun. 180: 2582–2615. Blum, V., Gehrke, R., Hanke, F. et al. (2009). Ab initio molecular simulations with numeric atom-centered orbitals. Comput. Phys. Commun. 180: 2175–2196. Soler, J.M., Artacho, E., Gale, J.D. et al. (2002). The SIESTA method for ab initio order-N materials simulation. J. Phys. Condens. Matter 14: 2745. Frisch, M.J., Trucks, G.W., Schlegel, H.B. et al. (2009). Gaussian09 Revision D.01. Wallingford, CT: Gaussian, Inc. Hehre, W.J., Stewart, R.F., and Pople, J.A. (1969). Self-consistent molecularorbital methods. I. Use of Gaussian expansions of slater-type atomic orbitals. J. Chem. Phys. 51: 2657–2664. Monkhorst, H.J. and Pack, J.D. (1976). Special points for Brillouin-zone integrations. Phys. Rev. B 13: 5188. Wisesa, P., McGill, K.A., and Mueller, T. (2016). Efficient generation of generalized Monkhorst–Pack grids through the use of informatics. Phys. Rev. B 93: 155109. Greaves, G.N., Greer, A.L., Lakes, R.S., and Rouxel, T. (2011). Poisson’s ratio and modern materials. Nat. Mater. 10: 823–837. Poirier, J.-P. (2000). Introduction to the Physics of the Earth’s Interior, 2e. Cambridge University Press. Mouhat, F. and Coudert, F.-X. (2014). Necessary and sufficient elastic stability conditions in various crystal systems. Phys. Rev. B 90: 224104. Barzilai, S., Toher, C., Curtarolo, S., and Levy, O. (2016). Evaluation of the tantalum-titanium phase diagram from ab-initio calculations. Acta Mater. 120: 255–263. Chen, X.-Q., Niu, H., Li, D., and Li, Y. (2011). Modeling hardness of polycrystalline materials and bulk metallic glasses. Intermetallics 19: 1275–1281. Teter, D.M. (1998). Computational alchemy: the search for new superhard materials. MRS Bull. 23: 22–27. Hashin, Z. and Shtrikman, S. (1963). A variational approach to the theory of the elastic behaviour of multiphase materials. J. Mech. Phys. Solids 11: 127–140. Zohdi, T.I. and Wriggers, P. (2001). Aspects of the computational testing of the mechanical properties of microheterogeneous material samples. Int. J. Numer. Methods Eng. 50: 2573–2599.

215

216

7 Automated Computation of Materials Properties

112 Anderson, O.L., Schreiber, E., Liebermann, R.C., and Soga, N. (1968). Some

113

114

115 116

117

118 119 120 121 122

123 124 125 126 127 128

129

130

elastic constant data on minerals relevant to geophysics. Rev. Geophys. 6: 491–524. Karki, B.B., Stixrude, L., and Wentzcovitch, R.M. (2001). High-pressure elastic properties of major materials of Earth’s mantle from first principles. Rev. Geophys. 39: 507–534. Zebarjadi, M., Esfarjani, K., Dresselhaus, M.S. et al. (2012). Perspectives on thermoelectrics: from fundamentals to device applications. Energy Environ. Sci. 5: 5147–5162. Garrity, K.F. (2016). First principles search for n-type oxide, nitride and sulfide thermoelectrics. Phys. Rev. B 94: 045122. Yeh, L.-T. and Chu, R.C. (2002). Thermal Management of Microloectronic Equipment: Heat Transfer Theory, Analysis Methods, and Design Practices. ASME Press. Wright, C.D., Wang, L., Shah, P. et al. (2011). The design of rewritable ultrahigh density scanning-probe phase-change memories. IEEE Trans. Nanotechnol. 10: 900–912. Watari, K. and Shinde, S.L. (2001). High thermal conductivity materials. MRS Bull. 26: 440–444. Slack, G.A., Tanzilli, R.A., Pohl, R.O., and Vandersande, J.W. (1987). The intrinsic thermal conductivity of AlN. J. Phys. Chem. Solids 48: 641–647. Toberer, E.S., Zevalkink, A., and Snyder, G.J. (2011). Phonon engineering through crystal chemistry. J. Mater. Chem. 21: 15843–15852. Nye, J.F. (1985). Physical Properties of Crystals: Their Representation by Tensors and Matrices. Oxford Science Publications (Clarendon Press). Maradudin, A.A., Montroll, E.W., Weiss, G.H., and Ipatova, I.P. (1971). Theory of Lattice Dynamics in the Harmonic Approximation. New York: Academic Press. Stokes, H.T. and Hatch, D.M. (2005). FINDSYM: program for identifying the space group symmetry of a crystal. J. Appl. Crystallogr. 38: 237–238. Stokes, H.T. (1995). Using symmetry in frozen phonon calculations. Ferroelectrics 164: 183–188. Spek, A.L. (2003). Single-crystal structure validation with the program PLATON. J. Appl. Crystallogr. 36: 7–13. Togo, A. and Tanaka, I. (2017). Spglib: a software library for crystal symmetry search. https://atztogo.github.io/spglib/ (accessed 17 April 2019). Hloucha, M. and Deiters, U.K. (1998). Fast coding of the minimum image convention. Mol. Simul. 20: 239–244. Hicks, D., Oses, C., Gossett, E. et al. (2018). AFLOW-SYM: platform for the complete, automatic and self-consistent symmetry analysis of crystals. Acta Crystallogr., Sect. A: Found. Adv. 74: 184–203. Hahn, T. (ed.) (2002). International Tables of Crystallography. Volume A: Space-Group Symmetry. Chester, England: Kluwer Academic publishers, International Union of Crystallography. Golesorkhtabar, R., Pavone, P., Spitaler, J. et al. (2013). ElaStic: a tool for calculating second-order elastic constants from first principles. Comput. Phys. Commun. 184: 1861–1873.

References

131 da Silveira, P.R.C., da Silva, C.R.S., and Wentzcovitch, R.M. (2008). Meta-

132

133 134

135

136 137 138

139

140

141

142

143 144 145 146

147

data management for distributed first principles calculations in VLab-A collaborative cyberinfrastructure for materials computation. Comput. Phys. Commun. 178: 186–198. da Silva, C.R.S., da Silveira, P.R.C., Karki, B. et al. (2007). Virtual laboratory for planetary materials: system service architecture overview. Phys. Earth Planet. Inter. 163: 321–332. Hill, R. (1952). The elastic behaviour of a crystalline aggregate. Proc. Phys. Soc., Sect. A 65: 349. Blanco, M.A., Francisco, E., and Luaña, V. (2004). GIBBS: isothermal-isobaric thermodynamics of solids from energy curves using a quasi-harmonic Debye model. Comput. Phys. Commun. 158: 57–72. Birch, F. (1938). The effect of pressure upon the elastic parameters of isotropic solids, according to Murnaghan’s theory of finite strain. J. Appl. Phys. 9: 279. Vinet, P., Rose, J.H., Ferrante, J., and Smith, J.R. (1989). Universal features of the equation of state of solids. J. Phys. Condens. Matter 1: 1941–1963. Baonza, V.G., Cáceres, M., and Núñez, J. (1995). Universal compressibility behavior of dense phases. Phys. Rev. B 51: 28–37. Leibfried, G. and Schlömann, E. (1954). Wärmeleitung in elektrisch isolierenden Kristallen, Nachrichten d.Akadd. Wiss. in Göttingen. Math.-physik. Kl. 2a. Math.-physik.-chem. Abt Vandenhoeck & Ruprecht. Slack, G.A. (1979). The thermal conductivity of nonmetallic crystals. In: Solid State Physics, vol. 34 (ed. H. Ehrenreich, F. Seitz, and D. Turnbull), 1–71. New York: Academic Press. Morelli, D.T. and Slack, G.A. (2006). High lattice thermal conductivity solids. In: High Thermal Conductivity Materials (ed. S.L. Shinde and J.S. Goela), 37–68. New York, NY: Springer. Wee, D., Kozinsky, B., Pavan, B., and Fornari, M. (2012). Quasiharmonic vibrational properties of TiNiSn from ab-initio phonons. J. Electron. Mater. 41: 977–983. Bjerg, L., Iversen, B.B., and Madsen, G.K.H. (2014). Modeling the thermal conductivities of the zinc antimonides ZnSb and Zn4 Sb3 . Phys. Rev. B 89: 024304. Ashcroft, N.W. and Mermin, N.D. (1976). Solid State Physics. Philadelphia, PA: Holt-Saunders. Dove, M.T. (1993). Introduction to Lattice Dynamics, Cambridge Topics in Mineral Physics and Chemistry. Cambridge University Press. Parlinski, K. (2010). Computing for materials: phonon software. http://www .computingformaterials.com/phoncfm/3faq/100softmode1.html. Nath, P., Plata, J.J., Usanmaz, D. et al. (2016). High-throughput prediction of finite-temperature properties using the quasi-harmonic approximation. Comput. Mater. Sci. 125: 82–91. Nath, P., Plata, J.J., Usanmaz, D. et al. (2017). High throughput combinatorial method for fast and robust prediction of lattice thermal conductivity. Scr. Mater. 129: 88–93.

217

218

7 Automated Computation of Materials Properties

148 Broido, D.A., Malorny, M., Birner, G. et al. (2007). Intrinsic lattice thermal

149 150

151

152

153

154

155 156

157

158

159

160 161

162

163 164

conductivity of semiconductors from first principles. Appl. Phys. Lett. 91: 231922. Li, W., Mingo, N., Lindsay, L. et al. (2012). Thermal conductivity of diamond nanowires from first principles. Phys. Rev. B 85: 195436. Ward, A., Broido, D.A., Stewart, D.A., and Deinzer, G. (2009). Ab initio theory of the lattice thermal conductivity in diamond. Phys. Rev. B 80: 125203. Ward, A. and Broido, D.A. (2010). Intrinsic phonon relaxation times from first-principles studies of the thermal conductivities of Si and Ge. Phys. Rev. B 81: 085205. Zhang, Q., Cao, F., Lukas, K. et al. (2012). Study of the thermoelectric properties of lead selenide doped with boron, gallium, indium, or thallium. J. Am. Chem. Soc. 134: 17731–17738. Li, W., Lindsay, L., Broido, D.A. et al. (2012). Thermal conductivity of bulk and nanowire Mg2 Six Sn1−x alloys from first principles. Phys. Rev. B 86: 174307. Lindsay, L., Broido, D.A., and Reinecke, T.L. (2013). First-principles determination of ultrahigh thermal conductivity of boron arsenide: a competitor for diamond?. Phys. Rev. Lett. 111: 025901. Lindsay, L., Broido, D.A., and Reinecke, T.L. (2013). Ab initio thermal transport in compound semiconductors. Phys. Rev. B 87: 165201. Li, W., Carrete, J., Katcho, N.A., and Mingo, N. (2014). ShengBTE: a solver of the Boltzmann transport equation for phonons. Comput. Phys. Commun. 185: 1747–1758. Plata, J.J., Nath, P., Usanmaz, D. et al. (2017). An efficient and accurate framework for calculating lattice thermal conductivity of solids: AFLOWAAPL Automatic Anharmonic Phonon Library. NPJ Comput. Mater. 3: 45. Legrain, F., Carrete, J., van Roekeghem, A. et al. (2017). How chemical composition alone can predict vibrational free energies and entropies of solids. Chem. Mater. 29: 6220–6227. Gossett, E., Toher, C., Oses, C. et al. (2018). AFLOW-ML: a RESTful API for machine-learning predictions of materials properties. Comput. Mater. Sci. 152: 134–145. Meredig, B. and Mulholland, G. (2015). Citrine informatics. http://www .citrine.io (accessed 17 April 2019). Rose, F., Toher, C., Gossett, E. et al. (2017). AFLUX: the LUX materials search API for the AFLOW data repositories. Comput. Mater. Sci. 137: 362–370. Widom, M. (2016). Prediction of structure and phase transformations. In: High-Entropy Alloys: Fundamentals and Applications, Chapter 8 (ed. M.C. Gao, J.-W. Yeh, P.K. Liaw, and Y. Zhang). Cham: Springer 267–298. Zunger, A., Wei, S.-H., Ferreira, L.G., and Bernard, J.E. (1990). Special quasirandom structures. Phys. Rev. Lett. 65: 353–356. Barzilai, S., Toher, C., Curtarolo, S., and Levy, O. (2017). The effect of lattice stability determination on the computational phase diagrams of intermetallic alloys. J. Alloys Compd. 728: 314–321.

References

165 Barzilai, S., Toher, C., Curtarolo, S., and Levy, O. (2017).

166 167

168 169

170 171 172 173 174 175 176 177 178 179 180 181 182 183

184

185

Molybdenum-titanium phase diagram evaluated from ab initio calculations. Phys. Rev. Mater. 1: 023604. van de Walle, A., Asta, M.D., and Ceder, G. (2002). The alloy theoretic automated toolkit: a user guide. Calphad 26: 539–553. Lederer, Y., Toher, C., Vecchio, K.S., and Curtarolo, S. (2018). The search for high entropy alloys: a high-throughput ab-initio approach, Acta Mater. 159: 364–383. Rost, C.M., Sachet, E., Borman, T. et al. (2015). Entropy-stabilized oxides. Nat. Commun. 6: 8485. Rak, Z., Rost, C.M., Lim, M. et al. (2016). Charge compensation and electrostatic transferability in three entropy-stabilized oxides: results from density functional theory calculations. J. Appl. Phys. 120: 095105. Chen, W., Ketkaew, J., Liu, Z. et al. (2015). Does the fracture toughness of bulk metallic glasses scatter? Scr. Mater. 107: 1–4. Schroers, J. and Paton, N. (2006). Amorphous metal alloys form like plastics. Adv. Mater. Processes 164: 61. Schroers, J., Hodges, T.M., Kumar, G. et al. (2011). Thermoplastic blow molding of metals. Mater. Today 14: 14–19. Kaltenboeck, G., Demetriou, M.D., Roberts, S., and Johnson, W.L. (2016). Shaping metallic glasses by electromagnetic pulsing. Nat. Commun. 7: 10576. Johnson, W.L. (1999). Bulk glass-forming metallic alloys: science and technology. MRS Bull. 24: 42–56. Greer, A.L. (2009). Metallic glasses…on the threshold. Mater. Today 12: 14–22. Schroers, J. (2010). Processing of bulk metallic glass. Adv. Mater. 22: 1566–1597. Johnson, W.L., Na, J.H., and Demetriou, M.D. (2016). Quantifying the origin of metallic glass formation. Nat. Commun. 7: 10313. Ashby, M.F. and Greer, A.L. (2006). Metallic glasses as structural materials. Scr. Mater. 54: 321–326. Miracle, D.B. (2004). A structural model for metallic glasses. Nat. Mater. 3: 697–702. Egami, T. and Waseda, Y. (1984). Atomic size effect on the formability of metallic glasses. J. Non-Cryst. Solids 64: 113–134. Greer, A.L. (1993). Confusion by design. Nature 366: 303–304. Egami, T. (2003). Atomistic mechanism of bulk metallic glass formation. J. Non-Cryst. Solids 317: 30–33. Lee, H.-J., Cagin, T., Johnson, W.L., and Goddard, W.A. III (2003). Criteria for formation of metallic glasses: the role of atomic size ratio. J. Chem. Phys. 119: 9858–9870. Zhang, K., Dice, B., Liu, Y. et al. (2015). On the origin of multi-component bulk metallic glasses: atomic size mismatches and de-mixing. J. Chem. Phys. 143: 054501. Cheney, J. and Vecchio, K. (2009). Evaluation of glass-forming ability in metals using multi-model techniques. J. Alloys Compd. 471: 222–240.

219

220

7 Automated Computation of Materials Properties

186 Cheney, J. and Vecchio, K. (2007). Prediction of glass-forming compositions

using liquidus temperature calculations. Mater. Sci. Eng., A 471: 135–143. 187 Lu, Z.P. and Liu, C.T. (2002). A new glass-forming ability criterion for bulk

metallic glasses. Acta Mater. 50: 3501–3512. 188 Ward, L., Agrawal, A., Choudhary, A., and Wolverton, C. (2016). A general-

189

190

191 192

193 194 195 196

197

198 199 200 201 202 203 204

205

purpose machine learning framework for predicting properties of inorganic materials. NPJ Comput. Mater. 2: 16028. Villars, P. (2000). Factors governing crystal structures. In: Crystal Structures of Intermetallic Compounds (ed. J.H. Westbrook and R.L. Fleisher), 1–49. New York: Wiley. Daams, J.L.C. (2000). Atomic environments in some related intermetallic structure types. In: Crystal Structures of Intermetallic Compounds (ed. J.H. Westbrook and R.L. Fleisher), 139–159. New York: Wiley. Daams, J.L.C. and Villars, P. (2000). Atomic environments in relation to compound prediction. Eng. Appl. Artif. Intell. 13: 507–511. Xie, L., Brault, P., Coutanceau, C. et al. (2015). Efficient amorphous platinum catalyst cluster growth on porous carbon: a combined molecular dynamics and experimental study. Appl. Catal. B 162: 21–26. Bednorz, J.G. and Müller, K.A. (1986). Possible high Tc superconductivity in the Ba-La-Cu-O system. Z. Phys. B: Condens. Matter 64: 189–193. Maeno, Y., Hashimoto, H., Yoshida, K. et al. (1994). Superconductivity in a layered perovskite without copper. Nature 372: 532–534. Winter, M.R. and Clarke, D.R. (2007). Oxide materials with low thermal conductivity. J. Am. Ceram. Soc. 90: 533–540. Yang, K., Oses, C., and Curtarolo, S. (2016). Modeling off-stoichiometry materials with a high-throughput Ab-Initio approach. Chem. Mater. 28: 6484–6492. Rohrer, G.S., Affatigato, M., Backhaus, M. et al. (2012). Challenges in ceramic science: a report from the workshop on emerging research areas in ceramic science. J. Am. Ceram. Soc. 95: 3699–3712. Saal, J.E. and Wolverton, C. (2013). Thermodynamic stability of Co-Al-W L12 𝛾’. Acta Mater. 61: 2330–2338. Sato, J., Omori, T., Oikawa, K. et al. (2006). Cobalt-Base High-Temperature Alloys. Science 312: 90–91. Snyder, G.J. and Toberer, E.S. (2008). Complex thermoelectric materials. Nat. Mater. 7: 105–114. Nolas, G.S., Sharp, J., and Goldsmid, H.J. (2001). Thermoelectrics: Basic Principles and New Materials Developments. Springer-Verlag. Bell, L.E. (2008). Cooling, heating, generating power, and recovering waste heat with thermoelectric systems. Science 321: 1457–1461. DiSalvo, F.J. (1999). Thermoelectric cooling and power generation. Science 285: 703–706. Chen, W., Pöhls, J.-H., Hautier, G. et al. (2016). Understanding thermoelectric properties from high-throughput calculations: trends, insights, and comparisons with experiment. J. Mater. Chem. C 4: 4414–4426. Zhu, H., Hautier, G., Aydemir, U. et al. (2015). Computational and experimental investigation of TmAgTe2 and XYZ 2 compounds, a new group

References

206

207 208

209 210

211 212

213 214

215

216 217 218 219

220 221 222 223

of thermoelectric materials identified by first-principles high-throughput screening. J. Mater. Chem. C 3: 10554–10565. Tang, Y., Gibbs, Z.M., Agapito, L.A. et al. (2015). Convergence of multi-valley bands as the electronic origin of high thermoelectric performance in CoSb3 skutterudites. Nat. Mater. 14: 1223–1228. Zeier, W.G., Schmitt, J., Hautier, G. et al. (2016). Engineering half-Heusler thermoelectric materials using Zintl chemistry. Nat. Rev. Mater. 1: 16032. Shi, X., Yang, J., Bai, S. et al. (2010). On the design of high-efficiency thermoelectric clathrates through a systematic cross-substitution of framework elements. Adv. Func. Mater. 20: 755–763. Zhang, H., Borrmann, H., Oeschler, N. et al. (2011). Atomic interactions in the p-type clathrate I Ba8 Au5.3 Ge40.7 . Inorg. Chem. 50: 1250–1257. Saiga, Y., Du, B., Deng, S.K. et al. (2012). Thermoelectric properties of type-VIII clathrate Ba8 Ga16 Sn30 doped with Cu. J. Alloys Compd. 537: 303–307. Christensen, M., Johnsen, S., and Iversen, B.B. (2010). Thermoelectric clathrates of type I. Dalton Trans. 39: 978–992. Madsen, G.K.H., Katre, A., and Bera, C. (2016). Calculating the thermal conductivity of the silicon clathrates using the quasi-harmonic approximation. Phys. Status Solidi A 213: 802–807. Sales, B.C., Mandrus, D., and Williams, R.K. (1996). Filled skutterudite antimonides: a new class of thermoelectric materials. Science 272: 1325–1328. Bai, S.Q., Pei, Y.Z., Chen, L.D. et al. (2009). Enhanced thermoelectric performance of dual-element-filled skutterudites Bax Cey Co4 Sb12 . Acta Mater. 57: 3135–3139. Yang, J., Qiu, P., Liu, R. et al. (2011). Trends in electrical transport of p-type skutterudites RFe4 Sb12 (R=Na, K, Ca, Sr, Ba, La, Ce, Pr, Yb) from first-principles calculations and Boltzmann transport theory. Phys. Rev. B 84: 235205. Madsen, G.K.H. and Singh, D.J. (2006). BoltzTraP. A code for calculating band-structure dependent quantities. Comput. Phys. Commun. 175: 67–71. Clarke, D.R. (2003). Materials selection guidelines for low thermal conductivity thermal barrier coatings. Surf. Coat. Technol. 163–164: 67–74. Cahill, D.G., Braun, P.V., Chen, G. et al. (2014). Nanoscale thermal transport. II. 2003-2012. Appl. Phys. Rev. 1: 011305. Toberer, E.S., May, A.F., Scanlon, C.J., and Snyder, G.J. (2009). Thermoelectric properties of p-type LiZnSb: Assessment of ab initio calculations. J. Appl. Phys. 105: 063701. Pei, Y., Wang, H., and Snyder, G.J. (2012). Band engineering of thermoelectric materials. Adv. Mater. 24: 6125–6135. Bhattacharya, S. and Madsen, G.K.H. (2015). High-throughput exploration of alloying as design strategy for thermoelectrics. Phys. Rev. B 92: 085205. Pei, Y., Shi, X., LaLonde, A. et al. (2011). Convergence of electronic bands for high performance bulk thermoelectrics. Nature 473: 66–69. Zeier, W.G., LaLonde, A., Gibbs, Z.M. et al. (2012). Influence of a nano phase segregation on the thermoelectric properties of the p-type doped stannite compound Cu2+x Zn1−x GeSe4 . J. Am. Chem. Soc. 134: 7147–7154.

221

222

7 Automated Computation of Materials Properties

224 Zeier, W.G., Pei, Y., Pomrehn, G. et al. (2012). Phonon scattering through

225

226 227 228 229

a local anisotropic structural disorder in the thermoelectric solid solution Cu2 Zn1−x Fex GeSe4 . J. Am. Chem. Soc. 135: 726–732. Usanmaz, D., Nath, P., Plata, J.J. et al. (2016). First principles thermodynamical modeling of the binodal and spinodal curves in lead chalcogenides. Phys. Chem. Chem. Phys. 18: 5005–5011. Usanmaz, D., Nath, P., Toher, C. et al. (2018). Spinodal superlattices of topological insulators. Chem. Mater. 30: 2331–2340. Sanvito, S., Oses, C., Xue, J. et al. (2017). Accelerated discovery of new magnets in the Heusler alloy family. Sci. Adv. 3: e1602241. Graf, T., Felser, C., and Parkin, S.S.P. (2011). Simple rules for the understanding of Heusler compounds. Prog. Solid State Chem. 39: 1–50. Kreiner, G., Kalache, A., Hausdorf, S. et al. (2014). New Mn2 -based Heusler compounds. Z. Anorg. Allg. Chem. 640: 738–752.

223

8 Cognitive Chemistry: The Marriage of Machine Learning and Chemistry to Accelerate Materials Discovery Edward O. Pyzer-Knapp IBM Research, Hartree Centre, Daresbury, WA4 4AD, UK

8.1 Introduction The chemical realm is one of incredibly rich and deep data, and this data is growing exponentially. Indeed it is estimated that the number of stable small molecules (sometimes termed “chemical space” [1]) is greater than 1060 [2]. While in the recent past, the rate of chemical discovery was limited by both the human and financial cost of performing experiments, and the relatively slow manner of the dissemination of these results through traditional print media, this is no longer the case. Advances in computational techniques have enabled a new, virtual laboratory for running simulated experiments, and the rapid rise of the Internet for the almost instant dissemination of chemical knowledge has advanced us to a situation where gathering large amounts of data is no longer the challenge it once was; it is what we do with it that matters. Since we are now at the point at which it is not only impractical, but impossible for a sole researcher to have complete domain knowledge, it seems inevitable that the next challenge is to build computers that can approach this capacity. Given that discovery in the chemical sciences has not ground to a halt – in fact it is actually accelerating – it can also be argued that a complete domain knowledge is not necessary; but instead the “trick” is being able to take what knowledge you have and use it to make informed decisions. Machine learning has been used to aid decision making through building models, which are specific, fast, and accurate and through building algorithms for searching molecular space. Balancing exploiting the current knowledge with exploring new regions of chemical space is key for efficiently locating the materials of the future. This chapter will be split into three major sections. Firstly, we will discuss how to represent chemical knowledge in a form in which machine learning algorithms can understand – often called molecular fingerprints. We will chart their development from handcrafted sets of fragments to fingerprints derived from so-called representation learning, where the representation of the molecule is learned directly from the data. Secondly, we will discuss the use of machine learning to build fast and accurate models from data – both experimental and simulated – and also to robustly convert between these different types of data. Materials Informatics: Methods, Tools and Applications, First Edition. Edited by Olexandr Isayev, Alexander Tropsha, and Stefano Curtarolo. © 2019 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2019 by Wiley-VCH Verlag GmbH & Co. KGaA.

224

8 The Marriage of Machine Learning and Chemistry to Accelerate Materials Discovery

Finally, we will consider how machine learning can be used for making decisions in the discovery process both by building enriched libraries of candidate molecules and determining priorities for molecules within a screening library. Throughout this chapter, detailed discussions on relevant techniques will be undertaken as they are introduced.

8.2 Describing Molecules for Machine Learning Algorithms When utilizing machine learning techniques for materials discovery, the first problem encountered is how to portray the information in a way that the algorithm can understand. For some entities, for example, molecules, it is not immediately apparent how to do this. In general, there are two schools of thought as to how to approach this problem: (1) Introduce as much of your current knowledge as possible into this “fingerprinting” technique, by building features that express building representations that include features which you think are important to the task. This is often known as “handcrafting” a feature set. (2) Assume that all important relationships are contained within the data, and use an unsupervised technique to build a representation directly from the data. This method is closer to how the brain actually learns but can be inefficient to train. There exists a set of methods that lie in the gray zone between these two approaches, in which a super-set of features are presented to an algorithm, which then uses the data itself to decide which of these are relevant to the problem at hand. In this section we will examine all three of these techniques and discuss their relative pros and cons. Molecules are hard to compare, and so there have been many attempts to convert them into forms for which we have preexisting mathematical constructs for making these comparisons [3–7]. As previously stated, handcrafting feature representation is one possible way to do this and commonly consists of transforming the molecule into a vector of some description based upon the existence, or absence, of noteworthy chemical fragments [8–11]. If you already know, or believe that you know, which features are important for your model, and which are not, this method can provide a powerful means of ensuring that this knowledge is represented within your model. This strength of handcrafted features is also a contributor to their main shortcoming, however, since it is easy to accidentally (or otherwise!) introduce the biases of the feature creator into the feature, which can have far reaching – and often subtle – implications. The most common conceit employed when crafting features is to ignore any three-dimensional information that you may have and treat the molecule as a graph with the atoms as nodes and the bonds represented as edges [12]. Since different classes of bond are indistinguishable within this conceit, it is common to embellish the atomic description to include some of this information. One

8.2 Describing Molecules for Machine Learning Algorithms

MACCS key fingerprint calculation Key Key position description O

N CI

N

Diazepam

... ... ... ... ... ... ... ...

H3C

Key code

11

4M RING

0

14

S–S

0

19

7M RING

1

45

C=CN

0

78

C=N

1

92

OC(N)C

1

163

6M RING

1

Fingerprint 19 78 92 163

Figure 8.1 An example of the process of MACCS encoding the molecule diazepam. Key queries are run against the molecular graph and the positive responses being encoded into a sparse representation. Source: Vilar et al. 2014 [13]. Reproduced with permission of Springer Nature.

example of this might be to separate aromatic and aliphatic carbons into separate classes. One common way to build in chemical intuition to a 2D fingerprint is to identify sets of fragments, which are believed to be correlated to desirable properties. Perhaps the most commonly used fingerprints are the Molecular ACCess System (MACCS) keys [11]. This fingerprinting technique is implemented through reporting the set of binary (or Boolean) responses to a set of queries, where a 1 represents the fragment being present within the structure and a 0 encodes its absence. This is demonstrated in Figure 8.1 for the molecule diazepam, a common pharmaceutical material. While the MACCS keys have been widely used and demonstrated some successes, their major limitation is the inflexibility of their queries. There are 166 MACCS keys [11], and the queries that they represent have been tuned toward pharmaceutical materials. This should be very much kept in mind when using this fingerprinting technique to describe molecules, especially outside of the pharmaceutical arena. A generalization of the MACCS approach is the extended connectivity fingerprinting approach [10]. Here, instead of using a set of fixed queries, the fingerprint is generated by systematically interrogating atomic environments contained within the molecule by building up a description based upon information contained within a certain number of edges (bonds) from each node (atom). This is performed using the following process: (1) Initial assignation of atomic types: In this process atoms are assigned a descriptor, which sufficiently describes it. This could contain information about the atomic number, connectivity, or some other desirable property in much the same way that a force field atom type is constructed. (2) Iterative generation of environment descriptions: In this stage, additional descriptions are generated by considering atoms connected through one, then two, then three, etc. bonds and adding each a hash of each

225

226

8 The Marriage of Machine Learning and Chemistry to Accelerate Materials Discovery

representation to a list. If this iterative updating process is performed using the Morgan algorithm [14], then the fingerprint is generally known as the Morgan circular fingerprint. Using this procedure, each environment can be described in one of two ways: by either using a binary description (i.e. a hashed representation either appears in the molecule or it does not) or by storing a count of how many times each fragment occurs. If a binary description is used, then the generated list of environments is reduced to a set by removing duplicates; else the number of duplicates for each environment is counted and stored against the description. The resulting set can be thought of as the sparse representation of an extremely long vector describing a large number of potential environments. In order to make this more useful for machine learning, it is common practice to transform this description into a more reasonable, fixed length vector. This is commonly achieved by the use of a folding algorithm. It is important to bear in mind that these methods are one-way functions; since there is the possibility of bit collisions (where multiple features switch on the same bit), it is impossible to regenerate the molecule from its fingerprint (Figure 8.2). For some tasks, it is desirable to extend the information presented to the algorithm to include, for example, atomic positions. This introduces additional complexity not only to the description itself but to the algorithm that is used to produce this fingerprint. A key property which these 3D feature representations must exhibit is translational and rotational invariance, as rotating the molecule or translating it in space should not affect its properties. Additionally, the same requirement of being able to generate a fixed length representation is still strong and contains some very subtle implications – while it is easy to restrict the length of a fragment-based fingerprint by restricting the number of fragments searched for, such an enumeration over, say, Cartesian space is less achievable. Additionally, the implication of fixing the length of a representation through the addition of null vectors (“padding with zeros”) must be carefully considered. The Coulomb matrix, developed by the group of Von Lilienfeld, is a method for representing molecules derived from their atom’s nuclear charges and corresponding Cartesian coordinates [15]. The Coulomb matrix can be described as ⎧ 0.5 Z2.4 , i ⎪ Cij = ⎨ Zi Zj ⎪ |R − R | , ⎩ i j

∀i = j ∀i ≠ j

(8.1)

where Z represents the atomic charge and R the coordinates in 3D space. The Coulomb matrix is then made invariant by employing one of a range of sorting algorithms [15]. This method has been extended to periodic structures [16] to allow for the calculation of fingerprints of crystalline materials. While the Coulomb matrix has been extensively employed, with some notable successes [17–20], it does have some disadvantages. For example, it does not allow the distinction of the global chirality of the molecules. Additionally, as with all 3D fingerprinting techniques, it requires the molecular geometry of the molecule in question. Since this is itself

Identifiers:

Diameter 0: x

O

x

x

x

x x x

O

x

N

–1266712900 –1216914295 78421366 –887929888 –276894788

x x

Diameter 2: O

O

x

N

O

x

x

x

O

x N

x

x

x

x

O

x

O x

N

N

x

O x

x

Diameter 4: O

O

x

O x

N x

x

N O

x

x

x

O x

N O

O

x O N

x

N x

x

–744082560 –798098402 –690148606 1191819827 1687725933 1844215264 –252457408 132019747 –2036474688 –1979958858 –1104704513

(a) Identifier list representation: –1266712900 –1216914295

78421366 –887929888 –276894788 –744082560 –798098402 –690148606 1191819827

1687725933 1844215264 –252457408 132019747 –2036474688 –1979958858 –1104704513 Hash function Fixed-length binary representation

010000000010000011000010001100000000010100000000000000000000000001001010010000000000100000000000 (b)

Bit collisions

Figure 8.2 A description of the generation of circular (extended connectivity) fingerprints from fragmentation of the molecular graph. First fragments up to the desired diameter size are generated, and their corresponding bit is identified (a). This sparse representation is then hashed into a fixed length vector (b). Since bit collisions are possible, this is a strictly one-directional process. Source: https://docs.chemaxon.com/pages/viewpage.action?pageId=41129785.

228

8 The Marriage of Machine Learning and Chemistry to Accelerate Materials Discovery

0 00

1 01

2 10

3 11

0 00 0000 0001 0100 0101 1 01 0010 0011 0110

Figure 8.3 A graphical representation of the flattening of 3D Cartesian space through the use of a Morton space filling curve. To construct the curve, the binary representations of each cell are interleaved, with the resulting representation having the property of retained locality from the original inputs.

0111

2 10 1000 1001 1100 1101 3 11

1010

1011

1110

1111

often expensive to produce, techniques such as the Coulomb matrix are more suited to bootstrapping from one level of theory to another (sometimes known as Δ-Machine-Learning [21]) than for accelerating the high-throughput screening techniques used in materials discovery. It should also be borne in mind that the success of techniques such as these is also dependent on the quality of (i) the molecular geometry itself and (ii) the technique used to generate the geometry (i.e. the conformational generator). Recently, a technique based upon the geometric principle of the space (or sphere) filling curve has been used as an alternative method for generating fingerprints of both molecular and periodic structures [22]. The space filling curve is a commonly used technique for indexing N-dimensional space in which a curve is placed over the space in question, and the intersection between objects within the space and the curve is determined. The simplest, and perhaps most commonly used, space filling curve is called the Morton curve, or sometimes the Z-curve, due to its shape when represented in two dimensions (Figure 8.3). In order to construct a Morton index, the binary representation for each coordinate is interleaved. By performing this operation for each point (for example, atom in a molecule), the set of indices representing the set of intersections between the points and a Z-shaped curve that fills the space constructed with a bounding box. In higher dimensions, this can be thought of as a list of which N-dimensional cells are not empty, when an N-dimensional grid is cast over the object in question. An advantage of the N-dimensional nature of this family of techniques is that it is possible to encode additional chemical information into the “spare” dimension. Recent work on applying these techniques to describe crystalline materials has used a range of potential values for this new dimension, including atomic number and statistics that encode the local bonding environment of the atom. One potential disadvantage of this method is that the size of cell required to give atomic resolution results in representations that are very sparse. While there are ways around this, it does limit the direct use of these methods without some form of dimensionality reduction. An alternative approach to handcrafting descriptors is to let the data itself determine the descriptors. This approach has shown to be effective in many areas of image recognition, including the MNIST handwritten digit problem, with these descriptors frequently outperforming their handcrafted alternatives. Restricted Boltzmann machines (RBM) have been used to perform this task.

8.2 Describing Molecules for Machine Learning Algorithms

A convenient way to conceptualize an RBM is as a two-layer artificial neural network, with one visible layer and one hidden layer. The “restricted” part of the RBM is derived from the fact that the neurons in an RBM are required to form a bipartite graph – that is, there are no connections between neurons within a layer. RBMs predict probability distributions from a set of inputs, through an energy function of an RBM: E(v, h) = −aT v − bT h − vT W h

(8.2)

which is related to the probability distribution as follows: 1 −E(v,h) (8.3) e Z where Z is a normalizing coefficient to ensure that P(v, h) sums to 1. If funneling sets of RBMs are stacked together, it is possible to create an entity known as a deep autoencoder [23]. In this paradigm the RBMs are used to pretrain the weights, by training a stack of RBMs, which decrease in size with the inputs of layer (n + 1) being supplied by the outputs of layer n. Unraveling these RBMs results in an initial weight matrix for a funnel-shaped deterministic neural network, which can then be fine-tuned using one of the myriad adaptations of back propagation [23]. By splitting this neural network into its encoder/decoder components, we can use this model to effectively reduce the dimensionality. Deep autoencoders have been used to great success for the recognition of handwritten digits directly from the inputs (in this case images), and it is not hard to imagine the same being used for materials discovery. Pyzer-Knapp, Hernandez-Lobato et al. have used these techniques to compress circular fingerprints to two-dimensions for the construction of information landscapes [24] that allow easier interpretation of the diversity of chemical libraries, which can be used to rationalize screening techniques. This will be covered in more detail in Section 8.4. In addition to deep autoencoders, convolutional neural networks and recurrent neural networks have been used to generate models working directly from raw inputs, in this case the molecular graphs. Convolutional neural networks are very similar to standard (deterministic, feed-forward) neural networks but have additional layer types (in addition to the fully connected layers of a traditional neural network) and are laid out to take into account the spatial structure of the data. In order to achieve this, a convolutional network consists of small collections of neurons with each portion a part of the image, with the outputs of these neurons being tiled to overlap portions of their input. This setup gives the network some tolerance to input translation. One of the problems with traditional network architectures is the so-called curse of dimensionality, in which the predictive power of a model often reduces as the dimensionality increases [25], which results from the fully connected nature of the model. In some tasks, such as image recognition, there are clear cases where a fully connected approach is wasteful – pixels at one extreme of an image are unlikely to be correlated to those at another. This can be extended into chemistry, where the existence, for example, of a particular functional group at one part of the molecule does not affect the chemistry, which can occur at some other distant point. This is borne out by the P(v, h) =

229

230

8 The Marriage of Machine Learning and Chemistry to Accelerate Materials Discovery

Figure 8.4 A demonstration of the concept of the locality of neurons in a convolutional network. Here, each neuron in the hidden layer is only connected with a small subset of the inputs. The size and amount of overlap of each of the receptive fields are set with hyperparameters, which can themselves be optimized.

weakening contribution of three and four bond contributions to, for instance, the nuclear magnetic resonance spectroscopy (NMR) shift of a particular chemical group. This is encoded into a convolutional network through the paradigm of receptive fields. In this paradigm, convolutional networks exploit local structure by only allowing connections between neurons and a small partition of the previous layer (Figure 8.4). In addition to receptive fields, convolutional networks differ from traditional neural networks in a few key ways. Firstly, the neurons have a depth associated with the weight matrix. This can be thought of as the number of filters that each layer is trying to discover. Neurons that are in the same depth channel will also share weights. This is a tactic employed to reduce the number of degrees of freedom for the network and make it more resilient to overfitting. Weight sharing is based upon the assumption that features that are captured in one receptive field are likely to be relevant in another. For instance, this means that we do not need to have a filter for edges for every possible place they may occur in an image. Another difference is the potential to add padding to images to control the spatial size of the output volumes; although this is often used for images, the physical implications of padding molecules may mean that this is less often implemented in chemical applications. Finally, convolutional networks have a pooling layer, which can be thought of as a mechanism for simplifying the output of the convolutional layer. Simply put, the pooling layer will take the outputs of each filter (layer of depth) and select the most strongly responding result. In the pooling process, some positional information is lost, with the rational being that the most important thing is not exactly where a feature has been detected, rather its existence and which (if any) features are present in the approximate vicinity.

8.2 Describing Molecules for Machine Learning Algorithms

(a)

(b)

Figure 8.5 (a) Graphical depiction of the construction of a circular fingerprint through the hashing function(s), indexing, and finally bit-setting and (b) how the information flow differs in a neural where the hashing function is replaced by neural network layer(s), the indexing by a softmax operation, and the bit-setting operation by a summation. Source: Duvenaud et al. 2015 [26].

Duvenaud et al. have used convolutional networks to generate so-called neural fingerprints of molecules as a data-driven alternative to classical fingerprinting techniques [26]. The authors show that through the use of a convolutional network, it is possible to generate a data-driven, generalizable fingerprint that has competitive predictive performance (Figure 8.5). The neural fingerprint has some key advantages over traditional fingerprinting techniques. Firstly, since it is a data-driven technique, it is trivial to generate a custom-built fingerprint for a specific task. While previously a great deal of importance was placed in a method being transferable, when tackling problems in a data-rich environment, a significant amount of transferability can be sacrificed on the altar of performance (so long as significant overfitting is avoided). This is justified by the logic of making specific tools for specific tasks and not reusing them for tasks for which they are ill-suited (often a criticism of parameterized methods such as molecular mechanics force fields [27]). Secondly, unlike traditional hashing techniques used in, for example, circular fingerprints, the generation of a neural fingerprint can be reverse-engineered to locate parts of the graph that are strongly related to the expression of certain desired features. This is clearly of great use to the materials discovery community, who have long searched for techniques to implement the “inverse design problem.” Finally, initial studies on neural fingerprints suggest that they offer stronger predictive performance than other fingerprinting techniques, when exposed to the same predictive model [26].

231

232

8 The Marriage of Machine Learning and Chemistry to Accelerate Materials Discovery

An alternative approach to using neural networks to generate fingerprints is to use recursive neural networks (RNNs). These networks have had significant successes in the analysis and prediction of sequences of text, and so it can be easily seen how they could be applied to molecules, which are in effect complex sequences of atoms. Much as a convolutional network derives some of its benefits through the utilization of spatial information, recurrent neural networks offer benefits over traditional networks when there is a significant degree of sequential information present. A key advantage of this approach is that, unlike other neural network methods, RNNs are not dependent upon a fixed size input. Since molecules are typically represented in a manner that is of variable length, this is a key advantage here. It should be noted that RNNs can still be used to great effect when the input is fixed length [28, 29]. One of the key features of a recurrent neural network is its so-called neural memory. This allows it to apply updates in the context of the sequence seen thus far. In the chemical case, this could be thought of as seeing an atom in the context of its neighbors – a core concept of the handcrafted fingerprint. In an RNN, the hidden layer at a time step t is a combination of the input at t and the value of the previous hidden layer. Unfortunately, neural memory in “vanilla” RNNs does a poor job when the length of memory required for correct context increases, which may be problematic for the chemical problem. This can be seen in Figure 8.6, which demonstrates how the neural memory is limited by the size of the hidden layer. In this example, the network will begin to forget after the fifth member of the sequence is ingested. At this point, the network starts to learn what it is important to remember and what it is OK to forget. Another problem with RNN learning occurs during the propagation of errors back through the network [30]. Since the recurrent behavior of the network means that the weight matrix connecting the input to the hidden layer is multiplied a large number of times (the number of time steps used, in fact), if the leading eigenvalue of the weight matrix is smaller than one, then the gradients will quickly vanish to zero. Conversely, if the leading eigenvalue of the weight matrix is greater than one, the magnitude of the elements in the weight matrix will quickly explode. Both of these behaviors have a significant and detrimental effect upon learning and make it very hard to learn long-term dependencies within the data set. A variant on the vanilla RNN, the long short-term memory

1

2

3

Time

Figure 8.6 An example of how sequential information is stored in a recurrent neural network. It can be seen that the hidden layer contains information from each of the previous time steps. Since there are only four neurons in the hidden layer, when the fifth time step is reached, the neural memory will be full and it will have to decide what to forget.

8.2 Describing Molecules for Machine Learning Algorithms

(LSTM) network, includes a more sophisticated update mechanism to counter this effect and is widely used in place of vanilla RNNs [31]. In a similar spirit to Duvenaud et al., Lusci et al. have used RNNs to generate features for machine learning directly from the molecular graph [32]. Traditionally, graphical inputs to RNNs have been directed acyclic graphs, in which the direction associated with the edges implies some causality or temporal relationship between nodes and in which there are no cyclic connections. While these have been applied with some success to tasks such as protein structure prediction [33], it is not clear that these are appropriate for use on small molecules, for which there is a high probability of a cyclic connection and for which there is no obvious directionality in the linkages between nodes. Lusci et al. therefore adapted the framework to turn these undirected graphs of molecules into directed acyclic graphs for processing with RNNs. This was achieved by taking all possible acyclic orientations – a task possible due to the relatively small size and low connectivity of small molecule graphs – and reporting them as an ensemble of directed acyclic graphs. An example of this is shown in Figure 8.7. Since within each directed acyclic graph there exists a path between each atom and the graphical root and this represents a “view” of the molecule from the graphical root, this ensemble can be thought of as representing the molecule in the context of all of the constituent atoms. As previously mentioned, training RNNs can be impeded by either vanishing or exploding gradients, a problem which is exacerbated when deep architectures are used due to the large number of matrix transform operations. In order to control the magnitude of the dimensionality in this approach, Lusci contracted some rings present in molecular undirected graphs. This was achieved through the selection of the set of smallest rings from the graph and the contraction of each of these rings to a single node. Through experimentation on a range of problems related to solubility measurements and calculations, Lusci found that the descriptions derived through this RNN-based approach were at least as good as, and sometimes outperformed, traditional fingerprinting techniques (Figure 8.8).

Figure 8.7 An example of the generation of an ensemble of directed acyclic graphs from an input of an undirected graph. Although this is a potentially expensive task, the size and connectivity of typical small molecules makes this approach computationally tractable. Source: Adapted from Ref. [32]. Undirected graph

Directed acyclic graphs

233

234

8 The Marriage of Machine Learning and Chemistry to Accelerate Materials Discovery O C

H

O

O

C C

H C

H

O

C

H

H

H

H

C

H

C

H

H

H

H O

O

H

O

O

C

D1 D2

H

Σ

H H

. . .

C

Solubility

Dk

O H

MO

O

C

H

H H

O C

H

O

C H

O

H

C C

O

C C

H H

O H O H

H H

H

H H

Figure 8.8 The deep fingerprints based upon the recurrent neural network method by Lusci et al. The ensemble of directed acyclic graphs is summed together to produce an input for the recurrent neural network architecture, which is then trained using back propagation to predict some useful target, in this case solubility. Source: Lusci et al. 2013 [32]. Reproduced with permission of American Chemical Society.

The techniques described here have focused upon the graphical structure of a molecule through either handcrafted fragmentation or direct interpretation; however there has also been significant work in the use of other descriptors such as the electronic structure [34, 35] or other molecular descriptors relating to chemical functionality [36–38]. Since the fundamentals discussed here apply broadly, the reader is referred to relevant reviews on the subject if further information is desired [39, 40].

8.3 Building Fast and Accurate Models with Machine Learning As the need for new materials becomes increasingly great, the (often imposed) timeline for discovery grows ever shorter. Many researchers are turning to the

8.3 Building Fast and Accurate Models with Machine Learning

paradigm of high-throughput virtual screening (HTVS) [41] to reduce their time to discovery and increase their efficiency [42–44]. A popular conceit in HTVS is the computational funnel. In this paradigm, large numbers of molecules are screened with a cheap method, with unsuitable molecules being removed from the library. As the number of molecules decreases, more expensive methods are used to discriminate between molecules and filter out more candidates. Eventually a small enough number of molecules remain that they can be screened experimentally, with these results being fed back into the library generation process for a new set of molecules to be screened. Clearly the efficiency of a filter in such a computational funnel is dependent on two main criteria; its speed (which limits how many molecules constitute the initial set) and its accuracy (which determines how many candidates can be winnowed away). Machine learning has the potential to play an important role in this area as it can potentially deliver an accurate model that can be executed with orders of magnitude speedup over current methods, once the model is trained. A particularly attractive method for deploying machine learning in this setting is the paradigm of a deep neural network. While the basics of neural networks have been well understood for some time, they slipped in and out of favor with the materials design community, due to concerns over overfitting (i.e. making a model that memorizes the training data, rather than learns its patterns) and the slow convergence of large networks [45]. Recent developments in both improved training methods and computational hardware, however, have addressed these concerns and allowed them to be deployed with great success. The basic neural network algorithm is very simple. In general, a network is made up of input nodes, hidden nodes, and output nodes. Let X be the inputs to a network, a vector of size v1 by v2 , where v1 is the number of data points in the training set and v2 is the number of descriptors describing each data point. The inputs of the network are connected to each of the hidden neurons (this is known as a fully connected system) through a weight matrix w1 that has dimensions v2 by h1 , where h1 is the number of hidden neurons in the first hidden layer. For simplicity we will assume that there is only one hidden layer. The neurons in the hidden layer are connected to the neurons in the output layer through a weight matrix w2 , which has dimensions h1 by o1 where o1 is the number of neurons in the output layer. Having established how the network is connected, the next task is to propagate a signal through this network. Initial values for w1 and w2 are randomly generated, and the signal at the first hidden layer is formed through the summation of the dot product between the input neurons and the w1 . The activation of each hidden layer is calculated using a sigmoidal function that flattens the outputs between 0 and 1. Typical functions for the sigmoid are the tanh func1 tion, or the logistic function 1+exp(−x) . The outputs of the hidden layer are passed through to the output layer, where they are combined into a value – the target. This prediction is compared to the true value, and an error on the prediction is calculated. The most common form of training sees this error propagated backward through the nodes in a process called back propagation. This allows the network to update the weight matrix to minimize the error between the target and the prediction.

235

236

8 The Marriage of Machine Learning and Chemistry to Accelerate Materials Discovery

Overfitting is a key issue when dealing with neural networks. The sheer flexibility of the model (they are in fact universal approximators [46]) coupled with the fact that the number of degrees of freedom in a fully connected network grows rapidly with the number of neurons means that steps have to be taken during the training of such a model to ensure that overfitting does not occur. A simple, yet often unreasonably effective, technique is known as early stopping [47]. In this paradigm, a small set is stochastically selected from the training set and held back during each pass of the data through the network. The parameters determined by that pass are then validated against this small set, and if the error on this set increases (often a thresholding criteria is applied here), then the training is stopped. Another, more sophisticated, technique for avoiding overfitting is known as dropout [48]. This technique avoids the adoption of complex codependencies between neurons during training by randomly turning off (dropping out) neurons and their connections for one pass through the network so that their signal does not contribute to the model. In this way, neurons cannot guarantee connections to any other neurons, which limits their capability to form interdependencies. An alternative way to think of dropout is that dropout is providing the user with a method for simultaneously training a large number of models at once and the ability to combine their predictions. This is described in the original paper as follows: In networks with a single hidden layer of N units and a “softmax” output layer for computing the probabilities of the class labels, using the mean network is exactly equivalent to taking the geometric mean of the probability distributions over labels predicted by all 2N possible networks. [48] When propagating a signal through a dropped out network (i.e. during training, since prediction is always performed using the full network), it is common to scale the signal by the number of neurons that are active. This means that the strength of the signal remains constant, regardless of the number of neurons that have been dropped out. Dropout is a very popular method for regularizing neural networks, and that can in part be put down to the fact that it is very simple to implement, and requires little adaptation of existing code. Pseudocode for this operation is shown in Algorithm 1. Algorithm 1 A pseudo-code implementation of dropout for a layer of a neural network. It is common to scale the output to the amount of expected dropout to ensure that the signal strength is independent of dropout, but this is not shown here for clarity. def dropout_layer(X, probability): if rand(0,1) > probability: return activation(X) else: return 0 In order to accelerate the training of the large data sets that can be found when using neural networks for materials discovery, it is important to go beyond the

8.3 Building Fast and Accurate Models with Machine Learning

paradigm of online training, where all of the data must be processed through the network before the weights are updated. One popular method for achieving this is known as stochastic gradient descent with mini-batching. Here, smaller sets of data are chosen and propagated through the network to perform a partial, noisy, update to the weight matrices, with the number of partial epochs per epoch being defined as |Y |∕Nbatch

(8.4)

i.e. the total number of data points divided by the size of the batch. The argument for using mini-batching is that for large data sets it is expected that there is a reasonable amount of data redundancy and thus training on, say, half, the data should return approximately the same weight matrix as training on the full set. Thus, for a reduced cost you can make an educated guess about what values are reasonable for the weight matrix. Dahl et al. used an additional, subtle, form of regularization in the form of a multi-target fit [49]. In this approach, the network was trained to simultaneously learn to regress the same input to multiple properties. This builds in an implicit regularization into the training, since it is very hard to over-fit to multiple targets simultaneously. In their study, Dahl et al. argue that aggressive feature selection – which is often touted as a necessary step in reducing model complexity (and hence the danger of overfitting) – is not necessary when proper regularization techniques are used. They demonstrate this point by showing a strong downward trend in the performance of their regularized neural network against a validation set, as the number of features is reduced from 3764 down to 500, 2000, 1500, 1000, 500, or 100 most informative input features – shown in Figure 8.9. Tuning the learning rates is key to the efficient convergence of neural networks. Since the learning rate is analogous to a step size in a traditional geometric optimization, it is clear that a constant value for the learning rate – which is commonly implemented – is not an optimal solution. An intuitive solution is to vary the learning rate as the training progresses. This can be achieved using a “cooling regime” where the rate is scheduled to reduce over time, although a more robust method is to use information on the gradients of each weight to adapt the learning rate. This means that the algorithm will take larger steps in directions in which it is moving fast (large gradient) and smaller steps in directions in which the weights are moving slowly. This results in a more efficient descent route through the multidimensional training surface and hence a faster convergence for the network. Pyzer-Knapp et al. used neural networks to regress HOMO, LUMO, and power conversion efficiency values from a circular fingerprint to accelerate the discovery of organic photovoltaic molecules [50]. In this study, the network was trained using 200 000 1024-bit Morgan circular fingerprints [10] to simultaneously learn all properties at once (i.e. a multi-target pseudo regularization such as in [49]). Overfitting was dealt with by means of an early-stopping strategy [47], in which the average error over all properties was tracked, and weight-convergence was accelerated using RMSProp [51]. Due to the size of the training set, the Hogwild! Algorithm [52] was used to parallelize the stochastic gradient descent. Hogwild is an asynchronous stochastic gradient descent solver, in which a central weight

237

8 The Marriage of Machine Learning and Chemistry to Accelerate Materials Discovery

1.00

0.95

0.90 Test set AUC

238

0.85

0.80

100 500

0.75

1000 1500 2000

0.70

2500 All

0.65 488917

1851_1a2

Figure 8.9 The accuracy (area under the receiver operating characteristic curve) with respect to the number of features used for two key assays studied in the work of Dahl et al. ROC is a common measure of the performance of a binary classifier and is strongly negatively impacted when the number of features is reduced.

parameter store is updated in a lock-free manner during training. Since weight updates are commutative, the order in which these updates are processed does not matter, and additionally the noise introduced by the possibility of race condition overwrites empirically helps the convergence of the algorithm potentially by acting as a smoother, although this is unproven [50]. In this study, all properties were found under validation to predict at state of the art levels, with HOMO and LUMO values being predicted in the order of 1e-4 au and power conversion efficiencies to 0.27%. For the 50 000 molecule validation library, which did not contain molecules on which the network had been trained, even a conservative filter of 40×) over a random sampling approach for the organic photovoltaic problem, reducing the search time from c. four years to a month of HTVS effort. The reasoning for this can be demonstrated through the use of information landscapes, a technique developed in the same paper. In these plots, the feature space is compressed to two dimensions using a deep autoencoder (as described in Section 8.2). This 2D landscape is then divided into segments that are colored to represent the average value of the targets contained within them. The distribution of data points within this space is related to the viewer using a kernel density estimation (contour lines), with the locations of the top 100 data points being explicitly marked. An example of information landscapes for the organic photovoltaic data set and the tumor suppressor data set is shown in Figure 8.13. Figure 8.12 (LHS, left hand side) shows the photovoltaic search, in which the authors observed that initially a greedy search was the most successful strategy. This can be rationalized by the fact that there are a large number of promising molecules enclosed in a small area of feature space – a situation that is optimal for purely exploitative strategies. When this supply of strong candidates is exhausted, however, the performance of the greedy strategy degrades relative to the Thompson strategy as the search has taken insufficient information to correctly identify other promising areas to direct the search to. Another situation is shown in Figure 8.13 (RHS, right hand side), in which there are clearly two areas of promising candidates. This is a situation in which exploratory search strategies are optimal and is demonstrated in the improved performance seen by the authors in this study.

247

248

8 The Marriage of Machine Learning and Chemistry to Accelerate Materials Discovery

4.0 3.2

0.8

2.4 1.6

0.6

0.8 0.4

0.0 –0.8

0.2

–1.6 –2.4

(a)

0.2

0.4

0.6

0.8 99.0 97.5

0.8

96.0 0.6

94.5 93

0.4 91.5 90.0

0.2

88.5 (b)

0.2

0.4

0.6

0.8

Figure 8.13 Information landscapes for the photovoltaic data set (a, a maximization problem) and the tumor suppressor data set (b, minimization problem). Source: Adapted from Pyzer-Knapp et al. [24].

8.5 Conclusion In this work, we have taken a journey through the process of machine learning, through representing the data, applying the representation to build faster and more accurate models, and using sampling techniques to intelligently search the data to build those models more efficiently and make intelligent decisions to swiftly locate desirable new materials. We have seen how many different techniques, including cutting edge methods from the exploding field of deep learning, can be harnessed and applied to the problem of swiftly and intelligently moving through complex chemical spaces. At the current point in time, and perhaps at

References

no other time in history, we are presented with a perfect storm of computational capability, accessible data, and sophisticated learning techniques – the marriage of which can allow us to bring chemistry well and truly into the cognitive era.

References 1 Reymond, J.-L. (2015). Acc. Chem. Res. 48: 722–730. 2 Kirkpatrick, P. and Ellis, C. (2004). Nature 432: 823–823. 3 Maldonado, A.G., Doucet, J.P., Petitjean, M., and Fan, B.-T. (2006). Mol.

Divers 10: 39–79. 4 Nikolova, N. and Jaworska, J. (2003). QSAR Comb. Sci. 22: 1006–1026. 5 Sheridan, R.P. and Kearsley, S.K. (2002). Drug Discov. Today 7: 903–911. 6 Nasr, R., Hirschberg, D.S., and Baldi, P. (2010). J. Chem. Inf. Model. 50:

1358–1368. 7 Wang, Z., Liang, L., Yin, Z., and Lin, J. (2016). J. Cheminformatics 8: 1–10. 8 Xue, L. and Bajorath, J. (2000). Comb. Chem. High Throughput Screen 3:

363–372. 9 Sutherland, J.J., Higgs, R.E., Watson, I., and Vieth, M. (2008). J. Med. Chem.

51: 2689–2700. 10 Rogers, D. and Hahn, M. (2010). J. Chem. Inf. Model. 50: 742–754. 11 Durant, J.L., Leland, B.A., Henry, D.R., and Nourse, J.G. (2002). J. Chem. Inf. 12 13 14 15

16 17 18 19 20 21 22 23 24

25

Comput. Sci. 42: 1273–1280. Gutman, I. and Estrada, E. (1996). J. Chem. Inf. Comput. Sci. 36: 541–543. Vilar, S., Uriarte, E., Santana, L. et al. (2014). Nat. Protoc. 9: 2147–2163. Morgan, H.L. (1965). J. Chem. Doc. 5: 107–113. Montavon, G., Hansen, K., Fazli, S. et al. (2012). Advances in Neural Information Processing Systems, vol. 25 (eds. F. Pereira, C.J.C. Burges, L. Bottou and K.Q. Weinberger), 440–448. Curran Associates, Inc. Faber, F., Lindmaa, A., von Lilienfeld, O.A., and Armiento, R. (2015). Int. J. Quantum Chem. 115: 1094–1101. Hansen, K., Montavon, G., Biegler, F. et al. (2013). J. Chem. Theory Comput. 9: 3404–3419. Montavon, G., Rupp, M., Gobre, V. et al. (2013). New J. Phys. 15: 095003. Lopez-Bezanilla, A. and von Lilienfeld, O.A. (2014). Phys. Rev. B 89: 235411. Häse, F., Valleau, S., Pyzer-Knapp, E., and Aspuru-Guzik, A. (2016). Chem. Sci. 7: 5139–5147. Ramakrishnan, R., Dral, P.O., Rupp, M., and von Lilienfeld, O.A. (2015). J. Chem. Theory Comput. 11: 2087–2096. Jasrasaria, D., Pyzer-Knapp, E.O., Rappoport, D., and Aspuru-Guzik, A. (2016). Phys. Stat., arXiv preprint arXiv:1608.05747. Hinton, G.E. and Salakhutdinov, R.R. (2006). Science 313: 504–507. Hernández-Lobato, J.M., Requeima, J., Pyzer-Knapp, E.O. & Aspuru-Guzik, A. (2017). Parallel and Distributed Thompson Sampling for Large-scale Accelerated Exploration of Chemical Space. Proceedings of the 34th International Conference on Machine Learning, in PMLR 70:1470–1479. Hughes, G. (1968). IEEE Trans. Inf. Theory 14: 55–63.

249

250

8 The Marriage of Machine Learning and Chemistry to Accelerate Materials Discovery

26 Zaborowski, B., Jagieła, D., Czaplewski, C. et al. (2015). A

27 28 29 30 31 32 33 34 35

36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51

maximum-likelihood approach to force-field calibration. J. Chem. Inf. Model. 55 (9): 2050–2207. Zaborowski, B., Jagieła, D., Czaplewski, C. et al. (2015). J. Chem. Inf. Model. 55: 2050–2070. Gregor, K., Danihelka, I., Graves, A. et al. (2015). Draw: a recurrent neural network for image generation. arXiv preprint arXiv:1502.04623. Ba, J., Mnih, V., and Kavukcuoglu K. (2014). ArXiv14127755 Cs. Pineda, F.J. (1987). Phys. Rev. Lett. 59: 2229–2232. Zaremba, W., Sutskever, I., and Vinyals O. (2014). Recurrent neural network regularization. arXiv preprint arXiv:1409.2329. Lusci, A., Pollastri, G., and Baldi, P. (2013). J. Chem. Inf. Model. 53: 1563–1575. Baldi, P., Brunak, S., Frasconi, P. et al. (1999). Bioinformatics 15: 937–946. Isayev, O., Fourches, D., Muratov, E.N. et al. (2015). Chem. Mater. 27: 735–743. Bultinck, P., Gironés, X., and Carbó-Dorcaz, R. (2005). Reviews in Computational Chemistry (eds. K.B. Lipkowitz, R. Larter and T.R. Cundari), 127–207. Wiley. Carhart, R.E., Smith, D.H., and Venkataraghavan, R. (1985). J. Chem. Inf. Comput. Sci. 25: 64–73. Nilakantan, R., Bauman, N., Dixon, J.S., and Venkataraghavan, R. (1987). J. Chem. Inf. Comput. Sci. 27: 82–85. Labute, P. (2000). J. Mol. Graph. Model. 18: 464–477. Todeschini, R. and Consonni, V. (2009). Molecular Descriptors for Chemoinformatics, vol. 41 (2 Volume Set). Wiley. Todeschini, R. and Consonni, V. (2000). Handbook of Molecular Descriptors. Wiley Online Library. Pyzer-Knapp, E.O., Suh, C., Gómez-Bombarelli, R. et al. (2015). Annu. Rev. Mater. Res. 45: 195–216. Wilmer, C.E., Leaf, M., Lee, C.Y. et al. (2012). Nat. Chem. 4: 83–89. Curtarolo, S., Hart, G.L.W., Nardelli, M.B. et al. (2013). Nat. Mater. 12: 191–201. Hachmann, J., Olivares-Amaya, R., Jinich, A. et al. (2014). Energy Env. Sci. 7: 698. Zupan, J. and Gasteiger, J. (1991). Anal. Chim. Acta 248: 1–30. Hornik, K., Stinchcombe, M., and White, H. (1989). Neural Netw. 2: 359–366. Prechelt, L. (1998). Neural Netw. 11: 761–767. Srivastava, N., Hinton, G., Krizhevsky, A. et al. (2014). J. Mach. Learn. Res. 15: 1929–1958. Dahl, G.E., Jaitly, N., and Salakhutdinov R. (2014). Multi-task neural networks for QSAR predictions. arXiv preprint arXiv:1406.1231. Pyzer-Knapp, E.O., Li, K., and Aspuru-Guzik, A. (2015). Adv. Funct. Mater. n/a–n/a. Tielman, T. and Hinton, G. (2012). “Lecture 6.5 – RMSProp,” COURS-ERA: Neural Networks for Machine Learning.

References

52 Recht, B., Re, C., Wright, S., and Niu, F. (2011). Hogwild: a lock-free approach

53 54 55 56 57 58

59 60 61 62 63

64 65 66 67

to parallelizing stochastic gradient descent. In: Advances in Neural Information Processing Systems, 693–701. Li, H., Liang, Y., and Xu, Q. (2009). Chemom. Intell. Lab. Syst. 95: 188–198. Lu, W.-C., Ji, X.-B., Li, M.-J. et al. (2013). Adv. Manuf. 1: 151–159. Schwaighofer, A., Schroeter, T., Mika, S. et al. (2007). J. Chem. Inf. Model. 47: 407–424. Mauri, A., Consonni, V., and Pavan, M. (2006). Roberto Todeschini 56: 237–248. Huuskonen, J. (2001). Comb. Chem. High Throughput Screen. 4: 311–316. Ramakrishnan, R., Dral, P.O., Rupp, M., and von Lilienfeld, O.A. (2014). Quantum chemistry structures and properties of 134 kilo molecules. Sci. Data 1: 140022. Pyzer-Knapp, E.O., Simm, G.N., and Guzik, A.A. (2016). Mater. Horiz. 3: 226–233. Balachandran, P.V., Xue, D., Theiler, J. et al. (2016). Sci. Rep. 6: 19660. Jones, D.R., Schonlau, M., and Welch, W.J. (1998). J. Glob. Optim. 13: 455–492. Reker, D. and Schneider, G. (2015). Drug Discov. Today 20: 458–465. Hernández-Lobato, J.M. and Adams, R.P. (2015). Probabilistic backpropagation for scalable learning of bayesian neural networks. In: International Conference on Machine Learning, 1861–1869. Ito, K. and Xiong, K. (2000). IEEE Trans. Autom. Control 45: 910–927. Minka, T.P. (2001). UAI’01: Proc. of the 17th Conf , Uncertainty in Artificial Intelligence, vol. 17, 362–369. Spangenberg, T., Burrows, J., Kowalczyk, P. et al. (2013). PLoS One 8: e62906. Thompson, W.R. (1933). Biometrika 25: 285–294.

251

253

9 Machine Learning Interatomic Potentials for Global Optimization and Molecular Dynamics Simulation Ivan A. Kruglov 1,2 , Pavel E. Dolgirev 1,3 , Artem R. Oganov 3,2,1 , Arslan B. Mazitov 1,2 , Sergey N. Pozdnyakov 1,3 , Efim A. Mazhnik 1,3 , and Alexey V. Yanilkin 1,2 1

Moscow Institute of Physics and Technology, Dolgoprudny, Moscow Region 141700, Russian Federation

2 Dukhov Research Institute of Automatics (VNIIA), Moscow 127055, Russian Federation 3

Skolkovo Institute of Science and Technology, Skolkovo Innovation Center, Moscow 143026, Russian Federation

9.1 Introduction The problem of fast and accurate reconstruction of the potential energy surface (PES) of a crystal is of central importance in computational chemistry, since PES contains essential information about the system and all transition states. Indeed, given PES, one can find the stable and metastable structures, paths of transitions between these structures, and its first derivatives provide forces and second derivatives at local minima of PES contain information about lattice vibrations and mechanical properties. One of the most accurate ways to build PES is to use first-principles calculations – e.g. based on density functional theory (DFT). The main disadvantage of this approach is its computational cost: usually, DFT calculations could be performed for systems with not more than hundreds of atoms. Finding the global minimum is a daunting task, which until recently was considered unsolvable due to the astronomically large number of local minima. One of the possible solutions (and the most powerful) was provided in USPEX code [1–3]. It is an evolutionary algorithm, which in the first generation builds random structures, then estimates their energy with any appropriate method (like DFT) and chooses the most energetically stable individuals. After that the same operations are carried out with best structures from previous generation, new random structures, and structures made by variation operators (mutation, heredity, and so on). For more details, see [4]. Global optimization using DFT is reliable but expensive; using force fields instead of DFT leads to faster calculations, but with a less of accuracy and reliability. While USPEX solves the task of finding the most stable structures at given conditions, molecular dynamics (MD) allows simulating physical processes within a given crystalline modification or molecular conformation, giving information Materials Informatics: Methods, Tools and Applications, First Edition. Edited by Olexandr Isayev, Alexander Tropsha, and Stefano Curtarolo. © 2019 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2019 by Wiley-VCH Verlag GmbH & Co. KGaA.

254

9 Machine Learning Interatomic Potentials for Global Optimization

about dynamical evolution of the system. Trajectories of atoms are determined by Newton’s (or similar) equations, where forces, acting on atoms, and potential energy are calculated using interatomic potentials (also called force fields). Therefore, the weak point there is that the result of MD run depends on the choice of the potential. Most force fields are available not for all elements and materials, and sometimes there are several potentials for one material, and they work differently under different conditions (temperature, pressure). Empirical potential strongly depends on the structures on which it has been trained. Following a simple combinatorial argument [1], the number of possible distinct structures can be evaluated as C=

(V ∕𝛿 3 )! 1 3 (V ∕𝛿 ) [(V ∕𝛿 3 ) − N]!N!

(9.1)

where N is the total number of atoms in the unit cell of volume V , 𝛿 is a relevant discretization parameter (for instance, 1 Å), and ni is the number of atoms of ith type in the unit cell. For small systems (N ∼ 10–20), C is astronomically large (roughly, ∼10N if one uses 𝛿 = 1 Å and typical atomic volume of 10 Å3 ). It is useful to consider the dimensionality of the energy landscape: d = 3N + 3

(9.2)

where 3N − 3 degrees of freedom are the atomic positions and the remaining six dimensions are lattice parameters. For example, a system with 20 atoms/cell poses a 63-dimensional problem! We can rewrite Eq. (9.1) as C ∼ exp(αd), where α is some system-specific constant. With such high-dimensional problems, exhaustive searches are clearly not feasible. The problem can be greatly simplified if global optimization is combined with local optimization (structure relaxation) – which indicates certain correlations between atomic positions (interatomic distances adjust to reasonable values, and unfavorable interactions are avoided), and the intrinsic dimensionality of this reduced energy landscape consisting only of local minima (Figure 9.1) is reduced to d∗ = 3N + 3 − 𝜅

(9.3)

where 𝜅 is the (non-integer) number of correlated dimensions. d* depends both on system size and its chemical properties. We found [6] d* = 10.9 (d = 39) for Au8 Pd4 , d* = 11.6 (d = 99) for Mg16 O16 , and d* = 32.5 (d = 39) for Mg4 N4 H4 . The number of local minima is then C ∗ ∼ exp(𝛽d∗ )

(9.4)

with 𝛽 < α, d* < d, and C* ≪ C, which means that including structure relaxation simplifies the problem greatly. The scaling of the problem with the number of degrees of freedom (or the number of particles) is still exponential, which means that crystal structure prediction problem is NP-hard, and for sufficiently large systems will be intractable (with existing methods the limit is ∼300–500 degrees of freedom). If one had a unique and compact representation of a crystal structure, it would serve many purposes: such representations are desperately needed for machine learning, and within the context of evolutionary crystal structure prediction they

Free energy

9.1 Introduction

Order parameter (s)

Energy difference (eV)

(a)

(b)

4.0 3.0 2.0 1.0 0.1

0.15

0.2

0.25

(c)

0.3

0.35

0.4

0.45

0.5

0.55

Distance

Figure 9.1 Energy landscape. (a) 1D scheme showing the full landscape (solid line) and reduced landscape (dashed line joining local minima). (b) 2D projection of the reduced landscape of Au8 Pd4 , showing clustering of low-energy structures in one region. (c) Energy–distance correlation (here, shown for GaAs with 8 atoms/cell). Each point is a locally optimized (i.e. relaxed) structure. The correlation proves that the energy landscape has a simple one-funneled topology. Source: Oganov and Valle 2009 [5]. Reprinted with permission from AIP Publishing.

allow to detect (and remove from population) duplicate structures and exploit correlations between energies, properties, and structure similarity/dissimilarity. Traditional representation of crystal structures by a set of lattice vectors and atomic coordinates is, unfortunately, not unique: the same crystal structure can be described by an infinite number of such sets related to each other by linear transformations of lattice vectors and shifts of the origin. We have started with a fingerprint function [5], related to the pair correlation function and diffraction spectra, and for each pair of atomic types A and B it is defined as ∑ ∑ 𝛿(R − Rij ) − 1 = gAB (R) − 1 (9.5) FAB (R) = 2 NA NB Δ Ai ,cell Bj 4πRij V cell

where the double sum runs over all ith atoms of type A within the unit cell and all jth atoms of type B within the distance Rmax . In Eq. (9.5), N A and N B are the number of atoms A and B in the unit cell, V is the unit cell volume, Rij is the distance between atoms i and j, and Δ a is the discretization parameter. g AB (R) is the pair correlation function; subtracting 1 from it makes it short-ranged and oscillating around zero in the long-distance limit: F AB (0) = −1 and F AB (∞) = 0. Another interesting property of the fingerprint function is that it is exactly zero

255

256

9 Machine Learning Interatomic Potentials for Global Optimization

for the ideal gas, and consequently, all deviations from zero are a consequence of order. This fingerprint function is invariant to all linear transformations, and is numerically very robust. Despite all these useful properties, this function does not quite uniquely define the structure: two different structures can, in principle, have identical fingerprints. While searching for perfect representations of crystal structures, we use this fingerprint as a simple and robust pragmatic solution. We discretize the fingerprint function, representing it as a vector, each kth component of which is obtained as (k+1)D

F(k) =

1 D ∫kD

F(R) dR

(9.6)

Then, similarity between structures i and j can be defined as distance between their fingerprint vectors, e.g. Cartesian distance, Minkowski norm, or cosine distance, the latter being defined as: ( ) F i Fj (9.7) Dij = 0.5 1 − ‖Fi ‖‖Fj ‖ Fingerprint analysis allows one to visualize energy landscapes (e.g. Figure 9.1b was obtained in this way) and make sure that in real chemical systems energy landscapes indeed have a small number of energy funnels, where low-energy structures are clustered relatively close to each other. It is this overall organization of energy landscapes that makes global optimization possible. Such overall structure is also expected for landscapes of many physical properties. Cosine distances have a useful mathematical property: their values can only be in the range [0; 1], and this allows a very convenient entropy-like measure of the diversity of a set of structures. This measure is called quasientropy [5]: Scoll = − < (1 − Dij ) ln(1 − Dij ) >

(9.8)

where Dij are abstract cosine distances between all pairs of structures. In the same style as above, one can define [5] fingerprints for each atomic site and quasientropy of a given crystal structure – which is then a measure of disparity of the fingerprints (e.g. local atomic environments) of different atomic sites within the same structure. It was shown [5] how this definition of structural quasi-entropy can be used to analyze and justify Pauling’s fifth rule (the rule of parsimony, saying that the number of essential structural elements in a stable crystal tends to be small) – for SiO2 modifications this is illustrated in Figure 9.2. One can also define, for each structure, a degree of order 𝚷: Π2 =

1 (V ∕N)1∕3 ∫0

Rmax

F 2 (R) dR =

Δ |F|2 (V ∕N)1∕3

(9.9)

which measures deviations of the fingerprint function from zero (it is strictly zero for the ideal gas) and the cubic root of the atomic volume is introduced to make it dimensionless and scale invariant. Degree of order often correlates with the energy, and this correlation is exploited in USPEX [3]. Machine learning interatomic potentials trained on DFT data can solve the problem of making both accurate and fast calculation of the energy [7, 8]. The process of construction of any machine learning algorithm consists of three main

9.1 Introduction

Energy (eV)

–645.0 –650.0 –655.0 –660.0 0.0

0.05

0.1

0.15

Energy (eV)

0.2

0.25

0.3

0.35

0.4

Quasi-entropy Sstr

(a)

–56.0

–58.0

–60.0 0.05

(b)

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Quasi-entropy Sstr

Figure 9.2 Energy vs. structural quasi-entropy correlation. (a) MgO (32 atoms/cell) and (b) MgNH (12 atoms/cell). Note a clear energetic preference for simple structures (those with lowest quasi-entropy), just as prescribed by Pauling’s fifth rule. Source: Oganov and Valle 2009 [5]. Reprinted with permission from AIP Publishing.

steps: feature vector selection (how to describe a structure), algorithm selection, evaluation and testing of the algorithm. This method assumes that energy of the system can be approximated as a sum of energies of the atomic environments of the individual atoms. In practical calculations, these environments are defined within a cut-off radius rcut , which is typically about 10 Å. This approximation is true for most systems with short-range interaction. Energy of the individual environments is represented by a function of the positions of the neighboring atoms, typically with hundreds or more parameters. This function must satisfy several constraints; the most important are its constancy with respect to permutation (of the chemically equivalent units), rotation, and translation. For the last two decades, many approaches based on different feature vectors and different algorithms were developed. Among them are Behler–Parinello neural networks (NNs) [9, 10], Gaussian Approximation Potential (GAP) [11], Spectral Neighbor Analysis Potential (SNAP) [12], MTP [13], and many others. Analysis of some crystal structure descriptors used in these methods can be found in [14]. All these methods were designed to work with structures close to the local minima, and moreover they can be hardly interpreted from the physical point of view. As it follows from above, there are two strategies to build Machine Learning (ML) interatomic potential: it could treat structures from a wide region of PES (for global optimization) or structures from an MD run, which are very similar to each other. In this chapter we will show methods that work for both problems. The procedure of feature vector and training set selection will be covered in detail. Applications of the developed approaches will be shown.

257

258

9 Machine Learning Interatomic Potentials for Global Optimization

9.2 Machine Learning Potential for Global Optimization 9.2.1

Lattice Sums Method

In this chapter we consider a method which includes the high-energy part in training. We propose a natural separation of the total energy into pairwise and many-body parts, providing us with some insight into the interactions in the crystal. Moreover, the pair potential can be visualized, helping us to physically interpret our model. Our method can be used for both global optimization (e.g. using USPEX) and MD. We express the total energy in the following way (similar ideas are reviewed in [9]): E = E0 + E2body + Emanybody

(9.10)

i.e. as a sum of some constant, pairwise, and many-body terms. In our algorithm 2-body term is fitted by linear regression and the second term is fitted by an artificial NN. Linear regression gives a simple analytical form of E2body , so it can be easily visualized. In order to describe the two-body part, we employed lattice sums. If there are K different types of atoms in the crystal, the number of pair potentials equals to 1 K(K + 1). Each potential can be represented in a very general form: 2 𝜑(r) =

kmax ∑ Ak k=1

(9.11)

rk

The total pairwise energy per unit cell then equals to E2body =

∞ N kmax Ak ∑ ∑ ∑ i,j l=0 j=0 k=1

k ri,j (l)

(9.12)

where the summation is done over the whole crystal; each unit cell is numbered by index l and l equals to 0 for the current cell; i refers to the atom in the cell with l = 0 and j refers to the atom in the cell with number l; and Aki,j is the kth coefficient in the potential of interaction between the atom i and the atom j – it depends only on types of those atoms. Such sums ∞ N kmax ∑ ∑∑

1

k l=0 j=0 k=1 ri,j (l)

(9.13)

are called lattice sums. Terms with k = 1, 2, 3 correspond to long-range interactions, because inte∞ 2 diverge. The main problem of with these terms is that one has to grals ∫1 r rdr k calculate them through the whole crystal rather than inside a finite sphere. For the Coulomb part, Ewald method [15, 16] was developed, where summation is taken both in real and reciprocal spaces. We generalize this approach to terms with k = 2, 3, 4, 5, 6. The derivation of the formulas (see below) is very cumbersome and is almost the same as described in [16]. For the term k = 6, the formula was taken from [17]. The components with k ≥ 7 we calculate within a sphere

9.2 Machine Learning Potential for Global Optimization

of sufficiently large radius (typically, 10 Å). Terms k = 4, 5, 6 can be calculated by summation within a sphere of finite radius, but in order to reach the same accuracy as for terms k ≥ 7, the radius of that sphere would have to be too large, making the calculations too expensive. Ewald-like summation of these terms in both real and reciprocal space makes calculations cheaper and better. For these terms, real-space and reciprocal-space radii Rmax and Gmax and parameter g (this parameter appears in formulas below) should be chosen in such a way that the number of iterations is minimal and the accuracy is maximal. The optimal parameters for the Coulomb interaction were borrowed from the GULP code [17, 18]: ( )1 n𝜔π3 2 (9.14) gopt = V2 f Gmax = 2fg, Rmax = (9.15) g where w = 1, n = a number of atoms in a unit cell, and V is its volume; f = 1 (− ln A) 2 = 3; and A corresponds to desired accuracy. Our analysis showed that these parameters work equally well for k = 2, 3, 4, 5, 6, so we used them. It is obvious from the physical considerations (and also from our formulas, since otherwise sums diverge) that the overall charge of the unit cell equals to zero. It turns out that similar sum rules apply also to the components with k = 2, ∑N 3 (see formulas below): i,j=1 Aki,j = 0. From this we make an important conclusion: crystals with only one type of atoms have no long-range interactions. That is why they are fundamentally easier to operate with than crystals made of many atomic types. We assume here that all the atoms of one species should have the same charge. It is true only for systems that have only pair interactions. In real systems even in case of only one type of species, there can be a significant charge transfer (e.g. in γ-boron [19]) because of strong many-body interactions. Usually, one considers the nonlinear dependence of E on r, but it is striking that the dependence of E on the Aij coefficients is obviously linear, and therefore, knowing the energy of some structures, we can reconstruct the coefficients of the remaining structures using linear regression. We also note that since in real systems pair interactions are affected by the environment (which leads to many-body interactions), we can expect the dependence of pair potentials on density of the system, especially for metals. For metals, this can be described as density-dependent screening of pair interactions by the electron gas. To illustrate our method, we carried out two tests of our algorithm. In the first test we used the Lennard-Jones potential: [( )12 ( r )6 ] rm −2 m (9.16) 𝜑(r) = 𝜀 r r where 𝜀 is the depth of the potential well and rm – position of its minimum. In our tests we considered Ax By Lennard-Jones system. Using USPEX code we generated 10 000 random with arbitrary numbers of formula units in the unit cell and any space group symmetry unrelaxed structures, and then computed their energies using GULP [17]. Having these energies, we reconstructed

259

9 Machine Learning Interatomic Potentials for Global Optimization

A–A interaction –0.2

A–B interaction 0.5

Real Reconstructed

–0.3

Real Reconstructed

0

–0.4 –0.5

E (ε)

–0.5

E (ε)

–0.6

–1

–0.7

–1.5

–0.8 –2

–0.9 –1 2.6

(a)

2.8

3

3.2

3.4

3.6

3.8

–2.5

4

R (σ)

2.2

2

2.4

2.6

2.8

3

R (σ)

(b) B–B interaction 1 Real Reconstructed

0.5

E (ε)

260

0 –0.5 –1 –1.5 1.3

(c)

1.4

1.5

1.6

1.7

1.8

1.9

2

R (σ)

Figure 9.3 Reconstructed and real potentials in Ax By Lennard-Jones system for (a) A–A, (b) A–B, (c) B–B interactions.

the potentials – see Figure 9.3. Energies (per atom) of the structures were in the range [30; 2] in 𝜀 units. Reconstructed model yielded root mean square error (RMSE) of less than 10−7 𝜀; Pearson coefficient was equal to 1, i.e. our reconstruction was perfect. The second test was more complicated: we added Coulomb interactions to this Lennard-Jones potential. We took the same A–B system but with fixed composition An Bn ; this allowed us to put fixed charges on atoms: +1 on atom A and −1 on atom B. After generating these structures with USPEX and calculating their energies with GULP, we could reconstruct the potentials. Results were even more encouraging; we also were able to calculate charges and we got this charge of 1e on each A atom (to be more precise, we are able only to calculate squared charges). Thus, we showed that our method allows one to reconstruct pair interaction potentials. Even in the systems with many-body interactions, such potentials carry important chemical information, because usually many-body contributions are much smaller. Yet, for quantitatively accurate results, many-body interactions must be taken into account. The next section describes our main model, which is a combination of our lattice sums method for 2-body interactions and machine learning for many-body terms. With this model we can fit the energies of real systems very accurately.

9.2 Machine Learning Potential for Global Optimization

9.2.2

Feature Vector

Since lattice sums describe only two-body part, feature vectors that describe many-body part have to be introduced. In general, any feature vector has to be unambiguous (i.e. two feature vectors should be different for two different structures) and invariant with respect to the order of atoms, choice of the unit cell, and translation and rotation of the system. In addition, the feature vector should be compact, i.e. contain as few components as possible. Finally, since ML builds a correlation between the input (feature-vector) and output (in this case, energies), we want to find geometric features that are correlated with energies. 2-body correlation function in systems where only one type of atoms is presented is defined as follows: ∑ g(r) = 𝛿(r − ri,j ) (9.17) i,j

where the summation is taken over the whole crystal and rij refers to interatomic distances; 𝛿 is Dirac delta. Our feature vector consists of integrals of 2-body correlation function terms: 1. Lattice sums with 4 ≤ k ≤ 15 (lattice sums with k = 4, 5, 6 catch far neighbors) 2. For each atom i in the unit cell, we can calculate the following value (sums are taken in a sphere of a large radius around atom i): fi =

∞ N ∑ ∑ j

l=0

1 erfc(g × ri,j (l)) ri,j (l)

(9.18)

or fi =

∞ N exp(−r 2 (l) × g 2 ) ∑ ∑ i,j j

l=0

(9.19)

ri,j 2 (l)

for some fixed values g (the list of used parameters is described below). Then the feature for our NN is the average of these values: 1∑ fi = f n i i n

(9.20)

Those terms appeared as part of lattice sums with k = 1, 2. Hence, we calculate them in a sphere of a radius where both sums converge (the radius for a given parameter g is equal to f /g – see Section 9.2.1). We also include volume per atom and average degree of order described in [6, 19], the latter has an important correlation for optimized structures: the higher degree of order, the lower energy of the structure [6]. Motivated by success of [10], we also include a 3-body part, which is the average sum of symmetric functions over the whole crystal: ( ) ∑ 1 + 𝜆 cos 𝜃ijk 𝜉 exp(−𝜂(R2ij + R2ik + R2kj )) × fc (Rij ) × fc (Rik ) × fc (Rkj ) 2 i,j,k≠i (9.21) fc (x) =

] [ ( ) { + 1 , x ≤ Rc 0.5 × cos πx R c

0,

x > Rc

261

262

9 Machine Learning Interatomic Potentials for Global Optimization

Table 9.1 Parameters for 3-body terms used in all the examples in the current section. 𝝀

𝜼

𝝃

±1

0.001

1

±1

0.003

2

±1

0.005

3

±1

0.007

4

±1

0.009

5

±1

0.011

6

±1

0.013

7

±1

0.015

8

where ith atom is inside the unit cell and jth and kth atoms are inside a sphere of relatively small radius (6 Å); l = 1; 𝜉 and 𝜂 are some fixed parameters for each feature. Here we used the following parameters for 2-body terms in all the examples presented in the following sections: g = [0.17; 0.18; 0.19; 0.20; 0.21; 0.22; 0.23; 0.24; 0.25; 0.26] In Table 9.1 we show parameters used for 3-body terms. So, the length of our feature vector is equal to 50. But this is only the initial value, which will be optimized using feature analysis. 9.2.3

Feature Vector Analysis

In all machine learning applications, feature vectors are selected heuristically: from the beginning it is always unknown how many features are necessary and which of them are the most important. In computational chemistry their choice is based on intuition about interactions between the atoms. We propose here a method to analyze which crystal structure descriptors are most correlated in a nonlinear way with, e.g. the energy (or any other property) of the crystal. This is important for applications of machine learning technique in computational chemistry because on the one hand, one can find really important descriptors, and on the other hand, it may significantly speed up the algorithm since calculations of such descriptors are often time-consuming. So, one encounters many questions at this step: 1. Is it possible to find features that are based on geometry of a crystal and that are highly correlated with the energy (or other properties) based on physical reality rather than intuition? 2. Is it possible to systematically improve the accuracy of a model by adding new features? 3. Is there a way to build universal unambiguous descriptors of a chemical system?

9.2 Machine Learning Potential for Global Optimization

These questions are very important because solving them will lead to understanding interactions in crystals and to creating a very powerful algorithm. Actually a lot has already been done by many groups [14, 20, 21], but here we present an approach which is novel to the field. What is interesting (and motivating) is that in paper [21] feature vector of length 3 was enough to describe well Al crystals, while in paper [9] authors use feature-vector of length 40, but the system is much more complicated. As long as we are using NNs, straightforward extension of the so-called optimal brain damage (OBD) algorithm can be used for feature analysis. This method was first proposed in computer vision field [22]. In addition, we want to point out that with our algorithm we can make a very reasonable analysis, but the choice of features depends on data that are used for training. This means that the final choice of features may not be universal, implying that features that are important for one problem may be less important for a different problem. Now we will prove that descriptors based only on interatomic distances in a crystal are ambiguous by presenting two infinite structures with the same such feature vector (Figure 9.4). Generally, any feature that is based only on 2-body correlation function can be presented in the following way: ∑ F(Rij ) (9.22) f = i,j

where summation is taken over the whole crystal, F is some fixed (smooth) function of one variable, and Rij corresponds to interatomic distance. This implies that if two crystal structures have the same distribution of interatomic distances Rij , their feature vectors will be the same. The final step in our proof is to show that two quasi-1D structures in Figure 9.4 have the same distribution of pairwise distances. At first, these two structures are built from two different (one is trapezoid and one is not) quadrangles presented

(a)

(b)

(c)

Figure 9.4 Two quasi-1D structures ((b) and (c)) that have the same 2-body correlation function (a).

263

264

9 Machine Learning Interatomic Potentials for Global Optimization

on Figure 9.4a. Due to geometrical simplicity, one can easily check that the list of six distances in one quadrangle coincides with such list in the other quadrangle (the property we want for an infinite structure). Next, we can translate each quadrangle in the orthogonal direction (with the same spacing), making two different structures with eight edges in the unit cell. Again, one easily proves ( ) that list of 82 distances of one structure coincides with such list of the other structure. From this it follows that repeating these quadrangles in orthogonal direction with the same spacing will result in two quasi-1D structures with the same distribution of pairwise distances. For a given feature vector (input of ML algorithm), we want to analyze which features are the most correlated with energy (output of ML algorithm) in a nonlinear manner. In particular, we use here ideas that came from computer vision science. Indeed, they use a deep feed-forward NN with many connections, and for them building a good architecture with only relevant connections results in a fast algorithm. So, in paper [22] authors proposed the so-called OBD algorithm, which deletes irrelevant connections in their NN architecture. Suppose that somehow we can calculate the importance of every connection inside a NN (below such quantity we will call “silence”) and then delete connections with the lowest silences. In our case we extend this idea proposing to calculate importance of input nodes. Indeed, we can calculate silence of a feature by summing up silences of connections that flow from this feature. Let us introduce some notations. NN function is denoted as F(w, xi ), where w corresponds to connections of the architecture and xi is a feature vector of ith sample. By training of NN we mean the process of minimizing the error function with respect to the weights w. We use the following cost function: 1∑ (E − F(𝜔, xi ))2 2 i=1 i m

J=

(9.23)

where m is the total number of training structures, while Ei denotes the energy of ith sample. After NN is trained we can calculate silence of each weight wij , which is defined as 𝛿 2 J 𝜔ij 𝛿𝜔2ij 2 2

sij =

(9.24)

Now, let us summarize the algorithm of feature analysis: 1. We start with fixed a architecture (initial architecture for Al is (50–35–50–1)) of NN and with a fixed feature vector (initial length is 50 for Al and 59 for C), and train our NN on training data. The process of training includes two steps: (i) random initialization of weights of NN and (ii) gradient descent for minimizing the objective function. Often, the resulting NN is in a local (not global) minimum of J. In order to prevent this, we actually start with several identical NNs (we use 5) simultaneously, so that initial weights are chosen to be different in each NN. After training them, we leave only the one that showed the best accuracy.

9.2 Machine Learning Potential for Global Optimization

2. For the best NN we calculate silences of every feature as described earlier and rank the features. Then we delete the lowest-rank feature. This means we cut every connection in our NN from the node that corresponds to the worst feature (after first deletion in Al we will have architecture (49–35–50–1)). Afterward, we retrain NN in a manner described earlier. If performance of the new best NN is acceptable, we start from the beginning. Otherwise, the process is complete. We fixed the same activation function for all layers: tanh(x) + 𝛾x. Such an activation function prevents the so-called paralysis of a NN; linear term is also important in the output layer, because energies may be arbitrary. We used 𝛾 = 0.1, and the choice of that particular parameter was done by making several tests on examples below. The weights and biases were trained by standard back-propagation algorithm. We use batch gradient descent with conjugate gradients, but we modified the objective function. Here we are minimizing the following function: 1 ∑ (E − F(𝜔, xi ))2 × exp(−𝛽(Ei − Emin )) 2m i=1 i m

J=

(9.25)

where m is the total number of training samples; xi , Ei correspond to ith feature vector and its energy; F is the NN function (i.e. output); and w are weights that we are optimizing. Such choice of cost function results in the NN being more sensitive to low-energy structures. Indeed, the higher the 𝛽, the more accurately we predict the low-energy part of PES, but at the cost of slight worsening of the description of the high-energy part. Since we use gradient descent for optimization of weights, it converges much faster if the feature vector is normalized. In particular, each feature x is replaced , where x denotes the average over the training data; 𝜎 is standard by x′ = x−x 𝜎 deviation. In our scheme, the subtraction of 2-body term may be considered as a physically motivated normalization of energies (i.e. normalization of the output of the NN). 9.2.4 9.2.4.1

Examples of Machine Learning Interatomic Potentials Aluminum

We collected 30 000 training structures of Al (12 000 of them correspond to randomly generated structures and the rest are intermediate structures collected during relaxation of many structures) and 8000 were used for test. We summarize our results in Table 9.2. Energies were in the range [3.8; 0] eV/atom. One can notice in Table 9.2 that results at low energies do not depend much on 𝛽 here. This is due to the very large (∼90%) contribution of 2-body part to the total energy. This is quite different from the case of carbon, where interactions are much more complex. Reconstructed potential is plotted in Figure 9.5. Since the data include structures with different densities, it means that this potential is averaged over different densities. Since the interaction potential generally depends on the density, we performed the following calculations: about 20 000 random structures with fixed densities

265

9 Machine Learning Interatomic Potentials for Global Optimization

Table 9.2 Results of our scheme for aluminum: rlow and RMSElow denote the Pearson correlation coefficient and RMSE on test structures with energies not more than 0.5 eV/atom above the minimal energy; r and RMSE correspond to all test structures.

𝜷

RMSE (eV/atom)

r (%)

RMSElow (eV/atom)

rlow (%)

0

99.9

0.049

98.7

0.020

0.4

99.9

0.052

98.6

0.021

0.8

99.8

0.055

98.5

0.021

1.2

99.8

0.058

98.5

0.021

1.6

99.8

0.064

98.5

0.022

2

99.8

0.069

98.6

0.021 Figure 9.5 Reconstructed pair potential of Al–Al interaction (𝛽 = 1).

0.06 0.05 0.04 Energy (eV)

266

0.03 Neighbor distance in fcc Al

0.02 0.01 0 –0.01 –0.02 –0.03

2

3

4

5

6

7

8

9

10

R (Å)

were generated and afterward for each density we reconstructed the potentials using our scheme with architecture (50–50–1) (we fixed b = 0, Figure 9.6). These potentials have similar shapes: they have pronounced wiggles that correspond to Friedel oscillations, i.e. reflect the effect of screening of the pair interaction by the electron gas, and they have minima at the same Al–Al distance (also a confirmation of Friedel’s theory). This distance is very close to 2.86 Å – the neighbor distance between Al atoms in the fcc structure of Al. Before proceeding further, we would like to notice that 2-body contribution is higher in structures with higher energies (structures far from local energy minima). This means that a given nonoptimal structure can be relaxed at first steps only by using the pair potential reconstruction method. Interestingly, the higher the density, the higher the potential – see Figure 9.6. At first it may be expected that the higher the density, the greater relative contribution of many-body terms to the total energy. But for Al, calculations show the opposite. We think that this is because a large part of many-body interaction energy is subsumed into the (now density-dependent) pair potential.

9.2 Machine Learning Potential for Global Optimization

0.6 8.62 × 10–2 Å–3 7.09 × 10–2 Å–3 6.02 × 10–2 Å–3 5.23 × 10–2 Å–3 4.63 × 10–2 Å–3

0.5

Energy (eV)

0.4 0.3 0.2 0.1 0 –0.1

2

2.5

3

3.5

4

4.5

5

5.5

6

R (Å)

Figure 9.6 Al–Al interaction potentials at different densities. Figure 9.7 Deletion of features in Al.

35 Train Test

RMSE (10–2 eV/atom)

30 25 20 15 10 5 0

0

10

20 30 Number of deletions

40

50

The result of feature analysis scheme is presented on Figure 9.7. Clearly, we can take feature vector of length equal to 20 without losing performance (we started with 50 features). Importantly, all the remaining features belong to the 2-body part of the feature vector; moreover, one of them is the lattice sum with k = 5. The results are consistent with our previous work where we point that in metallic aluminum the contribution of pairwise interactions to the total energy is more than 90%. 9.2.4.2

Carbon

We applied our scheme to carbon, where we collected 37 000 training structures (12 000 of them are random and the rest are intermediate structures collected

267

268

9 Machine Learning Interatomic Potentials for Global Optimization

Table 9.3 Results of our scheme for carbon: rlow and RMSElow denote to the Pearson correlation coefficient and RMSE on test structures with energies not more than 1 eV/atom above the minimal energy; r and RMSE correspond to all test structures.

𝜷

r (%)

RMSE (eV/atom)

rlow (%)

RMSElow (eV/atom)

0.4

96

0.49

78

0.29

0.8

95

0.56

84

0.22

1.2

94

0.62

88

0.19

1.6

92

0.73

88

0.18

2.0

87

0.88

91

0.16

during relaxations of a large number of structures) and 9000 were used for test. We summarize our results in Table 9.3. Energies of structures covered a wide range [−9.3; 0] eV/atom. We calculated the average 2-body contribution and it turned out to be unexpectedly small – only about 50%. This means that this system is very complex and cannot be treated by any pair model, many-body interactions being essential. We decided to test the accuracy of our model by comparing it with the well-known ReaxFF model [23] for test structures (i.e. those that were not used for training of our model). One important question arises: How to compare different models? We want to emphasize that although we want to include the high-energy part of PES in the fit, it is more important to describe the low-energy part very accurately. We propose to plot RMSE as a function of energy: RMSE(E) gives RMSE for structures with energies below E. Indeed, such plot reflects better the performance of the model. We plot RMSE(E) curves for ReaxFF and for our models in Figure 9.8. Clearly, the higher 𝛽, the better low-energy part of PES is described and the worse is the high-energy part. Analyzing the curves, we can naturally separate models with different 𝛽 and use them in different energy ranges, obtaining a very good fit of both low- and high-energy regions. We compare our best model with performance of ReaxFF in Figures 9.8 and 9.9, and see a clear advantage of our scheme. Carbon is much more complex and many-body interactions are essential. That is why in addition to the initial 50 features we include the following: 1. We introduce here feature that we call bond-valence imbalance, which is defined as )2 ∑ ( r0 −rij 1 ∑ e 𝜌 − Vi fi , fi = (9.26) f = N i j This feature is chemically motivated: here V i = 4 corresponds to the valence of carbon; r0 , r are parameters describing strength of bond that corresponds to rij . We choose r0 and 𝜌 to be equal 1.5 and 0.35 Å, respectively. Indeed, we took 100 relaxed carbon structures, and chosen parameters r0 and r minimize total bond valence imbalance in these structures.

9.2 Machine Learning Potential for Global Optimization

0.9

β = 0.4 β = 0.8 β = 1.2 β = 1.6 β=2

0.8

RMSE (eV/atom)

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 –9

–8

–3 –6 –5 –4 Energy (eV/atom)

–7

(a)

–2

–1

0

1 0.9

RMSE (eV/atom)

0.8 Our model ReaxFF

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

–8

–6 –4 Energy (eV/atom)

(b)

–2

0

Figure 9.8 RMSE(E) curves for our models of carbon with different β (a) and comparison of performances of our best mixed model with known ReaxFF method (b).

2. In order to describe many-body interactions deeper, we extend symmetry functions proposed in [10]: 1 ∑ f, f = N i i ( ) ∑ 1 + 𝜆 cos 𝜃jkl 1 + 𝜆 cos 𝜃jil 1 + 𝜆 cos 𝜃lik 𝜉 fi = 2 2 2 j,k,l 2

2

× e−𝜂(rij +rij +…) fc (rij )fc (rik )

(9.27)

269

9 Machine Learning Interatomic Potentials for Global Optimization

Performance of ReaxFF on carbon

–2

–2

–3

–3

NN energies (eV/atom)

NN energies (eV/atom)

Performance of our best model on carbon

–4 –5 –6 –7 –8 –9

(a)

–4 –5 –6 –7 –8

–6

–8

–4

–9

–2

VASP energies (eV/atom)

(b)

–8

–6

–4

–2

VASP energies (eV/atom)

Figure 9.9 Comparison of performances of our best mixed model (a) with known ReaxFF method (b).

here f i is a value that is calculated in the sphere around ith; then we average this value over atoms (N corresponds to the total number of atoms in the unit cell). We point that such features are based on 4-body correlation function. The final length of feature vector is 59, and initial architecture of NN is (59–35–50–1). The result of feature analysis is plotted in Figure 9.10. As in the case of aluminum, the plot shows that we can reasonably keep only 20 features, which include 13 from the 2-body feature vector and 7 from the many-body part (2 of them correspond to 4-body correlation function in Eq. (9.7), while the other 5 features refer to Behler–Parrinello symmetry function [10], related to 3-body correlation function). It was indeed expected that many-body terms are very important, but a weird result is that long-term lattice sums with k = 4, 5, 6 survived, which signals that long-range interactions in carbon are important. Moreover, as one can see from Figure 9.10, we were deleting features until we Figure 9.10 Deletion of features in C.

8 Train Test

RMSE (10–1 eV/atom)

270

7

6 5

4 3

0

10

30 40 20 Number of deletions

50

60

9.2 Machine Learning Potential for Global Optimization

are left with only one. It turned out that these lattice sums are so important, that all of them are present even when the length of a feature vector is equal to 5. 9.2.4.3

Helium and Xenon

It is well known that interactions in noble gases can be described by a simple Lennard-Jones potential. Our scheme recovers this shape without using any assumptions – see Figure 9.11. Here we used a simple architecture (50–50–1) (we fixed 𝛽 = 0) and trained on unrelaxed structures (20 000 structures for each system, energies were calculated using van der Waals functions). The minima of 6

Energy (10–4 eV/atom)

4 2 0 –2 –4 –6 –8

2

3

4

5

6

7

8

9

7

8

9

10

R (Å)

(a) 0.12 0.1

Energy (eV/atom)

0.08 0.06 0.04 0.02 0 –0.02 –0.04 (b)

3

4

5

6 R (Å)

Figure 9.11 Reconstructed He–He (a) and Xe–Xe (b) pair potentials.

271

9 Machine Learning Interatomic Potentials for Global Optimization

these potentials are very close to the sum of van der Waals radii: RHe–He = 2.80 Å and RXe–Xe = 4.32 Å. The He–He potential has a very shallow minimum, as expected. Moreover, the 2-body contribution to the total energy in He is 97%; during the process of training, RMSE on train was only 0.003 eV/atom, while the energy range was [0.015; 0.87] eV/atom. In Xe the 2-body contribution decreased to 92% and many-body terms turned out to be quite important. Many-body effects in Xe are discussed comprehensively in paper [24]. Feature analysis for He and Xe is plotted in Figure 9.11. Due to simplicity of data used for training and due to simplicity of these systems, two features for each case are enough to describe them. For both cases these features all refer to 2-body correlation function, and in He these are lattice sums with k = 4, 6 and in Xe – lattice sums with k = 5, 7 (Figure 9.12). 9.2.5

Discussion

In this section we discussed a new method [25] for reconstruction of PES for different classes of materials. The algorithm is a combination of the proposed lattice Figure 9.12 Deletion of features in He (a) and Xe (b).

3.5 Train Test

RMSE (10–2 eV/atom)

3 2.5 2 1.5 1 0.5 0

0

10

(a)

20 30 Number of deletions

40

50

3

RMSE (10–2 eV/atom)

272

Train Test

2.5 2 1.5 1 0.5 0

(b)

0

10

20 30 Number of deletions

40

50

9.3 Interatomic Potential for Molecular Dynamics

sums method and an artificial NN. Our method is good both for high-energy prediction and PES visualization. Our reconstructed Al–Al potential showed, as expected for metals, strong density dependence and oscillatory form originating from Friedel oscillations. Lattice sums with k = 4, 5, 6 are very important for metals. We showed that unlike metals, covalent crystals (e.g. carbon) cannot be adequately described by any pair potential, and many-body interactions are essential. Our scheme, including many-body interactions, shows much better performance than frequently used ReaxFF many-body force field. Reconstructed pair potentials in noble gases (He–He and Xe–Xe) demonstrate the Lennard-Jones-type shape with minima corresponding to sums of van der Waals radii. Also we adopted the OBD algorithm for feature analysis. It allows us to leave only physically important features. Aluminum turned out to be quite simple: for the data analyzed here we have shown that among 50 features that include 2-body and many-body distribution functions only features based on 2-body correlation function influence energy. We believe that such behavior is common for simple metals. It turned out that only k = 5 survived in our analysis of aluminum, while it was expected that all three lattice sums (with k = 4, 5, 6) can have large effect due to the fact that electrons in metals can move very far from the initial atom. In contrast, carbon turned out to be much more complex. Our analysis was that we systematically introduced features that correspond to 2-body, 3-body, and even 4-body distribution functions. Intuitively, we expected large influence of many-body terms, and this is the case: among 20 final features, 7 correspond to the many-body part. The unexpected result is that lattice sums with k = 4, 5, 6 survived, corresponding to longer-ranged interactions than anticipated. For noble gases one expects that terms with k = 6, 12 are the most important. We argue here that with survived features NN still feels such term as lattice sum with k = 12. This indicates that our data mostly contain structures, where k = 6 dominates, and this might be the reason why short-range k = 12 term can be reconstructed by our NN.

9.3 Interatomic Potential for Molecular Dynamics 9.3.1

General Form of the Potential

The method described in Section 9.2 is suitable for global optimization and chemical interpretation of interatomic interactions in crystals. Its accuracy is high, it works in a wide range of energies, and accounts for many-body interactions. Lattice sums method is rather computationally expensive, and NN training is complicated and time consuming. This is why we tested a simple, fast, and rather accurate method in order to describe interatomic interactions, suitable for MD simulation. The main idea is the same: during quantum molecular dynamics (QMD) run, many structures with accurately calculated energies and forces are generated. These structures are used as a training set for machine learning. Here we use simple linear regression method. The idea for structural descriptors is taken from [26], where an energy-free force field predicting only forces was developed. A simplified way to predict also energies is tested here.

273

274

9 Machine Learning Interatomic Potentials for Global Optimization

We suggest that the energy and forces acting on atoms could be described with linear regression. In other words, energy is a multiplication of some coefficients on feature vector, which uniquely describes the structure. Forces are minus derivatives of the energy with respect to coordinates. We use as a feature vector the following expression: )p(k) ( → Nat Nneigh,i ⎤ ⎡ ∑ ∑ |− rij | ⎥ exp ⎢− (9.28) XE = ⎥ ⎢ rcut (k) i=1 j=1 ⎦ ⎣ where N at is the number of atoms in unit cell, N neigh,i is the number of neighbors of atom i in the radius Rneigh (normally, 5 Å), |rij | is the distance between atom i and j, and rcut (k) and p(k) are constants of the potential. The length of the feature vector is defined by the number of constants (k = 1, …, N). In case of linear regression, the energy of a structure is calculated as E = ΘXE + Θ0

(9.29)

where Θ are linear regression coefficients. Then, the force acting on atom i in the x direction (the same for y and z) is calculated as Fx,i = −

𝛿X 𝛿E = −Θ E = ΘXF 𝛿xi 𝛿xi

(9.30)

Since X E and X F are descriptors of structure, the main task of machine learning algorithm is to build mapping from X E into E and X F into F. In order to find the values of linear regression coefficients, the following equations should be solved: E = ΘX E + Θ0 and F = ΘX F . The solution follows from matrix transformation: Θ = (XET XE )−1 E, Θ = (XFT XF )−1 F

(9.31)

As it follows from the latter equation, energies and forces are now indistinguishable, and the algorithm could be trained and predict both energies and forces. 9.3.2

Parameters Selection

We implemented this potential in the LAMMPS [27] code. The implementation of the potential is parallelized using LAMMPS domain decomposition. We developed several parameterizations of our ML potential for aluminum and uranium. Trajectories for training sets were taken from first-principles MD calculations made with VASP at different densities and temperatures. Each trajectory was calculated with a time step of 2 fs for about 5 ps. As we mentioned previously, dynamics of the system are mostly defined by forces acting on the atoms and by initial conditions. So, the small difference between ab initio and predicted forces (RMSE) was considered as the main quality criterion for constructed potentials. In order to parameterize any potential, the particular pairs of values (rcut , p) were selected manually. First we fixed p = 1 and plotted the dependence of RMSE on the value of rcut . Therefore, the starting pair of parameters was defined by the minimum of the RMSE value on this plot. The subsequent values of constants were taken with the step of 0.3. For example,

9.3 Interatomic Potential for Molecular Dynamics 0.8

RMSE (eV/atom)

RMSE (eV/atom)

0.35

0.25

0.15

(a)

0.4

0.2

0.05 0.0

0.6

0.5

1.0

1.5

2.0

2.5

3.0

rcut (Å)

0

10

15

20

25

30

Number of (rcut,p) paris

0.27

RMSE (eV/atom)

5

(b)

Test Train

0.25 0.23 0.21 0.19 0

(c)

10

20

30

40

50

Training set size (%)

Figure 9.13 Optimal choice of number of parameters pairs and training set size. (a) The RMSE dependence on the value of rcut at a given p = 1, (b) the relation between RMSE on test set for α-U and number of parameters pairs, and (c) learning curves when randomly selected structures are in the training set.

we found that for aluminum at zero pressure and 300 K the optimal value of rcut was 0.22 Å at p = 1, RMSE = 0.043 eV/Å (Figure 9.13a). For uranium this minimum is very broad. For this case p was taken in the range from 1 to 3, and rcut – from Rneigh (usually equals to 5 Å) to 1 Å. We note that rcut = 0.22 Å is similar to the exponent with b = 0.25 Å in the Morse potential (which is just a sum of two exponents). The model used here can be thought of as generalized Morse potential with many-body effects. The main parameters that should be optimized for ML potentials are not only the exact values of rcut and p pairs, but also the number of such pairs and training set size. Since Al could be relatively well described even with one optimally selected pair of parameters (see Figure 9.13a), all the main features of ML potential will be considered with reference to α-phase of uranium (at zero pressure and 1000 K). First, we established the optimal number of pairs (rcut , p) (Figure 9.13b). To do this the training set was chosen to be 20% of a 5-picosecond QMD run. The figure shows that the minimum value of error could be reached using 15 pairs. But for MD runs, feature vector calculation time (which linearly increases with the number of parameters) plays a crucial role, so for further calculations the number of (rcut , p) pairs was taken as a compromise between calculation time and RMSE. Figure 9.13b shows that the optimal number of pairs equals to 11, and this is common for almost all ML potentials considered here.

275

9 Machine Learning Interatomic Potentials for Global Optimization

Second, after the optimum number of (rcut , p) pairs was defined, we studied the RMSE dependence on the training set size. We randomly chose structures from the first 50% steps of QMD trajectory and put them in the training set (Figure 9.13c) (for the test set we always kept the last 50%). There exist more advanced strategies such as active learning [28] and evaluation of distance between structures for its further consideration as a new point in the training set [26]. However, using our approach, convergence in error was achieved even when there were 10% of all structures in the training set. Normally, for confidence, we took 20% of structures for training. Since the entire dataset comes from just 5-picosecond QMD run, we cannot affirm that constructed potentials will not be overfitted. This is why we always used L2 regularization. We compared the accuracy given by our ML potentials for Al and U with that of different published embedded atom method (EAM) potentials. We also compared our potentials with the EAM potential constructed by us using force matching technique based on the same training set. The latter type of potentials was included for a more fair comparison. For Al we studied fcc phase at 300 K and liquid phase at 2000 K (Figure 9.14). At 300 K our potential with 11 pairs of parameters gave the same accuracy as EAM potential made using force matching. Yet these errors were lower than the ones given by [29, 30] potentials. Even our potential trained with one pair of parameters had accuracy higher than in Refs. [29, 30]. Moreover, the potential parameterized at 2000 K could accurately predict forces for structures at 300 K. The lowest RMSE for test MD trajectory 0.07

0.05 0.04 Gupta

LEA

0.00

EAM

0.01

2000 K

0.02

300 K 1p

0.03 300 K

RMSE (eV/atom)

0.06

Ttest = 300 K

(a) 0.6 0.5 0.4 0.3

Gupta

Ttest = 2000 K

LEA

EAM

0.0

300 K 1p

0.2 0.1

(b)

2000 K 300 K

RMSE (eV/atom)

276

Figure 9.14 Comparison of different potentials for Al at 300 K (a) and 2000 K (b), where “300 K” – our potential trained at 300 K with 11 pair of parameters, “300 K 1p” – our potential trained at 300 K with 1 pair of parameters, “EAM” – EAM potential trained on the same training set, “2000 K” – our potential trained at 2000 K with 11 pair of parameters, “LEA” – from [29], and “Gupta” – from [30].

9.3 Interatomic Potential for Molecular Dynamics

0.6

RMSE (eV/atom)

0.5 0.4 0.3 0.2

3

4

3

4

2

0.0

1

EAM

0.1

1000 K

Figure 9.15 Comparison of different potentials for α-U at 1000 K (a) and liquid U at 5000 K (b), where “1000 K” and “5000 K” are our potentials trained at 1000 and 5000 K with 11 pair of parameters, respectively, “EAM” is EAM potential trained on the same training set, 1 is from [31], 2 is from [32], 3 is from [4, 33], and 4 is from [34].

Ttest = 1000 K

(a)

1.6 1.2 0.8 2

1

0.0

5000 K

0.4

EAM

RMSE (eV/atom)

2.0

Ttest = 5000 K

(b)

corresponding to 2000 K was reached using our ML potential with 11 pairs of parameters. Considering uranium, we tested different potentials for α-phase at 0 GPa and 1000 K (stable solid phase) and liquid phase at 300 GPa and 5000 K (Figure 9.15). For both α-phase and liquid phase, our ML potential trained with 11 pairs of parameters gave the highest accuracy among all considered potentials. In our opinion it can be used to build the phase diagram of uranium.

9.3.3

Thermodynamic Quantities and Phase Transitions

MD simulations were performed in a 20 × 20 × 20 supercell with periodic boundary conditions in all directions. The interaction between atoms was described with the developed ML potential. The system was equilibrated using MD in the NVT ensemble for 4 ps. After that we performed calculations of velocity autocorrelation function (VACF) in the NVE ensemble for another 4 ps. The characteristic time of VACF attenuation in the considered systems is about 1 ps. The phonon density of states (PDOS) was calculated using the formula: ∞

g(𝜈) = 4 ×

∫0

cos(2πνt)

⟨ν(0)ν(t)⟩ ⟨ν(0)2 ⟩

dt

(9.32)

277

9 Machine Learning Interatomic Potentials for Global Optimization

where 𝜈 is the vibrational frequency – the average is taken over all atoms. The system must be large, if accurate g(𝜈) is needed (e.g. 32 000 atoms in our calculations), so one cannot use the QMD even though the necessary physical calculation time is rather short. Figure 9.16 shows two examples of the calculation of the PDOS. Positions, widths, and heights of peaks are in good agreement with the experimental data from inelastic neutron scattering [35]. The results differ substantially from calculations made with the frozen phonon method. In the frozen phonon method, a purely harmonic PDOS is obtained, neglecting anharmonicity and finite lifetime of phonons. The finite displacements can be accounted for by using the self-consistent phonon method suggested in [36], and the broadening due to finite lifetimes can be calculated from phonon–phonon interaction (taken from perturbation theory) [37]. In the approach used here, these two effects appear naturally from the movement and interaction of atoms at finite temperature.

Phonon density of states

0.3

1 2 3

T = 300 K, a = 4.056 Å

0.1

0

2

(a)

4

6

8

10

12

10

12

Frequency (THz) 0.3

1 2 3

T = 775 K, a = 4.11 Å

0.2

0.1

0 (b)

Figure 9.16 Phonon density of states at 300 K (a) and 775 K (b): 1 – experimental data from [35]. 2 – calculation via MD with our ML potential. 3 – calculation via frozen phonon method (using DFT). Source: Data from Kresch et al. 2008 [35].

0.2

0

Phonon density of states

278

0

2

4

6

8

Frequency (THz)

9.3 Interatomic Potential for Molecular Dynamics

Entropy was computed using the harmonic formula: ∞

S = 3kB

∫0

g[(n + 1) ln(n + 1) − n ln n]d𝜈

(9.33)

where k B is the Boltzmann constant, g = g(n) is the PDOS, and n = n(ν) = 1 ) ( is the average density of bosons. However, the g(n) includes all hν exp

kB T

−1

anharmonic effects. It is known [38] that the use of this equation in conjunction with anharmonically renormalized g(n) yields correct entropies, to the leading order of perturbation theory. The computed entropies are shown in Figure 9.17 and in Table 9.4. The obtained values are in good agreement with experimental data. The discrepancy is within 0.1 k B per atom, which enables the use of this approach for the analysis of phase stability. Similar calculations were made with several ML potentials built on the same database. The deviation of the entropy at room temperature is within 0.03 k B per atom. Our tests show that the constructed ML potentials can be used to reproduce forces acting on atoms in the liquid state. For the liquid state one cannot define the PDOS, but the verification of the potential can be carried out on the basis of the radial distribution function (RDF). The RDF was averaged for 10 ps after equilibration (see Figure 9.18a). We considered a 4000-atom supercell of Al (V at = 19.1 Å3 ) at temperature T = 1023 K. The developed potential reproduces QMD results at the same conditions and gives results, which are in good 7

1 2 3

6 5 S (kB/atom)

Figure 9.17 Entropy as a function of temperature: 1 – thermodynamic data from the NIST-JANAF database. 2 – calculated from the experimental PDOS [35]. 3 – calculated from MD with the ML potential. Source: Data from Kresch et al. 2008 [35].

4 3 2 1 0

200

0

400

600

Temperature (K)

Table 9.4 Experimental and calculated entropies of crystalline Al at different temperatures. T (K)

300

525

775

a (Å)

4.056

4.079

4.11

Scalc (k B /atom)

3.49

5.17

6.42

Sexp (k B /atom)

3.462

5.146

6.332

800

279

9 Machine Learning Interatomic Potentials for Global Optimization

3

1 2 3

2.5

RDF

2 1.5 1 0.5 0

1

2

3

4

5

6

Figure 9.18 Radial distribution function and melting temperature of Al. (a) Radial distribution function at T = 1023 K and V at = 19.1 Å3 : 1 – experimental data. 2 – QMD results. 3 – MD calculation with our ML potential. (b) The dependence of temperature on time in the calculation of melting temperature with the modified Z method. Atomic configurations in the beginning and in the end of the simulation are also shown.

r (Å)

(a) 1100 1050 Temperature (K)

280

Solid

1000 950 900 Liquid

850 800 (b)

0

Solid

0.2

0.4 Time (ns)

Liquid

0.6

0.8

agreement with experimental data. It is worth noting that almost identical results were obtained for different parameterizations made with different sets (rcut , p). We also noted that the potential parameterized on liquid configurations also describes well the forces in crystalline configurations. Even though there are no explicitly calculated energies, a sufficiently accurate representation of the forces can enable the use of such potentials for modeling two-phase systems and for direct determination of the melting temperature. To verify this, we calculated the melting temperature using the modified Z method [39]. The system was simulated at a fixed density in the NVE ensemble. It contained 4 × 4 × fcc unit cells with lattice parameter a = 4.16 Å. Initially the temperature was set to T = 2000 K, and shortly after the start of the MD run, it relaxed to an average temperature T = 1000 K. After spontaneous melting, a decrease in temperature to an average value of 925 K was observed (Figure 9.18b). The density of liquid is calculated from the density profile and corresponds to the atomic volume V liq = 18.6 ± 0.1 Å3 . The obtained atomic volume for the crystalline part V cryst is 17.3 ± 0.2 Å3 . The obtained values are in excellent agreement with the

9.3 Interatomic Potential for Molecular Dynamics

experimental melting temperature of 933 K and the equilibrium atomic volume for liquid of 18.9 Å3 [40]. Our results T = 925 K and ΔV = V liq − V cryst = 1.3 Å3 are also close to thermodynamic calculations based on DFT [41]: the melting temperature T = 912 K and ΔV = 1.35 Å3 . It is worth noting that pressure calculated in our QMD run was 2 ± 0.5 GPa. It is known that DFT with GGA functional overestimates pressure, and in [41] the pressure correction for the melting curve under normal conditions was estimated as 1.6 GPa. Therefore, calculated pressure with correction is close to normal conditions. 9.3.4

Interatomic Potential for System of Two (or More) Atomic Types

We explored the idea of using exponential feature vectors in order to describe interatomic interactions in the system consisting of two different atom types: it is essential there to separate the pair interactions between specific atom types into independent groups. Thus, considering A and B atom types in the system, for each atom i we define two sets of N feature vectors: (

neigh



NA

Eij(A)

=

e



rkj rcut,i

)pi

(

neigh



NB

,

E(B) ij

k=1

=

e



rkj rcut,i

)pi

, i = 1, … , N, j = 1, … , Natoms

k=1

(9.34) neigh NA ,

neigh NB

where are the the numbers of neighboring atoms of each type in global the sphere of radius rcut , rcut , p are the external parameters of the method, and N is a number of (rcut, i , pi ) pairs. Summing over all atoms of specific type we for interaction between atoms of types A and obtain a set of basis vectors E(A−B) i B. The actual energy E of the system can be evaluated as a linear combination of constructed features and bias feature E0 : E = 𝜃0 E0 + (𝜃1 ⋅ E(A−A) ) + (𝜃2 ⋅ E(A−B) ) + (𝜃3 ⋅ E(B−B) )

(9.35)

where Θ = (𝜃 0 , 𝜃 1 , 𝜃 2 , 𝜃 3 ) is the vector of free parameters of the method. Mathematical simplicity of the chosen functional form allows us to construct the force feature vectors by a similar principle after differentiation of energy features following the definition F = − ∇ E: T

neigh



NA

Fij(A)

=

k=1

pi

nkj rcut,i

(

rkj rcut,i

(

)pi −1 e



rkj rcut,i

)pi

(9.36)

Thereby, the force acting on atom j is a linear combination of the force feature vector with the same coefficients vector Θ: FA,j = 𝜃1 F(A) + 𝜃2 F(B) , where nkj =

rkj ∣rkj ∣

FB,j = 𝜃2 F(A) + 𝜃3 F(B)

(9.37)

, F(A) ≡ (Fij(A) ) for atom of type A.

The proposed approach allows us to use linear regression as an effective tool in terms of optimization of the parameters Θ. Since the force acting on the atom is the exact derivative (with the minus sign) of the potential energy, it is possible to train the model using both energy and force features simultaneously.

281

9 Machine Learning Interatomic Potentials for Global Optimization

As example we investigated Ti4 H7 system. We used p in the range from 0.5 global to 4.75 and rcut – from 0.5 to rcut and searched for the best pairs in terms of the lowest RMSE. The minimum RMSE was achieved with rcut = 1.0 and p = 2 (Figure 9.19a). Since these values can be considered as the most informative, we established a sufficient number of pairs (rcut , p) varying p from 1.25 to 2.5 (Figure 9.19b). The minimum of RMSE both on train and test set could be reached with four pairs. We also investigated the dependence of RMSE on the training set size in order to avoid overfitting. For this, we took structures from the first 90% of the whole database as a training set, leaving the remaining 10% (Figure 9.20) for test. The plots show that using more than 20% of structures from the QMD run allows us to train an acceptable model for the energy and forces. We compare the predicted values of interatomic forces with those obtained in ab initio calculations (Figure 9.21). Being trained on successive 70% of structures 1.1

RMSE (eV/Å)

1.0 0.9 0.8 0.7 0.6 1

(a)

2

p

4

3

Test Train

1.0 RMSE (eV/Å)

282

0.8

0.6

0.4

0 (b)

1

3 2 Number of (rcut, p) pairs

4

5

Figure 9.19 Determination of the optimal (rcut , p) set. (a) The dependence of RMSE on the value of p for a constant rcut = 1.0. (b) The relation between RMSE and number of (rcut , p) pairs.

9.3 Interatomic Potential for Molecular Dynamics

4.25

Test Train

RMSE (eV/Å)

4.00 3.75 3.50 3.25 3.00 2.75 2.50 20

60 40 Train size (%)

(a)

80

0.272

Test Train

RMSE (eV/Å)

0.271 0.270 0.269 0.268 0.267 0.266 40 60 Train size (%)

20

(b)

80

Figure 9.20 Learning curves for Ti4 H7 on (a) energies and (b) forces. RMSE [eV/Å]: 0.776

25

RMSE [eV/Å]: 0.548

10

15 5

0

–5 –10

–15 –25

–20 –20 (a)

FH from MD [eV/Å]

FTi from MD [eV/Å]

20

–10

0

10

FTi from DFT [eV/Å]

20

–25 –15 (b)

–5

5

15

25

FH from DFT [eV/Å]

Figure 9.21 Comparison of ab initio (x) and model (y) projections of forces acting on titanium atoms (a) and hydrogen atoms (b) for Ti4 H7 .

283

284

9 Machine Learning Interatomic Potentials for Global Optimization

with 5 (rcut , p) pairs, the model showed acceptable results: the RMSE was about −1 0.27 eV Å on forces and 3 meV/atom on energies. The values of RMSE on forces are two times lower than if we used the method described in Section 9.3.1 (considering atoms of Ti and H to be the same). This approach can be trivially extended to systems of many different types of atoms, and systems like high-entropy alloys can be studied using it.

9.4 Statistical Approach for Constructing ML Potentials 9.4.1

Two-Body Potential

∑Ntot V (Dxk ), where Dxk is the Energy of the system can be written as E = k=1 atomic environment of the atoms and usually is considered to be a tuple (xi − xk )1≤i≤Ntot ,0