590c a Data Model Supporting Intelligent Search for Materials Research

Stephen D. Stamatis¹, Balachandra B. Krishnamurthy¹, Amr Shehab², Tanu Malik³, Leif Delgass¹, Steven R. Dunlop⁴, and James M. Caruthers¹. (1) School of Chemical Engineering, Purdue University, Forney Hall of Chemical Engineering, 480 Stadium Mall Drive, West Lafayette, IN 47907-2100, (2) Department of Computer Science, Purdue University, (3) Cyber Center, Purdue University, (4) ITaP, Purdue University

High-Throughput Experimentation (HTE) enables researchers with the ability to generate massive amounts of data. HTE is typically used to indentify ‘hot' material candidates, where the information from the ‘cold' candidates is discarded. However, if experiments are properly designed all data is useful in developing an understanding of the material or process under investigation. In order to gain maximal value from HTE data, the data must be managed in a manner that facilitates knowledge extraction by integrating data from multiple sources, including both lab experiments and computer simulations. Traditional techniques of managing data using nested file folders on local computer network resources are capable of storing the data, but searching data that is stored in this manner is nearly impossible because the data are typically incomplete, have insufficient provenance, are not cataloged correctly and cannot be queried intelligently using domain specific language with complex syntax.

We have developed a general data model as part of the SciAether^TM (www.SciAether.org) software system that is capable of serving the data needs of multiple chemistry domains in a flexible and searchable manner. The data model for experimental information distinguishes two classes of materials: (1) an ideal material, which could be a mixture of ideal materials; (2) the actual material in a given batch including all impurities, which is characterized by both its composition and the sequences of ‘unit operations' used in its synthesis. Materials are involved in synthesis, characterization and performance experiments. In silico experiments are also included in this data model, because they involve ideal materials. The intelligent searching is enabled by a project specific ontology and similarity searching tools. The ontology enables the researcher to use terms that were not originally defined when data was entered into the database, e.g. find all α-olefins, where the term α-olefin was never entered with the data rather the specific instances of α-olefin like propylene, 1-hexene, etc. A user defined rule based engine is employed to classify ideal materials and this classification is stored in the database to enhance query speed. The implementation of similarity search uses an information retrieval approach to give a rank ordered list of structures that are similar to a target molecule. The method consists of the following steps: (1) the molecules of interest are broken into a series of individual fragments, which provide a ‘fingerprint' of each molecule; (2) an inverse document list is prepared which indicates all molecules that contain a given fragment; and, (3) the frequency at which specific fragments occur in the molecular population of interest is computed and stored in the database. Using this frequency table, the similarity between related molecules can be determined using heuristics detected by the user's preferences in previous searches. This organization of data along with domain specific ontologies and similarity searching capabilities allows for high level querying of the database, e.g. find all data for batch polymerization (specific unit operation) of a-olefins (class of ideal materials) with metallocene catalysts (class of ideal materials) that contain an aryloxide ligand (class of ideal materials) similar to a 2,6 di-isopropylphenoxy (ideal material) which will then return all actual materials that meet this criterion. The data model described above will be illustrated with case studies in the area of single-site olefin polymerization, which is a homogeneous catalytic process, and the water gas shift reaction, which is a heterogeneous catalytic process.

Web Page: www.SciAether.org