Opticalmaterials.org is a site for demostrating an auto-generated optical properties database and to provide flexible ways to query and analyse the database.

The database is readily loaded and to be used on the website. Please refer to the documentation page for detailed guidance on how to use the database.

Autogeneration of optical property databases

This project uses a state-of-the-art natural language processing package ChemDataExtractor to auto-generate an optical properties database.

Machine-learning methods such as conditional random fields are used in the entity recognition process and document processing system. Rule-based text parsers are combined with semi-supervised Snowball algorithm and table data parsers to extract structured information from unstructured textual data.

Data cleaning

To increase the precision of the database, a series of filters were set to clean the data after mining.

Data were firstly cleaned by their compound names, any records with unrecognised symbols (e.g. '$') were removed. Name cleaning was followed by a cleaning in specifier, any records with unrecognised specifiers were removed. Finally, records with extreme values were removed because they were less possible to be correct records. For example, records with refractive index >= 10.

Technical validation of database

After data cleaning, for each optical property, 500 records were randomly selected to validate the precision and recall of the database. A detailed discussion about the technical validation can be found in the citing page.

Method Text Table
Refractve Index 90.1% 71.9%
Dielectric Constant 78.6% 79.0%

Method Text Table
Refractive Index 71.6% 78.4%
Dielectric Constant 59.1% 72.7%

Overall, our database exhibits a precision of 77.1%.

