Opticalmaterials.org is a site for demostrating an auto-generated optical properties database and to provide flexible ways to query and analyse the database.
The database is readily loaded and to be used on the website. Please refer to the documentation page for detailed guidance on how to use the database.
This project uses a state-of-the-art natural language processing package ChemDataExtractor to auto-generate an optical properties database.
Machine-learning methods such as conditional random fields are used in the entity recognition process and document processing system. Rule-based text parsers are combined with semi-supervised Snowball algorithm and table data parsers to extract structured information from unstructured textual data.
To increase the precision of the database, a series of filters were set to clean the data after mining.
Data were firstly cleaned by their compound names, any records with unrecognised symbols (e.g. '$') were removed. Name cleaning was followed by a cleaning in specifier, any records with unrecognised specifiers were removed. Finally, records with extreme values were removed because they were less possible to be correct records. For example, records with refractive index >= 10.
After data cleaning, for each optical property, 500 records were randomly selected to validate the precision and recall of the database. A detailed discussion about the technical validation can be found in the citing page.
Click to see the validation result of precision
Method | Text | Table |
---|---|---|
Property | ||
Refractve Index | 90.1% | 71.9% |
Dielectric Constant | 78.6% | 79.0% |
Click to see the validation result of recall
Method | Text | Table |
---|---|---|
Source | ||
Refractive Index | 71.6% | 78.4% |
Dielectric Constant | 59.1% | 72.7% |
Overall, our database exhibits a precision of 77.1%.