| ChemDataExtractor Documentation


Click on the buttons inside the tabbed menu:


Opticalmaterials.org is a web-based user interface to help users query the optical property database auto-generated by data mining through scientific literature.

The auto-generation of the database makes use of ChemDataExtractor toolkit to automatically extract refractive index and dielectric constant of materials from scientific articles.

For a guide on using ChemDataExtractor toolkit please refer ChemDataExtractor.org.



If you use ChemDataExtractor as a resource in your research, please cite the following work:

Swain, M. C., & Cole, J. M. "ChemDataExtractor: A Toolkit for Automated Extraction of Chemical Information from the Scientific Literature", J. Chem. Inf. Model. 2016, 56 (10), pp 1894–1904 10.1021/acs.jcim.6b00207

Generate your own database

This page gives a brief demonstration on how this database was constructed. This assumes you already have ChemDataExtractor installed.

The simplest way to load a Document into ChemDataExtractor is to call the 'from_file' function of class 'Document'. A scientific article from Royal Society of Chemistry with DOI 10.1039/C0CP02270E is used here as an example:

>>> from chemdataextractor import Document

>>> document = Document.from_file(r'.\10.1039/C0CP02270E.xml')

>>> document
<Document: 247 elements>

The processed document is a 'document' object which integrates different parts of the document together such as title, paragraphs and sentences etc. Once loaded, users must specify one or more property models for this document object. Here, the 'RefractiveIndex' model is loaded to mine relationships of the refractive index.

>>> from chemdataextractor.model.model import RefractiveIndex

>>> document.models = [RefractiveIndex]

Relationships found in the document will be collected as a 'records' object:

>>> document.records
[<RefractiveIndex>, <RefractiveIndex>, <RefractiveIndex> ...]

Each relationship found in the paper will be interpreted as a 'RefractiveIndex' object. Users can call records.serialize() to access these relationships or call records[index].serialize() to access a certain relationship:

>>> document.records.serialize()
[{'RefractiveIndex': {'raw_value': '1.372', 'value': [1.372], 'specifier': 'R.I', 'compound': {'Compound': {'names': ['Acetic acid']}}}}, {'RefractiveIndex': {'raw_value': '1.419', 'value': [1.419], 'specifier': 'R.I', 'compound': {'Compound': {'names': ['Glutaric acid']}}}}, {'RefractiveIndex': {'raw_value': '1.428', 'value': [1.428], 'specifier': 'R.I', 'compound': {'Compound': {'names': ['Pyruvic acid']}}}} ...]

>>> document.records[0].serialize()
{'RefractiveIndex': {'raw_value': '1.372', 'value': [1.372], 'specifier': 'R.I', 'compound': {'Compound': {'names': ['Acetic acid']}}}}

Or users may only be interested in relationships contained in tables:

>>> for table in document.tables:
>>>         for record in table.records:
>>>             record.serialize()
{'RefractiveIndex': {'raw_value': '1.372', 'value': [1.372], 'specifier': 'R.I', 'compound': {'Compound': {'names': ['Acetic acid']}}}}
{'RefractiveIndex': {'raw_value': '1.419', 'value': [1.419], 'specifier': 'R.I', 'compound': {'Compound': {'names': ['Glutaric acid']}}}}
{'RefractiveIndex': {'raw_value': '1.428', 'value': [1.428], 'specifier': 'R.I', 'compound': {'Compound': {'names': ['Pyruvic acid']}}}}

Or only from text:

>>> for parser in RefractiveIndex.parsers:
>>>     if type(parser) == AutoTableParser:
>>>         RefractiveIndex.parsers.remove(parser)
>>> doc.records.serialize()
[{'RefractiveIndex': {'raw_value': '1.49-1.50', 'value': [1.49, 1.5], 'specifier': 'R.I.', 'compound': {'Compound': {'names': ['toluene']}}}} ...]

Once users have obtained the relationships, they can write them into a json/csv file:

>>> with open(r"yourfilepath", 'a', encoding='utf-8') as json_file:
>>>     json.dump(doc.records.serialize(), json_file, ensure_ascii=False)

Database overview

Before querying the databases, go to the "Access databases" page and select the database (refractive index or dielectric constant) that you would like to query.

Search records in the database

We offer different ways to query and search through two optical property databases.

Option 1: Query by a compound name

The simpliest way to query the database is querying by a chemical compound name, for example, BiFeO3. Novice-level "spell auto correction" is implemented, so searching biFeO3 or bifeo3 will yield the same result as BiFeO3.


For organic compounds, both searching by their compound names or searching by their SMILES notations are allowed. For example, to search the records of acetone, entering acetone, Acetone or CC(=O)C will yield the same result.


Option 2: Query by a Digital Object Identifier (DOI)

As all records were mined and processed from published scientific articles, they were tagged with DOIs of their original papers.


The database allows searching by a single publication DOI and it will display all records mined from that DOI. That may be helpful if you have questions about one search result or would like to explore more on some other search results from an interested paper.

Searching by DOI only allows inputs with more than 10 characters.

Option 3: Query by value range

Alternatively, searching the database by querying the interested range of value is also possible.


By entering a minimum value and/or a maximum value, corresponding records with values within that range will be displayed.

Searching by value range only allows pure numbers as inputs.

Search result page

The search result result will be shown once you submitting the searching criteria. The main search result will be displayed in a table that allows the user to rank each column, search through table content, and customize number of shown entries per page.

Apart from the search result, users can also obtain some statistical information of the search result. By clicking the "Statistics of search result", a histogram of the search result will be displayed along with the mean, median, skewness, and kurtosis values.

Please be aware that if your search criteria yield a large number of records, the page could take longer time to load. If the statistical histogram is not displayed, please refresh the page.

Contact us

If you encounter any problems or have any feedback about the search result or the website, please refer to the Contact us tab on the navigate bar to contact us.

Modelling refractive index of inorganic compound

One can access to the online modelling of refractive index page by clicking the Analysis tab on the navigation bar.

This page provides the functionality to build a customised machine learning model to predict the refractive index of any inorganic compound.

Training set

The website uses a training set containing more than 400 refractive index records of different compounds, with ready-constructed features as columns. The training set was obtained from the large database mined from scientific articles with a series of filtering and cleaning processes.

The features are pure atomic features, for example, the average covalent radius of $Fe_3O_4$ is calculated as [3r(Fe) + 4r(O)] / 7, where r(Fe) and r(O) are covalent radii of iron and oxygen.

Construct the model

Choose the features

User can choose up to 84 features under the "Click to select the features" button to train the model. Any combinations of features are allowed. One can also search for one particular feature, select all features, and deselect all features by click corresponding buttons easily.

If no feature is chosen, the system will use all 84 features for linear regression based models. For other models, features selected by GeneticSelectionCV Python package will be used.

Ridge regression:

Linear regression:

Lasso regression:

RandomForest regression:

GaussianProcess regression:

Support vector regression:

Choose the model

We use Python library Scikit-Learn to implement our analysis. All models listed above are imported from the Scikit-Learn library. The hyperparameters of the models were optimised according to their 10-fold cross validation scores.

If no feature is selected, the prediction result will be generated by pre-trained models trained on the features above.

If any features are selected, the model will be trained at the first time based on the feature and model selected.

Select elements from the periodic table

To help the users to construct the chemical formula, a interactive periodic table is built in the webpage. By simply clicking on the element, filling in the "Number of element", and clicking "Add element", the element will be added to the formula input box.

Prediction result page

Once you submit your chemical formula with the features and model. Our algorithm will give a prediction on refractive index based on your model.

Apart from the prediction result, you also have the change to review the features and the model you selected. By clicking "Click to see the features", you will be able to see all features and their values we calculated. And also the mean squared error and mean absolute error of the 10-fold cross validation of your model will be shown.