If you need really huge data set to test your methods, then our data set with 133,885 species is one of the best choices. You can download it in figshare.

Distribution of species according to number of electron pairs, Figure from the data descriptor.

This data set contains plenty of properties calculated with electronic structure methods for all species. You can find all details about data formats, properties and methods in the data descriptor “Quantum chemistry structures and properties of 134 kilo molecules” (DOI: 10.1038/sdata.2014.22) that was published in the very first volume of Scientific Data — new venture of Nature Publishing Group. The work was done in collaboration with Prof. Dr. O. Anatole von Lilienfeld and his group members Dr. Raghunathan Ramakrishnan and Dr. Matthias Rupp.

The data set is also called QM13 in the “Quantum Machine” website and augments previously used and smaller QM7 and QM7b data sets.

Molecules constituting QM13 set are small organic molecules from the “library” of molecules called “GDB-17 database”. All 134k molecules were optimized at the B3LYP/6-31G(2df,p) level of theory. Their harmonic frequencies, dipole moments, polarizabilities, and energies, enthalpies and free energies of atomization were calculated at the same level of theory.

If you need more accurate data sets, then two subsets of QM13 set are provided. One subset consists of 100 randomly picked molecules, which atomization enthalpies were calculated at G4MP2, G4 and CBS-QB3. Another subset consists of 6095 isomeric C7H10O2 molecules, which properties were calculated at G4MP2.

Our work has already caught attention. For instance, Henry S. Rzepa in his post “Data galore! 134 kilomolecules” recognizes the need for such data sets, but also raises important issues of data formats that must be more unified and suitable for “machine reading”.

In addition, Ralph Koitz provides web interface to access this data set. Within this web interface you can browse molecules, query data (for instance, you can find all epoxides) and you can plot data (for instance, HOMO vs LUMO levels).

P.S. Thanks to Raghunathan Ramakrishnan for pointing out some of above links.

  1. […] groups for fitting and validating MNDO-type semiempirical methods, and a subset of the huge set of 134 thousand species. Most of the above sets consist of many subsets targeting different […]

