WS22 database, Wigner Sampling and geometry interpolation for configurationally diverse molecular datasets
Our work published in Scientific Data presents the WS22 database, which contains 10 flexible organic molecules of increasing complexity in chemical composition and accessible conformations. The WS22 database provides 1.18 million equilibrium and non-equilibrium molecular geometries together with many quantum mechanical properties. The diversity and chemical complexity of our database increases the challenge for machine learning models.
Machine learning (ML) has become a powerful method for constructing full-dimensional potential energy surfaces (PESs). In recent years, efforts have been devoted to developing increasingly complex machine learning potentials for fitting nonlinear PESs of organic molecules to reduce the high computational cost. However, such great progress of ML in the field of quantum chemistry is inseparable from extensive and high-quality quantum chemistry data.
As a complement to many existing databases, including our own recent VIB5 database, the new WS22 database provides broad and statistically robust hypersurfaces of quantum chemical properties of molecules with increasing complexity.
The dataset contains 10 flexible organic molecules of increasing complexity in chemical composition, most of which contain flexible functional groups capable of different conformations. At the same time, to ensure a broader sampling of PESs in terms of configurational space, we adopted a composing strategy for generating molecular geometries. As the workflow shows below, first, we optimize and calculate harmonic frequencies of different conformers of each molecule, which are used for sampling from the Wigner probability distribution function to generate non-equilibrium structures. For a more robust PES representation, we augment the data by geometry interpolation between different configurations to obtain additional tens of thousands of geometries. This interpolation allows us to cover regions close to the transition state structures lying between stable conformers.
We calculated quantum chemical properties by performing single-point electronic structure calculations on all generated molecular configurations at PBE0/6–311G*. These properties are:
- Potential energies
- Mulliken charges
- Dipole moment
- Quadrupole moment
- HOMO and LUMO energies
- Electronic spatial extent
We assessed the conformational diversity in the WS22 database qualitatively and quantitatively using principal component analysis (PCA) and the distribution of root-mean-squared deviation (RMSD) between each sampled geometry and the minimum energy structure, both showing the good coverage of PESs. Importantly, the WS22 database has a much broader distribution of potential energies and atomic forces compared to the popular MD17 database obtained from classical molecular dynamics. This has implications for benchmarking different machine learning potentials and, as we have shown recently, different methods have different rankings when benchmarked on MD17 or WS22 databases.
The lead author of the work, Max, set up an interactive dashboard, where users can visualize molecular structures and download geometries in standard XYZ format.
- Max Pinheiro Jr*†, Shuang Zhang†, Pavlo O. Dral, Mario Barbatti*. WS22 database: combining Wigner Sampling and geometry interpolation towards configurationally diverse molecular datasets. Sci. Data 2023, 10, 95. DOI: 10.1038/s41597-023-01998-3.