Toward Chemical Accuracy in Predicting Enthalpies of Formation with General-Purpose Data-Driven Methods

In our work published in the Journal of Physical Chemistry Letters, we investigate the performance of the general-purpose data-driven methods ANI-1ccx and AIQM1 in the calculation of enthalpies of formation. Extensive benchmark tests show that these two methods can achieve accuracy close to 1 kcal/mol chemical accuracy with a very low computational cost. Importantly, a scheme to quantify the prediction uncertainty was proposed to evaluate the confidence of prediction, detect outliers, and fix mistakes in experimental data.

The enthalpy of formation is one of the crucial thermochemical properties needed by chemists for studying chemical reaction processes. In addition to the direct experimental measurements which are time- and resource-consuming, enthalpies of formation can also be accurately predicted by modern quantum mechanical (QM) calculations. However, the common issue of traditional QM predictions is that the more accurate result we want, the higher computational costs are usually required. The mushrooming applications of machine learning (ML) in quantum chemistry provide us with an alternative approach to overcome this accuracy versus cost conundrum.

In this study, we adopted two kinds of general-purpose ML strategies to calculate enthalpies of formation. One is a pure ML method – ANI-1ccx, an ANI type neural network (NN) potential, and another one is an ML-enhanced semiempirical QM (SQM) method – AIQM1, which has broader applicability but is a little slower. Both of these two methods were trained to approach the gold-standard CCSD(T)/CBS level. The performance of ANI-1ccx and AIQM1 were systematically assessed on 14 typical benchmark data sets composed of C, H, N, and O atoms. The results show that ANI-1ccx and AIQM1 can achieve outstanding performance for most data sets and are comparable to the high-level G4 and G4MP2 composite methods while with a much cheaper computational cost. For example, for the CHNO data set with 137 molecules, the MAEs of G4MP2, G4, ANI-1ccx, and AIQM1 are 0.9, 0.75, 1.76 and 0.84 kcal/mol respectively. However, it just needs no more than 15 CPU-min for ANI-1ccx and AIQM1 calculations on this data set, while G4MP2 and G4 calculations require 5 and 11 CPU-days.

Importantly, unlike traditional QM methods, the data-driven nature of the ML approaches can be used to quantify uncertainty and systematically improve their accuracy. We proposed a scheme to quantify the prediction uncertainty based on the standard deviation of NN predictions, which allows us to estimate the confidence of our predictions and detect outliers. After removing all outliers in the data sets, AIQM1 and ANI-1ccx can reach chemical accuracy for most data sets. For example, for the CHNO data set mentioned above, after removing outliers MAEs of ANI-1ccx and AIQM1 can be further reduced to 0.92 and 0.60 kcal/mol.

Using our uncertainty quantification scheme, we have analyzed the confident predictions of AIQM1 and ANI-1ccx with high deviation with respect to the experimental reference data which allowed us to detect potential errors in reference values. We suggested the revised values supported by independent G4 and G4MP2 calculations. Finally, we hope that our scheme will bring a great improvement and lead to a positive feedback loop for experimental and computational thermochemistry.

As usual, our implementations are available as open-source and free of charge in our package for atomistic machine learning simulations MLatom. The detailed tutorial on the calculation of enthalpies of formation using AIQM1 and AN1-1ccx are available at which we hope will be useful in your research.

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.