Machine Learning for Absorption Cross Sections
Bao-Xin Xue, Mario Barbatti*, Pavlo O. Dral*, Machine Learning for Absorption Cross Sections, J. Phys. Chem. A 2020, 124, 7199–7210. DOI: 10.1021/acs.jpca.0c05310.
Preprint on ChemRxiv, DOI: 10.26434/chemrxiv.12594191.
Short overview of the method in a form of LiveSlides:
- ML-NEA can boost the calculation speed and increase precision of absorption cross-section
- ML-NEA makes approach less empirical by removing arbitrary broadening parameter and giving criteria for determining required number of points in ensemble
- ML-NEA converges fast even with several hundreds of QC calculation
Simulating Absorption Spectra
Using Beer-Lambert Law, we get an absorption spectra (UV-Vis spectra) with calculating the ε (molar attenuation coefficient) from experimental data. From the formula below, we can get a relationship between ε and σ (attenuation cross section). so we can calculate σ to simulate absorption spectra (see Wikipedia article for more details).
Commonly used approach is single point convolution (SPC), which only performs quantum chemical calculations at the ground state geometry, and then broadens oscillator strengths with the Gaussian function.
Introduction to Nuclear Ensemble Approach
Much more accurate method is Nuclear Ensemble Approach (NEA). It calculates cross section by averaging over multiple normalized broadening functions at different conformations. You can find more detail at this link. The following figure is the concise sketch map:
Different color means the excitation energy and oscillator strength at different conformation.
Compared with traditional single point convolution (SPC), NEA successfully makes a prediction for the absorption intensity when transitions are forbidden (have zero oscillator strength) at the ground state conformation, and SPC often fails to provide correct peak shape and peak position.
Problem in the QC-NEA
Now, we call one point as one conformation.
Although NEA allowed to correctly simulate the absorption spectrum, it required whooping 1000 QC calculations, which requires quite a lot of computational resources.
And QC-NEA has also other problems. Firstly, the broadening parameter is rather arbitrary, which results in a huge difference when adjusting this parameter (left figure). Secondly, the number of point in ensemble is also arbitrary, as it is unclear how many points are necessary to achieve the trade off between more precision and less computational cost (right figure).
Machine Learning Comes to Rescue
Because many conformations are quite similar, ML provides a powerful way to interpolate between them. So we separate the total ensemble into 2 part: the first part only has very few point to be calculated with QC method (the orange sticks).), and the left majority part will be predicted through a ML model training from the first part (the
As for the arbitrary parameters, we can now get rid of them by setting them to fixed values: 50k point in nuclear ensemble, which is large enough to obtain precise spectrum and, that in turn allows to set the proper broadening parameter to very small value of 0.01eV.
This approach should be faster at the same level of theory.
ML Method Introduction
We use the KREG model (KRR with RE descriptor and the Gaussian kernel function) and MLatom software to complete all the ML task.
The descriptor is the ratio of distance of two atoms at the equilibrium geometry to the distance of two atoms at current point. It has the dimension of N(N-1)/2.
For excited energy and oscillator strength of each state, we train separated ML models for excited energy and oscillator strength of each state, and then make prediction for 50k point to calculate cross-section.
This figure shows how the spectrum of benzene (left – from 1 excitation, middle – from 3, and right – from 10 excitations) changes with the increased number of points, but to evaluate which one is good, it is better to have an objective criterion.
So we suggest to use a criterion RIC (relative integral change), which is the ratio of green area to the area between x axis and reference spectrum curve.
To verify ML-NEA is good enough, we set use cross section from 50k QC calculations as the reference spectrum, and use sample many times random set of training points from the dataset to obtain dozens of ML cross sections for the same number of points. The average RIC values and the standard deviations as error bars are plotted in the figure above, which shows the ML-NEA has consistently lower RIC than QC-NEA, i.e. ML improves the accuracy of cross sections for the same number of QC calculations.
We also found that using different raw data to make QC-NEA spectra differs from each other, but using ML trained on these data allows to calculate much more accurate spectra with ML cross sections closer to each other.
But in real situation, we don’t have a reference to compare with, so we suggest to use rRMSE (relative change in geometrical mean of validation root-mean-squared errors) as the criteria to judge the convergence when increasing the number of training point by 50.
From this figure, we can judge the convergence using rRMSE. We recommend to use the convergence criterion of rRMSE=0.1, which means we can run the ML-NEA method iteratively to determine how many point is enough for the convergence.
Application to a larger molecule
To prove ML-NEA is also usable in middle/large molecule, we also apply this method to a larger molecule and 30 excitations.
We found that for this molecule, it converged even faster than for benzene and even with 100 point, we have got very good result.
How to run ML-NEA simulations?
We provide a user-friendly suit of programs to perform ML-NEA simulations based on MLatom and Newton-X. For this we prepared special release of MLatom that automates most of the computational steps. You can find the tutorial and manual at the following link.