Wednesday, January 25, 2012

Accuracy and Performance of the VABC Descriptor


Last week I added an implementation of the VABC molecular descriptor to the chemkit library.

To test the efficacy of the descriptor I used the molecule data from the MMFF94 validation suite which contains 753 drug-like organic compounds and their 3D coordinates.

I wrote a small application to generate the volume data. The code below will output a pair of comma-separated values containing the analytical and predicted volumes for each molecule in the file.

#include <iostream>

#include <chemkit/foreach.h>
#include <chemkit/molecule.h>
#include <chemkit/moleculefile.h>

using namespace chemkit;

int main()
{
    MoleculeFile file("MMFF94_hypervalent.mol2");
    bool ok = file.read();
    if(!ok){
        std::cerr << file.errorString() << std::endl;
        return -1;
    }

    foreach(const Molecule *molecule, file.molecules()){
        std::cout << molecule->descriptor("vdw-volume").toDouble();
        std::cout << ",";
        std::cout << molecule->descriptor("vabc").toDouble();
        std::cout << std::endl;
    }

    return 0;
}


Below shows a plot of the VABC predicted volumes against the corresponding analytically calculated volumes.


As you can see, there is a very strong correlation (R^2 = 0.968) between the analytically calculated volume and the predicted volume. This indicates that the VABC descriptor does a fairly good job in predicting a compounds van der Waals volume from just its structure.

As far as performance goes, calculating the VABC volumes for this dataset was about five times faster than calculating the volumes analytically. For this data set it took approximately 220 msec for the VABC calculations and 1100 msec for the analytical calculations.

I would also be happy to send the raw data in a csv file to anyone who wants to perform further analysis.

UPDATE (1/28/2012): Analysis of Outliers

A number of people have asked for additional information concerning the outliers in the plot above (especially the data points in the lower left).

Looking at the structures for the molecules with the largest differences between predicted and analytical volumes gave me a few insights as to why the VABC descriptor underestimated in some cases.

First off, the VABC descriptor is only parametrized for 15 different elements (mostly all from the upper-right of the periodic table). For molecules containing other elements (such as potassium) the atom contribution is zero which lead to large underestimations. The analytical method uses the van der Waals radii from the Blue Obelisk Data Repository and thus can handle every known element. To correct for this I removed the molecules from the data set which contained elements that VABC was not parameterized for.

Secondly, VABC is designed for completely bonded molecules, not systems containing multiple separate molecules. For example, a few structures in the data set contain multiple water molecules surrounding another central molecule. To correct for this I removed the structures which contained multiple disconnected fragments.

After applying those two filters 16 structures were removed and the data set was left with 737 molecules.

I reran the analysis and produced the following plot which gave a new R^2 value of 0.979.



The major outliers from the lower left have been completely removed. Now, as you can see, most of the larger discrepancies come from VABC overestimating the volume. The largest difference was ~50 A^3 and came from the following, rather exotic, structure:



The VABC descriptor does not take bond order into account and so the above compound with its numerous double and triple bonds gets assigned a much larger volume.

As you can see from the new plot the VABC descriptor does a very good job assigning volumes to the compounds that it contains parameters for. Extending the parameters for a wider range of elements should definitely be possible.

Hope this has been helpful!

2 comments:

  1. Kyle, can you identify a few of the major outlier in the bottom left? What would you say are features of those molecules that are not well captured by the VABC model?

    ReplyDelete
  2. Great update on those details, thanx!

    ReplyDelete