Wednesday, February 23, 2011

Optimizing the UFF force field by unrolling the pow() function

Today I added a new performance benchmark to the chemkit library. The new benchmark, called uridine-minimization, measures the time to perform an energy minimization until convergence on a uridine molecule using the UFF force field.

Running the benchmark gives:
2,985 msecs per iteration
After running the benchmark through a profiler (I used callgrind from the valgrind package) I found some pretty interesting results and a potential easy performance optimization.

Total time spent in the program

Time spent in functions called by UffVanDerWaalsCalculation::energy()
We see that 53% of the total time the program is in the UffVanDerWaalsCalculation::energy() method and, of that, 85% of the time is spent executing the pow()function. A potential optimization is to remove the pow()function call and replace it with a manual calculation of the two exponent terms.

The current code is:
chemkit::Float UffVanDerWaalsCalculation::energy() const
{
    const chemkit::ForceFieldAtom *a = atom(0);
    const chemkit::ForceFieldAtom *b = atom(1);

    chemkit::Float d = parameter(0);
    chemkit::Float x = parameter(1);
    chemkit::Float r = distance(a, b);

    return d * (-2 * pow(x/r, 6) + pow(x/r, 12));
}

And the new code becomes:
chemkit::Float UffVanDerWaalsCalculation::energy() const
{
    const chemkit::ForceFieldAtom *a = atom(0);
    const chemkit::ForceFieldAtom *b = atom(1);

    chemkit::Float d = parameter(0);
    chemkit::Float x = parameter(1);
    chemkit::Float r = distance(a, b);

    chemkit::Float xr = x / r;
    chemkit::Float xr2 = xr * xr;
    chemkit::Float xr4 = xr2 * xr2;
    chemkit::Float xr6 = xr4 * xr2;
    chemkit::Float xr12 = xr6 * xr6;

    return d * (-2 * xr6 + xr12);
}
After making this change the benchmark time fell dramatically:
1,803 msecs per iteration
The new code is 60% faster. A good improvement. But, now we had to make sure that the new method is just as accurate as the old method. To test the difference in the normal and unrolled functions I wrote the following code which calculates the relative error introduced and the speed of each version:
#include <cmath>
#include <ctime>
#include <limits>
#include <iostream>

// calculates the van der waals energy for the given distance 'r'
double energy(double r)
{
    // parameters for Helium
    double d = 0.056;
    double x = 2.362;
    
    // [Rappe 1992] equation 20
    return d * (-2 * pow(x/r, 6) + pow(x/r, 12));
}

// calculates the van der waals energy for the given distance 'r'
//
// unrolls the pow() function for increased performance
double energy_unrolled(double r)
{
    // parameters for Helium
    double d = 0.056;
    double x = 2.362;

    double xr = x / r;
    double xr2 = xr * xr;
    double xr4 = xr2 * xr2;
    double xr6 = xr4 * xr2;
    double xr12 = xr6 * xr6;
    
    // [Rappe 1992] equation 20
    return d * (-2 * xr6 + xr12);
}

// returns the time (in msecs) per iteration of 'function' called 
// with r from 0 to 100 in picometer steps
double benchmark(double (*function)(double), int iterations)
{
    time_t start = time(0);

    for(int i = 0; i < iterations; i++){
        for(double r = 1e-2; r < 100; r += 1e-2){
            volatile double e = function(r);
        }
    }

    return difftime(time(0), start) / iterations * 1000;
}

// measures the speed of the energy() and energy_unrolled() functions
void measure_speed()
{
    std::cout << "energy time: " << benchmark(energy, 1e5) << " msec" << std::endl;
    std::cout << "energy_unrolled time: " << benchmark(energy_unrolled, 1e5) << " msec" << std::endl;
}

// measures the difference of the results of the energy() and
// energy_unrolled() functions
void measure_difference()
{
    int iterations = 0;
    double relative_error_sum = 0;
    double relative_error_max = 0;

    for(double r = 1e-2; r < 100; r += 1e-2){
        double e = energy(r);
        double e0 = energy_unrolled(r);

        double difference = std::abs(e - e0);
        double relative_error = std::abs(difference / e);

        relative_error_sum += relative_error;

        if(relative_error > relative_error_max){
            relative_error_max = relative_error;
        }

        iterations++;
    }

    double relative_error_mean = relative_error_sum / iterations;

    std::cout << "max relative error: " << relative_error_max << std::endl;
    std::cout << "mean relative error: " << relative_error_mean << std::endl;
}

int main(void)
{
    measure_speed();
    measure_difference();

    return 0;
}
The results are:
$ g++ energy.cpp -o energy -O2
$ ./energy
energy time: 2.28 msec
energy_unrolled time: 0.16 msec
max relative error: 5.6035e-15
mean relative error: 1.28544e-16
We see that the unrolled version is significantly faster but introduces a mean relative error of ~1.3e-16.

Instead of manually optimizing the code, it would be great if we could get the compiler to optimize it for us. Looking through the g++ documentation turned up the -ffast-math compiler flag. It claims to increase the performance of floating-point math at the expense of breaking IEEE floating-point compliance and potentially reducing accuracy.

Compiling and running the test with the -ffast-math flag gives:
$ g++ energy.cpp -o energy -O2 -ffast-math
$ ./energy
energy time: 0.13 msec
energy_unrolled time: 0.14 msec
max relative error: 1.81079e-14
mean relative error: 9.84612e-17
So we can see that using the -ffast-math flag enables the same optimization that we performed manually.

Going back to the UFF force field I compiled the UFF plugin with the -ffast-math flag and re-ran the benchmark. It gave:
1,590 msecs per iteration
Using the -ffast-math option increased the speed by 88%, an even bigger improvement than just manually unrolling the pow() function.

For now I will leave the unoptimized energy function and let the user choose if they want to trade speed for accuracy by passing the -ffast-math compiler flag when compiling.

No comments:

Post a Comment