Running the benchmark gives:

2,985 msecs per iterationAfter running the benchmark through a profiler (I used callgrind from the valgrind package) I found some pretty interesting results and a potential easy performance optimization.

Total time spent in the program |

Time spent in functions called by UffVanDerWaalsCalculation::energy() |

The current code is:

chemkit::Float UffVanDerWaalsCalculation::energy() const { const chemkit::ForceFieldAtom *a = atom(0); const chemkit::ForceFieldAtom *b = atom(1); chemkit::Float d = parameter(0); chemkit::Float x = parameter(1); chemkit::Float r = distance(a, b); return d * (-2 * pow(x/r, 6) + pow(x/r, 12)); }

And the new code becomes:

chemkit::Float UffVanDerWaalsCalculation::energy() const { const chemkit::ForceFieldAtom *a = atom(0); const chemkit::ForceFieldAtom *b = atom(1); chemkit::Float d = parameter(0); chemkit::Float x = parameter(1); chemkit::Float r = distance(a, b); chemkit::Float xr = x / r; chemkit::Float xr2 = xr * xr; chemkit::Float xr4 = xr2 * xr2; chemkit::Float xr6 = xr4 * xr2; chemkit::Float xr12 = xr6 * xr6; return d * (-2 * xr6 + xr12); }After making this change the benchmark time fell dramatically:

1,803 msecs per iterationThe new code is 60% faster. A good improvement. But, now we had to make sure that the new method is just as accurate as the old method. To test the difference in the normal and unrolled functions I wrote the following code which calculates the relative error introduced and the speed of each version:

#include <cmath> #include <ctime> #include <limits> #include <iostream> // calculates the van der waals energy for the given distance 'r' double energy(double r) { // parameters for Helium double d = 0.056; double x = 2.362; // [Rappe 1992] equation 20 return d * (-2 * pow(x/r, 6) + pow(x/r, 12)); } // calculates the van der waals energy for the given distance 'r' // // unrolls the pow() function for increased performance double energy_unrolled(double r) { // parameters for Helium double d = 0.056; double x = 2.362; double xr = x / r; double xr2 = xr * xr; double xr4 = xr2 * xr2; double xr6 = xr4 * xr2; double xr12 = xr6 * xr6; // [Rappe 1992] equation 20 return d * (-2 * xr6 + xr12); } // returns the time (in msecs) per iteration of 'function' called // with r from 0 to 100 in picometer steps double benchmark(double (*function)(double), int iterations) { time_t start = time(0); for(int i = 0; i < iterations; i++){ for(double r = 1e-2; r < 100; r += 1e-2){ volatile double e = function(r); } } return difftime(time(0), start) / iterations * 1000; } // measures the speed of the energy() and energy_unrolled() functions void measure_speed() { std::cout << "energy time: " << benchmark(energy, 1e5) << " msec" << std::endl; std::cout << "energy_unrolled time: " << benchmark(energy_unrolled, 1e5) << " msec" << std::endl; } // measures the difference of the results of the energy() and // energy_unrolled() functions void measure_difference() { int iterations = 0; double relative_error_sum = 0; double relative_error_max = 0; for(double r = 1e-2; r < 100; r += 1e-2){ double e = energy(r); double e0 = energy_unrolled(r); double difference = std::abs(e - e0); double relative_error = std::abs(difference / e); relative_error_sum += relative_error; if(relative_error > relative_error_max){ relative_error_max = relative_error; } iterations++; } double relative_error_mean = relative_error_sum / iterations; std::cout << "max relative error: " << relative_error_max << std::endl; std::cout << "mean relative error: " << relative_error_mean << std::endl; } int main(void) { measure_speed(); measure_difference(); return 0; }The results are:

$ g++ energy.cpp -o energy -O2We see that the unrolled version is significantly faster but introduces a mean relative error of ~1.3e-16.

$ ./energy

energy time: 2.28 msec

energy_unrolled time: 0.16 msec

max relative error: 5.6035e-15

mean relative error: 1.28544e-16

Instead of manually optimizing the code, it would be great if we could get the compiler to optimize it for us. Looking through the g++ documentation turned up the -ffast-math compiler flag. It claims to increase the performance of floating-point math at the expense of breaking IEEE floating-point compliance and potentially reducing accuracy.

Compiling and running the test with the -ffast-math flag gives:

$ g++ energy.cpp -o energy -O2 -ffast-mathSo we can see that using the -ffast-math flag enables the same optimization that we performed manually.

$ ./energy

energy time: 0.13 msec

energy_unrolled time: 0.14 msec

max relative error: 1.81079e-14

mean relative error: 9.84612e-17

Going back to the UFF force field I compiled the UFF plugin with the -ffast-math flag and re-ran the benchmark. It gave:

1,590 msecs per iterationUsing the -ffast-math option increased the speed by 88%, an even bigger improvement than just manually unrolling the pow() function.

For now I will leave the unoptimized energy function and let the user choose if they want to trade speed for accuracy by passing the -ffast-math compiler flag when compiling.

## No comments:

## Post a Comment