Optimization Mathematical Computation (Multiplication and Summation)

Suppose you want to calculate the sum of the squared differences of items:

$ \ sum_ {i = 1} ^ {N-1} (x_i - x_ {i + 1}) ^ 2 $

simplest code (input std::vector<double> xs

, output sum2

):

double sum2 = 0.;
double prev = xs[0];
for (vector::const_iterator i = xs.begin() + 1;
 i != xs.end(); ++i)
{
sum2 += (prev - (*i)) * (prev - (*i)); // only 1 - with compiler optimization
prev = (*i);
}

      

I hope the compiler does the optimization in the comment above. If N

is length xs

, you have multiplications and sums (sums mean or ). N-1

2N-3

+

-

Now, suppose you know this variable:

$ x_1 ^ 2 + x_N ^ 2 + 2 \ sum_ {i = 2} ^ {N-1} x_i ^ 2 $

and name it sum

. Binomial square expansion:

$ sum_i ^ {N-1} (x_i-x_ {i + 1}) ^ 2 = sum

- 2 \ sum_ {i = 1} ^ {N-1} x_i x_ {i + 1} $

so the code will look like this:

double sum2 = 0.;
double prev = xs[0];
for (vector::const_iterator i = xs.begin() + 1;
 i != xs.end(); ++i)
{
sum2 += (*i) * prev;
prev = (*i);
}
sum2 = -sum2 * 2. + sum;

      

Here I have N multiplications and N-1 complements . In my case, N is around 100.

Well compiling with g++ -O2

I was unable to speed up (I'm trying to call the inline function 2M times), why?

+2


a source to share


2 answers


The multiplications are much more expensive than the deadline additions. Also, depending on the processor's additions and multiplications will be done in parallel. I.e. it will start the next multiplication when the add is done (see http://en.wikipedia.org/wiki/Out-of-order_execution ).

Thus, reducing the number of add-ons will not impact performance much.

What you can do is make it easier for the compiler to vectorize your code or vectorize yourself. To make it easier for the compiler to vectorize, I would use a regular array of doubles, use indices, not pointers.

EDIT: N = 100 might also be a small number to see the difference in runtime. Try N more.



Dirty code but shows perfection. Exit:

1e+06
59031558
1e+06
18710703

      

You get ~ 3x speedup.

#include <vector>
#include <iostream>

using namespace std;

unsigned long long int rdtsc(void)
{
  unsigned long long int x;
  unsigned a, d;

  __asm__ volatile("rdtsc" : "=a" (a), "=d" (d));

  return ((unsigned long long)a) | (((unsigned long long)d) << 32);;
}



double f(std::vector<double>& xs)
{
  double sum2 = 0.;
  double prev = xs[0];

  vector<double>::const_iterator iend = xs.end();
  for (vector<double>::const_iterator i = xs.begin() + 1;
       i != iend; ++i)
    {
      sum2 += (prev - (*i)) * (prev - (*i)); // only 1 - with compiler optimization
      prev = (*i);
    }

  return sum2;
}

double f2(double *xs, int N)
{
  double sum2 = 0;

  for(int i = 0; i < N - 1; i+=1) {
    sum2 += (xs[i+1] - xs[i])*(xs[i+1] - xs[i]);

  }

  return sum2;
}

int main(int argc, char* argv[])
{
  int N = 1000001;
  std::vector<double> xs;
  for(int i=0; i<N; i++) {
    xs.push_back(i);
  }

  unsigned long long int a, b;
  a = rdtsc();
  std::cout << f(xs) << endl;
  b = rdtsc();
  cout << b - a << endl;

  a = rdtsc();
  std::cout << f2(&xs[0], N) << endl;
  b = rdtsc();
  cout << b - a << endl;
}

      

+2


a source


The add-on can be free when executed as x + = a * b. The compiler should be able to figure this out in the first version if the architecture supports it.

The math is probably going along with it *i

, which can be slower.



Do not call xs.end()

on every loop iteration unless you expect the return value to change. If the compiler can't optimize it, it will outshine the rest of the loop.

+1


a source







All Articles