Optimizing code performance when odd / even threads do different things in CUDA

Question

Optimizing code performance when odd / even threads do different things in CUDA

I have two large vectors, I am trying to do some kind of element multiplication where an even numbered element in the first vector is multiplied by the next odd numbered element in the second vector ... and where the odd -numbered in the first vector is multiplied by the previous element with an even number in the second vector.

For instance:

vector 1 is V1 (1) V1 (2) V1 (3) V1 (4)
vector 2 is V2 (1) V2 (2) V2 (3) V2 (4)
V1 (1) * V2 (2)
V1 (3 ) * V2 (4)
V1 (2) * V2 (1)
V1 (4) * V2 (3)

I wrote Cuda code for this (Pds has the elements of the first vector in shared memory, Nds the second vector):

// instead of % 2, checking the first bit to decide if a number
// is odd/even is faster 

if ((tx & 0x0001) == 0x0000)
    Nds[tx+1] = Pds[tx] * Nds[tx+1];
else
    Nds[tx-1] = Pds[tx] * Nds[tx-1];
__syncthreads();

Is there a way to speed up this code or avoid the discrepancy?

+2

performance optimization cuda

The byzantine May 18 '10 at 11:29

a source to share

2 answers

This is an old post, maybe someone finds my answer helpful. If you have threadIdx in your tx code, then you have forking or warping. You should avoid block mismatch because it serializes the process. This means that even-indexed threads will execute first, and then odd-indexed threads will execute. If tx is threadIdx, try changing the algorithm so that branching depends on blockIdx.

0

Sasan May 1 '16 at 1:16

a source to share

Paul R · Accepted Answer · 2010-05-18T12:21:58+0000

You should be able to detach a branch like this:

int tx_index = tx ^ 1; // equivalent to: tx_index = (tx & 1) ? tx - 1 : tx + 1
Nds[tx_index] = Pds[tx] * Nds[tx_index];

Optimizing code performance when odd / even threads do different things in CUDA

More articles: