Also, floating point exponentiation is incredibly expensive (as are transendental functions). If you're computing the same exponentiation multiple times, you should replace them with a scalar constant to save on computation. Also, if you're exponentiating by an integer, make sure it's an integer! "a**2." consumes way more time than "a**2", which the compiler will likely recognize and change to "a*a".

FMA - Fused Multiply and Accumulate will compute a . x + b for the cost of a multiplication (on x86 AVX2 and CUDA), so structuring your code to take advantage of this will double the effective number of FLOPS you can get.

Accelerator Issues

Pushing Loops Down the Callstack

...

Versions Compared

Old Version 12

New Version 13

Key

Accelerator Issues

Pushing Loops Down the Callstack

Page Comparison

Versions Compared

Old Version 12

New Version 13

Key

Accelerator Issues

Pushing Loops Down the Callstack