Original post is here: eklausmeier.goip.de
Loop unrolling is not only good for sequential programming, it has similar dramatic effects in highly parallel codes as well, see Unrolling parallel loops (local copy), also see #pragma unroll
in the NVidia CUDA programming guide.
Some bullet points of the presentation:
More resources consumed per thread
Note: each load costs 2 arithmetic instructions
- 32 banks vs 32 streaming processors
- But run at half clock rate These 3 loads are 6x more expensive than 1 FMA
Conclusion:
- Simple optimization technique
- Resembles loop unrolling
- Often results in 2x speedup
See Vasily Volkov.
Cédric Augonnet, Samuel Thibault and Raymond Namyst call Vasily Volkov a "CUDA-hero" in How to get portable performance on accelerator-based platforms without the agonizing pain.
In a similar vein Dr. Mark Harris describes the beneficial effect of unrolling in parallel reduction.