The GNU Compiler Collection, gcc, offers multiple ways to perform SIMD calculations. There has always been the possibility of hardcoding assembler instructions within your source, of course. Furthermore, gcc offers so called 'builtin' instructions which directly translate into assembler but which do provide 'glue' to make coding easier. These are described in the X86 Built-in functions and PowerPC Altivec Built-in functions chapters of the gcc manual.
And lastly, gcc has recently gained intrinsic support for some SIMD operations whereby the coder requests a vector of specified dimension and content, and then performs operations on that vector. Depending on compiler flags, these operations translate into either SIMD instructions or regular opcodes. This is described in the Vector Extensions chapter of the gcc manual.
We'll start out with this last variant as it is easiest on the eyes, and portable too:
#include <stdio.h> typedef int v4sf __attribute__ ((mode(V4SF))); // vector of four single floats union f4vector { v4sf v; float f[4]; };
This in itself does nothing, it only defines a union which is suitable for SIMD operations. The typedef creates a more legible name for a vector of four single precision floats, the union enables us to access the individual contents of the vector. Behind the scenes, the cryptic 'mode' command also takes care of alignment, more about which later.
The next bit actually does a calculation:
int main() { union f4vector a, b, c; a.f[0] = 1; a.f[1] = 2; a.f[2] = 3; a.f[3] = 4; b.f[0] = 5; b.f[1] = 6; b.f[2] = 7; b.f[3] = 8; c.v = a.v + b.v; printf("%f, %f, %f, %f\n", c.f[0], c.f[1], c.f[2], c.f[3]); }
This can be compiled 'as is' by any recent gcc version (3.3 works, 3.4 does too):
$ gcc -ggdb -c example1.c $ gcc example1.o -o example1
When run, it delivers the expected output:
$ ./example1 6.000000, 8.000000, 10.000000, 12.000000
However, we did not tell gcc about our processor and it will probably have assumed the most basic variant available on your platform (80386, or a G3 for example). To verify, run:
$ objdump -dS ./example1.o | grep -22 c.v | tail -25 c.v = a.v + b.v; 8b: d9 45 e8 flds 0xffffffe8(%ebp) 8e: d8 45 d8 fadds 0xffffffd8(%ebp) 91: d9 5d b8 fstps 0xffffffb8(%ebp) 94: d9 45 ec flds 0xffffffec(%ebp) 97: d8 45 dc fadds 0xffffffdc(%ebp) 9a: d9 5d bc fstps 0xffffffbc(%ebp) 9d: d9 45 f0 flds 0xfffffff0(%ebp) a0: d8 45 e0 fadds 0xffffffe0(%ebp) a3: d9 5d c0 fstps 0xffffffc0(%ebp) a6: d9 45 f4 flds 0xfffffff4(%ebp) a9: d8 45 e4 fadds 0xffffffe4(%ebp) ac: d9 5d c4 fstps 0xffffffc4(%ebp) af: 8b 45 b8 mov 0xffffffb8(%ebp),%eax b2: 89 45 c8 mov %eax,0xffffffc8(%ebp) b5: 8b 45 bc mov 0xffffffbc(%ebp),%eax b8: 89 45 cc mov %eax,0xffffffcc(%ebp) bb: 8b 45 c0 mov 0xffffffc0(%ebp),%eax be: 89 45 d0 mov %eax,0xffffffd0(%ebp) c1: 8b 45 c4 mov 0xffffffc4(%ebp),%eax c4: 89 45 d4 mov %eax,0xffffffd4(%ebp) printf("%f, %f, %f, %f\n", c.f[0], c.f[1], c.f[2], c.f[3]);
We see a lot of repetitive instructions, indicating that gcc has handcoded the four additions for us. Now let's recompile informing gcc of our CPU, and take another look. Note that this example is Intel specific, substitute your proper CPU name. Results will look different on a G3, but are similar in nature.
$ gcc -ggdb -march=pentium3 -mcpu=pentium3 -c -o example1.o example1.c $ gcc -lm example1.o -o example1 $ objdump -dS ./example1.o | grep -4 c.v | tail -5 c.v = a.v + b.v; 8b: 0f 28 45 e8 movaps 0xffffffe8(%ebp),%xmm0 8f: 0f 58 45 d8 addps 0xffffffd8(%ebp),%xmm0 93: 0f 29 45 c8 movaps %xmm0,0xffffffc8(%ebp) printf("%f, %f, %f, %f\n", c.f[0], c.f[1], c.f[2], c.f[3]);
Here we see our first SSE instructions:
'MOVe four Aligned Packed Single precision'. Copies four single precision floats from a memory location to the register XMM0. This memory location is 'a'.
'ADD four Packed Single precision'. Adds the contents of the four floats at the specified memory location to the SSE register XMM0. This memory location is 'b'.
'MOVe four Aligned Packed Single precision'. Copies four single precision floats from the register XMM0 to an aligned memory location. This location is 'c' in our program.
It is probably a good idea to play around a bit with this program, which is called example1.c on disk.
Suggested changes are inducing division by zero errors and performing timings. Very simple benchmarking can be done by adding for(n=0; n < 1000000000; ++n) before our calculation. For reliable results, do not turn on optimization as gcc may discover the calculation is not actually changing, and only perform it once.
Of special note are the speed diferences between multiplication and division:
$ time ./example1 5.000000, 12.000000, 21.000000, 32.000000 real 0m0.562s user 0m0.542s sys 0m0.001s $ emacs example1.c ; make ; time ./example1 0.200000, 0.333333, 0.428571, 0.500000 real 0m2.634s user 0m2.611s sys 0m0.002s
When studying the assembler output, the sole difference turns out to be the change from divps to mulps, the latter being a lot faster.