I am currently working on a new project where I need to un-pack raw data very fast and efficiently. Moreover, I need to do several mathematical operations (FFTs, etc.) on batches of the data. Although it is planned to move a lot of this stuff to the GPU in future, I am trying to develop sophisticated functions that get the maximum out of the CPU. Therefore, SSE is inevitable and SSE4 gives you the total performance kick. With the nice interface for the dot-product you can multiply two arrays of floats (4 in each vector) and sum up whatever indices you want, depending on how you set the mask parameter. Here’s a simple example
#include "stdio.h"
#include "smmintrin.h"
int main ()
{
__m128 a ;
__m128 b ;
float x[4], y[4];
x[0]= 1.0; x[1]= 2.0; x[2]= 3.0; x[3]= 4.0;
y[0]=-1.0; y[1]=-2.0; y[2]=-3.0; y[3]=-4.0;
// copy the data
a = _mm_load_ps(&x[0]);
b = _mm_load_ps(&y[0]);
// multiply and sum all 4 values (1111)
// and store them at 0 index (0001)
// 11110001 = 0xf1
const int mask = 0xf1;
__m128 res = _mm_dp_ps(a, b, mask);
union { __m128 v; float f[4]; } uf; // a trick to access the 4 floats
uf.v = a;
printf("Original a: %f\t%f\t%f\t%f\n", uf.f[0],uf.f[1],uf.f[2],uf.f[3]);
uf.v = b;
printf("Original b: %f\t%f\t%f\t%f\n", uf.f[0],uf.f[1],uf.f[2],uf.f[3]);
uf.v = res;
printf("Result : %f\t%f\t%f\t%f\n", uf.f[0],uf.f[1],uf.f[2],uf.f[3]);
return 0;
}