Optimizing using SSE

If you need optimize a cpu intensive application, then you need to know SSE, a SIMD instruction set.

There are four types that you need to familiarize:

__m64 MM register
__m128 packed single precision (XMM register)
__m128d packed double precision (XMM register)
__m128i packed integer (XMM register)

The intrinsics have the next format _mm_op_suffix() where op is the operation that performs the intrinsic and suffix indicate the data type that will operate (the four types mentioned above).

Ok then, no wait! Please do not rewrite your software using SSE intrinsics. Let the dirty job to gcc :).

First of all, the intrinsics function are not friendly. If you write some code using them, I promise you that you'll need rethink what do you wanted to do when you take a look at your own code written just a week ago.

Second your code will not be portable to machines that not supports SSE instructions.

Let me show you a small program that performs a dot product between two vectors.

#include <limits>
#include <cstdlib>
#include <stdint.h>

#define VECTOR_SIZE 1000000

int32_t dotProduct(size_t ndim, const int16_t *a, const int16_t *b) {
  int32_t acc = 0;
  for (size_t i = 0; i < ndim; ++i) {
    acc += a[i] * b[i];
  return acc;

void populateVector(size_t dim, int16_t *v) {
  for(size_t i = 0; i < dim; ++i) {
    v[i] = static_cast<int16_t>((std::rand() * RAND_MAX) % std::numeric_limits<int16_t>::max());

int main(int argc, char *argv[]) {
  int16_t *a = NULL;
  int16_t *b = NULL;
  posix_memalign(reinterpret_cast<void **>(&a), 16, VECTOR_SIZE * sizeof(int16_t));
  posix_memalign(reinterpret_cast<void **>(&b), 16, VECTOR_SIZE * sizeof(int16_t));
  populateVector(VECTOR_SIZE, a);
  populateVector(VECTOR_SIZE, b);
  return dotProduct(VECTOR_SIZE, a, b);

If you build this program passing -msse2 flag to gcc, then your code is written using SSE instructions automagically. For better performance, SSE needs to allocate memory aligned to 16 bytes, in posix systems we can use the posix_memalign function. Don't forget use the const keyword!

$ gcc dot_product.cc -o dot_product -O3 -g -msse2
$ objdump -dS dot_product > dot_product.s

Disassembling the code

int32_t dotProduct(size_t ndim, const int16_t *a, const int16_t *b) {
8048561: 89 44 24 0c           mov    %eax,0xc(%esp)
8048565: 8b 04 24              mov    (%esp),%eax
8048568: 01 ed                 add    %ebp,%ebp
804856a: 8d 0c 2e              lea    (%esi,%ebp,1),%ecx
804856d: 31 db                 xor    %ebx,%ebx
804856f: 66 0f ef c9           pxor   %xmm1,%xmm1
8048573: 01 fd                 add    %edi,%ebp
8048575: 8d 76 00              lea    0x0(%esi),%esi
8048578: f3 0f 6f 45 00        movdqu 0x0(%ebp),%xmm0
804857d: 66 0f f5 01           pmaddwd (%ecx),%xmm0
8048581: 83 c3 01              add    $0x1,%ebx
8048584: 83 c1 10              add    $0x10,%ecx
8048587: 83 c5 10              add    $0x10,%ebp
804858a: 39 c3                 cmp    %eax,%ebx
804858c: 66 0f fe c8           paddd  %xmm0,%xmm1
8048590: 72 e6                 jb     8048578 <_Z10dotProductjPKsS0_+0x98>
8048592: 66 0f 6f c1           movdqa %xmm1,%xmm0
8048596: 66 0f 73 d8 08        psrldq $0x8,%xmm0
804859b: 8b 44 24 0c           mov    0xc(%esp),%eax
804859f: 66 0f fe c8           paddd  %xmm0,%xmm1
80485a3: 8b 4c 24 04           mov    0x4(%esp),%ecx
80485a7: 66 0f 6f c1           movdqa %xmm1,%xmm0
80485ab: 66 0f 73 d8 04        psrldq $0x4,%xmm0
80485b0: 66 0f fe c8           paddd  %xmm0,%xmm1
80485b4: 66 0f 7e 0c 24        movd   %xmm1,(%esp)
80485b9: 03 54 24 04           add    0x4(%esp),%edx
80485bd: 03 04 24              add    (%esp),%eax
80485c0: 39 4c 24 08           cmp    %ecx,0x8(%esp)
80485c4: 74 1e                 je     80485e4 <_Z10dotProductjPKsS0_+0x104>
80485c6: 8b 6c 24 30           mov    0x30(%esp),%ebp
80485ca: 8d b6 00 00 00 00     lea    0x0(%esi),%esi
80485d0: 0f bf 0c 56           movswl (%esi,%edx,2),%ecx
80485d4: 0f bf 1c 57           movswl (%edi,%edx,2),%ebx

As you can see, the above code performs the algorithm using SSE instructions. You can help to gcc for generate more optimized code. For example if you are sure that you memory will be aligned, then you can indicate this as follows:

const int16_t *a_ = static_cast<const int16_t *>(__builtin_assume_aligned(a, 16));
const int16_t *b_ = static_cast<const int16_t *>(__builtin_assume_aligned(b, 16));

Then gcc will generate assembly code using the operations assuming that the memory is aligned. You can see here and compare the cost of the instructions with memory aligned and memory unaligned.

Also if you want to vectorize the algorithm, generally first rewrite your code without using SSE intrinsics. If you do that, then you are helping gcc to vectorize the algorithm. You need disassemble the code and take a look if gcc is doing right it job. I recomend read this presentation that gives good tips for helping to gcc.

I can conclude that just when it actually necessary, you need to write code using intrinsics. But first, you need to do profiling and find your bottle neck. After that, disassemble the code and try to understand how gcc is resolving the algorithm. Then, try to point to gcc as much as possible. If you don't make that gcc do the work for you, then sadly you have to write code using the intrinsics functions. But I warn you, this just happend in weird and complex cases.

If you need perform a dot product, use std::inner_product. The code above is only for get a simple example.


Comments powered by Disqus