In a previous article about SIMD wrappers, I suggested to design a dedicated memory allocator to handle SIMD memory alignment constraints, but I didn’t give any details on how to do it. That’s the purpose of this article. The C++ standard describes a set of requirements our allocator must respect to work with standard containers. After a survey of these standard requirements, we’ll see how to implement an aligned memory allocator that meets them.
The standard requires the allocator to define the following type:
The standard then requires a template class rebind member, which is a template typedef: Allocator<T>::rebind<U>::other is the same type as Allocator<U>. This member is used by container that allocates memory for internal structures that hold T elements instead of allocating memory for T elements. For instance, std::list<T> allocates memory for Node<T> instead of T, but you don’t want to use Allocator<Node<T> > as template parameter since this would expose implementation details in the interface. Thus you use Allocator<T> and the internal allocation is done with Allocator<T>::rebind<U>::other.allocate(n).
Then we have to provide the address functions, which return the address of a given object. Two overloads are provided, one for references and one for constant references.
The two following functions are the essential part of the allocator: allocate and deallocate, which allocates/deallocates memory for n objects of type T. These functions are low-level memory management functions and are not responsible for constructing or destroying objects, this has to be done in specific functions: construct and destroy.
The last specific function is max_size, a function that returns the maximum value that can be passed to allocate.
Finally, the allocator must provide default and copy constructors, and equality check operators.
Since we must handle different memory alignment bounds, our aligned memory allocator will take two template parameters: T, the type of allocated objects, and N, the aligment bound. Given the requirements of the previous section, the allocator interface looks like:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 |
template <class T, int N> class aligned_allocator { public: typedef T value_type; typedef T& reference; typedef const T& const_reference; typedef T* pointer; typedef const T* const_pointer; typedef size_t size_type; typedef ptrdiff_t difference_type; template <class U> struct rebind { typedef aligned_allocator<U,N> other; }; inline aligned_allocator() throw() {} inline aligned_allocator(const aligned_allocator&) throw() {} template <class U> inline aligned_allocator(const aligned_allocator<U,N>&) throw() {} inline ~aligned_allocator() throw() {} inline pointer address(reference r) { return &r; } inline const_pointer address(const_reference r) const { return &r; } pointer allocate(size_type n, typename std::allocator<void>::const_pointer hint = 0); inline void deallocate(pointer p, size_type); inline void construct(pointer p, const_reference value) { new (p) value_type(value); } inline void destroy(pointer p) { p->~value_type(); } inline size_type max_size() const throw() { return size_type(-1) / sizeof(T); } inline bool operator==(const aligned_allocator&) { return true; } inline bool operator!=(const aligned_allocator& rhs) { return !operator==(rhs); } }; |
Nothing special to say here, the construct function calls the copy constructor of T through the placement new operator but does not allocate memory for the element, this is the responsibility of the allocate function. Same thing for the destroy function, it calls the destructor of T but it doesn’t deallocate memory, this has to be done after with a call to the deallocate function.
We can now focus on the allocate and deallocate implementation. Depending on our platform, aligned memory allocation function may be available:
Except for the 16-bytes aligned malloc, every function takes an alignment parameter that must be a power of 2, thus the N template parameter of our allocator should be a power of 2 so it can works with these aligned memory allocation functions. Note that many of these functions can be available on a same platform.
Assume we can detect at compile time if such functions are available (we’ll come back on this later); we can provide a function that selects the aligned memory allocation to use if such a function is available, otherwise forwards to our own implementation:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
namespace detail { inline void* _aligned_malloc(size_t size, size_t alignment) { void* res = 0; void* ptr = malloc(size+alignment); if(ptr != 0) { res = reinterpret_cast<void*>((reinterpret_cast<size_t>(ptr) & ~(size_t(alignment-1))) + alignment); *(reinterpret_cast<void**>(res) - 1) = ptr; } return res; } } inline void* aligned_malloc(size_t size, size_t alignment) { #if MALLOC_ALREADY_ALIGNED return malloc(size); #elif HAS_MM_MALLOC return _mm_malloc(size,alignment); #elif HAS_POSIX_MEMALIGN void* res; const int failed = posix_memalign(&res,size,alignment); if(failed) res = 0; return res; #elif (defined _MSC_VER) return _aligned_malloc(size, alignment); #else return detail::_aligned_malloc(size,alignment); #endif } |
The idea in the _aligned_malloc function is to search for the first aligned memory address (res) after the one returned by the classic malloc function (ptr), and to use it as return value. But since we must ensure size bytes are available after res, we must allocate more than size bytes; the minimum size to allocate to prevent buffer overflow is size+alignment. The we store the ptr value just before res so the _aligned_free function can easily retrieve and pass it to the classic free function:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
namespace detail { inline void _aligned_free(void* ptr) { if(ptr != 0) free(*(reinterpret_cast<void**>(ptr)-1)); } } inline void aligned_free(void* ptr) { #if MALLOC_ALREADY_ALIGNED free(ptr); #elif HAS_MM_MALLOC _mm_free(ptr); #elif HAS_POSIX_MEMALIGN free(ptr); #elif defined(_MSC_VER) _aligned_free(ptr); #else detail::_aligned_free(ptr); #endif } |
The aligned_free function is the symmetric of aligned_malloc: it selects the aligned memory function available or forwards to the _aligned_free function.
We can now write the allocate and deallocate functions of the allocator:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
template <class T, int N> typename aligned_allocator<T,N>::pointer aligned_allocator<T,N>::allocate(size_type n, typename std::allocator<void>::const_pointer hint) { pointer res = reinterpret_cast<pointer>(aligned_malloc(sizeof(T)*n,N)); if(res == 0) throw std::bad_alloc(); return res; } template <class T, int N> typename aligned_allocator<T,N>::pointer aligned_allocator<T,N>::deallocate(pointer p, size_type) { aligned_free(p); } |
Here we see the advantage to have encapsulated aligned memory allocation selection in a dedicated function: the allocate function of the allocator simply forwards to this dedicated function and then handles possible bad allocation. The result is a simple and easy-to-read code. Another advantage is that you can use aligned_malloc and aligned_free functions outside the aligned_allocator class if you need.
Note: the call to malloc after the MALLOC_ALREADY_ALIGNED preprocessor token should be available for 16-bytes aligned memory allocator only (the same applies to the call to free). Thus we should provide two versions of aligned_malloc and aligned_free and a specialization of the allocator ofr N = 16.
Now that we have implemented the allocation and deallocation methods, we can come back to the preprocessor tokens. Defining these tokens is not simple because you have to refer to the documentation of a lot of various systems and architectures. Thus there’s a chance that we may not be comprehensive, but at least we can cover the most common platforms.
Let’s start with the GNU world; according to this documentation, “The address of a block returned by malloc or realloc in GNU systems is always a multiple of eight (or sixteen on 64-bit systems)”. According to this one, page 114, “[The] LP64 model […] is used by all 64-bit UNIX ports”, therefore we should use this predefined macro instead of __x86_64__ (this last one won’t work on PowerPC or SPARC). Thus we can define the following macro:
FreeBSD has 16-byte aligned malloc, except on ARM and MIPS architectures (see this documentation):
On windows 64 bits and Apple OS, the malloc function is also already aligned, thus we can define the MALLOC_ALREADY_ALIGNED macro, based on these information and the macros previously defined:
To handle systems implementing POSIX:
The last macro to define is HAS_MM_MALLOC; the _mm_malloc function is provided with SSE intrinsics, thus we can rely on the macros defined in this article:
That’s it, some architectures may be missing but it shouldn’t be too complicated to handle them with appropriate documentation.
The aligned memory allocator designed in this article meets the standard requirements and can therefore be used with any container of the STL. If you work with intrinsics wrappers and std::vector, this allocator will allow you to load the data from memory with the load_a method, faster than load_u (the same applies for storing data to memory):
1 2 3 4 5 6 7 8 9 10 11 |
typedef std::vector<double,aligned_allocator<double,16> > vector_type; vector_type v1,v2,v3; // code filling v1 and v2 for(size_t i = 0; i < v1.size(); i += simd_traits<double>::size) { vector2d v1d = load_a(&v1[i]); vector2d v2d = load_a(&v2[i]); vector2d v3d = v1d + v2d; store_a(&v3[i],v3d); } |
But as we will see in a forthcoming article, std::vector may not be the most appropriate container for efficient numerical analysis.
]]>I’ve always thought the compiler was smart enough to handle registers and optimizations, whatever the type of the functions arguments (const references or values); and I don’t understand why operators overloads shouldn’t be considered as classical functions by the compiler. But well, maybe I am too optimistic about the capabilities of the compiler? I was suggested a solution based on pure functions that should be simpler and faster, but I was not given any evidence. Let’s take a closer look at both implementations and the assembly code they generate so we can determine whether or not the wrappers introduce performance hits.
Before we go further, here are some technical details: the compiler used in this article is gcc 4.7.3, results may be different with another compiler (and I am interested in seeing these results). The SIMD wrappers used are those of the article series mentioned above, and the implementation based on stateless pure functions looks like:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
typedef __m128 vector4f2; inline vector4f2 add(vector4f2 lhs, vector4f2 rhs) { return _mm_add_ps(lhs,rhs); } inline vector4f mul(vector4f2 lhs, vector4f2 rhs) { return _mm_mul_ps(lhs,rhs); } inline vector4f2 load_a(const float* src) { return _mm_load_ps(src); } inline vector4f2 store_a(float* dst, vector4f2 src) { _mm_store_ps(dst,src); } |
Let’s see the assembly code generated by the following functions:
The generated assembly code is:
1 2 3 4 5 6 7 8 9 10 11 12 |
// test_sse_a 0: c5 f8 28 06 vmovaps (%rsi),%xmm0 4: 48 89 f8 mov %rdi,%rax 7: c5 f8 58 02 vaddps (%rdx),%xmm0,%xmm0 b: c5 f8 29 07 vmovaps %xmm0,(%rdi) f: c3 retq // test_sse_a2 0: c5 f8 58 c1 vaddps %xmm1,%xmm0,%xmm0 4: c3 retq 5: 66 66 2e 0f 1f 84 00 data32 nopw %cs:0x0(%rax,%rax,1) c: 00 00 00 00 |
If you’re not familiar with assembler, vaddps is the assembly for _mm_add_ps (strictly speaking for _m256_add_ps, but this doesn’t make a big difference), vmovaps is a transfer instruction from memory to SIMD register (load) or from SIM register to memory (store) depending on its arguments, and %xmmX are the SIMD registers. Do not worry about the last line of the test_sse_a2 function, this is a “do-nothing” operation, used for padding, and does not concern us here.
So what can we tell at first sight ? Well, it seems SIMD wrappers introduce an overhead, using transfer instructions, while the implementation based on stateless functions directly uses register. Now the question is why. Is this due to constant reference arguments ?
If we change the code of the SIMD wrappers operator overloads to take their arguments by value rather than by constant reference, the generated assembly code doesn’t change:
Moreover, if we change the functional implementation so it takes arguments by constant reference instead of value, the generated assembly code for test_sse_a2 is exactly the same as in the previous section:
As I supposed, the compiler (at least gcc) is smart enough to optimize and keep in register arguments passed by constant reference (if they fit into registers of course). So it seems the overhead comes from the indirection of the wrapping, but this is really hard to believe.
To confirm this hypothesis, let’s simplify the code of the wrapper so we only test the indirection. Inheritance from the simd_vector base class is removed:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
class vector4f { public: inline vector4f2() {} inline vector4f2(__m128 rhs) : m_value(rhs) {} inline vector4f2& operator=(__m128 rhs) { m_value = rhs; return *this; } inline operator __m128() const { return m_value; } private: __m128 m_value; }; inline vector4f2 operator+(vector4f2 lhs, vector4f2 rhs) { return _mm_add_ps(lhs,rhs); } |
Now if we dump the assembly code of the test_sse_add function we defined in the beginning, here is what we get:
That’s exactly the same code as the one generated by pure stateless functions. So the indirection of the wrapper doesn’t introduce any overhead. Since the only change we’ve made from the previous wrapper is to remove the CRTP layer, we have the culprit for the overhead we noticed in the beginning: the CRTP layer.
I first thought of a Empty Base Optimization problem, but printing the size of both implementations of the wrapper proved me wrong: in both case, the size of the wrapper is 16, so it fits in the XMM registers. So I must admit, I still have no explanation for this problem.
In the next section, I will consider the wrapper implementation that doesn’t use CRTP. Now that we’ve fixed this issue, let’s see if operators overload prevents the compiler from proper instructions reordering during optimization.
For this test, I used the following functions:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
vector4f2 test_sse_b2(vector4f2 a, vector4f2 b, vector4f2 c, vector4f2 d) { return add(mul(a,b),mul(c,d)); } vector4f2 test_sse_c2(vector4f2 a, vector4f2 b, vector4f2 c, vector4f2 d) { return add(add(mul(a,b),div(c,d)),sub(div(c,b),mul(a,d))); } vector4f2 test_sse_d2(vector4f2 a, vector4f2 b, vector4f2 c, vector4f2 d) { return mul(test_sse_c2(a,b,c,d),test_sse_b2(a,b,c,d)); } |
And the equivalent functions for wrappers:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
vector4f test_sse_b(vector4f a, vector4f b, vector4f c, vector4f d) { return a*b + c*d; } vector4f test_sse_c(vector4f a, vector4f b, vector4f c, vector4f d) { return (a*b + c/d) + (c/b - a*d); } vector4f test_sse_d(vector4f a, vector4f b, vector4f c, vector4f d) { return test_sse_c(a,b,c,d) * test_sse_b(a,b,c,d); } |
Here the parenthesis in test_sse_c ensure the compiler generates the same syntactic tree for both implementations; indeed, if we omitted the brackets, the code would have been almost equivalent to:
Here is the generated assembly code with explanations in comments:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
// test_sse_d 40: c5 f8 59 e1 vmulps %xmm1,%xmm0,%xmm4 // a*b in xmm4 44: c5 e8 5e c9 vdivps %xmm1,%xmm2,%xmm1 // c/b in xmm1 48: c5 f8 59 c3 vmulps %xmm3,%xmm0,%xmm0 // a*d in xmm0 4c: c5 e8 59 eb vmulps %xmm3,%xmm2,%xmm5 // c*d in xmm5 50: c5 d8 58 ed vaddps %xmm5,%xmm4,%xmm5 // a*b + c*d in xmm5 54: c5 f0 5c c8 vsubps %xmm0,%xmm1,%xmm1 // c/b - a*d in xmm1 58: c5 e8 5e c3 vdivps %xmm3,%xmm2,%xmm0 // c/d in xmm0 5c: c5 d8 58 c0 vaddps %xmm0,%xmm4,%xmm0 // a*b + c/d in xmm0 60: c5 f8 58 c1 vaddps %xmm1,%xmm0,%xmm0 // a*b + c/d + c/b - a*d in xmm0 64: c5 f8 59 c5 vmulps %xmm5,%xmm0,%xmm0 // (a*b + c*d) * xmm0 in xmm0 68: c3 retq 69: 0f 1f 80 00 00 00 00 nopl 0x0(%rax) // test_sse_d2 40: c5 f8 59 e1 vmulps %xmm1,%xmm0,%xmm4 44: c5 e8 5e c9 vdivps %xmm1,%xmm2,%xmm1 48: c5 f8 59 c3 vmulps %xmm3,%xmm0,%xmm0 4c: c5 e8 59 eb vmulps %xmm3,%xmm2,%xmm5 50: c5 d8 58 ed vaddps %xmm5,%xmm4,%xmm5 54: c5 f0 5c c8 vsubps %xmm0,%xmm1,%xmm1 58: c5 e8 5e c3 vdivps %xmm3,%xmm2,%xmm0 5c: c5 d8 58 c0 vaddps %xmm0,%xmm4,%xmm0 60: c5 f8 58 c1 vaddps %xmm1,%xmm0,%xmm0 64: c5 f8 59 c5 vmulps %xmm5,%xmm0,%xmm0 68: c3 retq 69: 0f 1f 80 00 00 00 00 nopl 0x0(%rax) |
The generated assembly codes for test_sse_d and test_sse_d2 are exactly the sames. Operators overloads and equivalent stateless functions generally produces the same assembly code provided that the syntax tree is the same in both implementations. Indeed, the evaluation order of operators arguments and functions arguments may differ, making it impossible to have the same syntax tree in both implementations when using non-commutative operators.
Now what if we mix computation instructions with loop, load and store ? Consider the following piece of code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
void test_sse_e(const std::vector<float>& a, const std::vector<float>& b, const std::vector<float>& c, const std::vector<float>& d, std::vector<float>& e) { // typedef vector4f2 for test_sse_e2 implementation typedef vector4f vec_type; size_t bound = a.size()/4; for(size_t i = 0; i < bound; i += 4) { vec_type av = load_a2(&a[i]); vec_type bv = load_a2(&b[i]); vec_type cv = load_a2(&c[i]); vec_type dv = load_a2(&d[i]); // vec_type ev = test_sse_d2(av,bv,cv,dv); for test_sse_e2 implementation vec_type ev = test_sse_d(av,bv,cv,dv); store_a(&e[i],ev); } } |
Again, the generated assembly code is the same for both implementations:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 |
// test_sse_e: 70: 4c 8b 0f mov (%rdi),%r9 73: 48 8b 7f 08 mov 0x8(%rdi),%rdi 77: 4c 29 cf sub %r9,%rdi 7a: 48 c1 ff 02 sar $0x2,%rdi 7e: 48 c1 ef 02 shr $0x2,%rdi 82: 48 85 ff test %rdi,%rdi 85: 74 5d je e4 <_ZN4simd11test_sse_eERKSt6vectorIfSaIfEES4_S4_S4_RS2_+0x74> 87: 4c 8b 16 mov (%rsi),%r10 8a: 31 c0 xor %eax,%eax 8c: 48 8b 32 mov (%rdx),%rsi 8f: 48 8b 09 mov (%rcx),%rcx 92: 49 8b 10 mov (%r8),%rdx 95: 0f 1f 00 nopl (%rax) 98: c5 f8 28 0c 86 vmovaps (%rsi,%rax,4),%xmm1 9d: c5 f8 28 04 81 vmovaps (%rcx,%rax,4),%xmm0 a2: c4 c1 78 28 24 81 vmovaps (%r9,%rax,4),%xmm4 a8: c4 c1 78 28 1c 82 vmovaps (%r10,%rax,4),%xmm3 ae: c5 f0 59 e8 vmulps %xmm0,%xmm1,%xmm5 b2: c5 d8 59 d3 vmulps %xmm3,%xmm4,%xmm2 b6: c5 d8 59 e0 vmulps %xmm0,%xmm4,%xmm4 ba: c5 f0 5e db vdivps %xmm3,%xmm1,%xmm3 be: c5 e8 58 ed vaddps %xmm5,%xmm2,%xmm5 c2: c5 f0 5e c0 vdivps %xmm0,%xmm1,%xmm0 c6: c5 e0 5c dc vsubps %xmm4,%xmm3,%xmm3 ca: c5 e8 58 d0 vaddps %xmm0,%xmm2,%xmm2 ce: c5 e8 58 d3 vaddps %xmm3,%xmm2,%xmm2 d2: c5 e8 59 d5 vmulps %xmm5,%xmm2,%xmm2 d6: c5 f8 29 14 82 vmovaps %xmm2,(%rdx,%rax,4) db: 48 83 c0 04 add $0x4,%rax df: 48 39 c7 cmp %rax,%rdi e2: 77 b4 ja 98 <_ZN4simd11test_sse_eERKSt6vectorIfSaIfEES4_S4_S4_RS2_+0x28> e4: f3 c3 repz retq // test_sse_e2 70: 4c 8b 0f mov (%rdi),%r9 73: 48 8b 7f 08 mov 0x8(%rdi),%rdi 77: 4c 29 cf sub %r9,%rdi 7a: 48 c1 ff 02 sar $0x2,%rdi 7e: 48 c1 ef 02 shr $0x2,%rdi 82: 48 85 ff test %rdi,%rdi 85: 74 5d je e4 <_ZN4simd11test_sse_e2ERKSt6vectorIfSaIfEES4_S4_S4_RS2_+0x74> 87: 4c 8b 16 mov (%rsi),%r10 8a: 31 c0 xor %eax,%eax 8c: 48 8b 32 mov (%rdx),%rsi 8f: 48 8b 09 mov (%rcx),%rcx 92: 49 8b 10 mov (%r8),%rdx 95: 0f 1f 00 nopl (%rax) 98: c5 f8 28 0c 86 vmovaps (%rsi,%rax,4),%xmm1 9d: c5 f8 28 04 81 vmovaps (%rcx,%rax,4),%xmm0 a2: c4 c1 78 28 24 81 vmovaps (%r9,%rax,4),%xmm4 a8: c4 c1 78 28 1c 82 vmovaps (%r10,%rax,4),%xmm3 ae: c5 f0 59 e8 vmulps %xmm0,%xmm1,%xmm5 b2: c5 d8 59 d3 vmulps %xmm3,%xmm4,%xmm2 b6: c5 d8 59 e0 vmulps %xmm0,%xmm4,%xmm4 ba: c5 f0 5e db vdivps %xmm3,%xmm1,%xmm3 be: c5 e8 58 ed vaddps %xmm5,%xmm2,%xmm5 c2: c5 f0 5e c0 vdivps %xmm0,%xmm1,%xmm0 c6: c5 e0 5c dc vsubps %xmm4,%xmm3,%xmm3 ca: c5 e8 58 d0 vaddps %xmm0,%xmm2,%xmm2 ce: c5 e8 58 d3 vaddps %xmm3,%xmm2,%xmm2 d2: c5 e8 59 d5 vmulps %xmm5,%xmm2,%xmm2 d6: c5 f8 29 14 82 vmovaps %xmm2,(%rdx,%rax,4) db: 48 83 c0 04 add $0x4,%rax df: 48 39 c7 cmp %rax,%rdi e2: 77 b4 ja 98 <_ZN4simd11test_sse_e2ERKSt6vectorIfSaIfEES4_S4_S4_RS2_+0x28> e4: f3 c3 repz retq |
To conclude, operators overloads don’t prevent the compiler to reorder instructions during optimization, and thus they don’t introduce any performance issue. Since they allow you to write code more readable and easier to maintain, it would be a shame not to use them.
Before we consider refactoring the wrappers, let’s see the overhead of the CRTP layer in a more realistic code. Using the test_sse_d and test_sse_e functions of the previous section with the first version of the wrappers (the one with CRTP), here is the result of objdump:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 |
// test_sse_d 70: c4 c1 78 28 00 vmovaps (%r8),%xmm0 75: 48 89 f8 mov %rdi,%rax 78: c5 f8 28 09 vmovaps (%rcx),%xmm1 7c: c5 f8 28 1a vmovaps (%rdx),%xmm3 80: c5 f8 28 26 vmovaps (%rsi),%xmm4 84: c5 f0 59 e8 vmulps %xmm0,%xmm1,%xmm5 88: c5 d8 59 d3 vmulps %xmm3,%xmm4,%xmm2 8c: c5 d8 59 e0 vmulps %xmm0,%xmm4,%xmm4 90: c5 f0 5e c0 vdivps %xmm0,%xmm1,%xmm0 94: c5 e8 58 ed vaddps %xmm5,%xmm2,%xmm5 98: c5 f0 5e db vdivps %xmm3,%xmm1,%xmm3 9c: c5 e8 58 d0 vaddps %xmm0,%xmm2,%xmm2 a0: c5 e8 58 d3 vaddps %xmm3,%xmm2,%xmm2 a4: c5 e8 5c e4 vsubps %xmm4,%xmm2,%xmm4 a8: c5 d8 59 e5 vmulps %xmm5,%xmm4,%xmm4 ac: c5 f8 29 27 vmovaps %xmm4,(%rdi) b0: c3 retq b1: 66 66 66 66 66 66 2e data32 data32 data32 data32 data32 nopw %cs:0x0(%rax,%rax,1) b8: 0f 1f 84 00 00 00 00 bf: 00 // test_sse_e c0: 4c 8b 0f mov (%rdi),%r9 c3: 48 8b 7f 08 mov 0x8(%rdi),%rdi c7: 4c 29 cf sub %r9,%rdi ca: 48 c1 ff 02 sar $0x2,%rdi ce: 48 c1 ef 02 shr $0x2,%rdi d2: 48 85 ff test %rdi,%rdi d5: 74 5d je 134 <_ZN4simd10test_sse_eERKSt6vectorIfSaIfEES4_S4_S4_RS2_+0x74> d7: 4c 8b 16 mov (%rsi),%r10 da: 31 c0 xor %eax,%eax dc: 48 8b 32 mov (%rdx),%rsi df: 48 8b 09 mov (%rcx),%rcx e2: 49 8b 10 mov (%r8),%rdx e5: 0f 1f 00 nopl (%rax) e8: c5 f8 28 0c 86 vmovaps (%rsi,%rax,4),%xmm1 ed: c5 f8 28 04 81 vmovaps (%rcx,%rax,4),%xmm0 f2: c4 c1 78 28 24 81 vmovaps (%r9,%rax,4),%xmm4 f8: c4 c1 78 28 1c 82 vmovaps (%r10,%rax,4),%xmm3 fe: c5 f0 59 e8 vmulps %xmm0,%xmm1,%xmm5 102: c5 d8 59 d3 vmulps %xmm3,%xmm4,%xmm2 106: c5 d8 59 e0 vmulps %xmm0,%xmm4,%xmm4 10a: c5 f0 5e c0 vdivps %xmm0,%xmm1,%xmm0 10e: c5 e8 58 ed vaddps %xmm5,%xmm2,%xmm5 112: c5 f0 5e db vdivps %xmm3,%xmm1,%xmm3 116: c5 e8 58 d0 vaddps %xmm0,%xmm2,%xmm2 11a: c5 e8 58 d3 vaddps %xmm3,%xmm2,%xmm2 11e: c5 e8 5c e4 vsubps %xmm4,%xmm2,%xmm4 122: c5 d8 59 e5 vmulps %xmm5,%xmm4,%xmm4 126: c5 f8 29 24 82 vmovaps %xmm4,(%rdx,%rax,4) 12b: 48 83 c0 04 add $0x4,%rax 12f: 48 39 c7 cmp %rax,%rdi 132: 77 b4 ja e8 <_ZN4simd10test_sse_eERKSt6vectorIfSaIfEES4_S4_S4_RS2_+0x28> 134: f3 c3 repz retq |
In test_sse_d, we have six more instructions than in the previous version, these instructions are data transfer to the SIMD registers at the beginning of the function, and data transfer from the SIMD register at the end of the function. Now if we look at test_sse_e, we’ve got exactly the same code as in the previous section. The call to test_sse_d is inlined, and since the data transfer from and to SIMD registers is required by load_a and store_a functions, there is no need to keep the movaps instructions of test_sse_d. So if the functions working with wrappers are small enough to be inlined and if computation instructions are used between load and store functions, using the wrappers with CRTP should not introduce any overhead since the compiler will remove useless movaps instructions.
However, if you still want to refactor the wrappers but don’t want to repeat the boilerplate implementation of operators overloads, the alternative is to use preprocessor macros:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
#define DEFINE_OPERATOR+=(RET_TYPE,ARG_TYPE)\ inline RET_TYPE& operator+=(const ARG_TYPE& rhs)\ {\ *this = *this + rhs;\ return *this;\ } // ... etc for other computed assignment operators #define DEFINE_ASSIGNMENT_OPERATORS(TYPE,SCALAR_TYPE)\ DEFINE_OPERATOR+=(TYPE,TYPE)\ DEFINE_OPERATOR+=(TYPE,SCALAR_TYPE)\ DEFINE_OPERATOR-=(TYPE,TYPE)\ DEFINE_OPERATOR-=(TYPE,SCALAR_TYPE)\ // etc |
This is much less elegant, but it comes with the guarantee that there won’t be any performance issue.
Performance is not an intuitive domain; we have to check any assumption we make, because these assumptions can be legacy of time when compilers were inefficient or buggy, or a bias due to our misunderstanding of some mechanisms of the language. Here we’ve seen that neither operator overloads nor constant reference argument instead of value argument introduce any performance issue with GCC, but this might be different with another compiler.
]]>In the previous section we saw how to plug the wrappers into existing code and ended up with the following loop:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
std::vector<float> a, b, c, d, e // Somewhere in the code the vectors are resized // so they hold n elements for(size_t i = 0; i < n/4; i+=4) { vector4f av; av.load_a(&a[i]); vector4f bv; bv.load_a(&b[i]); vector4f cv; cv.load_a(&c[i]); vector4f dv; dv.load_a(&d[i]); vector4f ev = av*bv + cv*dv; ev.store_a(&e[i]); } // Remaining part of the loop // ... |
As said in the previous section, the first problem of this code is its lack of genericity; we are highly coupled with the SIMD instruction set wrapped, and replacing it with another one requires code changes we should avoid. If we want to make the code independant from the SIMD instruction set and the related wrapper, we need to hide the specifics of this instruction set, that is, the vector type and its size (the number of scalars it holds).
We want to be able to select the right wrapper depending on the scalar type and the instruction set used. When talking about selecting a type depending on another one, the first thing that comes to mind is type traits. Here our traits must contain the wrapper type and its size associated with the scalar type used:
The general definition of the traits class allows us to write code that works even for types that don’t have related wrappers (numerical types defined by another user for instance). Then we need to specialize these definitions for float and double, depending on the considered instruction set. Assume we can detect the instruction set available on our system and save this information in a macro (we’ll see how to do that in a later section). The specialization of the traits class will look like:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
#ifdef USE_SSE template <> struct simd_traits<float> { typedef vector4f type; static const size_t size = 4; }; template <> struct simd_traits<double> { typedef vector2d type; static const size_t size = 2; }; #elif USE_AVX template <> struct simd_traits<float> { typedef vector8f type; static const size_t size = 8; }; template <> struct simd_traits<double> { typedef vector4d type; static const size_t size = 4; }; #endif |
Now we can adapt the loop so it doesn’t explicitly refer to the vector4f type:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
std::vector<float> a,b,c,d,e; // ... resize a, b, c, d, and e so they hold n elements typedef simd_traits<float>::type vec_type; size_t vec_size = simd_traits<float>::size; for(size_t i = 0; i < n/vec_size; i += vec_size) { vec_type av; av.load_a(&a[i]); vec_type bv; bv.load_a(&b[i]); vec_type cv; cv.load_a(&c[i]); vec_type dv; dv.load_a(&d[i]); vec_type ev = av*bv + cv*dv; ev.store_a(&e[i]); } // Remaining part of the loop // ... |
That’s it! If we need to compile this code on a system where AVX is available, we have nothing to do. The macro USE_AVX will be defined, the specialization of simd_traits with vector8f as inner type will be instantiated, and the loop will use the vector8f wrapper and the AVX intrinsics. However, there’s still a problem: we can migrate to any SIMD instruction set for which a wrapper is available, but we can’t use types that don’t have related wrappers. The simd_traits works fine even for user defined types, but the load and store functions are available for wrappers only. We need to provide generic versions of these functions that work with any type.
Actually, all we have to do is to provide two versions of these functions: one for types that don’t have related wrappers, and one that works with wrappers. Template specialization can be of help here, but since partial specialization is not possible for functions, let’s wrap them into a simd_functions_invoker class:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
// Common implementation for types that support vectorization template <class T, class V> struct simd_functions_invoker { inline static V set1(const T& a) { return V(a); } inline static V load_a(const T* src) { V res; res.load_a(src); return res; } inline static V load_u(const T* src) { V res; res.load_u(src); return res; } inline static void store_a(T* dst, const V& src) { src.store_a(dst); } inline static void store_u(T* dst, const V& src) { src.store_u(dst); } }; // Specialization for types that don't support vectorization template <class T> struct simd_functions_invoker<T,T> { inline static T set1(const T& a) { return T(a); } inline static T load_a(const T* src) { return *src; } inline static T load_u(const T* src) { return *src; } inline static void store_a(T* dst, const T& src) { *dst = src; } inline static void store_u(T* dst, const T& src) { *dst = src; } }; |
We’ve added the set1 function so we can intialize wrappers and scalar type from a single value in an uniform way. Calling the generic functions would look like:
That’s too much verbose. Let’s add façade functions that deduce template parameters for us:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
template <class T> inline typename simd_traits<T>::type set1(const T& a) { return simd_functions_invoker<T,typename simd_traits<T>::type>::set1(a); } template <class T> inline typename simd_traits<T>::type load_a(const T* src) { return simd_functions_invoker<T,typename simd_traits<T>::type>::load_a(src); } template <class T> inline typename simd_traits<T>::type load_u(const T* src) { return simd_functions_invoker<T,typename simd_traits<T>::type>::load_u(src); } template <class T> inline void store_a(T* dst, const typename simd_traits<T>::type& src) { simd_functions_invoker<T,typename simd_traits<T>::type>::store_a(dst,src); } template <class T> inline void store_u(T* dst, const typename simd_traits<T>::type& src) { simd_functions_invoker<T,typename simd_traits<T>::type>::store_u(dst,src); } |
Now we can use these generic functions in the previous loop so it works with any type, even those that don’t support vectorization:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
std::vector<float> a,b,c,d,e; // ... resize a, b, c, d, and e so they hold n elements typedef simd_traits<float>::type vec_type; size_t vec_size = simd_traits<float>::size; for(size_t i = 0; i < n/vec_size; i += vec_size) { vec_type av = load_a(&a[i]); vec_type bv = load_a(&b[i]); vec_type cv = load_a(&c[i]); vec_type dv = load_a(&d[i]); vec_type ev = av*bv + cv*dv; store_a(&e[i],ev); } // Remaining part of the loop // ... |
Or, if you want to be more concise:
1 2 3 4 5 6 7 8 9 10 11 12 |
std::vector<float> a,b,c,d,e; // ... resize a, b, c, d, and e so they hold n elements typedef simd_traits<float>::type vec_type; size_t vec_size = simd_traits<float>::size; for(size_t i = 0; i < n/vec_size; i += vec_size) { vec_type ev = load_a(&a[i])*load_a(&b[i]) + load_a(&c[i])*load_a(&d[i])); store_a(&e[i], ev); } // Remaining part of the loop // ... |
We’ve reached our goal, we can use intrinsics almost like floats; in a real application code, it is likely that you initialize the wrappers through load functions, then perform the computations and finally store the results (like in the not concise version of the generic loop); thus the only difference between classical code and code with SIMD wrappers is the initialization and storing of wrappers (and eventually the functions signatures if you want to pass wrappers instead of scalars), the other parts should be exactly the same and the code remains easy to read and to maintain.
Until now, we’ve assumed we were able to detect at compile time the available instruction sets. Let’s see now how to achieve this. Compilers often provide preprocessor tokens depending on the available instruction sets, but these tokens may vary from one compiler to another, so we have to standardize that. On most 64-bit compilers, the tokens look like __SSE__ or __SSE3__, on 32-bit systems, Microsoft compilers set the preprocessor token _M_IX86_FP to 1 for SSE (vectorization of float) and 2 for SSE2 (vetorization of double and integers).
Here is how we can standardize that:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
#if (defined(_M_AMD64) || defined(_M_X64) || defined(__amd64)) && ! defined(__x86_64__) #define __x86_64__ 1 #endif // Find sse instruction set from compiler macros if SSE_INSTR_SET not defined // Note: Not all compilers define these macros automatically #ifndef SSE_INSTR_SET #if defined ( __AVX2__ ) #define SSE_INSTR_SET 8 #elif defined ( __AVX__ ) #define SSE_INSTR_SET 7 #elif defined ( __SSE4_2__ ) #define SSE_INSTR_SET 6 #elif defined ( __SSE4_1__ ) #define SSE_INSTR_SET 5 #elif defined ( __SSSE3__ ) #define SSE_INSTR_SET 4 #elif defined ( __SSE3__ ) #define SSE_INSTR_SET 3 #elif defined ( __SSE2__ ) || defined ( __x86_64__ ) #define SSE_INSTR_SET 2 #elif defined ( __SSE__ ) #define SSE_INSTR_SET 1 #elif defined ( _M_IX86_FP ) // Defined in MS compiler on 32bits system. 1: SSE, 2: SSE2 #define SSE_INSTR_SET _M_IX86_FP #else #define SSE_INSTR_SET 0 #endif // instruction set defines #endif // SSE_INSTR_SET |
Now we can use the SSE_INSTR_SET token to include the right file:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
#// Include the appropriate header file for intrinsic functions #if SSE_INSTR_SET > 7 // AVX2 and later #ifdef __GNUC__ #include <x86intrin.h> // x86intrin.h includes header files for whatever instruction // sets are specified on the compiler command line, such as: // xopintrin.h, fma4intrin.h #else #include <immintrin.h> // MS version of immintrin.h covers AVX, AVX2 and FMA3 #endif // __GNUC__ #elif SSE_INSTR_SET == 7 #include <immintrin.h> // AVX #elif SSE_INSTR_SET == 6 #include <nmmintrin.h> // SSE4.2 #elif SSE_INSTR_SET == 5 #include <smmintrin.h> // SSE4.1 #elif SSE_INSTR_SET == 4 #include <tmmintrin.h> // SSSE3 #elif SSE_INSTR_SET == 3 #include <pmmintrin.h> // SSE3 #elif SSE_INSTR_SET == 2 #include <emmintrin.h> // SSE2 #elif SSE_INSTR_SET == 1 #include <xmmintrin.h> // SSE |
Note that if you split the implementation of SSE wrappers and AVX wrappers into different files, you can also use the SSE_INSTR_SET token to include the implementation file in the simd.hpp file:
Now from the client code, the only file to include is simd.hpp, and everything will be available.
Now that we have nice wrappers providing basic functionalities, what could be the next step ? Well, first we could add a method to retrieve an element in the vector:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
template <class X> class simd_vector { public: typedef simd_traits<X>::value_type value_type; // ... value_type operator[](size_t index) const { size_t size = simd_traits<X>::size; value_type v[size]; (*this)().store_u(v); return v[index]; } }; |
We can add horizontal add function, useful for linear algebra products:
1 2 3 4 5 6 7 8 9 10 11 12 |
inline float hadd(const vector4f& rhs) { #if SSE_INSTR_SET >= 3 // SSE3 __m128 tmp0 = _mm_hadd_ps(rhs,rhs); __m128 tmp1 = _mm_hadd_ps(tmp0,tmp0); #else __m128 tmp0 = _mm_add_ps(rhs,_mm_movehl_ps(rhs,rhs)); __m128 tmp1 = _mm_add_ss(tmp0,_mm_shuffle_ps(tmp0,tmp0,1)); #endif return _mm_cvtss_f32(tmp1); } |
Another useful project would be to write overloads of standard mathematical functions (exp, log, etc) that work with the wrappers.
As you can see, writing the wrappers is just the beginning, you can then enrich them with whatever functionality you need but this goes beyond the topic of this first series of articles.
]]>Now that we have nice wrappers, let’s see how we can use them in real code. Consider the following loop:
A first solution could be to store vector of vector4f instead of vector of float:
Not so bad, thanks to the operators overloads, the code is exactly the same as the one for float, but the operations are performed on four floats at once. If n is not a multiple of four, we allocate an additional vector4f in each vector and we initialize the useless elements with 0.
The problem is you could need to work with the scalar instead of the vector4f, for instance if you search for a specific element in the vector or if you fill your vector pushing back elements one by one. In this case, you would have to recode any piece of algorithm that works on single elements (and that includes a lot of STL algorithms) and then add special code for working on scalars within a vector4f. Working on scalars within vector4f is possible (we will see later how to modify our wrappers so that we can do it), but is slower than working directly on scalars, thus you could lose the benefits of using vectorization.
Another solution could be to initialize the wrapper from values stored in a vector:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
std::vector<float>a, b, c, d, e; // somewhere in the code, a, b, c, d and e are // resized so they hold n elements / ... for(size_t i = 0; i < n/4; i += 4) { vector4f av(a[i],a[i+1],a[i+2],a[i+3]); vector4f bv(b[i],b[i+1],b[i+2],b[i+3]); vector4f cv(c[i],c[i+1],c[i+2],c[i+3]); vector4f dv(d[i],d[i+1],d[i+2],d[i+3]); vector4f ev = av*bv + cv*dv; // how do we store ev in e[i],e[i+1],e[i+2],e[i+3] ? } for(size_t i = n/4; i < n; ++i) { e[i] = a[i]*b[i] + c[i]*d[i]; } |
The first problem is that we need a way to store a vector4f into 4 floats; as said in the previous paragraph, we can add to our wrappers a method that returns a scalar within the vector4f and invoke it that way:
The second problem is that this code is not generic; if you migrate from SSE to AVX, you’ll have to update the initialization of your wrapper so it takes 8 floats; the same for storing your vector4f in scalar results.
What we need here is a way to load float into vector4f and to store vector4f into floats that doesn’t depend on the size of vector4f (that is, 4). That’s the aim of the load and store intrinsics.
If you take a look at the xmmintrin.h file, you’ll notice the compiler provides two kinds of load and store intrinsics:
Intrinsics with alignment constraints are faster, and should be used by default; however, even if memory allocations are aligned, you can’t guarantee that the memory buffer you pass to the load / store function is aligned. Indeed, consider the matrix product C=AxB, where A is a 15x15 matrix of floats with linear row storage and B a vector that holds 15 float elements. The computation of C[1] starts with:
Here, if A is 16-byte aligned, since the size of a float is 4 bytes, a[15], a[19] and a[23] aren’t 16-byte aligned, and you have to use the unaligned overload of the intrinsics (designated by the generic loadu function in the sample code).
Here’s how we need to update our wrappers to handle load and store functions:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
class vector4f : public simd_vector<vector4f> { public: // ... inline vector4f& load_a(const float* src) { m_value = _mm_load_ps(src); return *this; } inline vector4f& load_u(const float* src) { m_value = _mm_loadu_ps(src); return *this; } inline void store_a(float* dst) const { _mm_store_ps(dst,m_value); } inline void store_u(float* dst) const { _mm_storeu_ps(dst,m_value); } }; |
Assuming the memory buffer of std::vector is 16-bytes aligned, the sample code becomes:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
std::vector<float>a, b, c, d, e; // somewhere in the code, a, b, c, d and e are // resized so they hold n elements / ... for(size_t i = 0; i < n/4; i += 4) { vector4f av; av.load_a(&a[i]); vector4f bv; bv.load_a(&b[i)); vector4f cv; cv.load_a(&c[i]); vector4f dv; dv.load_a(&d[i]); vector4f ev = av*bv + cv*dv; ev.store_a(&e[i]); } for(size_t i = n/4; i < n; ++i) { e[i] = a[i]*b[i] + c[i]*d[i]; } |
Now, if we migrate our code from SSE to AVX, all we have to do is to replace vector4f by vector8f! (Ok, we also have to deal with memory alignment issues, I come back to this in a few moments). We’ll see in a future section how we can avoid the explicit usage of vector4f so we get full genericity. But for now, we have to face a last problem: in the sample code, we assumed the memory buffer wrapped by std::vector was 16-bytes aligned. How do we know a memory allocation is aligned, and how do we know the boundary alignment?
The answer is that it depends on your system and your compiler. On Windows 64 bits, dynamic memory allocation is 16-bytes aligned; in GNU systems, a block returned by malloc or realloc is always a multiple of 8 (32-bit systems) or 16 (64-bit system). So if we want to write code generic enough to handle many SIMD instruction sets, it is clear that we must provide a way to ensure memory allocation is always aligned, and is aligned on a given boundary.
The solution is to design an aligned memory allocator and to use it in std::vector:
Now, we can handle any alignment boundary requirement through a typedef:
Another issue we have to deal with, when we plug our wrapper, is conditional branching: indeed the if-else statement evaluates a branch depending on the scalar condition, but the if statement works only for scalar condition, and we cannot directly override it to work with our wrappers. Consider the following code:
What we do here is selecting a value for e[i] depending on the sign of a[i]; the code could be written in a sub-optimal way:
Even though the “select” function is a bit overkill in the scalar case, it is exactly what we need for handling conditional branching with the SIMD wrappers. This means the two values (or “branches”) of the conditional statement will be evaluated before we choose the one to affect, but we can’t do better. And since you execute your conditional statement on 4 floats at once, it is still faster than the scalar version, even if suboptimal. The only case where the vectorized code could have a performance loss compared to the scalar code is if one of the conditional branch takes much more time to compute than the other and its result is seldom used.
Knowing this, let’s see how we can implement a select function taking SIMD wrapper parameters. Depending on the SSE version, the compiler may provide a built-in function we can directly use as ternary operator. If not, we have to handle it with old bitwise logical:
1 2 3 4 5 6 7 8 9 10 |
vector4f select(const vector4fb& cond, const vector4f& a, const vector4f& b) { // Don't bother with the SSE_INSTR_SET preprocessor token, we'll be back ont it later #if SSE_INSTR_SET >= 5 // SSE 4.1 return _mm_blendv_ps(b,a,cond); #else return _mm_or_ps(_mm_and_ps(cond,a),_mm_andnot_ps(cond,b)); #endif } |
That’s it! We can now write the previous loop using full vectorization:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
for(size_t i = 0; i < n/4; i+=4) { vector4f av; av.load_a(&a[i]); vector4f bv; bv.load_a(&b[i)); vector4f cv; cv.load_a(&c[i]); vector4f dv; dv.load_a(&d[i]); vector4f e_tmp1 = av*bv + cv*dv; vector4f e_tmp2 = bv + cv*dv; vector4f ev = select(av > 0, e_tmp1, e_tmp2); ev.store_a(&e[i]); } // scalar version for the last elements of the vectors // ... |
Although this code is far better than using intrinsics directly, it is still very verbose and, worse, not generic. If you want to update your code to take advantage of AVX instead of SSE, you need to replace every occurence of vector4f by vector8f, and to change the loop condition and increment so as to take into account the size of vector8f. Doing this in real code will quickly become painful.
What we need here is full genericity, so that replacing an instruction set by another requires almost no code change. That the point of the next section.
]]>Now that we know a little more about SSE and AVX, let’s write some code; the wrappers will have a data vector member and provide arithmetic, comparison and logical operators overloads. Throughout this section, I will mainly focus on vector4f, the wrapper around __m128, but translating the code for other data vectors should not be difficult thanks to the previous section. Since the wrappers will be used as numerical types, they must have value semantics, that is they must define copy constructor, assignment operator and non-virtual destructor.
SSE and AVX data vectors can be initialized from different inputs: a single value for all elements, a value per element, or another data vector.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
class vector4f { public: inline vector4f() {} inline vector4f(float f) : m_value(_mm_set1_ps(f)) {} inline vector4f(float f0, float f1, float f2, float f3) : m_value(_mm_setr_ps(f0,f1,f2,f3)) {} inline vector4f(const __m128& rhs) : m_value(rhs) {} inline vector4f& operator=(const __m128& rhs) { m_value = rhs; return *this; } inline vector4f(const vector4f& rhs) : m_value(rhs.m_value) {} inline vector4f& operator=(const vector4f& rhs) { m_value = rhs.m_value; return *this; } private: __m128 m_value; }; |
The operators overloads have to access the m_value member of the wrapper so they can pass it as an argument to the intrinsic functions:
We could declare the operator overloads as friend functions of the wrapper class, or provide a get method returning the internal m_value. Both of these solutions work, but aren’t elegant: the first requires a huge amount of friend declarations, the second produces heavy code unpleasant to read.
A more elegant solution is to provide a conversion operator from vector4f to __m128; since vector4f can be implicitly converted from __m128, we can now use vector4f or __m128 indifferently. Moreover we can save the vector4f copy constructor and assignment operator:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
class vector4f { public: inline vector4f() {} inline vector4f(float f) : m_value(_mm_set1_ps(f)) {} inline vector4f(float f0, float f1, float f2, float f3) : m_value(_mm_setr_ps(f0,f1,f2,f3)) {} inline vector4f(const __m128& rhs) : m_value(rhs) {} inline vector4f& operator=(const __m128& rhs) { m_value = rhs; return *this; } inline operator __m128() const { return m_value; } // vector4f(const vector4f&) and operator=(const vector4f&) are not required anymore: // the conversion operator will be called before calling vector4f(const __m128&) // or operator=(const __m128&) private: __m128 m_value; }; |
Next step is to write the arithmetic operators overloads. The classic way to do this is to write computed assignment operators and to use them in operators overloads, so they don’t have to access private members of vector4f; but since vector4f can be implicitly converted to __m128, we can do the opposite and avoid using a temporary (this won’t have any impact on performance since the compiler can optimize it, but produces shorter and more pleasant code to read):
We could go ahead and write the remaining arithmetic operators overloads, just as we did before:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
vector4f operator+(const vector4f&, const vector4f&); // Adds the same float value to each data vector member vector4f operator+(const vector4f&, const float&); vector4f operator+(const float&, const vector4f&); // Similar for operator-, operator* and operator/ // ... vector4f operator-(const vector4f&); vector4f& operator++(); vector4f operator++(int); // Similar for operator-- // ... |
But wait! Whenever you add a new wrapper, you’ll have to write these operators overloads again. Besides the fact that you will need to type a lot of boilerplate code, computed assignment operators will be the same as those of vector4f (that is, invoke the corresponding operator overload and return the object), and even some operators overloads will have the same code as the one of vector4f operators. Code duplication is never good, and we should look for ways to avoid it.
If we had encountered this problem for classes with entity semantics, we would have captured the common code into a base class, and delegate the specific behavior to virtual methods, a typical use of classical dynamic polymorphism. What we need here is an equivalent architecture for classes with value semantics and no virtual methods (since virtual assignment operators are nonsense). This equivalent architecture is the CRTP (Curiously Recurring Template Pattern). A lot has been written about CRTP and I will not dwell on it. If you don’t know about this pattern, the most important thing to know is CRTP allows you to invoke methods of inheriting classes from the base class just as you would do through virtual methods, except the target methods are resolved at compile time.
Let’s call our base class simd_vector, it will be used as base class for every wrapper; here is what it should look like:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 |
template <class X> struct simd_vector_traits; template <class X> class simd_vector { public: typedef typename simd_vector_traits<X>::value_type value_type; // downcast operators so we can call methods in the inheriting classes inline X& operator()() { return *static_cast<X*>(this); } inline const X& operator()() const { return *static_cast<const X*>(this); } // Additional assignment operators inline X& operator+=(const X& rhs) { (*this)() = (*this)() + rhs; return (*this)(); } inline X& operator+=(const value_type& rhs) { (*this)() = (*this)() + X(rhs); return (*this)(); } // Same for operator-=, operator*=, operator/= ... // ... // Increment operators inline X operator++(int) { X tmp = (*this)(); (*this) += value_type(1); return tmp; } inline X& operator++() { (*this)() += value_type(1); return (*this)(); } // Similar decrement operators // ... protected: // Ensure only inheriting classes can instantiate / copy / assign simd_vector. // Avoids incomplete copy / assignment from client code. inline simd_vector() {} inline ~simd_vector() {} inline simd_vector(const simd_vector&) {} inline simd_vector& operator=(const simd_vector&) { return *this; } }; template <class X> inline simd_vector<X> operator+(const simd_vector<X>& lhs, const typename simd_vector_traits<X>::type& rhs) { return lhs() + X(rhs); } template <class X> inline simd_vector<X> operator+(const typename simd_vector_traits<X>::type& lhs, const simd_vector<X>& rhs) { return X(lhs) + rhs(); } // Same for operator-, operator*, operator/ // ... |
Now, all vector4f needs to do is to inherit from simd_vector and implement the traditional operator+, and it will get += and ++ operators overloads for free (and the same for other arithmetic operators):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 |
class vector4f : public simd_vector<vector4f> { public: inline vector4f() {} inline vector4f(float f) : m_value(_mm_set1_ps(f)) {} inline vector4f(float f0, float f1, float f2, float f3) : m_value(_mm_setr_ps(f0,f1,f2,f3)) {} inline vector4f(const __m128& rhs) : m_value(rhs) {} inline vector4f& operator=(const __m128& rhs) { m_value = rhs; return *this; } inline operator __m128() const { return m_value; } // No more operator+= since it is implemented in the base class private: __m128 m_value; }; // Based on this operator implementation, simd_vector<vector4f> will generate // the following methods and overloads: // vector4f& operator+=(const vector4f&) // vector4f operator++(int) // vector4f& operator++() // vector4f operator+(const vector4f&, ocnst float&) // vector4f operator+(const float&, const vector4f&) inline vector4f operator+(const vector4f& lhs, const vector4f& rhs) { return _mm_add_ps(lhs,rhs); } inline vector4f operator-(const vector4f& lhs, const vector4f& rhs) { return _mm_sub_ps(lhs,rhs); } inline vector4f operator*(const vector4f& lhs, const vector4f& rhs) { return _mm_mul_ps(lhs,rhs); } inline vector4f operator/(const vetcor4f& lhs, const vector4f& rhs) { return _mm_div_ps(lhs,rhs); } |
Looks good, doesn’t it ? Every time we want to implement a new wrapper, we only have to code 4 operators and make our class inherit from simd_vector, and all overloads will be generated for free!
Just one remark before we continue with comparison operators; if you have noticed, the base class simd_vector defines a type named value_type, depending on the nature of the inheriting class (float for vector4f, double for vector2d, …). However, this type is not defined by the inheriting class, but by a traits class instead. This is a constraint of the CRTP pattern: you can access the inheriting class as long the compiler doesn’t instantiate the code; if you call a method defined in the inheriting class, the compiler will assume it exists until it has to instantiate the code. But type resolution is different and you have to define it out of the inheriting class. This is one reason for the existence of the simd_vector_traits class. Other reasons will be discussed in a later section. Note the class containing the type definition doesn’t have to be fully defined at this point: a simple forward declaration is sufficient.
EDIT 20/11/2014: it seems the CRTP layer introduces a slight overhead (at least with GCC), see this article for more details and an alternative solution.
Since ordinary comparison operators return boolean value, we need to implement SIMD wrappers for booleans. The number of boolean elements of the wrappers will be directly related to the number of floating values wrapped by our arithmetic wrappers.
In order not to duplicate code, we’ll use the same architecture as for arithmetic wrappers: a CRTP with base class for common code, and inheriting classes for specific implementation. Here is the implementation of the simd_vector_bool class, the base used to generate bitwise assignment operators and logical operators overloads in inheriting classes:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 |
template <class X> class simd_vector_bool { public: inline X& operator()() { return *static_cast<X*>(this); } inline const X& operator()() const { return *static_cast<const X*>(this); } inline X& operator&=(const X& rhs) { (*this) = (*this) && rhs; return (*this)(); } inline X& operator|(const X& rhs) { (*this)() = (*this) || rhs; return (*this)(); } inline X& operator^=(const X& rhs) { (*this)() = (*this)() ^ rhs; return (*this)(); } protected: inline simd_vector_bool() {} inline ~simd_vector_bool() {} inline simd_vector_bool(const simd_vector_bool&) {} inline simd_vector_bool& operator=(const simd_vector_bool&) { return *this; } }; template <class X> inline X operator&&(const simd_vector_bool<X>& lhs, const simd_vector_bool<X>& rhs) { return lhs() & rhs(); } template <class X> inline X operator&&(const simd_vector_bool<X>& lhs, bool rhs) { return lhs() & rhs; } template <class X> inline X operator||(bool lhs, const simd_vector_bool<X>& rhs) { return lhs & rhs(); } // Similar for operator|| overloads // ... template <class X> inline X operator!(const simd_vector_bool<X>& rhs) { return rhs() == 0; } |
The inheriting class vector4fb only has to provide bitwise operators and equality/inequality operators:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 |
class vector4fb : public simd_vector_bool<vector4fb> { public: inline vector4fb() {} inline vector4fb(bool b) : m_value(_mm_castsi128_ps(_mm_set1_epi32(-(int)b))) {} inline vector4fb(bool b0, bool b1, bool b2, bool b3) : m_value(_mm_castsi128_ps(_mm_setr_epi32(-(int)b0,-(int)b1,-(int)b2,-(int)b3))) {} inline vector4fb(const __m128& rhs) : m_value(rhs) {} inline vector4fb& operator=(const __m128& rhs) { m_value = rhs; return *this; } inline operator __m128() const { return m_value; } private: __m128 m_value; }; inline vector4fb operator&(const vector4fb& lhs, const vector4fb& rhs) { return _mm_and_ps(lhs,rhs); } inline vector4fb operator|(const vector4fb& lhs, const vector4fb& rhs) { return _mm_or_ps(lhs,rhs); } inline vector4fb operator^(const vector4fb& lhs, const vector4fb& rhs) { return _mm_xor_ps(lhs,rhs); } inline vector4fb operator~(const vector4fb& rhs) { return _mm_xor_ps(rhs,_mm_castsi128_ps(_mm_set1_epi32(-1))); } inline vector4fb operator==(const vector4fb& lhs, const vector4fb& rhs) { return _mm_cmeq_ps(lhs,rhs); } inline vector4fb operator!=(const vector4f& lhs, const vector4fb& rhs) { return _mm_cmpneq_ps(lhs,rhs); } |
Now that we have wrappers for boolean, we can add the comparison operators to the vector4f class; again, to avoid code duplication, some operators will be implemented in the base class and will be based on specific operators implemented in the inheriting class. Let’s start with the vector4f comparison operators:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
// Definition of vector4f and arithmetic overloads // ... inline vector4fb operator==(const vector4f& lhs, const vector4f& rhs) { return _mm_cmpeq_ps(lhs,rhs); } inline vector4fb operator!=(const vector4f& lhs, const vector4f& rhs) { return _mm_cmpneq_ps(lhs,rhs); } inline vector4fb operator<(const vector4f& lhs, const vector4f& rhs) { return _mm_cmplt_ps(lhs,rhs); } inline vector4fb operator<=(const vector4f& lhs, const vector4f& rhs) { return _mm_cmple_ps(lhs,rhs); } |
Before we implement operator> and operator>= for the base class, we have to focus on their return type. If these operators were implemented for vector4f, we would have return vector4fb; but since they are implemented for the base class, they need to return the boolean wrapper related to the arithmetic wrapper, i.e the inheriting class. What we need here is to provide a mapping between arithmetic wrapper type and boolean wrapper type somewhere. Remember the simd_vector_traits structure we declared to define our value_type ? It would be the perfect place for defining that mapping:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
// simd_vector_traits<vector4f> must be defined before vector4f so simd_vector can compile // (remember we use simd_vector_traits<X>::value_type in the definition of simd_vector). class vector4f; // Full specialization of the template simd_vector_traits declared in simd_base.hpp template <> struct simd_vector_traits<vector4f> { typedef float value_type; typedef simd_vector4fb vector_bool; }; class vector4f { // ... |
A last remark before we add the last comparison operators: since the template simd_vector_traits will never be defined but fully specialized instead, there is no risk we forget to define it when we add a new wrapper, we’ll have a compilation error.
Finally, we can add the missing operators for the base class:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
// Declaration of simd_vector and operators //... template <class X> inline typename simd_vector_traits<X>::vector_bool operator>(const simd_vector<X>& lhs, const simd_vector<X>& rhs) { return rhs() <= lhs(); } template <class X> inline typename simd_vector_traits<X>::vector_bool operator>=(const simd_vector<X>& lhs, const simd_vector<X>& rhs) { return rhs() < lhs(); } |
Since float provides logical operators, our wrapper should do so. The implementation is the same as for the simd_vector_bool class, that is logical assignment operator in the simd_vector base class, and operator overloads for the inheriting classes. The implementation of operator|, operator&, operator^ and operator~ is the same as the one for vector4fb, so I don’t repeat it here.
Next step is to implement wrapper for 2 double, then wrapper for 8 float and 4 double if you want to support AVX. You can also implement wrappers for int if you aim to do integre computation. The implementation is similar to what has been done in this section.
Now we have nice wrappers, we’ll see in the next section how to plug them into existing code.
]]>Before we start writing any code, we need to take a look at the instrinsics provided with the compiler. Henceforth, I assume we use an Intel processor, recent enough to provide SSE 4 and AVX instruction sets; the compiler can be gcc or MSVC, the instrinsics they provide are almost the same.
If you already know about SSE / AVX intrinsics you may skip this section.
SSE uses eight 128 bits registers, from xmm0 to xmm7; Intel and AMD 64 bits extensions adds eight more registers, from xmm8 to xmm15; thus SSE intrinsics can perform on 4 packed float, 2 packed double, 4 32-bits integers, etc …
With AVX, the width of the SIMD registers is increased from 128 to 256 bits; the register are renamed from xmm0-xmm7 to ymm0-ymm7 (and from xmm8-xmm15 to ymm8 to ymm15); however legacy sse instructions still can be used, and xmm registers can still be addressed since they’re the lower part of ymm registers.
AVX512 will increase the width of the SIMD registers from 256 to 512 bits.
Intrinsic functions are made available in different header files, based on the version of the SIMD instruction set they belong to:
Each of these files includes the previous one, so you only have to include the one matching the highest version of the SIMD instruction set available in your processor. Later we will see how to detect at compile time which version on SIMD instruction set is available and thus which file to include. For now, just assume we’re able to include the right file each time we need it.
Now if you take a look at these files, you will notice provided data and functions follow some naming rules :
Intrinsics encompass a wide set of features; we can distinguish the following categories (not exhausive) :
I will not provide wrappers for all intrinsics, and some of them will be used only to build higher level functions in the wrappers.
Now you know a little more about SSE and AVX intrinsics, you may reconsider the need for wrapping them; indeed, if you don’t need to handle other instructions set, you could think of using SSE/AVX intrinsics directly. I hope this sample code will make you change your mind :
1 2 3 4 5 6 7 8 |
// computes e = a*b + c*d using SSE where a, b, c, d and e are vector of floats for(size_t i = 0; i < e.size(); i += 4) { __m128 val = _mm_add_ps(_mm_mul_ps(_mm_load_ps(&a[i]),_mm_load_ps(&b[i])), _mm_mul_ps(_mm_load_ps(&c[i]),_mm_load_ps(&d[i]))); _mm_store_ps(&e[i],val); } |
Quite hard to read, right ? And this is just for two multiplications and one addition; imagine using intrinsics in a huge amount of code, and you will get code really hard to understand and to maintain. What we need is a way to use __m128 with traditional arithmetic operators, as we do with float :
That’s the aim of the wrappers we start to write in the next section.
]]>SIMD (and more generally vectorization) is a longstanding topic and a lot has been written about it. But when I had to use it in my own applications, it appeared that most of the articles were theoretical, explaining the principles vectorization lacking practical examples; some of them, however, linked to libraries using vectorization, but extending these libraries for my personal needs was difficult, if not painfull. For this reason, I decided to implement my own library. This series of articles if the result of my work on the matter. I share it there in case someone faces the same problem.
SIMD stands for Single Instruction, Mutiple Data, a class of parallel computers which can perform the same operation on multiple data points simultaneously. Let’s consider a summation we want to perform on two sets of four floating point numbers. The difference between scalar and SIMD operations is illustrated below:
Using scalar operation, four add instructions must be executed one after other to obtain the sums, whereas SIMD uses a single instruction to achieve the same result. Thus SIMD operations achieve higher efficiency than scalar operations.
SIMD instructions were first used in the early 1970s, but only became available in consumer-grade chips in the 90s to allow real-time video processing and advanced computer graphics for video games. Each processor manufacturer has implemented its own SIMD instruction set:
Many of these instruction sets still coexist nowadays, and you have to deal with all of them if you want to write portable software. This is a first argument for writing wrappers: capture the abstraction of SIMD operations with nice interfaces, and forget about the implementation you rely on.
Although you can write assembly code to use the SIMD instructions, compilers usually provide built-in functions so you can use SIMD instructions in C without writing a single line of assembly code. These functions (and more generally functions whose implementation is handled specially by the compiler) are called intrinsic functions, often shortened to intrinsics. Of course the SIMD intrinsics depend on the underlying architecture, and may differ from one compiler to other even for a same SIMD instruction set. Fortunately, compilers tend to standardize intrinsics prototype for a given SIMD instruction set, and we only have to handle the differences between the various SIMD instruction sets.
In this series of article, the focus will be on wrapping Intel’s SIMD instruction set, although the wrappers will be generic enough so that plugging other instruction sets is easy.
Since SIMD instructions are longstanding, you might wonder if writing your own wrapper is relevant; maybe you could reuse an existing library wrapping these intrinsics. Well, it depends on your needs.
Agner Fog has written some very usefull classes that handle Intel SIMD instruction set (different versions of SSE and AVX), but he doesn’t make heavy use of metaprogramming in his implementation. Hence adding a new wrapper (for a new instruction set, a new version of an existing one or even for your own numerical types) requires to type a lot of code that could otherwise have been factorized. Moreover, some essential tools are missing, such as an aligned memory allocator (we will see why you need such a tool later). However his library is a good starting point.
Another library you might want to consider is the Numerical Template Toolbox. Although it has a very comprehensive set of mathematical functions, its major drawback is that it really slows the compilation. Moreover its development and documentation aren’t finished yet, and it might be difficult to extend it.
And last but not least, writing your own wrapper will make you confront issues specific to SIMD instructions and make you understand how it works; thus you will be able to use SIMD intrinsics in a really efficient way, regardless of the implementation you choose (your own wrappers or an existing library).
]]>