The C++ scientist

Scientific computing, numerical methods and optimization in C++

Aligned Memory Allocator

Introduction

In a previous article about SIMD wrappers, I suggested to design a dedicated memory allocator to handle SIMD memory alignment constraints, but I didn’t give any details on how to do it. That’s the purpose of this article. The C++ standard describes a set of requirements our allocator must respect to work with standard containers. After a survey of these standard requirements, we’ll see how to implement an aligned memory allocator that meets them.

Performance Considerations About SIMD Wrappers

When I posted a link to this blog on reddit, I had comments from people who were skeptical of the SIMD Wrappers performances. They raised many possible performance hits in the implementation:

  • Arguments passed by const references instead of values, introducing a useless indirection and preventing the compiler from keeping the variable into registers
  • Indirection due to the wrapping of __mXXX types into objects
  • Operator overloads preventing the compiler from proper instruction reordering during optimization

Writing C++ Wrappers for SIMD Intrinsics (5)

4. Making the code more generic

In the previous section we saw how to plug the wrappers into existing code and ended up with the following loop:

sample.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
std::vector<float> a, b, c, d, e
// Somewhere in the code the vectors are resized
// so they hold n elements
for(size_t i = 0; i < n/4; i+=4)
{
    vector4f av; av.load_a(&a[i]);
    vector4f bv; bv.load_a(&b[i]);
    vector4f cv; cv.load_a(&c[i]);
    vector4f dv; dv.load_a(&d[i]);

    vector4f ev = av*bv + cv*dv;
    ev.store_a(&e[i]);
}
// Remaining part of the loop
// ...

Writing C++ Wrappers for SIMD Intrinsics (4)

3. Plugging the wrappers into existing code

3.1 Storing vector4f instead of float

Now that we have nice wrappers, let’s see how we can use them in real code. Consider the following loop:

sample.cpp
1
2
3
4
5
6
7
8
9
std::vector<float> a, b, c, d, e;
// somewhere in the code, a, b, c, d and e are
// resized so they hold n elements
// ...
for(size_t i = 0; i < n; ++i)
{
    e[i] = a[i]*b[i] + c[i]*d[i];
}

Writing C++ Wrappers for SIMD Intrinsics (3)

2. First version of wrappers

Now that we know a little more about SSE and AVX, let’s write some code; the wrappers will have a data vector member and provide arithmetic, comparison and logical operators overloads. Throughout this section, I will mainly focus on vector4f, the wrapper around __m128, but translating the code for other data vectors should not be difficult thanks to the previous section. Since the wrappers will be used as numerical types, they must have value semantics, that is they must define copy constructor, assignment operator and non-virtual destructor.

Writing C++ Wrappers for SIMD Intrinsics (2)

1. SSE/AVX intrinsics

Before we start writing any code, we need to take a look at the instrinsics provided with the compiler. Henceforth, I assume we use an Intel processor, recent enough to provide SSE 4 and AVX instruction sets; the compiler can be gcc or MSVC, the instrinsics they provide are almost the same.

If you already know about SSE / AVX intrinsics you may skip this section.

Writing C++ Wrappers for SIMD Intrinsics (1)

Introduction

SIMD (and more generally vectorization) is a longstanding topic and a lot has been written about it. But when I had to use it in my own applications, it appeared that most of the articles were theoretical, explaining the principles vectorization lacking practical examples; some of them, however, linked to libraries using vectorization, but extending these libraries for my personal needs was difficult, if not painfull. For this reason, I decided to implement my own library. This series of articles if the result of my work on the matter. I share it there in case someone faces the same problem.