Guaranteed Copy Elision Does Not Elide Copies

This post is also available at the Microsoft Visual C++ Team Blog

C++17 merged in a paper called Guaranteed copy elision through simplified value categories. The changes mandate that no copies or moves take place in some situations where they were previously allowed, e.g.:

struct non_moveable {
    non_moveable() = default;
    non_moveable(non_moveable&&) = delete;
};
non_moveable make() { return {}; }
non_moveable x = make(); //compiles in C++17, error in C++11/14

You can see this behavior in compiler versions Visual Studio 2017 15.6, Clang 4, GCC 7, and above.

Despite the name of the paper and what you might read on the Internet, the new rules do not guarantee copy elision. Instead, the new value category rules are defined such that no copy exists in the first place. Understanding this nuance gives a deeper understanding of the current C++ object model, so I will explain the pre-C++17 rules, what changes were made, and how they solve real-world problems.

Value Categories

To understand the before-and-after, we first need to understand what value categories are (I’ll explain copy elision in the next section). Continuing the theme of C++ misnomers, value categories are not categories of values. They are characteristics of expressions. Every expression in C++ has one of three value categories: lvalue, prvalue (pure rvalue), or xvalue (eXpring value). There are then two parent categories: all lvalues and xvalues are glvalues, and all prvalues and xvalues are rvalues.

diagram expressing the taxonomy described above

For an explanation of what these are, we can look at the standard (C++17 [basic.lval]/1):

  • A glvalue ([generalised lvalue]) is an expression whose evaluation determines the identity of an object, bit-field, or function.
  • A prvalue is an expression whose evaluation initializes an object or a bit-field, or computes the value of an operand of an operator, as specified by the context in which it appears.
  • An xvalue is a glvalue that denotes an object or bit-field whose resources can be reused (usually because it is near the end of its lifetime).
  • An lvalue is a glvalue that is not an xvalue.
  • An rvalue is a prvalue or an xvalue.

Some examples:

std::string s; 
s //lvalue: identity of an object 
s + " cake" //prvalue: could perform initialization/compute a value 

std::string f(); 
std::string& g(); 
std::string&& h(); 

f() //prvalue: could perform initialization/compute a value 
g() //lvalue: identity of an object 
h() //xvalue: denotes an object whose resources can be reused 

struct foo { 
    std::string s; 
}; 

foo{}.s //xvalue: denotes an object whose resources can be reused

C++11

What are the properties of the expression std::string{"a pony"}?

It’s a prvalue. Its type is std::string. It has the value "a pony". It names a temporary.

That last one is the key point I want to talk about, and it’s the real difference between the C++11 rules and C++17. In C++11, std::string{"a pony"} does indeed name a temporary. From C++11 [class.temporary]/1:

Temporaries of class type are created in various contexts: binding a reference to a prvalue, returning a prvalue, a conversion that creates a prvalue, throwing an exception, entering a handler, and in some initializations.

Let’s look at how this interacts with this code:

struct copyable {
    copyable() = default;
    copyable(copyable const&) { /*...*/ }
};
copyable make() { return {}; }
copyable x = make();

make() results in a temporary. This temporary will be moved into x. Since copyable has no move constructor, this calls the copy constructor. However, this copy is unnecessary since the object constructed on the way out of make will never be used for anything else. The standard allows this copy to be elided by constructing the return value at the call-site rather than in make (C++11 [class.copy]). This is called copy elision.

The unfortunate part is this: even if all copies of the type are elided, the constructor still must exist.

This means that if we instead have:

struct non_moveable {
    non_moveable() = default;
    non_moveable(non_moveable&&) = delete;
};
non_moveable make() { return {}; }
auto x = make();

then we get a compiler error:

<source>(7): error C2280: 'non_moveable::non_moveable(non_moveable &&)': attempting to reference a deleted function
<source>(3): note: see declaration of 'non_moveable::non_moveable'
<source>(3): note: 'non_moveable::non_moveable(non_moveable &&)': function was explicitly deleted

Aside from returning non-moveable types by value, this presents other issues:

auto x = non_moveable{}; //compiler error
  • The language makes no guarantees that the constructors won’t be called (in practice this isn’t too much of a worry, but guarantees are more convincing than optional optimizations).
  • If we want to support some of these use-cases, we need to write copy/move constructors for types which they don’t make sense for (and do what? Throw? Abort? Linker error?)
  • You can’t pass non-moveable types to functions by value, in case you have some use-case which that would help with.

What’s the solution? Should the standard just say “oh, if you elide all copies, you don’t need those constructors”? Maybe, but then all this language about constructing temporaries is really a lie and building an intuition about the object model becomes even harder.

C++17

C++17 takes a different approach. Instead of guaranteeing that copies will be elided in these cases, it changes the rules such that the copies were never there in the first place. This is achieved through redefining when temporaries are created.

As noted in the value category descriptions earlier, prvalues exist for purposes of initialization. C++11 creates temporaries eagerly, eventually using them in an initialization and cleaning up copies after the fact. In C++17, the materialization of temporaries is deferred until the initialization is performed.

That’s a better name for this feature. Not guaranteed copy elision. Deferred temporary materialization.

Temporary materialization creates a temporary object from a prvalue, resulting in an xvalue. The most common places it occurs are when binding a reference to or performing member access on a prvalue. If a reference is bound to the prvalue, the materialized temporary’s lifetime is extended to that of the reference (this is unchanged from C++11, but worth repeating). If a prvalue initializes a class type of the same type as the prvalue, then the destination object is initialized directly; no temporary required.

Some examples:

struct foo {
    int i;
};

foo make();
auto& a = make();  //temporary materialized and lifetime-extended
auto&& b = make(); //ditto

foo{}.i //temporary materialized

auto c = make(); //no temporary materialized

That covers the most important points of the new rules. Now on to why this is actually useful past terminology bikeshedding and trivia to impress your friends.

Who cares?

I said at the start that understanding the new rules would grant a deeper understanding of the C++17 object model. I’d like to expand on that a bit.

The key point is that in C++11, prvalues are not “pure” in a sense. That is, the expression std::string{"a pony"} names some temporary std::string object with the contents "a pony". It’s not the pure notion of the list of characters “a pony”. It’s not the Platonic ideal of “a pony”.

In C++17, however, std::string{"a pony"} is the Platonic ideal of “a pony”. It’s not a real object in C++’s object model, it’s some elusive, amorphous idea which can be passed around your program, only being given form when initializing some result object, or materializing a temporary. C++17’s prvalues are purer prvalues.

If this all sounds a bit abstract, that’s okay, but internalising this idea will make it easier to reason about aspects of your program. Consider a simple example:

struct foo {};
auto x = foo{};

In the C++11 model, the prvalue foo{} creates a temporary which is used to move-construct x, but the move is likely elided by the compiler.

In the C++17 model, the prvalue foo{} initializes x.

A more complex example:

std::string a() {
    return "a pony";
}

std::string b() {
    return a();
}

int main() {
    auto x = b();
}

In the C++11 model, return "a pony"; initializes the temporary return object of a(), which move-constructs the temporary return object of b(), which move-constructs x. All the moves are likely elided by the compiler.

In the C++17 model, return "a pony"; initializes the result object of a(), which is the result object of b(), which is x.

In essence, rather than an initializer creating a series of temporaries which in theory move-construct a chain of return objects, the initializer is teleported to the eventual result object. In C++17, the code:

T a() { return /* expression */ ; }
auto x = a();

is identical to auto x = /* expression */;. For any T.

Closing

The “guaranteed copy elision” rules do not guarantee copy elision; instead they purify prvalues such that the copy doesn’t exist in the first place. Next time you hear or read about “guaranteed copy elision”, think instead about deferred temporary materialization.

Spaceship Operator

You write a class. It has a bunch of member data. At some point, you realise that you need to be able to compare objects of this type. You sigh and resign yourself to writing six operator overloads for every type of comparison you need to make. Afterwards your fingers ache and your previously clean code is lost in a sea of functions which do essentially the same thing. If this sounds familiar, then C++20’s spaceship operator is for you. This post will look at how the spaceship operator allows you to describe the strength of relations, write your own overloads, have them be automatically generated, and how correct, efficient two-way comparisons are automatically rewritten to use them.

Relation strength

The spaceship operator looks like <=> and its official C++ name is the “three-way comparison operator”. It is so-called, because it is used by comparing two objects, then comparing that result to 0, like so:

(a <=> b) < 0  //true if a < b
(a <=> b) > 0  //true if a > b
(a <=> b) == 0 //true if a is equal/equivalent to b

One might think that <=> will simply return -1, 0, or 1, similar to strcmp. This being C++, the reality is a fair bit more complex, but it’s also substantially more powerful.

Not all equality relations were created equal. C++’s spaceship operator doesn’t just let you express orderings and equality between objects, but also the characteristics of those relations. Let’s look at two examples.

Example 1: Ordering rectangles

We have a rectangle class and want to define comparison operators for it based on its size. But what does it mean for one rectangle to be smaller than another? Clearly if one rectangle is 1cm by 1cm, it is smaller than one which is 10cm by 10cm. But what about one rectangle which is 1cm by 5cm and another which is 5cm by 1cm? These are clearly not equal, but they’re also not less than or greater than each other. But speaking practically, having all of <, <=, ==, >, and >= return false for this case is not particularly useful, and it breaks some common assumptions at these operators, such as (a == b || a < b || a > b) == true. Instead, we can say that == in this case models equivalence rather than true equality. This is known as a weak ordering.

Example 2: Ordering squares

Similar to above, we have a square type which we want to define comparison operators for with respect to size. In this case, we don’t have the issue of two objects being equivalent, but not equal: if two squares have the same area, then they are equal. This means that == models equality rather than equivalence. This is known as a strong ordering.

Describing relations

Three-way comparisons allow you express the strength of your relation, and whether it allows just equality or also ordering. This is achieved through the return type of operator<=>. Five types are provided, and stronger relations can implicitly convert to weaker ones, like so:

relation conversion

Strong and weak orderings are as described in the above examples. Strong and weak equality means that only == and != are valid. Partial ordering is weaker than weak_ordering in that it also allows unordered values, like NaN for floats.

These types have a number of uses. Firstly they indicate to users of the types what kind of relation is modelled by the comparisons, so that their behaviour is more clear. Secondly, algorithms could potentially have more efficient specializations for when the orderings are stronger, e.g. if two strongly-ordered objects are equal, calling a function on one of them will give the same result as the other, so it would only need to be carried out once. Finally, they can be used in defining language features, such as class type non-type template parameters, which require the type to have strong equality so that equal values always name the same template instantiation.

Writing your own three-way comparison

You can provide a three-way comparison operator for your type in the usual way, by writing an operator overload:

struct foo {
  int i;

  std::strong_ordering operator<=> (foo const& rhs) {
    return i <=> rhs.i;
  }
};

Note that whereas two-way comparisons should be non-member functions so that implicit conversions are done on both sides of the operator, this is not necessary for operator<=>; we can make it a member and it’ll do the right thing.

Sometimes you may find that you want to do <=> on types which don’t have an operator<=> defined. The compiler won’t try to work out a definition for <=> based on the other operators, but you can use std::compare_3way instead, which will fall back on two-way comparisons if there is no three-way version available. For example, we could write a three-way comparison operator for a pair type like so:

template<class T, class U>
struct pair {
  T t;
  U u;

  auto operator<=> (pair const& rhs) const
    -> std::common_comparison_category_t<
         decltype(std::compare3_way(t, rhs.t)),
         decltype(std::compare3_way(u, rhs.u)> {
    if (auto cmp = std::compare3_way(t, rhs.t); cmp != 0) return cmp;
    return std::compare3_way(u, rhs.u);
  }

std::common_comparison_category_t there determines the weakest relation given a number of them. E.g. std::common_comparison_category_t<std::strong_ordering, std::partial_ordering> is std::partial_ordering.

If you found the previous example a bit too complex, you might be glad to know that C++20 will support automatic generation of comparison operators. All we need to do is =default our operator<=>:

auto operator<=>(x const&) = default;

Simple! This will carry out a lexicographic comparison for each base, then member of the type, in order of declaration.

Automatically-rewritten two-way comparisons

Although <=> is very powerful, most of the time we just want to know if some object is less than another, or equal to another. To facilitate this, two-way comparisons like < can be rewritten by the compiler to use <=> if a better match is not found.

The basic idea is that for some operator @, an expression a @ b can be rewritten as a <=> b @ 0. It can even be rewritten as 0 @ b <=> a if that is a better match, which means we get symmetry for free.

In some cases, this can actually provide a performance benefit. Quite often comparison operators are implemented by writing == and <, then writing the other operators in terms of those rather than duplicating the code. This can lead to situations where we want to check <= and end up doing an expensive comparison twice. This automatic rewriting can avoid that cost, since it will only call the one operator<=> rather than both operator< and operator==.

Conclusion

The spaceship operator is a very welcome addition to C++. It gives us more expressiveness in how we define our relations, lets us write less code to define them (sometimes even just a defaulted declaration), and avoids some of the performance pitfalls of manually implementing some comparison operators in terms of others.

For more details, have a look at the cppreference articles for comparison operators and the compare header, and Herb Sutter’s standards proposal. For a good example on how to write a spaceship operator, see Barry Revzin’s post on implementing it for std::optional. For more information on the mathematics of ordering with a C++ slant, see Jonathan Müller’s blog post series on the mathematics behind comparison.

Super Simple Named Boolean Parameters

Quite often we come across interfaces with multiple boolean parameters, like this:

cake make_cake (bool with_dairy, bool chocolate_sauce, bool poison);

A call to this function might look like:

auto c = make_cake(true, true, false);

Unfortunately, this is not very descriptive. We need to look at the declaration of the function to know what each of those bools mean, and it would be way too easy to accidentally flip the wrong argument during maintainance and end up poisoning ourselves rather than having a chocolatey treat.

There are many solutions which people use, from just adding comments to each argument, to creating tagged types. I recently came across a very simple trick which I had not seen before:

#define K(name) true

Yep, it’s a macro which takes an argument and gets replaced by true. This may look strange, but it enables us to write this:

auto c = make_cake(K(with_dairy), !K(chocolate_sauce), !K(poison));

// or using `not` if you prefer
auto c = make_cake(K(with_dairy), not K(chocolate_sauce), not K(poison));

Of course, it’s up to you whether you think using a macro for this kind of thing is worth it, or if you think that macro should have a name which is less likely to collide. There’s also the issue that this looks a bit like Python named parameters, where you can pass them in any order, but the order matters here. There are much better solutions when you own the interface, but for the times when you don’t, I think this is a neat trick to solve the issue in a low-overhead manner.

A guess I need to come up with a marketing name for this. Since we already have X Macros, I hereby dub this trick “K Macros”.

Function Template Partial Ordering: Worked Examples

C++ function overloading rules are complex. C++ template rules are complex. Put the two together, and you unfortunately do not get something simple; you get a hideous monster of standardese which requires great patience and knowledge to overcome. However, since C++ is mostly corner-cases, it can pay to understand how the rules apply for those times where you just can’t work out why your code won’t compile. This post will present a few step-by-step examples of how partial ordering of function templates works in order to arm you for these times of need.

Partial ordering of function templates is a step of overload resolution. It occurs when you call a function template which is overloaded and the compiler needs to decide which one is more specialized than the other. Consider this code:

template<class T> void f(T);        //(1)
template<class T> void f(T const*); //(2)

int const* p = nullptr;
f(p);

We expect f(p) to call (2), because p is a int const*. In order to decide that (2) is more specialized than (1), the compiler needs to follow the function template partial ordering rules. Let’s see what the standard has to say.

Partial ordering selects which of two function templates is more specialized than the other by transforming each template in turn (see next paragraph) and performing template argument deduction using the function type.

This is not so complicated, even if the terms may be unfamiliar. Unfortunately, the “next paragraph” and the sections which it references are almost impossible to parse without a lot of background knowledge and re-readings, so I shall step through the algorithm rather than pasting the rules for you to cry over.

There are four steps which we to have work through:

  1. Transforming (1).
  2. Performing deduction on (2) with the transformed template from step 1.
  3. Transforming (2).
  4. Performing deduction on (1) with the transformed template from step 3.

If one and only one of the deductions succeeds, then the template with which the deduction was performed is more specialized than the other.

Step 1: The rules state that for each template parameter, we create some unique type to use in its stead1. Let’s call this unique type type_0. You can pretend that this was defined somewhere like class type_0{};. Now we take our function template template <class T> void f(T) and substitute in type_0 for T. This gives us void f(type_0). The transformation is complete.

Step 2: Now that we have transformed template <class T> void f(T) into void f(type_0), we will perform deduction on (2) using the transformed function type. To do this, we imagine a call to (2) where the arguments have the type of the parameters for (1). Concretely, it would look like this:

template <class T> void func_2(T const*);
func_2(type_0{}); //derived from void f(type_0)

Would this call succeed? We can put it into our compiler to find out. GCC 8.1 says:

<source>: In function 'int main()':
<source>:4:18: error: no matching function for call to 'func_2(type_0)'
   func_2(type_0{});
                  ^
<source>:1:25: note: candidate: 'template<class T> void func_2(const T*)'
 template <class T> void func_2(T const*);
                         ^~~~~~
<source>:1:25: note:   template argument deduction/substitution failed:
<source>:4:18: note:   mismatched types 'const T*' and 'type_0'
   func_2(type_0{});
                  ^   

So deduction from (1) to (2) fails, because the invented type type_0 cannot be used to deduce const T*.

Step 3: Let’s try from (2) to (1). Again, we’ll transform (2) from template <class T> void f(T const*) to void f(type_0 const*).

Step 4: Now we attempt deduction:

template <class T> void func_1(T);
type_0 const* arg = nullptr;
func_1(arg);

This succeeds because a type_0 const* can be used to deduce T. Since deduction from (1) to (2) fails, but deduction from (2) to (1) succeeds, (2) is more specialized than (1) and will be chosen by overload resolution.


Let’s try a different example. How about:

template<class T> void g(T);  //(1)
template<class T> void g(T&); //(2)
int i = 0;
g(i);

(1) transforms to void g(type_0). Before we try deduction, we need to apply one of the numerous additional rules from the standard, which says we need to replace references with the type being referred to. So template <class T> void g(T&) becomes template <class T> void g(T). Deduction time:

template<class T> void func_2(T);
func_2(type_0{});

This succeeds.

Now the other direction. template<class T> void g(T&) transforms to void g(type_0&), then we remove the reference to get void g(type_0). Our second deduction:

template<class T> void func_1(T);
func_1(type_0{});

This is effectively identical to the previous one, so of course it succeeds.

Since deduction succeeded in both directions, the call is ambiguous. Sure enough, GCC diagnoses:

<source>: In function 'int main()':
<source>:5:8: error: call of overloaded 'g(int&)' is ambiguous
     g(i);
        ^
<source>:1:24: note: candidate: 'void g(T) [with T = int]'
 template<class T> void g(T);  //(1)
                        ^
<source>:2:24: note: candidate: 'void g(T&) [with T = int]'
 template<class T> void g(T&); //(2)
                        ^ 

This is why the algorithm is a partial ordering: sometimes two function templates are not ordered.


I’ll give one more example. This one has multiple parameters and is a bit more subtle.

template<class T>struct identity { using type = T; };
template<class T>struct A{};

template<class T, class U> void h(typename identity<T>::type, U); //(1)
template<class T, class U> void h(T, A<U>);                       //(2)
h<int>(0,A<void>{});

identity here just evaluates to its template argument, but the important thing to note is that typename identity<T>::type is a non-deduced context, so T cannot be deduced from the argument for that parameter.

(1) transforms to void h(typename identity<type_0>::type, type_0), which is void h(type_0, type_0). Attempt deduction on (2):

template<class T, class U> void func_2(T, A<U>);
func_2(type_0{}, type_0{});

This fails because we can’t match type_0 against A<U>.

(2) transforms to void h(type_0, A<type_1>). Try deduction against (1):

template<class T, class U> void func_1(typename identity<T>::type, U);
func_1(type_0{}, A<type_1>{});

This fails because typename identity<T>::type is a non-deduced context, so we can’t deduce T.

In the example from the last section deduction succeeded both ways so the call was ambiguous. In this example, deduction fails both ways, which is also an ambiguous call.


That’s the last of the examples. Of course, there are a bunch of rules which I didn’t cover here, like Concepts, parameter packs, non-type/template template parameters, and cases where both the argument and parameter types are references. Hopefully you now have enough of an intuition that you can understand what the standard says when you inevitably hit those corner cases. If you have any partial ordering conundrums, drop them down in the comments below or send them to me on Twitter.


  1. I’ll ignore non-type template parameters and template template parameters for simplicity, but the rules are essentially the same. 

std::accumulate vs. std::reduce

std::accumulate has been a part of the standard library since C++98. It provides a way to fold a binary operation (such as addition) over an iterator range, resulting in a single value. std::reduce was added in C++17 and looks remarkably similar. This post will explain the difference between the two and when to use one or the other.

Let’s start by looking at their interfaces, beginning with std::accumulate.

template< class InputIt, class T >
T accumulate( InputIt first, InputIt last, T init );

template< class InputIt, class T, class BinaryOperation >
T accumulate( InputIt first, InputIt last, T init,
              BinaryOperation op );

std::accumulate takes an iterator range and an initial value for the accumulation. You can optionally give it a binary operation to do the reduction, which will default to addition. It will call this operation on the initial value and the first element of the range, then on the result and the second element of the range, etc. Here are two equivalent calls:

auto sum1 = std::accumulate(begin(vec), end(vec), 0);
auto sum2 = std::accumulate(begin(vec), end(vec), 0, std::plus<>{});

std::reduce has a fair few more overloads to get your head round, but has a very similar interface once you understand them:

template<class InputIt>
typename std::iterator_traits<InputIt>::value_type reduce(
    InputIt first, InputIt last);

template<class ExecutionPolicy, class ForwardIt>
typename std::iterator_traits<ForwardIt>::value_type reduce(
    ExecutionPolicy&& policy,
    ForwardIt first, ForwardIt last);

template<class InputIt, class T>
T reduce(InputIt first, InputIt last, T init);

template<class ExecutionPolicy, class ForwardIt, class T>
T reduce(ExecutionPolicy&& policy,
         ForwardIt first, ForwardIt last, T init);

template<class InputIt, class T, class BinaryOp>
T reduce(InputIt first, InputIt last, T init, BinaryOp binary_op);

template<class ExecutionPolicy, class ForwardIt, class T, class BinaryOp>
T reduce(ExecutionPolicy&& policy,
         ForwardIt first, ForwardIt last, T init, BinaryOp binary_op);

The differences here are:

I’ll talk about these points in turn.

Execution policies

Execution policies are a C++17 feature which allows programmers to ask for algorithms to be parallelised. There are three execution policies in C++17:

  • std::execution::seq – do not parallelise
  • std::execution::par – parallelise
  • std::execution::par_unseq – parallelise and vectorise (requires that the operation can be interleaved, so no acquiring mutexes and such)

The idea behind execution policies is that you can change a serial algorithm to a parallel algorithm simply by passing an additional argument to the function:

auto sum1 = std::reduce(begin(vec), end(vec));                      //sequential
auto sum2 = std::reduce(std::execution::seq, begin(vec), end(vec)); //sequential
auto sum3 = std::reduce(std::execution::par, begin(vec), end(vec)); //parallel

Allowing parallelisation is the main reason for the addition of std::reduce. Let’s look at an example where we want to sum up all the elements in an array. With std::accumulate it looks like this:

accumulate plus

Note that each step of the computation relies on the previous computation, i.e. this algorithm will execute serially and we make no use of hardware parallel processing capabilities. If we use std::reduce with the std::execution::par policy then it could look like this:

reduce plus

This is a trivial amount of data for processing in parallel, but the benefit gained when the data size is scaled up should be clear: some of the operations can be executed independently of others, so they can be done in parallel.

A common question is: why do we need an entirely new algorithm for this? Why can’t we just overload std::accumulate? For an example of why we can’t do this, let’s use std::minus<> instead of std::plus<> as our reduction operation. With std::accumulate we get this:

accumulate minus

However, if we try to use std::reduce, we could get something like:

reduce minus

Uh oh. We just broke our code.

We got the wrong answer because of the mathematical properties of subtraction. You can’t arbitrarily reorder the operands, or compute the operations out of order when doing subtraction. This is formalised in the properties of commutativity and associativity.

A binary operation ∗ on a set S is associative if the following equation holds for all x, y, and z in S:

(x ∗ y) ∗ z = x ∗ (y ∗ z)

An operation is commutative if:

x ∗ y = y ∗ x

Associativity lets the algorithm compute reduction steps on arbitrary adjacent pairs of elements. Commutativity allows carrying out the operation on intermediate results in whatever order they are produced in, rather than having to preserve the original ordering. Some people (e.g. here and here) find that commutativity is too strong a requirement on std::reduce, because it denies use of common fold operations like string concatenation, which is associative but not commutative. I agree that it’s a shame we don’t have a step between std::accumulate and std::reduce which only requires associativity, but maybe in the future!

We can understand how associativity and commutativity affect our algorithm as much as we want, but there’s no way for the compiler to reliably check this. As such, we’re stuck with having std::reduce as a separate algorithm. In concepts speak, these are called axioms1: requirements imposed on semantics which cannot generally be statically verified.

Input vs. Forward Iterators

The forward iterator overloads allow the implementation to chunk up the data and dispatch these subranges to different threads. The idea is that input iterators are single-pass, whereas forward iterators can be iterated through multiple times. An algorithm couldn’t chunk up data indexed by input iterators because by the time it had gone through the range to work out the sub-range boundaries, the iterators will have been invalidated and can’t be passed on.

For more information on iterator types and parallel algorithms, see p0467r2.

Optional Initial Element

std::reduce lets you not bother passing an initial element, in which case it will default-construct one using typename std::iterator_traits<InputIt>::value_type{}. I think this was a mistake. A default-constructed value is not always the identity element (such as in multiplication), and it’s very easy to for a programmer to miss out the initial element. The code will still compile, it will just give the wrong answer. I suspect that this choice will result in some hard-to-find bugs when this interface comes into heavier use.

Finishing Up

That covers the differences between std::reduce and std::accumulate. My three point guide to std::reduce is:

  • Use std::reduce when you want your accumulation to run in parallel
  • Ensure that the operation you want to use is both associative and commutative
  • Remember that the default initial value is produced by default construction, and that this may not be correct for your operation

Now you know how and when to use std::reduce over std::accumulate. More generally, the differences show some of the technical aspects you need to consider when parallelising any kind of algorithm. Keep in mind how your operations act with respect to common mathematical properties and you might save yourself some debugging down the line.

Acknowledgements

Thanks to Christopher Di Bella for reviewing this post and linking me to p0467r2. Thanks to Ben Steffan, Ben Deane, and TemplateRex for discussion about commutativity.


  1. For more about axioms and algorithms, see this post by Christopher Di Bella

Custom Alias Analysis in LLVM

At Codeplay I currently work on a compiler backend for an embedded accelerator. The backend is the part of the compiler which takes some representation of the source code and translates it to machine code. In my case I’m working with LLVM, so this representation is LLVM IR.

It’s very common for accelerators like GPUs to have multiple regions of addressable memory – each with distinct properties. One important optimisation I’ve implemented recently is extending LLVM’s alias analysis functionality to handle the different address spaces for our target architecture.

This article will give an overview of the aim of alias analysis – along with an example based around address spaces – and show how custom alias analyses can be expressed in LLVM. You should be able to understand this article even if you don’t work with compilers or LLVM; hopefully it gives some insight into what compiler developers do to make the tools you use generate better code. LLVM developers may find the implementation section helpful, but may want to read the documentation or examples linked at the bottom for more details.

What is alias analysis

Alias analysis is the process of determining whether two pointers can point to the same object (alias) or not. This is very important for some valuable optimisations.

Take this C function as an example:

int foo (int __attribute__((address_space(0)))* a,
         int __attribute__((address_space(1)))* b) {
    *a = 42;
    *b = 20;
    return *a;
}

Those __attribute__s specify that a points to an int in address space 0, b points to an int in address space 1. An important detail of the target architecture for this code is that address spaces 0 and 1 are completely distinct: modifying memory in address space 0 can never affect memory in address space 1. Here’s some LLVM IR which could be generated from this function:

define i32 @foo(i32 addrspace(0)* %a, i32 addrspace(1)* %b) #0 {
entry:
  store i32 42, i32 addrspace(0)* %a, align 4
  store i32 20, i32 addrspace(1)* %b, align 4
  %0 = load i32, i32* %a, align 4
  ret i32 %0
}

For those unfamiliar with LLVM IR, the first store is storing 42 into *a, the second storing 20 into *b. The %0 = ... line is like loading *a into a temporary variable, which is then returned in the final line.

Optimising foo

Now we want foo to be optimised. Can you see an optimisation which could be made?

What we really want is for that load from a (the line beginning %0 = ...) to be removed and for the final statement to instead return 42. We want the optimised code to look like this:

define i32 @foo(i32 addrspace(0)* %a, i32 addrspace(1)* %b) #0 {
entry:
  store i32 42, i32 addrspace(0)* %a, align 4
  store i32 20, i32 addrspace(1)* %b, align 4
  ret i32 42
}

However, we have to be very careful, because this optimisation is only valid if a and b do not alias, i.e. they must not point at the same object. Forgetting about the address spaces for a second, consider this call to foo where we pass pointers which do alias:

int i = 0;
int result = foo(&i, &i);

Inside the unoptimised version of foo, i will be set to 42, then to 20, then 20 will be returned. However, if we carry out desired optimisation then the two stores will occur, but 42 will be returned instead of 20. We’ve just broken the behaviour of our function.

The only way that a compiler can reasonably carry out the above optimisation is if it can prove that the two pointers cannot possibly alias. This reasoning is carried out through alias analysis.

Custom alias analysis in LLVM

As I mentioned above, address spaces 0 and 1 for our target architecture are distinct. However, this may not hold for some systems, so LLVM cannot assume that it holds in general: we need to make it explicit.

One way to achieve this is llvm::AAResultBase. If our target is called TAR then we can create a class called TARAAResult which inherits from AAResultBase<TARAAResult>1:

class TARAAResult : public AAResultBase<TARAAResult> {
public:
  explicit TARAAResult() : AAResultBase() {}
  TARAAResult(TARAAResult &&Arg) : AAResultBase(std::move(Arg)) {}

  AliasResult alias(const MemoryLocation &LocA, const MemoryLocation &LocB);
};

The magic happens in the alias member function, which takes two MemoryLocations and returns an AliasResult. The result indicates whether the locations can never alias, may alias, partially alias, or precisely alias. We want our analysis to say “If the address spaces for the two memory locations are different, then they can never alias”. The resulting code is surprisingly close to this English description:

AliasResult TARAAResult::alias(const MemoryLocation &LocA,
                               const MemoryLocation &LocB) {
  auto AsA = LocA.Ptr->getType()->getPointerAddressSpace();
  auto AsB = LocB.Ptr->getType()->getPointerAddressSpace();

  if (AsA != AsB) {
    return NoAlias;
  }

  // Forward the query to the next analysis.
  return AAResultBase::alias(LocA, LocB);
}

Alongside this you need a bunch of boilerplate for creating a pass out of this analysis (I’ll link to a full example at the end), but after that’s done you just register the pass and ensure that the results of it are tracked:

void TARPassConfig::addIRPasses() {
  addPass(createTARAAWrapperPass());
  auto AnalysisCallback = [](Pass &P, Function &, AAResults &AAR) {
    if (auto *WrapperPass = P.getAnalysisIfAvailable<TARAAWrapper>()) {
      AAR.addAAResult(WrapperPass->getResult());
    }
  }; 
  addPass(createExternalAAWrapperPass(AliasCallback));
  TargetPassConfig::addIRPasses();
}

We also want to ensure that there is an optimisation pass which will remove unnecessary loads and stores which our new analysis will find. One example is Global Value Numbering.

  addPass(createNewGVNPass());

After all this is done, running the optimiser on the LLVM IR from the start of the post will eliminate the unnecessary load, and the generated code will be faster as a result.

You can see an example with all of the necessary boilerplate in LLVM’s test suite. The AMDGPU target has a full implementation which is essentially what I presented above extended for more address spaces.

Hopefully you have got a taste of what alias analysis does and the kinds of work involved in writing compilers, even if I have not gone into a whole lot of detail. I may write some more short posts on other areas of compiler development if readers find this interesting. Let me know on Twitter or in the comments!


  1. This pattern of inheriting from a class and passing the derived class as a template argument is known as the Curiously Recurring Template Pattern

No more pointers

One of the major changes at the most recent C++ standards meeting in Jacksonville was the decision to deprecate raw pointers in C++20, moving to remove them completely in C++23. This came as a surprise to many, with a lot of discussion as to how we’ll get by without this fundamental utility available any more. In this post I’ll look at how we can replace some of the main use-cases of raw pointers in C++20.

Three of the main reasons people use raw pointers are:

  • Dynamic allocation & runtime polymorphism
  • Nullable references
  • Avoiding copies

I’ll deal with these points in turn, but first, an answer to the main question people ask about this change.

The elephant in the room

What about legacy code? Don’t worry, the committee have come up with a way to boldly move the language forward without breaking all the millions of lines of C++ which people have written over the years: opt-in extensions.

If you want to opt-in to C++20’s no-pointers feature, you use #feature.

#feature <no_pointers> //opt-in to no pointers
#feature <cpp20>       //opt-in to all C++20 features

This is a really cool new direction for the language. Hopefully with this we can slowly remove features like std::initializer_list so that new code isn’t bogged down with legacy as much as it is today.

Dynamic allocation & runtime polymorphism

I’m sure most of you already know the answer to this one: smart pointers. If you need to dynamically allocate some resource, that resource’s lifetime should be managed by a smart pointer, such as std::unique_ptr or std::shared_ptr. These types are now special compiler-built-in types rather than normal standard library types. In fact, std::is_fundamental<std::unique_ptr<int>>::value now evaluates to true!

Nullable references

Since references cannot be rebound and cannot be null, pointers are often used to fulfil this purpose. However, with C++20, we have a new type to fulfil this purpose: std::optional<T&>. std::optional was first introduced in C++17, but was plagued with no support for references and no monadic interface. C++20 has fixed both of these, so now we have a much more usable std::optional type which can fill the gap that raw pointers have left behind.

Avoiding copies

Some people like to use raw pointers to avoid copies at interface boundaries, such as returning some resource from a function. Fortunately, we have much better options, such as (Named) Return Value Optimization. C++17 made some forms of copy elision mandatory, which gives us even more guarantees for the performance of our code.

Wrapping up

Of course there are more use-cases for raw pointers, but this covers three of the most common ones. Personally, I think this is a great direction to see the language going in, and I look forward to seeing other ways we can slowly make C++ into a simpler, better language.

emBO++ 2018 Trip Report

emBO++ is a conference focused on C++ on embedded systems in Bochum, Germany. This was it’s second year of operation, but the first that I’ve been along to. It was a great conference, so I’m writing a short report to hopefully convince more of you to attend next year!

Format

The conference took place over four days: an evening of lightning talks and burgers, a workshop day, a day of talks (what you might call the conference proper), and finally an unofficial standards meeting for those interested in SG14. This made for a lot of variety, and each day was valuable.

Venue

One thing I really enjoyed about emBO++ was that the different tech and social events were dotted around the city. This meant that I actually got to see some of Bochum, get lost navigating the train system, walk around town at night, etc., which made a nice change from being cooped up in a hotel for a few days.

The main conference venue was at the Zentrum für IT-Sicherheit (Centre for IT Security). It was a spacious building with a lot of light and large social areas, so it suited the conference environment well. The only problem was that it used to be a military building and was lined with copper, making the thing into one huge Faraday cage. This meant that WiFi was an issue for the first few hours of the day, but it got sorted eventually.

zits

Food and Drink

The catering at the main conference location was really excellent: a variety of tasty food with healthy options and large quantities. Even better were the selection of drinks available, which mostly consisted of interesting soft drinks which I’d never seen before, such as bottled Matcha with lime and a number of varieties of Mate. All the locations we went to for food and drinks were great – especially the speakers dinner. A lot of thought was obviously put into this area of the conference, and it showed.

Workshops

There were four workshops on the first day of the conference with two running in parallel. The two which I attended were very interesting and instructive, but I wish that they had been more hands-on.

Jörn Seger – Getting Started with Yocto

I was in two minds about attending this workshop. We need to use Yocto a little bit in my current project, so I could attend the workshop in order to gain more knowledge about it. On the other hand, I’d then be the most experienced member of my team in Yocto and would be forced to fix all the issues!

In the end I decided to go along, and it was definitely worthwhile. Previously I’d mostly muddled along without an understanding of the fundamentals of the system; this workshop provided those.

Kris Jusiak – Embedding a Compile-Time-State-Machine

Kris gave a workshop on Boost.SML, which is an embedded domain specific language (EDSL) for encoding expressive, high-performance state machines in C++. The library is very impressive, and it was valuable to see all the different use-cases it supports and how it supports switching out the frontend and backend of the system. I was particularly interested in this session as my talk the next day was on EDSLs, so it was an opportunity to steal some things to mention in my talk.

You can find Boost.SML here.

Talks

There were two tracks for most of the day, with the first and final ones being plenary sessions. There was a strong variety of talks, and I felt that my time was well-spent at all of them.

Simon Brand – Embedded DSLs for Embedded Programming

My talk! I think it went down well. I got some good engagement and questions from the audience, although not much feedback from the attendees later on in the day. I guess I’ll need to wait for it to get torn apart on YouTube.

me

Klemens Morgenstern – Developing high-performance Coroutines for ARMs

Klemens gave an excellent talk about an ARM coroutine library which he implemented. This talk has nothing to do with the C++ Coroutines TS, instead focusing on how coroutines can be implemented in a very low-overhead manner. In Klemens’ library, the user provides some memory to be used as the stack for the coroutine, then there are small pieces of ARM assembly which perform the context switch when you suspend or resume that coroutine. The talk went into the performance considerations, implementation, and use in just the right amount of detail, so I would definitely recommend watching if you want an overview of the ideas.

The library and presentation can be found here.

Emil Fresk – The compile-time, reactive scheduler: CRECT

CRECT is a task scheduler which carries out its work at compile time, therefore almost entirely disappearing from the generated assembly. Emil’s lofty goal for the talk was to present all of the necessary concepts such that those viewing the talk would feel like they could go off and implement something similar afterwards. I think he mostly succeeded in this – although a fair amount of metaprogramming skills would be required! He showed how to use the library to specify the jobs which need to be executed, the resources which they require, and when they should be scheduled to run. After we understood the fundamentals of the library, we learned how this actually gets executed at compile-time in order to produce the final scheduled output. Highly recommended for those who work with embedded systems and want a better way of scheduling their work.

You can find CRECT here.

Ben Craig – Standardizing an OS-less subset of C++

If you watch one talk from the conference it should be this one. C++ has had a “freestanding” variant for a long time, and it’s been neglected for the same amount of time. Ben talked about all the things which should not be available in freestanding mode but are, and those which should be but are not. He presented his vision for what should be standards-mandated facilities available in freestanding C++ implementations, and a tentative path to making this a reality. Particularly of interest were the odd edge cases which I hadn’t considered. For example, it turns out that std::array has to #include <string> somewhere down the line, because my_array.at(n) can throw an exception (std::out_of_range), and that exception has a constructor which takes std::string as an argument. These tiny issues will make getting a solid standard for freestanding difficult to pin down and agree on, but I think it’s a worthy cause.

Ben’s ISO C++ paper on a freestanding standard library can be found here.

Jacek Galowicz — Scalable test infrastructure for advanced bare-metal software development

In Jacek’s team, they have many different hardware versions to support. This creates a problem of creating regressions in some versions and not others when changes are made. This talk showed how they developed the testing infrastructure to enable them to test all hardware versions they needed on each merge request to ensure that bad commits aren’t merged in to the master branch. They wrote a simple testing framework in Haskell which was fine-tuned to their use case rather than using an existing solution like Jenkins (that’s what we use at Codeplay for solving the same problem). Jacek spoke about issues they faced and the creative solutions they put in place, such as putting a light detector over the CAPS LOCK button of a keyboard and making it blink in Morse code in order to communicate out from machines with no usable ports.

Odin Holmes – Bare-Metal-Roadmap

Odin’s talk summed up some current major problems that are facing the embedded community, and roped together all of the talks which had come before. It was cool to see the overlap in all the talks in areas of abstraction, EDSLs, making choices at compile time, etc.

Closing

I had a great time at emBO++ and would whole-heartedly recommend attending next year. The talks should be online in the next few months, so I look forward to watching those which I didn’t attend. The conference is mostly directed at embedded C++ developers, but I think it would be valuable to anyone doing low-latency programming on non-embedded systems, or those writing C/Rust/whatever for embedded platforms.

Thank you to Marie, Odin, Paul, Stephan, and Tabea for inviting me to talk and organising these great few days!

embo

Do compilers take inline as a hint?

If you’ve spent any time in C or C++ communities online, you’ve probably seen someone say this:

inline used to be a hint for compilers to inline the definition, but no compilers actually take that into account any more.

You shouldn’t believe everything you see on the internet.

Of course, inline has mandatory effects on linkage and whatnot, but this post is only interested in whether or not compilers might change the decision to inline a function or not based on whether you write inline in the declaration.

However, the point of this article isn’t to give you good rules for when to use inline, because as much as I like to debate technical minutiae, your compiler will likely be a better judge than you anyway. Instead I want to show that if you want to know how compilers or standard libraries implement things, you can just go look! It might be daunting the first time, but getting a knack for finding things in large code bases and understanding your tools can pay off. I’ll be diving into the codebases for Clang and GCC to see how they handle inlining hints and hopefully convince you that you can do the same for questions which you have about your tools.


Clang

Let’s do this bottom-up. Clang is a compiler frontend for LLVM, which means that it takes C-family languages, translates them to LLVM Intermediate Representation, and LLVM handles generating assembly/binaries from that. So we can dig around in LLVM to see if we can find code which deals with inlining. I armed myself with a text editor and grep and found the following code in llvm/lib/Analysis/InlineCost.cpp:

  // Adjust the threshold based on inlinehint attribute and profile based
// hotness information if the caller does not have MinSize attribute.
if (!Caller->optForMinSize()) {
if (Callee.hasFnAttribute(Attribute::InlineHint))
Threshold = MaxIfValid(Threshold, Params.HintThreshold);
if (PSI) {
uint64_t TotalWeight;
if (CS.getInstruction()->extractProfTotalWeight(TotalWeight) &&
PSI->isHotCount(TotalWeight)) {
Threshold = MaxIfValid(Threshold, Params.HotCallSiteThreshold);
} else if (PSI->isFunctionEntryHot(&Callee)) {
// If callsite hotness can not be determined, we may still know
// that the callee is hot and treat it as a weaker hint for threshold
// increase.
Threshold = MaxIfValid(Threshold, Params.HintThreshold);
} else if (PSI->isFunctionEntryCold(&Callee)) {
Threshold = MinIfValid(Threshold, Params.ColdThreshold);
}
}
}

By just looking at this code without knowledge of all the other functions, we won’t be able to totally understand what it’s doing. But there’s certainly some information we can gleam here:

    if (Callee.hasFnAttribute(Attribute::InlineHint))
Threshold = MaxIfValid(Threshold, Params.HintThreshold);

So now we know that LLVM takes the presence of an inlinehint attribute into account in its inlining cost model. But does Clang produce that attribute? Let’s look for that attribute in the Clang sources:

    // Otherwise, propagate the inline hint attribute and potentially use its
// absence to mark things as noinline.
if (auto *FD = dyn_cast<FunctionDecl>(D)) {
if (any_of(FD->redecls(), [&](const FunctionDecl *Redecl) {
return Redecl->isInlineSpecified();
})) {
B.addAttribute(llvm::Attribute::InlineHint);
} else if (CodeGenOpts.getInlining() ==
CodeGenOptions::OnlyHintInlining &&
!FD->isInlined() &&
!F->hasFnAttribute(llvm::Attribute::AlwaysInline)) {
B.addAttribute(llvm::Attribute::NoInline);
}
}

The above code from clang/lib/CodeGen/CodeGenModule.cpp shows Clang setting the inlinehint attribute. It will also possibly mark declarations as noinline if inline was not supplied (depending on compiler flags). Now we can look for code which affects isInlineSpecified:

// clang/include/clang/Sema/DeclSpec.h
bool isInlineSpecified() const {
return FS_inline_specified | FS_forceinline_specified;
}

// clang/lib/Sema/DeclSpec.cpp
bool DeclSpec::setFunctionSpecInline(SourceLocation Loc, const char *&PrevSpec,
unsigned &DiagID) {
// 'inline inline' is ok. However, since this is likely not what the user
// intended, we will always warn, similar to duplicates of type qualifiers.
if (FS_inline_specified) {
DiagID = diag::warn_duplicate_declspec;
PrevSpec = "inline";
return true;
}
FS_inline_specified = true;
FS_inlineLoc = Loc;
return false;
}

// Parse/ParseDecl.cpp:ParseDeclarationSpecifiers
case tok::kw_inline:
isInvalid = DS.setFunctionSpecInline(Loc, PrevSpec, DiagID);
break;

The above snippets show the functions which set and retrieve whether something was specified as inline, and the code in the parser which calls the setter.

So now we know that Clang propagates the presence of the inline keyword through to LLVM using the inlinehint attribute, and that LLVM takes this into account in its cost model for inlining. Our detective work has been successful!


GCC

How about GCC? The source for GCC is a fair bit less accessible than that of LLVM, but we can still make an attempt. After a bit of searching, I found that the DECL_DECLARED_INLINE_P macro seemed to be used lots of places which were relevant. So we can look for where that’s set:

        // c/c-decl.c:grokdeclarator

/* Record presence of `inline' and `_Noreturn', if it is
reasonable. */

if (flag_hosted && MAIN_NAME_P (declarator->u.id))
{
if (declspecs->inline_p)
pedwarn (loc, 0, "cannot inline function %<main%>");
if (declspecs->noreturn_p)
pedwarn (loc, 0, "%<main%> declared %<_Noreturn%>");
}
else
{
if (declspecs->inline_p)
/* Record that the function is declared `inline'. */
DECL_DECLARED_INLINE_P (decl) = 1;
if (declspecs->noreturn_p)
{
if (flag_isoc99)
pedwarn_c99 (loc, OPT_Wpedantic,
"ISO C99 does not support %<_Noreturn%>");
else
pedwarn_c99 (loc, OPT_Wpedantic,
"ISO C90 does not support %<_Noreturn%>");
TREE_THIS_VOLATILE (decl) = 1;
}
}

We can then look for uses of this macro which seem to affect the behaviour of the inliner. Here are a few:

  //ipa-inline.c:want_inline_small_function_p    

/* Do fast and conservative check if the function can be good
inline candidate. At the moment we allow inline hints to
promote non-inline functions to inline and we increase
MAX_INLINE_INSNS_SINGLE 16-fold for inline functions. */

else if ((!DECL_DECLARED_INLINE_P (callee->decl)
&& (!e->count.ipa ().initialized_p () || !e->maybe_hot_p ()))
&& ipa_fn_summaries->get (callee)->min_size
- ipa_call_summaries->get (e)->call_stmt_size
> MAX (MAX_INLINE_INSNS_SINGLE, MAX_INLINE_INSNS_AUTO))
{
e->inline_failed = CIF_MAX_INLINE_INSNS_AUTO_LIMIT;
want_inline = false;
}


//ipa-inline.c:edge_badness

/* ... and do not overwrite user specified hints. */
&& (!DECL_DECLARED_INLINE_P (edge->callee->decl)
|| DECL_DECLARED_INLINE_P (caller->decl)))))


//ipa-inline.c:early_inline_small_functions

/* Do not consider functions not declared inline. */
if (!DECL_DECLARED_INLINE_P (callee->decl)
&& !opt_for_fn (node->decl, flag_inline_small_functions)
&& !opt_for_fn (node->decl, flag_inline_functions))
continue;

That’s some evidence that the keyword is being taken into account. If we want to get a better idea of all of the effects, then we now have a starting point from which to conduct our expectations.


Conclusion

I hope I’ve managed to convince you of two things:

  1. Some modern compilers still take inline hints into account.
  2. If you want to understand how your compiler works and it’s open source, you can just go and look.

If you’re actually going to try and optimize your code using inline then Do The Right Thing And Measure It1. See what your compiler actually generates. Profile your code to make sure you’re opmitizing something that needs optimizing. Don’t guess.


  1. DTRTAMI (dee-tee-arr-tamee) kinda has a ring to it. 

Passing overload sets to functions

Passing functions to functions is becoming increasingly prevalent in C++. With common advice being to prefer algorithms to loops, new library features like std::visit, lambdas being incrementally beefed up12 and C++ function programming talks consistently being given at conferences, it’s something that almost all C++ programmers will need to do at some point. Unfortunately, passing overload sets or function templates to functions is not very well supported by the language. In this post I’ll discuss a few solutions and show how C++ still has a way to go in supporting this well.

An example

We have some generic operation called foo. We want a way of specifying this function which fulfils two key usability requirements.

1- It should be callable directly without requiring manually specifying template arguments:

auto a = foo(42);           //good
auto b = foo("hello"); //good
auto c = foo<double>(42.0); //bad
auto d = foo{}(42.0); //bad

2- Passing it to a higher-order function should not require manually specifying template arguments:

std::transform(first, last, target, foo);      //good
std::transform(first, last, target, foo<int>); //bad
std::transform(first, last, target, foo{}); //okay I guess

A simple first choice would be to make it a function template:

template <class T>
T foo(T t) { /*...*/ }

This fulfils the first requirement, but not the second:

//compiles, but not what we want
std::transform(first, last, target, foo<int>);

//uh oh
std::transform(first, last, target, foo);

7 : <source>:7:5: error: no matching function for call to 'transform'
std::transform(first, last, target, foo);
^~~~~~~~~~~~~~
/opt/compiler-explorer/gcc-7.2.0/lib/gcc/x86_64-linux-gnu/7.2.0/../../../../include/c++/7.2.0/bits/stl_algo.h:4295:5: note: candidate template ignored: couldn't infer template argument '_UnaryOperation'
transform(_InputIterator __first, _InputIterator __last,
^
/opt/compiler-explorer/gcc-7.2.0/lib/gcc/x86_64-linux-gnu/7.2.0/../../../../include/c++/7.2.0/bits/stl_algo.h:4332:5: note: candidate function template not viable: requires 5 arguments, but 4 were provided
transform(_InputIterator1 __first1, _InputIterator1 __last1,
^
1 error generated.

That’s no good.

A second option is to write foo as a function object with a call operator template:

struct foo {
template<class T>
T operator()(T t) { /*...*/ }
};

We are now required to create an instance of this type whenever we want to use the function, which is okay for passing to other functions, but not great if we want to call it directly:

//this looks okay
std::transform(first, last, target, foo{});

//this looks strange
auto x = foo{}(42.0);
auto x = foo()(42.0);

We have similar problems when we have multiple overloads, even when we’re not using templates:

int foo (int);
float foo (float);

std::transform(first, last, target, foo); //doesn't compile
// ew ew ew ew ew ew ew
std::transform(first, last, target, static_cast<int(*)(int)>(foo));

We’re going to need a different solution.


Lambdas and LIFT

As an intermediate step, we could use the normal function template approach, but wrap it in a lambda whenever we want to pass it to another function:

std::transform(first, last, target,
[](const auto&... xs) { return foo(xs...); });

That’s not great. It’ll work in some contexts where we don’t know what template arguments to supply, but it’s not yet suitable for all cases. One improvement would be to add perfect forwarding:

[](auto&&... xs) { return foo(std::forward<decltype(xs)>(xs)...); }

But wait, we want to be SFINAE friendly, so we’ll add a trailing return type:

[](auto&&... xs) -> decltype(foo(std::forward<decltype(xs)>(xs)...)) {
return foo(std::forward<decltype(xs)>(xs)...);
}

Okay, it’s getting pretty crazy and expert-only at this point. And we’re not even done! Some contexts will care about noexcept:

[](auto&&... xs)
noexcept(noexcept(foo(std::forward<decltype(xs)>(xs)...)))
-> decltype(foo(std::forward<decltype(xs)>(xs)...)) {
return foo(std::forward<decltype(xs)>(xs)...);
}

So the solution is to write this every time we want to pass an overloaded function to another function. That’s probably a good way to make your code reviewer cry.

What would be nice is if P0573: Abbreviated Lambdas for Fun and Profit and P0644: Forward without forward were accepted into the language. That’d let us write this:

[](xs...) => foo(>>xs...)

The above is functionally equivalent to the triplicated monstrosity in the example before. Even better, if P0834: Lifting overload sets into objects was accepted, we could write:

[]foo

That lifts the overload set into a single function object which we can pass around. Unfortunately, all of those proposals have been rejected. Maybe they can be renewed at some point, but for now we need to make do with other solutions. One such solution is to approximating []foo with a macro (I know, I know).

#define FWD(...) std::forward<decltype(__VA_ARGS__)>(__VA_ARGS__)

#define LIFT(X) [](auto &&... args) \
noexcept(noexcept(X(FWD(args)...))) \
-> decltype(X(FWD(args)...)) \
{ \
return X(FWD(args)...); \
}

Now our higher-order function call becomes:

std::transform(first, last, target, LIFT(foo));

Okay, so there’s a macro in there, but it’s not too bad (you know we’re in trouble when I start trying to justify the use of macros for this kind of thing). So LIFT is at least some solution.


Making function objects work for us

You might recall from a number of examples ago that the problem with using function object types was the need to construct an instance whenever we needed to call the function. What if we make a global instance of the function object?

struct foo_impl {
//template
template<class T>
T operator()(T t) { /*...*/ }

//overloads
int operator()(int) { /*...*/ }
float operator()(float) { /*...*/ }
};

extern const foo_impl foo;

// in some .cpp file
foo_impl foo;

This works if you’re able to have a single translation unit with the definition of the global object. If you’re writing a header-only library then you don’t have that luxury, so you need to do something different.

struct foo_impl {
template<class T>
T operator()(T t) { /*...*/ }
};

static constexpr foo_impl foo;

This might look innocent, but it can lead to One-Definition Rule (ODR) violations3:

//test.h header
struct foo_impl {
template<class T>
T operator()(T t) const { return t; }
};

static constexpr foo_impl foo;

template <class T>
int oh_no(T t) {
auto* foop = &foo;
return (*foop)(t);
}

//cpp1
#include "test.h"
int sad() {
return oh_no(42);
}

//cpp2
#include "test.h"
int also_sad() {
return oh_no(24);
}

Since foo is declared static, each Translation Unit (TU) will get its own definition of the variable. However, sad and also_sad will instantiate oh_no which will get different definitions of foo for &foo. This is undefined behaviour by [basic.def.odr]/12.2.

In C++17 the solution is simple:

inline constexpr foo_impl foo{};

The inline allows the variable to be multiply-defined, and the linker will throw away all but one of the definitions.

If you can’t use C++17, there are a few solutions given in N4424: Inline Variables. The Ranges V3 library uses a reference to a static member of a template class:

template<class T>
struct static_const {
static constexpr T value{};
};

template <class T>
constexpr T static_const<T>::value;

constexpr auto& foo = static_const<foo_impl>::value;

An advantage of the function object approach is that function objects designed carefully make for much better customisation points than the traditional techniques used in the standard library. See Eric Niebler’s blog post and standards paper for more information.

A disadvantage is that now we need to write all of the functions we want to use this way as function objects, which is not great at the best of times, and even worse if we want to use external libraries. One possible solution would be to combine the two techniques we’ve already seen:

// This could be in an external library
namespace lib {
template <class T>
T foo(T t) { /*...*/ }
}

namespace lift {
inline constexpr auto foo = LIFT(lib::foo);
}

Now we can use lift::foo instead of lib::foo and it’ll fit the requirements I laid out at the start of the post. Unfortunately, I think it’s possible to hit ODR-violations with this due to possible difference in closure types cross-TU. I’m not sure what the best workaround for this is, so input is appreciated.


Conclusion

I’ve given you a few solutions to the problem I showed at the start, so what’s my conclusion? C++ still has a way to go to support this paradigm of programming, and teaching these ideas is a nightmare. If a beginner or even intermediate programmer asks how to pass overloaded functions around – something which sounds like it should be fairly easy – it’s a real shame that the best answers I can come up with are “Copy this macro which you have no chance of understanding”, or “Make function objects, but make sure you do it this way for reasons which I can’t explain unless you understand the subtleties of ODR4”. I feel like the language could be doing more to support these use cases.

Maybe for some people “Do it this way and don’t ask why” is an okay answer, but that’s not very satisfactory to me. Maybe I lack imagination and there’s a better way to do this with what’s already available in the language. Send me your suggestions or heckles on Twitter @TartanLlama.


Thanks to Michael Maier for the motivation to write this post; Jayesh Badwaik, Ben Craig, Michał Dominiak and Kévin Boissonneault for discussion on ODR violations; and Eric Niebler, Barry Revzin, Louis Dionne, and Michał Dominiak (again) for their work on the libraries and standards papers I referenced.


  1. P0315: Lambdas in unevaluated contexts 

  2. P0624: Default constructible and assignable stateless lambdas 

  3. Example lovingly stolen from n4381

  4. Disclaimer: I don’t understand all the subtleties of ODR. 

A call for data on exceptions

Last week I released an article entitled “Functional exceptionless error-handling with optional and expected”. The idea was to talk about some good ways to handle errors for those cases where you don’t want to use exceptions, but without starting a huge argument about whether exceptions are good or bad. That post has since spawned arguments about exceptions on Hacker News, two other blog posts, and three Reddit threads. Ten points for effort, zero for execution. Since the argument has now started, I feel inclined to give my own view on the matter.

I know people who think exceptions are dangerous. I don’t personally know anyone who thinks ADT-based error handling is dangerous, but I’m sure the Internet is full of them. What I think is dangerous is cargo cult software engineering. Many people take a stance on either side of this divide for good reasons based on the constraints of their platform, but an unfortunately large number just hear a respected games programmer say “exceptions are evil!”, believe that this applies to all cases and assert it unconditionally. How do we fight this? With data.

I believe that as a community we need to be able to answer these questions:

  1. What do the different kinds of error handling cost when errors occur?
  2. What do they cost when no errors occur?
  3. How consistent are the costs, and what causes deviations?
  4. What is the cognitive overhead of the different techniques?
  5. How do they affect the design of our interfaces?

I have an intuition about all of these, backed by some understanding of the actual code which is generated and how it performs, as I’m sure many of you do. But we hit a problem when it comes to teaching. I don’t believe that teaching an intuition can build an intuition. Such a thing can only be built by experimentation or the reduction of hard data into a broad understanding of what that data points towards. If we answer all of questions above with hard data, then we can come up with a set of strong, measurable guidelines for when to use different kinds of error handling, and we can have something concrete to point people towards to help build that intuition. The C++ Core Guidelines provides the former, but not the latter. Furthermore, when I talk about “hard data”, I don’t just mean microbenchmarks. Microbenchmarks can help, sure, but writing them such that they have the realistic error patterns, cache behaviour, branch prediction, etc. is hard.

Of course, many people have tried to answer these questions already, so I’ve teamed up with Matt Dziubinski to put together a set of existing resources on error handling. Use it to educate yourselves and others, and please submit issues/pull requests to make it better. But I believe that there’s still work to be done, so if you’re a prospective PhD or Masters student and you’re looking for a topic, consider helping out the industry by researching the real costs of different error handling techniques in realistic code bases with modern compilers and modern hardware.

throw end_of_blog_post;

Functional exceptionless error-handling with optional and expected

In software things can go wrong. Sometimes we might expect them to go wrong. Sometimes it’s a surprise. In most cases we want to build in some way of handling these misfortunes. Let’s call them disappointments. I’m going to exhibit how to use std::optional and the proposed std::expected to handle disappointments, and show how the types can be extended with concepts from functional programming to make the handling concise and expressive.

One way to express and handle disappointments is exceptions:

void foo() {
try {
do_thing();
}
catch (...) {
//oh no
handle_error();
}
}

There are a myriad of discussions, resources, rants, tirades, debates about the value of exceptions123456, and I will not repeat them here. Suffice to say that there are cases in which exceptions are not the best tool for the job. For the sake of being uncontroversial, I’ll take the example of disappointments which are expected within reasonable use of an API.

The internet loves cats. The hypothetical you and I are involved in the business of producing the cutest images of cats the world has ever seen. We have produced a high-quality C++ library geared towards this sole aim, and we want it to be at the bleeding edge of modern C++.

A common operation in feline cutification programs is to locate cats in a given image. How should we express this in our API? One option is exceptions:

// Throws no_cat_found if a cat is not found.
image_view find_cat (image_view img);

This function takes a view of an image and returns a smaller view which contains the first cat it finds. If it does not find a cat, then it throws an exception. If we’re going to be giving this function a million images, half of which do not contain cats, then that’s a lot of exceptions being thrown. In fact, we’re pretty much using exceptions for control flow at that point, which is A Bad Thing™.

What we really want to express is a function which either returns a cat if it finds one, or it returns nothing. Enter std::optional.

std::optional<image_view> find_cat (image_view img);

std::optional was introduced in C++17 for representing a value which may or may not be present. It is intended to be a vocabulary type – i.e. the canonical choice for expressing some concept in your code. The difference between this signature and the last is powerful; we’ve moved the description of what happens on an error from the documentation into the type system. Now it’s impossible for the user to forget to read the docs, because the compiler is reading them for us, and you can be sure that it’ll shout at you if you use the type incorrectly.

The most common operations on a std::optional are:

std::optional<image_view> full_view = my_view;
std::optional<image_view> empty_view;
std::optional<image_view> another_empty_view = std::nullopt;

full_view.has_value(); //true
empty_view.has_value(); //false

if (full_view) { this_works(); }

my_view = full_view.value();
my_view = *full_view;

my_view = empty_view.value(); //throws bad_optional_access
my_view = *empty_view; //undefined behaviour

Now we’re ready to use our find_cat function along with some other friends from our library to make embarrassingly adorable pictures of cats:

std::optional<image_view> get_cute_cat (image_view img) {
auto cropped = find_cat(img);
if (!cropped) {
return std::nullopt;
}

auto with_tie = add_bow_tie(*cropped);
if (!with_tie) {
return std::nullopt;
}

auto with_sparkles = make_eyes_sparkle(*with_tie);
if (!with_sparkles) {
return std::nullopt;
}

return add_rainbow(make_smaller(*with_sparkles));
}

Well this is… okay. The user is made to explicitly handle what happens in case of an error, so they can’t forget about it, which is good. But there are two issues with this:

  1. There’s no information about why the operations failed.
  2. There’s too much noise; error handling dominates the logic of the code.

I’ll address these two points in turn.

Why did something fail?

std::optional is great for expressing that some operation produced no value, but it gives us no information to help us understand why this occurred; we’re left to use whatever context we have available, or (God forbid) output parameters. What we want is a type which either contains a value, or contains some information about why the value isn’t there. This is called std::expected.

Now don’t go rushing off to cppreference to find out about std::expected; you won’t find it there yet, because it’s currently a standards proposal rather than a part of C++ proper. However, its interface follows std::optional pretty closely, so you already understand most of it. Again, here are the most common operations:

std::expected<image_view,error_code> full_view = my_view;
std::expected<image_view,error_code> empty_view = std::unexpected(that_is_a_dog);

full_view.has_value(); //true
empty_view.has_value(); //false

if (full_view) { this_works(); }

my_view = full_view.value();
my_view = *full_view;

my_view = empty_view.value(); //throws bad_expected_access
my_view = *empty_view; //undefined behaviour

auto code = empty_view.error();
auto oh_no = full_view.error(); //undefined behaviour

With std::expected our code might look like this:

std::expected<image_view, error_code> get_cute_cat (image_view img) {
auto cropped = find_cat(img);p
if (!cropped) {
return no_cat_found;
}

auto with_tie = add_bow_tie(*cropped);
if (!with_tie) {
return cannot_see_neck;
}

auto with_sparkles = make_eyes_sparkle(*with_tie);
if (!with_sparkles) {
return cat_has_eyes_shut;
}

return add_rainbow(make_smaller(*with_sparkles));
}

Now when we call get_cute_cat and don’t get a lovely image back, we have some useful information to report to the user as to why we got into this situation.

Noisy error handling

Unfortunately, with both the std::optional and std::expected versions, there’s still a lot of noise. This is a disappointing solution to handling disappointments. It’s also the limit of what C++17’s std::optional and the most recent proposed std::expected give us.

What we really want is a way to express the operations we want to carry out while pushing the disappointment handling off to the side. As is becoming increasingly trendy in the world of C++, we’ll look to the world of functional programming for help. In this case, the help comes in the form of what I’ll call map and and_then.

If we have some std::optional and we want to carry out some operation on it if and only if there’s a value stored, then we can use map:

widget do_thing (const widget&);

std::optional<widget> result = maybe_get_widget().map(do_thing);
auto result = maybe_get_widget().map(do_thing); //or with auto

This code is roughly equivalent to:

widget do_thing (const widget&);

auto opt_widget = maybe_get_widget();
if (opt_widget) {
widget result = do_thing(*opt_widget);
}

If we want to carry out some operation which could itself fail then we can use and_then:

std::optional<widget> maybe_do_thing (const widget&);

std::optional<widget> result = maybe_get_widget().and_then(maybe_do_thing);
auto result = maybe_get_widget().and_then(maybe_do_thing); //or with auto

This code is roughly equivalent to:

std::optional<widget> maybe_do_thing (const widget&);

auto opt_widget = maybe_get_widget();
if (opt_widget) {
std::optional<widget> result = maybe_do_thing(*opt_widget);
}

The real power of this comes when we begin to chain operations together. Let’s look at that original get_cute_cat implementation again:

std::optional<image_view> get_cute_cat (image_view img) {e
auto cropped = find_cat(img);
if (!cropped) {
return std::nullopt;
}

auto with_tie = add_bow_tie(*cropped);
if (!with_tie) {
return std::nullopt;
}

auto with_sparkles = make_eyes_sparkle(*with_tie);
if (!with_sparkles) {
return std::nullopt;
}

return add_rainbow(make_smaller(*with_sparkles));
}

With map and and_then, our code transforms into this:

tl::optional<image_view> get_cute_cat (image_view img) {
return crop_to_cat(img)
.and_then(add_bow_tie)
.and_then(make_eyes_sparkle)
.map(make_smaller)
.map(add_rainbow);
}

With these two functions we’ve successfully pushed the error handling off to the side, allowing us to express a series of operations which may fail without interrupting the flow of logic to test an optional.

A theoretical aside

I didn’t make up map and and_then off the top of my head; other languages have had equivalent features for a long time, and the theoretical concepts are common subjects in Category Theory.

I won’t attempt to explain all the relevant concepts in this post, as others have done it far better than I could. The basic idea is that map comes from the concept of a functor, and and_then comes from monads. These two functions are called fmap and >>= (bind) in Haskell. The best description of these concepts which I have read is Functors, Applicatives’ And Monads In Pictures by Aditya Bhargava. Give it a read if you’d like to learn more about these ideas.

A note on overload sets

One use-case which is annoyingly verbose is passing overloaded functions to map or and_then. For example:

int foo (int);

tl::optional<int> o;
o.map(foo);

The above code works fine. But as soon as we add another overload to foo:

int foo (int);
int foo (double);

tl::optional<int> o;
o.map(foo);

then it fails to compile with a rather unhelpful error message:

test.cpp:7:3: error: no matching member function for call to 'map'
o.map(foo);
~~^~~
/home/simon/projects/optional/optional.hpp:759:52: note: candidate template ignored: couldn't infer template argument 'F'
  template <class F> TL_OPTIONAL_11_CONSTEXPR auto map(F &&f) & {
                                                   ^
/home/simon/projects/optional/optional.hpp:765:52: note: candidate template ignored: couldn't infer template argument 'F'
  template <class F> TL_OPTIONAL_11_CONSTEXPR auto map(F &&f) && {
                                                   ^
/home/simon/projects/optional/optional.hpp:771:37: note: candidate template ignored: couldn't infer template argument 'F'
  template <class F> constexpr auto map(F &&f) const & {
                                    ^
/home/simon/projects/optional/optional.hpp:777:37: note: candidate template ignored: couldn't infer template argument 'F'
  template <class F> constexpr auto map(F &&f) const && {
                                    ^
1 error generated.

One solution for this is to use a generic lambda:

tl::optional<int> o;
o.map([](auto x){return foo(x);});

Another is a LIFT macro:

#define FWD(...) std::forward<decltype(__VA_ARGS__)>(__VA_ARGS__)
#define LIFT(f) \
[](auto&&... xs) noexcept(noexcept(f(FWD(xs)...))) -> decltype(f(FWD(xs)...)) \
{ return f(FWD(xs)...); }

tl::optional<int> o;
o.map(LIFT(foo));

Personally I hope to see overload set lifting get into the standard so that we don’t need to bother with the above solutions.

Current status

Maybe I’ve persuaded you that these extensions to std::optional and std::expected are useful and you would like to use them in your code. Fortunately I have written implementations of both with the extensions shown in this post, among others. tl::optional and tl::expected are on GitHub as single-header libraries under the CC0 license, so they should be easy to integrate with projects new and old.

As far as the standard goes, there are a few avenues being entertained for adding this functionality. I have a proposal to extend std::optional with new member functions. Vicente Escribá has a proposal for a generalised monadic interface for C++. Niall Douglas’ operator try() paper suggests an analogue to Rust’s try! macro for removing some of the boilerplate associated with this style of programming. It turns out that you can use coroutines for doing this stuff, although my gut feeling puts this more to the “abuse” end of the spectrum. I’d also be interested in evaluating how Ranges could be leveraged for these goals.

Ultimately I don’t care how we achieve this as a community so long as we have some standardised solution available. As C++ programmers we’re constantly finding new ways to leverage the power of the language to make expressive libraries, thus improving the quality of the code we write day to day. Let’s apply this to std::optional and std::expected. They deserve it.


Meeting C++17 Trip Report

This year was my first time at Meeting C++. It was also the first time I gave a full-length talk at a conference. But most of all it was a wonderful experience filled with smart, friendly people and excellent talks. This is my report on the experience. I hope it gives you an idea of some talks to watch when they’re up on YouTube, and maybe convince you to go along and submit talks next year! I’ll be filling out links to the talks as they go online.

The Venue

The conference was held at the Andel’s Hotel in Berlin, which is a very large venue on the East side of town, about 15 minutes tram ride from Alexanderplatz. All the conference areas were very spacious and there were always places to escape to when you need a break from the noise, or places to sit down on armchairs when you need a break from sitting down on conference chairs.

The People

Some people say they go to conferences for the talks, and for the opportunity to interact with the speakers in the Q&As afterwards. The talks are great, but I went for the people. I had a wonderful time meeting those whom I knew from Twitter or the CppLang Slack, but had never encountered in real life; catching up with those who I’ve met at other events; and talking to those I’d never had interacted with before. There were around 600 people at the conference, primarily from Europe, but many from further-afield. I didn’t have any negative encounters with anyone at the conference, and Jens was very intentional about publicising the Code of Conduct and having a diverse team to enforce it, so it was a very positive atmosphere.

Day 1

Keynote: Sean Parent – Better Code: Human interface

After a short introduction from Jens, the conference kicked off with a keynote from Sean Parent. Although he didn’t say it in these words, the conclusion I took out of it was that all technology requires some thought about user experience (UX if we’re being trendy). Many of us will think of UX as something centred around user interfaces, but it’s more than that. If you work on compilers (like I do), then your UX work involves consistent compiler flags, friendly error messages and useful tooling. If you work on libraries, then your API is your experience. Do your functions do the most obvious thing? Are they well-documented? Do they come together to create terse, correct, expressive, efficient code? Sean’s plea was for us to consider all of these things, and to embed them into how we think about coding. Of course, it being a talk about good code from Sean Parent, it had some great content about composing standard algorithms. I would definitely recommend checking out the talk if what I’ve just written sounds thought-provoking to you.

Peter Goldsborough – Deep Learning with C++

Peter gave a very well-structured talk, which went down the stack of technology involved in deep learning – from high-level libraries to hardware – showing how it all works in theory. He then went back up the stack showing the practical side, including a few impressive live demos. It covers a lot of content, giving a pretty comprehensive introduction. It does move at a very quick pace though, so you probably don’t want to watch this one on 1.5x on YouTube!

Jonathan Boccara – Strong types for strong interfaces

This talk was quite a struggle against technical difficulties; starting 15 minutes late after laptop/projector troubles, then having Keynote crash as soon as the introduction was over. Fortunately, Jonathan handled it incredibly well, seeming completely unfazed as he gave an excellent presentation and even finished on time.

The talk itself was about using strong types (a.k.a. opaque typedefs) to strengthen your code’s resilience to programmer errors and increase its readability. This was a distillation of some of the concepts which Jonathan has written about in a series on his blog. He had obviously spent a lot of time rehearsing this talk and put a lot of thought into the visual design of the slides, which I really appreciated.

Simon Brand (me!) – How C++ Debuggers Work

This was my first full length talk at a conference, so I was pretty nervous about it, but I think it went really well: there were 150-200 people there, I got some good questions, and a bunch of people came to discuss the talk with me in the subsequent days. I screwed up a few things – like referring to the audience as “guys” near the start and finishing a bit earlier than I would have liked – but all in all it was a great experience.

My talk was an overview of almost all of the commonly-used parts of systems-level debuggers. I covered breakpoints, stepping, debug information, object files, operating system interaction, expression evaluation, stack unwinding and a whole lot more. It was a lot of content, but I think I managed to get a good presentation on it. I’d greatly appreciate any feedback of how the talk could be improved in case I end up giving it at some other conference in the future!

Joel Falcou – The Three Little Dots and the Big Bad Lambdas

I’m a huge metaprogramming nerd, so this talk on using lambdas as a kind of code injection was very interesting for me. Joel started the talk with an introduction to how OCaml allows code injection, and then went on to show how you could emulate this feature in C++ using templates, lambdas and inlining in C++. It turned out that he was going to mostly talk about the indices trick for unpacking tuple-like things, fold expressions (including how to emulate them in C++11/14), and template-driven loop unrolling – all of which are techniques I’m familiar with. However, I hadn’t thought about these tools in the context of compile-time code injection, and Joel presented them very well, so it was still a very enjoyable talk. I particularly enjoyed hearing that he uses these in production in his company in order to fine-tune performance. He showed a number of benchmarks of the linear algebra library he has worked on against another popular implementation which hand-tunes the code for different architectures, and his performance was very competitive. A great talk for demonstrating how templates can be used to optimise your code!

Diego Rodriguez-Losada – Conan C++ Quiz

This was one of the highlights of the whole conference. The questions were challenging, thought-provoking, and hilarious, and Diego was a fantastic host. He delivered all the content with charisma and wit, and was obviously enjoying himself: maybe my favourite part of the quiz was the huge grin he wore as everyone groaned in incredulity when the questions were revealed. I don’t know if the questions will go online at some point, but I’d definitely recommend giving them a shot if you get the chance.

Day 2

Keynote: Kate Gregory – It’s complicated!

Kate gave a really excellent talk which examined the complexities of writing good code, and those of C++. The slide which I saw the most people get their phones out for said “Is it important to your ego that you’re really good at a complicated language?”, which I think really hit home for many of us. C++ is like an incredibly intricate puzzle, and when you solve a piece of it, you feel good about it and want to share your success with others. However, sometimes, we can make that success, that understanding, part of our identity. Neither Kate or I are saying that we shouldn’t be proud of our domination over the obscure parts of the language, but a bit of introspection about how we reflect this understanding and share it with others will go a long way.

Another very valuable part of her talk was her discussion on the importance of good names, and how being able to describe some idiom or pattern helps us to talk about it and instruct others. Of course, this is what design patterns were originally all about, but Kate frames it in a way which helps build an intuition around how this common language works to make us better programmers. A few days after the conference, this is the talk which has been going around my head the most, so if you watch one talk from this year’s conference, you could do a lot worse than this one.

Felix Petriconi – There Is A New Future

Felix presented the std::future-a-like library which he and Sean Parent have been working on. It was a very convincing talk, showing the inadequacies of std::future, and how their implementation fixes these problems. In particular, Felix discussed continuations, cancellations, splits and forks, and demonstrated their utility. You can find the library here.

Lukas Bergdoll – Is std::function really the best we can do?

If you’ve read a lot of code written by others, I’m sure you’ll have seen people misusing std::function. This talk showed why so many uses are wrong, including live benchmarks, and gave a number of alternatives which can be used in different situations. I particularly liked a slide which listed the conditions for std::function to be a good choice, which were:

  • It has to be stored inside a container
  • You really can’t know the layout at compile time
  • All your callable types are move and copyable
  • You can constrain the user to exactly one signature

#include meeting

Guy Davidson lead an open discussion about #include, which is a diversity group for C++ programmers started by Guy and Kate Gregory (I’ve also been helping out a bit). It was a very positive discussion in which we shared stories, tried to grapple with some of the finer points of diversity and inclusion in the tech space, and came up with some aims to consider moving forward. Some of the key outcomes were:

  • We should be doing more outreach into schools and trying to encourage those in groups who tend to be discouraged from joining the tech industry.
  • Some people would greatly value a resource for finding companies which really value diversity rather than just listing it as important on their website.
  • Those of use who care about these issues can sometimes turn people away from the cause by not being sensitive to cultural, psychological or sociological issues; perhaps we need education on how to educate.

I thought this was a great session and look forward to helping develop this initiative and seeing where we can go with it. I’d encourage anyone who is interested to get on to the #include channel on the CppLang Slack, or to send pull requests to the information repository.

Juan Bolívar – The most valuable values

Juan gave a very energetic, engaging talk about value semantics and his Redux-/Elm-like C++ library, lager. He showed how these techniques can be used for writing clear, correct code without shared mutable state. I particularly like the time-travelling debugger which he presented, which allows you to step around the changes to your data model. I don’t want to spoil anything, but there’s a nice surprise in the talk which gets very meta.

Day 3

Lightning talks

I spent most of day three watching lightning talks. There were too many to discuss individually, so I’ll just mention my favourites.

Arvid Gerstmann – A very quick view into a compiler

Arvid gave a really clear, concise description of the different stages of compilation. Recommended if you want a byte-sized introduction to compilers.

Réka Nikolett Kovács – std::launder

It’s become a bit of a meme on the CppLang Slack that we don’t talk about std::launder, as it’s only needed if you’re writing standard-library-grade generic libraries. Still, I was impressed by this talk, which explains the problem well and shows how std::launder (mostly) solves it. I would have liked more of a disclaimer about who the feature is for, and that there are still unsolved problems, but it was still a good talk.

Andreas Weis – Type Punning Done Right

Strict aliasing violations are a common source of bugs if you’re writing low level code and aren’t aware of the rules. Andreas gave a good description of the problem, and showed how memcpy is the preferred way to solve it. I do wish he’d included an example of std::bit_cast, which was just voted in to C++20, but it’s a good lightning talk to send to your colleagues the next time they go type punning.

Miro Knejp – Is This Available?

Miro presented a method of avoiding awful #ifdef blocks by providing a template-based interface to everything you would otherwise preprocess. This is somewhat similar to what I talked about in my if constexpr blog post, and I do something similar in production for abstracting the differences in hardware features in a clean, composable manner. As such, it’s great to have a lightning talk to send to people when I want to explain the concept, and I’d recommend watching it if the description piques your interest. Now we just need a good name for it. Maybe the Template Abstracted Preprocessor (TAP) idiom? Send your thoughts to me and Miro!

Simon Brand (me!) – std::optional and the M word

Someone who was scheduled to do a lightning talk didn’t make the session, so I ended up giving an improvised talk about std::optional and monads. I didn’t have any slides, so described std::optional as a glass which could either be empty, or have a value (in this case represented by a pen). I got someone from the audience to act as a function which would give me a doily when I gave it a pen, and acted out functors and monads using that. It turned out very well, even if I forgot how to write Haskell’s bind operator (>>=) halfway through the talk!

Overall

I really enjoyed the selection of lightning talks. There was a very diverse range of topics, and everything flowed very smoothly without any technical problems. It would have been great if there were more speakers, since some people gave two or even three talks, but that means that more people need to submit! If you’re worried about talking, a lightning talk is a great way to practice; if things go terribly wrong, then the worse that can happen is that they move on to the next speaker. It would have also been more fun if it was on the main track, or if there wasn’t anything scheduled against them.

Secret Lightning Talks

Before the final keynote there was another selection of lightning talks given on the keynote stage. Again, I’ll just discuss my favourites.

Guy Davidson – Diversity and Inclusion

Guy gave a heartfelt talk about diversity in the C++ community, giving reasons why we should care, and what we can do to try and encourage involvement for those in under-represented groups in our communities. This is a topic which is also very important to me, so it was great to see it being given an important place in the conference.

Kate Gregory – Five things I learned when I should have been dying

An even more personal talk was given by Kate about important things she learned when she was given her cancer diagnosis. It was a very inspiring call to not care about the barriers which may perceive to be in the way of achieving something – like, say, going to talk at a conference – and instead just trying your best and “doing the work”. Her points really resonated with me; I, like many of us, have suffered from a lot of impostor syndrome in my short time in the industry, and talks like this help me to battle through it.

Phil Nash – A Composable Command Line Parser

Phil’s talk was on Clara, which is a simple, composable command line parser for C++. It was split off from his Catch test framework into its own library, and he’s been maintaining it independently of its parent. The talk gave a strong introduction to the library and why you might want to use it instead of one of the other hundreds of command line parsers which are out there. I’m a big fan of the design of Clara, so this was a pleasure to watch.

Keynote: Wouter van Ooijen – What can C++ offer embedded, what can embedded offer C++?

To close out the conference, Wouter gave a talk about the interaction between the worlds of C++ and embedded programming. The two have been getting more friendly in recent years thanks to libraries like Kvasir and talks by people like Dan Saks and Odin Holmes. Wouter started off talking about his work on space rockets, then motivated using C++ templates for embedded programming, showed us examples of how to do it, then finished off talking about what the C++ community should learn from embedded programming. If I’m honest, I didn’t enjoy this keynote as much as the others, as I already had an understanding of the techniques he showed, and I was unconvinced by the motivating examples, which seemed like they hadn’t been optimised properly (I had to leave early to catch a plane, so please someone tell me if he ended up clarifying them!). If you’re new to using C++ for embedded programming, this may be a good introduction.

Closing

That’s it for my report. I want to thank Jens for organising such a great conference, as well as his staff and volunteers for making sure everything ran smoothly. Also thanks to the other speakers for being so welcoming to new blood, and all the other attendees for their discussions and questions. I would whole-heartedly recommend attending the conference and submitting talks, especially if you haven’t done so before. Last but not least, a big thanks to my employer, Codeplay for sending me (we’re hiring, you get to work on compilers and brand new hardware and go to conferences, it’s pretty great).

Detection Idiom – A Stopgap for Concepts

Concepts is a proposed C++ feature which allows succinct, expressive, and powerful constraining of templates. They have been threatening to get in to C++ for a long time now, with the first proposal being rejected for C++11. They were finally merged in to C++20 a few months ago, which means we need to hold on for another few years before they’re in the official standard rather than a Technical Specification. In the mean time, there have been various attempts to implement parts of concepts as a library so that we can have access to some of the power of Concepts without waiting for the new standard. The detection idiom, which is part of the Library Fundamentals 2 Technical Specification, is one such solution. This post will outline the utility of it, and show the techniques which underlie its implementation.


A problem

We are developing a generic library. At some point on our library we need to calculate the foo factor for whatever types the user passes in.

template <class T>
int calculate_foo_factor (const T& t);

Some types will have specialized implementations of this function, but for types which don’t we’ll need some generic calculation.

struct has_special_support {
  int get_foo() const;
};

struct will_need_generic_calculation {
  // no get_foo member function
};

Using concepts we could write calculate_foo_factor like so:

template <class T>
concept bool SupportsFoo = requires (T t) {
  { t.get_foo() } -> int;
};

template <SupportsFoo T>
int calculate_foo_factor (const T& t) {
    return t.get_foo();
}

template <class T>
int calculate_foo_factor (const T& t) {
    // insert generic calculation here
    return 42;
}

This is quite succinct and clear: SupportsFoo is a concept which checks that we can call get_foo on t with no arguments, and that the type of that expression is int. The first calculate_foo_factor will be selected by overload resolution for types which satisfy the SupportsFoo concept, and the second will be chosen for those which don’t.

Unfortunately, our library has to support C++14. We’ll need to try something different. I’ll demonstrate a bunch of possible solutions to this problem in the next section. Some of them may seem complex if you aren’t familiar with the metaprogramming techniques used, but for now, just note the differences in complexity and abstraction between them. The metaprogramming tricks will all be explained in the following section.

Solutions

Here’s a possible solution using expression SFINAE:

namespace detail {
  template <class T>
  auto calculate_foo_factor (const T& t, int)
    -> decltype(std::declval<T>().get_foo()) {
    return t.get_foo();
  }

  template <class T>
  int calculate_foo_factor (const T& t, ...) {
    // insert generic calculation here
    return 42;
  }
}

template <class T>
int calculate_foo_factor (const T& t) {
  return detail::calculate_foo_factor(t, 0);
}

Well, it works, but it’s not exactly clear. What’s the int and ... there for? What’s std::declval? Why do we need an extra overload? The answers are not the important part here; what is important is that unless you’ve got a reasonable amount of metaprogramming experience, it’s unlikely you’ll be able to write this code offhand, or even copy-paste it without error.

The code could be improved by abstracting out the check for the presence of the member function into its own metafunction:

template <class... Ts>
using void_t = void;

template <class T, class=void>
struct supports_foo : std::false_type{};

template <class T>
struct supports_foo<T, void_t<decltype(std::declval<T>().get_foo())>>
: std::true_type{};

Again, some more expert-only template trickery which I’ll explain later. Using this trait, we can use std::enable_if to enable and disable the overloads as required:

template <class T, std::enable_if_t<supports_foo<T>::value>* = nullptr>
auto calculate_foo_factor (const T& t) {
  return t.get_foo();
}

template <class T, std::enable_if_t<!supports_foo<T>::value>* = nullptr>
int calculate_foo_factor (const T& t) {
  // insert generic calculation here
  return 42;
}

This works, but you’d need to understand how to implement supports_foo, and you’d need to repeat all the boilerplate again if you needed to write a supports_bar trait. It would be better if we could completely abstract away the mechanism for detecting the member function and just focus on what we want to detect. This is what the detection idiom provides.

template <class T>
using foo_t = decltype(std::declval<T>().get_foo());

template <class T>
using supports_foo = is_detected<foo_t,T>;

is_detected here is part of the detection idiom. Read it as “is it valid to instantiate foo_t with T?” std::declval<T>() essentially means “pretend that I have a value of type T” (more on this later). This still requires some metaprogramming magic, but it’s a whole lot simpler than the previous examples. With this technique we can also easily check for the validity of arbitrary expressions:

template <class T, class U>
using equality_t = decltype(std::declval<T>() == std::declval<U>());

template <class T, class U>
using supports_equality = is_detected<equality_t, T, U>;

struct foo{};
struct bar{};
struct baz{};

bool operator== (foo, bar);

static_assert( supports_equality<foo,bar>::value, "wat");
static_assert(!supports_equality<foo,baz>::value, "wat");

We can also compose traits using std::conjunction:

template <class T>
using is_regular = std::conjunction<std::is_default_constructible<T>,
                                    std::is_copy_constructible<T>,
                                    supports_equality<T,T>,
                                    supports_inequality<T,T>, //assume impl
                                    supports_less_than<T,T>>; //ditto

If you want to use is_detected today, then you can check if your standard library supports std::experimental::is_detected. If not, you can use the implementation from cppreference or the one which we will go on to write in the next section. If you aren’t interested in how this is written, then turn back, for here be metaprogramming dragons.


Metaprogramming demystified

I’ll now work backwards through the metaprogramming techniques used, leading up to implementing is_detected.

Type traits and _v and _t suffixes

A type trait is some template which can be used to get information about characteristics of a type. For example, you can find out if some type is an arithmetic type using std::is_arithmetic:

template <class T>
void foo(T t) {
     static_assert(std::is_arithmetic<T>::value, "Argument must be of an arithmetic type");
}

Type traits either “return” types with a ::type member alias, or values with a ::value alias. _t and _v suffixes are shorthand for these respectively. So std::is_arithmetic_v<T> is the same as std::is_arithmetic<T>::value, std::add_pointer_t<T> is the same as typename std::add_pointer<T>::type1.

decltype specifiers

decltype gives you access to the type of an entity or expression. For example, with int i;, decltype(i) is int.

std::declval trickery

std::declval is a template function which helps create values inside decltype specifiers. std::declval<foo>() essentially means “pretend I have some value of type foo”. This is needed because the types you want to inspect inside decltype specifiers may not be default-constructible. For example:

struct foo {
    foo() = delete;
    foo(int);
    void do_something();
};

decltype(foo{}.do_something())               // invalid
decltype(std::declval<foo>().do_something()) // fine

SFINAE and std::enable_if

SFINAE stands for Substitution Failure Is Not An Error. Due to this rule, some constructs which are usually hard compiler errors just cause a function template to be ignored by overload resolution. std::enable_if is a way to gain easy access to this. It takes a boolean template argument and either contains or does not contain a ::type member alias depending on the value of that boolean. This allows you to do things like this:

template <class T>
std::enable_if_t<std::is_integral_v<T>> foo(T t);

template <class T>
std::enable_if_t<std::is_floating_point_v<T>> foo(T t);

If T is integral, then the second overload will be SFINAEd out, so the first is the only candidate. If T is floating point, then the reverse is true.

Expression SFINAE

Expression SFINAE is a special form of SFINAE which applies to arbitrary expressions. This is what allowed this example from above to work:

namespace detail {
  template <class T>
  auto calculate_foo_factor (const T& t, int)
    -> decltype(std::declval<T>().to_foo()) {
    return t.get_foo();
  }

  template <class T>
  int calculate_foo_factor (const T& t, ...) {
    // insert generic calculation here
    return 42;
  }
}

int calculate_foo_factor (const T& t) {
  return calculate_foo_factor(t, 0);
}

The first overload will be SFINAEd out if calling to_foo on an instance of T is invalid. The difficulty here is that both overloads will be valid if the too_foo call is valid. For this reason, we add some dummy parameters to the overloads (int and ...) to specify an order for overload resolution to follow2.

void_t magic

void_t is a C++17 feature (although it’s implementable in C++11) which makes writing traits and using expression SFINAE a bit easier. The implementation is deceptively simple3:

template <class... Ts>
using void_t = void;

You can see this used in this example which we used above:

template <class T, class=void>
struct supports_foo : std::false_type{};

template <class T>
struct supports_foo<T, void_t<decltype(std::declval<T>().get_foo())>>
: std::true_type{};

The relevant parts are the class=void and void_t<...>. If the expression inside void_t is invalid, then that specialization of supports_foo will be SFINAEd out, so the primary template will be used. Otherwise, the whole expression will be evaluated to void due to the void_t and this will match the =void default argument to the template. This gives a pretty succinct way to check arbitrary expressions.

That covers all of the ingredients we need to implement is_detected.

The implementation

For sake of simplicity I’ll just implement is_detected rather than all of the entities which the Lib Fundamentals 2 TS provides.

namespace detail {
  template <template <class...> class Trait, class Enabler, class... Args>
  struct is_detected : std::false_type{};

  template <template <class...> class Trait, class... Args>
  struct is_detected<Trait, void_t<Trait<Args...>>, Args...> : std::true_type{};
}

template <template <class...> class Trait, class... Args>
using is_detected = typename detail::is_detected<Trait, void, Args...>::type;

Let’s start with that last alias template. We take a variadic template template parameter (yes really) called Trait and any number of other type template arguments. We forward these to our detail::is_detected helper with an extra void argument which will act as the class=void construct from the previous section on void_t4. We then have a primary template which will “return” false as the result of the trait. The magic then happens in the following partial specialization. Trait<Args...>> is evaluated inside void_t. If instantiating Traits with Args... is invalid, then the partial specialization will be SFINAEd out, and if it’s valid, it’ll evaluate to void due to the void_t. This successfully abstracts away the implementation of the detection and allows us to focus on what we want to detect.


That covers it for the detection idiom. This is a handy utility for clearing up some hairy metaprogramming, and is even more useful since it can be implemented in old versions of the standard. Let me know if there are any other parts of the code you’d like explained down in the comments or on Twitter.


  1. See here for information on why the typename keyword is needed in some places. 

  2. Have a look here for a better way to do this. 

  3. Due to standards defect CWG1558, implementations needed an extra level of abstraction to work on some compilers. 

  4. The reason we can’t use the same trick is that parameter packs must appear at the end of the template parameter list, so we need to put the void in the middle. 

Learning C++

C++ is one of the most powerful programming languages in the world. However, it can also be one of the most daunting for beginners. There is so much to learn, so many corner cases, so many little languages within the language itself, that mastering it can seem an insurmountable task. Fortunately, it’s not quite as impossible as it looks. This post outlines some resources and advice for learning C++ today. Some of these also will be useful for those who have used C++ in the past and want to get familiar with modern C++, which is like an entirely different language.

If you want to learn C++ from scratch, don’t rely on random online tutorials. Get a good book. One of these. I read the entirety of Accelerated C++ the day before the interview for my current job, and it was enough to get stuck in to real C++ development. It’s a bit outdated now, (i.e. it doesn’t cover C++11/14/17) but is easily the most important book in my career development. The Effective C++ series by Scott Meyers should be mandatory reading for C++ developers, particularly Effective Modern C++, which is the go-to introduction to C++11 and 14. If you prefer online resources, use reputable ones like Kate Gregory’s Pluralsight courses rather than whatever Google throws out. There are a lot of terrible C++ tutorials out there which will only harm you. A good rule of thumb: if you see #include <iostream.h>, run1.

However, no matter how much you read, you need to practice in order to improve. I find it helpful to have a mix of small, self-contained problems to try out new techniques, and larger applications to make sure you can use those ideas in a real-world context. Kind of like unit tests and integration tests for your programming skills; can you use these ideas in isolation, and can you fit them together into something comprehensible and maintainable.

Stack Overflow is a great source of skill unit tests. I learned most of my initial metaprogramming knowledge by answering Stack Overflow questions incorrectly and having people tell me why I was wrong. Unfortunately it can be an unforgiving community, and isn’t friendly to very simple questions. Be sure to read through the rules thoroughly before contributing, and feel free to send me an answer you were going to post to get feedback on it if you’re worried!

For skill integration tests, I’d recommend coming up with an application in a domain you’re interested in and trying to develop it from start to finish. If you haven’t done much of this kind of thing before, it can be a daunting task. The main piece of advice I have on this is to ensure your application has a strict scope so that you always know what you’re aiming towards and when you’re finished. Set a number of checkpoints or milestones so that you have explicit tasks to work on and don’t get lost. Some ideas for fun projects:

  • A compiler for a simple C subset
  • A mini text editor
  • Implement some simplified standard library classes, like std::vector
  • A clone of a simple video game like Snake
  • An ELF file manipulation tool

One of the most valuable knowledge sources is a good C++ community to be part of. Find out if your city or a nearby one has a User Group and go along to meet other developers. They might even have a mentorship program or advice channels for people learning the language. Definitely get on to the C++ Slack Channel and lurk, ask questions, join in discussions. Many of the foremost C++ experts in the world hang out there, and it’s also a very friendly and welcoming environment to those learning. There are even dedicated subchannels for those learning or having problems debugging C++.

Don’t be discouraged when you find things difficult, and, trust me, you will. C++ is hard, and even the experts get things wrong, or can’t even agree on what’s right.

If you do get good, pass on the knowledge. Start a blog, give advice, mentor those who need it, talk at conferences. You do not need to be an expert to do these things, and we need more speakers and writers from a diverse set of experience levels, as well as in background, race, gender, sexuality. The major C++ conferences all have codes of conduct and organisers who care about these issues. Many of them also have student or accessibility tickets available for those who need further assistance to attend. If in doubt, just ask the organisers, or drop me a message and I can get in touch with the relevant people.

Finally, find some good, regular blogs to learn from. Here are some I would recommend:

I hope that this post helps you in your journey to learn C++. It’s a long, but very rewarding path, with a lot of wonderful people to meet, work with, and debate metaprogramming techniques with along the way.


  1. iostream.h rather than just iostream the name for the header before C++ was standardised. People are still writing tutorials which teach using this form and ancient compilers. 

Writing a Good Tech Cover Letter

I review a lot of job applications at Codeplay and most of them make me sad.

For every application I look at, I make a list of plus points and minus points as I read through the candidate’s documents and code. Invariably the first thing I look at is the cover letter, and I would estimate that 90% of them get an instant minus point in my review feedback. As an alternative to ranting about this on the internet, I’ve written some advice on writing a cover letter for a tech job. All advice is from the viewpoint of a small-medium sized tech company and all, well, you know, just like, my opinion, man.

A good job application answers two questions: why the candidate is right for the company and why the company is right for the candidate. The best place to answer these questions is in your cover letter. In fact, with the best cover letters, I even forget to read the CV. If you can’t answer both of these questions on a single side of A4, you probably don’t know the answers.

Step 1 to writing a good cover letter is to actually write a cover letter. If the job advertisement asks for a cover letter, write a cover letter. If it doesn’t ask for a cover letter, write one anyway! Without a concise, targeted description of why you should be considered for a job, your CV needs to work five times as hard for you, as the reviewer needs to try and extract the necessary information from a list of competencies and experience rather than an explicit argument for why the candidate fits the position.

Do not copy-paste a generic cover letter which you send everywhere. Anyone who has reviewed a bunch can spot them instantly, and they make it look like you don’t care about the job enough to actually write something targeted to the company and position.

If the job advert lists criteria for applicants, reference it. Show why you fulfil (or, even better, exceed) the expectations and provide evidence. The best cover letters I’ve seen have gone through all of the required and desired skills and provided short descriptions of experience in the area with links to proof of this experience. And, yes, you can do this on one sheet of paper and still have space for a how-do-you-do.

Don’t spout vague buzzwords. Every time I read about a “team player who is eager to learn” my eyes roll back in my head so hard I lose them. If things like being a team player or eager to learn are in the criteria, fine, but (again) provide evidence.

Let your character and enthusiasm show. These don’t come across in a CV, so show me who you are and why I should want to work with you. Of course, this is much more apparent when it actually comes to interviews, but being enthusiastic about the position and giving some colour to your writing can make the difference between actually getting that interview or not.

If the advert asks for links to your work, provide good ones. Of course, if most of your work is confidential and you don’t work on hobby projects then you may not have any; that’s fine, just justify it. But many applicants link to a Stack Overflow profile with two closed questions on std::vector and send a code sample they wrote ten years ago with using namespace std; in all the headers. Don’t.

Spell-check your letter. English might not be your first language, and whoever is reviewing your application should be sensitive to that, but maeking shoor ur letter doesnt look lyk this can be done by your word processor. This advice should be obvious, but you’d be surprised how many applications I review are riddled with typos. Bad spelling looks lazy. You should put effort into your cover letter, and it should show.

Finally, and most importantly, don’t assume you’re not good enough. An understanding of the gaps in your knowledge, an honest desire to learn the field and evidence that you’re trying goes a really long way.

void C::C::C::A::A::A::foo() – valid syntax monstrosity

Here’s an odd bit of C++ syntax for you. Say we have the following class hierarchy:

class A {
public:
    virtual void foo() = 0;
};

class B {
public:
    virtual void foo() = 0;
};

class C : public A, public B {
public:
    virtual void foo();
};

The following definitions are all well-formed:

void C::foo(){
  std::cout << "C";
}
void C::A::foo(){
  std::cout << "A";
}
void C::B::foo(){
  std::cout << "B";
}

The first one defines C::foo, the second defines A::foo and the third defines B::foo. This is valid because of an entity known as the injected-type-name:

A class-name is inserted into the scope in which it is declared immediately after the class-name is seen. The class-name is also inserted into the scope of the class itself; this is known as the injected-class-name. For purposes of access checking, the injected-class-name is treated as if it were a public member name. […]

Since A is a base class of C and the name A is injected into the scope of A, A is also visible from C.

The injected-class-name exists to ensure that the class is found during name lookup instead of entities in an enclosing scope. It also makes referring to the class name in a template instantiation easier. But since we’re awful people, we can use this perfectly reasonable feature do horribly perverse things like this:

void C::C::C::A::A::A::foo(){
    std::cout << "A";
}

Yeah, don’t do that.

Writing a Linux Debugger Part 10: Advanced topics

We’re finally here at the last post of the series! This time I’ll be giving a high-level overview of some more advanced concepts in debugging: remote debugging, shared library support, expression evaluation, and multi-threaded support. These ideas are more complex to implement, so I won’t walk through how to do so in detail, but I’m happy to answer questions about these concepts if you have any.


Series index

  1. Setup
  2. Breakpoints
  3. Registers and memory
  4. Elves and dwarves
  5. Source and signals
  6. Source-level stepping
  7. Source-level breakpoints
  8. Stack unwinding
  9. Handling variables
  10. Advanced topics

Remote debugging

Remote debugging is very useful for embedded systems or debugging the effects of environment differences. It also sets a nice divide between the high-level debugger operations and the interaction with the operating system and hardware. In fact, debuggers like GDB and LLDB operate as remote debuggers even when debugging local programs. The general architecture is this:

debugarch

The debugger is the component which we interact with through the command line. Maybe if you’re using an IDE there’ll be another layer on top which communicates with the debugger through the machine interface. On the target machine (which may be the same as the host) there will be a debug stub, which in theory is a very small wrapper around the OS debug library which carries out all of your low-level debugging tasks like setting breakpoints on addresses. I say “in theory” because stubs are getting larger and larger these days. The LLDB debug stub on my machine is 7.6MB, for example. The debug stub communicates with the debugee process using some OS-specific features (in our case, ptrace), and with the debugger though some remote protocol.

The most common remote protocol for debugging is the GDB remote protocol. This is a text-based packet format for communicating commands and information between the debugger and debug stub. I won’t go into detail about it, but you can read all you could want to know about it here. If you launch LLDB and execute the command log enable gdb-remote packets then you’ll get a trace of all packets sent through the remote protocol. On GDB you can write set remotelogfile <file> to do the same.

As a simple example, here’s the packet to set a breakpoint:

$Z0,400570,1#43

$ marks the start of the packet. Z0 is the command to insert a memory breakpoint. 400570 and 1 are the argumets, where the former is the address to set a breakpoint on and the latter is a target-specific breakpoint kind specifier. Finally, the #43 is a checksum to ensure that there was no data corruption.

The GDB remote protocol is very easy to extend for custom packets, which is very useful for implementing platform- or language-specific functionality.


Shared library and dynamic loading support

The debugger needs to know what shared libraries have been loaded by the debuggee so that it can set breakpoints, get source-level information and symbols, etc. As well as finding libraries which have been dynamically linked against, the debugger must track libraries which are loaded at runtime through dlopen. To facilitate this, the dynamic linker maintains a rendezvous structure. This structure maintains a linked list of shared library descriptors, along with a pointer to a function which is called whenever the linked list is updated. This structure is stored where the .dynamic section of the ELF file is loaded, and is initialized before program execution.

A simple tracing algorithm is this:

  • The tracer looks up the entry point of the program in the ELF header (or it could use the auxillary vector stored in /proc/<pid>/aux)
  • The tracer places a breakpoint on the entry point of the program and begins execution.
  • When the breakpoint is hit, the address of the rendezvous structure is found by looking up the load address of .dynamic in the ELF file.
  • The rendezvous structure is examined to get the list of currently loaded libraries.
  • A breakpoint is set on the linker update function.
  • Whenever the breakpoint is hit, the list is updated.
  • The tracer infinitely loops, continuing the program and waiting for a signal until the tracee signals that it has exited.

I’ve written a small demonstration of these concepts, which you can find here. I can do a more detailed write up of this in the future if anyone is interested.


Expression evaluation

Expression evaluation is a feature which lets users evaluate expressions in the original source language while debugging their application. For example, in LLDB or GDB you could execute print foo() to call the foo function and print the result.

Depending on how complex the expression is, there are a few different ways of evaluating it. If the expression is a simple identifier, then the debugger can look at the debug information, locate the variable and print out the value, just like we did in the last part of this series. If the expression is a bit more complex, then it may be possible to compile the code to an intermediate representation (IR) and interpret that to get the result. For example, for some expressions LLDB will use Clang to compile the expression to LLVM IR and interpret that. If the expression is even more complex, or requires calling some function, then the code might need to be JITted to the target and executed in the address space of the debuggee. This involves calling mmap to allocate some executable memory, then the compiled code is copied to this block and is executed. LLDB does this by using LLVM’s JIT functionality.

If you want to know more about JIT compilation, I’d highly recommend Eli Bendersky’s posts on the subject.


Multi-threaded debugging support

The debugger shown in this series only supports single threaded applications, but to debug most real-world applications, multi-threaded support is highly desirable. The simplest way to support this is to trace thread creation and parse the procfs to get the information you want.

The Linux threading library is called pthreads. When pthread_create is called, the library creates a new thread using the clone syscall, and we can trace this syscall with ptrace (assuming your kernel is older than 2.5.46). To do this, you’ll need to set some ptrace options after attaching to the debuggee:

ptrace(PTRACE_SETOPTIONS, m_pid, nullptr, PTRACE_O_TRACECLONE);

Now when clone is called, the process will be signaled with our old friend SIGTRAP. For the debugger in this series, you can add a case to handle_sigtrap which can handle the creation of the new thread:

case (SIGTRAP | (PTRACE_EVENT_CLONE << 8)):
    //get the new thread ID
    unsigned long event_message = 0;
    ptrace(PTRACE_GETEVENTMSG, pid, nullptr, message);

    //handle creation
    //...

Once you’ve got that, you can look in /proc/<pid>/task/ and read the memory maps and suchlike to get all the information you need.

GDB uses libthread_db, which provides a bunch of helper functions so that you don’t need to do all the parsing and processing yourself. Setting up this library is pretty weird and I won’t show how it works here, but you can go and read this tutorial if you’d like to use it.

The most complex part of multithreaded support is modelling the thread state in the debugger, particularly if you want to support non-stop mode or some kind of heterogeneous debugging where you have more than just a CPU involved in your computation.


The end!

Whew! This series took a long time to write, but I learned a lot in the process and I hope it was helpful. Get in touch on Twitter @TartanLlama or in the comments section if you want to chat about debugging or have any questions about the series. If there are any other debugging topics you’d like to see covered then let me know and I might do a bonus post.

Writing a Linux Debugger Part 9: Handling variables

Variables are sneaky. At one moment they’ll be happily sitting in registers, but as soon as you turn your head they’re spilled to the stack. Maybe the compiler completely throws them out of the window for the sake of optimization. Regardless of how often variables move around in memory, we need some way to track and manipulate them in our debugger. This post will teach you more about handling variables in your debugger and demonstrate a simple implementation using libelfin.


Series index

  1. Setup
  2. Breakpoints
  3. Registers and memory
  4. Elves and dwarves
  5. Source and signals
  6. Source-level stepping
  7. Source-level breakpoints
  8. Stack unwinding
  9. Handling variables
  10. Advanced topics

Before you get started, make sure that the version of libelfin you are using is the fbreg branch of my fork. This contains some hacks to support getting the base of the current stack frame and evaluating location lists, neither of which are supported by vanilla libelfin. You might need to pass -gdwarf-2 to GCC to get it to generate compatible DWARF information. But before we get into the implementation, I’ll give a more detailed description of how locations are encoded in DWARF 5, which is the most recent specification. If you want more information than what I write here, then you can grab the standard from here.

DWARF locations

The location of a variable in memory at a given moment is encoded in the DWARF information using the DW_AT_location attribute. Location descriptions can be either single location descriptions, composite location descriptions, or location lists.

  • Simple location descriptions describe the location of one contiguous piece (usually all) of an object. A simple location description may describe a location in addressable memory, or in a register, or the lack of a location (with or without a known value).
    • Example:
      • DW_OP_fbreg -32
      • A variable which is entirely stored -32 bytes from the stack frame base
  • Composite location descriptions describe an object in terms of pieces, each of which may be contained in part of a register or stored in a memory location unrelated to other pieces.
    • Example:
      • DW_OP_reg3 DW_OP_piece 4 DW_OP_reg10 DW_OP_piece 2
      • A variable whose first four bytes reside in register 3 and whose next two bytes reside in register 10.
  • Location lists describe objects which have a limited lifetime or change location during their lifetime.
    • Example:
      • <loclist with 3 entries follows>
        • [ 0]<lowpc=0x2e00><highpc=0x2e19>DW_OP_reg0
        • [ 1]<lowpc=0x2e19><highpc=0x2e3f>DW_OP_reg3
        • [ 2]<lowpc=0x2ec4><highpc=0x2ec7>DW_OP_reg2
      • A variable whose location moves between registers depending on the current value of the program counter

The DW_AT_location is encoded in one of three different ways, depending on the kind of location description. exprlocs encode simple and composite location descriptions. They consist of a byte length followed by a DWARF expression or location description. loclists and loclistptrs encode location lists. They give indexes or offsets into the .debug_loclists section, which describes the actual location lists.

DWARF Expressions

The actual location of the variables is computed using DWARF expressions. These consist of a series of operations which operate on a stack of values. There are an impressive number of DWARF operations available, so I won’t explain them all in detail. Instead I’ll give a few examples from each class of expression to give you a taste of what is available. Also, don’t get scared off by these; libelfin will take care off all of this complexity for us.

  • Literal encodings
    • DW_OP_lit0, DW_OP_lit1, …, DW_OP_lit31
      • Push the literal value on to the stack
    • DW_OP_addr <addr>
      • Pushes the address operand on to the stack
    • DW_OP_constu <unsigned>
      • Pushes the unsigned value on to the stack
  • Register values
    • DW_OP_fbreg <offset>
      • Pushes the value found at the base of the stack frame, offset by the given value
    • DW_OP_breg0, DW_OP_breg1, …, DW_OP_breg31 <offset>
      • Pushes the contents of the given register plus the given offset to the stack
  • Stack operations
    • DW_OP_dup
      • Duplicate the value at the top of the stack
    • DW_OP_deref
      • Treats the top of the stack as a memory address, and replaces it with the contents of that address
  • Arithmetic and logical operations
    • DW_OP_and
      • Pops the top two values from the stack and pushes back the logical AND of them
    • DW_OP_plus
      • Same as DW_OP_and, but adds the values
  • Control flow operations
    • DW_OP_le, DW_OP_eq, DW_OP_gt, etc.
      • Pops the top two values, compares them, and pushes 1 if the condition is true and 0 otherwise
    • DW_OP_bra <offset>
      • Conditional branch: if the top of the stack is not 0, skips back or forward in the expression by offset
  • Type conversions
    • DW_OP_convert <DIE offset>
      • Converts value on the top of the stack to a different type, which is described by the DWARF information entry at the given offset
  • Special operations
    • DW_OP_nop
      • Do nothing!

DWARF types

DWARF’s representation of types needs to be strong enough to give debugger users useful variable representations. Users most often want to be able to debug at the level of their application rather than at the level of their machine, and they need a good idea of what their variables are doing to achieve that.

DWARF types are encoded in DIEs along with the majority of the other debug information. They can have attributes to indicate their name, encoding, size, endianness, etc. A myriad of type tags are available to express pointers, arrays, structures, typedefs, anything else you could see in a C or C++ program.

Take this simple structure as an example:

struct test{
    int i;
    float j;
    int k[42];
    test* next;
};

The parent DIE for this struct is this:

< 1><0x0000002a>    DW_TAG_structure_type
                      DW_AT_name                  "test"
                      DW_AT_byte_size             0x000000b8
                      DW_AT_decl_file             0x00000001 test.cpp
                      DW_AT_decl_line             0x00000001

The above says that we have a structure called test of size 0xb8, declared at line 1 of test.cpp. All there are then many children DIEs which describe the members.

< 2><0x00000032>      DW_TAG_member
                        DW_AT_name                  "i"
                        DW_AT_type                  <0x00000063>
                        DW_AT_decl_file             0x00000001 test.cpp
                        DW_AT_decl_line             0x00000002
                        DW_AT_data_member_location  0
< 2><0x0000003e>      DW_TAG_member
                        DW_AT_name                  "j"
                        DW_AT_type                  <0x0000006a>
                        DW_AT_decl_file             0x00000001 test.cpp
                        DW_AT_decl_line             0x00000003
                        DW_AT_data_member_location  4
< 2><0x0000004a>      DW_TAG_member
                        DW_AT_name                  "k"
                        DW_AT_type                  <0x00000071>
                        DW_AT_decl_file             0x00000001 test.cpp
                        DW_AT_decl_line             0x00000004
                        DW_AT_data_member_location  8
< 2><0x00000056>      DW_TAG_member
                        DW_AT_name                  "next"
                        DW_AT_type                  <0x00000084>
                        DW_AT_decl_file             0x00000001 test.cpp
                        DW_AT_decl_line             0x00000005
                        DW_AT_data_member_location  176(as signed = -80)

Each member has a name, a type (which is a DIE offset), a declaration file and line, and a byte offset into the structure where the member is located. The types which are pointed to come next.

< 1><0x00000063>    DW_TAG_base_type
                      DW_AT_name                  "int"
                      DW_AT_encoding              DW_ATE_signed
                      DW_AT_byte_size             0x00000004
< 1><0x0000006a>    DW_TAG_base_type
                      DW_AT_name                  "float"
                      DW_AT_encoding              DW_ATE_float
                      DW_AT_byte_size             0x00000004
< 1><0x00000071>    DW_TAG_array_type
                      DW_AT_type                  <0x00000063>
< 2><0x00000076>      DW_TAG_subrange_type
                        DW_AT_type                  <0x0000007d>
                        DW_AT_count                 0x0000002a
< 1><0x0000007d>    DW_TAG_base_type
                      DW_AT_name                  "sizetype"
                      DW_AT_byte_size             0x00000008
                      DW_AT_encoding              DW_ATE_unsigned
< 1><0x00000084>    DW_TAG_pointer_type
                      DW_AT_type                  <0x0000002a>

As you can see, int on my laptop is a 4-byte signed integer type, and float is a 4-byte float. The integer array type is defined by pointing to the int type as its element type, a sizetype (think size_t) as the index type, with 2a elements. The test* type is a DW_TAG_pointer_type which references the test DIE.


Implementing a simple variable reader

As mentioned, libelfin will deal with most of the complexity for us. However, it doesn’t implement all of the different methods for representing variable locations, and handling a lot of them in our code would get pretty complex. As such, I’ve chosen to only support exprlocs for now. Feel free to add support for more types of expression. If you’re really feeling brave, submit some patches to libelfin to help complete the necessary support!

Handling variables is mostly down to locating the different parts in memory or registers, then reading or writing is the same as you’ve seen before. I’ll only show you how to implement reading for the sake of simplicity.

First we need to tell libelfin how to read registers from our process. We do this by creating a class which inherits from expr_context and uses ptrace to handle everything:

class ptrace_expr_context : public dwarf::expr_context {
public:
    ptrace_expr_context (pid_t pid) : m_pid{pid} {}

    dwarf::taddr reg (unsigned regnum) override {
        return get_register_value_from_dwarf_register(m_pid, regnum);
    }

    dwarf::taddr pc() override {
        struct user_regs_struct regs;
        ptrace(PTRACE_GETREGS, m_pid, nullptr, &regs);
        return regs.rip;
    }

    dwarf::taddr deref_size (dwarf::taddr address, unsigned size) override {
        //TODO take into account size
        return ptrace(PTRACE_PEEKDATA, m_pid, address, nullptr);
    }

private:
    pid_t m_pid;
};

The reading will be handled by a read_variables function in our debugger class:

void debugger::read_variables() {
    using namespace dwarf;

    auto func = get_function_from_pc(get_pc());

    //...
}

The first thing we do above is find the function which we’re currently in. Then we need to loop through the entries in that function, looking for variables:

    for (const auto& die : func) {
        if (die.tag == DW_TAG::variable) {
            //...
        }
    }

We get the location information by looking up the DW_AT_location entry in the DIE:

            auto loc_val = die[DW_AT::location];

Then we ensure that it’s an exprloc and ask libelfin to evaluate the expression for us:

            if (loc_val.get_type() == value::type::exprloc) {
                ptrace_expr_context context {m_pid};
                auto result = loc_val.as_exprloc().evaluate(&context);

Now that we’ve evaluated the expression, we need to read the contents of the variable. It could be in memory or a register, so we’ll handle both cases:

                switch (result.location_type) {
                case expr_result::type::address:
                {
                    auto value = read_memory(result.value);
                    std::cout << at_name(die) << " (0x" << std::hex << result.value << ") = "
                              << value << std::endl;
                    break;
                }

                case expr_result::type::reg:
                {
                    auto value = get_register_value_from_dwarf_register(m_pid, result.value);
                    std::cout << at_name(die) << " (reg " << result.value << ") = "
                              << value << std::endl;
                    break;
                }

                default:
                    throw std::runtime_error{"Unhandled variable location"};
                }

As you can see I’ve simply printed out the value without interpreting it based on the type of the variable. Hopefully from this code you can see how you could support writing variables, or searching for variables with a given name.

Finally we can add this to our command parser:

    else if(is_prefix(command, "variables")) {
        read_variables();
    }

Testing it out

Write a few small functions which have some variables, compile it without optimizations and with debug info, then see if you can read the values of your variables. Try writing to the memory address where a variable is stored and see the behaviour of the program change.


Nine posts down, one to go! Next time I’ll be talking about some more advanced concepts which might interest you. For now you can find the code for this post here

C++17 attributes – maybe_unused, fallthrough and nodiscard

C++17 adds three new attributes for programmers to better express their intent to the compiler and readers of the code: maybe_unused, fallthrough, and nodiscard. This is a quick post to outline what they do and why they are useful.

In case you aren’t familiar with the concept, attributes in C++ allow you mark functions, variables, and other entities with compiler-specific or standard properties. The standard attributes prior to C++17 are noreturn (function doesn’t return), deprecated (entity is going to be removed in a later version) and carries_dependency (used for optimizing atomic operations). You mark an entity with an attribute like this:

[[deprecated]]
void do_thing_the_old_way();

void do_thing_the_new_way();

do_thing_the_old_way(); // Compiler warning: deprecated
do_thing_the_new_way(); // No warning

With that out of the way, on to the new attributes!


maybe_unused

[[maybe_unused]] suppresses compiler warnings about unused entities. Usually unused functions or variables indicates a programmer error – if you never use it, why is it there? – but sometimes they can be intentional, such as variables which are only used in release mode, or functions only called when logging is enabled.

Variables

void emits_warning(bool b) {
     bool result = get_result();
     // Warning emitted in release mode
     assert (b == result);
     //...
}

void warning_suppressed(bool b [[maybe_unused]]) {
     [[maybe_unused]] bool result = get_result();
     // Warning suppressed by [[maybe_unused]]
     assert (b == result);
     //...
}

Functions

static void log_with_warning(){}
[[maybe_unused]] static void log_without_warning() {}

#ifdef LOGGING_ENABLED
log_with_warning();    // Warning emitted if LOGGING_ENABLED is not defined
log_without_warning(); // Warning suppressed by [[maybe_unused]]
#endif

I feel like this attribute is more useful to supress compiler warnings than to document your code for others, but at least we now have a standard way to do so.


fallthrough

[[fallthrough]] indicates that a fallthrough in a switch statement is intentional. Missing a break or return in a switch case is a very common programmer error, so compilers usually warn about it, but sometimes a fallthrough can result in some very terse code.

Say we want to process an alert message. If it’s green, we do nothing; if it’s yellow, we record the alert; if it’s orange we record and trigger the alarm; if it’s red, we record, trigger the alarm and evacuate.

void process_alert (Alert alert) {
     switch (alert) {
     case Alert::Red:
         evacuate();
         // Warning: this statement may fall through

     case Alert::Orange:
         trigger_alarm();
         [[fallthrough]]; //yes, you do need the semicolon
         // Warning suppressed by [[fallthrough]]

     case Alert::Yellow:
         record_alert();
         return;

     case Alert::Green:
         return;
     }
}

The most important function of fallthrough is as documentation for maintainers. The presence of it in the code above shows anyone looking at the code that an orange alert is absolutely supposed to be recorded. Without the fallthrough in the Alert::Red case, it is not obvious whether a red alert is supposed to trigger the alarm and be recorded, or just evacuate everyone.


nodiscard

Functions declared with [[nodiscard]] should not have their return values ignored by the caller. This can be useful if you want to ensure that callers check a return value, or that some scope guard object has a reasonable lifetime. Types can be marked [[nodiscard]] to implicitly mark all functions returning that type as the same.

Functions

[[nodiscard]] error do_something (thing&);
error do_something_else (thing&);

do_something(my_thing); // Warning: ignored return value
do_something_else(my_thing);

Classes

[[nodiscard]]
struct lock_guard;

lock_guard lock (mutex& m);

{
    lock(my_mutex); // Warning: ignored return value
    // critical section
}

Those warnings will help the user notice that do_something_else might be given a bad object, or the critical section won’t be locked.


Compilers have shipped non-standard extensions to express these concepts for years, but it’s great that we now have standard methods to do the same. Leave a comment if you have ideas for other attributes you would like to be added to the language!

For some more examples of using these attributes, see posts by Kenneth Benzie, Bartłomiej Filipek, and Arne Mertz.

Metaclasses for embedded domain specific languages

Metaclasses are a proposed feature to C++ which will allow the creation of abstractions over classes and the extension of the language’s type definition system. As an example, consider this definition of IShape:

class IShape {
public:
 virtual int area() const =0;
 virtual void scale_by(double factor) =0;
 // ... etc.
 virtual ~IShape() noexcept { };
};

This class is obviously defining an interface of some kind: all of the member functions are public, pure virtual, and it has a virtual destructor. Other classes will need to be added which inherit from this interface for it to be useful. But those properties I listed are really just boilerplate; it’s just noise that we need to type to get the compiler to do The Right Thing™. It would be a lot clearer and less error-prone if we could push the boilerplate aside and just focus on the details which are important to us. This is the essence of metaclasses. If there were a metaclass called interface, you could write something like this:

interface IShape {
 int area() const;
 void scale_by(double factor);
};

I’m sure you’d agree this is a huge improvement over the last sample. The magic happens in the definition for the interface metaclass, which could look like this:

$class interface {
  ~interface() noexcept { }
  constexpr {
    compiler.require($interface.variables().empty(),
                     "interfaces may not contain data");
    for (auto f : $interface.functions()) {
      compiler.require(!f.is_copy() && !f.is_move(),
                       "interfaces may not copy or move; consider a"
                       " virtual clone() instead");
      if (!f.has_access()) f.make_public();
      compiler.require(f.is_public(),
                       "interface functions must be public");
      f.make_pure_virtual();
    }
  }
};

The ~interface() noexcept { } defines a destructor for instances of the metaclass; this is essentially a default implementation so that you don’t need to type one yourself. Inside the constexpr block we have code to ensure that there are no data members and no copy or move constructors, and to change each function to be virtual and public. Be sure to read over this code a few times to really understand what it’s doing. The first time I saw this code my mind was well and truly blown.

But enough of stealing examples from the paper. I’ve been interested in metaclasses since seeing Herb talk about them at ACCU 2017, and since then I’ve always been looking for places where they could make C++ programming safer, more elegant, more maintainable. In this post I’ll talk about how they can aid creation of embedded domain specific languages.


Embedded what?

Before I get into the meat of what I want to discuss, I have some terms to define.

A domain specific language (DSL) is a language which is specifically designed for some problem area. C++ is a general-purpose language because it can be used for solving any kind of programming problem. SQL is a domain-specific language because it is designed only for reading and manipulating databases.

An embedded DSL (EDSL) is a domain specific language hosted inside a more general language. For example, Boost.Spirit is a parser library which embeds a language for parsing code into C++ by using expression templates and operator overloading.

My case study

The example I’m going to be using is from my ETKF keyboard firmware library. A keyboard definition for ETKF looks like this:

struct my_keyboard {
    using rows = pin_set<b4, b6, f1, f0>;
    using columns = pin_set<f6, f7, c7, c6, d3, d2, d1, d0, b7, b3, b2, b1, b0>;

    using key_positions = typelist<
        row<1,1,1,1,1,1,1,1,1,1,1,1,1>,
        row<1,1,1,1,1,1,1,1,1,1,1,0,1>,
        row<1,0,1,1,1,1,1,1,1,1,1,1,1>,
        row<1,1,1,0,1,1,0,1,1,1,1,1,1>
    >;

    using layouts = typelist<
        row<tab, quot,comm,dot, p,   y,   f,   g,   c,   r,   l,   bspc, null>,
        row<lctl,a,   o,   e,   u,   i,   d,   h,   t,   n,   s,   ent>,
        row<lsft,scln,q,   j,   k,   x,   b,   m,   w,   v,   z,   del>,
        row<caps,lgui,esc,lalt, spc,      null,null,left,down,up,  righ>
    >;
};

This is an EDSL which is hosted inside the C++ template system. It is very declarative in nature; the class declares how the switches are wired up, which parts of the switch matrix have holes in them, and how the keymap should look, but it doesn’t say how the firmware should carry out any computations. We are declaring our problem at a high level of abstraction and the implementation of the library is in charge of mapping this abstraction onto real operations. There are no metaclasses in this example; the implementation is all in C++17. But what could metaclasses add to this?

The first improvement is as embarrassingly simple as it is expressive: changing struct my_keyboard to keyboard my_keyboard. Metaclasses allow us to express not only the name of our class, but also what kind of entity it is. If we had a bunch of declarations like this kicking around – maybe some define processors, some define hardware pins – having a strong indication of the classes’ nature right in the declaration is powerful. my_keyboard is no longer a struct, it is no longer a class, it is a keyboard.

The second is in compiler diagnostics. When I was writing ETKF I wanted to toe the line between carrying out crazy compile-time optimisations and having a clean interface which even non-programmers could use. “This’ll be great”, I thought naively, “I’ll just have a ton of static_asserts which run over the templates before generating the firmware code”. This is indeed possible, but have a look at the resulting compiler errors:

In file included from /home/simon/etkf/src/main.cpp:17:0:
/home/simon/etkf/include/validations.hpp: In instantiation of 'void validate_layout(etkf::typelist<Rows ...>) [with Kbd = test_keyboard; Rows = {etkf::row<(etkf::keys::key)43, (etkf::keys::key)52, (etkf::keys::key)54, (etkf::keys::key)55, (etkf::keys::key)19, (etkf::keys::key)28, (etkf::keys::key)9, (etkf::keys::key)10, (etkf::keys::key)6, (etkf::keys::key)21, (etkf::keys::key)15, (etkf::keys::key)42, (etkf::keys::key)111>, etkf::row<(etkf::keys::key)100, (etkf::keys::key)4, (etkf::keys::key)18, (etkf::keys::key)8, (etkf::keys::key)24, (etkf::keys::key)12, (etkf::keys::key)7, (etkf::keys::key)11, (etkf::keys::key)23, (etkf::keys::key)17, (etkf::keys::key)22, (etkf::keys::key)40>, etkf::row<(etkf::keys::key)57, (etkf::keys::key)103, (etkf::keys::key)41, (etkf::keys::key)102, (etkf::keys::key)44, (etkf::keys::key)109, (etkf::keys::key)111, (etkf::keys::key)80, (etkf::keys::key)81, (etkf::keys::key)82, (etkf::keys::key)79>}]':
/home/simon/etkf/include/validations.hpp:14:26:   required from 'void validate_layouts(etkf::typelist<Rows ...>) [with Kbd = test_keyboard; Layouts = {etkf::typelist<etkf::row<(etkf::keys::key)43, (etkf::keys::key)52, (etkf::keys::key)54, (etkf::keys::key)55, (etkf::keys::key)19, (etkf::keys::key)28, (etkf::keys::key)9, (etkf::keys::key)10, (etkf::keys::key)6, (etkf::keys::key)21, (etkf::keys::key)15, (etkf::keys::key)42, (etkf::keys::key)111>, etkf::row<(etkf::keys::key)100, (etkf::keys::key)4, (etkf::keys::key)18, (etkf::keys::key)8, (etkf::keys::key)24, (etkf::keys::key)12, (etkf::keys::key)7, (etkf::keys::key)11, (etkf::keys::key)23, (etkf::keys::key)17, (etkf::keys::key)22, (etkf::keys::key)40>, etkf::row<(etkf::keys::key)57, (etkf::keys::key)103, (etkf::keys::key)41, (etkf::keys::key)102, (etkf::keys::key)44, (etkf::keys::key)109, (etkf::keys::key)111, (etkf::keys::key)80, (etkf::keys::key)81, (etkf::keys::key)82, (etkf::keys::key)79> >, etkf::typelist<etkf::row<(etkf::keys::key)43, (etkf::keys::key)30, (etkf::keys::key)31, (etkf::keys::key)32, (etkf::keys::key)33, (etkf::keys::key)34, (etkf::keys::key)35, (etkf::keys::key)36, (etkf::keys::key)37, (etkf::keys::key)38, (etkf::keys::key)39, (etkf::keys::key)42, (etkf::keys::key)111>, etkf::row<(etkf::keys::key)100, (etkf::keys::key)47, (etkf::keys::key)48, (etkf::keys::key)56, (etkf::keys::key)46, (etkf::keys::key)45, (etkf::keys::key)50, (etkf::keys::key)49, (etkf::keys::key)53, (etkf::keys::key)17, (etkf::keys::key)22, (etkf::keys::key)40>, etkf::row<(etkf::keys::key)101, (etkf::keys::key)58, (etkf::keys::key)59, (etkf::keys::key)60, (etkf::keys::key)61, (etkf::keys::key)62, (etkf::keys::key)63, (etkf::keys::key)64, (etkf::keys::key)65, (etkf::keys::key)66, (etkf::keys::key)67, (etkf::keys::key)76>, etkf::row<(etkf::keys::key)57, (etkf::keys::key)103, (etkf::keys::key)41, (etkf::keys::key)102, (etkf::keys::key)44, (etkf::keys::key)109, (etkf::keys::key)111, (etkf::keys::key)80, (etkf::keys::key)81, (etkf::keys::key)82, (etkf::keys::key)79> >, etkf::typelist<etkf::row<(etkf::keys::key)43, (etkf::keys::key)20, (etkf::keys::key)26, (etkf::keys::key)8, (etkf::keys::key)21, (etkf::keys::key)23, (etkf::keys::key)28, (etkf::keys::key)24, (etkf::keys::key)12, (etkf::keys::key)18, (etkf::keys::key)19, (etkf::keys::key)42, (etkf::keys::key)111>, etkf::row<(etkf::keys::key)100, (etkf::keys::key)4, (etkf::keys::key)22, (etkf::keys::key)7, (etkf::keys::key)9, (etkf::keys::key)10, (etkf::keys::key)11, (etkf::keys::key)13, (etkf::keys::key)14, (etkf::keys::key)15, (etkf::keys::key)51, (etkf::keys::key)40>, etkf::row<(etkf::keys::key)101, (etkf::keys::key)29, (etkf::keys::key)27, (etkf::keys::key)6, (etkf::keys::key)25, (etkf::keys::key)5, (etkf::keys::key)17, (etkf::keys::key)16, (etkf::keys::key)54, (etkf::keys::key)55, (etkf::keys::key)56, (etkf::keys::key)76>, etkf::row<(etkf::keys::key)57, (etkf::keys::key)103, (etkf::keys::key)41, (etkf::keys::key)102, (etkf::keys::key)44, (etkf::keys::key)109, (etkf::keys::key)111, (etkf::keys::key)80, (etkf::keys::key)81, (etkf::keys::key)82, (etkf::keys::key)79> >}]'
/home/simon/etkf/include/validations.hpp:19:26:   required from 'void validate_keyboard() [with Kbd = test_keyboard]'
/home/simon/etkf/src/main.cpp:267:44:   required from here
/home/simon/etkf/include/validations.hpp:8:5: error: static assertion failed: A layout has the wrong number of rows
     static_assert(sizeof...(Rows) == variadic_size<typename Kbd::rows>::value,

Yeah.

I got my wish of a nice little assert message at the bottom which tells you exactly what you did wrong, but if you’re not a seasoned C++ developer you don’t get that far. You run. Fast.

However, with the help of metaclasses and requires clauses from concepts, we could write something like this1:

$class keyboard {
    constexpr {
        compiler.require(requires(keyboard kbd) { keyboard::layouts{}; },
                         "must have type 'layouts'");
        compiler.require(requires(keyboard kbd) { keyboard::rows{}; },
                         "must have type 'rows'");
        compiler.require(requires(keyboard kbd) { keyboard::columns{}; },
                         "must have type 'columns'");
        compiler.require(requires(keyboard kbd) { keyboard::key_positions{}; },
                         "must have type 'key_positions'");

        if (!all_types<keyboard::key_positions> ([](auto type){
                return variadic_size<decltype(type)::type> == variadic_size<keyboard::columns>;
            })) {
            compiler.error("Each row of 'key_positions' must be the same size as 'columns'");
            compiler.error("Check the definition of 'key_positions' here",
                           $(keyboard::key_positions).source_location());
            compiler.error("Check the definition of 'columns' here",
                           $(keyboard::columns).source_location());
        }
    }
}

Now the validations are run as part of generating a class from the metaclass, and the diagnostics should be placed depending on the source location which is given to the compiler.error calls. With sufficiently fine-grained control over the placement of diagnostics, all error messages can be emitted at the abstraction level of the EDSL rather than having C++ template guff injecting itself into the party.

Code to generate firmware from the high-level descriptions can now also be placed in the implementation of the keyboard metaclass, so that executing the firmware is carried out by calling my_keyboard::run_firmware():

$class keyboard {
    //validations as above
    
    template <class Layouts>
    static pressed_keys scan_matrix() {
        //...
    }
    
    static void run_firmware() {
        while (true) {
            auto pressed_keys = scan_matrix<keyboard::layouts>();
            //...
        }
    }
}

The above also somewhat addresses problem of cohesion and code locality. In my C++17 ETKF implementation, the validations which are run over the keyboard descriptions are quite separate from the code which generates the firmware from the template declarations. But really, these are both part of the abstraction which I’m trying to express in the interface. Metaclasses provide a means to tie together the constraints on the declaration as well as the code which lowers the EDSL into normal C++ land.


That’s it for my contribution to the metaclass hype train. Maybe I’ll write some more posts as I come up with more ideas, but I’m particularly interested in exploring the design space for declarative EDSLs in C++. Templates are a powerful host for other languages, and metaclasses only make them more so.


  1. There’s not an implementation fully-featured enough to compile this yet, but it seems to be in line with the paper’s definition. 

Writing a Linux Debugger Part 8: Stack unwinding

Sometimes the most important information you need to know about what your current program state is how it got there. This is typically provided with a backtrace command, which gives you the chain of function calls which have lead to the the program is right now. This post will show you how to implement stack unwinding on x86_64 to generate such a backtrace.


Series index

  1. Setup
  2. Breakpoints
  3. Registers and memory
  4. Elves and dwarves
  5. Source and signals
  6. Source-level stepping
  7. Source-level breakpoints
  8. Stack unwinding
  9. Handling variables
  10. Advanced topics

Take the following program as an example:

void a() {
    //stopped here
}

void b() {
     a();
}

void c() {
     a();
}

int main() {
    b();
    c();
}

If the debugger is stopped at the //stopped here line, there are two ways which it could have got there: main->b->a or main->c->a. If we set a breakpoint there with LLDB, continue, and ask for a backtrace, then we get the following:

* frame #0: 0x00000000004004da a.out`a() + 4 at bt.cpp:3
  frame #1: 0x00000000004004e6 a.out`b() + 9 at bt.cpp:6
  frame #2: 0x00000000004004fe a.out`main + 9 at bt.cpp:14
  frame #3: 0x00007ffff7a2e830 libc.so.6`__libc_start_main + 240 at libc-start.c:291
  frame #4: 0x0000000000400409 a.out`_start + 41

This says that we are currently in function a, which we got to from function b, which we got to from main and so on. Those final two frames are just how the compiler has bootstrapped the main function.

The question now is how we implement this on x86_64. The most robust way to do this is to parse the .eh_frame section of the ELF file and work out how to unwind the stack from there, but this is a pain. You could use libunwind or something similar to do it for you, but that’s boring. Instead, we’ll assume that the compiler has laid out the stack in a certain way and we’ll just walk it manually. In order to do this, we first need to understand how the stack is laid out.

            High
        |   ...   |
        +---------+
     +24|  Arg 1  |
        +---------+
     +16|  Arg 2  |
        +---------+
     + 8| Return  |
        +---------+
EBP+--> |Saved EBP|
        +---------+
     - 8|  Var 1  |
        +---------+
ESP+--> |  Var 2  |
        +---------+
        |   ...   |
            Low

As you can see, the frame pointer for the last stack frame is stored at the start of current stack frame, creating a linked list of frame pointers. The stack is unwound by following this linked list. We can find out which function the next frame in the list belongs to by looking up the return address in the DWARF info. Some compilers will omit tracking the frame base with the EBP, since this can be represented as an offset from ESP and it frees up an extra register. Passing -fno-omit-frame-pointer to GCC or Clang should force it to follow the convention we’re relying on, even when optimisations are enabled.

We’ll do all our work in a print_backtrace function:

void debugger::print_backtrace() {

Something to decide early is what format to print out the frame information in. I used a little lambda to push this out the way:

    auto output_frame = [frame_number = 0] (auto&& func) mutable {
        std::cout << "frame #" << frame_number++ << ": 0x" << dwarf::at_low_pc(func)
                  << ' ' << dwarf::at_name(func) << std::endl;
    };

The first frame to print out will be the one which is currently being executed. We can get the information for this frame by looking up the current program counter in the DWARF:

    auto current_func = get_function_from_pc(get_pc());
    output_frame(current_func);

Next we need to get the frame pointer and return address for the current function. The frame pointer is stored in the rbp register, and the return address is 8 bytes up the stack from the frame pointer.

    auto frame_pointer = get_register_value(m_pid, reg::rbp);
    auto return_address = read_memory(frame_pointer+8);

Now we have all the information we need to unwind the stack. I’m just going to keep unwinding until the debugger hits main, but you could also choose to stop when the frame pointer is 0x0, which will get you the functions which your implementation called before main as well. We’ll to grab the frame pointer and return address from each frame and print out the information as we go.

    while (dwarf::at_name(current_func) != "main") {
        current_func = get_function_from_pc(return_address);
        output_frame(current_func);
        frame_pointer = read_memory(frame_pointer);
        return_address = read_memory(frame_pointer+8);
    }
}

That’s it! The whole function is here for your convenience:

void debugger::print_backtrace() {
    auto output_frame = [frame_number = 0] (auto&& func) mutable {
        std::cout << "frame #" << frame_number++ << ": 0x" << dwarf::at_low_pc(func)
                  << ' ' << dwarf::at_name(func) << std::endl;
    };

    auto current_func = get_function_from_pc(get_pc());
    output_frame(current_func);

    auto frame_pointer = get_register_value(m_pid, reg::rbp);
    auto return_address = read_memory(frame_pointer+8);

    while (dwarf::at_name(current_func) != "main") {
        current_func = get_function_from_pc(return_address);
        output_frame(current_func);
        frame_pointer = read_memory(frame_pointer);
        return_address = read_memory(frame_pointer+8);
    }
}

Adding commands

Of course, we have to expose this command to the user.

    else if(is_prefix(command, "backtrace")) {
        print_backtrace();
    }

Testing it out

A good way to test this functionality is by writing a test program with a bunch of small functions which call each other. Set a few breakpoints, jump around the code a bit, and make sure that your backtrace is accurate.


We’ve come a long way from a program which can merely spawn and attach to other programs. The penultimate post in this series will finish up the implementation of the debugger by supporting the reading and writing of variables. Until then you can find the code for this post here.

Writing a Linux Debugger Part 7: Source-level breakpoints

Setting breakpoints on memory addresses is all well and good, but it doesn’t provide the most user-friendly tool. We’d like to be able to set breakpoints on source lines and function entry addresses as well, so that we can debug at the same abstraction level as our code.

This post will add source-level breakpoints to our debugger. With all of the support we already have available to us, this is a lot easier than it may first sound. We’ll also add a command to get the type and address of a symbol, which can be useful for locating code or data and understanding linking concepts.


Series index

  1. Setup
  2. Breakpoints
  3. Registers and memory
  4. Elves and dwarves
  5. Source and signals
  6. Source-level stepping
  7. Source-level breakpoints
  8. Stack unwinding
  9. Handling variables
  10. Advanced topics

Breakpoints

DWARF

The Elves and dwarves post described how DWARF debug information works and how it can be used to map the machine code back to the high-level source. Recall that DWARF contains the address ranges of functions and a line table which lets you translate code positions between abstraction levels. We’ll be using these capabilities to implement our breakpoints.

Function entry

Setting breakpoints on function names can be complex if you want to take overloading, member functions and such into account, but we’re going to iterate through all of the compilation units and search for functions with names which match what we’re looking for. The DWARF information will look something like this:

< 0><0x0000000b>  DW_TAG_compile_unit
                    DW_AT_producer              clang version 3.9.1 (tags/RELEASE_391/final)
                    DW_AT_language              DW_LANG_C_plus_plus
                    DW_AT_name                  /super/secret/path/MiniDbg/examples/variable.cpp
                    DW_AT_stmt_list             0x00000000
                    DW_AT_comp_dir              /super/secret/path/MiniDbg/build
                    DW_AT_low_pc                0x00400670
                    DW_AT_high_pc               0x0040069c

LOCAL_SYMBOLS:
< 1><0x0000002e>    DW_TAG_subprogram
                      DW_AT_low_pc                0x00400670
                      DW_AT_high_pc               0x0040069c
                      DW_AT_name                  foo
                      ...
...
<14><0x000000b0>    DW_TAG_subprogram
                      DW_AT_low_pc                0x00400700
                      DW_AT_high_pc               0x004007a0
                      DW_AT_name                  bar
                      ...

We want to match against DW_AT_name and use DW_AT_low_pc (the start address of the function) to set our breakpoint.

void debugger::set_breakpoint_at_function(const std::string& name) {
    for (const auto& cu : m_dwarf.compilation_units()) {
        for (const auto& die : cu.root()) {
            if (die.has(dwarf::DW_AT::name) && at_name(die) == name) {
                auto low_pc = at_low_pc(die);
                auto entry = get_line_entry_from_pc(low_pc);
                ++entry; //skip prologue
                set_breakpoint_at_address(entry->address);
            }
        }
    }
}

The only bit of that code which looks a bit weird is the ++entry. The problem is that the DW_AT_low_pc for a function doesn’t point at the start of the user code for that function, it points to the start of the prologue. The compiler will usually output a prologue and epilogue for a function which carries out saving and restoring registers, manipulating the stack pointer and suchlike. This isn’t very useful for us, so we increment the line entry by one to get the first line of the user code instead of the prologue. The DWARF line table actually has some functionality to mark an entry as the first line after the function prologue, but not all compilers output this, so I’ve taken the naive approach.

Source line

To set a breakpoint on a high-level source line, we translate this line number into an address by looking it up in the DWARF. We’ll iterate through the compilation units looking for one whose name matches the given file, then look for the entry which corresponds to the given line.

The DWARF will look something like this:

.debug_line: line number info for a single cu
Source lines (from CU-DIE at .debug_info offset 0x0000000b):

NS new statement, BB new basic block, ET end of text sequence
PE prologue end, EB epilogue begin
IS=val ISA number, DI=val discriminator value
<pc>        [lno,col] NS BB ET PE EB IS= DI= uri: "filepath"
0x004004a7  [   1, 0] NS uri: "/super/secret/path/a.hpp"
0x004004ab  [   2, 0] NS
0x004004b2  [   3, 0] NS
0x004004b9  [   4, 0] NS
0x004004c1  [   5, 0] NS
0x004004c3  [   1, 0] NS uri: "/super/secret/path/b.hpp"
0x004004c7  [   2, 0] NS
0x004004ce  [   3, 0] NS
0x004004d5  [   4, 0] NS
0x004004dd  [   5, 0] NS
0x004004df  [   4, 0] NS uri: "/super/secret/path/ab.cpp"
0x004004e3  [   5, 0] NS
0x004004e8  [   6, 0] NS
0x004004ed  [   7, 0] NS
0x004004f4  [   7, 0] NS ET

So if we want to set a breakpoint on line 5 of ab.cpp, we look up the entry which corresponds to that line (0x004004e3) and set a breakpoint there.

void debugger::set_breakpoint_at_source_line(const std::string& file, unsigned line) {
    for (const auto& cu : m_dwarf.compilation_units()) {
        if (is_suffix(file, at_name(cu.root()))) {
            const auto& lt = cu.get_line_table();

            for (const auto& entry : lt) {
                if (entry.is_stmt && entry.line == line) {
                    set_breakpoint_at_address(entry.address);
                    return;
                }
            }
        }
    }
}

My is_suffix hack is there so you can type c.cpp for a/b/c.cpp. Of course you should actually use a sensible path handling library or something; I’m lazy. The entry.is_stmt is checking that the line table entry is marked as the beginning of a statement, which is set by the compiler on the address it thinks is the best target for a breakpoint.


Symbol lookup

When we get down to the level of object files, symbols are king. Functions are named with symbols, global variables are named with symbols, you get a symbol, we get a symbol, everyone gets a symbol. In a given object file, some symbols might reference other object files or shared libraries, where the linker will patch things up to create an executable program from the symbol reference spaghetti.

Symbols can be looked up in the aptly-named symbol table, which is stored in ELF sections in the binary. Fortunately, libelfin has a fairly nice interface for doing this, so we don’t need to deal with all of the ELF nonsense ourselves. To give you an idea of what we’re dealing with, here is a dump of the .symtab section of a binary, produced with readelf:

Num:    Value          Size Type    Bind   Vis      Ndx Name
 0: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT  UND
 1: 0000000000400238     0 SECTION LOCAL  DEFAULT    1
 2: 0000000000400254     0 SECTION LOCAL  DEFAULT    2
 3: 0000000000400278     0 SECTION LOCAL  DEFAULT    3
 4: 00000000004002c8     0 SECTION LOCAL  DEFAULT    4
 5: 0000000000400430     0 SECTION LOCAL  DEFAULT    5
 6: 00000000004004e4     0 SECTION LOCAL  DEFAULT    6
 7: 0000000000400508     0 SECTION LOCAL  DEFAULT    7
 8: 0000000000400528     0 SECTION LOCAL  DEFAULT    8
 9: 0000000000400558     0 SECTION LOCAL  DEFAULT    9
10: 0000000000400570     0 SECTION LOCAL  DEFAULT   10
11: 0000000000400714     0 SECTION LOCAL  DEFAULT   11
12: 0000000000400720     0 SECTION LOCAL  DEFAULT   12
13: 0000000000400724     0 SECTION LOCAL  DEFAULT   13
14: 0000000000400750     0 SECTION LOCAL  DEFAULT   14
15: 0000000000600e18     0 SECTION LOCAL  DEFAULT   15
16: 0000000000600e20     0 SECTION LOCAL  DEFAULT   16
17: 0000000000600e28     0 SECTION LOCAL  DEFAULT   17
18: 0000000000600e30     0 SECTION LOCAL  DEFAULT   18
19: 0000000000600ff0     0 SECTION LOCAL  DEFAULT   19
20: 0000000000601000     0 SECTION LOCAL  DEFAULT   20
21: 0000000000601018     0 SECTION LOCAL  DEFAULT   21
22: 0000000000601028     0 SECTION LOCAL  DEFAULT   22
23: 0000000000000000     0 SECTION LOCAL  DEFAULT   23
24: 0000000000000000     0 SECTION LOCAL  DEFAULT   24
25: 0000000000000000     0 SECTION LOCAL  DEFAULT   25
26: 0000000000000000     0 SECTION LOCAL  DEFAULT   26
27: 0000000000000000     0 SECTION LOCAL  DEFAULT   27
28: 0000000000000000     0 SECTION LOCAL  DEFAULT   28
29: 0000000000000000     0 SECTION LOCAL  DEFAULT   29
30: 0000000000000000     0 SECTION LOCAL  DEFAULT   30
31: 0000000000000000     0 FILE    LOCAL  DEFAULT  ABS init.c
32: 0000000000000000     0 FILE    LOCAL  DEFAULT  ABS crtstuff.c
33: 0000000000600e28     0 OBJECT  LOCAL  DEFAULT   17 __JCR_LIST__
34: 00000000004005a0     0 FUNC    LOCAL  DEFAULT   10 deregister_tm_clones
35: 00000000004005e0     0 FUNC    LOCAL  DEFAULT   10 register_tm_clones
36: 0000000000400620     0 FUNC    LOCAL  DEFAULT   10 __do_global_dtors_aux
37: 0000000000601028     1 OBJECT  LOCAL  DEFAULT   22 completed.6917
38: 0000000000600e20     0 OBJECT  LOCAL  DEFAULT   16 __do_global_dtors_aux_fin
39: 0000000000400640     0 FUNC    LOCAL  DEFAULT   10 frame_dummy
40: 0000000000600e18     0 OBJECT  LOCAL  DEFAULT   15 __frame_dummy_init_array_
41: 0000000000000000     0 FILE    LOCAL  DEFAULT  ABS /super/secret/path/MiniDbg/
42: 0000000000000000     0 FILE    LOCAL  DEFAULT  ABS crtstuff.c
43: 0000000000400818     0 OBJECT  LOCAL  DEFAULT   14 __FRAME_END__
44: 0000000000600e28     0 OBJECT  LOCAL  DEFAULT   17 __JCR_END__
45: 0000000000000000     0 FILE    LOCAL  DEFAULT  ABS
46: 0000000000400724     0 NOTYPE  LOCAL  DEFAULT   13 __GNU_EH_FRAME_HDR
47: 0000000000601000     0 OBJECT  LOCAL  DEFAULT   20 _GLOBAL_OFFSET_TABLE_
48: 0000000000601028     0 OBJECT  LOCAL  DEFAULT   21 __TMC_END__
49: 0000000000601020     0 OBJECT  LOCAL  DEFAULT   21 __dso_handle
50: 0000000000600e20     0 NOTYPE  LOCAL  DEFAULT   15 __init_array_end
51: 0000000000600e18     0 NOTYPE  LOCAL  DEFAULT   15 __init_array_start
52: 0000000000600e30     0 OBJECT  LOCAL  DEFAULT   18 _DYNAMIC
53: 0000000000601018     0 NOTYPE  WEAK   DEFAULT   21 data_start
54: 0000000000400710     2 FUNC    GLOBAL DEFAULT   10 __libc_csu_fini
55: 0000000000400570    43 FUNC    GLOBAL DEFAULT   10 _start
56: 0000000000000000     0 NOTYPE  WEAK   DEFAULT  UND __gmon_start__
57: 0000000000400714     0 FUNC    GLOBAL DEFAULT   11 _fini
58: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND __libc_start_main@@GLIBC_
59: 0000000000400720     4 OBJECT  GLOBAL DEFAULT   12 _IO_stdin_used
60: 0000000000601018     0 NOTYPE  GLOBAL DEFAULT   21 __data_start
61: 00000000004006a0   101 FUNC    GLOBAL DEFAULT   10 __libc_csu_init
62: 0000000000601028     0 NOTYPE  GLOBAL DEFAULT   22 __bss_start
63: 0000000000601030     0 NOTYPE  GLOBAL DEFAULT   22 _end
64: 0000000000601028     0 NOTYPE  GLOBAL DEFAULT   21 _edata
65: 0000000000400670    44 FUNC    GLOBAL DEFAULT   10 main
66: 0000000000400558     0 FUNC    GLOBAL DEFAULT    9 _init

You can see lots of symbols for sections in the object file, symbols which are used by the implementation for setting up the environment, and at the end you can see the symbol for main.

We’re interested in the type, name, and value (address) of the symbol. We’ll have a symbol_type enum for the type and use a std::string for the name and std::uintptr_t for the address:

enum class symbol_type {
    notype,            // No type (e.g., absolute symbol)
    object,            // Data object
    func,              // Function entry point
    section,           // Symbol is associated with a section
    file,              // Source file associated with the
};                     // object file

std::string to_string (symbol_type st) {
    switch (st) {
    case symbol_type::notype: return "notype";
    case symbol_type::object: return "object";
    case symbol_type::func: return "func";
    case symbol_type::section: return "section";
    case symbol_type::file: return "file";
    }
}

struct symbol {
    symbol_type type;
    std::string name;
    std::uintptr_t addr;
};

We’ll need to map between the symbol type we get from libelfin and our enum since we don’t want the dependency poisoning this interface. Fortunately I picked the same names for everything, so this is dead easy:

symbol_type to_symbol_type(elf::stt sym) {
    switch (sym) {
    case elf::stt::notype: return symbol_type::notype;
    case elf::stt::object: return symbol_type::object;
    case elf::stt::func: return symbol_type::func;
    case elf::stt::section: return symbol_type::section;
    case elf::stt::file: return symbol_type::file;
    default: return symbol_type::notype;
    }
};

Lastly we want to look up the symbol. For illustrative purposes I loop through the sections of the ELF looking for symbol tables, then collect any symbols I find in them into a std::vector. A smarter implementation would build up a map from names to symbols so that you only have to look at all the data once.

std::vector<symbol> debugger::lookup_symbol(const std::string& name) {
    std::vector<symbol> syms;

    for (auto &sec : m_elf.sections()) {
        if (sec.get_hdr().type != elf::sht::symtab && sec.get_hdr().type != elf::sht::dynsym)
            continue;

        for (auto sym : sec.as_symtab()) {
            if (sym.get_name() == name) {
                auto &d = sym.get_data();
                syms.push_back(symbol{to_symbol_type(d.type()), sym.get_name(), d.value});
            }
        }
    }

    return syms;
}

Adding commands

As always, we need to add some more commands to expose the functionality to users. For breakpoints I’ve gone for a GDB-style interface, where the kind of breakpoint is inferred from the argument you pass rather than requiring explicit switches:

  • 0x<hexadecimal> -> address breakpoint
  • <line>:<filename> -> line number breakpoint
  • <anything else> -> function name breakpoint
    else if(is_prefix(command, "break")) {
        if (args[1][0] == '0' && args[1][1] == 'x') {
            std::string addr {args[1], 2};
            set_breakpoint_at_address(std::stol(addr, 0, 16));
        }
        else if (args[1].find(':') != std::string::npos) {
            auto file_and_line = split(args[1], ':');
            set_breakpoint_at_source_line(file_and_line[0], std::stoi(file_and_line[1]));
        }
        else {
            set_breakpoint_at_function(args[1]);
        }
    }

For symbols we’ll lookup the symbol and print out any matches we find:

else if(is_prefix(command, "symbol")) {
    auto syms = lookup_symbol(args[1]);
    for (auto&& s : syms) {
        std::cout << s.name << ' ' << to_string(s.type) << " 0x" << std::hex << s.addr << std::endl;
    }
}

Testing it out

Fire up your debugger on a simple binary, play around with setting source-level breakpoints. Setting a breakpoint on some foo and seeing my debugger stop on it was one of the most rewarding moments of this project for me.

Symbol lookup can be tested by adding some functions or global variables to your program and looking up the names of them. Note that if you’re compiling C++ code you’ll need to take name mangling into account as well.

That’s all for this post. Next time I’ll show how to add stack unwinding support to the debugger.

You can find the code for this post here.

Accelerating your C++ on GPU with SYCL

Leveraging the power of graphics cards for compute applications is all the rage right now in fields such as machine learning, computer vision and high-performance computing. Technologies like OpenCL expose this power through a hardware-independent programming model, allowing you to write code which abstracts over different architecture capabilities. The dream of this is “write once, run anywhere”, be it an Intel CPU, AMD discrete GPU, DSP, etc. Unfortunately, for everyday programmers, OpenCL has something of a steep learning curve; a simple Hello World program can be a hundred or so lines of pretty ugly-looking code. However, to ease this pain, the Khronos group have developed a new standard called SYCL, which is a C++ abstraction layer on top of OpenCL. Using SYCL, you can develop these general-purpose GPU (GPGPU) applications in clean, modern C++ without most of the faff associated with OpenCL. Here’s a simple vector multiplication example written in SYCL using the parallel STL implementation:

#include <vector>
#include <iostream>

#include <sycl/execution_policy>
#include <experimental/algorithm>
#include <sycl/helpers/sycl_buffers.hpp>

using namespace std::experimental::parallel;
using namespace sycl::helpers;

int main() {
  constexpr size_t array_size = 1024*512;
  std::array<cl::sycl::cl_int, array_size> a;
  std::iota(begin(a),end(a),0);

  {
    cl::sycl::buffer<int> b(a.data(), cl::sycl::range<1>(a.size()));
    cl::sycl::queue q;
    sycl::sycl_execution_policy<class Mul> sycl_policy(q);
    transform(sycl_policy, begin(b), end(b), begin(b),
              [](int x) { return x*2; });
  }
}

For comparison, here’s a mostly equivalent version written in OpenCL using the C++ API (don’t spend much time reading this, just note that it looks ugly and is really long):

#include <iostream>
#include <array>
#include <numeric>
#include <CL/cl.hpp>

int main(){
    std::vector<cl::Platform> all_platforms;
    cl::Platform::get(&all_platforms);
    if(all_platforms.size()==0){
        std::cout<<" No platforms found. Check OpenCL installation!\n";
        exit(1);
    }
    cl::Platform default_platform=all_platforms[0];

    std::vector<cl::Device> all_devices;
    default_platform.getDevices(CL_DEVICE_TYPE_ALL, &all_devices);
    if(all_devices.size()==0){
        std::cout<<" No devices found. Check OpenCL installation!\n";
        exit(1);
    }

    cl::Device default_device=all_devices[0];
    cl::Context context({default_device});

    cl::Program::Sources sources;
    std::string kernel_code=
        "   void kernel mul2(global int* A){"
        "       A[get_global_id(0)]=A[get_global_id(0)]*2;"
        "   }";
    sources.push_back({kernel_code.c_str(),kernel_code.length()});

    cl::Program program(context,sources);
    if(program.build({default_device})!=CL_SUCCESS){
        std::cout<<" Error building: "<<program.getBuildInfo<CL_PROGRAM_BUILD_LOG>(default_device)<<"\n";
        exit(1);
    }

    constexpr size_t array_size = 1024*512;
    std::array<cl_int, array_size> a;
    std::iota(begin(a),end(a),0);

    cl::Buffer buffer_A(context,CL_MEM_READ_WRITE,sizeof(int)*a.size());
    cl::CommandQueue queue(context,default_device);

    if (queue.enqueueWriteBuffer(buffer_A,CL_TRUE,0,sizeof(int)*a.size(),a.data()) != CL_SUCCESS) {
        std::cout << "Failed to write memory;n";
        exit(1);
    }

    cl::Kernel kernel_add = cl::Kernel(program,"mul2");
    kernel_add.setArg(0,buffer_A);

    if (queue.enqueueNDRangeKernel(kernel_add,cl::NullRange,cl::NDRange(a.size()),cl::NullRange) != CL_SUCCESS) {
        std::cout << "Failed to enqueue kernel\n";
        exit(1);
    }

    if (queue.finish() != CL_SUCCESS) {
        std::cout << "Failed to finish kernel\n";
        exit(1);
    }

    if (queue.enqueueReadBuffer(buffer_A,CL_TRUE,0,sizeof(int)*a.size(),a.data()) != CL_SUCCESS) {
        std::cout << "Failed to read result\n";
        exit(1);
    }
}

In this post I’ll give an introduction on using SYCL to accelerate your C++ code on the GPU.


Lightning intro to GPGPU

Before I get started on how to use SYCL, I’ll give a brief outline of why you might want to run compute jobs on the GPU for those who are unfamiliar. If you’ve already used OpenCL, CUDA or similar, feel free to skip ahead.

The key difference between a GPU and a CPU is that, rather than having a small number of complex, powerful cores (1-8 for common consumer desktop hardware), a GPU has a huge number of small, simple processing elements.

CPU architecture

Above is a comically simplified diagram of a CPU with four cores. Each core has a set of registers and is attached to various levels of cache (some might be shared, some not), and then main memory.

GPU architecture

In the GPU, tiny processing elements are grouped into execution units. Each processing element has a bit of memory attached to it, and each execution unit has some memory shared between its processing elements. After that, there’s some GPU-wide memory, then the same main memory which the CPU uses. The elements within an execution unit execute in lockstep, where each element executes the same instruction on a different piece of data.

The benefit which GPUs bring is the amount of data which can be processed at the same time. If you’re on a CPU, maybe you can process a large handful of pixels at a given time if you use multithreading and vector instructions, but GPUs can process orders of magnitude more than this. The sheer size of data which can be processed at once in GPUs makes them very well-suited to applications like graphics (duh), mathematical processing, neural networks, etc.

There are many aspects of GPGPU programming which make it an entirely different beast to everyday CPU programming. For example, transferring data from main memory to the GPU is slow. Really slow. Like, kill all your performance and get you fired slow. Therefore, the tradeoff with GPU programming is to make as much of the ridiculously high throughput of your accelerator to hide the latency of shipping the data to and from it.

There are other issues which might not be immediately apparent, like the cost of branching. Since the processing elements in an execution unit work in lockstep, nested branches which cause them to take different paths (divergent control flow) is a real problem. This is often solved by executing all branches for all elements and masking out the unneeded results. That’s a polynomial explosion in complexity based on the level of nesting, which is A Bad Thing ™. Of course, there are optimizations which can aid this, but the idea stands: simple assumptions and knowledge you bring from the CPU world might cause you big problems in the GPU world.

Before we get back to SYCL, some short pieces of terminology. The host is the main CPU running on your machine which executes your normal C or C++ code, and the device is what will be running your OpenCL code. A device could be the same as the host, or it could be some accelerator sitting in your machine, a simulator, whatever. A kernel is a special function which is the entry point to the code which will run on your device. It will often be supplied with buffers for input and output data which have been set up by the host.


Back to SYCL

There are currently two implementations of SYCL available: triSYCL, an experimental open source version by Xilinx (mostly used as a testbed for the standard), and ComputeCpp, an industry-strength implementation by Codeplay1 (currently in open beta). Only ComputeCpp supports execution of kernels on the GPU, so we’ll be using that in this post.

Step 1 is to get ComputeCpp up and running on your machine. The main components are a runtime library which implements the SYCL API, and a Clang-based compiler which compiles both your host code and your device code. At the time of writing, Intel CPUs and some AMD GPUs are officially supported on Ubuntu and CentOS. It should be pretty easy to get it working on other Linux distributions (I got it running on my Arch system, for instance). Support for more hardware and operating systems is being worked on, so check the supported platforms document for an up-to-date list. The dependencies and components are listed here. You might also want to download the SDK, which contains samples, documentation, build system integration files, and more. I’ll be using the SYCL Parallel STL in this post, so get that if you want to play along at home.

Once you’re all set up, we can get GPGPUing! As noted in the introduction, my first sample used the SYCL parallel STL implementation. We’ll now take a look at how to write that code with bare SYCL.

#include <CL/sycl.hpp>

#include <array>
#include <numeric>
#include <iostream>

int main() {
      const size_t array_size = 1024*512;
      std::array<cl::sycl::cl_int, array_size> in,out;
      std::iota(begin(in),end(in),0);

      {
          cl::sycl::queue device_queue;
          cl::sycl::range<1> n_items{array_size};
          cl::sycl::buffer<cl::sycl::cl_int, 1> in_buffer(in.data(), n_items);
          cl::sycl::buffer<cl::sycl::cl_int, 1> out_buffer(out.data(), n_items);

          device_queue.submit([&](cl::sycl::handler &cgh) {
              constexpr auto sycl_read = cl::sycl::access::mode::read;
              constexpr auto sycl_write = cl::sycl::access::mode::write;

              auto in_accessor = in_buffer.get_access<sycl_read>(cgh);
              auto out_accessor = out_buffer.get_access<sycl_write>(cgh);

              cgh.parallel_for<class VecScalMul>(n_items,
                  [=](cl::sycl::id<1> wiID) {
                      out_accessor[wiID] = in_accessor[wiID]*2;
                  });
         });
     }
}

I’ll break this down piece-by-piece.

#include <CL/sycl.hpp>

The first thing we do is include the SYCL header file, which will put the SYCL runtime library at our command.

const size_t array_size = 1024*512;
std::array<cl::sycl::cl_int, array_size> in,out;
std::iota(begin(in),end(in),0);

Here we construct a large array of integers and initialize it with the numbers from 0 to array_size-1 (this is what std::iota does). Note that we use cl::sycl::cl_int to ensure compatibility.

{
    //...
}

Next we open up a new scope. This achieves two things:

  1. device_queue will be destructed at the end of the scope, which will block until the kernel has completed.
  2. in_buffer and out_buffer will also be destructed, which will force data tranfer back to the host and allow us to access the data from in and out.
cl::sycl::queue device_queue;

Now we create our command queue. The command queue is where all work (kernels) will be enqueued before being dispatched to the device. There are many ways to customise the queue, such as providing a device to enqueue on or setting up asynchronous error handlers, but the default constructor will do for this example; it looks for a compatible GPU and falls back on the host CPU if it fails.

cl::sycl::range<1> n_items{array_size};

Next we create a range, which describes the shape of the data which the kernel will be executing on. In our simple example, it’s a one-dimensional array, so we use cl::sycl::range<1>. If the data was two-dimensional we would use cl::sycl::range<2> and so on. Alongside cl::sycl::range, there is cl::sycl::ndrange, which allows you to specify work group sizes as well as an overall range, but we don’t need that for our example.

cl::sycl::buffer<cl::sycl::cl_int, 1> in_buffer(in.data(), n_items);
cl::sycl::buffer<cl::sycl::cl_int, 1> out_buffer(out.data(), n_items);

In order to control data sharing and transfer between the host and devices, SYCL provides a buffer class. We create two SYCL buffers to manage our input and output arrays.

      device_queue.submit([&](cl::sycl::handler &cgh) {/*...*/});

After setting up all of our data, we can enqueue our actual work. There are a few ways to do this, but a simple method for setting up a parallel execution is to call the .submit function on our queue. To this function we pass a command group functor2 which will be executed when the runtime schedules that task. A command group handler sets up any last resources needed by the kernel and dispatches it.

constexpr auto sycl_read = cl::sycl::access::mode::read;
constexpr auto sycl_write = cl::sycl::access::mode::write;

auto in_accessor = in_buffer.get_access<sycl_read>(cgh);
auto out_accessor = out_buffer.get_access<sycl_write>(cgh);

In order to control access to our buffers and to tell the runtime how we will be using the data, we need to create accessors. It should be clear that we are creating one accessor for reading from in_buffer, and one accessor for writing to out_buffer.

cgh.parallel_for<class VecScalMul>(n_items,
    [=](cl::sycl::id<1> wiID) {
         out_accessor[wiID] = in_accessor[wiID]*2;
    });

Now that we’ve done all the setup, we can actually do some computation on our device. Here we dispatch a kernel on the command group handler cgh over our range n_items. The actual kernel itself is a lambda which takes a work-item identifier and carries out our computation. In this case, we are reading from in_accessor at the index of our work-item identifier, multiplying it by 2, then storing the result in the relevant place in out_accessor. That <class VecScalMul> is an unfortunate byproduct of how SYCL needs to work within the confines of standard C++, so we need to give a unique class name to the kernel for the compiler to be able to do its job.

}

After this point, our kernel will have completed and we could access out and expect to see the correct results.

There are quite a few new concepts at play here, but hopefully you can see the power and expressibility we get using these techniques. However, if you just want to toss some code at your GPU and not worry about the customisation, then you can use the SYCL Parallel STL implementation.


SYCL Parallel STL

The SYCL Parallel STL is an implementation of the Parallelism TS which dispatches your algorithm function objects as SYCL kernels. We already saw an example of this at the top of the page, so let’s run through it quickly.

#include <vector>
#include <iostream>

#include <sycl/execution_policy>
#include <experimental/algorithm>
#include <sycl/helpers/sycl_buffers.hpp>

using namespace std::experimental::parallel;
using namespace sycl::helpers;

int main() {
  constexpr size_t array_size = 1024*512;
  std::array<cl::sycl::cl_int, array_size> in,out;
  std::iota(begin(in),end(in),0);

  {
    cl::sycl::buffer<int> in_buffer(in.data(), cl::sycl::range<1>(in.size()));
    cl::sycl::buffer<int> out_buffer(out.data(), cl::sycl::range<1>(out.size()));
    cl::sycl::queue q;
    sycl::sycl_execution_policy<class Mul> sycl_policy(q);
    transform(sycl_policy, begin(in_buffer), end(in_buffer), begin(out_buffer),
              [](int x) { return x*2; });
  }
}
  constexpr size_t array_size = 1024*512;
  std::array<cl::sycl::cl_int, array_size> in, out;
  std::iota(begin(in),end(in),0);

So far, so similar. Again we’re creating a couple of arrays to hold our input and output data.

cl::sycl::buffer<int> in_buffer(in.data(), cl::sycl::range<1>(in.size()));
cl::sycl::buffer<int> out_buffer(out.data(), cl::sycl::range<1>(out.size()));
cl::sycl::queue q;

Here we are creating our buffers and our queue like in the last example.

sycl::sycl_execution_policy<class Mul> sycl_policy(q);

Here’s where things get interesting. We create a sycl_execution_policy from our queue and give it a name to use for the kernel. This execution policy can then be used like std::execution::par or std::execution::seq.

transform(sycl_policy, begin(in_buffer), end(in_buffer), begin(out_buffer),
          [](int x) { return x*2; });

Now our kernel dispatch looks like a call to std::transform with an execution policy provided. That closure we pass in will be compiled for and executed on the device without us having to do any more complex set up.

Of course, you can do more than just transform. At the time of writing, the SYCL Parallel STL supports these algorithms:

  • sort
  • transform
  • for_each
  • for_each_n
  • count_if
  • reduce
  • inner_product
  • transform_reduce

That covers things for this short introduction. If you want to keep up to date with developments in SYCL, be sure to check out sycl.tech. Notable recent developments have been porting Eigen and Tensorflow to SYCL to bring expressive artificial intelligence programming to OpenCL devices. Personally, I’m excited to see how the high-level programming models can be exploited for automatic optimization of heterogeneous programs, and how they can support even higher-level technologies like HPX or SkelCL.


  1. I work for Codeplay, but this post was written in my own time with no suggestion from my employer. 

  2. Hey, “functor” is in the spec, don’t @ me.