C++ value categories

2021-04-07

This is an attempt at an explanation of C++ value categories. They are mentioned in various documents that a programmer will read, therefore it's useful to gain an understanding of them. Intuition tells us when and how objects can be moved, but the notion of value categories isn't at all clear to everyone as far as I can tell. In fact it's quite a difficult thing to understand and explain, judging by the amount of articles on the internet about it.

I am fairly sure there will be some errors here too, which reinforces what I have just said. If there are errors, do tell me. blog@bitanvil.com

Recap of the old categories (pre c++11)

Why even have value categories?

C, C++, CPL, D are languages that have the concept of value categories. As far as I can tell other languages don't mention such things. In my small experience of writing a Basic transpiler it did become clear that the left hand side of the equals operator in Basic could never be an expression according to the parsing grammar; it was as I recall what I called an 'identifier'; therefore languages like C++ that allow expressions (not just "variables") on the left hand side of assignment operators may also need value categories to distinguish between "assignable" expressions and non assignable ones.

lvalues and rvalues

In 1998, when the first C++ standard was published, expressions fell into two groups. The lvalue and the rvalue. Back then people may not have called them value categories, just lvalue and rvalue, but none the less that is what they were.

Expressions have a type, and a value category. The value categories are for expressions only, and they are not part of the type. An expression is something that is evaluated, and the categories are separate from any object that may take part in the evaluation. In other words int a; doesn't have a value category; it defines some storage. The category only appears when a is evaluated (used) in an expression.

Example, these statements declare (and define) objects, there are no value categories here:

int a;
char k[10];

In terms of definitions:

a is an object of type int
k is an object of type array of ten chars

All the following statements compile (but don't all do useful things) they are all statements because they have a semi-colon at the end, but they all evaluate expressions, once again: value categories result from expression evaluation - i.e. evaluating an expression yields a value, and this value is in some category. You can also see that sometimes an expression evaluation involving "a" is an lvalue, and sometimes it is an rvalue, so it does indeed depend on how it is used in the expression itself.

a;
a = 1;
++a;
a++;
*k = '\0';
k[1] = 'a';
1;
"hello";
*new int;

What is and is not an lvalue or an rvalue expression has changed in various standards. So the above list might well be wrong for C++98. I am not going to dig into this now, that is a history lesson.

Observe these interesting snippets, too:

$ c++ test.cpp
int main()
{
	int a = 1;
	a++ = 5; // `a++` is an rvalue 
}
test.cpp:7:3: error: lvalue required as left operand of assignment
    7 |  a++ = 5;
      |  ~^~

That's a compile error, of course you can't assign a number to a++

So how about this:

$ c++ test.cpp
int main()
{
	int a = 1;
	++a = 5; // `a++` is an lvalue remember
}
$ (no error)

Note - The above compiles for C++ but not in C, since in C ++a is an rvalue

If something is an lvalue, it can be assigned to - unless of course it is also const. The above compiler error message should now be clear, assignment requires an expression that is in the lvalue category.

Why did I mention all this? Because the = (assignment) operator like any other operator is defined to "work" against specific types, with specific value categories. It does not assign something to a variable. It assigns something to an lvalue expression. This concept is quite helpful going forward. Day to day writing C++ code you don't think in these terms, but it should help explain the new value categories.

The concept of Identity

Lvalues are said to have an identity. If something has an identity then it can be distinguished from another thing. In C++, this means anything that you can get the address of with the & operator, but additionally, bit fields have an identity. You cannot take the address of a bit field because it may not be a multiple of the fundamental char type (often), so taking their address is not permitted. e.g. int a : 3;

Therefore, what identity is, and what an lvalue actually is (in abstract terms) is a descriptor of storage. I.e. identity describes a place, not necessary an address. If you have an lvalue, you can do lvalue type things with it. Chiefly, assign something to the lvalue (again, if non const)

Expressions such as 3 + 4, don't have addresses - at least ones with any visibility to the user facing compiler model. The compiler could store that result somewhere in memory if it wanted to. Typically though it will be stored in a CPU register, or it could even be re-computed, they are all valid possibilities. From the programmers perspective however, they are ethereal things with no identity. - These things in (c++98) we call rvalues.

Identity is needed when an object is to be changed, for example when using the assignment operator. Identity doesn't mean something has to have an associated variable name, consider this lvalue: *new int = 5.

C++(98) in short:

lvalue - People are now calling these "locator values". Previously the name was due to these being found on the left hand side of an assignment operator. They always have an identity.

rvalue - Computation result - has no identity. Can't use them to refer an object, but you can know the value of it's evaluation.


Value categories in C++11 and beyond

Before talking about the new value categories in C++11, lets first say why the new categories were needed. Essentially, they are required to support move semantics. That is, the ability to transfer data from one object to another, without making a copy. i.e. without preserving the original.

This is important, consider a std::vector. If we can somehow know that a vector needs to be transferred to a new place (e.g. outside a function?) and we know that the place where it is now isn't going to be in use again - then it's internal data can be re-assigned, that is moved, without any expensive copy. In the case of std::vector, the internal data pointer that refers to the vector data itself is given away to a new recipient, the data itself is untouched - a significant time (and memory) saving.

To do this we need to determine what and when this move operation can happen. In some ways a move is like any other operator, such as + -, except that it is implicit (i.e. compiler activated) [ Note that programmers can use std::move() to make something moveable, however, that construct doesn't move anything, it only activates the possibility of a move, by changing the value category of the argument.] In fact std::move() simply performs an rvalue reference cast, it's instructive to look at the compiler source for std::move to see that this is so.

This then, means, that a new value category was defined, and this category of expressions can be moved. Also, during expression evaluation the compiler will put an expression in this category when appropriate, and of course when a programmer says std::move() - it's important to note though that you can use std::move on something but it might not be applicable, in particular when something is already in the correct category, e.g.: std::move(5) compiles, but is redundant because 5 is already in the correct category.

With all this in mind, the value categories in C++11 and above are these:

    value category (from an expression)
       /      \ 
   glvalue  rvalue          (mixed categories)
    /   \   /   \
lvalue	xvalue	prvalue     (Primary categories)

Above we see the familiar lvalue and rvalue categories, as well as some new categories. The bottom row are primary categories and an expression will be in one of these categories. The categories above those, glvalue, rvalue are called mixed categories (according to the standard, and they are simply a grouping of the primary categories - but note that they both contain the xvalue, hence the name mixed categories) and an expression is also in one of these - but it is less specific, these less specific categories are needed because some compiler rules can apply to the more general mixed case. In other words the compiler (and the C++ standard) will apply a rule to a mixed category of say, an rvalue, because it applies both to an xvalue and prvalue.

What each category means

lvalue: It has an identity, can be located (somehow), it doesn't have to have an obvious name in the code. Remember *new int is an lvalue but has no identifier associated with it.

xvalue: Something that is, or can be expired, it also has an identity. These objects are not temporary, contrast with a prvalue.

prvalue: a so called "proper rvalue". These are things without any identity, often literals, such as 5, 'd' but also class objects, like say, std::string, that are not lvalue references. e.g. string returned from a function is a prvalue, so is 8. These are all temporary values, or objects, that the compiler generates, or materialises. A materialisation is when an object becomes an xvalue from a prvalue. e.g.

int &&a = 6;

// a has materialised into a temporary which becomes an lvalue from the prvalue 6 
// Why this becomes an lvalue will be discussed later.
a = 7; 

The mixed categories are as follows:

glvalue: A category that has an identity. This doesn't say anything about whether it can be moved or not.

rvalue: This category can always be moved, but doesn't say whether it has and identity or not.

The rules (roughly)

  1. If an expression is an rvalue, it can be moved.
  2. That is in fact the only rule. A glvalue might be able to be moved if it is an xvalue:

Notice that an xvalue is in both mixed categories glvalue and rvalue. This then means that a glvalue, that is an xvalue, is also an rvalue, since an xvalue is in the rvalue mixed category too. This "trick of the light" is how it becomes possible to move an lvalue - this can only be done if it is coerced into an xvalue explicitly, which of course then also makes it an rvalue, and all rvalues can be moved. This is exactly what std::move does, it converts an expression into an rvalue reference - this conversion actually yields an xvalue (has identity).

The last piece of the puzzle: rvalue references

There are a few things to know about rvalue references because there are some special rules involved to make the whole machinery of "move semantics" work correctly, and not be subject to accidental moves.

rvalue references are denoted thus: T && name. Similar to the original lvalue references with a single ampersand. rvalue references have these important properties:

  1. They can be used anywhere other references are used. Function arguments, local scope variables, globals.
  2. They are specifically important in implementing move constructors and move assignment operators. This is the crux of movability. The functions implemented on a class can run code that will do the actual moving of data to a new home.
  3. If an rvalue reference name is used in an expression at any point, for anything. It becomes an lvalue. e.g.:
void ref_param(SomeClass && c)
{
	c.some_function(); // c is now an lvalue
}
  1. If you intend to forward on an rvalue reference to another function then you must use std::forward (possibly std::move).

It is unusual to "pass on" an rvalue reference in this way. Normally std::forward is always used with a templated function. But importantly the example clearly shows that the rvalue reference c, is indeed converted to an lvalue - see the error message produced below:

#include <utility>

class SomeClass {};

void other_function(SomeClass && c)
{
}

void ref_param(SomeClass && c)
{
	// This following line gives an error
	other_function(c);
	// error: cannot bind rvalue reference of type ‘int&&’ to lvalue of type ‘int’

	// Must forward, or move, because mentioning a refrence by name always
	// converts the expression into an lvalue, no matter what.

	// This
	other_function(std::forward<SomeClass>(c));
	
	// or this
	other_function(std::move(c));
}

Using rvalue references

The important new notation of the double ampersand T && allows you to write functions that can "bind" to rvalues. Binding means that when the compiler considers if a function call can match or not, it will check that the arguments is requires can be bound to the parameters passed, before it can be considered as a matching call.

What this means is that you can write functions that will bind to values that are rvalue references - something not previously possible before C++11. Thus, you can write a move constructor or move assignment operator to do the moving of object resources to the new location, like so:

class HeavyBox
{
	HeavyBox(HeavyBox && box)
	{
		this.largeData = box.largeData;
		box.largeData = nullptr;
	}
	
	// And most usually, one of these too
	HeavyBox & operator=(HeavyBox && right); 
private:
	char * largeData;
};

Above, the programmer writes a move constructor to move data into a newly created object. It is usually also the case that the programmer will write a move assignment operator, so that moves can also work in the case of

a = <some rvalue of HeavyBox>

e.g.

a = some_fuction(); // An rvalue of specific category prvalue will be moved
// or
HeavyBox x;
a = std::move(x) // An rvlue of specific category xvalue will be moved

You can see here why there are mixed categories: the move constructor expects an rvalue. In the examples above, this works for both prvalues, and xvalues

There is more to be said about writing move constructors and move assignment operators, but that is a bit beyond the topic at hand. Now, hopefully value categories, and what they are should be a little more apparent.

Practical summary

  1. Expressions have value categories.
  2. rvalues can always be moved
  3. rvalue references that have a name, when evaluated as an expression become lvalue expressions. Therefore you have to std::move or std::forward them if you intend to pass them on to another function as an rvalue reference. These are called id-expressions. a is an id-expression. a + 1 isn't.
  4. category conversion rules:
    1. an lvalue can be converted to an rvalue (but not to an rvalue reference) e.g. a + 5. Here, a is converted to an rvalue (it's value is taken) so that it can be used in the eppression <rvalue> + 1. It doesn't mean the result will be an lvalue however:
int main()
{
    int x[2];
    int * a = &x[0];

    // a is converted to an rvalue, (a + 1) is also an rvalue
    // *(x + 1) is an lvalue though; 
    *(a + 1) = 5;
}

Interesting refs:

[1] Lifetimes are not the same as categories: https://quuxplusone.github.io/blog/2020/03/04/rvalue-lifetime-disaster/