Sunday, May 21, 2017

Seeding the std::mt19937 random number engine

A comment on Hacker News complained that the code in my previous blog post does not seed the std::mt19937 random number engine properly. The code was taken directly from a CppCon presentation, so I don’t want to take the blame, but the comment is right — the initialization code can be improved.

State size and seeding

The initialization in the blog post was done as
std::random_device rd;
std::mt19937 gen(rd());
which seeds the std::mt19937 random number engine with a random 32-bit value. The problem with this is that that the Mersenne twister has 19968 bits of internal state so it can generate \(2^{19968}\) streams of random values, but we can only reach \(2^{32}\) of those states when initializing with a 32-bit value.

This is not necessarily a problem. Let’s say the random numbers are used for generating input data in unit tests. The test suite is probably not run more than a few thousand times, so it does not matter that it only can create \(2^{32}\) different test runs. But there are use-cases where this is a problem.

The random number engine can be seeded with more data by using std::seed_seq, and the code below seeds the std::mt19937 with the same number of bits as are in the state
std::random_device rd;
std::array<int, std::mt19937::state_size> seed_data;
std::generate_n(seed_data.data(), seed_data.size(), std::ref(rd));
std::seed_seq seq(std::begin(seed_data), std::end(seed_data));
std::mt19937 gen(seq);

std::random_device

One other potential problem is the quality of the seed values. The idea behind std::random_device is that it returns non-deterministic random numbers, but it is allowed to return deterministic values (e.g. if a non-deterministic source is not available to the implementation). I’m not a big fan of this functionality — it either does exactly what you want (generates non-deterministic values) or it does the opposite (generates deterministic values), and there is no way you can determine which.1

This is probably not a problem when developing for the big platforms, but there may be surprises when running the code in other environments — at least old versions of libstdc++ on MinGW always return the same sequence of values...


1. The std::random_device can return an estimate of the entropy, and it is required to return 0 if the values are generated deterministically. But it is not required to return non-zero for the non-deterministic case, and e.g. libstdc++ is conservative and always estimates the entropy as 0, even when /dev/urandom or the x86 RDRND instruction are used.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.