tag:blogger.com,1999:blog-2420869991974935734.post4862459450842043992..comments2023-07-02T17:04:56.386+02:00Comments on Krister Walfridsson’s old blog: Integer division is slowKrister Walfridssonhttp://www.blogger.com/profile/02591279630933941271noreply@blogger.comBlogger6125tag:blogger.com,1999:blog-2420869991974935734.post-20808973210101519042017-06-05T23:08:56.971+02:002017-06-05T23:08:56.971+02:00An alternative is to just use bit masking:
uint64...An alternative is to just use bit masking:<br /><br />uint64_t m = 0xffffffff >> __builtin_clz(range);<br />do {<br /> ret = randGenerator() & m;<br />} while (ret > range);<br /><br />The downside is that this calls the RNG twice in expectation for bad choices of range, and probably yields worse results with bad RNGs since bits are not mixed any further, but in a quick experiment I did it was never slower than division for mt19937, and a lot faster with a fast RNG (xoroshiro128+).Falkhttps://www.blogger.com/profile/05618060968564744404noreply@blogger.comtag:blogger.com,1999:blog-2420869991974935734.post-48494133508945281012017-05-17T19:59:20.956+02:002017-05-17T19:59:20.956+02:00If the hardware took a constant 103 cycles that co...If the hardware took a constant 103 cycles that could be 100% overlapped with other instructions that didn't need the result, while a software solution took 80 cycles but couldn't overlap anything, there would be many cases where the hardware solution would be "better". Otherwise, I'm curious how much time would be required to e.g. convert the divisor to "float" or "double", compute the approximate or actual reciprocal, separate out the significant and exponent, and then use a successive approximation. Given a 53-bit reciprocal, two iterations should yield a result within +/1 of being correct, and one more compare/subtract should be able to clean that up.supercathttps://www.blogger.com/profile/12531904492602532373noreply@blogger.comtag:blogger.com,1999:blog-2420869991974935734.post-41708103237335991022017-05-17T13:23:41.856+02:002017-05-17T13:23:41.856+02:00I believe the compiler would be able to do the opt...I believe the compiler would be able to do the optimization if the distribution were declared const. But that gives syntax error that there is no matching function for call to object of type \(\verb!const std::uniform_int_distribution<int>!\)Krister Walfridssonhttps://www.blogger.com/profile/02591279630933941271noreply@blogger.comtag:blogger.com,1999:blog-2420869991974935734.post-74745417005545956772017-05-17T12:00:43.942+02:002017-05-17T12:00:43.942+02:00My thought is when you say the compiler can't ...My thought is when you say the compiler can't tell that the distribution hasn't changed - can it work with a distribution which is declared const? Could a compiler, in theory, use such information to lead it to the same realisation? Or would the ability to cast away const qualifiers mean the compiler could just never make that assumption?Matt Waltonhttps://www.blogger.com/profile/04683084278575333885noreply@blogger.comtag:blogger.com,1999:blog-2420869991974935734.post-8335453383161154762017-05-17T01:10:58.055+02:002017-05-17T01:10:58.055+02:00You should not be able to beat the hardware instru...You should not be able to beat the hardware instruction in general — if so, the the CPU vendor could have saved some silicon area by not implementing the instruction. Or a good compiler would call an asm implementation instead of generating this slow instruction. 😀<br /><br />But it <i>might</i> be possible if you have special knowledge of the possible input values. But note that the 103 cycles is what the document list as worst case. The division in this example seem to execute in about 40 cycles, which is much harder to beat with a software implementation...Krister Walfridssonhttps://www.blogger.com/profile/02591279630933941271noreply@blogger.comtag:blogger.com,1999:blog-2420869991974935734.post-45697232138852365162017-05-16T22:11:03.562+02:002017-05-16T22:11:03.562+02:00If a compiler wouldn't be able to do anything ...If a compiler wouldn't be able to do anything else useful while waiting for integer division results, could a user-written routine outperform the hardware-divide instruction? A 64-bit division would seem long enough that even if a little setup would be needed to allow eight bits of quotient to be computed per successive-approximation step, reducing the number of iterative steps from 64 to 8 could pay off.supercathttps://www.blogger.com/profile/12531904492602532373noreply@blogger.com