Krister Walfridsson’s old blog: The cost of clearing memory

The technical report "The Renewed Case for the Reduced Instruction Set Computer: Avoiding ISA Bloat with Macro-Op Fusion for RISC-V" looks at how the RISC-V ISA compares to ARM and Intel by analyzing the result from running SPEC CINT2006. One thing that surprised me was that 30% of the instructions executed when running the 403.gcc benchmark (which compiles a few files using a modified GCC 3.2) is from doing memset!

The RISC-V memset loop writes 16 bytes in 4 instructions

// RV64G, 4 instructions to move 16 bytes
4a3814:   sd    a1, 0(a4)
4a3818:   sd    a1, 8(a4)
4a381c:   addi  a4, a4, 16
4a3820:   bltu  a4, a3, 4a3814

which is somewhat less efficient compared to ARM and Intel that writes more data per instruction

// armv8, 6 instructions to move 64 bytes
6f0928:   stp   x7, x7, [x8,#16]
6f092c:   stp   x7, x7, [x8,#32]
6f0930:   stp   x7, x7, [x8,#48]
6f0934:   stp   x7, x7, [x8,#64]!
6f0938:   subs  x2, x2, #0x40
6f093c:   b.ge  6f0928

so this should translate to about 10% of the executed ARM instructions doing memset. But that is still much more than what I would have guessed.

I do not have access to SPEC, and I have been too lazy to try to replicate with other data, but a quick literature search indicates that this is not as insane as I thought. The papers I have found look at the cost of clearing data in garbage collection implementations, and they seem to get a similar result for the cost. For example "Why Nothing Matters: The Impact of Zeroing" says

We show that existing approaches of zero initialization are surprisingly expensive. On three modern IA32 architectures, the direct cost is around 2.7-4.5% on average and as much as 12.7% of all cycles, in a high-performance Java Virtual Machine (JVM), without accounting for indirect costs due to cache displacement and memory bandwidth consumption.

Krister Walfridsson’s old blog

Sunday, September 25, 2016

The cost of clearing memory

No comments:

Post a Comment