Optimizing RandomX: Loop Invariant Extraction

As seen in my last post, most of the RandomX execution time is spent in the execution of the randomly generated program. Each instruction in a loop executes an average of 620 times within a hash round, or 1.2 million times vs. the 2048 executions per hash.

Roughly half, or 128, of the instructions end up in loops.

Invariant Candidates

RandomX instructions are designed to mix inputs and outputs. This means the output of most instructions won’t be invariant, but there are a couple instructions where the inputs and outputs are independent:

  • CFROUND
  • ISTORE

CFROUND sets the floating point rounding mode based on an input register and the immediate instruction bits. If the input register is unchanged between calls to the same instruction, then the same rounding mode is set each time.

ISTORE takes 2 input registers, src and dst, and writes src to a memory location dependent on dst and other bits within the instruction. If src and dst are unchanged between calls to the same ISTORE instruction, the same value is written to the same location.

Invariant Rules

The rules for when an instruction, inst, can leave the loop are roughly as follows:

  • If the output of inst is never used in the loop AND the inputs never change in the loop, then inst can be moved off the back of the loop — this is obviously invariant.
  • If the output of inst is never used in the loop AND the inputs change, but entirely before the instruction AND the outputs are written completely, then inst can be moved off the back of the loop. If all input changes happen before the instruction, then the output will be set multiple times, but only the last one matters because the output doesn’t get used within the loop. This only works, though, if each write completely overwrites the previous. This is important for ISTORE because changing memory addresses means that each write is contributing to the state of the hash rather than just the last one.
  • If the inputs of inst never change AND inst is the only instruction that sets its output AND the output is either never used in the loop OR it’s used entirely after inst, then inst can be moved off the front of the loop.

These rules don’t catch every case in which one of these instructions could move, but they get most of the candidates that I manually confirmed. If you know of something big that I missed, please leave a comment.

Results of Instruction Moving

When applied, these rules move an average 2.5 instructions per program out of RandomX loops. There’s a lot of variability in these programs, but using the above stated averages, this should reduce hash runtime from: (128 x 620 + 128) x 2048 to: (125.5 x 620 + 130.5) x 2048. That’s an expected improvement of ~2% of hash program execution time.

However, when measured, no noticeable gain is seen. In fact, the runtimes between two versions — one that moves instructions and one that doesn’t — are virtually identical. The variance between runs is less than 0.5%, so a 1-2% change should have been apparent.

This surprising result led me to reconsider what the system hardware might be doing. It’s possible that the hash is bound by memory bandwidth rather than instruction throughput, meaning that between my standard and “optimized” runs the execution time might be the same while the processor might be doing less. This might show up in power consumption measurements, but that measurement is not something I’m set up to do.

The Cost

Unfortunately, the analysis and reordering of the instructions takes some work, and while I probably don’t have a particularly fast implementation, the work required to do this increases program compilation cost by 4-5x. That makes this first optimization, even if the expected results were seen, an overall loser. For an improvement of 2% in program execution time, compilation time would need to be within 2x of original. However, once the analysis is being done, it can potentially be applied to other things.