cryptocurrency | joshfuhs.com

As seen in my last post, most of the RandomX execution time is spent in the execution of the randomly generated program. Each instruction in a loop executes an average of 620 times within a hash round, or 1.2 million times vs. the 2048 executions per hash.

Roughly half, or 128, of the instructions end up in loops.

Invariant Candidates

RandomX instructions are designed to mix inputs and outputs. This means the output of most instructions won’t be invariant, but there are a couple instructions where the inputs and outputs are independent:

CFROUND
ISTORE

CFROUND sets the floating point rounding mode based on an input register and the immediate instruction bits. If the input register is unchanged between calls to the same instruction, then the same rounding mode is set each time.

ISTORE takes 2 input registers, src and dst, and writes src to a memory location dependent on dst and other bits within the instruction. If src and dst are unchanged between calls to the same ISTORE instruction, the same value is written to the same location.

Invariant Rules

The rules for when an instruction, inst, can leave the loop are roughly as follows:

If the output of inst is never used in the loop AND the inputs never change in the loop, then inst can be moved off the back of the loop — this is obviously invariant.
If the output of inst is never used in the loop AND the inputs change, but entirely before the instruction AND the outputs are written completely, then inst can be moved off the back of the loop. If all input changes happen before the instruction, then the output will be set multiple times, but only the last one matters because the output doesn’t get used within the loop. This only works, though, if each write completely overwrites the previous. This is important for ISTORE because changing memory addresses means that each write is contributing to the state of the hash rather than just the last one.
If the inputs of inst never change AND inst is the only instruction that sets its output AND the output is either never used in the loop OR it’s used entirely after inst, then inst can be moved off the front of the loop.

These rules don’t catch every case in which one of these instructions could move, but they get most of the candidates that I manually confirmed. If you know of something big that I missed, please leave a comment.

Results of Instruction Moving

When applied, these rules move an average 2.5 instructions per program out of RandomX loops. There’s a lot of variability in these programs, but using the above stated averages, this should reduce hash runtime from: (128 x 620 + 128) x 2048 to: (125.5 x 620 + 130.5) x 2048. That’s an expected improvement of ~2% of hash program execution time.

However, when measured, no noticeable gain is seen. In fact, the runtimes between two versions — one that moves instructions and one that doesn’t — are virtually identical. The variance between runs is less than 0.5%, so a 1-2% change should have been apparent.

This surprising result led me to reconsider what the system hardware might be doing. It’s possible that the hash is bound by memory bandwidth rather than instruction throughput, meaning that between my standard and “optimized” runs the execution time might be the same while the processor might be doing less. This might show up in power consumption measurements, but that measurement is not something I’m set up to do.

The Cost

Unfortunately, the analysis and reordering of the instructions takes some work, and while I probably don’t have a particularly fast implementation, the work required to do this increases program compilation cost by 4-5x. That makes this first optimization, even if the expected results were seen, an overall loser. For an improvement of 2% in program execution time, compilation time would need to be within 2x of original. However, once the analysis is being done, it can potentially be applied to other things.

Performance

Here’s a quick look at the performance of this algorithm on three different systems:

AMD Ryzen 2700 w/ 2-channel DDR4 3200 CAS 14

AMD Ryzen TR 3970X w/ 4-channel DDR4 3200 CAS 14

AMD EPYC 7502P w/ 8-channel DDR4 3200 CAS 24 (Buffered ECC)

I built the xmr-stak-rx 1.0.4 miner from source that can be found here. Make sure to apply the large memory page optimizations. You’ll need at least 4GB RAM and 2MB L3 cache.

System

Performance (kH/s)

System Power (w)

Ryzen 2700

4.8

105

Ryzen TR 3970X

355

EPYC 7502P

235

Structure of a Hash

The phases of the hash proceed as:

Initialization: AES-based scratchpad (2MB) fill

Program generation – AES-based hash (typically hardware accelerated)

Program compilation (for JIT-enabled hasher, much slower without)

Program execution

Results mixing

Here’s a breakdown of where the per-hash time is spent on the Ryzen 2700:

Phase

Percent of Execution

Program Generation

0.24

Program Compilation

1.96

Program Execution

93.84

Results Mixing

3.95

As we can see, the majority of time is spent in program execution. If we’re going to make significant improvements, it’s likely going to be there.

The number of things happening looks roughly like this:

8 programs generated and executed

2048 iterations of
- 256 randomly generated instructions including an average of 25 loop instructions
  - Average of 620 iterations per loop instruction (measured)

Each instruction run in an inner loop instruction runs an average of 1.2 million times. Every other generated instruction runs 2048 times. If we find a clever way to move these instructions out of their loops or remove them entirely, there may be noticeable savings.

In the next couple entries, I’ll discuss the results of attempts at improving hash execution time.

joshfuhs.com

Tag Archives: cryptocurrency

Optimizing RandomX: Loop Invariant Extraction