Tag Archives: cryptocurrency

Optimizing RandomX: Loop Invariant Extraction

As seen in my last post, most of the RandomX execution time is spent in the execution of the randomly generated program. Each instruction in a loop executes an average of 620 times within a hash round, or 1.2 million times vs. the 2048 executions per hash.

Roughly half, or 128, of the instructions end up in loops.

Invariant Candidates

RandomX instructions are designed to mix inputs and outputs. This means the output of most instructions won’t be invariant, but there are a couple instructions where the inputs and outputs are independent:

  • CFROUND
  • ISTORE

CFROUND sets the floating point rounding mode based on an input register and the immediate instruction bits. If the input register is unchanged between calls to the same instruction, then the same rounding mode is set each time.

ISTORE takes 2 input registers, src and dst, and writes src to a memory location dependent on dst and other bits within the instruction. If src and dst are unchanged between calls to the same ISTORE instruction, the same value is written to the same location.

Invariant Rules

The rules for when an instruction, inst, can leave the loop are roughly as follows:

  • If the output of inst is never used in the loop AND the inputs never change in the loop, then inst can be moved off the back of the loop — this is obviously invariant.
  • If the output of inst is never used in the loop AND the inputs change, but entirely before the instruction AND the outputs are written completely, then inst can be moved off the back of the loop. If all input changes happen before the instruction, then the output will be set multiple times, but only the last one matters because the output doesn’t get used within the loop. This only works, though, if each write completely overwrites the previous. This is important for ISTORE because changing memory addresses means that each write is contributing to the state of the hash rather than just the last one.
  • If the inputs of inst never change AND inst is the only instruction that sets its output AND the output is either never used in the loop OR it’s used entirely after inst, then inst can be moved off the front of the loop.

These rules don’t catch every case in which one of these instructions could move, but they get most of the candidates that I manually confirmed. If you know of something big that I missed, please leave a comment.

Results of Instruction Moving

When applied, these rules move an average 2.5 instructions per program out of RandomX loops. There’s a lot of variability in these programs, but using the above stated averages, this should reduce hash runtime from: (128 x 620 + 128) x 2048 to: (125.5 x 620 + 130.5) x 2048. That’s an expected improvement of ~2% of hash program execution time.

However, when measured, no noticeable gain is seen. In fact, the runtimes between two versions — one that moves instructions and one that doesn’t — are virtually identical. The variance between runs is less than 0.5%, so a 1-2% change should have been apparent.

This surprising result led me to reconsider what the system hardware might be doing. It’s possible that the hash is bound by memory bandwidth rather than instruction throughput, meaning that between my standard and “optimized” runs the execution time might be the same while the processor might be doing less. This might show up in power consumption measurements, but that measurement is not something I’m set up to do.

The Cost

Unfortunately, the analysis and reordering of the instructions takes some work, and while I probably don’t have a particularly fast implementation, the work required to do this increases program compilation cost by 4-5x. That makes this first optimization, even if the expected results were seen, an overall loser. For an improvement of 2% in program execution time, compilation time would need to be within 2x of original. However, once the analysis is being done, it can potentially be applied to other things.

A Look At RandomX

I recently dug into the new proof-of-work algorithm powering mining for the Monero crypto-currency, RandomX. It was rolled out to the Monero blockchain early in December of 2019, replacing the CryptoNight algorithm that had been used before it. Several other crypto-currencies are prepping to follow Monero down the RandomX path, so I thought it might be worth investigating. For more background on proof-of-work algorithms, see here.

One of the problems with proof-of-work algorithms and crypto-currencies is that the money reward for faster processing creates a positive feedback loop for the first ones to optimize. The first optimization turns into an increase in income which then makes it easier to make the next optimization, repeat. This usually results in most of the mining rewards going to a few large miners while being prohibitively expensive for a newcomer to get started.

Several crypto-currencies have made it a point to ensure that consumer hardware is competitive when mining to keep the mining workload distributed, which theoretically makes the crypto-currency more secure and definitely makes it more approachable. Ethereum and Monero have been particularly good at this with Ethash and CryptoNight, respectively, working well via consumer GPU computing and resisting FPGA or ASIC optimizations.

RandomX is the latest attempt which initially appears to be resistant even to GPU optimization. Is does this by making maximum use of the typical CPU architecture so that effective hardware optimization virtually requires designing a full processor. This is a costly process which standard CPU manufacturing offsets with volume sales.

Performance

Here’s a quick look at the performance of this algorithm on three different systems:

  • AMD Ryzen 2700 w/ 2-channel DDR4 3200 CAS 14
  • AMD Ryzen TR 3970X w/ 4-channel DDR4 3200 CAS 14
  • AMD EPYC 7502P w/ 8-channel DDR4 3200 CAS 24 (Buffered ECC)

I built the xmr-stak-rx 1.0.4 miner from source that can be found here. Make sure to apply the large memory page optimizations. You’ll need at least 4GB RAM and 2MB L3 cache.

SystemPerformance (kH/s)System Power (w)
Ryzen 27004.8105
Ryzen TR 3970X24355
EPYC 7502P28235

Structure of a Hash

The phases of the hash proceed as:

  • Initialization: AES-based scratchpad (2MB) fill
  • Program generation – AES-based hash (typically hardware accelerated)
  • Program compilation (for JIT-enabled hasher, much slower without)
  • Program execution
  • Results mixing

Here’s a breakdown of where the per-hash time is spent on the Ryzen 2700:

PhasePercent of Execution
Program Generation0.24
Program Compilation1.96
Program Execution93.84
Results Mixing3.95

As we can see, the majority of time is spent in program execution. If we’re going to make significant improvements, it’s likely going to be there.

The number of things happening looks roughly like this:

  • 8 programs generated and executed
    • 2048 iterations of
      • 256 randomly generated instructions including an average of 25 loop instructions
        • Average of 620 iterations per loop instruction (measured)

Each instruction run in an inner loop instruction runs an average of 1.2 million times. Every other generated instruction runs 2048 times. If we find a clever way to move these instructions out of their loops or remove them entirely, there may be noticeable savings.

In the next couple entries, I’ll discuss the results of attempts at improving hash execution time.