benchmark | joshfuhs.com

I recently dug into the new proof-of-work algorithm powering mining for the Monero crypto-currency, RandomX. It was rolled out to the Monero blockchain early in December of 2019, replacing the CryptoNight algorithm that had been used before it. Several other crypto-currencies are prepping to follow Monero down the RandomX path, so I thought it might be worth investigating. For more background on proof-of-work algorithms, see here.

One of the problems with proof-of-work algorithms and crypto-currencies is that the money reward for faster processing creates a positive feedback loop for the first ones to optimize. The first optimization turns into an increase in income which then makes it easier to make the next optimization, repeat. This usually results in most of the mining rewards going to a few large miners while being prohibitively expensive for a newcomer to get started.

Several crypto-currencies have made it a point to ensure that consumer hardware is competitive when mining to keep the mining workload distributed, which theoretically makes the crypto-currency more secure and definitely makes it more approachable. Ethereum and Monero have been particularly good at this with Ethash and CryptoNight, respectively, working well via consumer GPU computing and resisting FPGA or ASIC optimizations.

RandomX is the latest attempt which initially appears to be resistant even to GPU optimization. Is does this by making maximum use of the typical CPU architecture so that effective hardware optimization virtually requires designing a full processor. This is a costly process which standard CPU manufacturing offsets with volume sales.

Performance

Here’s a quick look at the performance of this algorithm on three different systems:

AMD Ryzen 2700 w/ 2-channel DDR4 3200 CAS 14
AMD Ryzen TR 3970X w/ 4-channel DDR4 3200 CAS 14
AMD EPYC 7502P w/ 8-channel DDR4 3200 CAS 24 (Buffered ECC)

I built the xmr-stak-rx 1.0.4 miner from source that can be found here. Make sure to apply the large memory page optimizations. You’ll need at least 4GB RAM and 2MB L3 cache.

System	Performance (kH/s)	System Power (w)
Ryzen 2700	4.8	105
Ryzen TR 3970X	24	355
EPYC 7502P	28	235

Structure of a Hash

The phases of the hash proceed as:

Initialization: AES-based scratchpad (2MB) fill
Program generation – AES-based hash (typically hardware accelerated)
Program compilation (for JIT-enabled hasher, much slower without)
Program execution
Results mixing

Here’s a breakdown of where the per-hash time is spent on the Ryzen 2700:

Phase	Percent of Execution
Program Generation	0.24
Program Compilation	1.96
Program Execution	93.84
Results Mixing	3.95

As we can see, the majority of time is spent in program execution. If we’re going to make significant improvements, it’s likely going to be there.

The number of things happening looks roughly like this:

8 programs generated and executed
- 2048 iterations of
  - 256 randomly generated instructions including an average of 25 loop instructions
    - Average of 620 iterations per loop instruction (measured)

Each instruction run in an inner loop instruction runs an average of 1.2 million times. Every other generated instruction runs 2048 times. If we find a clever way to move these instructions out of their loops or remove them entirely, there may be noticeable savings.

In the next couple entries, I’ll discuss the results of attempts at improving hash execution time.

joshfuhs.com

Tag Archives: benchmark

A Look At RandomX

Performance

Structure of a Hash

A personal blog