Category Archives: Computing

Optimizing RandomX: Loop Invariant Extraction

As seen in my last post, most of the RandomX execution time is spent in the execution of the randomly generated program. Each instruction in a loop executes an average of 620 times within a hash round, or 1.2 million times vs. the 2048 executions per hash.

Roughly half, or 128, of the instructions end up in loops.

Invariant Candidates

RandomX instructions are designed to mix inputs and outputs. This means the output of most instructions won’t be invariant, but there are a couple instructions where the inputs and outputs are independent:

  • CFROUND
  • ISTORE

CFROUND sets the floating point rounding mode based on an input register and the immediate instruction bits. If the input register is unchanged between calls to the same instruction, then the same rounding mode is set each time.

ISTORE takes 2 input registers, src and dst, and writes src to a memory location dependent on dst and other bits within the instruction. If src and dst are unchanged between calls to the same ISTORE instruction, the same value is written to the same location.

Invariant Rules

The rules for when an instruction, inst, can leave the loop are roughly as follows:

  • If the output of inst is never used in the loop AND the inputs never change in the loop, then inst can be moved off the back of the loop — this is obviously invariant.
  • If the output of inst is never used in the loop AND the inputs change, but entirely before the instruction AND the outputs are written completely, then inst can be moved off the back of the loop. If all input changes happen before the instruction, then the output will be set multiple times, but only the last one matters because the output doesn’t get used within the loop. This only works, though, if each write completely overwrites the previous. This is important for ISTORE because changing memory addresses means that each write is contributing to the state of the hash rather than just the last one.
  • If the inputs of inst never change AND inst is the only instruction that sets its output AND the output is either never used in the loop OR it’s used entirely after inst, then inst can be moved off the front of the loop.

These rules don’t catch every case in which one of these instructions could move, but they get most of the candidates that I manually confirmed. If you know of something big that I missed, please leave a comment.

Results of Instruction Moving

When applied, these rules move an average 2.5 instructions per program out of RandomX loops. There’s a lot of variability in these programs, but using the above stated averages, this should reduce hash runtime from: (128 x 620 + 128) x 2048 to: (125.5 x 620 + 130.5) x 2048. That’s an expected improvement of ~2% of hash program execution time.

However, when measured, no noticeable gain is seen. In fact, the runtimes between two versions — one that moves instructions and one that doesn’t — are virtually identical. The variance between runs is less than 0.5%, so a 1-2% change should have been apparent.

This surprising result led me to reconsider what the system hardware might be doing. It’s possible that the hash is bound by memory bandwidth rather than instruction throughput, meaning that between my standard and “optimized” runs the execution time might be the same while the processor might be doing less. This might show up in power consumption measurements, but that measurement is not something I’m set up to do.

The Cost

Unfortunately, the analysis and reordering of the instructions takes some work, and while I probably don’t have a particularly fast implementation, the work required to do this increases program compilation cost by 4-5x. That makes this first optimization, even if the expected results were seen, an overall loser. For an improvement of 2% in program execution time, compilation time would need to be within 2x of original. However, once the analysis is being done, it can potentially be applied to other things.

A Look At RandomX

I recently dug into the new proof-of-work algorithm powering mining for the Monero crypto-currency, RandomX. It was rolled out to the Monero blockchain early in December of 2019, replacing the CryptoNight algorithm that had been used before it. Several other crypto-currencies are prepping to follow Monero down the RandomX path, so I thought it might be worth investigating. For more background on proof-of-work algorithms, see here.

One of the problems with proof-of-work algorithms and crypto-currencies is that the money reward for faster processing creates a positive feedback loop for the first ones to optimize. The first optimization turns into an increase in income which then makes it easier to make the next optimization, repeat. This usually results in most of the mining rewards going to a few large miners while being prohibitively expensive for a newcomer to get started.

Several crypto-currencies have made it a point to ensure that consumer hardware is competitive when mining to keep the mining workload distributed, which theoretically makes the crypto-currency more secure and definitely makes it more approachable. Ethereum and Monero have been particularly good at this with Ethash and CryptoNight, respectively, working well via consumer GPU computing and resisting FPGA or ASIC optimizations.

RandomX is the latest attempt which initially appears to be resistant even to GPU optimization. Is does this by making maximum use of the typical CPU architecture so that effective hardware optimization virtually requires designing a full processor. This is a costly process which standard CPU manufacturing offsets with volume sales.

Performance

Here’s a quick look at the performance of this algorithm on three different systems:

  • AMD Ryzen 2700 w/ 2-channel DDR4 3200 CAS 14
  • AMD Ryzen TR 3970X w/ 4-channel DDR4 3200 CAS 14
  • AMD EPYC 7502P w/ 8-channel DDR4 3200 CAS 24 (Buffered ECC)

I built the xmr-stak-rx 1.0.4 miner from source that can be found here. Make sure to apply the large memory page optimizations. You’ll need at least 4GB RAM and 2MB L3 cache.

SystemPerformance (kH/s)System Power (w)
Ryzen 27004.8105
Ryzen TR 3970X24355
EPYC 7502P28235

Structure of a Hash

The phases of the hash proceed as:

  • Initialization: AES-based scratchpad (2MB) fill
  • Program generation – AES-based hash (typically hardware accelerated)
  • Program compilation (for JIT-enabled hasher, much slower without)
  • Program execution
  • Results mixing

Here’s a breakdown of where the per-hash time is spent on the Ryzen 2700:

PhasePercent of Execution
Program Generation0.24
Program Compilation1.96
Program Execution93.84
Results Mixing3.95

As we can see, the majority of time is spent in program execution. If we’re going to make significant improvements, it’s likely going to be there.

The number of things happening looks roughly like this:

  • 8 programs generated and executed
    • 2048 iterations of
      • 256 randomly generated instructions including an average of 25 loop instructions
        • Average of 620 iterations per loop instruction (measured)

Each instruction run in an inner loop instruction runs an average of 1.2 million times. Every other generated instruction runs 2048 times. If we find a clever way to move these instructions out of their loops or remove them entirely, there may be noticeable savings.

In the next couple entries, I’ll discuss the results of attempts at improving hash execution time.

Universal Computing Interface

It’s a good exercise to ask, “What’s the smallest thing that would work?” Understanding the limits is a key step in the creation of anything.

In that spirit, what is the minimum description of software? This question has been the domain of computer language designers for decades. What’s the minimum syntax? I think Lisp can be declared the winner. What’s the minimum required description of a language? That’s still being worked out.

What is minimum required computing interface? That one we might be able to answer. The public clouds are touching on the answer in the form of AWS Lambda or Azure Service Fabric — inspired (I assume) by the general trend toward micro-services. There is a notion of an event that can be accepted, and a pre-defined computation that should be executed with the event. In the case where no event is required, you just have a computation that needs to be run to completion.

Any computational job could be described as an event to respond to and a job to run. Event-driven programming is a powerful paradigm that has been around for a while. It requires an effective way to express kinds of events and some kind of program to accept them.

This is a very light definition, so it’s an attractive interface. However, there are some aspects of computing that it doesn’t take into consideration. The biggest one is the run time of the job. For example, it’s often the case that if you can process one event in 1 second, you can process 10 events in 5. This is often the case in database or distributed applications where your computation involves the creation of lots of intermediate sets. There are caching techniques to try to mitigate the problem, but often it’s much more effective to do lots of the same things together. This suggests that the interface needs to be expanded.

General Sorting is Θ(k × n)

This article makes extensive use of Big-O notation.

n is the number of elements to be sorted
k is the key size in bits; the smallest k for n uniquely identified elements is log(n)

I was taught in school (back in the day) that general sorting, usually assumed to be comparison sorting, is O(n × log(n)).

An alternative and well-known form of sorting, radix sort, which places elements relative to the value of the key rather than swapping after comparisons, is O(k × n). Since the smallest-case k for a set of uniquely identified elements is log(n) bits, O(n × log(n)) for radix sorting is still accurate for a smallest-case key size.

However, in comparison sorting analysis the key size is often ignored and the comparison time is assumed constant. Thus the fastest comparison sorts are O(n × log(n)) comparisons. If the same strategy were applied to the analysis of radix sort, its time complexity would be O(n) — k would be a constant factor.

Given that most general-purpose comparison sort algorithms accept arbitrary objects with arbitrary comparators, comparison time is certainly not constant. If arbitrary key size is considered, sort time for comparison sorts would be O(k × n × log(n)) or, for the smallest-case k, O(n × log²(n)).

Radix sort doesn’t readily handle arbitrary data-types — typically used for sorting of fixed-size integers. The argument is that since radix sort isn’t general, its performance doesn’t translate to general sort performance.

What is general sorting? Any data structure that can be represented in computer memory can also be represented as a string, or a loosely-structured, arbitrarily-long sequence of bytes. So string sorting is general sorting.

There is a string sorting algorithm, Burstsort, that efficiently sorts strings and is O(k × n), where k is the average key size. As mentioned above, strings can represent arbitrary data structures, so Burstsort is general for sorting. A general implementation of Burstsort would take a set of objects and a serializing function as input rather than a comparator. The serialization of the objects to be sorted may not be particularly fast, but it is O(k × n).

Therefore, general sorting is O(k × n). Since that’s also the minimum time required to at least look at the keys to be sorted, general sorting is also Θ(k × n).

I’m not sure if this is old news to everyone, but it only just hit me.

Language Musing

If you’ve been in the software development world, you may have noticed some larger themes while going about your day-to-day. As you create, you (or I) tend to start with relatively large, unruly blobs of code. Then in several passes, those large, unruly blobs of code are broken down into smaller, more structured, and more generalized bits that are used by more specialized bits. This is the essence of functional decomposition, but it can happen along interesting axes. Most languages allow you to break down main routines into subroutines, but can you break down types? Some languages let you break down types, but can you parameterize them? Some languages let you parameterize your types, but can you classify and manipulate your language elements based on their properties?

We can watch computer language development go from simple to abstract, and this mirrors the path of mathematics. The power of the abstractions cannot be understated. As each new abstraction layer appears, our ability to compactly express ideas grows by some exponent. The cumulative effect will have a total power difficult to predict.

Mathematics is a language of “what is true.” One of its peculiarities is that it can be completely removed from the physical world and still work. Mathematics of some form acts as the core of most software languages, but that’s not their totality. We can call the myriad ways to express truth as Truth Language.

Programming languages take some way of describing “what is true” (i.e. mathematics and logic) and layer on top the ability to talk with a computer’s inputs and outputs. This makes software languages capable of interacting with the world.

Thus, programming languages are languages of sense and action. Every piece of software in its most generalized form is Input -> Computation -> Output [ -> Repeat ]. The inputs are some form of sensory data — key presses, camera data, mouse movement, network packets, etc. — that inform the rest of the process. The Computation is a series of transformations based on known truths, and the Outputs manifest the results of this computation in the world. These processes can range from simply putting characters on a screen to the full complexity of running a manufacturing plant and beyond. Whatever the goal, it’s manifested in the real world. We can call these Action Languages.

Not only do we use programming languages to direct computers, but we use them to communicate with each other about what computers will do or have done. The emphasis on human readable code is strong in the software engineering discipline(s). One might also notice that the core of general human language is Action Language — we use our language largely to describe things to do.

However, general language includes more layers atop Action Language: a semantics of history and future, past actions and intent. Computer systems generally express past as log of some form while intent is expressed as an event handler configuration or schedule. These things tend to be described within software languages by how they are configured and/or formatted rather than as first-class elements of the software languages themselves, but there may be a slow shift toward more formality.

By advancing the state of Action Language, computer science is essentially advancing the state of general language. Though we don’t yet completely use software languages this way, it’s clear that we’re educating millions at a time in their use to communicate both with our machines and each other. The long-term effect of this can only be more power and precision in our ability to express ideas and intent.

More than communicating amongst ourselves, software language is rapidly allowing us to express intent to and command the universe itself. Computers were an impressive step, giving each individual a capacity for logical and mathematical expression never before possible by even the whole of humanity. As technology advances toward increasingly more advanced creation machines like CNC machines and 3D printers, our words, expressed as software, achieve increasing creative potency.

To date, software represents the pinnacle of the power of the written word, and we still seem to have a lot of improvement ahead of us. If you don’t yet know how, pick up a programming book or tutorial and get started. Incomputency in the 21st century will be looked on as illiteracy in the 20th.

Disadvantages of the Cloud

As the world moves toward AWS, Azure, and Google Cloud, I would like to take a moment to reflect on areas where the cloud services maybe aren’t so strong.

Connectivity/Availability

While the popular cloud services are the best in the world at availability and reliability, if your application requires those qualities at a particular location, your service will only be as reliable as your Internet connection. For businesses, high-quality, redundant connections are readily available. For IoT-in-the-home, you’re relying on the typical home broadband connection and hardware. I won’t be connecting my door locks to that.

When disaster strikes, every layer of your application is a possible crippling fault line. This is especially the case when your application relies on services spread across the Internet. We saw an example of this recently when AWS experienced a hiccup that was felt around the world. The more tightly we integrate with services the world over, the more we rely on everything in the world maintaining stability.

Latency and Throughput Performance

If latency is important, locality is key. Physics can be difficult that way. Many attempts at cloud gaming have fallen apart because of this simple fact. Other applications can be similarly sensitive.

Likewise with bandwidth, even the best commercial connections are unlikely to achieve 10Gbps to the Internet. If your requirements stress the limitations of modern technology, the typical cloud services may not work for you.

Security

While the cloud providers themselves are at the cutting edge in security practices, the innumerable cloud services and IoT vendors built atop them often aren’t. Because of the ubiquity that the cloud model presents, it’s getting more difficult to find vendors that are willing to provide on-premise alternatives — alternatives that could be more secure by virtue of never shuffling your data through the Internet.

There’s also the simple fact that by using cloud services, you’re involving another party in the storage and manipulation of your data. The DoD types will certainly give this due consideration, but events like CelebGate (1 and 2) are indications that maybe more of us should. Every new online account we create is another door into our lives, and the user/password protected doors aren’t the most solid.

Another concern along these lines is legal access to your data. If you’ve shared data with a cloud provider, can it be legally disclosed without your knowledge via warrant? Can foreign governments request access to any data stored within their borders? These issues are still being worked out with somewhat varying results around the globe. This might be an even smaller concern to the typical user, especially for those of us in the US. However, I feel I would be remiss if I didn’t note that politics of late have gotten…interesting.

Precision Metrics

I haven’t heard about this problem recently, but it has been noted that it’s difficult to get good concrete metrics out of at least some of the large scale cloud providers. Software failures can occur for reasons that are invisible to the virtualization layer: poor node performance due to throttling of nodes in particular locations, odd behaviors manifesting between two VMs using resources in unexpected ways, etc. Good luck getting the hardware metrics needed to debug these from the cloud providers.

Cost

This is fast becoming a non-issue as prices race toward zero. However, at the time of writing, those AWS bills could still add up to a significant chunk of change. Delivering the kind of reliability that the cloud titans provide isn’t easy, and the cost can often be worth it.

Custom Hardware

This is a fairly extreme case, but customization with the big services is rough unless you’ve got significant cash to throw at the problem.

Scale

If you’re already the size of Amazon or larger (Facebook), you’re probably better off doing something that works for you. I don’t imagine many will be in this position.

 

There you have it. The best set of reasons I have for avoiding the cloud.

If none of those reasons apply to you, and high-availability, high-reliability, available-everywhere is what you need, nothing beats the cloud computing providers. They’re ubiquitous, generally easy to use and automate, and are getting cheaper every day. Sometimes you can even get better deals with specialized services like Bluehost or SquareSpace, but beware the drawbacks.

However, if you have a concern along any of the lines above, it’s at least worthwhile to take a second look.

Benefits of Functional

In a couple of previous posts I’ve provided what I believe to be a good definition of a functional language (Functional Programming Definition) and why a functional model is ultimately required due to the constraints of reality (Everything is Functional).

However, I don’t know that I’ve been clear about exactly why functional programming is so beneficial. I think the benefits can be summed up in one word: predictability.

When I’m working with functional code, I know that I can lift any part from any where and get identical behavior given the same inputs wherever I may need it. When I know the language enforces these rules, I can be supremely confident in this quality of the code, even unfamiliar code.

When working in languages like C++, Java, and Python, I go to some lengths to try to maintain a functional quality to my code. It can be done with most languages. However, when veering into unknown code, my confidence in my ability to modify without odd unexpected side-effects drops dramatically, and it makes those changes much more of a slog. As bad a practice as it is generally recognized to be, I still bump into the occasional global mutable variable or static class member. It doesn’t take many of those to wreck your day.

I think a second closely-related benefit of the functional style is composability. When you have supreme confidence in your components, putting them together can be a snap. This benefit has a little more dependence on good architecture and design practices, but outright avoidance of the features that make code-reasoning difficult will certainly have an impact.

In reference to my earlier post, everything is functional if you’re willing to zoom out far enough. You could start up another process, container, VM, or cluster to restore your faith in the predictability of your code. Note that this approach has costs that could be fairly easily avoided by starting functionally from the start.

Disclaimer: it’s always possible to write bad code in any language or style. However, it can be argued that it’s much more difficult to reach the spaghetti-horizon where the functional style is enforced.

Functional Programming Definition

A dozen years ago, when I was first investigating functional languages, the distinctions between functional and not were much more apparent. Functional languages had functions as values and parameters, closures — an inline-defined function that captures (closed) context for use in future invocations, and if not outright prohibiting variable updates most of them discouraged programming styles that did variable updates.

Today, most new languages are a thorough mix of imperative, object-oriented, and functional styles. Functions-as-parameters, and closures (or something roughly similar) are as common as the procedural paradigm in the 90’s. However, the one key point that has been largely missed is the discouragement of modifiable variables. The former two styles — imperative and object-oriented — basically require it.

As a result, we’re left with programmers today making a lot of the same non-functional mistakes of years past, and the software is little better for it.

Therefore, my favorite definition of “functional language” is what is more commonly referred to as “pure functional” — languages that do not permit side-effects (state changes). This idea breaks down around the edges of software design, and that’s where a lot of diversity among pure-functional languages can be found, but it is the property that gives the functional style the majority of its benefits.

When we imagine the intricacies around implicitly changing state as part of a computation in systems that are large and separated, I think we should begin to understand why Everything is Functional.

Everything is Functional

At a high enough level, all computing is functional. If your language isn’t functional (state-change and side-effect free), your process is isolating it. If your process is modifying global state, then your hypervisor or individual machine is isolating it. If your machine is working with a cluster of machines to maintain and modify global state, then all other machines not in the cluster work in isolation.

The model of global state is a difficult one to maintain as scale increases. The time to synchronize data grows and the number of complications due to stale data gets worse. At the largest scales, the distinction between database and communication layer (think pub-sub) breaks down, and they effectively become one. This is the model of tools like RethinkDB where the query becomes an asynchronous request for updates to a particular subset of the data model.

The latest software modeling paradigms make a point of restricting state to a locale via microservices. Each microservice is allowed its own state, but communicates with all other parts of the system via a common message bus. Tools like Storm and Spark specialize in connecting these services together into a larger dataflow model.

It makes sense. Physical locality is guaranteed to eventually constrain software, regardless of the programming models we’ve gotten away with in the past. I think we would do well to recognize that, when stretched to the largest scales, software is relatively small units of computation and the data flowing between them (just like hardware). Aligning our heads, tools, and techniques around this model is likely to yield the best results long-term.

Pick up and learn a functional language today!

Haskell
OCaml
Scala