What to avoid when building an EVM decompiler
As the Ethereum ecosystem matures, the growing utilization of the Ethereum Virtual Machine (EVM) begs a simple question: When will we see the launch of an effective EVM decompiler?
Unwilling to wait, our research team got to work at building one ourselves. Our efforts were ultimately underperforming, but we can provide you with the next best thing: A guide on what not to do when building an EVM decompiler with Machine Learning.
The following is a product of several months of research and development. If you seek to develop an EVM decompiler yourself, or are simply interested in decompilation - this is for you.
Background
Wait a minute. Don’t EVM decompilers exist already?
There have been notable previous attempts at creating EVM decompilers - Dedaub, Panoramix, Heimdall, and Ethervm for example. However, they produce code with no variable names that requires a deep understanding of EVM, and which may not be recompilable.
We wanted to build something different: A decompiler accessible to everyone regardless of their technical background that would produce recompilable code.
This would require leveraging recent advances in Machine Learning (ML). Finding a model and then feeding it with the correct data formed the basis of our work.
Lesson #1: Pick the right model(s)
When researching existing Machine Learning models that were used for decompilation tasks, we found that most of the research was done to decompile C programs. No one had attempted to make one for EVM and Solidity.
We took the N-Bref model from Meta as its code was open-source under Creative Commons. Our aim was to repurpose the model to decompile EVM bytecode rather than write our own model from scratch, saving us time. Taking this ‘shortcut’ was a mistake, which would end up costing us much more time, as we’ll explain.
Our plan consisted of the following steps:
- Collect the dataset of smart contracts
- Compile smart contracts to get EVM bytecode and Solidity AST
- Generate DGL graphs for CFG and AST
- Train the N-Bref model to transform bytecode CFG to Solidity AST
- Reconstruct Solidity source code from AST
- Modify the source code to be readable and recompilable using an LLM
The N-Bref model uses Graph Neural Networks to transform the assembly graph to the AST graph. In the original research, an artificial dataset was generated with C programs and their graphs had ~500 nodes. However, in our dataset, graphs of large contracts contained hundreds of thousands of nodes. Because of that, we had to load the dataset examples from the disk one-by-one and train with batch size 1, which is not ideal because it can lead to overfitting.
The big graphs also led to the CUDA (Compute Unified Device Architecture) running out of memory, so ultimately we removed the big contracts from the dataset.
In short: If large contracts are creating bottlenecks, examine whether the model you are using is correct rather than forcing the dataset to fit the model. If possible, experiment with multiple models at the same time rather than exclusively using one.
Iterating and experimenting with multiple models is as complex as it sounds, however - we recommend having a full-time ML researcher on your team to make this more manageable.
Lesson #2: Parse the right data
The obvious place to train an ML model on smart contract source code is the smart-contract-sanctuary repository. It hasn’t been updated with fresh contracts in a while, but it was the fastest way to test our theories.
However, using this dataset was not as straightforward as we initially thought. Out-of-the box, many contracts wouldn’t compile due to leftover import directives of the contract flattening process or are in other ways broken.
After fixing the problem for most of the contracts, we faced a different issue. The dataset had an excessive number of similar contracts, such as ERC20 tokens under different names, raising concerns about potential overfitting during the model training process. We used MinHash Near Deduplication from this repository to trim the smart contract sanctuary dataset. This technique uses multiple hash functions to convert smart contracts’s source code to a set of numeric values, creating MinHash signature (a compact representation consisting of minimum hash values for each hash function), which we used to estimate similarity between contracts. We tried different similarity threshold parameters until we were left with 60k contracts.
Compiling lots of contracts
After having dealt with the dataset, we began working on a setup, which would allow us to compile the contracts in parallel and tweak compiler flags that affected bytecode generation, such as optimize-runs. For initial training, we decided to freeze the solidity compiler version and disable the optimizer. Compilation artifacts were saved on S3 bucket and to postgres database.
From the compilation process, only two artifacts are useful: AST and deployed bytecode, which we later transform into control flow graph (CFG).
A list of AST nodes can be obtained in the solidity source code. There were some nodes that didn’t affect code generation (such as StructuredDocumentation, which represented NatSpec comments), so we pruned them.
For the control flow graph (CFG) generation as the base of our project, we initially utilized Crytic’s Rattle, an EVM binary static analysis framework that builds CFGs.
However, the original Rattle CFG generation wasn't compatible with our database requirements because it generated logic blocks as nodes, whereas we needed instruction-level nodes. Logic blocks can contain from one opcode up to hundreds of opcodes. And we needed each node to be uniform so we could encode it in 256 bits. Therefore we used opcodes as nodes instead of logic blocks.
To address this, we modified the CFG building code within Rattle to change how the nodes are constructed. Instead of creating nodes representing logic blocks (sequences of instructions without control flow constructs), our modified code generates a separate node for each individual instruction in the EVM bytecode. We achieved this by iterating through the bytecode instructions one-by-one using a depth-first search (DFS) algorithm, creating a new node for each instruction, and establishing control flow edges between these nodes based on the control flow constructs encountered, such as conditional jumps and unconditional jumps.
This instruction-level CFG approach better suited our database's requirements for precise analysis over Rattle's original logic block nodes.
Synthetic dataset
The model struggled to perform well and it was unclear whether the issue was with the model's architecture or we simply needed more data. We realized that if we needed more data, it is crucial to have an ability to create it artificially, i.e. to synthesize contract code.
We briefly tried using Certora’s gambit to mutate existing contracts, but ultimately found that it only resulted in slight modifications.
During the project we closely worked with Solidity’s compiler source code and the solution presented itself, we decided to reuse existing fuzzing code to generate smart contracts for our synthetic dataset. It was by no means an easy task, so we decided to first try this with Yul-only contracts. Yul only contains a few AST nodes and closely correlates with generated bytecode, but it does contain some non-trivial structures such as loops and functions which was enough to test our theories.
We’ve modified Solidity’s existing Yul’s fuzzer (yulProtoFuzzer.cpp) to produce synthetic Yul contracts, filter them based on contract size and save the results to disk for later compilation. Our patch is available here. This approach allowed us to iterate quicker and fix some issues that would have been hard to spot when working with Solidity.
In short: Parsing data in bulk is not easily achieved. Existing training data should be selected carefully, and synthesized if it is scarce.
Future
After exploring EVM decompilation for several months, we have shelved our work in this field for now. Decompilation is no mean feat, and requires significant resources to be achieved effectively.
Our attempts in this field focused on the use of Transformer models. However, more progress may be made by applying the now-famous Large Language Models (LLMs) to decompilation. We’ve recently taken interest in this paper for LLM4Decompile, which uses an LLM to decompile contracts in C. If you’re interested in collaborating with us on adapting this for EVM, reach out to us on X.