The “proof overflow” problem
At Scroll, we are building in the open, and want to be transparent about all aspects of the protocol we are building.
This post describes a problem that we term “proof overflow” - a problem that arises from the separation of execution and proof generation in ZK rollups.
Background
Scroll’s rollup flow can roughly understood as follows:
- Users submit transactions to Scroll’s mempool.
- A sequencer node packs some transactions into a block.
- A batcher packs some blocks into a batch.
- The batch’s data (including all its transaction data) is posted, or “committed,” to Ethereum L1.
- A prover takes a batch and generates a proof.
- This proof proves that all transactions in the batch were executed correctly.
- This proof is submitted to Ethereum L1 to be validated. The corresponding batch is considered “finalized.”
One problem that we encountered in our Alpha testnet was that some batches could not be proven. These batches could not be proven because they were “too large” to fit in our zkEVM circuit.
One can think of the zkEVM circuit as consisting of many subcircuits, say n subcircuits, that are interconnected by lookup arguments. Each subcircuit specializes in a particular operation - for example the Keccak circuit computes Keccak hashes, the Exponentiation circuit computes exponentiations. A current limitation of our zkEVM circuit design is that each subcircuit has to have same number of rows, say m rows.
Each incoming transaction will have a unique profile in terms of how many rows it consumes in each of the subcircuits. For example, there may be a transaction that results in many Keccak operations, and therefore takes up many rows in the Keccak circuit and no rows in the Exponentiation circuit. Conversely there may be a transaction that takes up very few rows in the Keccak circuit and many rows in the Exponentiation circuit.
Since a batch consists of blocks, and blocks consist of transactions, a batch’s row consumption profile is determined by the transactions that compose it. If a batch’s row consumption exceeds the maximum number of rows m, then the batch is not provable (the proof “overflows”). When a batch is not provable, it cannot be finalized on L1, and neither can any subsequent batches (which depend on the unprovable batch’s resulting state).
Note that even a batch containing just a single transaction may be capable of overflowing the circuit.
The “proof overflow” problem concerns the following question: how do we prevent the creation of batches that exceed circuit capacity?
The long term solution
The problem arises from a limitation of our circuit architecture: all subcircuits must have some predetermined, fixed number of rows. We are looking into redesigning our architecture such that subcircuits can independently be dynamically sized - each subcircuit’s size can scale up or down based on the requirements of a batch proof. For example, if one batch needs 2^20 rows in the Keccak circuit but only 2^14 rows in the Exponentiation circuit, the circuits can be independently scaled.
There are challenges that come from this type of dynamic design, and we are working through them. However, in the meantime, we need to solve the issue for fixed-size circuits.
Solutions
1. Set the block gas limit based on worst-case opcodes
The idea here is to set a block’s gas limit based on the worst-case (most expensive in terms of circuit row consumption) opcode. In other words, set the block gas limit such that if it were filled up with the most expensive opcode, the block could still fit in our circuit. This guarantees that any block with at most this much gas will not be able to fill up the circuit.
- Pros:
- Simplicity.
- Cons:
- This is extremely inefficient.
- Analysis shows that the ratio between proof rows and EVM gas for the worst opcode (SHA) is ~11x.
- Each marginal Keccak byte takes up ~2.2 rows, while costing ~6/32 EVM gas. With m = 2^20 (approximately 1 million rows), we can fit about 2^20 / 2.2 Keccak bytes. This corresponds to a maximum block limit of (2^20 / 2.2) * (6/32) ~= 89,000 gas. Too small!!
- Analysis shows that the ratio between proof rows and EVM gas for the worst opcode (SHA) is ~11x.
- This is extremely inefficient.
2. Gas re-pricing
We could modify the opcode-to-gas schedule to reflect proving cost, rather than execution cost. This would involve a static mapping from each opcode to the maximum number of rows it takes up across all subcircuits, and then modifying our Geth fork (“L2Geth”) to use this new gas schedule.
- Pros:
- Proof overflow issues are handled as “out of gas” errors at the execution layer.
- Cons:
- Potentially breaks contracts that depend on the normal EVM gas schedule.
- It’s difficult to programmatically map opcodes to row consumption.
- This mapping should be programmatic, since the zkEVM circuit repo will change over time, and also because by-hand analysis is prone to error, and even a slight error here will lead to overflow vulnerabilities.
- Requires maintaining parity between the L2Geth gas schedule and the zkEVM circuit repo - falling out of sync will lead to vulnerabilities.
3. Introduce additional “proof gas” metering
In addition to the normal EVM gas, we can have a separate notion of “proof gas.” This proof gas will quantify the amount of space that a transaction consumes in our circuits. Note that this “proof gas” should be multidimensional - as different opcodes take up rows in different circuits.
Once this notion of “proof gas” metering is introduced, there’s a question of at which level to enforce it.
3a. Enforce proof gas metering in execution layer
This solution is similar to solution 2, except it maintains both a notion of EVM gas and proof gas. This again would involve a static mapping of each opcode to the number of rows it takes up in each circuit, and then modifying L2Geth to add this notion of proof gas. If the proof gas limit is exceeded for a particular transaction, then the transaction reverts with some custom “out of proof gas” error. This ensures that no block can exceed the row limits, as execution will stop before then.
- Pros:
- Proof overflow issues are handled as “out of proof gas” errors at the execution layer.
- Cons:
- Difficult to produce static map from opcodes to row consumption.
- Requires maintaining parity between L2Geth and zkEVM circuit repo.
- Requires non-trivial changes to logic in L2Geth and zkEVM circuit repo to support additional notion of proof gas.
3b. Enforce proof gas metering outside of execution layer
Rather than generating a static mapping of opcodes to rows, we can expose an API from the zkEVM circuit repo to report the number of rows required for a given execution trace. L2Geth can generate the traces for blocks, and then query for the circuit row consumption - if the maximum number of rows is exceeded, then the block is not created.
- Pros:
- Doesn’t require complexity of programmatically mapping opcodes to row consumption.
- Cons:
- There will be some computational overhead added to L2Geth when having to construct a block, as it needs to do additional computation to estimate circuit row consumption.
- Complicates forced inclusion.
- Forced inclusion is a mechanism where users can submit L2 transactions directly though L1. These transactions are “forced” to be included in the L2 chain, as a censorship resistance mechanism.
- We can’t map a transaction to the number of circuit rows it consumes on L1, and therefore cannot tell if it is provable or not.
Concluding thoughts
It seems that solution 3b is the simplest and least risky solution that is also feasible.
The main challenge that accompanies this choice is how to handle forced transactions, as there may exist forced transactions that are too large to fit in a circuit. One idea here is to use the idea from solution 1 to bound the gas limit for forced transactions, so that even in the worst case, a forced transaction couldn’t overflow the circuit. However, this naturally limits the utility of forced transactions.
In the longer term, we aim to develop a more flexible proof system that supports dynamically-sized subcircuits, avoiding this problem altogether.