The "proof overflow" problem

The “proof overflow” problem

At Scroll, we are building in the open, and want to be transparent about all aspects of the protocol we are building.

This post describes a problem that we term “proof overflow” - a problem that arises from the separation of execution and proof generation in ZK rollups.

Background

Scroll’s rollup flow can roughly understood as follows:

  1. Users submit transactions to Scroll’s mempool.
  2. A sequencer node packs some transactions into a block.
  3. A batcher packs some blocks into a batch.
    1. The batch’s data (including all its transaction data) is posted, or “committed,” to Ethereum L1.
  4. A prover takes a batch and generates a proof.
    1. This proof proves that all transactions in the batch were executed correctly.
    2. This proof is submitted to Ethereum L1 to be validated. The corresponding batch is considered “finalized.”

One problem that we encountered in our Alpha testnet was that some batches could not be proven. These batches could not be proven because they were “too large” to fit in our zkEVM circuit.

One can think of the zkEVM circuit as consisting of many subcircuits, say n subcircuits, that are interconnected by lookup arguments. Each subcircuit specializes in a particular operation - for example the Keccak circuit computes Keccak hashes, the Exponentiation circuit computes exponentiations. A current limitation of our zkEVM circuit design is that each subcircuit has to have same number of rows, say m rows.

Each incoming transaction will have a unique profile in terms of how many rows it consumes in each of the subcircuits. For example, there may be a transaction that results in many Keccak operations, and therefore takes up many rows in the Keccak circuit and no rows in the Exponentiation circuit. Conversely there may be a transaction that takes up very few rows in the Keccak circuit and many rows in the Exponentiation circuit.

Since a batch consists of blocks, and blocks consist of transactions, a batch’s row consumption profile is determined by the transactions that compose it. If a batch’s row consumption exceeds the maximum number of rows m, then the batch is not provable (the proof “overflows”). When a batch is not provable, it cannot be finalized on L1, and neither can any subsequent batches (which depend on the unprovable batch’s resulting state).

Note that even a batch containing just a single transaction may be capable of overflowing the circuit.

The “proof overflow” problem concerns the following question: how do we prevent the creation of batches that exceed circuit capacity?

The long term solution

The problem arises from a limitation of our circuit architecture: all subcircuits must have some predetermined, fixed number of rows. We are looking into redesigning our architecture such that subcircuits can independently be dynamically sized - each subcircuit’s size can scale up or down based on the requirements of a batch proof. For example, if one batch needs 2^20 rows in the Keccak circuit but only 2^14 rows in the Exponentiation circuit, the circuits can be independently scaled.

There are challenges that come from this type of dynamic design, and we are working through them. However, in the meantime, we need to solve the issue for fixed-size circuits.

Solutions

1. Set the block gas limit based on worst-case opcodes :x:

The idea here is to set a block’s gas limit based on the worst-case (most expensive in terms of circuit row consumption) opcode. In other words, set the block gas limit such that if it were filled up with the most expensive opcode, the block could still fit in our circuit. This guarantees that any block with at most this much gas will not be able to fill up the circuit.

  • Pros:
    • Simplicity.
  • Cons:
    • This is extremely inefficient.
      • Analysis shows that the ratio between proof rows and EVM gas for the worst opcode (SHA) is ~11x.
        • Each marginal Keccak byte takes up ~2.2 rows, while costing ~6/32 EVM gas. With m = 2^20 (approximately 1 million rows), we can fit about 2^20 / 2.2 Keccak bytes. This corresponds to a maximum block limit of (2^20 / 2.2) * (6/32) ~= 89,000 gas. Too small!!

2. Gas re-pricing :warning:

We could modify the opcode-to-gas schedule to reflect proving cost, rather than execution cost. This would involve a static mapping from each opcode to the maximum number of rows it takes up across all subcircuits, and then modifying our Geth fork (“L2Geth”) to use this new gas schedule.

  • Pros:
    • Proof overflow issues are handled as “out of gas” errors at the execution layer.
  • Cons:
    • Potentially breaks contracts that depend on the normal EVM gas schedule.
    • It’s difficult to programmatically map opcodes to row consumption.
      • This mapping should be programmatic, since the zkEVM circuit repo will change over time, and also because by-hand analysis is prone to error, and even a slight error here will lead to overflow vulnerabilities.
    • Requires maintaining parity between the L2Geth gas schedule and the zkEVM circuit repo - falling out of sync will lead to vulnerabilities.

3. Introduce additional “proof gas” metering

In addition to the normal EVM gas, we can have a separate notion of “proof gas.” This proof gas will quantify the amount of space that a transaction consumes in our circuits. Note that this “proof gas” should be multidimensional - as different opcodes take up rows in different circuits.

Once this notion of “proof gas” metering is introduced, there’s a question of at which level to enforce it.

3a. Enforce proof gas metering in execution layer

This solution is similar to solution 2, except it maintains both a notion of EVM gas and proof gas. This again would involve a static mapping of each opcode to the number of rows it takes up in each circuit, and then modifying L2Geth to add this notion of proof gas. If the proof gas limit is exceeded for a particular transaction, then the transaction reverts with some custom “out of proof gas” error. This ensures that no block can exceed the row limits, as execution will stop before then.

  • Pros:
    • Proof overflow issues are handled as “out of proof gas” errors at the execution layer.
  • Cons:
    • Difficult to produce static map from opcodes to row consumption.
    • Requires maintaining parity between L2Geth and zkEVM circuit repo.
    • Requires non-trivial changes to logic in L2Geth and zkEVM circuit repo to support additional notion of proof gas.

3b. Enforce proof gas metering outside of execution layer

Rather than generating a static mapping of opcodes to rows, we can expose an API from the zkEVM circuit repo to report the number of rows required for a given execution trace. L2Geth can generate the traces for blocks, and then query for the circuit row consumption - if the maximum number of rows is exceeded, then the block is not created.

  • Pros:
    • Doesn’t require complexity of programmatically mapping opcodes to row consumption.
  • Cons:
    • There will be some computational overhead added to L2Geth when having to construct a block, as it needs to do additional computation to estimate circuit row consumption.
    • Complicates forced inclusion.
      • Forced inclusion is a mechanism where users can submit L2 transactions directly though L1. These transactions are “forced” to be included in the L2 chain, as a censorship resistance mechanism.
      • We can’t map a transaction to the number of circuit rows it consumes on L1, and therefore cannot tell if it is provable or not.

Concluding thoughts

It seems that solution 3b is the simplest and least risky solution that is also feasible.

The main challenge that accompanies this choice is how to handle forced transactions, as there may exist forced transactions that are too large to fit in a circuit. One idea here is to use the idea from solution 1 to bound the gas limit for forced transactions, so that even in the worst case, a forced transaction couldn’t overflow the circuit. However, this naturally limits the utility of forced transactions.

In the longer term, we aim to develop a more flexible proof system that supports dynamically-sized subcircuits, avoiding this problem altogether.

13 Likes

The simplist way may be the most efficient way. Enforce proof gas metering outside of execution layer seems better

3 Likes

Let me explain you how we are solving this in Polygon zkEVM.

Our proof mainly consist of an assembly program (we call it ROM) that runs on top of a zkProcessor. This program process transactions the same way that any Ethereum client would do. There are opcodes that may be resolved with very few clocks and some of them that contain loops and may take longer. For some expensive operations like keccaks, 256bit arithmetics or binaries, we relay in external processors that we link later on with plookups.

To solve this “proof overflow”, we keep track of the resources that we are using. so we know at any moment how many resources remain unused.

The trick here is that before doing any operation, we check that we have enough resouces to complete this operation. In case we don’t have them, we jump to a part of the code were it discard the TX, but the prove is still valid. We somehow we proof that something is not provable.

The counters we have are clocks, poseidon, arith, bin, keccak, memalign, and paddingpg.

For the future versions, we are working in the concept of VADCOPs (Variable Degree Composite Proofs) This will allow to verify a prof with different state machines of different size. With this technique, we expect to remove most of the counters.

3 Likes

Thanks for sharing, Jordi.

Your approach makes sense - just rephrasing to check my understanding:

  • Polygon zkEVM proves execution of its custom ROM, which is a program that implements EVM execution and is written in zkASM.
  • This ROM program, in addition to implementing the normal EVM execution logic, tracks circuit resource consumption via counters, and discards/reverts transactions that would cause some resource to exceed its limit.
    • Since this resource tracking behavior is itself encoded in the ROM, and the proof proves the valid ROM execution, the proof effectively proves that indeed the resource tracking was done correctly and the discarded transaction was discarded for the correct reason (would have exceeded some resource limit otherwise).

This is a clean solution for Polygon zkEVM, I think essentially due to the coupling of execution and proving - the execution program is written in terms of the zkProcessor, and so it’s able to cleanly reason about the zkProcessor’s (and thereby the prover’s) resources.

This solution would not be so clean for Scroll. Our execution and proving are more decoupled - we are trying to reuse existing Ethereum execution clients like Geth (which are not implemented in terms of any particular zkProcessor), and to prove their execution trace outputs. Our analogous solution would be that of 3a, where we could go in and decorate Geth to be able to map its operations to various resources in our zk circuit so that it can track circuit resource consumption, and then add logic during execution to check if a transaction causes resource limits to be exceeded and revert. But as I mentioned in the original post this is quite complex and delicate.

I’m not familiar with this term “Variable Degree Composite Proofs,” but it sounds like we are both aiming towards dynamically-sized subcircuits as the ultimate solution to this type of problem.

2 Likes

I wonder if 3b introduces a potential attack vector. Say some malicious actor constantly spans the network with transactions that saturate row capacity / resource. The block producer could unknowingly include these transactions in a block and they would not know a block it is invalid until it is too late and the block has to be dropped. This could happen consistently for many blocks in a row and in extreme cases for every block - ultimately stalling block production.

Could you add some context around why it is “Difficult to produce a static map from opcodes to row consumption” please? Could a worst case approximation be calculated? I think @Jordi_Baylina solution is elegant and practical, which is similar to 3a. I don’t think the proof resource metering logic needs to be embedded in the zkEVM (however it would be nice as you can prove infeasibly resource heavy tx’s). I think it would suffice to have the metering applied exclusively in L2Geth. As a block is created, the proof resource consumption is tracked and once it is saturated the block is sealed. If a transaction would exhaust proof resources then it is dropped. To minimise the impact of resource heavy transactions the goal should be to drop them as soon as possible. To achieve this, as soon as a transaction is received by a node, the node should compute the proof resource consumption. If the proof resource consumption is too high then it should be dropped (not added to mempool) and it should not be gossiped to the network.

1 Like

Under 3b, upon receiving a transaction, the block producer will execute this transaction in L2Geth, generate the execution trace, and then compute the required row resources. If it realizes the transaction exceeds max row resources, it will revert and drop the transaction (will not retry including this transaction). Otherwise, if it realizes the transaction can’t fit inside the current block, it will can revert the transaction, “seal” the current block, and try and fit the transaction in a subsequent block.

It’s a fair point that someone could try and spam the network with transactions that exceed row capacity, to force the block producer to execute transactions that eventually get reverted and discarded. We’ll need to think about this attack vector carefully.

Ideally, we would be able to just look at a transaction and tell whether or not it exhausts proof capacity. But in reality, we need to execute it, since its behavior depends on the chain’s state. If we did something like 3a, we would still be executing the transaction, but it would halt and revert at the point of resource exhaustion, and ideally would still charge gas fees for the partially-completed execution.

To elaborate on why 3a is difficult to implement: we require a static mapping from each opcode to its row consumption - we need to encode a way for L2Geth to translate an EVM computation into the number of rows it consumes in our zkEVM circuits. We currently don’t have a good systematic way of generating such a mapping - the formulas to express row cost for some opcodes are complex and involve many variables. As changes to the zkEVM circuits are made (optimizations, updates, etc), this mapping will also need to be kept in sync.