Dark Mode Toggle

Possible futures of the Ethereum protocol, part 4: The Verge

2024 Oct 23 See all posts

Possible futures of the Ethereum protocol, part 4: The Verge

Special thanks to Justin Drake, Hsiao-wei Wang, Guillaume Ballet, Ignacio, Josh Rudolf, Lev Soukhanov, Ryan Sean Adams and Uma Roy for feedback and review.

One of the most powerful things about a blockchain is the fact that anyone can run a node on their computer and verify that the chain is correct. Even if 95% of the nodes running the chain consensus (PoW, PoS...) all immediately agreed to change the rules, and started producing blocks according to the new rules, everyone running a fully-verifying node would refuse to accept the chain. The stakers who are not part of such a cabal would automatically converge on, and continue building, a chain that continues to follow the old rules, and fully-verifying users would follow that chain.

This is a key difference between blockchains and centralized systems. However, for this property to hold, running a fully-verifying node needs to be actually feasible for a critical mass of people. This applies both to stakers (as if stakers are not verifying the chain, they are not actually contributing to enforcing the protocol rules), and to regular users. Today, running a node is possible on a consumer laptop (including the one being used to write this post), but doing so is difficult. The Verge is about changing this, and making fully-verifying the chain so computationally affordable that every mobile wallet, browser wallet, and even smart watch is doing it by default.

The Verge, 2023 roadmap.

Originally, the "Verge" referred to the idea of moving Ethereum state storage to Verkle trees - a tree structure that allows for much more compact proofs, enabling stateless validation of Ethereum blocks. A node could verify an Ethereum block without having any of the Ethereum state (account balances, contract code, storage...) on its hard drive, at a cost of spending a few hundred kilobytes of proof data and a few hundred extra milliseconds verifying a proof. Today, the Verge represents a much larger vision focused on enabling maximally resource-efficient verification of the Ethereum chain, which includes not just stateless validation technology, but also verifying all Ethereum execution with SNARKs.

In addition to the added long-term focus on SNARK-verifying the whole chain, another new question has to do with whether or not Verkle trees are even the best technology in the first place. Verkle trees are vulnerable to quantum computers, and so if we replace the current KECCAK Merkle Patricia tree with Verkle trees, we will later have to replace the trees again. The natural alternative to Merkle trees is skipping straight to using a STARK of Merkle branches in a binary tree. Historically, this has been considered non-viable due to overhead and technical complexity. More recently, however, we have seen Polygon prove 1.7 million Poseidon hashes per second on a laptop with circle STARKs, and proving times for more "conventional" hashes are also rapidly improving thanks to techniques like GKR.

As a result, over the past year the Verge has become much more open-ended, and there are several possibilities.

The Verge: key goals

Stateless clients: fully-verifying clients, and staking nodes, should not need more than a few GB of storage
(Longer term) fully verify the chain (consensus and execution) on a smart watch. Download some data, verify a SNARK, done.

Stateless verification: Verkle or STARKs

What problem are we trying to solve?

Today, an Ethereum client needs to store hundreds of gigabytes of state data in order to verify blocks, and this amount continues to increase with each passing year. The raw state data increases by ~30 GB per year, and individual clients have to store some extra data on top to be able to update the trie efficiently.

This reduces the number of users who can run fully-verifying Ethereum nodes: even though hard drives large enough to store all Ethereum state and even history for many years are readily available, the computers that people buy by default tend to only have a few hundred gigabytes of storage. The state size also introduces a great friction into the process of setting up a node for the first time: the node needs to download the entire state, which can take hours or days. This has all kinds of knockon effects. For example, it makes it significantly harder for a staker to upgrade their staking setup. Technically, it's possible to do this with no downtime - start a new client, wait for it to sync, then shut down the old client and transfer the key - but in practice it's technically complicated.

What is it and how does it work?

Stateless verification is a technology that allows nodes to verify blocks without having the entire state. Instead, each block comes with a witness, which includes (i) the values (eg. code, balances, storage) at the specific locations in the state that the block will access, and (ii) a cryptographic proof that those values are correct.

Actually implementing stateless verification requires changing the Ethereum state tree structure. This is because the current Merkle Patricia tree is extremely unfriendly to implementing any cryptographic proof scheme, especially in the worst case. This is true for both "raw" Merkle branches, and the possibility of "wrapping" the Merkle branches in a STARK. The key difficulties stem from two weaknesses of the MPT:

It's a hexary tree (ie. each node has 16 children). This means that on average, a proof in a size-N tree has 32 * (16 - 1) * log16(N) = 120 * log2(N) bytes, or about 3840 bytes in a 2³²-item tree. With a binary tree, you only need 32 * (2 - 1) * log2(N) = 32 * log2(N) bytes, or about 1024 bytes.
The code is not Merkelized. This means that proving any access of account code requires providing the entire code, which is a maximum of 24000 bytes.

We can compute a worst-case scenario as follows:

30,000,000 gas / 2,400 ("cold" account read cost) * (5 * 480 + 24,000) = 330,000,000 bytes

The branch cost is slightly decreased ( 5 * 480 instead of 8 * 480 ) because the top parts of branches are repeated when there are many of them. But even still, this works out to a completely unrealistic amount of data to download within one slot. If we try to wrap it in a STARK, we get two problems: (i) KECCAK is relatively STARK-unfriendly, and (ii) 330 MB of data means we have to prove 5 million calls to the KECCAK round function, which is way too much to prove on all but the most powerful consumer hardware, even if we could make STARK-proving KECCAK much more efficient.

If we just replace the hexary tree with a binary tree, and we additionally Merkelize code, then the worst case becomes roughly 30,000,000 / 2,400 * 32 * (32 - 14 + 8) = 10,400,000 bytes (the 14 is a subtraction for redundant bits of ~2¹⁴ branches, and the 8 is the length of a proof going into a leaf in a chunk). Note that this requires a change in gas costs, to charge for accessing each individual chunk of code; EIP-4762 does this. 10.4 MB is much better, but it's still too much data for many nodes to download within one slot. And so we need to introduce some more powerful technology. For this, there are two leading solutions: Verkle trees, and STARKed binary hash trees.

Verkle trees

Verkle trees use elliptic curve-based vector commitments to make much shorter proofs. The key unlock is that the piece of the proof corresponding to each parent-child relationship is only 32 bytes, regardless of the width of the tree. The only limit to the width of the tree is that if the tree gets too wide, proofs become computationally inefficient. The implementation proposed for Ethereum has a width of 256.

The size of a single branch in a proof thus becomes 32 * log256(N) = 4 * log2(N) bytes. The theoretical max proof size thus becomes roughly 30,000,000 / 2,400 * 32 * (32 - 14 + 8) / 8 = 1,300,000 bytes (the math works out slightly differently in practice because of uneven distribution of state chunks, but this is fine as a first approximation).

As an additional caveat, note that in all of the above examples, this "worst case" is not quite a worst case: an even worse case is when an attacker intentionally "mines" two addresses to have a long common prefix in the tree and reads from one of them, which can extend the worst-case branch length by another ~2x. But even with this caveat, Verkle trees get us to ~2.6 MB worst-case proofs, which roughly matches today's worst-case calldata.

We also take advantage of this caveat to do another thing: we make it very cheap to access "adjacent" storage: either many chunks of code of the same contract, or adjacent storage slots. EIP-4762 provides a definition of adjacency, and charges only 200 gas for adjacent access. With adjacent accesses, the worst-case proof size becomes 30,000,000 / 200 * 32 = 4,800,800 bytes, which is still roughly within tolerances. If we want this value decreased for safety, we can increase the adjacent access costs slightly.

STARKed binary hash trees

The technology here is very self-explanatory: you do a binary tree, take the max-10.4-MB proof that you need to prove the values in a block, and replace the proof with a STARK of the proof. This gets us to the point where the proof itself consists of only the data being proven, plus ~100-300 kB fixed overhead from the actual STARK.

The main challenge here is prover time. We can make basically the same calculations as above, except instead of counting bytes, we count hashes. A 10.4 MB block means 330,000 hashes. If we add in the possibility of an attacker "mining" addresses with a long common prefix in the tree, the real worst case becomes around 660,000 hashes. So if we can prove ~200,000 hashes per second, we're fine.

These numbers have already been reached on a consumer laptop with the Poseidon hash function, which has been explicitly designed for STARK-friendliness. However, Poseidon is relatively immature, and so many do not yet trust it for security. There are thus two realistic paths forward:

Do lots and lots of security analysis on Poseidon quickly, and get comfortable enough with it to deploy it at L1
Use a more "conservative" hash function, such as SHA256 or BLAKE

Starkware's circle STARK provers at the time of this writing can only prove ~10-30k hashes per second on a consumer laptop if they are proving conservative hash functions. However, STARK technology is improving quickly. Even today, GKR-based techniques show promise in potentially increasing this to the ~100-200k range.

Use cases of witnesses other than verifying blocks

In addition to verifying blocks, there are three other key use cases for more efficient stateless validation:

Mempools: when a transaction gets broadcasted, nodes in the p2p network need to verify that the transaction is valid before re-broadcasting it. Today, validation involves verifying the signature, but also checking that the balance is sufficient and the nonce is correct. In the future (eg. with native account abstraction, such as EIP-7701), this may involve running some EVM code, which does some state accesses. If nodes are stateless, the transaction will need to come with a proof proving the state objects.
Inclusion lists: this is a proposed feature which allows (potentially small and unsophisticated) proof-of-stake validators to force the next block to contain a transaction, regardless of the (potentially large and sophisticated) block builder's wishes. This would reduce the ability of powerful actors to manipulate the blockchain by delaying transactions. However, this requires validators to have a way to verify the validity of transactions in the inclusion list.
Light clients: if we want users accessing the chain through wallets (eg. Metamask, Rainbow, Rabby...) to do so without trusting centralized actors, they need to run light clients (eg. Helios). The core Helios module gives the user verified state roots. However, for a fully trustless experience, the user needs proofs for each individual RPC call they make (eg. for an eth_call request, the user would need a proof of all the state accessed during the call)

One thing that all of these use cases have in common is that they require a fairly large number of proofs, but each proof is small. For this reason, STARK proofs do not actually make sense for them; instead, it's most realistic to just use Merkle branches directly. Another advantage of Merkle branches is that they are updateable: given a proof of a state object X, rooted in block B, if you receive a child block B2 with its witness, you can update the proof to make it rooted in block B2. Verkle proofs are also natively updateable.

What are some links to existing research?

Verkle trees: https://vitalik.eth.limo/general/2021/06/18/verkle.html
Original Verkle tree paper by John Kuszmaul: https://math.mit.edu/research/highschool/primes/materials/2018/Kuszmaul.pdf
Starkware proving data: https://x.com/StarkWareLtd/status/1807776563188162562
Polygon proving data: https://x.com/dlubarov/status/1845862467315920940
Poseidon2 paper: https://eprint.iacr.org/2023/323
Ajtai (alternative fast hash function based on lattice hardness): https://www.wisdom.weizmann.ac.il/~oded/COL/cfh.pdf
Verkle.info: https://verkle.info/

What is left to do, and what are the tradeoffs?

The main remaining work to do is:

More analysis on the consequences of EIP-4762 (statelessness gas cost changes)
More work finalizing and testing the transition procedure, which is a large part of the complexity of any statelessness EIP
More security analysis of Poseidon, Ajtai and other "STARK-friendly" hash functions
More development of ultra-efficient STARK protocols for "conservative" (or "traditional") hash functions, eg. based on ideas from Binius or GKR.

We also will soon have a decision point of which of three options to take: (i) Verkle trees, (ii) STARK-friendly hash functions, and (iii) conservative hash functions. Their properties can be approximately summarized in this table:

Algorithm	Proof size	Security assumptions	Worst-case prover time (today)
Verkle	Data plus ~100-2,000 kB	Elliptic curve (not quantum-resistant)	< 1s
STARK over conservative hash functions (eg. SHA256, BLAKE)	Data plus ~100-300 kB	Conservative hash functions	> 10 s
STARK over aggressive hash functions (Poseidon, Ajtai)	Data plus ~100-300 kB	Relatively new and less-tested hash functions	1-2s

In addition to these "headline numbers", there are a few other important considerations:

Today, the Verkle tree code is quite mature. Using anything but Verkle would realistically delay deployment, likely by one hard fork. This can be okay, especially if we need the extra time anyway to work on hash function analysis or prover implementations, and if we have other important features we want to get included in Ethereum earlier.
Updating the state root is faster with hashes than with Verkle trees. This means that hash-based approaches can lead to lower sync time for full nodes.
Verkle trees have interesting witness update properties - Verkle tree witnesses are updateable. This property is useful for mempools, inclusion lists, and other use cases, and it also can potentially help with making implementations more efficient: if a state object is updated, you can update the witness on the second last level without even reading the last level.
Verkle trees are more difficult to SNARK-prove. If we want to reduce the proof size all the way to a few kilobytes, Verkle proofs introduce some difficulty. This is because the verification of a Verkle proof introduces a large number of 256-bit operations, which requires the proof system to either have a lot of overhead, or itself have a custom internal construction with a 256-bit part for the Verkle proof. This is not a problem for statelessness itself, but introduces more difficulties later.

If we want the Verkle witness updateability properties in a way that's quantum-safe, and reasonably efficient, one other possible path is lattice-based Merkle trees.

If proof systems end up being not efficient enough in the worst case, one other "rabbit out of a hat" that we could use to compensate for such an inadequacy is multidimensional gas: have separate gas limits for (i) calldata, (ii) computation, (iii) state accesses, and possibly other distinct resources. Multidimensional gas adds complexity, but in exchange it much more tightly bounds the ratio between average case and worst case. With multidimensional gas, the theoretical max number of branches to prove could plausibly decrease from 30,000,000 / 2400 = 12,500 to eg. 3000. This would make BLAKE3 (just barely) sufficient even today, with no further prover improvements.

Multidimensional gas allows the resource limits of a block to much more closely replicate the resource limits of the underlying hardware.

Another "rabbit out of a hat" is this proposal to delay state root computation until the slot after a block. This would give us a full 12 seconds to compute the state root, meaning that only ~60,000 hashes/sec proving time is sufficient even in the most extreme cases, again putting us in the range of BLAKE3 being just barely sufficient.

This approach has the downside that it would increase light client latency by a slot, though there are more clever versions of the technique that reduce this delay to just the proof generation latency. For example, the proof could be broadcasted across the network as soon as any node generates it, instead of waiting for the next block.

How does it interact with other parts of the roadmap?

Solving statelessness greatly increases the ease of solo staking. This becomes much more valuable if technologies that reduce the minimum balance of solo staking, such as Orbit SSF or application-layer strategies like squad staking, become available.

Multidimensional gas becomes easier if EOF is also introduced. This is because a key complexity of multidimensional gas for execution is handling sub-calls which don't pass along the parent call's full gas, and EOF makes this problem trivial by simply making such sub-calls illegal (and native account abstraction would provide an in-protocol alternative for the current primary use case of partial-gas sub-calls).

One other important synergy is between stateless validation and history expiry. Today, clients have to store nearly a terabyte of history data; this data is several times larger than the state. Even if clients are stateless, the dream of nearly-storage-free clients will not be realized unless we can relieve clients of the responsibility to store history as well. The first step in this regard is EIP-4444, which also implies storing historical data in torrents or the Portal network.

Validity proofs of EVM execution

What problem are we trying to solve?

The long-term goal for Ethereum block verification is clear: you should be able to verify an Ethereum block by (i) downloading the block, or perhaps even only small parts of the block with data availability sampling, and (ii) verifying a small proof that the block is valid. This would be an extremely low-resource operation, and could be done on a mobile client, inside a browser wallet, or even (without the data availability part) in another chain.

Getting to this point requires having SNARK or STARK proofs of (i) the consensus layer (ie. the proof of stake), and (ii) the execution layer (ie. the EVM). The former is itself a challenge, and it should be addressed during the process of making further ongoing improvements to the consensus layer (eg. for single slot finality). The latter requires proofs of EVM execution.

What is it and how does it work?

Formally, in the Ethereum specifications, the EVM is defined as a state transition function: you have some pre-state S, a block B, and you are computing a post-state S' = STF(S, B). If a user is using a light client, they do not have S and S' or even B in their entirety; instead, they have a pre-state root R, and a post-state root R' and a block hash H. The full statement that needs to be proven is approximately:

Public inputs: pre-state root R, post-state root R' , block hash H
Private inputs: block body B , the objects in the state accessed by the block Q, the same objects after executing the block Q' , the state proof (eg. Merkle branches) P
Claim 1: P is a valid proof that Q contains some portion of the state represented by R
Claim 2: if you run STF on Q , (i) the execution only accesses objects inside of Q, (ii) the block is valid, and (iii) the outcome is Q'
Claim 3: If you recompute the new state root, using information from Q' and P , you get R'

If this exists, you can have a light client that fully verifies the Ethereum EVM execution. This allows clients to be quite low-resource already. To get to true fully-verifying Ethereum clients, you also need to do the same for the consensus side.

Implementations of validity proofs for EVM computation already exist, and are being used heavily by layer 2s. However, there is still a lot to do to make EVM validity proofs viable for L1.

What are some links to existing research?

EF PSE ZK-EVM (now sunsetted because better options exist): https://github.com/privacy-scaling-explorations/zkevm-circuits
Zeth, which works by compiling the EVM into the RISC-0 ZK-VM: https://github.com/risc0/zeth
ZK-EVM formal verification project: https://verified-zkevm.org/

What is left to do, and what are the tradeoffs?

Today, validity proofs for the EVM are inadequate in two dimensions: security and prover time.

A secure validity proof involves having an assurance that the SNARK actually verifies the EVM computation, and does not have a bug in it. The two leading techniques for increasing security are multi-provers and formal verification. Multi-provers means having multiple independently-written validity proof implementations, much like there are multiple clients, and having clients accept a block if it's proven by a sufficiently large subset of these implementations. Formal verification involves using tools often used to prove mathematical theorems, such as Lean4, to prove that a validity proof only accepts inputs that are correct executions of the underlying EVM specification (eg. the EVM K Semantics or the Ethereum execution layer specifications (EELS) written in python).

Sufficiently fast prover time means that any Ethereum block can be proven in less than ~4 seconds. Today, we are still far from this, though we are much closer than was imagined possible even two years ago. To get to this goal, we need to advance in three directions:

Parallelization - the fastest EVM prover today can prove an average Ethereum block in ~15 seconds. It does this by parallelizing between several hundred GPUs, and then aggregating their work together at the end. Theoretically, we know exactly how to make an EVM prover that can prove computation in O(log(N)) time: have one GPU do each step, and then do an "aggregation tree":

There are challenges in implementing this. To work even in worst-case situations, where a very large transaction takes up an entire block, the splitting of the computation cannot be per-tx; it must be per-opcode (of the EVM or of an underlying VM like RISC-V). A key implementation challenge that makes this not completely trivial is the need to make sure that the "memory" of the VM is consistent between different parts of the proof. However, if we can make this kind of recursive proof, then we know that at least the prover latency problem is solved, even without any improvements on any of the other axes.
Proof system optimization - new proof systems like Orion, Binius, GKR, and many more, are likely to lead to yet another large reduction in prover time for general-purpose computation.
EVM gas cost other changes - many things in the EVM could be optimized to be much more prover-friendly, especially in the worst case. Being able to prove an average Ethereum block in 4 seconds is not enough, if an attacker can construct a block that will clog up provers' time for ten minutes. The EVM changes required can largely be broken down into two categories:
- Gas cost changes - if an operation takes a long time to prove, it should have a high gas cost, even if it is relatively fast to compute. EIP-7667 is one proposed EIP to handle the worst offenders in this regard: it significantly increases the gas costs of (conventional) hash functions exposed as relatively cheap opcodes and precompiles. To compensate for these gas cost increases, we can reduce the gas cost of EVM opcodes that are relatively cheap to prove, so as to keep average throughput the same.
- Data structure replacements - in addition to replacing the state tree with a more STARK-friendly alternative, we need to replace the transaction list, receipt tree, and other structures that are expensive to prove. Etan Kissling's EIPs that move transaction and receipt structures to SSZ ([1] [2] [3]) are one step in that direction.

In addition to this, the two "rabbits out of a hat" mentioned in the previous section (multidimensional gas, and delayed state root) can also help here. However, it's worth noting that unlike stateless validation, where using either rabbit out of a hat means that we have sufficient technology to do what we need today, even with these techniques full ZK-EVM validation will take more work - it will just take less work.

One thing that was not mentioned above is prover hardware: using GPUs, FPGAs and ASICs to generate proofs faster. Fabric Cryptography, Cysic and Accseal are three companies pushing ahead on this. This will be extremely valuable for layer 2s, but it's unlikely to become a decisive consideration for layer 1, because there is a strong desire to keep layer 1 highly decentralized, which implies that proof generation must be within the capabilities of a reasonably large subset of Ethereum users, and should not be bottlenecked by a single company's hardware. Layer 2s can make more aggressive tradeoffs.

More work remains to be done in each of these areas:

Parallelized proving requires proof systems where different parts of the proof can "share a memory" (eg. lookup tables). We know techniques for doing this, but they need to be implemented.
We need more analysis to figure out the ideal set of gas cost changes to minimize worst-case prover time.
We need more work on proving systems

Likely tradeoffs here include:

Security vs prover time: it may be possible to cut down prover time using aggressive choices for hash functions, proof systems with more complexity or more aggressive security assumptions, or other design choices.
Decentralization vs prover time: the community needs to agree on what is the "spec" for prover hardware that it is targeting. Is it okay to require provers to be large-scale entities? Do we want a high-end consumer laptop to be able to prove an Ethereum block in 4 seconds? Something in between?
Degree of breaking backwards compatibility: insufficiencies in other areas can be compensated for by making much more aggressive gas cost changes, but this is more likely to disproportionately increase the cost of some applications over others, and force developers to rewrite and redeploy code in order to remain economically viable. Similarly, the "rabbits out of a hat" have their own complexity and downsides.

How does it interact with other parts of the roadmap?

The core tech needed to make EVM validity proofs at layer 1 happen is heavily shared with two other areas:

Validity proofs at layer 2 (ie. "ZK rollups")
The "STARK a binary hash proof" approach to statelessness

A successful implementation of validity proofs at layer 1 allows for the ultimate in easy solo staking: even the weakest computer (including phone or smart watch) would be able to stake. This further increases the value of addressing the other limitations to solo staking (eg. the 32 ETH minimum).

Additionally, EVM validity proofs at L1 can enable considerable L1 gas limit increases.

Validity proofs of consensus

What problem are we trying to solve?

If we want it to be possible to fully verify an Ethereum block with a SNARK, then the EVM execution is not the only part we need to prove. We also need to prove the consensus: the part of the system that handles deposits, withdrawals, signatures, validator balance updates, and other elements of the proof-of-stake part of Ethereum.

The consensus is considerably simpler than the EVM, but it has the challenge that we don't have layer 2 EVM rollups as a reason why most of the work is going to be done anyway. Hence, any implementation of proving Ethereum consensus would need to be done "from scratch", although the proof systems themselves, are shared work that can be built on top of.

What is it and how does it work?

The beacon chain is defined as a state transition function, just like the EVM. The state transition function is dominated by three things:

ECADDs (for verifying BLS signatures)
Pairings (for verifying BLS signatures)
SHA256 hashes (for reading and updating the state)

In each block, we need to prove 1-16 BLS12-381 ECADDs per validator (potentially more than one, because signatures can get included in multiple aggregates). This can be compensated for by subset precomputation techniques, so altogether we can say that it's one BLS12-381 ECADD per validator. Today, there are ~30,000 validators signing in each slot. In the future, with single slot finality, this could change in either direction (see the exposition here): if we take the "brute force" route, this could increase to 1 million validators per slot. Meanwhile, with Orbit SSF, it would stay at 32,768, or even decrease to 8,192.

How BLS aggregation works. Verifying the aggregate signature only requires an ECADD per participant, instead of an ECMUL. But 30,000 ECADDs is still a lot to prove.

For pairings, there is a current maximum of 128 attestations per slot, implying the need to verify 128 pairings. With EIP-7549 and further changes, this can plausibly decrease to 16 per slot. Pairings are few in number, but they are extremely expensive: each one takes thousands of times longer to run (or prove) than an ECADD.

A major challenge with proving BLS12-381 operations is that there is no convenient curve whose curve order equals the BLS12-381 field size, which adds considerable overhead to any proving system. The Verkle trees proposed for Ethereum, on the other hand, were built with the Bandersnatch curve, which makes BLS12-381 itself the natural curve to use in a SNARK system to prove a Verkle branch. A fairly naive implementation can prove ~100 G1 additions per second; clever techniques like GKR would almost certainly be required to make proving fast enough.

For SHA256 hashes, today the worst case is the epoch transition block, where the entire validator short-balance tree, and a significant number of validator balances, gets updated. The validator short-balance tree has one byte per validator, so ~1 MB of data gets re-hashed. This corresponds to 32,768 SHA256 calls. If a thousand validators' balances fall above or below a threshold that requires the effective balance, in the validator record, to be updated, that corresponds to a thousand Merkle branches, so perhaps another ten thousand hashes. The shuffling mechanism requires 90 bits per validator (so, 11 MB of data), but this can be computed at any time over the course of an epoch. With single-slot finality, these numbers may again increase or decrease depending on the details. Shuffling becomes unnecessary, though Orbit may bring back the need for some degree of it.

Another challenge is the need to read all validator state, including public keys, in order to verify a block. Reading the public keys alone takes 48 million bytes for 1 million validators, together with Merkle branches. This requires millions of hashes per epoch. If we had to prove the proof-of-stake validation today, a realistic approach would be some form of incrementally verifiable computation: store a separate data structure inside the proof system that is optimized for efficient lookups, and prove updates to this structure.

To summarize, there are a lot of challenges.

Fixing these challenges most efficiently may well require a deep redesign of the beacon chain, which could happen at the same time as a switch to single-slot finality. Features of this redesign could include:

Hash function change: today, the "full" SHA256 hash function gets used, so due to padding each call corresponds to two underlying compression function calls. At the very least, we can get a 2x gain by switching to the SHA256 compression function. If we switch to Poseidon, we can get a potentially ~100x gain, which could solve all of our problems (at least for hashes) completely: at 1.7 million hashes (54 MB) per second, even a million validator records can be "read" into a proof in a few seconds.
If Orbit, store shuffled validator records directly: if you choose some number of validators (eg. 8,192 or 32,768) to be the committee for a given slot, put them directly into the state beside each other, so that the minimum amount of hashing is needed to read all validator pubkeys into a proof. This would also allow all balance updates to be done efficiently.
Signature aggregation: any high-performance signature aggregation scheme will realistically involve some kind of recursive-proving, where intermediate proofs of subsets of signatures will get made by various nodes in the network. This naturally splits the load of proving across many nodes in the network, making the work of the "final prover" much smaller.
Other signature schemes: For a Lamport+Merkle signature, we need 256 + 32 hashes to verify a signature; multiplying by 32,768 signers gives 9,437,184 hashes. Optimizations to the signature scheme can improve this further by a small constant factor. If we use Poseidon, this is within range to prove within a single slot. Realistically, though, this would be made much faster with recursive aggregation schemes.

What are some links to existing research?

Succinct, proof of Ethereum consensus (sync committee only): https://github.com/succinctlabs/eth-proof-of-consensus
Succinct, Helios inside SP1: https://github.com/succinctlabs/sp1-helios
Succinct BLS12-381 precompiles: https://blog.succinct.xyz/succinctshipsprecompiles/
Halo2-based verification of BLS aggregate signatures: https://ethresear.ch/t/zkpos-with-halo2-pairing-for-verifying-aggregate-bls-signatures/14671

What is left to do, and what are the tradeoffs?

Realistically, it will take years before we have a validity proof of the Ethereum consensus. This is roughly the same timeline that we need to implement single slot finality, Orbit, changes to the signature algorithm, and potentially security analysis needed to become sufficiently confident to use "aggressive" hash functions like Poseidon. Hence, it makes the most sense to work on these other problems, and while doing that work keep STARK-friendliness in mind.

The main tradeoff may well be in order of operations, between a more incremental approach to reforming the Ethereum consensus layer and a more radical "many changes at once" approach. For the EVM, an incremental approach makes sense, because it minimizes disruption to backwards compatibility. For the consensus layer, backwards compatibility concerns are smaller, and there are benefits to a more "holistic" re-think in various details of how the beacon chain is constructed, to best optimize for SNARK-friendliness.

How does it interact with other parts of the roadmap?

STARK-friendliness needs to be a primary concern in long-term redesigns of the Ethereum proof of stake consensus, most notably single-slot finality, Orbit, changes to the signature scheme, and signature aggregation.