EIP-7745: Trustless log and transaction index (alternative version "B")

Author: Zsolt Felföldi (zsfelfoldi@ethereum.org)

Notes for the pre-release version

This document is a preview of an alternative for the original EIP-7745 design, which tried to achieve three key goals:

efficient proofs for long history search
search structure always available up to the current head
hard capped minimal log index state size for easy initialization

With basically any value lookup structure, there is a general tradeoff: the bigger, the more efficient, but also more risky to be added to the protocol as every builder/proposer/validator has to maintain this structure once it is a part of the consensus. An alternative I considered was not adding anything to the protocol, just create ZK-proofs of the index state tied to block hashes. With this approach it is possible to create huge search tables for long historical block ranges, since there is no consensus constraint on when to finish the job and also consensus-critical nodes do not have to process the index data at all. Unfortunately the downside is quite big: an index updated with ZK-proofs will usually not be available up to the latest head. Just an hour of non-indexed search costs around 20 Mb of data since all receipts need to be downloaded.

The original design does realize an acceptable balance between log index state size and search performance, at the cost of high complexity. The approach presented in this document is different; the search structure is very simple and also small, does not need an extra initialization protocol, provides efficient search in recent history, at the cost of being less efficient for long term history search. The efficiency is only limited by index table size though, and the index tables generated by consensus can be merged further into much bigger tables off-chain with ZK-proofs. This is intended to be a best-of-both-worlds solution; low complexity, minimal burden for the core infrastructure, yet achieves close to optimal performance for recent history proofs, which can be a key improvement for scalability through cross-chain interactions and off-chain data processing. On long history search it still achieves a cca 1000-5000x improvement in terms of data efficiency compared to no indexing at all. If combined with ZK-proven big search tables, the overall historic search efficiency can easily surpass the original EIP-7745 design, with index proofs for the entire chain history under 100 kb.

Abstract

Replace the log event bloom filters with new lookup index data structures called index tables, in order to make the results of log search queries, transaction and block by hash lookups efficiently and trustlessly provable.

Index tables are ordered lists of the indexed searchable values (log addresses and topics, transaction and block hashes) in a certain block range, stored together with the position of each occurence, sorted by event type and value first, then by position. Index tables are generated for 1, 4, 16 and 64 block ranges, each one being hashed in SSZ List format. The occurences of any search value in any continuous block range covered by the index table occupy a continuous index range in this list, and the complete set of matches for any {type, value, block range} query can be proven with an efficient Merkle multi-proof of a continuous tree leaf range covering the actual matches, plus the last index entry before and the first index entry after the actual matches. The absence of any matches can also be proven with the two adjacent tree leaves between which any matches would be located according to the defined entry ordering.

The proposed design allows block builders/validators to generate the index-related header fields based solely on a limited number of recent blocks and receipts. No extra initialization mechanism or wire protocol extension is needed.

Motivation

Adding logs has a significantly lower gas cost and should accordingly be less resource consuming than writing to the state. The original design of bloom filters in each block achieves this goal as there is no complex data structure to update, the set of logs emitted in each block is all contained within the header and receipts belonging to that block. Logs mostly just have long term storage costs. On the other hand, searching logs in a long range of blocks is very expensive and the current bloom filters do not really help.

Bloom filters are only useful as long as they are sufficiently sparse. False positive ratio rises rapidly with the number of events per filter and the density of 1 bits in the filter bit vector. In the currently existing bloom filter each log address and topic sets 3 out of a fixed length of 2048 bits which resulted in sufficiently sparse filters in the beginning but with the increase of the block gas limits the false positive ratio soon made the filter practically useless. Mainnet blocks currently add over 1000 log addresses and topics in average and therefore the bloom filter size would need to increase about tenfold in order to achieve acceptable false positive rates again. This would raise block header size to about 3 kilobytes. Even if the size of the per-block bloom filters would be raised to a sufficient level, log search would still require accessing the entire header chain. Searching in just the most recent one year history would cost over 6 gigabytes of data, not counting the access of actual logs where the bloom filters have a match. The current situation is even worse, requiring a large portion of the full block receipt sets to be accessed due to the high false positive rate of the bloom filters. Transaction inclusion lookup is also very expensive, requiring all block bodies in the searched range.

While full nodes and RPC servers can and usually do build additional index structures to aid certain lookups, the responses based on these cannot be verified remotely. While it is important to improve the transaction processing capabilities of Ethereum, trustless provability is required throughout the entire stack, from end user applications signing transactions, to the same or other end user applications getting the results they need. Scaling transaction processing only makes sense as long as there is also a trustless and scalable way to access the relevant results of those transactions. Users relying on trusted servers and indexers kind of beats the whole point of Ethereum.

In addition to making truly trustless user applications possible, cheap-to-add and efficiently provable logs may also find applications in improving scalability through cross-chain interactions and moving expensive computation off-chain with ZK-proofs.

Specification

Terms and definitions

index entry: a searchable event; can either be a block, transaction, address or topic entry. It is described by entry type, index value and chain inclusion position (block number and transaction/log index where applicable).
index value: a searchable value; depending on index entry type, it can be a block hash, transaction hash, log address or log topic.
index table: a sorted list of index entries belonging to a certain block range; ordered by index entry type first, then index value, then inclusion position.
table chain: a chain of index tables created from consecutive block ranges of identical length.
reference block: the block where the table chain roots used by a log index proof are referenced in the header. This is typically the head block or a very recent block.

Consensus data format

Block headers

Beginning at the execution timestamp FORK_TIMESTAMP, execution clients MUST replace the logs_bloom field of the header schema with log_index which is the RLP encoding of the LogIndex structure (four ChainedTable root hashes).

The resulting RLP encoding of the header is therefore:

rlp([
    parent_hash,
    0x1dcc4de8dec75d7aab85b567b6ccd41ad312451b948a7413f0a142fd40d49347, # ommers hash
    coinbase,
    state_root,
    txs_root,
    receipts_root,
    log_index,
    0, # difficulty
    number,
    gas_limit,
    gas_used,
    timestamp,
    extradata,
    prev_randao,
    0x0000000000000000, # nonce
    base_fee_per_gas,
    withdrawals_root,
    blob_gas_used,
    excess_blob_gas,
    parent_beacon_block_root,
])

Container types

class LogIndex(Container):
    table_roots: Vector[Root, 4] # ChainedTable roots

class ChainedTable(Container):
    table: Root # root of List[Bytes64] (variable list capacity)
    parent: Root

Index tables are encoded as variable capacity lists; their final length is exactly known at the time of initialization and their capacity is initialized to be the actual number of entries. index entries are encoded in 64 bytes in a way that the sorting order is a lexicographical ordering of the hashed encoding:

8 bytes entry type (big endian)
32 bytes index value
8 bytes block number (big endian)
8 bytes transaction index (big endian)
8 bytes log index (big endian)

Note that the database/network encoding can be more tightly packed (typically <48 bytes). The 64 byte encoding makes sense because a regular binary Merkle tree uses 32 byte chunks and the wide type/position fields leave room for future extensions while maintaining the nice property that the table is just an ordered list of 64 byte entries (easy to create ZK-proofs of merged index tables).

Index tables are hashed into table chains, with the root of every newly generated index table put into the table field of a new ChainedTable container, while the root of the previous ChainedTable being stored in the parent field. The top level LogIndex container tracks four chains of index tables, with table_roots[i] tracking a chain of tables with a 4^i long block range (blocks 4^i * j to 4^i * (j+1) - 1). The first chain (made of single-block tables) is always updated instantly, with the table built of the events of the actually processed block. The other (multi-block) table chains are updated with a delay, in order to leave the builders/validators a safe time window to process:

`table_roots` index	block range length	updated
0	1	every block (no processing delay)
1	4	when block_number % 4 == 0 (1 block delay)
2	16	when block_number % 16 == 3 (4 blocks delay)
3	64	when block_number % 64 == 15 (16 blocks delay)

Index tables and entries

Index tables are always associated with a certain range of blocks and are made up of index entries representing the events in that block range. Each index entry has a type which can be one of the following:

entry type	type id	index value meaning	position info
block	0	block hash	block number
transaction	1	transaction hash	block number, tx index
log.address	2	log address	block number, tx index, log index
log.topics[0]	3	log topic	block number, tx index, log index
log.topics[1]	4	log topic	block number, tx index, log index
log.topics[2]	5	log topic	block number, tx index, log index
log.topics[3]	6	log topic	block number, tx index, log index

Index tables are sorted by index enrty type id first, then index value, then position (block number, then transaction index if applicable, then log index if applicable). Non-applicable position info fields are represented as a 0 in the IndexEntry.meta vector.

Single block index tables do not have a block entry because they are generated from the actually processed block and they are hashed into the block, therefore they cannot contain the block hash.

Note that multi-block tables representing consecutive ranges can always be merged into larger tables representing a longer block range, thereby implementing a kind of merge sort algorithm, which is an efficient method for very big tables stored on disk. Though this proposal does not use tables with longer than 64 block ranges, off-chain indexers can merge these tables further in order to make the search more efficient, where this property can be very useful. When merging single block tables, the corresponding block entries should also be added.

Index table example

For an example let's take a scenario with the following block range:

Block #40: empty
Block #41: empty
Block #42:
- Tx #0: WETH Transfer from Alice to Bob, USDT Transfer from Bob to Alice
- Tx #1: WETH Withdrawal by Bob
Block #43:
- Tx #0: USDT Transfer from Alice to Carol

The unsorted index entries for the block range 40..43 are (in chronological order):

Block #	Tx #	Log #	Type	Value
40			0 (Block)	0xbf98e6cb26f6ff... (block hash)
41			0 (Block)	0x42f66a2e9f9c68... (block hash)
42	0		1 (Tx)	0xca2d12d1b8132d... (tx hash)
42	0	0	2 (Address)	0x00..00c02aaa39... (WETH)
42	0	0	3 (Topic0)	0xddf252ad1be2c8... (Transfer)
42	0	0	4 (Topic1)	0x00..004a5e9a3b... (Alice)
42	0	0	5 (Topic2)	0x00..00f2a5fd4d... (Bob)
42	0	1	2 (Address)	0x00..00dac17f95... (USDT)
42	0	1	3 (Topic0)	0xddf252ad1be2c8... (Transfer)
42	0	1	4 (Topic1)	0x00..00f2a5fd4d... (Bob)
42	0	1	5 (Topic2)	0x00..004a5e9a3b... (Alice)
42	1		1 (Tx)	0xa75590c9ced728... (tx hash)
42	1	0	2 (Address)	0x00..00c02aaa39... (WETH)
42	1	0	3 (Topic0)	0x7fcf532c15f0a6... (Withdrawal)
42	1	0	4 (Topic1)	0x00..00f2a5fd4d... (Bob)
42			0 (Block)	0x66f42ef12b140e... (block hash)
43	0		1 (Tx)	0x7046035b326ab2... (tx hash)
43	0	0	2 (Address)	0x00..00dac17f95... (USDT)
43	0	0	3 (Topic0)	0xddf252ad1be2c8... (Transfer)
43	0	0	4 (Topic1)	0x00..004a5e9a3b... (Alice)
43	0	0	5 (Topic2)	0x00..00ac8cbe20... (Carol)
43			0 (Block)	0x978ce0036b6d1c... (block hash)

The sorted index table (columns reordered according to sorting priority):

Type	Value	Block #	Tx #	Log #
0 (Block)	0x42f66a2e9f9c68... (block hash)	41
0 (Block)	0x66f42ef12b140e... (block hash)	42
0 (Block)	0x978ce0036b6d1c... (block hash)	43
0 (Block)	0xbf98e6cb26f6ff... (block hash)	40
1 (Tx)	0x7046035b326ab2... (tx hash)	43	0
1 (Tx)	0xa75590c9ced728... (tx hash)	42	1
1 (Tx)	0xca2d12d1b8132d... (tx hash)	42	0
2 (Address)	0x00..00c02aaa39... (WETH)	42	0	0
2 (Address)	0x00..00c02aaa39... (WETH)	42	1	0
2 (Address)	0x00..00dac17f95... (USDT)	42	0	1
2 (Address)	0x00..00dac17f95... (USDT)	43	0	0
3 (Topic0)	0x7fcf532c15f0a6... (Withdrawal)	42	1	0
3 (Topic0)	0xddf252ad1be2c8... (Transfer)	42	0	0
3 (Topic0)	0xddf252ad1be2c8... (Transfer)	42	0	1
3 (Topic0)	0xddf252ad1be2c8... (Transfer)	43	0	0
4 (Topic1)	0x00..004a5e9a3b... (Alice)	42	0	0
4 (Topic1)	0x00..004a5e9a3b... (Alice)	43	0	0
4 (Topic1)	0x00..00f2a5fd4d... (Bob)	42	0	1
4 (Topic1)	0x00..00f2a5fd4d... (Bob)	42	1	0
5 (Topic2)	0x00..004a5e9a3b... (Alice)	42	0	1
5 (Topic2)	0x00..00ac8cbe20... (Carol)	43	0	0
5 (Topic2)	0x00..00f2a5fd4d... (Bob)	42	0	0

Since the index table above is 22 entries log, its root hash is calculated as the root of a List[IndexEntry, 22] SSZ list. Note that the proof format always includes the size of the accessed index tables (it is required anyways as part of the Merkle proof), which tells the verifier about the expected number of tree levels.

Index query proof examples

The following examples show the minimum set of index entries required to be proven from the index table. Note that these examples all show a single index table that covers a limited block range; for any query covering multiple index tables, a similar multi-proof on each index table is required. Also note that in case of log pattern queries it is assumed that the logs themselves are also proven by including the corresponding block header, a Merkle proof through the receipts tree and the receipt containing the logs. For this reason, a full inclusion proof of the matching entries is not required and the index proof can be interpreted as an exclusion proof of all the non-matching logs.

Canonical block lookup

A block number to hash or hash to number lookup of an existing canonical block can be proven by just the matching block entry:

Type	Value	Block #	Tx #	Log #
...	...	...	...	...
0 (Block)	0x66f42ef12b140e... (block hash)	42
...	...	...	...	...

In case of a hash to number lookup where the given block hash is not in the index table's block range, this fact can be proven with the last entry before the searched hash and the first entry after. For example, a search for block hash 0xe816a111dc40d8... in the _index table _example would require the following proven entry set:

Type	Value	Block #	Tx #	Log #
...	...	...	...	...
0 (Block)	0xbf98e6cb26f6ff... (block hash)	40
1 (Tx)	0x7046035b326ab2... (tx hash)	43	0
...	...	...	...	...

According to the ordering rules, the searched (Block, 0xe816a111dc40d8...) entry would be located between the two entries above, so proving these two adjacent entries proves that the searched one does not exist in this table.

Note that if single-block index tables are used, the canonical blocks are not proven through the tables since they are not present there. In the worst case, single-block index tables are used for the reference block and three of its ancestors. Before that, a four-block table should always be available. We can always assume that the receiver of the proof knows the block number and hash of the reference block, while its required ancestor block hashes can be proven with at most two extra block headers.

Transaction inclusion lookup

Though the inclusion of a certain transaction can technically be proven with just the transaction entry, a transaction inclusion proof should typically also include the belonging block entry that proves the canonical block hash, so that the returned inclusion position can also refer to the inclusion block by hash and the client can also trustlessly look up the resulting receipt and other circumstances of the transaction execution.

Type	Value	Block #	Tx #	Log #
0 (Block)	0x66f42ef12b140e... (block hash)	42
...	...	...	...	...
1 (Tx)	0xa75590c9ced728... (tx hash)	42	1
...	...	...	...	...

If the given transaction hash is not included in the index table's block range, the proof is similar to the block hash exclusion proof. For example, a search for transaction hash 0x7f6419ca474f39 would require the following proven entry set:

Type	Value	Block #	Tx #	Log #
...	...	...	...	...
1 (Tx)	0x7046035b326ab2... (tx hash)	43	0
1 (Tx)	0xa75590c9ced728... (tx hash)	42	1
...	...	...	...	...

Log pattern match query

This example shows a proof of a log query for USDT Transfer from Bob to anyone:

Address: 0x00..00dac17f95... (USDT)
Topic0: 0xddf252ad1be2c8... (Transfer)
Topic1: 0x00..00f2a5fd4d... (Bob)
Topic2: anything

In this case the proof also includes the receipt containing the only actual match (block #42 tx #0 log #1). The block entry for block #42 is included in the proven index entry set so that the canonicalness of the receipt can be proven through the block header and the receipts tree.

Type	Value	Block #	Tx #	Log #
...	...	...	...	...
0 (Block)	0x66f42ef12b140e... (block hash)	42
...	...	...	...	...
3 (Topic0)	0x7fcf532c15f0a6... (Withdrawal)	42	1	0
...	...	...	...	...
4 (Topic1)	0x00..004a5e9a3b... (Alice)	43	0	0
4 (Topic1)	0x00..00f2a5fd4d... (Bob)	42	0	1
4 (Topic1)	0x00..00f2a5fd4d... (Bob)	42	1	0
5 (Topic2)	0x00..004a5e9a3b... (Alice)	42	0	1
...	...	...	...	...

Note that the set of proven index entries does not need to include the USDT address or the Transfer topic because the actual matching log is proven through the receipt, the index table only has to prove that none of the other logs are potential matches. The last four proven entries prove that there are only two potential matches for Topic1 (Bob) and the second proven entry proves that at position (block #42 tx #1 log #0) Topic0 is Withdrawal, therefore it cannot be a match.

Maintaining the log index

We define the function table_root(parent_root, first_block, last_block) that calculates the root of a new ChainedTable created from block range first_block .. last_block, with the parent set to parent_root. The actual implementation can either collect all events from transactions and receipts of that range and sort them, or (in case of multi-block ranges) take the four previously calculated lower level sorted tables covering the same range and merge them. It is also optional to permanently store the actual tables, which is only necessary if the client actually intends to use the index.

Since multi-block ranges are calculated asynchronously, these processes and their results are stored in an off-protocol temporary storage. Each index table root that is calculated asynchronously is referred to here as scheduled_table_roots[i][j] which denotes the root of the table of the block range 4^i * j to 4^i * (j+1) - 1. These hashes are either scheduled for future update of header.table_roots or kept around for a while after that, in case a reorg reverts the header.table_roots update but does not invalidate the actual block range it was based on (so it has to be updated again in a later block of the new fork).

Adding a new block

After processing all transactions, perform the following steps:

calculate table_root(parent.table_roots[0], header.number, header.number) and store it in header.table_roots[0]
for every 0 < i < 4:
- if header.number % 4^i == 4^i - 1 then start calculating scheduled_table_roots[i][header.number // 4^i] = table_root(parent.table_roots[i], header.number + 1 - 4^i, header.number)
- if header.number % 4^i == 4^(i-1) - 1 then set header.table_roots[i] to scheduled_table_roots[i][header.number // 4^i - 1]
  - otherwise set header.table_roots[i] to parent.table_roots[i]

Reverting blocks

When reverting to revert_number, perform the following steps:

for every 0 < i < 4:
- discard every invalidated scheduled_table_roots[i][j] where j >= (revert_number + 1) // 4^i
- cancel any asynchronous process calculating any such roots

Note that whenever the finalized block is advanced, every scheduled_table_roots[i][j] where j < (finalized_number + 1 - 4^(i-1)) // 4^i can also be discarded because the header in which it has been updated cannot be rolled back any more.

Initialization

At the fork block where the table_roots are first added to the block header, the roots are first initialized to zero, then normal update rules are applied; on table chain 0 (single-block tables) a new ChainedTable is instantly created, while other chains might be updated too in the fork block if the block number has the specified modulus. The first ChainedTable added to any of the four chains will have a zero parent.

Since longer block range index tables might be added right at the time of the fork, the preparation of these tables should begin before the fork block. The same logic can be applied here as above; for every 0 < i < 4 when header.number % 4^i == 4^i - 1, after transaction processing the table creation can start asynchronously if header.number + 4^(i-1) (the actual update block number) is at or after the expected fork block.

Note that pre-fork history index is not added to the consensus for three reasons: first, it would be a heavy requirement for every client to obtain and process the entire chain history. Second, larger tables for long term history are intended to be created off-chain with ZK-proofs, covering ancient history with 64 block tables would not be efficient enough anyways. Third, after activating EIP-7708 it would make a lot of sense to reprocess the whole chain history and create one big index table with the new ETH transfer system logs included. Since this is a one-time thing, the root hash of this table could simply be hardcoded (creating a ZK-proof of reprocessing the whole chain would be very expensive).

Log index proof format

The LogIndexProof can be serialized either as RLP or SSZ. It proves subsets of index entries from of a consecutive range of index tables, ending with the most recent one, from each of the four table chains referenced in the block header. It can also include some ancestor headers of the reference block, so that block hashes of the most recent blocks (covered only by single-block index tables that do not have a block entry) can also be proven.

It can also prove individual receipts where matching logs are found. This serves as inclusion proof for actual matches of a log pattern match query, while the proven index entries serve as an exclusion proof for any other potential matches. Note that with EIP-6466: SSZ receipts proving matching logs could be more efficient since individual logs could be proven with a Merkle proof. With the current RLP encoding entire receipts need to be proven.

class LogIndexProof(Container):
    table_chain_proofs: Vector[TableChainProof, 4]
    headers: List[Header]    # ancestors of the reference block
    receipt_proofs: List[ReceiptProof]

class TableChainProof(Container):
    parent: Root
    tables: List[IndexTableProof]

class IndexTableProof(Container):
    entry_count: uint64
    entry_indices: List[uint64]
    entries: List[IndexEntry]
    proof_nodes: List[Bytes32]

class ReceiptProof(Container):
    header: Header
    receipt_proof: List[ByteList] # MPT proof of the receipts tree, including the rlp encoded receipts as value

Backwards Compatibility

The existing log filter API (eth_getLogs, eth_newFilter, eth_getFilterLogs, eth_getFilterChanges) can be implemented with the new filter data structure. Applications relying on this API can operate without any change, benefiting from a higher search performance.

Repricing the LOG opcode might be considered after performing benchmarks but the initial estimates suggest that it might not be necessary (see below). Other than that the EVM is not affected in any way.

Security Considerations

Cost of adding entries

A key advantage of the proposed design is that adding new index entries is consistently cheap. The CPU costs are mostly determined by tree hashing; eventually all index entries will be added to all four table chains. On each chain two new leaves are added to a binary hash tree; the cost can be estimated as 8 binary hash operations per index entry, 6 of these can happen asynchronously, 2 should happen at the end of block processing (but even these can be parallelized). If the 375 gas per address/topic cost of LOG operations is maintained, a 100M gas block can have 266k index entries in the worst case, meaning 533k binary hash operations during block processing, which should have a less than 100ms (parallelizable) CPU cost. Note that the typical number of index entries per block on mainnet is currently around 1-2k so the typical processing cost is much lower.

Sorting the largest (64 block) tables in memory might be possible but has a significant memory requirement assuming the worst case; 266k*64 entries, 48 bytes each (assuming a compact in-memory representation of the meta info) is around 800 Mb. This worst case is very unlikely to happen and processing it in memory is still possible, but for a future-proof implementation it might be better to not load all this data into memory, instead create the 64 block table by merging the corresponding four 16 block tables (which can be performed on disk). Reading 4 * 200 Mb and writing 800 Mb sequentially during a 16 block time window should not be a problem even with a slow disk.

In conclusion, while measurements are still required, at this point it looks like the 375 gas per address/topic cost of LOG operations is easily maintainable.

zsfelfoldi/eip-7745b.md

Select an option

No results found