The Data Availability Problem and Innovative Solutions on the Horizon

TL;DR

The data availability problem, critical for blockchain scalability, involves ensuring necessary data to validate a block is accessible to all network participants, especially in modular blockchains, Layer 2 rollups, and light clients.

Various strategies to address the data availability problem include Full Node Data Downloading, Data Availability Committees, Randomly Sampled Committees, Part-wise Attestation, and Data Availability Sampling (DAS).

These methods have their own strengths and weaknesses such as trustworthiness, manipulation potential, implementation complexity, resource requirements, and possible inefficiencies.

Data Availability Sampling (DAS), an approach based on erasure coding and random sampling, is preferred by major blockchain networks like Polkadot, NEAR, Tezos, and Ethereum due to its scalability, security, ability to reconstruct data incrementally, and role in maintaining decentralization.

What is the Data Availability Problem?

Data availability, in its most basic form, is the assurance that the necessary data to validate a block is accessible to all participants in a network. For full nodes operating on Layer 1 blockchains, data availability is relatively straightforward. These nodes download all data in each block, and the successful download itself confirms the data’s availability. This process, known as “on-chain data availability,” is typical for monolithic blockchains such as Bitcoin, Cardano, and Ethereum.

However, the situation becomes more complex with modular blockchains, Layer 2 rollups, and light clients. In these contexts, data availability becomes a more complex issue requiring advanced verification methods. The push for greater blockchain scalability has led to strategies that allow users to post their transactional data off-chain, thereby stretching the bandwidth capabilities of Layer 1 (L1). This approach, however, raises a critical question:

How can consensus be reached about the availability of off-chain data without requiring all L1 nodes to download this data?

This question encapsulates the data availability problem.

Addressing the Data Availability Problem

The challenge here is to offer proof to the entire network that the summarized form of transaction data being added to the blockchain indeed represents a set of valid transactions. This needs to be done without mandating all nodes to download all data. However, this is an obstruction to scalability since full transaction data is essential for independently verifying blocks.

Consequently, solutions targeting this issue aim to assure that full transaction data was made available for verification to network participants that do not download and store the data for themselves. This is especially critical for light nodes and Layer 2 rollups that need robust data availability assurances, but can’t download and process transaction data for themselves.

Different Scenarios Where Data Availability is Relevant

The data availability problem is relevant in various scenarios within the blockchain ecosystem, particularly where scalability solutions are implemented. Here are a few key scenarios:

Sharding: Sharding is a scalability technique that partitions a blockchain into multiple smaller chains, known as shards. Each shard operates independently, processing its own transactions and smart contracts. This allows the network to process many transactions in parallel, significantly increasing its capacity. However, this approach introduces a new challenge: ensuring data availability.

Source: https://changelly.com/blog/what-is-sharding/

In a sharded system, not all nodes process all transactions. Instead, different nodes process different shards. This means that if a shard’s block producers were to become malicious and start accepting invalid transactions, nodes on other shards would not necessarily be aware of it. To detect such fraudulent activity, it’s crucial that all data within each shard is available to the entire network. This allows any invalid transactions to be detected and rejected, maintaining the integrity of the blockchain.

Sidechains: Sidechains are separate blockchains that run in parallel to the main chain. They allow for offloading of transactions and smart contracts from the main chain, thereby increasing its capacity. However, like with sharding, data availability is a critical concern.

Source: https://blog.digitalogy.co/sidechains-blockchain/

If the block producers of a sidechain were to act maliciously and include invalid transactions in a block, it could potentially go unnoticed by the main chain. To prevent this, it’s crucial that all data from the sidechain is made available to the main chain. This allows the main chain to verify the validity of the sidechain’s transactions and maintain the security of the overall system.

Rollups: Rollups are Layer 2 scaling solutions that perform transaction execution off-chain and post the transaction data on-chain. There are two main types of rollups: Optimistic Rollups and Zero-Knowledge (ZK) Rollups/Validity Rollups. Both types rely on data availability to function securely.

Solutions Addressing the Data Availability Problem

1. The Basic Approach: Full Node Data Downloading

The most basic approach to addressing the data availability problem in blockchain networks is the full node data downloading method. This strategy is employed by existing Layer 1 blockchains such as Bitcoin, Ethereum, and Cardano.

In this approach, every full node in the network is required to download all data associated with each block. This ensures that all data is available to every participant in the network, thereby maintaining the integrity of the blockchain. If a node proposes a block that is unavailable, honest nodes in the network won’t accept it. The reason is simple: they can’t download and execute it, making it impossible to build on top of the block. This mechanism ensures that all blocks added to the blockchain are fully available and can be verified by any participant in the network.

Scalability Challenges with Full Node Data Downloading

However, while this approach is straightforward and effective, it has significant limitations when it comes to scalability. As the volume of transactions on the network increases, the amount of data that each full node is required to download also increases. This can lead to significant storage and bandwidth requirements for full nodes, which can be prohibitive for many participants in the network.

This scalability issue becomes even more pronounced as we move towards a Layer 2-centric world. Layer 2 solutions, such as rollups and sidechains, are designed to increase the scalability of blockchain networks by moving some of the computational load off the main chain. However, these solutions often require ensuring the availability of large amounts of data, which can exacerbate the scalability issues associated with the full node data downloading approach.

Therefore, while full node data downloading is a fundamental approach to the data availability problem, it is not sufficient on its own to address the scalability needs of modern blockchain networks.

2. Data Availability Committees: Trusted Guardians of Data

Data Availability Committees (DACs) represent a more sophisticated approach to the data availability problem. These committees are composed of trusted parties whose primary role is to ensure the availability of data within the network.

The Role of DACs in Data Verification

When a user submits a data blob, or a set of data, to the network, the members of the DAC are tasked with downloading the full data. Once they have downloaded and verified the data, they post signatures to the Layer 1 blockchain. These signatures serve as attestations to the availability of the data. The data is only considered available if a certain threshold of DAC members sign this attestation. This threshold is set to ensure that a sufficient number of independent verifications have been made, thereby increasing the reliability of the attestation.

The Trust Factor in DACs

However, this approach does come with its own set of challenges. The most significant of these is the reliance on the trustworthiness of the DAC. Since the DAC members are responsible for verifying the availability of data, the system inherently requires users to trust these members. This trust is crucial for the system to function effectively, as any malicious activity or dishonesty within the DAC could compromise the integrity of the data.

The Limitations of DACs

Another limitation of this approach is that it is not backed by Layer 1 security. Layer 1 security refers to the security provided by the underlying blockchain protocol itself, which is typically robust and highly resistant to attacks. In contrast, the security of a DAC is dependent on the honesty and reliability of its members, which can potentially be compromised. Therefore, while DACs can provide a valuable service in ensuring data availability, they are not a foolproof solution and their use must be complemented by other mechanisms to ensure the overall security and integrity of the blockchain network.

3. Randomly Sampled Committees: A Randomized Approach to Data Availability

Randomly sampled committees represent another approach to tackling the data availability problem. This method introduces an element of randomness into the process of data verification, which can help to increase the security and robustness of the system.

The Role of Randomly Sampled Committees

In this approach, a subset of validators is randomly selected by the Layer 1 blockchain to attest to the availability of a specific blob of data. These validators, which form the committee, are tasked with downloading the data blob and verifying its availability. Once they have verified the data, they post attestations to the Layer 1 blockchain. These attestations serve as proof that the data is available and can be accessed by other participants in the network.

The network as a whole only accepts the data if it sees signatures from the majority of the committee. This majority rule helps to ensure that the data is indeed available and that the attestation is not the result of a few rogue or malicious validators. This mechanism provides an additional layer of security and helps to maintain the integrity of the data.

The Trust Factor in Randomly Sampled Committees

However, this approach also has its limitations. The most significant of these is the reliance on the trustworthiness of the randomly sampled committee. Since the committee is responsible for verifying the availability of data, the system inherently requires users to trust these validators. This trust is crucial for the system to function effectively, as any dishonesty or malicious activity within the committee could compromise the integrity of the data.

The Size of the Committee and Its Implications

Another limitation of this approach is that the size of the committee is generally not large. This means that the number of validators verifying the data is relatively small, which could potentially make the system more vulnerable to attacks or manipulation. Therefore, while randomly sampled committees can provide a valuable mechanism for ensuring data availability, their use must be complemented by other security measures to ensure the overall integrity and security of the blockchain network.

4. Part-wise Attestation: A Fragmented Approach to Data Availability

Part-wise attestation represents a unique approach to the data availability problem. This method involves breaking down the data into smaller parts and assigning each part to a different validator for verification.

When a user submits data to the network, they first erasure-code the data blob. Erasure coding is a method of data protection in which data is broken into fragments, expanded and encoded with redundant data pieces, and stored across a set of different locations or storage media. The magic of erasure coding is that it allows the original data to be reconstructed from a subset of the fragments, even if some of the fragments are missing.

The Process of Part-wise Attestation

After the data blob has been erasure-coded and split into parts, each validator in the network is assigned to download and attest to the availability of one or more parts. The validators then download their assigned parts and post attestations to the Layer 1 blockchain, indicating that the data is available.

The Resilience of Part-wise Attestation

One of the key advantages of part-wise attestation is its resilience to data-withholding attacks. Since only a subset of the parts is required to reconstruct the whole data, the network can tolerate a certain level of malicious activity. Even if some validators act maliciously and withhold their assigned parts, the network can still reconstruct the original data from the remaining parts, as long as a sufficient number of parts are available.

Part-wise attestation, while an effective method for ensuring data availability, does have a few potential drawbacks:

Increased Complexity: The process of erasure coding, splitting data into parts, and assigning these parts to different validators adds a layer of complexity to the data verification process. This can make the system more difficult to implement and maintain, and could potentially introduce new points of failure.

Increased Resource Requirements: Each validator must download, store, and verify their assigned parts of data. This can increase the storage and bandwidth requirements for validators, which could be a barrier for smaller or less well-resourced participants.

Potential for Inefficiency: Depending on how the data is split and assigned, some validators may end up with a much larger workload than others. This could lead to inefficiencies and bottlenecks in the data verification process.

5. Data Availability Sampling: A Probabilistic Approach to Data Availability

Data Availability Sampling (DAS) is an innovative technique designed to ensure the availability of data in a blockchain network. It does this while only requiring the download of a small portion of the entire data set, making it a highly scalable solution.

Like part-wise attestation, DAS also relies on erasure coding. The data is first erasure-coded, which involves breaking it down into smaller parts and adding redundant data pieces. This process allows the original data to be reconstructed from a subset of the parts, even if some parts are missing.

The Process of Data Availability Sampling

Once the data has been erasure-coded and split into parts, the DAS process can begin. A client in the network, which could be a full node or a light client, initiates the process when it wants to verify the availability of a specific piece of data.

Instead of downloading the entire data set, the client randomly samples parts of the erasure-coded data. It sends requests to the network to download these sampled parts. The data is considered available if all sampling requests succeed, i.e., if all the sampled parts can be downloaded successfully.

The Probabilistic Nature of Data Availability Sampling

The key to DAS is its probabilistic nature. The client doesn’t need to download all the data to ensure its availability. Instead, it relies on the principle that if a sufficient number of randomly sampled parts are available, it’s highly likely that the entire data set is also available.

This approach allows DAS to ensure data availability with a high level of confidence, while significantly reducing the amount of data that needs to be downloaded. This makes it a highly efficient and scalable solution for the data availability problem in blockchain networks.

However, like all probabilistic methods, DAS is not 100% foolproof. There’s always a small chance that the sampled parts are available, but other parts of the data are not. Despite this, the probability of such a scenario occurring can be made arbitrarily small by increasing the number of parts sampled.

State of the Art: Data Availability Mechanisms in Major Blockchain Networks

Blockchain networks such as Polkadot, NEAR, Tezos, and Ethereum are gravitating towards Data Availability Sampling (DAS) as a solution for their data availability needs. This is driven by several factors:

Scalability and Efficiency: DAS helps improve the scalability and efficiency of the network. By breaking down the data into manageable parts and verifying the availability of these parts, DAS reduces the amount of data that needs to be stored and processed, making the network faster and more efficient. This is particularly important in a sharded network, where data is distributed across multiple shards.

Security and Trust: DAS helps protect the integrity of the blockchain by preventing data withholding attacks. With DAS, a malicious actor must withhold a significant percentage of the erasure-coded data to disrupt the network. This makes the system more robust and reliable.

Incremental Reconstruction: Especially in the case of Danksharding, a proposal for scaling Ethereum, the introduction of 2D erasure coding scheme allows the data to be reconstructed incrementally. This makes it feasible for lower-resource computers to perform the reconstruction, hence maintaining a more accessible and decentralized network.

Minimizing Centralization: Even though solutions like Polkadot and NEAR have incorporated aspects of centralization in block production, DAS helps them maintain a decentralized network by ensuring that validation remains decentralized. Ethereum too, though it acknowledges the need for high resource block builders, aims to keep the requirements for validators low, hence DAS becomes crucial in maintaining decentralization while allowing for efficient sharding of data.

In summary, DAS appears to be the preferred solution due to its contributions towards ensuring data availability, improving scalability and efficiency, protecting the integrity of the network, supporting incremental reconstruction, and aiding in maintaining decentralization.

Conclusion

Blockchain ecosystems are attempting to solve the data availability problem to improve scalability. While relatively simple for full nodes operating on Layer 1 blockchains like Bitcoin, Ethereum, and Cardano, it becomes more complex with modular blockchains, Layer 2 rollups, and light clients. Ensuring that off-chain data is available to all L1 nodes without requiring them to download it is a critical challenge.

To address it, different strategies have emerged: Full Node Data Downloading, Data Availability Committees, Randomly Sampled Committees, Part-wise Attestation, and Data Availability Sampling (DAS).

Each has its own advantages and limitations, such as trust issues, the possibility of manipulation, complexity, resource requirements, or potential inefficiencies.

DAS, an innovative and scalable solution based on erasure coding and random sampling of data parts, is increasingly being favored by major blockchain networks due to its scalability, security, incremental reconstruction capabilities, and assistance in maintaining decentralization.

Sources:

https://ethereum.org/en/developers/docs/data-availability/
https://medium.com/blockchain-capital-blog/wtf-is-data-availability-80c2c95ded0f
https://hackmd.io/@p-cUv0l5RNaDKBCowZ0IzA/HJgFgSzpo/https%3A%2F%2Fhackmd.io%2FNw-sVnxjR52GB7gjHVZ4sA
http://www0.cs.ucl.ac.uk/staff/m.albassam/publications/fraudproofs.pdf

The Latest

Total Value Locked: Why This Popular DeFi Metric Needs a Review

Project Acropolis: Cardano’s Initiative Towards a Developer-Friendly Blockchain

Cardano Decentralization Analytics: Why Visibility Is Critical in DeFi

Begin Wallet: The Simplest Way to Get Started with Cardano dApps