COMP3139›Software Architectures: Distributed Shared Data (DSD)

Parallel & Distributed ComputingTopic 10 of 33

Software Architectures: Distributed Shared Data (DSD)

8 minread

1,417words

Intermediatelevel

Software Architectures: Distributed Shared Data (DSD)

Distributed Shared Data (DSD) is an architecture in which data is shared across multiple distributed systems or nodes in a way that allows the nodes to access and modify the data in a coordinated and consistent manner. While Distributed Shared Memory (DSM) focuses on creating a global address space for memory, Distributed Shared Data typically refers to sharing and managing data (not just memory) across different processes or nodes in a distributed system. This model ensures that different nodes can interact with the data in a consistent manner, regardless of where the data is physically stored or how it is managed.

1. What is Distributed Shared Data (DSD)?

Distributed Shared Data refers to a design pattern in distributed systems where data is shared between multiple nodes (machines, processes, or services) across a network. These systems enable data to be accessed, modified, and synchronized across different locations, without the need for each node to have a local copy of the data.

In this architecture, data might be stored on multiple nodes, but it is accessible in a shared manner across the system. The key challenge in DSD systems is to ensure data consistency, provide synchronization mechanisms, and handle network latencies effectively to maintain the illusion of shared access, while also ensuring efficient use of resources.

2. Key Concepts of Distributed Shared Data (DSD)

To better understand DSD, it is important to consider the core concepts involved in a distributed shared data architecture:

1. Global Data Space

Global data space refers to the logical view of the data that is accessible from any node in the distributed system. This can be thought of as a unified data store where data is distributed across multiple nodes, but the system provides an interface for accessing the data in a consistent manner.

2. Data Partitioning

Distributed data is often partitioned to improve performance and scalability. Partitioning involves breaking data into smaller, manageable chunks and distributing those chunks across different nodes. Each node may own and manage a subset of the global data, and queries to the data are routed to the correct node based on partitioning.

3. Replication

Replication is the process of maintaining copies of data across multiple nodes. Replication helps with availability, fault tolerance, and load balancing. However, it also raises challenges related to ensuring consistency between replicas (the replication consistency problem).

4. Consistency Models

Ensuring that the data accessed across distributed nodes is consistent is a critical part of DSD. There are various models of consistency, such as strong consistency, eventual consistency, and causal consistency, each with trade-offs in terms of performance, fault tolerance, and user experience.

5. Concurrency Control

In a distributed shared data system, multiple nodes or processes may attempt to read or write the same data at the same time. Concurrency control mechanisms such as locks, optimistic concurrency, or transactional processing are needed to manage simultaneous access and prevent race conditions or data corruption.

6. Fault Tolerance

A distributed system must be designed to handle node failures without losing data or compromising availability. Fault tolerance mechanisms include data replication, distributed transaction logs, and failure detection and recovery protocols.

3. Distributed Shared Data vs. Distributed Shared Memory (DSM)

While both Distributed Shared Data and Distributed Shared Memory (DSM) abstract the idea of sharing data across nodes, there are key differences:

Feature	Distributed Shared Data (DSD)	Distributed Shared Memory (DSM)
Granularity	Typically at the data level (structured data, objects, tables, etc.)	Typically at the memory or page level (raw memory access)
Access Model	Data is shared via higher-level abstractions (e.g., files, objects, tables, databases)	Memory is shared through a low-level memory access interface (i.e., virtual memory)
Consistency	Focuses on data consistency (e.g., transactions, eventual consistency)	Focuses on memory consistency (coherency and synchronization of memory reads/writes)
Common Use Cases	Databases, file systems, key-value stores	Parallel computing, scientific computing, and high-performance applications
Synchronization	Data synchronization is done at the application or middleware level	Memory synchronization is managed by the underlying DSM protocol
Communication	Uses message-passing or distributed databases to sync data across nodes	Uses low-level memory communication protocols (often via a network or interconnect)

4. Architectural Models for Distributed Shared Data (DSD)

Several architectural approaches are used for implementing Distributed Shared Data systems, and these models differ in how they manage data distribution, consistency, and communication:

1. Client-Server Architecture

In a client-server architecture, one or more server nodes manage the data and share it with multiple client nodes. The clients request data from the server, which processes the request and provides the data.
The server typically handles data consistency, updates, and synchronization.
Example: Traditional relational databases (like MySQL or PostgreSQL) operate on this model, where data resides on a central server, and clients query the data over a network.

2. Peer-to-Peer (P2P) Architecture

In a P2P architecture, each node can act as both a client and a server, sharing and accessing data from any other node. There is no central server, and nodes communicate directly with each other to share data and maintain consistency.
Example: Distributed file systems like Freenet or BitTorrent where each peer shares pieces of data with other peers.

3. Master-Slave Architecture

A master-slave architecture is a type of client-server architecture where the master node controls the data and manages replication, while slave nodes maintain copies of the data. The slaves can access and read the data but are not allowed to modify it unless directed by the master.
This model provides more control over the data and is useful for systems requiring strict consistency.
Example: MySQL replication where the master node handles writes, and the slave nodes handle reads.

4. Distributed Database Systems

Distributed databases provide an abstraction for sharing and managing data across multiple nodes. These systems typically include mechanisms for partitioning and replicating data across nodes, as well as ensuring ACID (Atomicity, Consistency, Isolation, Durability) properties in transactions.
Example: Cassandra, HBase, MongoDB are examples of distributed database systems that implement DSD by allowing distributed nodes to share data.

5. Distributed Caching Systems

Distributed caching systems provide shared data that is used for faster access to frequently used data. These caches are typically distributed across multiple nodes and can hold frequently queried data to reduce load on a database or server.
Example: Redis or Memcached are widely used distributed caching systems that allow data to be shared across multiple nodes for faster data retrieval.

5. Key Challenges in Distributed Shared Data (DSD)

While DSD provides an efficient and scalable way to manage data in distributed systems, there are several challenges to address:

1. Data Consistency

Maintaining consistency across distributed data copies is challenging. Different nodes might hold copies of the same data, and changes made to one copy must be propagated to all other copies to ensure that they remain consistent.
Consistency Models: Strong consistency (immediate propagation), eventual consistency (eventual convergence), and causal consistency are different approaches to managing data consistency in DSD systems.

2. Fault Tolerance and Availability

Distributed systems are vulnerable to node failures, network partitions, and other disruptions. A robust DSD system must be able to recover from failures without losing data or compromising availability.
Replication and partitioning strategies (like Quorum-based approaches or leader election) are used to ensure data availability and consistency even in the event of partial system failures.

3. Latency

Since DSD systems often involve communication over a network, latency can become a significant factor. When data is distributed across multiple nodes, accessing the data from a distant node may introduce delays, especially in high-latency networks.
Caching and data locality techniques are used to reduce the impact of latency by storing frequently accessed data closer to where it is needed.

4. Scalability

As the number of nodes in a distributed system grows, managing the distribution, replication, and synchronization of data becomes more complex. Scalability requires careful partitioning of data, efficient synchronization algorithms, and mechanisms to balance the load across nodes.
Sharding and load balancing are common techniques used to manage data and ensure that the system can scale efficiently.

5. Concurrency Control

With multiple processes or nodes potentially modifying the same data simultaneously, concurrency control mechanisms such as locking, optimistic concurrency, and transactional consistency are needed to prevent race conditions and data corruption.

6. Advantages and Disadvantages of DSD

Advantages:

Scalability: Distributed shared data architectures can scale to accommodate large volumes of data and high levels

Previous topic 9

Software Architectures: Distributed Shared Memory (DSM)

Next topic 11

Parallel Algorithms

Past Papers

Open this section to load past papers

Click on Show Past Papers to see past papers.

Feature

Distributed Shared Data (DSD)

Distributed Shared Memory (DSM)

Granularity

Typically at the data level (structured data, objects, tables, etc.)

Typically at the memory or page level (raw memory access)

Access Model

Data is shared via higher-level abstractions (e.g., files, objects, tables, databases)

Memory is shared through a low-level memory access interface (i.e., virtual memory)

Consistency

Focuses on data consistency (e.g., transactions, eventual consistency)

Focuses on memory consistency (coherency and synchronization of memory reads/writes)

Common Use Cases

Databases, file systems, key-value stores

Parallel computing, scientific computing, and high-performance applications

Synchronization

Data synchronization is done at the application or middleware level

Memory synchronization is managed by the underlying DSM protocol

Communication

Uses message-passing or distributed databases to sync data across nodes

Uses low-level memory communication protocols (often via a network or interconnect)