COMP3147›Data-Level Parallelism: GPU programming

Computer ArchitectureTopic 24 of 24

Data-Level Parallelism: GPU programming

3 minread

507words

Beginnerlevel

⭐ Data-Level Parallelism (DLP) and GPU Programming

1. Definition of Data-Level Parallelism (DLP)

Data-Level Parallelism (DLP) refers to the ability of a processor to perform the same operation on multiple data elements simultaneously.

DLP is widely used in vector processing, SIMD (Single Instruction Multiple Data), and GPU architectures, where the same instruction operates on many data points at once.

Key Idea

If a program performs the same computation independently on many data elements, DLP allows these computations to run in parallel.
Common in image processing, scientific computing, machine learning, and graphics.

2. GPU Programming and DLP

GPU (Graphics Processing Unit) is specialized for massive data-level parallelism:

Thousands of cores: Optimized for executing the same instruction across many data threads simultaneously.
SIMD/SIMT architecture: Single Instruction, Multiple Threads (or Data) execution.
High memory bandwidth: To feed data to multiple cores efficiently.

A) GPU vs CPU for DLP

Feature	CPU	GPU
Number of cores	Few (2–64 typically)	Thousands
Execution model	Few threads, complex logic	Many threads, simple logic
Latency vs throughput	Low latency	High throughput
DLP support	Limited (SIMD units)	Extensive (SIMT execution)

3. GPU Programming Model

GPU programming relies on writing kernels that operate on many data elements in parallel.

Key Concepts:

Thread: Smallest unit of execution; each thread handles one or a few data elements.
Block / Workgroup: Group of threads executed together; share fast local memory.
Grid / NDRange: Collection of blocks/workgroups, representing the full dataset.
Kernels: Functions executed on the GPU, often performing vectorized operations.

Example: Vector Addition

// CUDA kernel
__global__ void vectorAdd(float *A, float *B, float *C, int N) {
    int i = threadIdx.x + blockIdx.x * blockDim.x;
    if (i < N)
        C[i] = A[i] + B[i];
}

Each thread calculates one element of the result array C[i].
The same instruction (+) is executed on multiple data elements simultaneously.

4. Benefits of DLP via GPU Programming

High throughput: Thousands of operations executed in parallel.
Efficient for repetitive operations on large datasets.
Reduces execution time for data-parallel tasks like:
- Image processing
- Matrix multiplication
- Neural network training

5. Challenges

Memory bandwidth bottleneck: Data must be moved efficiently to GPU cores.
Thread divergence: When threads in the same warp follow different execution paths, performance drops.
Limited per-thread resources: Registers and shared memory per thread are limited.
Synchronization overhead: Threads may need to coordinate within blocks.

6. Relation to Other Concepts

Concept	Relation to DLP / GPU
SIMD / Vector Processing	DLP exploits SIMD units in CPU/GPU
Multithreading / TLP	TLP is about independent threads; DLP is about same instruction on multiple data
Speculative Execution	Less relevant in DLP; GPU cores rely more on throughput than low-latency execution
Cache / Memory Hierarchy	Efficient memory access is critical for high DLP performance

7. Exam-Friendly Summary

Data-Level Parallelism (DLP): Same operation executed on many data elements simultaneously.
GPU Programming: Uses thousands of threads executing kernels in SIMT style for DLP.
Benefits: High throughput, efficient for large datasets.
Challenges: Memory bandwidth, thread divergence, limited per-thread resources.

Previous topic 23

Transactional memory

Past Papers

Open this section to load past papers

Click on Show Past Papers to see past papers.

COMP3147›Data-Level Parallelism: GPU programming

Computer ArchitectureTopic 24 of 24

Data-Level Parallelism: GPU programming

3 minread

507words

Beginnerlevel

⭐ Data-Level Parallelism (DLP) and GPU Programming

1. Definition of Data-Level Parallelism (DLP)

Data-Level Parallelism (DLP) refers to the ability of a processor to perform the same operation on multiple data elements simultaneously.

DLP is widely used in vector processing, SIMD (Single Instruction Multiple Data), and GPU architectures, where the same instruction operates on many data points at once.

Key Idea

If a program performs the same computation independently on many data elements, DLP allows these computations to run in parallel.
Common in image processing, scientific computing, machine learning, and graphics.

2. GPU Programming and DLP

GPU (Graphics Processing Unit) is specialized for massive data-level parallelism:

Thousands of cores: Optimized for executing the same instruction across many data threads simultaneously.
SIMD/SIMT architecture: Single Instruction, Multiple Threads (or Data) execution.
High memory bandwidth: To feed data to multiple cores efficiently.

A) GPU vs CPU for DLP

Feature	CPU	GPU
Number of cores	Few (2–64 typically)	Thousands
Execution model	Few threads, complex logic	Many threads, simple logic
Latency vs throughput	Low latency	High throughput
DLP support	Limited (SIMD units)	Extensive (SIMT execution)

3. GPU Programming Model

GPU programming relies on writing kernels that operate on many data elements in parallel.

Key Concepts:

Thread: Smallest unit of execution; each thread handles one or a few data elements.
Block / Workgroup: Group of threads executed together; share fast local memory.
Grid / NDRange: Collection of blocks/workgroups, representing the full dataset.
Kernels: Functions executed on the GPU, often performing vectorized operations.

Example: Vector Addition

// CUDA kernel
__global__ void vectorAdd(float *A, float *B, float *C, int N) {
    int i = threadIdx.x + blockIdx.x * blockDim.x;
    if (i < N)
        C[i] = A[i] + B[i];
}

Each thread calculates one element of the result array C[i].
The same instruction (+) is executed on multiple data elements simultaneously.

4. Benefits of DLP via GPU Programming

High throughput: Thousands of operations executed in parallel.
Efficient for repetitive operations on large datasets.
Reduces execution time for data-parallel tasks like:
- Image processing
- Matrix multiplication
- Neural network training

5. Challenges

Memory bandwidth bottleneck: Data must be moved efficiently to GPU cores.
Thread divergence: When threads in the same warp follow different execution paths, performance drops.
Limited per-thread resources: Registers and shared memory per thread are limited.
Synchronization overhead: Threads may need to coordinate within blocks.

6. Relation to Other Concepts

Concept	Relation to DLP / GPU
SIMD / Vector Processing	DLP exploits SIMD units in CPU/GPU
Multithreading / TLP	TLP is about independent threads; DLP is about same instruction on multiple data
Speculative Execution	Less relevant in DLP; GPU cores rely more on throughput than low-latency execution
Cache / Memory Hierarchy	Efficient memory access is critical for high DLP performance

7. Exam-Friendly Summary

Data-Level Parallelism (DLP): Same operation executed on many data elements simultaneously.
GPU Programming: Uses thousands of threads executing kernels in SIMT style for DLP.
Benefits: High throughput, efficient for large datasets.
Challenges: Memory bandwidth, thread divergence, limited per-thread resources.

Previous topic 23

Transactional memory

Past Papers

Open this section to load past papers

Click on Show Past Papers to see past papers.