⭐ VLIW Pipelines: Local Scheduling
1. What is VLIW?
VLIW (Very Long Instruction Word) is a CPU architecture that allows multiple independent operations to be encoded in a single long instruction word.
-
Each VLIW instruction contains multiple operations (like ALU ops, load/store ops, or branches).
-
All operations in the same VLIW instruction execute in parallel.
-
Example: A VLIW instruction may contain 4 independent operations:
ADD R1, R2, R3 | MUL R4, R5, R6 | LOAD R7, 0(R8) | STORE R9, 0(R10)
Key point: VLIW relies on compiler scheduling, not hardware, to extract parallelism.
⭐ 2. What is Local Scheduling in VLIW?
Definition:
Local scheduling is the compiler’s technique of arranging instructions within a single basic block to maximize parallel execution in a VLIW pipeline while avoiding hazards.
- Focuses on instruction-level parallelism (ILP) within a block of straight-line code.
- Ensures that dependent instructions are ordered correctly, and independent instructions are grouped into the same VLIW instruction.
Local scheduling is “local” because it only considers one basic block at a time, not the entire program.
⭐ 3. Why Local Scheduling is Needed
VLIW pipelines execute multiple operations per instruction. To fully utilize the functional units:
- Independent instructions must be packed together.
- Dependent instructions must be separated to avoid stalls.
- Pipeline hazards (RAW, structural, control) must be handled at compile-time.
Without local scheduling:
- Many functional units remain idle
- Performance is wasted
⭐ 4. Steps in Local Scheduling
-
Analyze instruction dependencies
- RAW (Read After Write)
- WAW (Write After Write)
- WAR (Write After Read)
-
Identify functional units available in the VLIW machine
- Example: ALU, FP-MUL, Load/Store, Branch
-
Pack independent instructions into one VLIW instruction
- Assign each instruction to a free functional unit
-
Insert NOPs if no independent instruction is available
- Avoids hazards when parallel slots are empty
⭐ 5. Example of Local Scheduling
Original code (basic block)
I1: R1 = R2 + R3
I2: R4 = R5 * R6
I3: R7 = R1 - R8
I4: R9 = R10 + R11
Dependencies
- I3 depends on I1 → cannot execute in same VLIW word.
- I1, I2, and I4 are independent → can execute together.
Scheduled VLIW Instructions
| VLIW Instruction |
Functional Units |
|
|
|
| VLIW1 |
I1 (ALU) |
I2 (MUL) |
I4 (ALU) |
NOP |
| VLIW2 |
I3 (ALU) |
NOP |
NOP |
NOP |
- I1, I2, I4 execute in parallel
- I3 executes after I1 is complete
Local scheduling ensures parallelism is exploited without violating dependencies.
⭐ 6. Characteristics of Local Scheduling in VLIW
- Compiler-driven: Hardware does not perform dynamic scheduling.
- Intra-block only: Only instructions in the same basic block are considered.
- Hazard-free: Compiler avoids RAW/WAR/WAW hazards.
- Slot utilization: Maximizes the use of functional units per VLIW instruction.
- May insert NOPs: When insufficient independent instructions exist.
⭐ 7. Local Scheduling vs Global Scheduling
| Feature |
Local Scheduling |
Global Scheduling |
| Scope |
Single basic block |
Across multiple blocks |
| Complexity |
Low |
High |
| Performance gain |
Moderate |
High (more ILP) |
| Hazard handling |
Easy (within block) |
Hard (requires analysis across blocks) |
| NOP insertion |
Common |
Less common |
⭐ 8. Advantages of Local Scheduling in VLIW
- Exploits instruction-level parallelism
- Reduces idle functional units
- Ensures hazard-free execution
- Simplifies hardware: no dynamic scheduling needed
⭐ 9. Limitations
- Cannot exploit inter-block parallelism (needs global scheduling)
- Dependent on program structure; if many dependent instructions → low parallelism
- May introduce NOPs, reducing efficiency
⭐ 10. Exam-Focused Summary
- VLIW (Very Long Instruction Word): Multiple operations per instruction, executed in parallel.
- Local scheduling: Compiler reorders instructions within a basic block to maximize parallel execution and avoid hazards.
- Goal: Fill all functional units in a VLIW instruction with independent instructions.
- Steps: Analyze dependencies → assign instructions to functional units → insert NOPs if needed.
- Advantage: High performance with simple hardware.
- Limitation: Only exploits parallelism within a basic block.