

# HetroOMP: OpenMP for Hybrid Load Balancing Across Heterogeneous Processors

Vivek Kumar<sup>1</sup>, Abhiprayah Tiwari<sup>1</sup>, Gaurav Mitra<sup>2</sup> 1 IIIT Delhi, New Delhi, India 2 Texas Instruments, Sugarland, Texas, USA



# Outline

- Introduction
- Contributions
- Motivating analysis
- Insights and approach
- Implementation
- Experimental Evaluation
- Summary



Introduction

# **Accelerator Programming**

**Directive-based** 





Language-based









# Introduction

# Hybrid Parallelism

**NSTRUMENTS** 

**TEXAS** 



# Hybrid Parallelism in OpenMP (Attempt #1)



HetroOMP: OpenMP for Hybrid Load Balancing | Kumar et al. | IWOMP 2019

**TEXAS** 

**NSTRUMENTS** 

# Hybrid Parallelism in OpenMP (Attempt #2)

```
Multicore
                                                                          Accelerator
1. int THRESHOLD=/*some value*/;
2. void mergesort(int left, int right){
    if(right-left > THRESHOLD) {
3.
      int mid = left + (left+right)/2;
4.
5.
      \#pragma omp task untied \setminus
6.
          firstprivate(left.mid)
7.
        mergesort(left, mid);
                                                        Cache
8.
     mergesort(mid+1, right);
9
     #pragma omp taskwait
                                                                              Device Memory
                                                      System Memory
10.
       merge(left, mid, right);
11.} else {
                                                           Mergesort - can l
12.
       sequentialSort(left, right);
                                                           use both multicore
13. }
14.}
                                                             and accelerator
15.void main() {
                                                                 together?
16.
     #pragma omp target map(to:N) \setminus
17.
           map(tofrom:A[0:N])
18.
     #pragma omp parallel
                                                      No... This will
19.
            firstprivate(A:N)
                                                                                  60
                                                     run only on my
20.
    #pragma omp single
         mergesort(0, size-1);
21.
                                                      accelerator
22.}
```

HetroOMP: OpenMP for Hybrid Load Balancing | Kumar et al. | IWOMP 2019

**TEXAS** 

**NSTRUMENTS** 

# Hybrid Parallelism in OpenMP (How?)

TEXAS

RUMENTS

 Manually partitioning the workload between multicore and accelerator?



- Two different kernels, one each for host and device
  - No serial elision different behavior if directive disabled
- What should be the optimal partition size?
  - Host and accelerator have different performance
  - Communication latency between host and device
  - There could be several layers of parallelism (NP-hard)





#### **Research Questions**

- 1. Can we extend OpenMP accelerator model to support hybrid parallelism without affecting programmer's productivity?
- 2. Can we design a high performance hybrid runtime for such an extension?

#### TEXAS INSTRUMENTS

# Contributions

#### HetroOMP programming model

Extension to OpenMP accelerator model for enabling hybrid parallelism across host and device

#### Lightweight runtime implementation

That uses hybrid work-stealing runtime for dynamic load balancing over an

ARM+DSP based MPSoC

#### Detailed performance study

Using several data and task parallel benchmarks

#### Results

That demonstrates HetroOMP achieves significant speedup over OpenMP



# MPSoC used in this Study: TI Keystone II

#### Architecture

– 4 ARM + 8 DSP cores

**TEXAS** 

ISTRUMENTS

- Cache coherency among ARM cores
- No cache coherency among DSPs / between DSP and ARM
- Shared memory w/ different address spaces
  - Pointer conversion needed bw. ARM & DSP
- L1 cache line sizes different at ARM (64 bytes) and DSP (128 bytes)
- C library for DSPs doesn't support concurrency
  - Concurrent hardware queues and hardware semaphores





# MPSoC used in this Study: TI Keystone II

#### Existing programming models

- OpenMP accelerator model [1]
- HC-K2H (Habanero C) [2]
  - No serial elision

**TEXAS** 

**NSTRUMENTS** 

Hybrid ARM/DSP work-stealing scheduler

Hybrid work-stealing performance worse than ARM\_Only



# MPSoC used in this Study: TI Keystone II

Drawbacks in HC-K2H's hybrid work-stealing



**TEXAS** 

NSTRUMENTS

# Insights

TEXAS

- OpenMP should support hybrid execution across host and accelerator
- Hybrid work-stealing runtime at DSP
  - Should improve locality by supporting FIFO steals
  - Should not perform costly cache writebacks at all task synchronization points



# Approach

- Hybrid programming
  - HetroOMP programming model
    - ✓Simple extension to OpenMP accelerator model
    - hetro clause to define the scope of hybrid execution
- Hybrid execution using work-stealing
  - ✓ Non-concurrent private deque [1] on L2 cache of DSP
    - LIFO push and pop, whereas FIFO steals (improved locality)
    - Sender initiated steal operations at DSP can keep track if thief is cache coherent
      - ✓ Cache writeback only if a task was sent to a cache incoherent core

[1] Acar et al., Scheduling Parallel Programs by Work-Stealing with Private Deques, PPoPP 2013



# Implementation

## mplementation

# HetroOMP Programming Model

Usage of the clause "hetro"

**TEXAS** 

RUMENTS

- 1. #pragma omp parallel hetro
  - Indicating the scope of hybrid execution
- 2. #pragma omp task hetro(Var1:Count1, ...)
  - Name and count of all writable type share variables
    - Var should only be a pointer type
    - Count is the number of elements (e.g., array size)
- 3. #pragma omp for schedule(hetro, chunks)
  - Hybrid execution of loop iterations

## mplementation

# HetroOMP Programming Model

Texas Instruments

| <ul> <li>Parallel Mergesort<br/>that can perform<br/>hybrid execution over<br/>both host and device</li> </ul>       | <pre>1. int THRESHOLD=omp_cache_grain()/INT_SIZE;<br/>2. void mergesort(int left, int right){<br/>3. if(right-left &gt; THRESHOLD) {<br/>4. int mid = left + (left+right)/2;<br/>5. #pragma omp task untied \<br/>6. firstprivate(left,mid) hetro(A:N)</pre> |
|----------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Task granularity to avoid false sharing between host and device                                                      | <pre>7. mergesort(left, mid);<br/>8. mergesort(mid+1, right);<br/>9. #pragma omp taskwait<br/>10. merge(left, mid, right);<br/>11.} else {<br/>12. sequentialSort(left, right);<br/>13. }<br/>14.</pre>                                                      |
| "hetro" clause listing the<br>shared writable variables<br>"hetro" clause to define the<br>scope of hybrid execution | <pre>14.} 15.void main() { 16. #pragma omp target map(to:N) \ 17. map(tofrom:A[0:N]) 18. #pragma omp parallel \ 19. firstprivate(A:N) hetro 20. #pragma omp single 21. mergesort(0, size-1); 22.}</pre>                                                      |

## Implementation

# HetroOMP Programming Model

**TEXAS** 

NSTRUMENTS



## Implementation

# HetroOMP Runtime

**NSTRUMENTS** 

TEXAS

- OMP-to-X [1] translator modified to generate runtime code
- Hybrid work-stealing runtime
  - ARM work-stealing runtime same as HC-K2H
  - Private deque (L2 cache) based DSP work-stealing runtime



[1] Grossman et al., OpenMP as a High-Level Specification Language for Parallelism, IWOMP 2016



# Methodology

**TEXAS** 

**NSTRUMENTS** 

- Benchmarks
  - Nested task and taskwait
    - Fibonacci
    - Matmul
    - Knapsack
    - MergeSort
    - Heat
  - Parallel for
    - Rodinia suite
      - BFS
      - Hotspot
      - Srad
      - LUD
      - B+Tree

# Runtime & Configurations

| 5        |                       |                       |                      |
|----------|-----------------------|-----------------------|----------------------|
|          | ARM_Only<br>(4 cores) | DSP_Only<br>(8 cores) | Hybrid<br>(12 cores) |
| OpenMP   |                       |                       | X                    |
| HC-K2H   |                       |                       |                      |
| HetroOMP |                       |                       |                      |



# Speedup (MergeSort)





# Geomean Speedup (All Tasking Types)



# Geomean Speedup (All Parallel for)



HetroOMP: OpenMP for Hybrid Load Balancing | Kumar et al. | IWOMP 2019

**TEXAS** 

**NSTRUMENTS** 

Summary



# Summary

- OpenMP accelerator model doesn't support hybrid execution across host and device
  - Wastage of CPU resources
- HetroOMP
  - Simple extension to OpenMP accelerator model for supporting hybrid execution
  - Uses hybrid work-stealing runtime
    - ARM work-stealing runtime built on traditional design (Cilk)
    - DSP work-stealing runtime uses private deques allocated on L2 cache instead of inbuilt hardware queues
      - Better locality
      - Fewer cache writebacks for task synchronization
  - Results
    - HetroOMP achieves geometric mean speedup of 3.6x over default OpenMP accelerator model



## **Backup Slides**

# Implemente Implemente HetroOMP Runtime 1. int THRESHOLD= DSP\_CACHE\_LINE / INT\_SIZE; 1. finish = new\_scope(A, INT\_SIZE \* N); 1. finish = new\_scope(A, INT\_SIZE \* N);

