# Dynafuse: Dynamic Dependence Analysis for FPGA Pipeline Fusion and Locality Optimizations

Jeremy Fowers, Greg Stitt University of Florida Department of Electrical and Computer Engineering

# **Dynafuse Optimization**

- High-level synthesis (HLS) useful for FPGA productivity
- Generally unaware of data dependencies

   Unnecessary PCIe transfers and missed parallelism

```
Application code:
void f(int *a, int *b, int *c) {
...
*a = f1();
*b = f2();
*c = f3();
f4(a);
}
```

# **Dynafuse Optimization**

#### Result: pessimistic data management



# **Dynafuse Optimization**

#### Result: pessimistic data management





**FPGA** 

F1()

F2()

# **Locality Exploitation**

- PCIe bus is inefficient
  - Up to 65% of total execution time
- FPGA boards often have multiple DDR RAMs



# **Locality Exploitation**

- PCIe bus is inefficient
  - Up to 65% of total execution time
- FPGA boards often have multiple DDR RAMs



• Saves 2N-2 transfers for N functions, 2 in this case

# **Pipeline Fusion**

- Many FPGA functions are pipelined
- Fuse dependent functions into single pipeline



Fusing N pipelines creates Nx speedup

#### **Dynafuse Overview**

- The Dynafuse optimizations:
  - Locality Exploitation and Pipeline Fusion
  - Automatic & transparent
- SW: Dynamic Dependence Analysis
  - Coherent Arrays
  - Function Queue
- HW: Fusion Network

# **Dynafuse Challenges**

• Information for locality exploitation:



Which device memory has the most recent copy of the data?

# **Coherent Arrays**

- Std arrays have locality, alias issues
- Resolved by Coherent Arrays



# **Coherent Arrays**

- Std arrays have locality, alias issues
- Resolved by Coherent Arrays



All data starts on the CPU



Coherent array does *not* immediately return output

Reads/writes ensure locality before returning



All data starts on the CPU



Coherent array does *not* immediately return output

Reads/writes ensure locality before returning

Application code:

1 CoherentArrav A

B

- 2 CoherentA
- 3 CoherentA U
- 4 FPGA\_F1(A, B)
- 5 Other\_Functions()
- 6 FPGA\_F2(B, C)
- 7 File.write(C)



All data starts on the CPU



3

Coherent array does *not* immediately return output

Reads/writes ensure locality before returning

- 1 CoherentArray A
- 2 CoherentArray B
- 3 CoherentArray C
- 4 FPGA\_F1(A, B)
- 5 Other\_Functions()
- 6 FPGA\_F2(B, C)
- 7 File.write(C)



All data starts on the CPU



3

Coherent array does *not* immediately return output

Reads/writes ensure locality before returning

- 1 CoherentArray A
- 2 CoherentArray B
- 3 CoherentArray C
- 4 FPGA\_F1(A, B)
- 5 Other\_Functions()
- 6 FPGA\_F2(B, C)
- 7 File.write(C)

| СР | U  |   |   |  |
|----|----|---|---|--|
|    |    |   | С |  |
|    |    |   |   |  |
| FP | GA |   |   |  |
|    | А  | В |   |  |
|    |    |   |   |  |

All data starts on the CPU



3

Coherent array does *not* immediately return output

Reads/writes ensure locality before returning

- 1 CoherentArray A
- 2 CoherentArray B
- 3 CoherentArray C
- 4 FPGA\_F1(A, B)
- 5 Other\_Functions()
- 6 FPGA\_F2(B, C)
- 7 File.write(C)

| С  | PU  |   |   |
|----|-----|---|---|
|    |     |   | С |
|    |     |   |   |
| FF | PGA |   | _ |
|    | А   | В |   |
|    |     |   |   |

All data starts on the CPU



Coherent array does *not* immediately return output

Reads/writes ensure locality before returning

- 1 CoherentArray A
- 2 CoherentArray B
- 3 CoherentArray C
- 4 FPGA\_F1(A, B)
- 5 Other\_Functions()
- 6 FPGA\_F2(B, C)
- 7 File.write(C)

| С  | PU  |   |   |      |
|----|-----|---|---|------|
|    |     |   |   | File |
|    |     |   |   |      |
| FF | PGA |   |   |      |
|    | А   | В | С |      |
|    |     |   |   | 17   |

All data starts on the CPU



3

Coherent array does *not* immediately return output

Reads/writes ensure locality before returning

Application code:

- 1 CoherentArray A
- 2 CoherentArray B
- 3 CoherentArray C
- 4 FPGA\_F1(A, B)
- 5 Other\_Functions()
- 6 FPGA\_F2(B, C)

7 File.write(C)

| C                      | PU                   |   |   |  |  |  |
|------------------------|----------------------|---|---|--|--|--|
|                        |                      |   | С |  |  |  |
|                        | ssary Pole transfers |   |   |  |  |  |
| e data <sub>FPGA</sub> |                      |   |   |  |  |  |
|                        | А                    | В |   |  |  |  |
|                        |                      |   |   |  |  |  |

# **Dynafuse Challenges**

- Information for locality exploitation:
  - (1)
    - Which device memory has the most recent copy of the data?
- For pipeline fusion:
  - 2

Which future functions need the output of the current function?

- FPGA functions are placed on the queue when called—*not executed*
- Coherent arrays used to establish
- dependence chains
  - CPU functions trigger the entire chain that produces the requested value



- FPGA functions are placed on the queue when called—*not executed*
- Coherent arrays used to establish
- dependence chains
  - CPU functions trigger the entire chain that produces the requested value



FPGA functions are placed on the queue when called—*not executed* Coherent arrays used to establish

dependence chains

3

CPU functions trigger the entire chain that produces the requested value

Application code:

- 1 FPGA\_F1(A, B)
- 2 FPGA\_F2(B, C)
- 3 FPGA\_F3(X, Y)
- 4 File.write(C)
- 5 File.write(Y)

Function Queue FPGA\_F1(A, B) FPGA\_F2(B, C) F

FPGA functions are placed on the queue when called—not executed Coherent arrays used to establish

- dependence chains
- CPU functions trigger the entire chain that produces the requested value

Application code: FPGA\_F1(A, B) FPGA\_F2(B, C) P1 FPGA F1(A, B) 1 2 FPGA F2(B, C)FPGA F3(X, Y) 3 4 File.write(C) File.write(Y) 5

**Function Queue** 

FPGA functions are placed on the queue when called—*not executed* 

- Coherent arrays used to establish
- dependence chains



Application code:

- 1 FPGA\_F1(A, B)
- 2 FPGA\_F2(B, C)
- 3 FPGA\_F3(X, Y)
- 4 File.write(C)
- 5 File.write(Y)

Function Queue

FPGA\_F1(A, B) FPGA\_F2(B, C) FPGA\_F3(X, Y) P2

- FPGA functions are placed on the queue when called—*not executed* Coherent arrays used to establish dependence chains
  - CPU functions trigger the entire chain that produces the requested value

Application code:

- 1 FPGA\_F1(A, B)
- 2 FPGA\_F2(B, C)
- 3 FPGA\_F3(X, Y)
- 4 File.write(C
- 5 File.write(Y)

Function Queue

FPGA\_F1(A, B) FPGA\_F2(B, C) P1 FPGA\_F3(X, Y) P2

- FPGA functions are placed on the queue when called—*not executed* Coherent arrays used to establish dependence chains
- CPU functions trigger the entire chain that produces the requested value



# **Dynafuse Challenges**

- Information for locality exploitation:
  - (1)
    - Which device memory has the most recent copy of the data?
- For pipeline fusion:



Which future functions need the output of the current function?



Can the desired pipeline be created at runtime?

# **Fusion Network**

- All-to-all network
- Muxes configured by function queue



#### Experiments

- Created 8 pipelined functions in VHDL
- Wrote 2 SW applications in C++

Image Segmentation 1080p:

- 1 RGB to HSV Conversion
- 2 HSV Threshold Filter
- 3 Morphological Erosion
- 4 Morphological Dilation
- 5 Sobel Edge Detection

FFT Averaging Filter:

- 1 Fast Fourier Transform (FFT)
- 2 Frequency-domain Average
- 3 Inverse FFT
- Altera Stratix III E260 on Gidel ProcStar III with PCIe x8

# **Dynafuse Results**

#### Image segmentation:

| Activity                 | Execution Time (s) | Speedup |  |
|--------------------------|--------------------|---------|--|
| No optimizations         | 0.2                | -       |  |
| w/ Locality Exploitation | 0.12               | 1.6x    |  |
| w/ Pipeline Fusion       | 0.04               | 5x      |  |

#### FFT Averaging Filter:

| Activity                 | Execution Time (s) | Speedup |  |
|--------------------------|--------------------|---------|--|
| No optimizations         | 0.0042             | -       |  |
| w/ Locality Exploitation | 0.0024             | 1.75x   |  |
| w/ Pipeline Fusion       | 0.0014             | 3x      |  |

Fusion network overhead < 1% total area</li>

#### Conclusions

- Optimizations would benefit HLS

   Require dependence information
- Transparent, dynamic approach
- Pipeline fusion: Nx speedup for N functions
- Locality exploitation: save 2N-2 transactions
- Future: dynamic fission study

Thanks!

# **Any Questions?**

# **Sample Application**

Goal: Apply Locality Exploitation and Pipeline Fusion Transparently

Regular code:

int\* A = (int\*)malloc(500);
 int\* B = (int\*)malloc(500);
 int\* C = (int\*)malloc(500);
 for(i = 0; i < 500; i++) A[i] = i;</li>
 FPGA\_F1(A, B);
 Other\_Functions();
 FPGA\_F2(B, C);
 File write (O);

8 File.write(C);

# **Sample Application**

# Goal: Apply Locality Exploitation and Pipeline Fusion Transparently

| Regular code: |                                    | Dynafuse code: |                   |                                             |
|---------------|------------------------------------|----------------|-------------------|---------------------------------------------|
| J             |                                    | 1              | Dynafuse context; |                                             |
| 1             | int* A = (int*)malloc(500);        |                | 2                 | CoherentArray <int> A (500, context);</int> |
| 2             | int* B = (int*)malloc(500);        |                | 3                 | CoherentArray <int> B (500, context);</int> |
| 3             | int* C = (int*)malloc(500);        |                | 4                 | CoherentArray <int> C (500, context);</int> |
| 4             | for(i = 0; i < 500; i++) A[i] = i; |                | 5                 | for(i = 0; i < 500; i++) A[i] = i;          |
| 5             | FPGA_F1(A, B);                     |                | 6                 | FPGA_F1(A, B, context);                     |
| 6             | Other_Functions();                 |                | 7                 | Other Functions();                          |
| 7             | FPGA_F2(B, C);                     |                | 8                 | FPGA_F2(B, C, context);                     |
| 8             | File.write(C);                     |                | 9                 | File.write(C);                              |
|               |                                    |                |                   |                                             |

## **Dynamic Fusion**

 Execute an arbitrary number of arbitrarily configured functions



## **Dynamic Fusion**

 Execute an arbitrary number of arbitrarily configured functions

