# Using Round-Robin Tracepoints to Debug Multithreaded HLS Circuits on FPGAs

a place of mind

<u>Jeffrey Goeders</u> Steve Wilton





# What this talk is about...

How to allow designers to debug HLS circuits?

In past work we developed a debugging tool:

- Software-like debugger
- Interfaces with an HLS circuit
- Only worked for single-threaded C code

What needs to change to support multithreaded code?



## **High-level** synthesis



- 1. Faster development
- 2. <u>Software designers target FPGAs</u>

Software designers need a full ecosystem of tools

- Testing
- Debugging
- Optimization
- ....

In this talk, I am going to focus on debugging

#### **Bugs in HLS systems Kernel-level bugs** Debug C code on main() { Software Self-contained workstation (gdb). int i; Easy to reproduce } HLS **RTL Bugs & RTL Verification** Run C/RTL co-simulation Simulation Incorrect use of HLS on workstation. **HLS** Generated tools RTL **System-Level Bugs** Debug on FPGA I/O Devices **Bugs in interfaces** FPGA Dependent on (Requires observing **HLS Generated** Hardware interaction timings internals of FPGA) Hardware Hard to reproduce Require long run-Other Other These are the bugs times Hardware Hardware we are targeting.

# **General Hardware Debug**

Commercial ELA tools (SignalTap II / Chipscope Pro):



Can we use these tools to debug HLS circuits? No

- 1. Software designers? Beyond their expertise
- 2. Hardware designers? Forces them to give up higher-level abstraction
- 3. It's very hard. HLS transformations means RTL doesn't look like the C code

### **Previous Work**



- Software-like debugging experience
- Circuit runs at full speed (record, then retrieve)

6

Only works with single-threaded C code

#### **Recording Execution: Trace Buffers**

To record live circuit execution we use trace buffers:



Trace buffers require a lot of memory, and on-chip memory is scarce

• Can't record entire circuit execution

Width x Depth tradeoff:

• The more signals we record, the shorter the execution trace

# What's New? Multithreaded HLS Circuits

User provides multithreaded source code (OpenCL, pthreads)

• Hardware modules execute in parallel

This is the future of HLS!

- Exploits parallelism of FPGA
- Altera OpenCL SDK, Xilinx SDAccel

Leads to more complicated in-system bugs:

- Thread communication/synchronization
- Race conditions
- Deadlocks
- Performance issues (thread imbalance, starvation, etc.)

# **Debugging Multithreaded HLS Circuits**

Need to record long run-times to expose these bugs

- Must record fewer signals
- Can't do GDB/Eclipse-like debug
- Need a debug framework for partial observability

#### Solution: Tracepoint-based debugging

- Like a breakpoint, but instead of stopping it just adds to a log.
- One or more tracepoints per thread
- Optionally record variables at the tracepoint.







### **Tracepoint Log**

Tracepoint data is provided to the user as a timeline:

| Thread ID | Cycle # | Tracepoint  | Variables Logged |
|-----------|---------|-------------|------------------|
| 1         | 26452   | main(): 113 | mutex = 0        |
| 2         | 28591   | foo(): 41   | x = 7, y = 3     |
| 4         | 28037   | bar(): 3    | <none></none>    |
|           |         |             |                  |

We add debug circuitry to record tracepoints into trace buffers:

- Which tracepoint ID
- Time
- Variable values

### **Basic Tracepoint Architecture**



This is a direct extension of previous work, except only a subset of variables are recorded

Much more efficient than an ELA (Chipscope Pro, SignalTap II)

- Variable values are multiplexed
- Add buffer entries buffer only when needed

The architecture is duplicated to every thread in the system:



Giving each thread their own trace buffer is not ideal:

- Each thread will fill buffers at different rates
- Hard to predict this before-hand
- Leads to wasted memory space

# **Buffer Sharing**



# **Effects of Multiple Trace Buffers**



Case 1: 8-entry buffer per thread Case 2: Two 16-entry buffers Case 3: One 32-entry buffer

For multiple trace buffers, execution trace length is the **minimum** captured in any buffer

**Key:** More buffer sharing = longer execution traces

# **Buffer Sharing**



To support buffer sharing, we need to handle:

- 1. Grouping threads
- 2. Arbitration

#### **Grouping Threads**



Want to group as aggressively as possible

• If group encounters too many tracepoints, memory will be overwhelmed

Need to know how often a thread will encounter tracepoints...

#### **Tracepoint Interval**

Use Dijkstra's algorithm on the CDFG

• Tracepoint Interval: minimum # of cycles between tracepoint hits



. 17

#### **Grouping Threads**



Group such that: *Interval* ≥ n

Allows for simple round robin arbitration

### **Round-Robin Arbitration Circuit**



### Benefit of Buffer Sharing?

*Metric:* How much does buffer sharing increase the length of the recorded execution trace?

# Cycles recorded **with** Buffer Sharing # Cycles recorded **without** BufferSharing

Longer trace length  $\rightarrow$  Easier for designers to find multithreaded bugs



#### 1000 trials per configuration:

- Randomly create synthetic benchmark from CHStone
- LegUp to synthesize
- Group threads and measure improvement

#### 8 threads, 1 tracepoint each: 4.1X increase to trace length



#### **Issues and Limitations**

Works best with few tracepoints per thread

• More tracepoints = smaller tracepoint interval, less buffer sharing.

Buffer sharing only benefits heterogeneous threads (i.e. task-parallel)

 In a data-parallel system (e.g. OpenMP) all threads encounter tracepoints at the same rate



Task-parallel programs are more likely to need in-system debug

• Thread balancing, starvation, synchronization, etc.

#### Area Overhead

#### Tracepointing Logic



| Round-Robin Logi | С |
|------------------|---|
|------------------|---|



| Data  | # Tracepoints |    |    |     |     |  |
|-------|---------------|----|----|-----|-----|--|
| Width | 1             | 2  | 4  | 8   | 16  |  |
| 16    | 0             | 17 | 19 | 53  | 90  |  |
| 32    | 0             | 33 | 35 | 101 | 170 |  |
| 64    | 0             | 64 | 67 | 197 | 330 |  |

Stratix IV ALUTs, per thread

| Data  | # Threads |    |    |    |    |  |
|-------|-----------|----|----|----|----|--|
| Width | 2         | 4  | 8  | 16 | 32 |  |
| 16    | 29        | 20 | 22 | 21 | 23 |  |
| 32    | 37        | 24 | 30 | 29 | 32 |  |
| 64    | 53        | 32 | 46 | 45 | 49 |  |

Stratix IV ALUTs, per thread

#### 8 threads, with two 32-bit tracepoints each: 63 ALUTs/per thread.

#### Summary

- HLS may change the face of hardware design for FPGAs
  - But, only if we have a suitable eco-system
  - Need to find elusive bugs in multithreaded systems
- Tracepoint-based debugging architecture for multithreaded HLS circuits
  - Run in-system, at-speed
  - Familiar to software designers
- Round-robin architecture
  - Record a longer execution trace: Find bugs faster

# **Timing Considerations**

Grouped threads may not be physically adjacent

• Long routing lengths will affect timing.

Pipeline the data signals

• Subtract the pipeline depth from *time* during offline re-ordering



## **Pipeline Model**

