

# GROK-LAB: Generating Real On-chip Knowledge for Intra-cluster Delays Using Timing Extraction

Benjamin Gojman, Sirisha Nalmela, Nikil Mehta, Nicholas Howarth, André DeHon Talk: bgojman@seas.upenn.edu







February 12, 2013



Can we identify random variation in today's FPGAs?



Can we identify random variation in today's FPGAs?

 Measure 1,000 nearly identical paths through 7 LUTs in one LAB (Altera Cyclone III 65nm FPGA)





Can we identify random variation in today's FPGAs?

- Measure 1,000 nearly identical paths through 7 LUTs in one LAB (Altera Cyclone III 65nm FPGA)
- No correlation between CAD and measurements





Can we identify random variation in today's FPGAs?

- Measure 1,000 nearly identical paths through 7 LUTs in one LAB (Altera Cyclone III 65nm FPGA)
- No correlation between CAD and measurements



- · CAD tools unaware of actual distribution
- DVS, etc. cannot take advantage of large distribution





#### Benefit of knowing variation in FPGA



Benefit of knowing variation in FPGA  $\rightarrow$  Perform component specific mapping

Mehta's FPGA'12 limit study

- Eliminate delay margins induced by variation
- Lower energy per operation (1.42-1.98×)
- Benefits increase with technology scaling





Component specific mapping

- Leverages variation
- Requires fine-grain delay information

- Extracts fine-grain delays (LUT-level magnitude)
- Uses only resources available in the FPGA
- Provides detailed process variation characterization

#### Outline



#### Motivation

#### **Timing Extraction**

What can we measure? How do we measure it? How do we process the measurements?

#### Results

Measurement Confidence Inter-LUT Delays

#### **Future Work**

#### Conclusion



- · Measure delay of multiple paths
- Decompose delays into individual components



- · Measure delay of multiple paths
- Decompose delays into individual components





- · Measure delay of multiple paths
- Decompose delays into individual components





- · Measure delay of multiple paths
- Decompose delays into individual components





- · Measure delay of multiple paths
- Decompose delays into individual components



$$A + B = 5$$
$$B + C = 4$$
$$C + A = 3$$



- Measure delay of multiple paths
- Decompose delays into individual components



- A = 2A + B = 5 $B + C = 4 \Rightarrow$ B = 3
- C + A = 3

C = 1



- · Measure delay of multiple paths
- Decompose delays into individual components



· Must carefully define what forms a component

# Grounding

- Timing Extraction generally applicable to FPGAs
  - Registers
  - Configurable clocks

Grounding on Altera Cyclone III Cluster: Logic Array Block (LAB)

- 16 Logic Elements (LE) per LAB
- 16 Internal interconnect wires
  - Local LE to LE connections
- External input wires
  - Not measured







#### FPGA'13 GROK-LAB







Logic Element

- 4-LUT (A, B, C, D)
- Optional Output Register







Internal Interconnect

- Connects output of LE to input of other LEs
- 50% depopulated connections



#### Logic Element

- 4-LUT (A, B, C, D)
- Optional Output Register





#### Measurable Components



· Cannot measure individual components



#### Measurable Components

- · Cannot measure individual components
- E.g., Measure crosspoint



- · Cannot measure individual components
- E.g., Measure crosspoint
- Cannot differentiate from other components



Measure groups of components



- · Can measure path delays between two registers
- Test for signal propagation between registers
- · Sweep frequency to test for delay between registers
  - Delay includes registers





- · Can measure path delays between two registers
- Test for signal propagation between registers
- · Sweep frequency to test for delay between registers
  - Delay includes registers





- · Can measure path delays between two registers
- Test for signal propagation between registers
- · Sweep frequency to test for delay between registers
  - Delay includes registers





- · Can measure path delays between two registers
- Test for signal propagation between registers
- · Sweep frequency to test for delay between registers
  - Delay includes registers





- · Can measure path delays between two registers
- Test for signal propagation between registers
- · Sweep frequency to test for delay between registers
  - Delay includes registers



- · Measured paths must start and end at registers
- Components must be grouped into measurable units

- · Measured paths must start and end at registers
- Components must be grouped into measurable units
- 3 Types of Logical Components



- · Measured paths must start and end at registers
- Components must be grouped into measurable units
- 3 Types of Logical Components



- · Measured paths must start and end at registers
- Components must be grouped into measurable units
- 3 Types of Logical Components



- · Measured paths must start and end at registers
- Components must be grouped into measurable units
- 3 Types of Logical Components



 Paths begin at Start Node, go through some Mid Nodes and stop at End Node





 $\begin{array}{c} 16 \times 15 \times 2 \\ 480 \text{ Nodes} \end{array}$ 





 $16 \times 15 \times 2$ 480 Nodes

 $\begin{array}{c} 16 \times 15 \times 2 \\ 480 \text{ Nodes} \end{array}$ 





| 16 	imes 15 	imes 2 | 16 	imes 15 	imes 2 | 16 Nodes |
|---------------------|---------------------|----------|
| 480 Nodes           | 480 Nodes           |          |

480 + 480 + 16 = 976 LC Nodes


- 976 Nodes in one LAB  $\rightarrow$  measure at least 976 paths
- Represent as path-node matrix





- 976 Nodes in one LAB  $\rightarrow$  measure at least 976 paths
- · Represent as path-node matrix
- Augment with path delay and solve





- 976 Nodes in one LAB  $\rightarrow$  measure at least 976 paths
- · Represent as path-node matrix
- Augment with path delay and solve



- $\approx 10^{18}$  paths in one LAB.
- Complete path-node Matrix  $976 \times 10^{18}$



- 976 Nodes in one LAB  $\rightarrow$  measure at least 976 paths
- Represent as path-node matrix
- · Augment with path delay and solve



- $\approx 10^{18}$  paths in one LAB.
- Complete path-node Matrix  $976 \times 10^{18}$
- Complete path-node Matrix has rank of only 960
  - Cannot solve for all LC Nodes  $\rightarrow$  Short by 16
  - Generally true when using Launch-Capture technique

# **Reconsider Measurable Component**

- Cannot solve for LC Nodes
- Redefine measurable component

#### **Discrete Unit of Knowledge (DUK)** Small linear combination of LC Nodes





# **Reconsider Measurable Component**

- Cannot solve for LC Nodes
- Redefine measurable component

### **Discrete Unit of Knowledge (DUK)** Small linear combination of LC Nodes









# **Reconsider Measurable Component**

- Cannot solve for LC Nodes
- Redefine measurable component

#### **Discrete Unit of Knowledge (DUK)** Small linear combination of LC Nodes























































































| Start            | Mid                       | Mid       |
|------------------|---------------------------|-----------|
| Node             | Node                      | Node      |
| $\mathbf{S}_1$ - | $\vdash$ M <sub>1</sub> - | $+$ $M_2$ |





| Start            | Mid                       | Mid                | Mid                     |
|------------------|---------------------------|--------------------|-------------------------|
| Node             | Node                      | Node               | Node                    |
| $\mathbf{S}_1$ - | $\vdash$ M <sub>1</sub> – | ⊢ M <sub>2</sub> - | $\vdash$ M <sub>3</sub> |





#### bgojman@seas.upenn.edu





$$(S_1 + E_1)$$













• Paths begin at M-DUK, go through zero or more C-DUKs

## Matrix Representation – DUKs

- 960 DUKs  $\rightarrow$  measure at least 960 paths.
- Represent as path-DUK matrix



## Matrix Representation – DUKs

- 960 DUKs  $\rightarrow$  measure at least 960 paths.
- Represent as path-DUK matrix
- Augment with path delay and solve



# Matrix Representation – DUKs

- 960 DUKs  $\rightarrow$  measure at least 960 paths.
- Represent as path-DUK matrix
- · Augment with path delay and solve



- $\approx 10^{18}$  paths in one LAB.
- Complete path-DUK Matrix  $960 \times 10^{18}$
- Complete path-DUK Matrix has rank of 960
  - Can solve for all DUKs!

- DUK delays computed from linear combinations of path delays
- Fewer linear combinations  $\Rightarrow$  smaller rounding error

- DUK delays computed from linear combinations of path delays
- Fewer linear combinations  $\Rightarrow$  smaller rounding error

#### C-DUKs



Difference of two paths, both have same prefix.

**1** Ends with Nodes  $_{i}M_{j} + _{j}E$ 

2 Ends with Node  $_i E$ 

$$(\mathbf{A} + {}_{i}\mathbf{M}_{j} + {}_{j}\mathbf{E}) - (\mathbf{A} + {}_{i}\mathbf{E}) = {}_{i}\mathbf{C}\mathbf{D}_{j}$$

- DUK delays computed from linear combinations of path delays
- Fewer linear combinations  $\Rightarrow$  smaller rounding error

#### M-DUKs



Cannot measure directly (PLL Limit) Sum of two paths, minus a third.

1 
$$A + {}_{I}M_{j} + {}_{j}E$$
  
2  ${}_{i}S_{j} + {}_{j}M_{k} + B$   
3  $A + {}_{I}M_{j} + {}_{i}M_{k} + B$ 

 $(A + {}_{I}M_{j} + {}_{j}E) + ({}_{i}S_{j} + {}_{j}M_{k} + B) - (A + {}_{I}M_{j} + {}_{j}M_{k} + B) = {}_{i}MD_{j}$ 

- DUK delays computed from linear combinations of path delays
- Fewer linear combinations  $\Rightarrow$  smaller rounding error
- 2 paths per C-DUK
- 3 paths per M-DUK
- Total:  $2 \times 480 + 3 \times 480 = 2,400$  paths





Performed Timing Extraction on LABs

- 18 Altera Cyclone III FPGAs, 65nm
- One Altera Cyclone IV FPGA with V<sub>dd</sub> control, 60nm
- Measured 2,400 paths per LAB, out of  $\approx 10^{18}$



## Structured and Systematic Measurements



Controlled placement of Block Under Test



### Structured and Systematic Measurements

- · Controlled placement of Block Under Test
- Isolation of control logic and measurement logic


### Structured and Systematic Measurements

- Controlled placement of Block Under Test
- Isolation of control logic and measurement logic
- Individual LAB Measurements



### Structured and Systematic Measurements

- Controlled placement of Block Under Test
- Isolation of control logic and measurement logic
- Individual LAB Measurements
- Consistent control logic between tests



## Structured and Systematic Measurements

- Controlled placement of Block Under Test
- Isolation of control logic and measurement logic
- Individual LAB Measurements
- Consistent control logic between tests
- Temperature controlled environment



### Measurement Confidence





1.6 ps clock granularity

### **Results: Path Delays**





## **DUK Delays**





### LUT to LUT Delay Map





C-DUK delay (ps) A & B LUT Inputs LAB (27,22)

#### bgojman@seas.upenn.edu

#### FPGA'13 GROK-LAB

# LUT to LUT Delay Map



|        |          |     | Start LE |     |     |     |     |     |     |     |     |     |     |           |     |     |     |
|--------|----------|-----|----------|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----------|-----|-----|-----|
|        |          | 1   | 2        | 3   | 4   | 5   | 6   | 7   | 8   | 9   | 10  | 11  | 12  | <b>13</b> | 14  | 15  | 16  |
| End LE | 1        |     | 409      | 384 | 381 | 377 | 389 | 390 | 393 | 393 | 379 | 392 | 395 | 379       | 403 | 382 | 398 |
|        | <b>2</b> | 400 |          | 433 | 434 | 403 | 414 | 412 | 420 | 422 | 403 | 417 | 418 | 404       | 422 | 409 | 401 |
|        | 3        | 378 | 414      |     | 409 | 376 | 392 | 395 | 393 | 398 | 382 | 396 | 395 | 379       | 400 | 387 | 376 |
|        | 4        | 418 | 417      | 417 |     | 426 | 434 | 423 | 417 | 420 | 404 | 414 | 431 | 408       | 434 | 409 | 416 |
|        | 5        | 385 | 396      | 395 | 419 |     | 422 | 397 | 401 | 404 | 387 | 399 | 404 | 388       | 407 | 393 | 382 |
|        | 6        | 415 | 422      | 423 | 422 | 404 |     | 453 | 442 | 423 | 406 | 415 | 430 | 411       | 434 | 409 | 415 |
|        | 7        | 392 | 388      | 393 | 395 | 373 | 412 |     | 409 | 390 | 373 | 385 | 398 | 377       | 403 | 377 | 384 |
|        | 8        | 396 | 407      | 400 | 401 | 396 | 406 | 409 |     | 434 | 422 | 409 | 412 | 396       | 417 | 400 | 390 |
|        | 9        | 376 | 387      | 387 | 382 | 376 | 387 | 390 | 409 |     | 392 | 384 | 389 | 370       | 396 | 376 | 368 |
|        | 10       | 407 | 419      | 417 | 412 | 411 | 425 | 419 | 423 | 422 |     | 449 | 445 | 414       | 423 | 409 | 403 |
|        | 11       | 379 | 390      | 390 | 385 | 380 | 390 | 393 | 396 | 393 | 414 |     | 422 | 382       | 396 | 382 | 372 |
|        | 12       | 422 | 420      | 422 | 422 | 401 | 418 | 426 | 415 | 417 | 399 | 414 |     | 426       | 445 | 398 | 409 |
|        | 13       | 381 | 388      | 393 | 384 | 379 | 389 | 396 | 398 | 390 | 376 | 396 | 416 |           | 416 | 377 | 371 |
|        | 14       | 417 | 418      | 422 | 412 | 401 | 415 | 428 | 414 | 417 | 398 | 409 | 420 | 399       |     | 423 | 430 |
|        | 15       | 385 | 396      | 401 | 393 | 385 | 398 | 403 | 403 | 403 | 392 | 396 | 401 | 384       | 423 |     | 398 |
|        | 16       | 457 | 460      | 438 | 434 | 414 | 433 | 441 | 430 | 428 | 411 | 425 | 434 | 420       | 438 | 412 |     |

LAB (37,14)

LAB (27,22)

End LE

# Lowering $V_{DD}$ on Cyclone IV





Lower  $V_{DD}$  magnifies the impact of random variation

## **Future Work**



- Apply TE to global interconnect
  - Determine necessary DUKs
  - Compute which paths to measure
- Understand necessary level of measurement control
- Comprehensive variation characterization
  - Spatial
  - Systematic
  - Random





### **Timing Extraction**

- Extracts near LUT-level delays
- Uses only resources available in FPGA
- Allows for fine-grain variation characterization
- Generally applicable to modern FPGAs



