



# An Adaptive Cross-Layer Fault Recovery Solution for Reconfigurable SoCs

Jifang Jin, Jian Yan, Xuegong Zhou\* and Lingli Wang State Key Laboratory of ASIC and System, Fudan University, Shanghai, China \*zhouxg@fudan.edu.cn

#### Abstract

Due to the technology scaling, the reconfigurable SoCs built on SRAM-based FPGAs become more susceptible to radiation and aging effects. This paper proposes an adaptive cross-layer fault recovery solution based on hardware/software co-design for reconfigurable SoCs. By pyramidal structure design and cross-layer adaptivity, our solution gives both consideration to hardware circuit integrity at the hardware level and application operating normality at the software level with reduced correction cost. The experimental result shows that the proposed solution can efficiently increase the sys-

## **1. Introduction**

Due to the technology scaling, SRAM-based FPGAs become more susceptible to radiation and aging effects, especially the single event upsets (SEUs). In the aspect of high reliability for reconfigurable SoCs at runtime, there are still some challenges to be solved:

1) Hardware circuit integrity at hardware level & Application operating normality at software level.

2) Combing suitable fault recovery techniques to optimize correction cost including latency and resource overhead.

3) Noninvasive fault recovery methods & Shield recovery implementation details.



According to the fault distribution and reconfigurable SoC features, we propose the cross-layer fault recovery solution.

1) Configuration Layer: The Fault Detection and Correction Controller is implemented as a soft IP, which periodically scans the configuration memory to detect, locate and fix SEUs.

2)Management Layer: The Fault Recovery Engine is a background

thread within the operating system and it is custom-built to deal with SEU-caused effects and other exceptions in CPU/RPUs applications.

3)External Layer: The External CPU Watchdog monitors the CPU operation and restores the system to latest checkpoint from an external radiation-hardened memory.

### **3. The Fault Recovery Solution Implementation**



| <b>Recovery APIs</b>      | <b>APIs Description</b>                                            |
|---------------------------|--------------------------------------------------------------------|
| SL_FixMBU                 | Recover MBU by scrubbing the frame data                            |
| SL_PartialReconfiguration | Reconfigure the RPU via the High Speed<br>Configuration Controller |
| SL_HardwareTaskReset      | Restore hardware task to latest checkpoint                         |

#### 4. Conclusion

1) An adaptive cross-layer fault recovery solution for reconfigurable SoCs: Configuration Layer, Management Layer and External Layer. 2) The Fault Detection and Correction Controller and High Speed Configuration Controller implemented as hardware IPs & Fault Recovery Engine including scrubbing, configuration data encoding, dynamic task transfer and checkpoints techniques.

3) Evaluated on the Xilinx ML605 Board by the SBU emulation experiment.