### Pipeline Frequency Boosting: Hiding Dual-Ported Block RAM Latency using Intentional Clock Skew



Alexander Brant alexb@ece.ubc.ca

Ameer Abdelhadi <u>ameer@ece.ubc.ca</u>

Aaron Severance aaronsev@ece.ubc.ca

Department of Electrical and Computer Engineering

University of British Columbia

Vancouver, Canada

Guy G.F. Lemieux lemieux@ece.ubc.ca

## **Objectives**

Hide dual-ported block RAM access latencies without additional pipeline stages or architectural changes.

## Method

Clock skewing is employed to effectively eliminate the read and

# **Pipeline Bypassing/Forwarding**

Written data is passed forward through a bypass register, skipping any additional cycles incurred by RAM writing process.



**Fully Pipelined Bypass** 



write latency of memories, while preserving functionality, and using fewer resources than conventional bypass designs.



Earlier read-after-write is also bypassed

# **Clock Skew Scheduling**

Useful or intentional skewing introduces skews to clocked elements allowing critical paths longer periods.



The write port is clocked late and the read port is clocked early to provide the



## **Timing Paths Analysis**



)+t<sub>d</sub>(ff→mux  $+t_d(mux)+t_d(mul)+t_d(mul)$ )+t<sub>su</sub>(ff  $T \ge \Delta_{\rm ff} + t_{c \to o}$  (ff )-∆<sub>ff</sub>  $T \ge \Delta_{ff} + t_{c \to o}(ff)$  $+t_d(mux)+t_d(mul)+t_d(mul) \rightarrow ram_{wr})+t_{su}(ram_{wr})-\Delta_{wr}$ )+t<sub>d</sub>(ff→mux  $T \ge \Delta_{rd} + t_{c \rightarrow o}(ram_{rd}) + t_d(ram_{rd} \rightarrow mux) + t_d(mux) + t_d(mul) + t_d(mul) \rightarrow ff$ )+t<sub>su</sub>(ff )-∆<sub>ff</sub>  $T \ge \Delta_{rd} + t_{c \rightarrow o}(ram_{rd}) + t_d(ram_{rd} \rightarrow mux) + t_d(mux) + t_d(mul) + t_d(mul \rightarrow ram_{wr}) + t_{su}(ram_{wr}) - \Delta_{wr}$ 

For minimal clock period,  $\Delta_{wr}$  and  $\Delta_{rd}$  are

 $\Delta_{wr} = t_d(mul \rightarrow ram_{wr}) + t_{su}(ram_{wr}) - t_d(mul \rightarrow ff) - t_{su}(ff) + \Delta_{ff}$  $\Delta_{rd} = t_{c \rightarrow o}(ff) + t_d(ff \rightarrow mux) - t_{c \rightarrow o}(ram_{rd}) - t_d(ram_{rd} \rightarrow mux) + \Delta_{ff}$ 

34

17

17

processing logic a wider clock period.

 $\Delta_{wr}$  and the read leads by  $\Delta_{rd}$ .





#### a place of mind THE UNIVERSITY OF BRITISH COLUMBIA