

# CAD Algorithms and Performance of Malibu: An FPGA with Time-Multiplexed Coarse-Grained Elements

# **David Grant**

Supervisor: Dr. Guy Lemieux

April 8, 2011



- Growing Industry Trend: Large FPGA Circuits
  - Software to hardware to FPGA
  - Often from C-to-Hardware or system generators
    - ex. molecular dynamics, rendering, nuclear simulation
  - → Circuits are:
    - Word-oriented
    - Very large, millions of gates



- Problems with FPGAs
  - → Synthesis time
    - CAD runtime can take hours or days
  - Density
    - FPGAs have a fixed capacity
    - A large circuit may not fit
- Not A Problem
  - → Circuit speed
    - FPGAs are fast
    - Preserve as much as possible



#### • Benchmark: chem

#### Inputs and register outputs





#### Benchmark: chem

#### Inputs and register outputs





- Solution
  - Preserve the coarse-grained features of the circuit
    - Fast CAD tools
  - Divide up the circuit and run it on an array of processors
    - Improved density because of time multiplexing
  - Create coarse-grained-aware CAD tools



*"To investigate mapping coarse-grained circuits onto an FPGA-like architecture comprised of fine-grained and time-multiplexed, coarse-grained resources"* 

- Objective: Find a better way to map coarse-grained circuits onto reconfigurable devices vs. existing commercial tools and approaches
  - → Faster synthesis
  - Increased density
  - Limited decrease in circuit performance
  - → Focus on CAD, not architecture or synthesis



### 1. Malibu: A hybrid coarse-grained / fine-grained architecture

- Time-multiplexed coarse-grained resources
  - Improves density
  - Enables faster CAD tools
- → Fine-grained resources
  - Improves circuit speed for some circuits

### 2. M-CAD: An FPGA-CAD based tool flow

- Traditional FPGA CAD + support for coarse-grained resources
- → For coarse-grained circuits vs. QuartusII
  - 77x faster synthesis vs. QuartusII (commercial CAD tool)
  - 10x density at 0.01x circuit speed
  - 0.99x density at 1.45x circuit speed



### 3. M-HOT: A height-oriented tool flow

- Integrated placement, routing and scheduling
- → For coarse-grained circuits vs. Quartus
  - 31x faster synthesis
  - 2.74x density at 2.71x circuit speed





# **Related Work**

| Project    | Resources   | Time-Mux  | Host<br>CPU | Global<br>Mem | Source Language |
|------------|-------------|-----------|-------------|---------------|-----------------|
| ADRES      | coarse      | alus only | vliw        | yes           | netlist         |
| Ambric     | coarse      | alus only | -           | -             | Java            |
| RaPiD      | coarse      | -         | mips        | -             | unknown         |
| MorphoSys  | coarse      | alus only | risc        | yes           | netlist         |
| PipeRench  | coarse      | alus only | sparc       | yes           | С               |
| RAW        | coarse      | coarse    | -           | -             | С               |
| SPR        | coarse+fine | coarse    | -           | -             | C-like (Macah)  |
| WaveScalar | coarse      | -         | -           | -             | С               |
| FPGA       | fine        | -         | -           | -             | VHDL/Verilog    |
| Tabula     | fine        | fine      | -           | -             | VHDL/Verilog    |
| TSFPGA     | fine        | fine      | -           | -             | netlist         |
| VEGA       | fine        | fine      | -           | -             | netlist         |
| DP-FPGA    | coarse+fine | -         | -           | -             | netlist         |
| MALIBU     | coarse+fine | coarse    | -           | -             | Verilog         |



# Overview

- Motivation
- Malibu Architecture
- Front-End Synthesis
- M-CAD tool flow
- M-HOT tool flow
- Conclusions





# Overview

- Motivation
- Malibu Architecture
- Front-End Synthesis
- M-CAD tool flow
- M-HOT tool flow
- Conclusions









- Add Coarse-Grained inputs and outputs
  - → CGOs
    - Into FG
  - → CGIs
    - Out of FG





• Add an ALU and register file





- Add coarse-grained communication
- User clock
  - → Target 20-200 MHz
- System Clock
  - → 1 GHz
- Time Mux Instructions
  - → SL instructions
  - → User clock ≤ 1GHz / SL





| Parame        | eter Va                      | alues   | additional ouptuts switch<br>input connection block block          |
|---------------|------------------------------|---------|--------------------------------------------------------------------|
|               | Speed                        | Density |                                                                    |
| CW            | 120                          | 120     |                                                                    |
| N             | 10                           | 10 —    | 4x1 additional inputs                                              |
| N_cgi         | 16                           | 16 📉    |                                                                    |
| N_cgo         | 4                            | 4       |                                                                    |
| NSEW_len      | 16                           | 32      |                                                                    |
| R_len         | 128                          | 256     | 1-bit wires<br>FG                                                  |
| Instr_len     | 256                          | 1024    |                                                                    |
| mini<br>→ Den | iit spee<br>mum a<br>sity: M | ed at   | 32 binbuses<br>width<br>CGO/Wf=3<br>To N,S,E,W<br>W<br>R<br>K<br>K |









()

2

NOP

ADD NO, WO -> EO



XOR NO,EO -> RO

MUX W0, R0, CGI0->E0

LUT







# Overview

- Motivation
- Malibu Architecture
- Front-End Synthesis
- M-CAD tool flow
- M-HOT tool flow
- Conclusions





### **Synthesis**

• Example





# **Front-End Synthesis**

#### • Parse

- Modified Verilator to build CDFG
  - OdinII incomplete
  - Verilator uses words
- Apply word-level optimizations
- Coarse-Grained Synthesis
  - → Map to Malibu architecture
  - Fine-Grained Synthesis
    - Synthesize FG logic to LUTs
    - → Use OdinII and ABC





# Front-End Synthesis

|                                                     |         | Circuit    | ALMs | Nodes |
|-----------------------------------------------------|---------|------------|------|-------|
| <ul> <li>Results for Front-End Synthesis</li> </ul> | CG only | me         | 5148 | 5954  |
| · Results for Front-Life Synthesis                  |         | fft16      | 6412 | 2120  |
| <ul> <li>Works well in most cases</li> </ul>        |         | chem       | 3526 | 568   |
|                                                     |         | fft8       | 2075 | 800   |
| <ul> <li>Very well for CG circuits</li> </ul>       |         | honda      | 1216 | 249   |
| <ul> <li>Some circuits have far too many</li> </ul> |         | mcm        | 1057 | 232   |
| nodes, needs optimizations                          |         | wang       | 797  | 134   |
|                                                     |         | pr         | 646  | 176   |
|                                                     | Good    | ac97_ctrl  | 1254 | 4911  |
| → CG only                                           |         | aes core   | 1154 | 3380  |
|                                                     |         | dir        | 1150 | 884   |
| <ul> <li>No fine-grained signals</li> </ul>         |         | spi        | 488  | 664   |
| → Good                                              |         | pci_master | 137  | 957   |
|                                                     | Bad     | ethernet   | 6868 | 9693  |
| <ul> <li>FG only used for control</li> </ul>        |         | wb_conmax  | 5349 | 17917 |
| → Bad                                               |         | dma        | 1714 | 18514 |
|                                                     |         | tv80       | 850  | 12186 |
| <ul> <li>Uses constructs that map</li> </ul>        |         | jpeg enc   | 791  | 4486  |
| poorly to Malibu                                    |         | systemcaes | 716  | 3043  |
|                                                     |         | des        | 298  | 4114  |
|                                                     |         | systemcdes | 237  | 1688  |



# Overview

- Motivation
- Malibu Architecture
- Front-End Synthesis
- M-CAD tool flow
- M-HOT tool flow
- Conclusions



25



- Approach based on FPGA-CAD
  - Separate clustering, placement, routing, scheduling
- Advantages
  - → Fast synthesis time
  - Can target a specific number of CLBs
  - Uses traditional FPGA-CAD algorithms

### Disadvantages

→ Lack of future information is a more of a problem in placement



- M-CAD Cluster
  - Reduce problem size for placement
  - Objectives
    - Minimize number of communications
    - Balance operations
  - → Create 2 clusters per CLB
  - → Use hMETIS (recursive bisection)
    - Connect some nodes with high weights
      - → CGO+source
      - → CGI+sink
      - → LOAD/STORE





- M-CAD Place
  - → Assign clusters to CLBs
  - Minimize critical path
  - Use VPR's timing driven placement algorithm (annealing)
    - Approx model for placement
    - Timing delay changed to an integer
      - allow mixing of routing networks
      - → CG = mh
      - → FG = ceil(mh/10)
      - → 10 = number of hops a FG signal can travel in a 1 GHz cycle





- Fine-Grained Routing
  - → FG nets passed to VPR for routing
    - Netlist must include CG objects too for criticality
    - all cg nets marked DNR
  - → Architecture is based on iFAR (n10k04l04) 65nm arch
    - Length-four wires changed to length-one, without changing timing to over-compensate for CG
  - Elmore delays and FG routing solution merged back into CDFG
- Coarse-Grained Routing
  - Horizontal-then-vertical routing strategy (O(n))
  - → Ignore congestion and conflicts



- Schedule
  - Input is netlist and criticality from placement
  - Assign operations to timeslots in each CLB
  - Based on list scheduling
    - Nodes ranked by criticality
  - Coarse-grained routing conflicts
    - Delay less-critical signal using a "hold slot"







# M-CAD Results -- Fmax





# M-CAD Results – Synthesis Time



Benchmark Circuit



# M-CAD Results -- Area





• Results (compared to Quartus II / Stratix III)

| → CG Only               | Wf=0              |  |
|-------------------------|-------------------|--|
| Synthesis Time Speedup: | 77.0x             |  |
| User Clock Speed:       | 1.45x             |  |
| Density:                | 0.69x (0.99x<br>) |  |
| → Good Only             | Wf=0              |  |
| Synthesis Time Speedup: | 36.1x             |  |
| User Clock Speed:       | 0.14x             |  |
| Density:                | 0.22x<br>(0.31x)  |  |
| → Bad Only              | VVf=0             |  |
| Synthesis Time Speedup: | 5.07x             |  |
| User Clock Speed:       | 0.05x             |  |
| Density:                | 0.095x            |  |

10x = 10 times better than the $Quartus result}$ 1x = same as Quartus0.1x = 1/10<sup>th</sup> the Quartusresult (10 times worse)



• Results (compared to Quartus II / Stratix III)

| <ul> <li>→ CG Only</li> <li>Synthesis Time Speedup:</li> <li>User Clock Speed:</li> <li>Density:</li> </ul>   | Wf=0<br>77.0x<br>1.45x<br>0.69x (0.99x<br>) |                                  | 10x = 10 times better than the<br>Quartus result<br>1x = same as Quartus<br>0.1x = 1/10 <sup>th</sup> the Quartus<br>result (10 times worse) |
|---------------------------------------------------------------------------------------------------------------|---------------------------------------------|----------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------|
| <ul> <li>→ Good Only</li> <li>Synthesis Time Speedup:</li> <li>User Clock Speed:</li> <li>Density:</li> </ul> | Wf=0<br>36.1x<br>0.14x<br>0.22x<br>(0.31x)  | Wf=1<br>15.5x<br>0.18x<br>0.36x  |                                                                                                                                              |
| → Bad Only<br>Synthesis Time Speedup:<br>User Clock Speed:<br>Density:                                        | 0.05x                                       | Wf=1<br>0.56x<br>0.06x<br>0.093x | 36                                                                                                                                           |



### Overview

- Introduction
- Malibu Architecture
- Front-End Synthesis
- M-CAD tool flow
- M-HOT tool flow
- Conclusions





- Based on a modulo graph scheduler from Park et al.
  - Integrated placement, routing, scheduling
  - Divides problem into heights and anneals each height
- Advantages
  - Better quality than separate flows
  - Can target any number of CLBs

#### Disadvantages

→ Slower, especially when ALAP tree is small or imbalanced



## M-HOT





# M-HOT

• M-HOT Place, Route, Schedule each height



40



 $\rightarrow$  CG Only

10x = 10 times better than

• Results (compared to Quartus II / Stratix III)

| Synthesis Time Speedup:<br>User Clock Speed:<br>Density: | CAD Wf=0<br>77.0x<br>1.45x<br>0.69x<br>(0.99x) |          | Quartus result<br>1x = same as Quartus<br>$0.1x = 1/10^{th}$ the Quartus<br>result (10 times worse) |
|----------------------------------------------------------|------------------------------------------------|----------|-----------------------------------------------------------------------------------------------------|
| → Good Only                                              | CAD Wf=0                                       | CAD Wf=1 |                                                                                                     |
| Synthesis Time Speedup:                                  | 36.1x                                          | 15.5x    |                                                                                                     |
| User Clock Speed:                                        | 0.14x                                          | 0.18x    |                                                                                                     |
| Density:                                                 | 0.22x<br>(0.31x)                               | 0.36x    |                                                                                                     |
| → Bad Only                                               | CAD Wf=0                                       | CAD Wf=1 |                                                                                                     |
| Synthesis Time Speedup:                                  | 5.07x                                          | 0.56x    |                                                                                                     |
| User Clock Speed:                                        | 0.05x                                          | 0.06x    |                                                                                                     |
|                                                          |                                                |          |                                                                                                     |
| Density:                                                 | 0.095x                                         | 0.093x   | 41                                                                                                  |



Results (compared to Quartus II / Stratix III) •

| → CG Only                                                | CAD Wf=0                           | HOT Wf=0                | 10x = 10 times better than                                                                                 |
|----------------------------------------------------------|------------------------------------|-------------------------|------------------------------------------------------------------------------------------------------------|
| Synthesis Time Speedup:<br>User Clock Speed:<br>Density: | 77.0x<br>1.45x<br>0.69x<br>(0.99x) | 30.9x<br>2.71x<br>2.74x | Quartus result<br>1x = same as Quartus<br>0.1x = 1/10 <sup>th</sup> the Quartus<br>result (10 times worse) |
| → Good Only                                              | CAD Wf=0                           | CAD Wf=1                |                                                                                                            |
| Synthesis Time Speedup:                                  | 36.1x                              | 15.5x                   |                                                                                                            |
| User Clock Speed:                                        | 0.14x                              | 0.18x                   |                                                                                                            |
| Density:                                                 | 0.22x<br>(0.31x)                   | 0.36x                   |                                                                                                            |
|                                                          |                                    |                         |                                                                                                            |
| → Bad Only                                               | CAD Wf=0                           | CAD Wf=1                |                                                                                                            |
| Synthesis Time Speedup:                                  | 5.07x                              | 0.56x                   |                                                                                                            |
| User Clock Speed:                                        | 0.05x                              | 0.06x                   |                                                                                                            |
| Density:                                                 | 0.095x                             | 0.093x                  | 42                                                                                                         |



• Results (compared to Quartus II / Stratix III)

| → CG Only               | CAD Wf=0 | HOT Wf=0 |          | times better than artus result |
|-------------------------|----------|----------|----------|--------------------------------|
| Synthesis Time Speedup: | 77.0x    | 30.9x    |          | ne as Quartus                  |
| User Clock Speed:       | 1.45x    | 2.71x    |          | O <sup>th</sup> the Quartus    |
| Density:                | 0.69x    | 2.74x    | resi     | ult (10 times worse)           |
|                         | (0.99x)  |          |          |                                |
| → Good Only             | CAD Wf=0 | CAD Wf=1 | HOT Wf=0 |                                |
| Synthesis Time Speedup: | 36.1x    | 15.5x    | 7.62x    |                                |
| User Clock Speed:       | 0.14x    | 0.18x    | 0.19x    |                                |
| Density:                | 0.22x    | 0.36x    | 0.40x    |                                |
|                         | (0.31x)  |          | (0.65x)  |                                |
|                         |          |          |          |                                |
| → Bad Only              | CAD Wf=0 | CAD Wf=1 | HOT Wf=0 |                                |
| Synthesis Time Speedup: | 5.07x    | 0.56x    | 0.27x    |                                |
| User Clock Speed:       | 0.05x    | 0.06x    | 0.10x    |                                |
| Density:                | 0.095x   | 0.093x   | 0.51x    | 43-                            |
|                         |          |          |          |                                |



• Results (compared to Quartus II / Stratix III)

| → CG Only               | CAD Wf=0 | HOT Wf=0 |          | 10x = 10 times better than                                      |    |  |
|-------------------------|----------|----------|----------|-----------------------------------------------------------------|----|--|
| Synthesis Time Speedup: | 77.0x    | 30.9x    |          | Quartus result<br>1x = same as Quartus                          |    |  |
| User Clock Speed:       | 1.45x    | 2.71x    |          | 0.1x = 1/10 <sup>th</sup> the Quartus<br>result (10 times worse |    |  |
| Density:                | 0.69x    | 2.74x    | resi     |                                                                 |    |  |
|                         | (0.99x)  |          |          |                                                                 |    |  |
| → Good Only             | CAD Wf=0 | CAD Wf=1 | HOT Wf=0 | HOT Wf=1                                                        |    |  |
| Synthesis Time Speedup: | 36.1x    | 15.5x    | 7.62x    | 20.9x                                                           |    |  |
| User Clock Speed:       | 0.14x    | 0.18x    | 0.19x    | 0.27x                                                           |    |  |
| Density:                | 0.22x    | 0.36x    | 0.40x    | 0.41x                                                           |    |  |
|                         | (0.31x)  |          | (0.65x)  |                                                                 |    |  |
|                         |          |          |          |                                                                 |    |  |
| → Bad Only              | CAD Wf=0 | CAD Wf=1 | HOT Wf=0 | HOT Wf=1                                                        |    |  |
| Synthesis Time Speedup: | 5.07x    | 0.56x    | 0.27x    | 1.06x                                                           |    |  |
| User Clock Speed:       | 0.05x    | 0.06x    | 0.10x    | 0.14x                                                           |    |  |
| Density:                | 0.095x   | 0.093x   | 0.51x    | 0.22x                                                           | 44 |  |
|                         |          |          |          |                                                                 |    |  |



#### Overview

- Motivation
- Malibu Architecture
- Front-End Synthesis
- M-CAD tool flow
- M-HOT tool flow
- Conclusions





- Achieved Thesis Objective
  - Fast synthesis for coarse-grained circuits
  - Improved density (by trading circuit speed)
  - Improved speed (by trading density)
- New Architecture: Malibu
  - Add time-multiplexed coarse-grained resources to an FPGA
- M-CAD
  - → Fast
  - Placement not ideal with time-multiplexing due to no information sharing
- M-HOT
  - → Better circuit speed, better density
  - Slower synthesis



- Improve M-CAD and M-HOT (focus of thesis)
  - At best 2x-3x performance improvement might exist
    - Assumes routing delays = 0, perfect placement, infinite resources
  - → Focus on placement
- Improvements to Architecture and Front-End Synthesis
  - → Architecture: FG/CG interface can better support signal mixing
  - Architecture: Multiple ALUs, minimize memory ports, better memory technology
  - → Synthesis: Parallel cases, better logic optimizations
  - → Synthesis: More architecture aware



## **Publications**

- D. Grant and G. G. Lemieux. A spatial computing architecture for implementing computational circuits. In *Proc. MNRC*, pages 41-44, Oct. 2008.
  - Malibu Architecture (coarse grained only)
- D. Grant, G. Smecher, G. G. Lemieux, and R. Francis. Rapid synthesis and simulation of computational circuits in an MPPA. In *Proc. FPT*, pages 151-158, Dec. 2009.
  - M-CAD CAD tool flow (coarse grained only)
- D. Grant, G. Smecher, G. G. Lemieux, and R. Francis. Rapid synthesis and simulation of computational circuits in an MPPA. To Appear In *Journal of Signal Process Systems*, 15 pages, 2010.
  - → M-CAD experimentation, max. theoretical clock speed, CLB area calculation
  - D. Grant, C. Wang, G. G. Lemieux. A CAD Framework for MALIBU: An FPGA with Timemultiplexed Coarse-Grained Elements. To Appear In *Proc. Field-Programmable Gate Arrays (FPGA)*, 10 pages, 2011.
    - → Arch and M-CAD fine-grained
    - → М-НОТ





# **Bad Constructs**

- A bus to store individual bits
  - → wire [3:0] flags;
  - Then using each flag as though it was a separate wire
  - Generates shift/mask for both reads and writes





- Creating multi-bit operations manually instead of using a high-level operator
  - → Use +,-,\*,<<,>>, ...
  - Don't build an adder manually





- A "case" LUT
  - → case (in)
    - 8'b0000000: out <= 4'b0000;
    - 8'b000001: out <= 4'b1010;
    - ...
  - Should be:
    - wire [3:0] lut [7:0];
    - Initial values in an initial block



- Series of case statements in an FSM
  - Need support for synopsys full parallel case
  - → ex. tv80 ~190 states in a sequence





<mark>54</mark> –