

# A CAD Framework for MALIBU: An FPGA with Time-multiplexed Coarse-Grained Elements

# **David Grant**

Supervisor: Dr. Guy Lemieux

FPGA 2011 -- Feb 28, 2011



- Growing Industry Trend: Large FPGA Circuits
  - Often from C-to-Hardware or system generators
    - ex. molecular dynamics, rendering, nuclear simulation
  - Word-oriented
  - Millions of gates





- Problems with FPGAs
  - → CAD runtime can take hours or days
  - Fixed capacity
    - A large circuit may not fit
  - Inefficient use of resources
    - Resources sit idle most of the time



### **Motivation**

• Benchmark: chem





#### **Motivation**

• Benchmark: chem





### **Motivation**

- Solution
  - Divide up the circuit and run it on an array of processors
    - Preserve the coarse-grained features of the circuit
  - Create coarse-grained-aware CAD tools





- Time-Multiplexed FPGAs
  - → Look-up Table (LUT)





- Time-Multiplexed FPGAs
  - → Time-multiplexed LUT
  - Multiplexer is shared





- Datapath FPGAs
  - → Config bit sharing
  - → 1.1x density, same performance





Datapath Routing Mux



- Where we're going
  - Coarse-grained(datapath) time-multiplexed resources
  - ALU is shared





### Overview

- Motivation
- Malibu Architecture
- Synthesis
- Results



Traditional Island-Style FPGA



### Overview

- Motivation
- Malibu Architecture
- Synthesis
- Results



Malibu Architecture







Add Coarse-Grained inputs and outputs







• Add an ALU and register file









ADD NO,WO

-> E0

2



MUX W0, R0, CGI0->E0



2



MUX W0, R0, CGI0->E0

LUT

ADD NO, WO -> EO







#### Overview

- Motivation
- Malibu Architecture
- Synthesis
- Results





#### Bitstream



#### Overview

- Motivation
- Malibu Architecture
- Synthesis
- Results



21



# **Front-End Synthesis**

- Parse and Elaborate
  - → Use Verilator
  - Construct a CDFG
  - → Optimize
- Coarse-Grained Synthesis
  - Map CDFG to Malibu instructions
  - Various CDFG transformations
- Fine-Grained Synthesis
  - → Extract signals  $\leq$  W<sub>f</sub>
  - → Use OdinII and ABC to synthize to LUTs





# **Back-End Synthesis**

- M-CAD
  - Traditional FPGA-CAD flow
  - → Separate Placement, Routing, <u>Scheduling</u>
- M-HOT
  - Integrated placement, routing, scheduling
  - Divides problem into levels, place+route each level
- Both Approaches
  - → Can target any-sized architecture
  - → Can trade area for performance
  - → Fast





• Example





#### **Synthesis**



![](_page_25_Picture_0.jpeg)

• M-HOT Place, Route, Schedule each height

![](_page_25_Figure_3.jpeg)

26

![](_page_26_Picture_0.jpeg)

#### Overview

- Motivation
- Malibu Architecture
- Synthesis
- Results

![](_page_26_Figure_6.jpeg)

![](_page_27_Picture_0.jpeg)

• Frequency (MHz) for each benchmark

![](_page_27_Figure_3.jpeg)

![](_page_28_Picture_0.jpeg)

• Frequency (MHz) for each benchmark

![](_page_28_Figure_3.jpeg)

![](_page_29_Picture_0.jpeg)

• Area vs. Performance tradeoff

![](_page_29_Figure_3.jpeg)

![](_page_30_Picture_0.jpeg)

• Results (compared to Quartus II / Stratix III)

|                             | M-HOT | M-CAD |
|-----------------------------|-------|-------|
| Synthesis Time Improvement: | 30.9x | 77.0x |
| User Clock Speed:           | 0.12x | 0.07x |
| Density:                    | 1.48x | 0.67x |

10x = 10 times better than the Quartus result 1x = same as Quartus  $0.1x = 1/10^{th}$  the Quartus result (10 times worse)

![](_page_31_Picture_0.jpeg)

- Improve Front-End Synthesis
  - Needs to be both coarse-grain and fine-grain aware
  - Coarse-grained optimizations
- Improve M-CAD and M-HOT
  - Possibly 2x-3x performance improvement

![](_page_31_Figure_7.jpeg)

![](_page_32_Picture_0.jpeg)

### Thanks

#### • Purpose

- Implement a circuit on a coarse-grained/fine-grained architecture
- Malibu Architecture
  - FPGA with time-multiplexed coarse-grained resources
  - Can trade density for performance
  - Synthesis (M-CAD and M-HOT)
    - → Fast, up to 250x faster than QuartusII
    - → Fmax results within 1/10th of an FPGA