# **NOENA:** A Massive-Scale Brain Activity Decoding Chip

Ameer Abdelhadi Eugene Sha Andreas Moshovos

University of Toronto

August, 2022





### Brain Machine Interfaces (BMIs)







### **Brain Machine Interfaces (BMIs)**







### **Brain Machine Interfaces (BMIs)**





### BMIs at the edge

What if we can detect patterns of neuron activity in real-time?

Detect, in real-time, memories, decisions, emotions, and experiences

#### **Applications**

#### **Repair brain function**

Interface brain regions which no longer connect, e.g. Alzheimer's



[1] https://www.newscientist.com/article/dn3488-worlds-first-brain-prosthesis-revealed/ (Hippocampus repair)





### BMIs at the edge

What if we can detect patterns of neuron activity in real-time?

#### Detect, in real-time, memories, decisions, emotions, and experiences

#### **Applications**

#### **Repair brain function**

Interface brain regions which no longer connect, e.g. Alzheimer's



Replacement of damaged hippocampus with a chip [1]

#### **Drive effectors**

Greater accuracy and dexterity, e.g. robotic limbs



Woman controls robotic arm with 100-channel Utah array [2]

https://www.newscientist.com/article/dn3488-worlds-first-brain-prosthesis-revealed/ (Hippocampus repair)
https://continuum.utah.edu/web-exclusives/the-bionics-man/ (Utah Array)





### BMIs at the edge

What if we can detect patterns of neuron activity in real-time?

#### Detect, in real-time, memories, decisions, emotions, and experiences

#### **Applications**

#### **Repair brain function**

Interface brain regions which no longer connect, e.g. Alzheimer's



Replacement of damaged hippocampus with a chip [1]

#### **Drive effectors**

Greater accuracy and dexterity, e.g. robotic limbs



Woman controls robotic arm with 100-channel Utah array [2]

#### Anticipate and prevent harmful neural activity

e.g. epilepsy



Responsive neurostimulator system for epilepsy [3]

[1] https://www.newscientist.com/article/dn3488-worlds-first-brain-prosthesis-revealed/ (Hippocampus repair)

[3] Critical review of the responsive neurostimulator system for epilepsy (Thomas and Jobst, 2015)





<sup>[2] &</sup>lt;u>https://continuum.utah.edu/web-exclusives/the-bionics-man/</u> (Utah Array)

#### The Challenge and Opportunity Capture Capability Growing Exponentially



Constraints for a *portable implanted device* 

- 1. Fast (real-time, <5ms detection latency)
- 2. Low-power & low-area
- 3. Scalable

The Edward S. Rogers Sr. Department of Electrical & Computer Engineering



## Data quickly outpacing analysis techniques

### Existing solutions can't cope



Constraints for a *portable implanted device* 

- 1. Fast (real-time, <5ms detection latency)
- 2. Low-power & low-area
- 3. Scalable

The Edward S. Rogers Sr. Department of Electrical & Computer Engineering



## Data quickly outpacing analysis techniques

### Existing solutions can't cope

1000 - Simultaneously Recorded

Limited number of neurons Not real-time High power Physically large

Data from https://stevenson.lab.uconn.edu/scaling/

1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015 2020

#### Constraints for a *portable implanted device*

- 1. Fast (real-time, <5ms detection latency)
- 2. Low-power & low-area
- 3. Scalable

The Edward S. Rogers Sr. Department of Electrical & Computer Engineering



## Data quickly outpacing analysis techniques

### Existing solutions can't cope

Simultaneously Recorded

Limited number of neurons Not real-time High power Physically large

#### Brain activity decoding is memory intensive & computationally expensive





4

### **Roadmap to NOEMA**

- Input to the system
- Template matching
- Baseline design & Noema
- Results























### **Processing Pipeline**





### **Processing Pipeline**





### **Processing Pipeline**







#### groups of 3 (example bin size)



VIVERSITY OF TORONTO











NIVERSITY OF TORONTO







### **Template Matching**







# **Template Matching**



# Which template does the input most closely resemble?





# **Template Matching**



#### How do neuroscientists determine this?



### Pearson Correlation Coefficient (PCC)

Widely used metric to measure the "closeness" of two matrices

$$r(X,Y) = \frac{\sum_{i=1}^{L} (x_i - \overline{x})(y_i - \overline{y})}{\sqrt{\sum_{i=1}^{L} (x_i - \overline{x})^2} \sqrt{\sum_{i=1}^{L} (y_i - \overline{y})^2}}$$















СНІ

The Edward S. Rogers Sr. Department of Electrical & Computer Engineering

OF TORONTO





### **Template Matching Overview**









Entire input buffer fills before compute begins

 $\rightarrow$  High latency

Most difficult requirement 5ms for real-time







Entire input buffer fills before compute begins

 $\rightarrow$  High latency

Most difficult requirement 5ms for real-time



Storage of input + templates → Large memory cost e.g. +1.24 Gb each











Input On-chip template

Storage of input + templates

### How can we do better?



→ Large area cost





#### **NOEMA** [MICRO'21, Patented]: Brain Interfaces at the Edge

A multidisciplinary collaboration effort in analyzing and developing a custom hardware platform to decipher the brain neural activity





#### **NOEMA** [MICRO'21, Patented]: *Brain Interfaces at the Edge*

A multidisciplinary collaboration effort in analyzing and developing a custom hardware platform to decipher the brain neural activity

Enabling truly portable systems for processing high-resolution brain activity signals for treatment, augmentation, and repair of brain functions





#### **NOEMA** [MICRO'21, Patented]: Brain Interfaces at the Edge

A multidisciplinary collaboration effort in analyzing and developing a custom hardware platform to decipher the brain neural activity

Enabling truly portable systems for processing high-resolution brain activity signals for treatment, augmentation, and repair of brain functions

#### **NOEMA**'s Prototype Chip

- Fabricated with TSMC 65nm GP technology
- Only 24µsec latency!

The Edward S. Rogers Sr. Department of Electrical & Computer Engineering

- 5 sec experience, 1K neurons @ 0.73 mW
- Scales to **30K** neurons, **10**× more than have ever been recorded
- Scales to meet *future* demand!



#### Input Serialization & PCC Reformulation



Time





#### Input Serialization & PCC Reformulation





#### Input Serialization & PCC Reformulation

NIVERSITY OF TORONTO



15

### **NOEWA's innovations** $r[t]^2 = \frac{(C_1S_1[t] - C_2S_2[t])^2}{C_3(C_1S_3[t] - S_2[t]^2)}$



#### **Bit-serial input**

- No buffering overhead
- Compute immediately when received







### **NOEWA's innovations** $r[t]^2 = \frac{(C_1S_1[t] - C_2S_2[t])^2}{C_3(C_1S_3[t] - S_2[t]^2)}$



#### **Bit-serial input**

- No buffering overhead
- Compute immediately when received



#### Near-memory bit-serial PEs

- Based on reformulated PCC
- Tiny, easy to scale





### **NOEMA's innovations**

$$r[t]^{2} = \frac{(C_{1}S_{1}[t] - C_{2}S_{2}[t])^{2}}{C_{3} (C_{1}S_{3}[t] - S_{2}[t]^{2})}$$

#### Bit-serial input

- No buffering overhead
- Compute immediately when received



Simple memory compression (~2.8x)



#### Near-memory bit-serial PEs

- Based on reformulated PCC
- Tiny, easy to scale

#### Fits well with existing probe interfaces (time-multiplexed ADC out)



### **Baseline to NOEMA Overview**

#### Baseline







NIVERSITY OF TORONTO



## **Performance Results**



\* For the most demanding configuration tested (9 sec experience, 30K neurons)



## **Performance Results**



\* For the most demanding configuration tested (9 sec experience, 30K neurons)



## **Performance Results**



\* For the most demanding configuration tested (9 sec experience, 30K neurons)



## **Power & Area Results**



\* For the most demanding configuration tested (9 sec experience, 30K neurons)



20

|                  | F              | Neurons     |           | Duration <sup>1</sup> | Resolution <sup>2</sup>                     | Require                  | ments <sup>3</sup>    | Imple             | mentation         |
|------------------|----------------|-------------|-----------|-----------------------|---------------------------------------------|--------------------------|-----------------------|-------------------|-------------------|
| Device           | • max<br>(MHz) | (thousands) | Templates | (seconds)             | Resolution <sup>2</sup> .<br>(milliseconds) | <b>Compute</b><br>(GOPs) | <b>Memory</b><br>(Mb) | FPGA <sup>4</sup> | ASIC <sup>5</sup> |
| NOEMA01K1T05S250 | 30             | 1           | 1         | 5                     | 250                                         | 0.6                      | 0.3                   | $\checkmark$      | $\checkmark$      |

- 1. Duration of the decoded experience
- 2. Resolution window of the incoming activities. Activities within this windows are binned (averaged).
- 3. If executed on commodity hardware.
- 4. Intel's Stratix 10 FPGA
- 5. TSMC 65nm GP





| <b>—</b> •       | F              | Neurons     |           | Duration <sup>1</sup> | Resolution <sup>2</sup> . | Require                  | ments <sup>3</sup>    | Imple             | mentation    |
|------------------|----------------|-------------|-----------|-----------------------|---------------------------|--------------------------|-----------------------|-------------------|--------------|
| Device           | ▪ max<br>(MHz) | (thousands) | Templates | (seconds)             | (milliseconds)            | <b>Compute</b><br>(GOPs) | <b>Memory</b><br>(Mb) | FPGA <sup>4</sup> | ASIC⁵        |
| NOEMA01K1T05S250 | 30             | 1           | 1         | 5                     | 250                       | 0.6                      | 0.3                   | $\checkmark$      | $\checkmark$ |
| NOEMA10K2T05S005 | 300            | 10          | 2         | 5                     | 5                         | 628.0                    | 114.4                 | $\checkmark$      | Planned      |

- 1. Duration of the decoded experience
- 2. Resolution window of the incoming activities. Activities within this windows are binned (averaged).
- 3. If executed on commodity hardware.
- 4. Intel's Stratix 10 FPGA
- 5. TSMC 65nm GP





|                                | F              | Neurons     |           | Duration <sup>1</sup> | Resolution <sup>2</sup> | Requirer                 | ments <sup>3</sup>    | Implementation        |                   |  |
|--------------------------------|----------------|-------------|-----------|-----------------------|-------------------------|--------------------------|-----------------------|-----------------------|-------------------|--|
| Device                         | • max<br>(MHz) | (thousands) | Templates |                       | (milliseconds)          | <b>Compute</b><br>(GOPs) | <b>Memory</b><br>(Mb) | FPGA⁴                 | ASIC <sup>5</sup> |  |
| NOEMA01K1T05S250               | 30             | 1           | 1         | 5                     | 250                     | 0.6                      | 0.3                   | $\checkmark$          | $\checkmark$      |  |
| NOEMA10K2T05S005               | 300            | 10          | 2         | 5                     | 5                       | 628.0                    | 114.4                 | $\checkmark$          | Planned           |  |
| NOEMA20K3T09S <mark>250</mark> | 600            | 20          | 3         | 9                     | 250                     | 64.8                     | 33.0                  | <b>x</b> <sup>6</sup> | Planned           |  |

- 1. Duration of the decoded experience
- 2. Resolution window of the incoming activities. Activities within this windows are binned (averaged).
- 3. If executed on commodity hardware.
- 4. Intel's Stratix 10 FPGA
- 5. TSMC 65nm GP
- 6. Not applicable; device can't meet target frequency.

The Edward S. Rogers Sr. Department of Electrical & Computer Engineering



|                                | F              | Neurons     | Templates | <b>Duration</b> <sup>1</sup> | <b>Resolution</b> <sup>2</sup> | Require                  | ments <sup>3</sup>    | Implementation        |              |
|--------------------------------|----------------|-------------|-----------|------------------------------|--------------------------------|--------------------------|-----------------------|-----------------------|--------------|
| Device                         | • max<br>(MHz) | (thousands) | Templates |                              | (milliseconds)                 | <b>Compute</b><br>(GOPs) | <b>Memory</b><br>(Mb) | FPGA⁴                 | ASIC⁵        |
| NOEMA01K1T05S250               | 30             | 1           | 1         | 5                            | 250                            | 0.6                      | 0.3                   | $\checkmark$          | $\checkmark$ |
| NOEMA10K2T05S005               | 300            | 10          | 2         | 5                            | 5                              | 628.0                    | 114.4                 | $\checkmark$          | Planned      |
| NOEMA20K3T09 <mark>S250</mark> | 600            | 20          | 3         | 9                            | 250                            | 64.8                     | 33.0                  | × <sup>6</sup>        | Planned      |
| NOEMA30K4T09S005               | 900            | 30          | 4         | 9                            | 5                              | 6786.4                   | 1236.0                | <b>x</b> <sup>6</sup> | Planned      |

- 1. Duration of the decoded experience
- 2. Resolution window of the incoming activities. Activities within this windows are binned (averaged).
- 3. If executed on commodity hardware.
- 4. Intel's Stratix 10 FPGA
- 5. TSMC 65nm GP
- 6. Not applicable; device can't meet target frequency.

The Edward S. Rogers Sr. Department of Electrical & Computer Engineering



| Dovico           | Silicon Area (mm <sup>2</sup> ) |       |       | Ро     | wer (mV | Latency | Chip |                      |
|------------------|---------------------------------|-------|-------|--------|---------|---------|------|----------------------|
| Device           | Memory                          | Logic | Total | Memory | Logic   | Total   | (μs) | Status               |
| NOEMA01K1T05S250 | 0.36                            | 0.07  | 0.43* | 0.30   | 0.43    | 0.73    | 23.9 | In lab <sup>+#</sup> |

\* Core only; 2.1mm<sup>2</sup> total silicon area.

<sup>+</sup> Fabricated with TSMC 65nm GP

# Also tested on Intel's Stratix 10 FPGA





| Device           | Silicon Area (mm <sup>2</sup> ) |       |       | Ро     | wer (mV | Latency | Chip |                        |
|------------------|---------------------------------|-------|-------|--------|---------|---------|------|------------------------|
| Device           | Memory                          | Logic | Total | Memory | Logic   | Total   | (µs) | Status                 |
| NOEMA01K1T05S250 | 0.36                            | 0.07  | 0.43* | 0.30   | 0.43    | 0.73    | 23.9 | In lab+#               |
| NOEMA10K2T05S005 | 28.46                           | 1.35  | 29.81 | 89.78  | 84.28   | 174.06  | 2.8  | Simulated <sup>#</sup> |

\* Core only; 2.1mm<sup>2</sup> total silicon area.

- <sup>+</sup> Fabricated with TSMC 65nm GP
- # Also tested on Intel's Stratix 10 FPGA





| Device           | Silicon | Silicon Area (mm <sup>2</sup> ) |       |        | wer (mV | Latency | Chip |                        |
|------------------|---------|---------------------------------|-------|--------|---------|---------|------|------------------------|
| Device           | Memory  | Logic                           | Total | Memory | Logic   | Total   | (µs) | Status                 |
| NOEMA01K1T05S250 | 0.36    | 0.07                            | 0.43* | 0.30   | 0.43    | 0.73    | 23.9 | In lab+#               |
| NOEMA10K2T05S005 | 28.46   | 1.35                            | 29.81 | 89.78  | 84.28   | 174.06  | 2.8  | Simulated <sup>#</sup> |
| NOEMA20K3T09S250 | 6.26    | 0.09                            | 6.25  | 18.55  | 9.68    | 28.23   | 1.5  | Simulated              |

\* Core only; 2.1mm<sup>2</sup> total silicon area.

<sup>+</sup> Fabricated with TSMC 65nm GP

# Also tested on Intel's Stratix 10 FPGA





| Device           | Silicon Area (mm <sup>2</sup> ) |       |        | Ро     | wer (mV | Latency | Chip |                        |
|------------------|---------------------------------|-------|--------|--------|---------|---------|------|------------------------|
| Device           | Memory                          | Logic | Total  | Memory | Logic   | Total   | (µs) | Status                 |
| NOEMA01K1T05S250 | 0.36                            | 0.07  | 0.43*  | 0.30   | 0.43    | 0.73    | 23.9 | In lab+#               |
| NOEMA10K2T05S005 | 28.46                           | 1.35  | 29.81  | 89.78  | 84.28   | 174.06  | 2.8  | Simulated <sup>#</sup> |
| NOEMA20K3T09S250 | 6.26                            | 0.09  | 6.25   | 18.55  | 9.68    | 28.23   | 1.5  | Simulated              |
| NOEMA30K4T09S005 | 202.00                          | 3.42  | 205.42 | 682.70 | 522.76  | 1205.46 | 1.0  | Simulated              |

\* Core only; 2.1mm<sup>2</sup> total silicon area.

<sup>+</sup> Fabricated with TSMC 65nm GP

# Also tested on Intel's Stratix 10 FPGA





- TSMC 65nm GP
- 24µsec latency
- 1K neurons (scales to 30K)
- 5sec experience
- Consumes 0.73mW
- Equivalent of 600MOPs 32bit-FP





- TSMC 65nm GP
- 24µsec latency
- 1K neurons (scales to 30K)
- 5sec experience
- Consumes 0.73mW
- Equivalent of 600MOPs 32bit-FP

By Comparison:

- Nvidia Jetson Nano
  - Consumes 10W
  - Barely meets **5ms** real-time latency
- Intel i5-7000
  - 63ms latency
  - Fails to meet real-time latency





NIVERSITY OF TORONTO



- TSMC 65nm GP
- 24µsec latency
- 1K neurons (scales to 30K)
- 5sec experience
- Consumes 0.73mW
- Equivalent of 600MOPs 32bit-FP

- Nvidia Jetson Nano
  - Consumes 10W
  - Barely meets **5ms** real-time latency
- Intel i5-7000
  - 63ms latency
  - Fails to meet real-time latency







- TSMC 65nm GP
- 24µsec latency
- 1K neurons (scales to 30K)
- 5sec experience
- Consumes 0.73mW
- Equivalent of 600MOPs 32bit-FP

- Nvidia Jetson Nano
  - Consumes 10W
  - Barely meets **5ms** real-time latency
- Intel i5-7000
  - 63ms latency
  - Fails to meet real-time latency







- TSMC 65nm GP
- 24µsec latency
- 1K neurons (scales to 30K)
- 5sec experience
- Consumes 0.73mW
- Equivalent of 600MOPs 32bit-FP

- Nvidia Jetson Nano
  - Consumes 10W
  - Barely meets **5ms** real-time latency
- Intel i5-7000
  - 63ms latency
  - Fails to meet real-time latency







- TSMC 65nm GP
- 24µsec latency
- 1K neurons (scales to 30K)
- 5sec experience
- Consumes 0.73mW
- Equivalent of 600MOPs 32bit-FP

- Nvidia Jetson Nano
  - Consumes 10W
  - Barely meets **5ms** real-time latency
- Intel i5-7000
  - 63ms latency
  - Fails to meet real-time latency









The Edward S. Rogers Sr. Department of Electrical & Computer Engineering

Brain machine interfaces:

- × Exponential growth in data
- Current solutions are not sufficient

NOEMA's key innovation:

- Uses simple, low-cost, area- and energy efficient bitserial and integer arithmetic units
- Enables computations to proceed progressively as data is received
- ✓ Scales to meet *future* demand
  - 14x less power, 2.6x smaller, order of µsec latency



# Thank you!