## Deploying Deep Neural Networks in the Embedded Space

Dr. Christos-Savvas Bouganis

2<sup>nd</sup> Workshop on Reconfigurable Computing for Machine Learning (**RCML**)

30 August 2018

Íntelligent Digital Systems Lab

**Dept. of Electrical and Electronic Engineering** 

www.imperial.ac.uk/idsl

#### The team

# Intelligent Digital Systems Lab



**Stylianos I. Venieris** Machine Learning



Manolis Vasileiadis **Computer Vision** 



**Alexandros Kouris** Machine Learning, Robotics



**Konstantinos Boikos** Computer Vision, SLAM



Mudhar Bin Rabieah Machine Learning

Nur Ahmadi





Brain-Machine Interface



analy WA at account of

Intelligent Digital Systems Lab (iDSL)

Welcome to the Intelligent Digital Systems Lab at Imperial College

TOP LINKS Our nesearch

loin our lab

forgation/when

ABSTRACT

hist would presente

t INTRODUCTION

**De Christes Bouganis** 

CAN-to-FPGA Benchmark Solle

While Cosmilational Neural Networks are becoming the spec-of-



The IDSL lab is part of the Electrical and Electronic Engineering Department of Imperial College London.

#### 1411 44 10





**Christos-Savvas Bouganis** iDSL Lab Director Imperial College London





## DNNs in the Embedded Space – Variability in Performance Requirements





**High-Throughput Applications** 



Low-Latency Applications

Íntelligent Digital Systems Lab

## DNNs in the Embedded Space – Variability in Performance Requirements



Our approach: Couple the design of the ML algorithm with the design of the computational platform to improve performance and enable the deployment of Al systems

#### Power constraints

- Absolute power consumption
- Performance-per-Watt

# Íntelligent Digital Systems Lab

## **Conventional Embedded Platforms for Neural Networks**

**GPUs** – Tegra K1, X1 and X2 **DSPs** – Qualcomm Hexagon, Apple Neural Engine, ...



- High throughput
  Low latency
- X Low power
- ✓ Tools

#### FPGAs

- Custom datapath
- Custom memory subsystem
- Programmable interconnections
- Reconfigurability



- ✓ High throughput
- ✓ Low latency
- 🗸 Low power
- X Tools

*Challenge:* Huge design space *Our Approach:* Automated toolflows

#### **Research Areas / Challenges**

## **Íntelligent Digital Systems Lab**





Challenge #1: Mapping Automation



### **Challenge #1: Mapping Automation**

## Íntelligent Digital Systems Lab



# Íntelligent Digital Systems Lab

### **Challenge #1: Automated CNN-to-FPGA Toolflow**



## fpgaConvNet – Design Space Exploration and Optimisation

- Synchronous Dataflow Modelling
  - Capture hardware mappings as matrices
  - Transformations as *algebraic operations*
  - Analytical *performance model*
  - Cast design space exploration as a mathematical optimisation problem



$$t_{total}(B, N_P, \mathbf{\Gamma}) = \sum_{i=1}^{N_P} t_i(B, \mathbf{\Gamma}_i) + (N_P - 1) \cdot t_{reconfig.}$$

## Lintelligent Digital Systems Lab

#### **Meeting the performance requirements**





## **Comparison with Embedded GPUs: Same absolute power constraints (5W)**



Latency-driven scenario  $\rightarrow$  batch size of 1

(3.43× geo. mean)

Up to 6.65× speedup with an average of 3.95×

fpgaConvNet vs Embedded GPU (GOp/s) for the same absolute power constraints (5W)



# Íntelligent Digital Systems Lab

### **Comparison with Embedded GPUs: Performance-per-Watt**







• Average of 1.17× (1.12× geo. mean) in GOp/s/W

### **Other approaches**

# Lntelligent Digital Systems Lab



Stylianos I. Venieris, Alexandros Kouris and Christos-Savvas Bouganis, "Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and Future Directions", ACM Computing Surveys, 2018

# Challenge #2: Multi-CNN Systems





Íntelligent Digital Systems Lab

#### **Challenge #2: Multi-CNN Systems – Autonomous Drones**



#### Imperial College London Challenge #2: Multi-DNN System

# Íntelligent Digital Systems Lab



### **Proposed Design Space Exploration Method**





#### **FPGA Architecture**



## **Comparison with Embedded GPUs**



- Latency-driven scenario → batch size of 1
- Up to 19.09× speedup with an average of 6.85× (geo. mean)



Performance-per-Watt: f-CNN<sup>x</sup> vs. TX1

- Latency-driven scenario  $\rightarrow$  batch size of 1
- Up to 9.61× speedup with an average of 2.76× (geo. mean)

# Challenge #3: Time-constrained Inference



## London Challenge #3: Time-constrained Inference

**Imperial College** 

Íntelligent Digital Systems Lab





Current approaches

#### Imperial College London Challenge #3: Time-constrained Inference

Intelligent Digital Systems Lab

- Approximate LSTMs
  - Iterative refinement using SVD + Pruning.
  - Parametrized with respect to:
    - Number of iterations
    - Level of pruning
- Parametrized hardware architecture, tailored for approximate LSTMs
- Co-optimise given a user-defined time budget



# Impact on LSTM-based Image Captioning

**Imperial College** 

London

Intelligent Digital Systems Lab



## Impact on LSTM-based Image Captioning

**Imperial College** 

London

Intelligent Digital Systems Lab



# Challenge #4: Privacy-aware Deep Learning



#### Imperial College London Challenge #4: Privacy-restricted Optimisation

Aim: Design an optimised HW system (performance and accuracy)

Given:

- A High-Level CNN Description (i.e. Caffe)
- A target FPGA platform
- Train Data privacy, availability
- Testing Data
- Target metric (top1/top-5 accuracy, ...)

 $\rightarrow$  quantisation with retraining step

Limited quantisation opportunities





## **Challenge #4: Privacy-aware Deep Learning**

**Imperial College** 

London





### Imperial College London *Cascade<sup>C</sup>N<sub>N</sub>*: High-Level System Architecture

Íntelligent Digital Systems Lab

- Pushing quantization bellow limits of acceptable accuracy to gain performance (high throughput)
- Evaluation of Quality of Prediction to identify and correct error introduced by quantization



Low-Precision Unit: Degraded accuracy classification with high performance Confidence Evaluation Unit: Identify misclassified cases High-Precision Unit: Correct detected misclassified samples, to restore accuracy Imperial College London *Cascalle<sup>C</sup>N<sub>N</sub>*: **Results** 





#### **Summary**

# **Íntelligent Digital Systems Lab**

**Research topics** 





A. Kouris and C-S Bouganis, "Learning to Fly by MySelf: A Self-Supervised CNN-based Approach for Autonomous Navigation", IROS, 2018

#### Imperial College London **Publications**

Intelligent Digital Systems Lab

www.imperial.ac.uk/idsl

- Alexandros Kouris, Stylianos I. Venieris, and Christos-Savvas Bouganis. 2018. CascadeCNN: Pushing the performance limits of quantisation. In SysML.
- Alexandros Kouris, Stylianos I. Venieris, and Christos-Savvas Bouganis. 2018. CascadeCNN: Pushing the Performance Limits of Quantisation in Convolutional Neural Networks. In 2018 28th International Conference on Field Programmable Logic and Applications (FPL).
- C. Kyrkou, G. Plastiras, T. Theocharides, S. I. Venieris, and C. S. Bouganis. 2018. DroNet: Efficient Convolutional Neural Network Detector for Real-Time UAV Applications. In 2018 Design, Automation Test in Europe Conference Exhibition (DATE). 967–972.
- Michalis Rizakis, Stylianos I. Venieris, Alexandros Kouris, and Christos-Savvas Bouganis. 2018. Approximate FPGA-based LSTMs under Computation
   Time Constraints. In Applied Reconfigurable Computing 14th International Symposium, ARC 2018, Santorini, Greece, May 2 4, 2018, 3–15.
- Stylianos I. Venieris and Christos-Savvas Bouganis. 2016. *fpgaConvNet: A Framework for Mapping Convolutional Neural Networks on FPGAs.* In 2016 IEEE 24<sup>th</sup> Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). 40–47.
- Stylianos I. Venieris and Christos-Savvas Bouganis. 2017. *fpgaConvNet: A Toolflow for Mapping Diverse Convolutional Neural Networks on Embedded FPGAs.* In NIPS 2017 Workshop on Machine Learning on the Phone and other Consumer Devices.
- Stylianos I. Venieris and Christos-Savvas Bouganis. 2017. *fpgaConvNet: Automated Mapping of Convolutional Neural Networks on FPGAs* (Abstract Only). *In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 291–292.*
- S. I. Venieris and C. S. Bouganis. 2017. *Latency-Driven Design for FPGA-based Convolutional Neural Networks*. In 2017 27th International Conference on Field Programmable Logic and Applications (FPL).
- S. I. Venieris and C. S. Bouganis. 2018. *f-CNNx: A Toolflow for Mapping Multiple Convolutional Neural Networks on FPGAs.* In 2018 28th International Conference on Field Programmable Logic and Applications (FPL).
- Stylianos I. Venieris, Alexandros Kouris, and Christos-Savvas Bouganis. 2018. Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and Future Directions. In ACM Computing Surveys 51, 3, Article 56 (June 2018), 39 pages.