Guy is a Professor in Computer Engineering at the University of British Columbia University of British Columbia where he teaches advanced digital design and computer systems/architecture related courses. His research focuses on improving FPGA devices and CAD tools, in particular making them easier to use and more efficient for computing tasks. His research has shown how to design FPGA interconnect to be more efficient, that CAD tool performance can be enhanced through parallelism, and that overlays are a much easier way to program FPGAs . His latest research is attempting to make machine learning and artificial intelligence applications more efficient on FPGAs through low-precision arithmetic, such as TinBiNN for binary neural networks, and using custom accelerator interfaces. Prof. Lemieux graduated from University of Toronto where he was part of a team that designed NUMAchine, a cache-coherent multiprocessor built from scratch using MIPS R4400 CPUs, custom PCBs and FPGAs. Throughout his career, he's designed or co-designed various soft processors and accelerators (MIPS and NIOS clones, VIPERS, VEGAS, VENICE, ORCA RISC-V, and Saturn-V) and co-founded VectorBlox Computing which developed the MXP (Matrix Processor), a tensor accelerator. Within RISC-V International, he is an elected member of the Technical Steering Committee and chair of the SoftCPU Special Interest Group . Throughout the COVID-19 pandemic, he worked with a small team from the SIG to develop new technology for managing custom instructions without namespace collisions in the ISA, resulting in Google's CFU-Playground as well as CFU interfaces in Lattice's RISC-V RX IP core and Efinix's Titanium FPGAs. After extending this further to provide virtualization and protection (like virtual memory), it is being offered as a basis specification for the new Composable Extensions (CX) Task Group which he helped launch. He was also a member of the RISC-V Vector and the early Cache Management Operations committees. Finally, he serves as a Voting Member and Editor within the Working Group for IEEE P3109 Standard for Arithmetic Formats for Machine Learning which has been releasing updates through its Interim Report. FPGAs, short for Field-Programmable Gate Arrays, are are universal logic chips, capable of emulating any other digital chip. Of course, this emulation capability comes with some overhead in the form of cost and performance -- a key goal of Prof. Lemieux's research is to drive down the cost as well as improve the speed and power dissipation of these chips. He has done this through optimization at various levels including the transistor-level design, architecture (internal organization) of the device, and CAD tools that map circuits into the device. Prof. Lemieux is especially interested in the acceleration of compute-oriented applications using custom hardware. For this purpose, FPGAs have shown much promise, but they also have a reputation for being very difficult to "program", particularly among software-oriented designers, the key users in such applications. To help, he is a strong advocate for the use of overlay architectures, which are digital circuits built on top of FPGAs that make them easier to program. Overlays are like a new type of FPGA, in that they themselves are programmable, but they are more application-specific and have fewer users, making them unlikely to be built as custom chips. Nevertheless, his research has shown that regular C programs can be easily mapped to processor-like overlays and offer levels of performance that competitive with GPUs. His startup, VectorBlox Computing, was acquired by Microchip in September 2019 and is now called the VectorBlox Accelerator SDK . VectorBlox designed a vector accelerator system known as the VectorBlox MXP (MatriX Processor) that operates directly on 1D, 2D and 3D tensors. MXP provides four key architectural features that provide gains in efficiency: scratchpad, hardware DMA, sub-word SIMD, and custom instructions. Instead of a traditional named vector register, MXP uses an addressable scratchpad. Being addressable, vectors are simply pointers in C; this allows any number of vectors of arbitrary length to be formed without any internal fragmentation, reduces the need for data duplication and data movement, and allows use of a stack-based ABI for nesting accelerated vector functions. Hardware DMA makes efficient use of a wide, dedicated path to external memory for transfering 1D and 2D tensors and operates concurrently with computation. Sub-word SIMD provides increased parallelism for operations on byte and halfword elements. Custom instructions make it easy to attach highly pipelined hardware into a C-programmed environment, where the hardware design effort focuses on data operations not on data storage/staging/movement (which is left within C). Furthermore, the MXP is fully portable, allowing the use of almost any host processor and almost any C compiler without the need for any compiler modifications. VectorBlox measured nearly 10,000 times speedup on an N-body physics problem. With this level of acceleration, compiler autovectorization is almost useless because it demands careful planning of code structure and data layout which are best done manually using compiler intrinsics.
His work on interconnect design for FPGAs resulted
in a book, published in November 2003.
He received a
Best Paper
Award at the
2004 IEEE International Conference on Field-Programmable Technology.
His 2001 paper
Using Sparse Crossbars within LUT Clusters
is included as part of FPGA20, the
Top 25 contributions in the First 20 Years
of the International Symposium on FPGAs
between 1992 and 2011.
His 2017 paper
Real-time Object Detection in Software with Custom Vector Instructions and Algorithm Changes,
was nominated for best paper. The accompanying
face detection demo
clearly demonstrates its overall accuracy and speed. This work shows how
success can be achieved using software-first overlay approach; the entire
application was coded in C++ and used less than 300 lines of custom VHDL, but
it out-performed all prior art including custom-built chips dedicated to that
purpose.
|
|
Degree | Name | Current Position | Completion | Thesis / Project |
M.Sc. | May Young | Vancouver | April 2020 | Dynamic Race Detection for Non-Coherent Accelerators
pdf
primary supervisor was Alan Hu |
Ph.D. | Hossein Omidian | Xilinx, San Jose | October 2018 | Automated Space/Time Scaling of Streaming Task Graphs on Field Programmable Gate Arrays
pdf |
M.A.Sc. | Maximilian Golub | Mercedes-Benz, Seattle | August 2018 | DropBack: Continuous Pruning During Deep Neural Network Training
pdf |
M.A.Sc. | Joseph Edwards | VectorBlox Computing, Vancouver | July 2018 | Real-time Computer Vision in Software using Custom Vector Overlays
pdf |
M.Eng. | Nathan van Woudenberg | Programming + Machine Learning Support in ECE Robotics Control Lab, UBC | May 2016 | n/a |
M.Eng. | Gene Lai | unknown | May 2016 | n/a |
Ph.D. | Ameer Abdelhadi | ameer | June 2016 | Architecture of Block-RAM-Based Massively Parallel Memory Structures: Multi-Ported Memories and Content-Addressable Memories
pdf |
M.A.Sc. | Keith Lee | Gumstix Inc., Vancouver | January 2016 | The DEVBOX development environment: an environment for introducing Verilog to young students
pdf video demo |
M.Eng. | Danting Li | unknown | December 2015 | n/a |
Ph.D. | Aaron Severance | VectorBlox Computing Inc. | March 2015 | Broadening the Applicability of FPGA-based Soft Vector Processors
pdf |
M.A.Sc. | Michael (Xi) Yue | unknown | October 2014 | Rapid Overlay Building for FPGAs
pdf |
M.Eng. | Douglas (Hak Hian) Sim | Recon Instruments | May 2014 | n/a |
M.A.Sc. | Alex Brant | Altera Toronto | November 2012 | Coarse and Fine Grain Programmable Overlay Architectures for FPGAs
pdf Please check out the open source repository for the ZUMA FPGA Overlay |
M.A.Sc. | Zhiduo Liu | upon graduation: Altera San Jose currently: Google, CA |
September 2012 | Accelerator Compiler for the VENICE Vector Processor pdf |
M.A.Sc. | Chris Wang | upon graduation: Xilinx, CA currently: Google, CA |
October 2011 | Scalable and Deterministic Timing-driven Parallel Placement for FPGAs pdf |
Ph.D. | David Grant | Altera Toronto | August 2011 | CAD Algorithms and Performance of Malibu: An FPGA with Time-Multiplexed Coarse-Grained Elements pdf |
Ph.D. | Usman Ahmed | Altera Toronto | April 2011 | Impact of custom interconnect masks on cost and performance of structured ASICs
pdf
co-supervised with Steve Wilton |
M.A.Sc. | Chris Chou | PMC-Sierra | April 2010 | VIPERS II: A Soft-core Vector Processor with Single-copy Scratchpad Memory pdf |
M.A.Sc. | Darius Chiu | Independent | Sept 2009 | Congestion-driven Re-clustering CAD Flow for Low-cost FPGAs pdf |
M.A.Sc. | Johnny Ho | upon graduation: Ixia, CA next: Quantlab, CA currently: Microsoft, WA |
Sept 2009 | PERG-Rx: An FPGA-based pattern-matching engine with limited regular expression support for large pattern databases pdf |
M.A.Sc. | Patrick Dong | Xilinx San Jose | Sept 2009 | Period and Glitch Reduction via Clock Skew Scheduling, Delay Padding and GlitchLess pdf |
M.A.Sc. | Paul Teehan | upon graduation: Ph.D. student, UBC next: EnerNOC currently: Travel Audience, Germany |
October 2008 | Reliable High-throughput FPGA Interconnect using Source-synchronous Surfing and Wave Pipelining pdf |
Ph.D. | Mehdi Alimadadi | Linear Technology | July 2008 | Recycling Clock Network Energy in High-performance Digital Designs using On-chip DC-DC Converters
pdf 90nm chip layout (4MB bitmap) co-supervised with Patrick Palmer |
M.A.Sc. | Jason Yu | Intel Canada | May 2008 | Vector Processing as a Soft-CPU Accelerator pdf |
M.Eng. | Eric Lai | Amazon.com | April 2008 | n/a |
M.A.Sc. | Mark Yamashita | upon graduation: IBM Canada next: Oxford MBA currently: at large |
November 2007 | A Combined Clustering and Placement Algorithm for FPGAs pdf |
M.Eng. | Shirley Ma | McKesson Canada | December 2007 | n/a |
M.Eng. | David Yeager | upon graduation: IBM Canada currently: Dynimize |
December 2006 | Interconnect Estimation for FPGAs pdf |
M.A.Sc. | David Leong | Nokia Canada | December 2006 | Incremental Placement for FPGAs
pdf |
M.Eng. | Wilson Lo | unknown | November 2006 | Power Model for Small Custom Embedded Memories
supervised by André Ivanov |
M.A.Sc. | Edmund Lee | Altera Toronto | Summer 2006 | Interconnect Driver Design for Long Wires in Field-Programmable Gate Arrays
pdf co-supervised with Shahriar Mirabbasi |
M.A.Sc. | Marvin Tom | Major Tech Firm in USA | Spring 2006 | Channel Width Reduction Techniques for System-on-Chip Circuits in Field-Programmable Grate Arrays pdf |
M.A.Sc. | Anthony Yu | Intel Canada | Fall 2005 | Defect Tolerance for Yield Enhancement of FPGA Interconnect Using Fine-grain and Coarse-grain Redundancy pdf |
M.A.Sc. | Victor Aken'Ova | PMC-Sierra | Spring 2005 | Bridging the Gap between Soft and Hard eFPGA Design
pdf supervised by Resve Saleh |
You might also enjoy
my old home pages
as a graduate student at the University of Toronto.
Try searching
Library and Archives Canada.
They have Canadian theses and other publications.
If you are looking for information about my book,
try here
According to data in
this list,
my Erdős number
is 4.
Lemieux →
Sevcik →
Klawe →
Erdős