Blog

My productivity system

I cannot live without a good system for information organization, time management, task tracking and collaboration. Here is what I use currently:

Why we should teach verification and testing to undergrads

For a significant portion of my time at NVIDIA, I did what is called design verification. It is a very interesting role that involves a lot of innovation, execution and cool stuff. However, somehow verification doesn't have the "reputation" that design has. People like to design and think that verification is not interesting. I'm here to say the opposite. Verification is a much more challenging and interesting problem to solve than design.

One reason why this happens is because the academia emphasizes design way more than verification. There are classes about design (digital design) in every university's curriculum. But only a few universities have a class on verification. And even that class is typically a graduate class. Note here that when I say design, I am talking about a relatively advanced digital class that teaches VHDL or Verilog and talks about designing a simple processor or some complex blocks involving datapaths and control logic. I am not talking the basic digital logic design class where students learn about logic gates and state machines.

I'm here to say that every university should have a class on verification (and testing and validation; more on that later) for undergrads. It is true that you can't understand verification, if you don't know design. So, this class would follow a design class. 

Hopefully, I've convinced you that it is valuable to teach verification to undergrads. It makes them more employable and they end up doing something cool and challenging.

Testing and validation are two other aspects of the game. You can see how these terms differ by going through my lecture on verification. But the skills of these two are valuable and useful in the industry as well. An undergrad class that gives a good overview of the 3 is a good introductory class (still an upper-division i.e. 3rd/4th year class). Graduate students can take separate classes to study each in detail.

The elusive Best Paper Award :)

After a long long hiatus (because of COVID), I was able to attend a conference in May this year (2022). I was super excited to get to network with folks from the FPGA community. This was my second FPGA conference. The last one was ISFPGA in 2020, right before the pandemic started.

I applied for several travel grants (they make you write up a lot of stuff; so much effort :P). But eventually I got the grants. So, I was glad to have my travel be fully paid for. The conference was in NYC, which I was not particularly excited about, coz I had seen NYC before. Hawaii would have been better. I kid I kid.

I had submitted my paper titled "CoMeFa: Compute-in-Memory Blocks for FPGAs". It was an evolution of our Compute RAMs idea, where we utilized the dual port nature of FPGA BRAMs (which I wrote a blog post about earlier). 

I was organizing a workshop on OpenFPGA at FCCM, but it was a virtual workshop. I did that from home and then flew to NYC later in the evening. I reached late in the night around 3 am (flight delayed over 2.5 hours) at the hotel and went straight to bed. Next morning I am at the main event center (which btw was wonderful; I loved the conference center at Cornell Tech), and I see the schedule on a piece of paper and I see my paper's name with an asterisk. The asterisk said the paper was a best paper candidate. I was pleasantly surprised. 

My presentation was right after lunch. So, I made a joke about everybody feeling sleepy as my icebreaker and off I went. I think the presentation went well and the Q/A session went well too. 

That evening we had the demo night and the reception. Great time to mingle and talk with others! I had a great time. I was manning the booth of OpenFPGA with some colleagues from Prof. Pierre-Emmanuel's lab (from Utah).

The next day we had the rest of the presentations and then came time for the award ceremony. Here we go... 1 2 3.. and our paper won the best paper award. I was so happy. In my excitement, I was too fast to exit the room that I didn't even interact much with others after getting the award (Learning for the next time - Don't do that). So, there it was - a set of certificates in my hand - one for each author. When I came back to Austin, my advisor sent the news to the department and it was published on the webpage :)

Anyway, the highlight of the trip was that I made so many new friends on this trip. If I start listing names, I will probably miss some. I loved it! I wish I had gotten more chances to travel and attend conferences, but Mr. COVID decided to stay for 2 long years! Hoping to get to travel more!

Pasting below two of the main slides from my presentation that highlight the main features of the CoMeFa work.

eFPGAs - What the heck is that?

I remember when I looked at SoC architecture diagram when I worked at NVIDIA (sometime in 2012-13). There was a section of the diagram which was full of tiny blocks like I2C, SPI, QSPI. It reminded of similar peripherals I had seen earlier like UART, CAN, etc. I was like who needs all these blocks. And most of these (I believe) are older protocols that don't need to run at super fast clocks.

What came to mind at the time was if we had an FPGA, we wouldn't need to have so many separate blocks. We could configure the FPGA with what we needed to use (I guess I should mention here that I was naively assuming that only one of these peripherals is required at a time, which I believe now is not true). I didn't know enough to understand one could integrate an FPGA fabric into an SoC. I thought they were just different kinda of chips and that was that.

Fast forward to 2019, I submitted a paper to one of the top FPGA conferences (ISFPGA'20). I won't go into the details/context of the paper because that's not relevant here. The relevant point is that in the paper I had talked about Xilinx and Altera (Intel) FPGAs in the Related Work section. One of the reviewer mentioned that I should also mention other FPGA vendors like Achronix and Flexlogix (because they had done stuff relevant to what I was proposing in the paper). Anyway, that was the first time I came across these FPGA companies. And when I checked their websites, I saw that both these companies have product lines called eFPGAs.

eFPGAs are Embedded FPGAs. That is, an FPGA fabric that can be integrated into as a block in another chip. So, think of it as a SoC peripheral, just that it is not a hard component (like a SPI block or a encoder block, etc), but a soft logic FPGA fabric. So, technically, an eFPGA is an IP, that you typically buy from a vendor (like Achronix or Flexlogix) and put it as a peripheral in your SoC. The eFPGA IP block will interface with the rest of the SoC just like other peripherals do (though an AXI interface for example). This gives the SoC a nifty feature. It makes it future proof (to some extent). One could configure this eFPGA peripheral to behave like anything you may need (even something that's not known today). Pretty cool huh!

Now, if you think about it, any modern FPGAs in the last say 10-15 years are not really "pure FPGAs". They are so heterogenous that they have so many things on them other than the soft logic (logic blocks). For example, hard memory controllers, I/O blocks, transceivers, etc. Most importantly, look at an SoC FPGA (like the Zynq series from Xilinx). They have a processor system on them that is connected to the FPGA fabric. Technically, one could say that the FPGA part of the chip is an eFPGA. Only that the area consumed by the FPGA fabric is much larger than the area consumed by the processor on the chip. If you reverse that (large processor area and small FPGA fabric), then that's what we commonly refer to a chip with an eFPGA on it.

How I started research?

It all started in late 2018 when I went for a guest lecture at UT Austin some years ago. I used to go there almost every year. It was the class I TA'ed back in my Masters days (Digital System Design using HDLs) and I used to deliver a lecture on design verification and System Verilog. 

I told my "would-be" advisor (Prof. Lizy John) that I am moving to the Deep Learning (DL) architecture team at NVIDIA. We got to talking about research in the area of DL (I had expressed interest in research, off and on, to her, over the years). She discussed with me the idea of making an FPGA specifically for neural networks. 

Neural networks contain 'neurons' of various shapes and forms, connected together in various ways. They can perform an immense variety of tasks, after training them, of course. The hardware used to run these networks has evolved dramatically over the last decade - from CPUs, to GPUs, to TPUs, to custom FPGA based solutions.

FPGAs consist of configurable computational and storage blocks connected via a highly flexible interconnect. A programmer can program specific behaviors into the computational blocks and the storage blocks, and connect them to their liking using the interconnect. 

What if we were to design a hardware substrate that was like an FPGA but made for neural networks. Instead of the logic blocks, DSP slices and RAMs typically found on an FPGA, we would have "neurons" floating in a sea of wires. And a programmer could connect the neurons to their liking to realize a neural network. It would be a "FPNA" (Field Programmable Neural Array).

This is how my PhD research started. Very soon, I found that FPNA was a term that already existed and some researchers had tried this long time back. As I looked more into neural networks, I realized a neuron was nothing but a MAC. And DSP slices are pretty much fancy MAC units on an FPGA. So, there you go. It looked like we already have what we wanted to build. 

But then we realized that the "unit of computation" in neural networks is not really MACs any more. It is matrix-matrix computation. And connecting MACs (or DSP slices) on an FPGA to make a matrix-matrix multiplier was pretty non-efficient. So, we saw an opportunity to add matrix-multiplier blocks on an FPGA. Exploring the benefits of adding such blocks to an FPGA was my first project of my PhD.

Interestingly, unbeknown to me was the fact that what I was doing was effectively adding "Tensor Cores" to FPGAs. I was working at NVIDIA at the time in the Deep Learning architecture team, where Tensor Cores were recently added to the GV100 chip. The resemblance didn't strike me for a long time. Actually the name "Tensor Slice" was born when I did realize the resemblance.

Are FPGAs computers?

The first time I learned about FPGAs, I was kinda blown away. I think it was the same time I was learning about VHDL. I was like we can write C like code to design circuits.. whaaaaa....t !!!

After getting over their apparent coolness, I tried to understand them more. Over time, my understanding of FPGAs has evolved.

I used to think of them as just things for hobbyist or for lab classes. You could try out a simple hardware circuit and see how it works. "Real hardware doesn't use FPGAs" is what I thought.

Then when I joined the industry, I saw FPGAs being used for prototyping chips. Large chips could be put on multiple FPGAs and simulated/emulated. That's a pretty neat way to verify your design. Much faster than software based simulation.

Then when I came to UT and took the SoC Design class, I was introduced to the concept of "accelerators". I got to know what "hardware acceleration" meant, and that FPGAs are commonly used for accelerating parts of software which are too slow when run on a CPU (because they are compute intensive). 

I found that there are applications which directly deploy FPGAs instead of designing ASICs, because designing ASICs was too expensive and standards kept changing making a fixed hardware obsolete very quickly. Examples were chips used in communications.

Some time later, the idea of parallelism was thrown at me. FPGAs are essentially "parallel computers". I hadn't thought of FPGAs as computers until then. (Side note: Speaking of how I hadn't thought of something in some way: When I was discussing Google's TPU with Prof. John, she casually mentioned "Look at the area spent on the TPU for doing actual compute. Compare that the area spent on a modern CPU for doing actual computer". I was like wow. That's an interesting way to look at it. I hadn't thought about this this way until then.)

So, I guess FPGAs are a lot of things - learning tools for hardware engineers, chips for prototyping ASICs, accelerators for compute intensive applications, chips for some applications. But at the very basic level, they are computers. Just that this "computer" is not your typical "personal computer", but a machine that does computes and intelligently does things.

Programming FPGAs

When using FPGAs, it can get confusing to understand what "programming an FPGA" actually means. That's because there is at least one more layer involved in programming FPGAs than typical computers (CPUs or GPUs). And also because FPGA programmers are have very different skill sets compared to computer programmers. HLS has tried to narrow this gap, but nevertheless the gap exists.

For a typical computer (CPU or GPU), a software engineer (well, actually anyone) writes a program, compiles it for that computer and runs it. That's it. The compilation process is very fast.

Historically, an FPGA programmer is a hardware engineer, someone who understands how to write digital designs in languages like VHDL and Verilog. After writing the design, they then "compile" the design to convert it to a bitstream, which is "configured" on to the FPGA. The compilation process is many times slower than that on a regular computer. 

So, simplistically, a circuit description is the "program" for the FPGA.

Typically, the circuit configured onto the FPGA does 1 specific thing. There is no software involved after that. If you want to perform another thing using the same FPGA, you write the new circuit, compile it, and put the bitstream onto the FPGA. This takes a long time.

What if we create a circuit that has a high degree of programmability. Such circuits are called overlays. An overlay is a circuit that is configured onto an FPGA, but can be controlled/programmed by a piece of software. So, in this case, a programmer designs the overlay circuit, puts it on the FPGA and then another piece of software can be written to control it. So, to get the FPGA to perform another thing, now you just need to write a new piece of software that controls the overlay, instead of writing a new circuit. Now, you're saving time because you don't need to compile another circuit for the FPGA. The FPGA's bitstream remains the same. There are many domain-specific overlays that have been developed for many applications. A recent one in the Brainwave design from Microsoft for DL applications.

Now, I lied above when I said "a circuit description is the program for the FPGA". With the introduction of HLS, this is not true anymore. A C or C++ or SystemC program can be directly compiled and the bitstream loaded onto the FPGA. This process makes programming FPGAs similar to programming typical computers. However, things aren't that rosy. There are a lot of things the programmer has to do in his program to make sure it works. In other words, they still need to know some aspects of the underlying FPGA architecture and the hardware their design should/will generate.

Of course, there is the question of why is it so hard to program FPGAs? Programming a device goes hand in hand with the abstraction it exposes. FPGAs expose much more hardware level details than a typical computer does. That's why they are hard to program. Simple. The lesser degrees of freedom you provide to the programmer (i.e. the lesser hardware level details you expose), the easier it gets to program. FPGAs expose so much hardware that typically, a hardware engineer is needed to efficiently program an FPGA. [Side note: This is related to why compilation for an FPGA is slow. There are so many degrees of freedom for the compiler. It's design space is huge. That's not the case for a normal computer.]

So, programming FPGAs can be made easier if we can create efficient abstractions that cleverly expose only what the programmer needs to see. Overlays help with this significantly. HLS takes a slightly different approach to achieve the same goal of easy programmability.

A cartoon representation of an FPGA with Tensor Slices added to it

Block diagram of a Tensor Slice

Interface of the Tensor Slice

Tensor Slices

Tensor Slices are to Machine Learning in the same way DSP Slices are to Digital Signal Processing.

When FPGAs came into existence, they were very homogenous. They contained logic blocks connected via a routing fabric. However, at some point, it was realized that there are many common operations (like multiplication and MACs) that are present in the applications FPGAs were used for. This was especially true for DSP applications, which were and are a very prevalent usecase for FPGAs. So, multiplier and MAC blocks were added to the FPGA fabric. These were called "hard" blocks because they only do a few specific things and they do them well (as opposed to logic blocks, which are called "soft" blocks because they can do pretty much any digital logic). RAM blocks were added for storage in the FPGA fabric as well (although I think that happened before DSP slices were added). FPGAs slowly became more and more heterogenous by addition of more "hard" blocks in their fabric.

With the prevalence of DL/ML, FPGAs are at an interesting juncture again. FPGAs are used for so many DL applications, in the cloud and at the edge. And tensor/matrix operations at the heart of DL. So, it seems logical that adding hard blocks specializing in tensor/matrix operations onto an FPGA can help make FPGAs better at accelerating DL applications.

We proposed adding blocks called Tensor Slices to FPGAs. These blocks support matrix matrix multiplication and elementwise matrix addition, subtraction and multiplication for various sizes and precisions (specifically, 4x4 fp16 and 8x8 int8). The slices have a systolic processing element array at their core, as seen in the image on the left. Adding Tensor Slices increases the compute density of FPGAs (more TMACs/second/area). 

We observed an average frequency increase of 2.45x, average area reduction to 0.4x and average routing wirelength reduction to 0.4x across several ML benchmarks. We didn't see any noticeable degradation in the performance of non-ML benchmarks (for the cases when we spent upto 10% area of the FPGA on Tensor Slices).

An interesting part of this research was how to perform FPGA architecture evaluation (Special thanks for Dr. Vaughn Betz from University of Toronto for his guidance). We've tried to explain it pretty well in the paper. And it uses mostly open-source tools! Read it if you're interested in this aspect (link in the next para). 

Our initial research in this area started way back in 2019 when we first proposed adding matrix multiplier blocks to an FPGA. This work was selected as a poster in ISFPGA'20, and went on to be published as a paper called Hamamu in ASAP'20. We made lot of improvements - new features to the Tensor Slice and a revamp of the evaluation strategy - and that work was published as a paper called Tensor Slices in ISFPGA'21.

Intel recently (in July 2020) introduced a new FPGA called Stratix NX. These FPGAs have a block called Tensor Block. These blocks are very similar to the Tensor Slice, although they have less compute throughput (30 int8 MACs instead of 64 in Tensor Slice). A big advantage of these blocks is that they were drop-in replacement of DSP slices and they didn't disturb the routing fabric of the base Stratix device. Intel coming out with these blocks around the same time we were working on proposing adding Tensor Slice encourages us that our research is on the right track. I hope that users take advantage of Tensor Blocks for accelerating DL applications.

Benchmarks for FPGA research

Typically when I used to think of benchmarks, I would think of pieces of software that are run on a personal computer or a mobile phone to compare the processors they run on. Geekbench comes to mind. Benchmark suites contain programs of various sizes and various characteristics. Running them on different processors can help differentiate processors from each other. Not only that, benchmarks are used to drive new processor architectures.

When it comes to FPGAs though, what is a benchmark? A benchmark for an FPGA is a circuit (typically a design coded in Verilog or VHDL). A benchmark suite consists of multiple designs. 

Why do we need benchmarks? For two reasons:

One is to FPGA architecture research. A researcher comes up with a new FPGA architecture and runs benchmarks on this FPGA and a baseline FPGA (VTR is the perfect tool for this). Comparing the results (resource usage, area, frequency, routing wirelength, etc) between this new FPGA arch and a baseline FPGA arch can help ascertain whether this new FPGA is better for these benchmarks or not.

The second is for FPGA CAD research. Benchmark circuits can be run through CAD tools (like Xilinx Vivado, Intel Quartus, or VTR) and the results can be used to improve the CAD algorithms.

In the HLS world (that we live in today), however, the definition of a benchmark changes. Now, a C/C++/SystemC program (intended for an FPGA) is a benchmark for an FPGA. Note that if such benchmarks are used, an HLS tool is necessary for any research (either FPGA arch or CAD research).

VTR has several benchmark suites that come with it. See here!

One-size-fits-all doesn't work when it comes to benchmark suites. Just like one FPGA doesn't work all types of applications out there. Benchmarks need to be representative of the kind of workloads you're trying to optimize FPGA architecture or CAD for. So, there are many benchmark suites, specifically targeted for a set of applications. For example, recently, we created are DL-specific benchmark suite for FPGAs called Koios.

You can check out the Koios benchmark suite and how to use it here! We welcome contributions from the FPGA community to this suite. Please contact me by clicking the link on the top-right if you'd like to contribute.

Tools for FPGA research

The University of Toronto is the Mecca of FPGA research :)

They have developed tools that are fundamental to FPGA researchers, especially, those who are doing FPGA architecture and CAD research. And best of all, these are open-source and actively being developed.

The main ones are:

Check them out! And contribute if you can!

How I came across VTR?


This is an interesting story. Early in my research, I was exploring creating a new kind of FPGA (an FPGA which will have lot of neurons immersed in lot of wires). I needed to model this to see if this is actually going to benefit a DL application. I needed to explore ways in which the neurons can be located/connected on this chip and what kind of interconnect should be provided.

Until then, I had only worked on FPGAs using tools from Xilinx (Vivado) and Intel (Quartus). In these tools, you "choose" an FPGA from a list of FPGAs that the manufacturer makes, and then you compile a design onto that FPGA. You can't play with the architecture of this FPGA itself. What I wanted to do was to be able to change the architectural parameters of this FPGA and see the results of the compilation (the resource usage, area, frequency, etc).

I was at the verge of starting to write a simple model of an FPGA in Python. But I should look around first. I probably used google but I think I was naive to know the right keywords for the search. So I couldn't find anything. Then I thought I should ask someone. So, I posted on StackOverflow. Here's my post: Model of an FPGA - Stack Overflow 

A few days later, someone whose user name is "Fosfor" replied and pointed me to VTR. That's exactly what I was looking for.

Thanks, Mr. Fosfor. You were a godsend. I will acknowledge you in my dissertation :)






Can RAMs compute?

Traditionally RAMs have been used for storage. But the "Processing-In-Memory" (PIM) paradigm has been gaining a lot of traction. Over the last decade, the computer architecture community has seen many ideas relating to PIM. Samsung recently released a DRAM chip that has embedded compute units.

FPGAs have a lot of RAM blocks integrated in their routing fabric. They are called Block RAMs or BRAMs. We took the idea of PIM and thought of applying it to FPGA BRAMs. This could help increase the compute density of FPGAs, while potentially reducing the energy consumption (because of reduced dependence on the FPGA interconnect/routing).

Supreet Jeloka et al demonstrated the principle of logic-in-memory by fabricating a chip. Multiple word lines are activated simultaneously and the shared bit-lines can be sensed, effectively performing logical AND and NOR operations on the data stored in the activated rows. Compute Caches and Neural Caches extended this work by adding bit serial compute, which leads to precision agnostic hardware. We take this technology and apply it to BRAMs.

We add some additional logic (an instruction memory and a controller) to the BRAM on an FPGA, as shown on the left. In compute mode, instructions are read by the controller and computations are performed on the data stored in the array in a bit-serial manner. Each computation takes a long time because of the serial nature of the operation, but many operations happen in parallel. We saw 80% energy savings and speedup ranging from 20% to 80% depending on the application.

Our paper discussing this research will be published in the ASILOMAR conference in a few months. You can access it here!

A very similar work in this area was published by researchers at University of Michigan and Intel in FCCM'21 ("Compute Capable Block RAMs for Efficient Deep Learning Acceleration on FPGAs"). They use the same technology with a slightly different BRAM implementation. They design an accelerator architecture using these RAMs and showed significant benefits for real-life RNN/LSTM/GRU workloads. This strengthens the observation that bit-serial compute and processing-in-memory can help making FPGAs better DL accelerators.