Quantized DNN MACs

Quantization on deep neural network (DNN) multiply-acccumulate units (MACs)

Machine learning is extremely popular today – we have all likely interacted with it at some point with our phones, computers, or even coffee machines. Deep neural networks, the core underlying element of machine learning, involve numerous multiplications followed by subsequent accumulations (additions) to feed to the next layers with the end goal of performing a machine learning computation. At a very high level, this can look like us accurately identifying an image of a cat, for example, or performing some language translation task. The numerous MACs needed in these computations is costly. A method known as quantization, a way in which we can reduce the precision of our computations for speed, is a common way computational cost is reduced in deep neural network architecture. In CPRE482X (Machine Learning Hardware Design), our team proposed a project to compare the accuracy of a quantized deep neural network with the effects on the design area, power and timing.

Earlier in the semester through labs, our team gained insight into implementing an entire DNN (specifically focused on image recognition) in C++. This implementation was based around 32-bit floating point numbers. We wanted to see what effect this would have if we, through quantization, used 8-bit or 4-bit values instead. After looking through the data that we collected we came to the conclusion that the 8-bit implementation would be best for a general purpose DNN. The 8-bit implementation is still fairly accurate while being improved greatly in the hardware implementation with area, timing and power. There is still merit to the other implementations as the 32-bit implementation would be very good if you needed extremely accurate data and were not concerned with any hardware metrics. The 4-bit implementation would be good if you did not need a very accurate system but you had very strict hardware requirements. We also got great experience using software like Genus, TensorFlow, ModelSim, and programming in Verilog (a hardware description language). My work was largely focused on initial design and debugging/testing.

This team experience allowed me to further develop my skills in team work and helped me to understand the ways in which to delegate tasks to others within the group based on our respective talents and skill-level. After my experience in CPRE 381 (see MIPS Processor page), I found that project team experience allowed me to grow much more with my team and helped me to understand how real computer engineering experiences would manifest. There was a lot of uncertainty when working in this project as this work had not been done before, but ultimately, that is what the engineering process looks like.