Neural inferencing engine tops 100 TOPS with low DRAM bandwidth use

Flex Logix Technologies, Inc. announced that it has leveraged its core patent-protected interconnect technology from its embedded FPGA (eFPGA) line of business to launch a completely new product line focused on neural inferencing. Unveiled today in a presentation at the Linley Processor Conference in Santa Clara, the Flex Logix NMAX Neural Inferencing Engine delivers 1 to >100 TOPS of neural inferencing capacity in a modular, scalable architecture that requires a fraction of the DRAM bandwidth of existing neural inferencing solutions.

In neural inferencing, the computation is primarily trillions of operations (multiplies and accumulates, typically using 8-bit integer inputs and weights, and sometimes 16-bit integer). The technology Flex Logix has developed for eFPGA is also ideally suited for inferencing because eFPGA allows for re-configurable, fast control logic for each network stage. SRAM in eFPGA is reconfigurable as needed in neural networks where each layer can require different data sizes; and Flex Logix interconnects allow reconfigurable connections between SRAM input banks, MAC clusters, and activation to SRAM output banks at each stage.

The result is an NMAX tile of 512 MACs with local SRAM, which in 16nm has ~1 TOPS peak performance. NMAX tiles can be arrayed, without any GDS change, in configurations of whatever TOPS is required, with varying amounts of SRAM as needed to optimize for the target neural network model, up to to >100 TOPS peak performance.

For example, for YOLOv3 real time object recognition, NMAX arrays can be generated in increasing size to process 1, 2 or 4 cameras with 2 MegaPixel inputs at 30 frames per second with batch size = 1. This is done with just ~10GB/sec of DRAM bandwidth, compared to the 100s of GB/second of existing solutions. In this example, MAC utilization is in the 60-80% range, which is much better than existing solutions.

Another example is ResNet-50 for image classification. The three NMAX arrays mentioned above classify 4600, 9500 and 19,000 images/second respectively, all with batch size = 1. All of these throughputs are achieved with 1 DRAM and about 90% MAC utilization. As a comparison Nvidia Tesla T4 needs a batch size of 28 to achieve 3920 images/second, achieving <25% MAC utilization while using 8 DRAMs. Lower batch sizes are very important for all edge applications and many data center applications in order to minimize latency – long latency means slower response time.

High MAC utilization means less silicon area/cost. Low DRAM bandwidth means fewer DRAMs, less system cost and less power.

NMAX is a general purpose Neural Inferencing Engine which can run any type of NN from simple fully connected DNN to RNN to CNN and can run multiple NNs at a time. NMAX is programmed using Tensorflow and in the future will support other model description languages as well.

NMAX is in development now and will be available in the second half of 2019.

Neural inferencing engine tops 100 TOPS with low DRAM bandwidth use

Leave a Reply Cancel reply

EE WORLD ONLINE NETWORK

EE WORLD ONLINE

Reader Interactions

Leave a Reply Cancel reply

Footer

EE WORLD ONLINE NETWORK

EE WORLD ONLINE