An FPGA Based Accelerator for VLSI Artificial Neural Network Emulation – Daniel Herrera

Analog \ac{VLSI} circuits are being used successfully to implement \acp{ANN}. These analog circuits exhibit nonlinear transfer function characteristics and suffer from device mismatches, degrading network performance. Because of the high cost involved with analog \ac{VLSI} production, it is beneficial to predict implementation performance during design. We present an \ac{FPGA}-based accelerator for the emulation of large (500+ synapses, 10k+ test samples) single-neuron \acp{ANN} implemented in analog \ac{VLSI}. We used hardware time-multiplexing to scale network size and maximize hardware usage. An on-chip \ac{CPU} controls the data flow through various memory systems to allow for large test sequences. We show that \ac{BRAM} availability is the main implementation bottleneck and that a trade-off arises between emulation speed and hardware resources. However, we can emulate large amounts of synapses on an \ac{FPGA} with limited resources. We have obtained a speedup of 30.5 times with respect to an optimized software implementation on a desktop computer.

Our work emulates single-neuron \acp{ANN} implemented in analog \ac{VLSI}. Furthermore, we focus on the implementation of the \ac{LMS} algorithm as a proof of concept. Before exploring the implementation of the emulator, we present mathematical concepts for a single neuron. Also, we introduce the transfer functions of the analog \ac{VLSI} circuits we used to implement artificial synapses and emulation techniques for these transfer functions.

The primary purpose of this project is to solve resource constraints through the re-use of hardware blocks to enable emulation large-size networks. We refer to this technique as \textit{temporal synapse slicing}. A \textit{temporally sliced synapse} consists of an emulated multiplier cell, an emulated memory cell, and control hardware, which together mimic the function of one artificial synapse as implemented on analog \ac{VLSI}. We re-use this single temporally sliced synapse, physically implemented on the \ac{FPGA}, to emulate multiple artificial synapses over time. A temporal slice refers to all temporally sliced synapses in the system together at a single point in time. For each sample, all sliced synapses are operated sequentially.

To allow for extensive tests (≥ 10k samples) to be performed with the emulator, sufficiently large memory systems are required. We used a test board with a Xilinx \ac{V2P} \ac{FPGA} for the implementation of the emulator system. It is equipped with a flash memory socket, a \ac{SDRAM} socket and on-chip \ac{BRAM} available for data storage. \ac{BRAM} is too small to store large data sets. To maximize the amount of input test data, we divide data into smaller blocks before copying the data to memories with lower capacity. \ac{SDRAM} access times are higher then \ac{BRAM} access times, but there is a need for large data sets for which \ac{BRAM} is not suited. An embedded PowerPC (PPC) on-chip CPU executes the data flow, dividing data samples into data blocks. We have used the Xilinx \ac{EDK} Hardware/Software development platform to implement the system.

This accelerator hardware design is based on four main sections. The process controller, which is a \ac{FSM} that starts/monitors all sub-\ac{FSM}s within other hardware blocks and implements the slicing system. The \textit{Synapses Block}, which contains all temporally sliced synapses and consists of a multiplier cell emulator, a memory cell emulator, and a slice \ac{BRAM}. The \textit{Algorithm Block} consisting of pipelined hardware multipliers, calculating all weight update values. The \textit{Addition Block}, which is an adder for synapse outputs.

A comparison with the emulator presented in Section \ref{2008:emulator} was performed in terms of performance. The slicing system to was configured to use one slice and (the same selection of) five hardware synapses to mirror the basic system. The results were identical, which validates the slicing system proposed in this work did not lose the features from the original work. In addition, post-implementation resource results were presented denoting that the increase of resources from the initial design is still a feasible choice for implementation on small \acp{FPGA}, an analysis was included for a Virtex 6 \ac{FPGA} where vast scale networks can be implemented in a single slice (140 simultaneous synapses), allowing more extensive global networks than the one tested on \ac{V2P} board. Furthermore, after the \ac{PAR} process, we analyzed the system timing details for a single-neuron analog \ac{VLSI} emulator with two slices. All post-implementation system delays confirmed that the network can operate \ac{V2P} board at the maximum clock of 100[MHz]. In comparison to our previous \ac{CPU} implementation, a significant speed-up 30.5X has been achieved. It is expected the V6 \ac{FPGA} would perform 2-3 magnitudes better than the \ac{CPU} version due to the higher clock speed and the higher maximum number of artificial synapses that can be implemented.

The contributions of this work are twofold. We implemented an \ac{FPGA}-based accelerator for realistic emulation of analog \ac{VLSI} neural networks and investigated the limits that availability of \ac{FPGA} resources impose on the number of synapses that we can emulate. First, we conclude that emulation of large analog \ac{VLSI} neural networks is feasible on an \ac{FPGA} platform. Secondly, we find the availability of on-chip memory limits the number of test samples, but external memory systems overcome this limitation. Our emulator allows for emulation of non-linearities of analog \ac{VLSI} implementations for artificial neural networks. The emulator enables convergence and performance analysis of massive single-neuron \acp{ANN}. We show that it is possible to implement 500+ synapses even on an entry-level \ac{FPGA} with limited resources. We use hardware efficiently through temporally slicing of synapse emulator blocks and show there is a trade-off between resources and emulation speed. Furthermore, we show that external memory systems and a \ac{CPU} for data flow control together overcome the limitations posed by available on-chip memory regarding the number of input samples, allowing for test sequences of more than 10K samples. Finally, our \ac{V2P} accelerator obtains a speedup of the order of one magnitude compared to specialized software implementation, while it is expected that a similar implementation on a Virtex 6 could achieve a 2-3 magnitudes speedup.