This work presents a reconfigurable array which performs blind source separation on a range of \ac{FPGA} devices. Our array uses \ac{ICA} with the InfoMax algorithm to separate a mixture of signals without an external reference. We describe two configurations of the array, representing distinct points in the design space. Our experimental results show a performance improvement of more than one order of magnitude over an optimized software implementation of the algorithm on a desktop computer, with a power consumption of 100[mW]. Our array successfully separates a fetal \ac{ECG} mixture into the source signals of mother and fetus, enabling medical analysis on the resulting independent components.
\ac{ICA} is a signal processing technique used to recover the sources from unknown mixtures captured by spatially distributed sensors. Due to the weak assumptions imposed by \ac{ICA} on the nature of the signals, this technique is widely used to perform blind source separation on medical applications such as \ac{ECG} and \ac{EEG} analysis, as well as speech recognition, face classification and data communications. Despite these advantages, most \ac{ICA} algorithms require high computational throughput to operate in real time. Typical software solutions on general purpose computers are large and power hungry, even digital signal processors do not meet the power, performance, and cost requirements of embedded and portable applications.
One of the most widely used algorithms that perform \ac{ICA} is InfoMax, which separates the sources by minimizing the mutual information between the outputs. The InfoMax computes the output vector as the product of the n-input vector and a weight matrix, after each block of samples, the algorithm applies a nonlinear learning rule to update the coefficients of the weight matrix by maximizing the entropy of the output. The algorithm executes in three main stages. An array of hardware multipliers performs a vector-matrix multiplication to compute the outputs, where the weights in this multiplication remain stored on on-chip RAM memory blocks. For every block of samples, the output values are used to compute the InfoMax weight update, which involves calculating the tanh function and a matrix multiplication. The array applies the updates to the stored weight matrix, then performs normalization and orthogonalization on the weight vectors, which require multiplications, divisions, and square roots.
We exploit the available data parallelism in two different strategies, one within each stage (fully-parallel configuration); and another between stages using pipelining (folded configuration). The scheduling of the computations differs between these two different configurations of the array. However, they rely upon the same set of base functions such as tanh, square root and division, which are specific processing blocks developed for seamless processing in \ac{FPGA}. The fully-parallel implementation of the algorithm targets large \ac{FPGA} devices to maximize performance at the expense of die area. This approach breaks up the algorithm into four stages (separation, update, orthogonalization, normalization), each of them further decomposed in sub-stages which execute concurrently in a pipelined fashion. The hardware multipliers ultimately constrain how many operations of the algorithm we can perform in parallel on the chip.
On the other hand, the folded configuration focuses to fit the array into smaller \ac{FPGA} devices; we fold each stage of the architecture onto itself to process fewer input elements simultaneously. Thus, instead of performing every operation of each sub-stage in parallel, we use time multiplexing to share multiple hardware resources between different elements of an input vector. A software tool developed in Matlab\texttrademark\ aids the designer in the process of folding the architecture based on the available resources in the target device and generates the correct \ac{HDL} code. As a result, the folded configuration uses approximately 25\% of the resources of the fully-parallel version, at the cost of extended execution time and a slightly more complex control structure.
We mapped the fully-parallel configuration of the array to a Xilinx \ac{V2P} platform \ac{FPGA} and tested the array on a 4-input experiment of blind source-separation with InfoMax; the algorithm converges in approximately 7.5[ms]. The folded version, mapped onto a Xilinx Spartan3 XC3S1000 \ac{FPGA}, executes in less than 110[ms] for the same experiment. For comparison purposes, we implemented the InfoMax algorithm in software on a laptop PC featuring floating-point arithmetic, which performs in 85[ms]. Thus, the fully-parallel configuration reduces the execution time of the software version by a factor of 11.3.
Moreover, the execution time of the software version is smaller than the folded array by a factor of 1.3. The power consumption of both hardware implementations is lower than the computer by more than two orders of magnitude.
We described an array architecture for \ac{ICA} using the InfoMax algorithm. The array can be reconfigured to target a wide range of \ac{FPGA} devices, representing different price\slash per-for-mance trade-offs. We developed two configurations: a fully-parallel version mapped to a Xilinx \ac{V2P} CX2VP30, and a folded implementation mapped to a Xilinx Spartan3 XC3S1000. The parallel array outperforms both the folded configuration by a factor of 14.5, and PC-based software implementation by a factor of 11.3. Both hardware arrays consume in the order of 100[mW], use 18-bit fixed-point arithmetic and achieve a resolution within 0.08\% of a floating point software implementation of the algorithm.