2017 SCNNAnAcceleratorforCompressedS

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Sparse Neural Network, CNN Inference, CNN Training.

Notes

Cited By

Quotes

Abstract

Convolutional Neural Networks (CNNs) have emerged as a fundamental technology for machine learning. High performance and extreme energy efficiency are critical for deployments of CNNs, especially in mobile platforms such as autonomous vehicles, cameras, and electronic personal assistants. This paper introduces the Sparse CNN (SCNN) accelerator architecture, which improves performance and energy efficiency by exploiting the zero-valued weights that stem from network pruning during training and zero-valued activations that arise from the common ReLU operator. Specifically, SCNN employs a novel dataflow that enables maintaining the sparse weights and activations in a compressed encoding, which eliminates unnecessary data transfers and reduces storage requirements. Furthermore, the SCNN dataflow facilitates efficient delivery of those weights and activations to a multiplier array, where they are extensively reused; product accumulation is performed in a novel accumulator array. On contemporary neural networks, SCNN can improve both performance and energy by a factor of 2.7x and 2.3x, respectively, over a comparably provisioned dense CNN accelerator.

I. INTRODUCTION

Driven by the availability of massive data and the computational capability to process it, deep learning has recently emerged as a critical tool for solving complex problems across a wide range of domains, including image recognition [20], speech processing [12], [16], [2], natural language processing [8], language translation [10], and autonomous vehicles [21]. Convolutional neural networks (CNNs) have become the most popular algorithmic approach for deep learning for many of these domains. Employing CNNs can be decomposed into two tasks: (1) training — in which the parameters of a neural network are learned by observing massive numbers of training examples, and (2) inference — in which a trained neural network is deployed in the field and classifies the observed data. Today, training is often done on GPUs [24] or farms of GPUs, while inference depends on the application and can employ CPUs, GPUs, FPGA, or specially-built ASICs.

During the training process, a deep learning expert will typically architect the network, establishing the number of layers, the operation performed by each layer, and the connectivity between layers. Many layers have parameters, typically filter weights, which determine their exact computation. The objective of the training process is to learn these weights, usually via a stochastic gradient descent-based excursion through the space of weights. This process typically employs a forward-propagation calculation for each training example, a measurement of the error between the computed and desired output, and then back-propagation through the network to update the weights. Inference has similarities, but only includes the forward-propagation calculation. Nonetheless, the computation requirements for inference can be prohibitively large, particularly with the emergence of deeper networks (hundreds of layers [17], [18], [26], [29]) and larger inputs sets, such as high-definition video. Furthermore, the energy efficiency of this computation is important, especially for mobile platforms, such as autonomous vehicles, cameras, and electronic personal assistants.

Recent published works have shown that common networks have significant redundancy and can be pruned dramatically during training without substantively affecting accuracy [15]. Our experience shows that the number of weights that can be eliminated varies widely across the layers but typically ranges from 20% to 80% [15], [14]. Eliminating weights results in a network with a substantial number of zero values, which can potentially reduce the computational requirements of inference.

The inference computation also offers a further optimization opportunity. In specific, many networks employ as their nonlinear operator the ReLU (rectified linear unit) function which clamps all negative activation values to zero. The activations are the output values of an individual layer that are passed as inputs to the next layer. Our experience shows that for typical data sets, 50–70% of the activations are clamped to zero. Since the multiplication of weights and activations is the key computation for inference, the combination of these two factors can reduce the amount of computation required by over an order of magnitude. Additional benefits can be achieved by a compressed encoding for zero weights and activations, thus allowing more to fit in on-chip RAM and eliminating energycostly DRAM accesses.

In this paper, we introduce the Sparse CNN (SCNN) accelerator architecture, a new CNN inference architecture that exploits both weight and activation sparsity to improve both performance and power. Previous works have employed techniques for exploiting sparsity, including saving computation energy for zero-valued activations and compressing weights and activations stored in DRAM [7], [6]. Other works have used a compressed encoding of activations [1] or weights [30] in parts of their dataflow to reduce data transfer bandwidth and save time for computations of some multiplications with a zero operand. While these prior architectures have largely focused on eliminating computations and exploiting some data compression, SCNN couples an algorithmic dataflow that eliminates all multiplications with a zero operand while employing a compressed representation of both weights and activations through almost the entire computation.

At the heart of the SCNN design is a processing element (PE) with a multiplier array that accepts a vector of weights and a vector of activations. Unlike previous convolutional dataflows [5], [11], [7], [25], the SCNN dataflow only delivers weights and activations to the multiplier array that can all be multiplied by one another in the manner of a Cartesian product. Furthermore, the activation vectors are reused in an input stationary [6] fashion against a number of weight vectors to reduce data accesses. Finally, only non-zero weights and activations are fetched from the input storage arrays and delivered to the multiplier array. As with any CNN accelerator, SCNN must accumulate the partial products generated by the multipliers. However, since the products generated by the multiplier array cannot be directly summed together, SCNN tracks the output coordinates associated with each multiplication and sends the coordinate and product to a scatter accumulator array for summing.

To increase parallelism beyond a single PE, multiple PEs can be run in parallel with each working on a disjoint 3D tile of input activations. Because of the end-to-end compression of activations, SCNN can keep the both the input and output activations of each tile local to its PE, further reducing energyhungry data transmission. Overall, this results in a design with efficient compressed storage and delivery of input operands, high reuse of the input operands in the multiplier array, and that spends no time on multiplications with zero operands. To evaluate SCNN, we developed a cycle-level performance model and a validated analytical model that allows us to quickly explore the design space of different types of accelerators. We also implemented an SCNN PE in synthesizable System C and compiled the design into gates using a combination of commercial high-level synthesis (HLS) tools and a traditional verilog compiler. Our results show that 64 PE SCNN implementation with 16 multipliers per PE (1024 multipliers in total) can be implemented in approximately 7.9mm2, which is somewhat larger than an equivalently provisioned dense accelerator architecture due to the overheads of managing the sparse dataflow. On a range of networks, SCNN provides a factor of 2.7� speedup and a 2.3� energy reduction relative to the dense architecture.

TABLE I: NETWORK CHARACTERISTICS. WEIGHTS AND ACTIVATIONS ASSUME A DATA-TYPE SIZE OF TWO BYTES.
# Conv. Max. Layer Max. Layer Total #
Network Layers Weights Activations Multiplies
AlexNet [20] 5 1:73 MB 0:31 MB 0:69 B
GoogLeNet [28] 54 1:32 MB 1:52 MB 1:1 B
VGGNet [27] 13 4:49 MB 6:12 MB 15:3 B

II. MOTIVATION

Convolutional Neural Network algorithms (CNNs) are essentially a cascaded set of pattern recognition filters trained with supervision [21]. A CNN consists of a series of layers, which include convolutional layers, non-linear scalar operator layers, and layers that downsample the intermediate data, for example by pooling. The convolutional layers represent the core of the CNN computation and are characterized by a set of filters that are usually 1�1 or 3�3, and occasionally 5�5 or larger. The values of these filters are the weights that are trained using a training set for the network. Some deep neural networks (DNNs) also include fully-connected layers, typically toward the end of the DNN. During inference, a new image (in the case of image recognition) is presented to the network, which classifies into the training categories by computing in succession each of the layers in the network. The intermediate data between the layers are called activations and the output activation of one layer becomes the input activation of the next layer. In this paper, we focus on accelerating the convolutional layers as they constitute the majority of the computation [9].

Table II lists the attributes of three commonly used networks in image processing: AlexNet [20], GoogLeNet [28], and VGGNet [27], whose specifications come from the Caffe BVLC Model Zoo [4]. The increasing layer depth across the networks represents the successively more accurate networks in the ImageNet [19] competition. The Maximum Weights and Activations columns indicate the size of the largest weight and activation matrices across the layer of the network. The last column lists the total number of multiplies required to compute a single inference pass through all of the convolutional layers of the network. These data and computational requirements are derived from the standard ImageNet inputs images of 224�224 pixels. Processing larger, higher resolution images will result in greater computational and data requirements.

Sparsity in CNNs. Sparsity in a layer of a CNN is defined as the fraction of zeros in the layer’s weight and input activation matrices. The primary technique for creating weight sparsity is to prune the network during training. Han, et al. developed a pruning algorithm that operates in two phases [15]. First, any weight with an absolute value that is close to zero (e.g. below a defined threshold) is set to zero. This process has the effect of removing weights from the filters, and sometimes even forcing an output activation to always to be zero. Second, the remaining network is retrained, to regain the accuracy lost through na¨ıve pruning. The result is a smaller network with accuracy extremely close to the original network. The process can be iteratively repeated to reduce network size while maintaining accuracy.

Activation sparsity occurs dynamically during inference and is highly dependent on the data being processed. Specifically, the rectified linear unit (ReLU) function that is commonly used as the non-linear operator in CNNs forces all negatively valued activations to be clamped to zero. After completing computation of a convolutional layer, a ReLU function is applied point-wise to each element in the output activation matrices before the data is passed to the next layer. To measure the weight and activation sparsity, we used the Caffe framework [3] to prune and train the three networks listed in Table II, using the pruning algorithm of [15]. We then instrumented the Caffe framework to inspect the activations between the convolutional layers. Figure 1 shows the weight and activation density (fraction of non-zeros or complement of sparsity) of the layers of the networks, referenced to the lefthand y-axes. As GoogLeNet has 54 convolutional layers, we only show a subset of representative layers. The data shows that weight density varies across both layers and networks, reaching a minimum of 30% for some of the GoogLeNet layers. Activation density also varies, with density typically being higher in early layers. Activation density can be as low as 30% as well. The triangles shows the ideal amount of work (measured in multiplies of non-zero values) that could be achieved through maximum exploitation of sparsity by taking the product of the weight and activation densities on a perlayer basis. Typical layers can reduce work by a factor of 4, and can reach as high as a factor of ten.

Exploiting sparsity. Since multiplication by zero just results in a zero, it should require no work. In addition, that zero will contribute nothing to the partial sum it is part of, so the addition is unnecessary as well. Furthermore, data with many zeros can be represented in a compressed form. Together these characteristics provide a number of opportunities for optimization:

  • Compressing data: Encoding the sparse weights and/or activations provides an architecture an opportunity to reduce the amount of data that must be moved throughout the memory hierarchy. It also reduces the data footprint, which allows larger matrices to be held a given size storage structure.
  • Eliminating computation: For multiplications that have a zero weight and/or activation operand, the operation can be data gated or the operands might never be sent to the multiplier. This can save energy consumption or both time and energy consumption, respectively.

Our SCNN architecture exploits both these opportunities. First, it employs a dense encoding of sparse weights and activations. Second, it uses a novel dataflow that delivers only those densely encoded weights and activations to the multipliers.

0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 conv1 conv2 conv3 conv4 conv5 Work (# of multiplies) Density (IA, W) Density (IA) Density (W) Work (# of multiplies) (a) AlexNet 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 pool_proj 1x1 3x3_reduce 3x3 5x5_reduce 5x5 pool_proj 1x1 3x3_reduce 3x3 5x5_reduce 5x5 inception_3a inception_5b Work (# of multiplies) Density (IA, W) Density (IA) Density (W) Work (# of multiplies) (b) GoogLeNet 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Work (# of multiplies) Density (IA, W) Density (IA) Density (W) Work (# of multiplies) (c) VGGNet

Fig. 1. Input activation and weight density and the reduction in the amount of work achievable by maximally exploiting sparsity.

References

;

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2017 SCNNAnAcceleratorforCompressedSAngshuman Parashar
Minsoo Rhu
Anurag Mukkara
Antonio Puglielli
Rangharajan Venkatesan
Brucek Khailany
Joel Emer
Stephen W Keckler
William J Dally
Scnn: An Accelerator for Compressed-sparse Convolutional Neural Networks