Deep Reinforcement Learning Processor Design for Mobile Applications 3031367928, 9783031367922

This book discusses the acceleration of deep reinforcement learning (DRL), which may be the next step in the burst succe

122 96 7MB

English Pages 107 [105] Year 2023

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Deep Reinforcement Learning Processor Design for Mobile Applications
 3031367928, 9783031367922

Table of contents :
Contents
Deep Reinforcement Learning Processor Design for MobileApplications
1 Introduction
2 Background of Deep Reinforcement Learning
2.1 Reinforcement Learning
Problem Setup
Bellman Equation
Exploration and Exploitation
Dynamic Programming
Monte Carlo
Temporal Difference Learning
Model-Based RL
Function Approximation
Policy Optimization
2.2 Deep Reinforcement Learning
Deep Q-Learning
Policy Gradient
Actor-Critic
3 Group-Sparse Training Algorithm for Accelerating Deep Reinforcement Learning
3.1 Introduction
3.2 Related Works
3.3 Methodology
The Architecture of the Group-Sparse Training
Reward-Aware Pruning
Block Size Conversion Methods
Estimation of Compression Ratio
3.4 Experiments
Gradual Pruning vs. Reward-Aware Pruning
GST Performance According to B and Sshift
Performance of Block Size Conversion Methods
Verification GST on Various DRL Benchmarks
Verification GST on Classification Networks
3.5 Conclusion and Future Work
4 An Energy-Efficient Deep Reinforcement Learning Processor Design
4.1 Introduction
4.2 Overall Architecture
4.3 Key Features for Energy-Efficient DRL Training Processor
Group-Sparse Training
Group-Sparse Training Core Architecture
Exponent-Mean-Delta Encoding
Sparse Weight Transposer
4.4 Implementation Results
4.5 Discussion
Analysis on GST Performance According to Group Size
Scalability of the Proposed GSTC
PE Array Size Decision of OmniDRL
4.6 Conclusion
5 Low-Power Autonomous Adaptation System with Deep Reinforcement Learning
5.1 Introduction
5.2 Processor Design for DRL Acceleration
Overall Architecture
Methods for Compressing Massive Data of DRL
Compressing Network Interface
5.3 DRL System Design for Autonomous Adaptation
DRL Adaptation Scenario of Humanoid
Detailed Configuration and Operation of DRL System
Implementation Results
5.4 Conclusion
6 Exponent-Computing-in-Memory for DNN Training Processor with Energy-Efficient Heterogeneous Floating-Point Computing Architecture
6.1 Introduction
6.2 Architecture for Computing-in-Memory of Floating Point
Motivation
Heterogeneous Floating-Point Computing Architecture
Overall Processor Architecture
6.3 Mantissa-Free-Exponent Calculation
6.4 Exponent-Computing-in-Memory
CNN Workload Mapping of HEMTC
6.5 Zero-Skip Architecture of HEMTC
6.6 Implementation Results
6.7 Conclusion
Reference
Index

Citation preview

Juhyoung Lee Hoi-Jun Yoo

Deep Reinforcement Learning Processor Design for Mobile Applications

Deep Reinforcement Learning Processor Design for Mobile Applications

Juhyoung Lee • Hoi-Jun Yoo

Deep Reinforcement Learning Processor Design for Mobile Applications

Juhyoung Lee QCT Graphics Hardware Korea Qualcomm Korea YH Seoul, Korea (Republic of)

Hoi-Jun Yoo Department of Electrical Engineering KAIST Daejeon, Korea (Republic of)

ISBN 978-3-031-36792-2 ISBN 978-3-031-36793-9 https://doi.org/10.1007/978-3-031-36793-9

(eBook)

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Contents

Deep Reinforcement Learning Processor Design for Mobile Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Background of Deep Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Deep Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Group-Sparse Training Algorithm for Accelerating Deep Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 An Energy-Efficient Deep Reinforcement Learning Processor Design. . . . . 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Overall Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Key Features for Energy-Efficient DRL Training Processor . . . . . . . . . . 4.4 Implementation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Low-Power Autonomous Adaptation System with Deep Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Processor Design for DRL Acceleration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 DRL System Design for Autonomous Adaptation . . . . . . . . . . . . . . . . . . . . 5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Exponent-Computing-in-Memory for DNN Training Processor with Energy-Efficient Heterogeneous Floating-Point Computing Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Architecture for Computing-in-Memory of Floating Point . . . . . . . . . . .

1 1 4 4 13 19 19 21 22 28 35 35 35 39 41 55 61 64 64 64 67 70 73

74 74 76 v

vi

Contents

6.3 6.4 6.5 6.6 6.7

Mantissa-Free-Exponent Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Exponent-Computing-in-Memory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zero-Skip Architecture of HEMTC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Implementation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

81 83 88 91 93

Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

95

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

Deep Reinforcement Learning Processor Design for Mobile Applications

1 Introduction Nowadays, a deep neural network (DNN) is ubiquitous everywhere. After a landmark study of AlexNet [1], DNN has outperformed not only previous humanmade algorithms but also even human level itself. It has achieved state-of-the-art performances in computer vision, natural language processing, and audio signal processing. However, the superior performance of DNN comes at a high computational cost. The majority of operations of DNN are comprised of fully connected layer (FCL) and convolutional layer (CL). In the case of FCL, matrix multiplications between weight parameters and input feature maps are performed to generate output feature maps. In the case of CL, convolutions between weight parameters and input feature maps are performed to generate output feature maps. Therefore, we need load operation of input feature map and weight, multiply-and-accumulate (MAC) operation for calculation, and store operation of output feature map for the DNN calculation. As DNNs have utilized a large number of layers that have large dimensions, it is easy to find DNNs that require 100s.∼1000s MB data and .∼1 Giga MAC operations for 1 inference. These massive amounts of computational cost are the main requirements for DNN accelerator design. Recently, the portion of the cost from memory is dominant in the DNN accelerator design. This is because of the problem “A Memory Wall,” the performance gap between processor and memory. While the logic for processors took advantage of the rapid advance in technology, the improvement of memory could not follow the advance. Figure 1 shows the memory wall in DNN accelerator design [2]. The rapid development of the DNN model requires 750.×larger computation requirement per every 2 years. The computational capability of hardware has been increased 3.1.×per every 2 years. However, the memory system capability has only increased 1.4.×per every 2 years. Moreover, we can scale out the computation capability by integrating more hardware, but it is hard to scale out the memory capability due © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 J. Lee, H.-J. Yoo, Deep Reinforcement Learning Processor Design for Mobile Applications, https://doi.org/10.1007/978-3-031-36793-9_1

1

2

Deep Reinforcement Learning Processor Design for Mobile Applications Scaling Out Solution - More Hardware

Solution Needed! - Limited by physical limit

Memory Wall

750œ / 2 years

3.1œ / 2 years

1.4œ / 2 years

AI model computation requirement

Hardware computation capability

Memory system capability

Fig. 1 Memory wall in DNN accelerator design, and this figure is from the ISSCC 22 [2] Intel performance counter monitors Two 8-cores CPU + 128GB DRAM Energy Measurements Genomics classification

Natural language processing

5%

18 %

95 %

&RPSXWH

(a)

82 %

&RPSXWH

0HPRU\

0HPRU\

(b)

Fig. 2 (a) Normalized energy cost of components in DNN accelerator, and this figure is from VLSI 18 [3], (b) Energy consumption breakdown of Intel CPU and DRAM, and this figure is from the IEDM 17 [4]

to physical limitations. Figure 2a shows the normalized energy cost of components in DNN accelerator [3]. The DNN accelerator integrates processing engines (PEs) for computation, flip-flop-based register files as L1 cache, scratch-pad memorybased global buffer as L2 cache, and DRAM as main memory. Compared with computation energy cost in PE, the energy costs of buffer access and DRAM access are much higher. Therefore, the DNNs that require frequent memory access suffer from the large memory energy consumption. Figure 2b shows the energy breakdowns of computation and memory that are measured with 2 intel 8-core CPU and 128 GB DRAM [4]. For genomics classification, memory consumes 95% of the entire energy. For natural language processing, memory consumes 82% of the entire energy. In this chapter, we have proposed various optimization techniques from the software level to the hardware level that can reduce the memory footprint and memory power consumption for energy-efficient deep reinforcement learning (DRL) accelerator in mobile devices. Inherently, the DRL has suffered from memory bottleneck due to its characteristic that utilizes multiple numbers of DNNs simultaneously that consist of fully connected layers. We have proposed 2 chip designs to handle

1 Introduction

3

memory bandwidth optimization and memory power consumption optimization, respectively. The rest of this chapter is organized as follows: – In Sect. 2, we include background information on reinforcement learning (RL) and deep reinforcement learning (DRL), which are required to design a DRL accelerator. In this section, we explain the definition of the RL problem, components of RL, classical methods for RL (such as dynamic programming, temporal difference learning, and Monte Carlo), and methods for deep RL (deep q-learning, actor-critic, and policy gradient). The information explained in this section is essential for understanding the operational characteristic of DRL. – In Sect. 3, we propose a novel weight compression method for DRL training acceleration, named group-sparse training (GST). Deep reinforcement learning (DRL) has shown remarkable success in sequential decision-making problems but suffers from a long training time to obtain such good performance. Many parallel and distributed DRL training approaches have been proposed to solve this problem, but it is difficult to utilize them on resource-limited devices. In order to accelerate DRL in real-world edge devices, memory bandwidth bottlenecks due to large weight transactions have to be resolved. However, previous iterative pruning not only shows a low compression ratio at the beginning of training but also makes DRL training unstable. GST selectively utilizes block-circulant compression to maintain a high weight compression ratio during all iterations of DRL training and dynamically adapt target sparsity through reward-aware pruning for stable training. Thanks to the features, GST achieves a 25 %p.∼41.5 %p higher average compression ratio than the iterative pruning method without reward drop in Mujoco Halfcheetah-v2 and Mujoco Humanoid-v2 environment with TD3 training. – In Sect. 4, we propose an energy-efficient deep reinforcement learning (DRL) processor, OmniDRL, for DRL training on edge devices. Despite the growing need for DRL training on edge devices due to distinct characteristics that can be adapted to each user, a massive amount of external and internal memory access limits the implementation of DRL training on the resource-constrained platforms. OmniDRL proposes 4 key features that can reduce external memory access by compressing as much data as possible and can reduce internal memory access by directly processing compressed data. A group-sparse training enables a high weight compression ratio for every DRL iteration by selective utilization of weight grouping and weight pruning. A group-sparse training core is proposed to fully take advantage of compressed weight from GST by skipping redundant operations and reusing duplicated data. An exponent-meandelta encoding applies additional bit-level compression to data while achieving a higher compression ratio and low memory consumption power than the previous exponent compression method. A world-first on-chip sparse weight transposer enables the DRL training process of compressed weight without softwarebased off-chip transposer. As a result, OmniDRL is fabricated in 28 nm CMOS technology and occupies a .3.6×3.6 .mm2 die area. It shows a state-of-the-art peak performance of 4.18 TFLOPS and a peak energy efficiency of 29.3 TFLOPS/W.

4

Deep Reinforcement Learning Processor Design for Mobile Applications

It achieved 7.16 TFLOPS/W energy efficiency for training robot agent (Mujoco Halfcheetah, TD3), which is 2.4.× higher than the previous state-of-the-art. – In Sect. 5, we propose a low-power and high-performance DRL system with an energy-efficient DRL chip. The proposed DRL chip can seamlessly compress both weight and feature map to reduce the number of memory access. The proposed system with DRL chip demonstrates adaptation of humanoid to sudden environmental changes at Mujoco Humanoid-v2. The proposed system shows 10.3 iteration/J of training energy efficiency that is 3.9.× higher than NVIDIA TX2. – In Sect. 6, we propose a heterogeneous floating-point (FP) computing architecture to maximize energy efficiency by separately optimizing exponent processing and mantissa processing. The proposed exponent-computing-in-memory (ECIM) architecture and mantissa-free-exponent-computing (MFEC) algorithm reduce the power consumption of both memory and FP MAC while resolving previous FP computing-in-memory processors’ limitations. Also, a bfloat16 DNN training processor with proposed features and sparsity exploitation support is implemented and fabricated in 28 nm CMOS technology. It achieves 13.7 TFLOPS/W energy efficiency while supporting FP operations with CIM architecture. All of the chapters except Sect. 2 are based on the previous research undertaken by this author [5–8]. More specifically, Sect. 3 is based on “GST: Group-Sparse Training for Accelerating Deep Reinforcement Learning” [5]. Section 4 is based on “OmniDRL: A 29.3 TFLOPS/W Deep Reinforcement Learning Processor with Dualmode Weight Compression and On-chip Sparse Weight Transposer” [6]. Section 5 is based on “Low-power Autonomous Adaptation System with Deep Reinforcement Learning” [7]. Section 6 is based on “ECIM: Exponent Computing in Memory for an Energy-Efficient Heterogeneous Floating-Point DNN Training Processor” [8]. Section 2 is based on this author’s previous publication [9], several DRL review papers, books, and courses such as [10, 11], RL course from David Silver, and Deep RL course from Sergey Levine.

2 Background of Deep Reinforcement Learning 2.1 Reinforcement Learning In this subsection, we give a brief introduction to reinforcement learning (RL). Because deep reinforcement learning (DRL) is one kind of RL, it is important to understand basic RL theory for understanding.

Problem Setup Generally, the RL system consists of an environment and an agent that interacts with each other. Let the state space that the agent can reach be S and the action space that

2 Background of Deep Reinforcement Learning

5

the agent can take be A. At each time t, the agent takes an action .at ∈ A at the current system state .st ∈ S, following a policy .π(at |st ), which decides action .at from state .st . As a result, the agent gets a reward .rt from the environment and state transition to the next system state .st+1 occurred with state transition probability .P (st+1 |st , at ). In the case of episodic problem, the agent goes through this process until it reaches a terminal state and then it restarts entire procedure. The goal of the RL is to find a policy .π(at |st ) that maximizes the expected accumulated reward for the entire timestep, which is described in Eq. (1). Rt =

∞ 

.

γ k rt+k

(1)

k=0

γ is a discount factor .γ ∈ (0, 1]. Even though the definition is set up in discrete spaces, the definition in the continuous spaces is similar. If the agent cannot fully observe states in the environment, we use “observation” instead of “state.” If an RL problem satisfies the Markov property, we can formulate it as a Markov decision process (MDP) with 5-tuple .(S, A, P , R, γ ). The Markov property means decision for the future relies only on the current state and action (not on the past). If we know entire system model dynamics, we can use dynamic programming methods for RL problems. When there is no model, we should use RL methods. We will address details of dynamic programming and RL methods in the following subsections.

.

Bellman Equation The Bellman equation is an essential tool for solving RL problems. To explain Bellman equation, we need to define value function first. A value function is a prediction of expected accumulated reward for each state or state–action pair. Simply speaking, the value function describes how good each state or state–action pair. The state value is the expected accumulated reward for the policy .π from the state. The state value is defined as Eq. (2). vπ (s) = E[Rt |st = s], where, Rt =

∞ 

.

γ k rt+k

(2)

k=0

The state value function .vπ (s) can be represented into Bellman equation as Eq. (3). vπ (s) =



.

a

π(a|s)



p(s  , r|s, a)[r + γ vπ (s  )]

(3)

s  ,r

An optimal state value is the maximum state value achievable for state s, which is defined as Eq. (4).

6

Deep Reinforcement Learning Processor Design for Mobile Applications

v∗ (s) = max vπ (s) = max

.

π

a



p(s  , r|s, a)[r + γ v∗ (s  )]

(4)

s  ,r

The action value is the expected accumulated reward for selecting action a from the state s with the policy .π . The action value is defined as Eq. (5). qπ (s, a) = E[Rt |st = s, at = a], where, Rt =

∞ 

.

γ k rt+k

(5)

k=0

The action value function .qπ (s, a) can be represented into Bellman equation as Eq. (6). qπ (s, a) =



.

p(s  , r|s, a)[r + γ

s  ,r



π(a  |s  )qπ (s  , a  )]

(6)

a

An optimal action value is the maximum action value achievable for state s and action a, which is defined as Eq. (7). q∗ (s, a) = max qπ (s, a) =

.

π

 s  ,r

p(s  , r|s, a)[r + γ max q∗ (s  , a  )] a

(7)

Bellman equation is a key ingredient to solve the RL problem. Because the goal of the RL agent is to find an action that maximizes the entire reward, the RL agent can use the Bellman equation for each state to decide proper action.

Exploration and Exploitation From the Bellman equation, we can choose the action that can maximize the value function for maximal reward greedily. However, at the beginning of the RL sequence, we do not have a trained optimal value function. The exploration vs. exploitation issue comes from this point. In the exploration process, the RL agent explores the environment with random actions to find better actions. In the exploitation process, the RL agent exploits the current best action to maximize the trained value function. It is a fundamental dilemma of RL that an agent must choose between exploring uncertain policies and using the current best policy. In this subsubsection, we want to introduce .-greedy method, a simple and basic approach to decide action. In the .-greedy method, for the current state s, an RL agent selects an action a from exploitation from .a = argmaxa∈A Q(s, a) with probability 1-.. With probability ., an RL agent selects a random action. To sum up, the agent uses current estimated value function with probability 1-. and explores the environment with probability .. By doing so, the agent can get enough data to update value function while training value function itself. The . is a value between 0 and 1, and the amount of . can be modified during training.

2 Background of Deep Reinforcement Learning

7

Dynamic Programming As mentioned above, we represented the RL problem with the Bellman equation. We can form the Bellman equation with either the state-value function or the state– action value function. If we have an optimal value function that satisfies the Bellman equation, we can get an optimal policy for the RL agent that maximizes the optimal value function at the state. From this subsubsection, we will address the classical methods to obtain the optimal value function from the Bellman equation. Dynamic programming (DP) is a classical method to solve RL problems. If we know all of the transitions and reward models of the environment, and if the environment can be modeled as MDP, we can apply DP to the problem. However, it is hard to solve the Bellman equation directly because we have to find the optimal value function and policy together from scratch. DP is proposed to handle this problem. The basic concept of DP is the iterative process of policy evaluation and policy improvement. The policy evaluation is called prediction, which evaluates the value function for a given policy. The policy improvement is called control, which updates the policy with the evaluated value function. Equation (8) shows how can we find value function for given policy .π . vk+1 (s) =



.

 π(a|s) R(s, a) + γ



 



P (s |s, a)vk (s )

(8)

s  ∈S

a∈A

Figure 3 shows the improvement process for dynamic programming. In the iterative policy evaluation, we update the current value function .vk+1 (s) from the value function of its previous states .vk (s  ). After a fixed amount of iterations, the value function will converge to the .vπ (s), which can estimate values for each state at policy .π(a|s). After obtaining value function .vπ (s), it is time for policy improvement to update policy. The method for policy update is as follows: (1) Find values of all of the actions at all of the states, (2) Update policy to choose the best action at each state. In other words, we can get the better policy by greedily select the action that has higher value than policy evaluation results. The policy iteration could solve the Bellman equation, but it requires a lot of iterations due to the large

policy evaluation V Æ vȜ

Ɠ

V Ȝ Æ greedy(V)

v0,Ȝ0

policy improvement Fig. 3 Policy evaluation and policy improvement process for dynamic programming

v*,Ȝ*

8

Deep Reinforcement Learning Processor Design for Mobile Applications

number of iterations for evaluation. The value iteration is proposed for handling this issue. In the value iteration, we directly update optimal value function without considering policy. Equation (9) shows how can we find optimal value function. v∗ (s) = max R(s, a) + γ

.

a



P (s  |s, a)vk (s  )

(9)

s  ∈S

Monte Carlo We can use DP to solve the Bellman equation if we know the all states and dynamics of the environment. However, how can we get optimal value function and policy if we do not have full knowledge about the environment? Monte Carlo (MC) method is proposed to handle this problem. MC stores all the rewards obtained for each step up to the terminal state and then obtains the value function based on the information. This process of obtaining a true value function with MC is called Monte Carlo Prediction. Monte Carlo prediction approximates the value function with the average value of the returns obtained after completing one episode and repeatedly finds the true value. We can apply MC only to episodic tasks. There are 2 types of MC according to the method to handle return value for duplicate visits. When one episode is finished, a return value exists for each state that has progressed during the episode. First-visit MC only takes the return value of the first visit to the state and does not consider the second one. Every-visit MC, on the other hand, calculates the return value that changes every time you visit. Both methods converge to the true value function. Here is the brief scenario of the first-visit MC. Through the rewards obtained every time the state is changed, we can get the return value G(s) for each state. MC proceeds with several episodes and estimates the arithmetic mean by obtaining all return values from episodes that pass through the state. These return values can be approximated as a true value function if enough values are gathered through repetition.

Temporal Difference Learning We cannot apply MC to non-episodic RL problems. MC calculates the expected returns from the beginning state and utilizes the average value of the expected return. However, if the episode is too long and requires a policy update before ends, high variance at expected return occurs. Therefore, MC cannot update its policy online. Temporal difference (TD) learning was proposed to handle this problem. Instead of using sampled discounted expected returns such as MC, TD utilizes time difference error. TD updates current value function with Eq. (10) shows the equation for update rule of TD. .α is a learning rate, and .r + γ V (s  ) − V (s) is the TD error. Algorithm 1

2 Background of Deep Reinforcement Learning

9

Algorithm 1 TD learning, adapted from [10] Input: the policy π to be evaluated Output: value function V 1: initialize V arbitrarily, e.g., to 0 for all states 2: for each episode do 3: initialize state s 4: for each step of episode, state s is not terminal do 5: a ← action given by π for s 6: take action a, observe r, s  7: V (s) ← V (s) + α[r + γ V (s  ) − V (s)] 8: s ← s 9: end for 10: end for

Dynammic Programming

Temporal Difference

Monte-Carlo st

st

st

rt+1

rt+1

st+1

T

T

T

T

T

T

T

T

T

T

T

T

T

Expect for all available next state. Full knowledge for model required

T

T

T

T

T

T

T

Update after end of 1 episode Episodic application required

T

st+1

T

T

T

T

T

T

T

T

T

Update after 1-step Available for continuous env.

Fig. 4 Comparison between dynamic programming (DP), Monte Carlo (MC), and temporal difference (TD)

shows the pseudocode for TD learning. Figure 4 shows the comparison between various classical RL methods (DP, MC, and TD). V (s) ← V (s) + α[r + γ V (s  ) − V (s)]

.

(10)

SALSA is a different kind of temporal difference learning. The name “SALSA” comes from the fact that the method utilizes .[St , At , Rt+1 , St+1 , At+1 . The only difference is that the target of update is not a state-value function .V (s). The target function is state–action value function .Q(s, a). Therefore, SALSA considers the action .at+1 at the next state .st+1 . Equation (11) shows the equation for update rule of SALSA, and Algorithm 2 shows the pseudocode for SALSA. Q(s, a) ← Q(s, a) + α[r + γ Q(s  , a  ) − Q(s, a)]

.

(11)

However, because SALSA utilizes on-policy temporal difference control, there is a possibility to occur error during value function update. Let us assume that we use .-greedy policy with SALSA. Even if .Q(s, a) returns the highest value among the others, .Q(s, a) could be degraded when .Q(s  , a  ) returns negative values due to exploration. When the situation happens, the RL agent cannot proceed further

10

Deep Reinforcement Learning Processor Design for Mobile Applications

Algorithm 2 SARSA, adapted from [10] Output: action value function Q 1: initialize Q arbitrarily, e.g., to 0 for all states, set action value for terminal states as 0 2: for each episode do 3: initialize state s 4: for each step of episode, state s is not terminal do 5: a ← action for s derived by Q, e.g., -greedy 6: take action a, observe r, s  7: a  ← action for s  derived by Q, e.g., -greedy 8: Q(s, a) ← Q(s, a) + α[r + γ Q(s  , a  ) − Q(s, a)] 9: s ← s  , a ← a 10: end for 11: end for

Algorithm 3 Q-learning, adapted from [10] Output: action value function Q 1: initialize Q arbitrarily, e.g., to 0 for all states, set action value for terminal states as 0 2: for each episode do 3: initialize state s 4: for each step of episode, state s is not terminal do 5: a ← action for s derived by Q, e.g., -greedy 6: take action a, observe r, s  7: Q(s, a) ← Q(s, a) + α[r + γ maxa  Q(s  , a  ) − Q(s, a)] 8: s ← s 9: end for 10: end for

and may be trapped in local minimum. Q-learning is proposed to overcome the limitation of on-policy temporal difference control. Q-learning adopts off-policy temporal difference control that decouples policy for action and policy for update. Each iteration of Q-learning consists of 2 steps: (1) At the state s, select action a according to .-greedy, and receive reward r, (2) At the next state .s  , select the maximum .Q(s  , a  ) for updating .Q(s, a). Equation (12) shows the equation for update rule of Q-learning, and Algorithm 3 shows the pseudocode for Q-learning. Q(s, a) ← Q(s, a) + α[r + γ max Q(s  , a  ) − Q(s, a)]

.

a

(12)

Model-Based RL So far, we have explained classical methods for RL problems. If we fully know the environment, we can use dynamic programming (DP). If not, we can use modelfree methods such as Monte Carlo (MC) or temporal difference (TD) learning. However, these model-free methods required a huge amount of interaction between the environment and the RL agent, which led to low sample efficiency. Model-

2 Background of Deep Reinforcement Learning

11

Policy/Value Function Update with Direct RL

Update with planning

Real Experience

Simulated Experience Model Training

Imagination

Model Environment Fig. 5 Overall flow of Dyna-Q learning

based RL was proposed to increase sample efficiency. Model-based RL includes a “model” that has knowledge about the environment’s dynamics. By utilizing a model, model-based RL can get a simulated experience without interaction between pure environments. For example, a model can estimate the next state and reward from the current state and action based on past experience. The simulated experience from the model can be utilized for updating the value function, and the updated value function can be used for improving the current policy. We can call the entire process of model-based RL “Planning.” Dyna-Q is one of the classical model-based RL. Figure 5 shows the overall process of Dyna-Q. Algorithm 4 shows the pseudocode for Dyna-Q. Dyna-Q utilized both planning and direct RL, updating value function with simulated experience from the model and with real experience from the environment, respectively. Same as the previous Q-learning, it updates the value function after getting experiences from the environment (direct RL update). The difference is that Dyna-Q also utilizes experiences to update the model (model learning). The updated model is utilized to generate simulated experiences, and the simulated experiences are used for updating the value function. Because Dyna-Q used a tabular-type model, it can be adopted for a deterministic environment to quickly find the optimal policy.

Function Approximation The methods we explained are all tabular methods that require a table form of the value function. These methods make sense only when the dimension of state or action is small. If the dimension is too large, these methods require huge memory

12

Deep Reinforcement Learning Processor Design for Mobile Applications

Algorithm 4 Dyna-Q, adapted from [10] 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15:

// Model(s, a) denotes predicted reward and next state for state action pair (s, a) initialize Q(s, a) and Model(s, a) for all s ∈ S and a ∈ A for true do s ← current nonterminal state a ← action for s derived by Q, e.g., -greedy take action a, and observe reward r, and next state s’ Q(s, a) ← Q(s, a) + α[r + γ maxa  Q(s  , a  ) − Q(s, a)] Model(s, a) ← r, s  for N iterations do s ← random state previously observed a ← random action previously taken r, s  ← Model(s, a) Q(s, a) ← Q(s, a) + α[r + γ maxa  Q(s  , a  ) − Q(s, a)] end for end for

and time for updating. For example, Go requires .10170 states, and there are so many applications that require continuous state space such as cars and robots. Although the theoretical convergence of tabular RL training is guaranteed, the limited capacity of mapping functions made it hard to train complex RL agents including autonomous robots. Function approximation is proposed to generalize classical RL in various realworld problems. The function approximation approximates state-value function or action-value function with parameter w. For example, state-value function .vπ (s) is approximated to .v(s, ˆ w), and action-value function .qπ (s, a) is approximated to .q(s, ˆ a, w). There are 3 advantages of function approximation: (1) By using the function, we can sample data that were never gathered from the environment. (2) It is more robust to the noise of data. (3) Thanks to the approximation, the highdimensional data can be stored efficiently. Equation (13) shows the update rule of approximated TD learning. .v(s, ˆ w) is the approximated value function, and w is the value function parameter. w ← w + α[r + γ v(s ˆ  , w) − v(s, ˆ w)]∇ v(s, ˆ w)

.

(13)

Policy Optimization So far, we have addressed value-based methods such as Q-learning. The valuebased method estimates the action-value function (like the Q-function) and uses it to estimate policy. In contrast, policy-based methods approximate the policy itself. Therefore, in the policy-based RL, the policy is directly parameterized. Compared to the value-based method, the policy-based method shows better convergence properties and is effective in high-dimensional or continuous action spaces. For example, in the greedy updates in the value-based method, small changes in value can cause large changes in policy, which makes updates unstable. However, in

2 Background of Deep Reinforcement Learning

13

the policy-based method, smooth and stable updates on policy can be maintained. However, it typically converges to a local minimum and shows high variance. For a differentiable policy .π(a|s; θ ), we can compute gradient of policy analytically as Eq. (14). ∇θ π(a|s; θ ) = π(a|s; θ )

.

∇θ π(a|s; θ ) = π(a|s; θ )∇θ logπ(a|s; θ ) π(a|s; θ )

(14)

.∇θ logπ(a|s; θ ) called as score function. We can drive gradient of cost function ∇θ J (θ ) from score function as Eq. (15).

.

∇θ J (θ ) = Eπθ [∇θ logπ(s, a; θ )r]

.

(15)

.∇θ logπ(s, a; θ ) means gradient of policy, and reward r is multiplied for determining the direction and magnitude of policy update. Instead of using immediate reward, we can use action-value function .Qπθ (s,a) as long-term reward. Moreover, baseline .bt (st ) can be subtracted from the return to reduce variance of estimated gradient. We can use value function .V (st ) as the baseline .bt (st ). Algorithm 5 shows the pseudocode of REINFORCE algorithm. It updates parameter .θ in the direction of .∇θ logπ(at |st ; θ )Rt . Algorithm 6 shows the pseudocode of actor-critic algorithm. In the actor-critic algorithms, the critic updates action-value function parameters, and the actor updates policy parameters according to the critic’s value.

2.2 Deep Reinforcement Learning Classical reinforcement learning used tabular methods, which causes huge memory and exploration time for high-dimensional problems. Function approximation methods are proposed to handle the issue, but the limited capacity of function (ex:

Algorithm 5 REINFORCE with baseline (episodic), adapted from [10] Input: policy π(a|s, θ), v(s, ˆ w) Parameters: step sizes, α > 0, β > 0 Output: policy π(a|s, θ) 1: initialize policy parameter θ and state-value weights w 2: for true do 3: generate an episode s0 , a0 , r1 , . . . , sT −1 , aT −1 , rT , following π(·|·, θ) 4: for each step t of episode 0, . . . , T − 1 do 5: Gt ← return from step t ˆ t , w) 6: δ ← Gt − v(s ˆ t , w) 7: w ← w + βδ∇w v(s 8: θ ← θ + αγ t δ∇θ logπ(at |st , θ) 9: end for 10: end for

14

Deep Reinforcement Learning Processor Design for Mobile Applications

Algorithm 6 Actor-critic (episodic), adapted from [10] Input: policy π(a|s, θ), v(s, ˆ w) Parameters: step sizes, α > 0, β > 0 Output: policy π(a|s, θ) 1: initialize policy parameter θ and state-value weights w 2: for true do 3: initialize s, the first state of the episode 4: I ←1 5: for s is not terminal do 6: a ∼ π(·|s, θ) 7: take action a, observe s  , r ˆ w) (if s’ is terminal, v(s ˆ  , w) = 0) 8: δ ← r + γ v(s ˆ  , w) − v(s, ˆ t , w) 9: w ← w + βδ∇w v(s 10: θ ← θ + αI δ∇θ logπ(at |st , θ) 11: I ← γI 12: s ← s 13: end for 14: end for

Environ ment

Agent (Robot)

Compliment

Owner (Environment)

Agent Action Agent (Ro (Robot)

State

Complaint

Action DNN

Owner (Environment)

Policy πθ(s, a)

DRL Training

Modified Action

Fig. 6 Basic principle of deep reinforcement learning

linear approximation) was not enough. Deep reinforcement learning applies deep learning to approximate value function and policy. Figure 6 shows an example of the RL system, which is composed of a pet robot (agent) and its owner (environment). When a pet robot takes a certain action, its owner can return a positive reward by complimenting it to induce the robot to do the same action more or can return a negative reward by complaining about it to modify the wrong action. In this subsection, we will explain the famous DRL methods, Deep Q-Network (DQN) and its variants, policy gradient and its variants, actor-critic and its variants, and some model-based DRL networks.

2 Background of Deep Reinforcement Learning

15

Deep Q-Learning

Deep Q-Learning High Performance

Fig. 7 Comparison between classical Q-learning and deep Q-learning

DNN

Q Value

Linear Approximation

Q-Value1 Q-Value2

Low Performance

State, Action

Previous Q-Learning

State, Action

There had been a lot of work trying to approximate Q-function with deep learning, but previous studies before DQN show unstable performance. There are 3 main reasons for instability: (1) sample correlation, (2) rapid changes in data distribution, and (3) moving target value. First, there are dependencies between consecutive samples in the case of RL. General deep learning assumes that training samples are independent of each other. This property is essential for stable learning because training with correlated samples disturbs convergence to true function. However, RL generates the next sample from the current sample, policy, and state transition probability, so the correlation between consecutive samples is the intrinsic property of RL. Second, when we update the Q-function in an on-policy manner, sudden changes in training data occur if the behavior policy is changed due to Q-update. These rapid changes in data distribution cause oscillation of function parameters that makes training unstable. Third, the target value of Q-learning is moving. The previous Q-learning sets the target value with the next state’s Q-value function that is the same as the current state’s Q-value function. Therefore, the updates to parameters change both Q-value and target value, which makes training unstable. Figure 7 shows comparison between classical Q-learning and deep Q-learning, and Fig. 8 shows the overall flow of the DQN. DQN proposed 3 key components to handle those problems: (1) convolutional neural network (CNN), (2) experience replay, and (3) target network. First, DQN adopts CNN for high performance. CNN has shown state-of-the-art performance in computer vision applications and shows better performance than previous linear function approximators. In the DQN, CNN receives only the state as an input and generates multiple Q-values for multiple actions as output. This architecture makes extracting .max a  Q(st+1 , a  ; θ ) value for Q-value update. Second, DQN proposes experience replay that consists of two stages as below: (1) in every step, store the extracted sample .et = (st , at , rt , st+1 ) in the replay memory D, (2) extract stored samples from the replay memory D in the uniform random manner, and use them for Q-update. In the experience replay, we get samples with current action but use them in a delayed manner. There are 3 advantages of experience replay. First, the sample efficiency is increased

16

Deep Reinforcement Learning Processor Design for Mobile Applications

Extracted Batch

Target Q

Synchronize Periodically

Replay Memory Store Experiences

Q-Network Action

Environment

Parameter Update

DQN Loss

Target value

Q value

Fig. 8 Overall flow of DQN

because it can reuse 1 sample at multiple value function updates. Second, random extraction removes the sample correlation, which decreases the update variance. Finally, training becomes stable because of training data distribution smoothing. The last component of DQN is the target network. DQN has the main Q-network and targets Q-network. The main Q-network estimates action-value Q by using state and action. The parameter of the main Q-network updates at every step. The target Qˆ The target network network is utilized for generating target value .y = r + γ max Q. is not updated at every step. Instead, it is synchronized with the main network for every C step. We can handle moving target problem with target network concept. Equation (16) shows the loss function of DQN, where .θi are parameters of the Qnetwork at iteration i and .θi− are parameters of the target network at iteration i. Algorithm 7 shows the pseudocode of DQN. (r + γ max a  Q(s  , a  ; θi− ) − Q(s, a; θi ))2

.

(16)

After the DQN, there are several works that improve DQN. Double DQN (DDQN) uses the target network to estimate its values to prevent over-estimated value problem. Prioritized experience reply extracts important experience more frequently to learn more efficiently.

Policy Gradient In this subsection, we will explain two methods for improving previous policy-based RL. While adopting DNN in the policy-based RL, the biggest problem is instability. Trust region policy optimization (TRPO) and proximal policy optimization (PPO) stabilize policy optimization by constraining gradient updates.

2 Background of Deep Reinforcement Learning

17

Algorithm 7 Deep Q-network (DQN), adapted from [10] Input: the pixels and the game score Output: Q action value function (from which we obtain a policy and select actions 1: initialize replay memory D 2: initialize action-value function Q with random weight θ ˆ with weights θ − = θ 3: initialize target action-value function Q 4: for episode = 1 to M do 5: initialize sequence s1 = x1 and pre-processed sequence φ1 = φ(s1 ) 6: for t = 1 to T do 7: following -greedy policy, select at (with probability , select a random action. otherwise, argmaxa Q(φ(st ), a; θ) 8: execute action ai in emulator and observe reward rt and image xt+1 9: set st+1 = st , at , xt+1 and preprocess φt+1 = φ(st+1 ) 10: store transition (φt , at , rt , φt+1 ) in D 11: // experience replay 12: sample random minibatch of transitions (φj , aj , rj , φj +1 ) from D 13: if episode terminates at step j + 1, set yj = rj . otherwise, set yj = rj + ˆ j +1 , a  ; θ − γ maxa  Q(φ 14: perform a gradient descent step on (yj − Q(φj , aj ; θ))2 w.r.t the network parameter θ 15: // periodic update of target network ˆ = Q, i.e., set θ − = θ 16: in every C steps, reset Q 17: end for 18: end for

The key concept of TRPO is updating policy function only when it is in the trustable region. If the gradient is too large, it will cause unpredictable, rapid changes. If the gradient is too small, the update is not performed properly. TRPO optimizes the surrogate objective function and makes several approximations. However, TRPO requires complex computations such as KL divergence’s Hessian for calculating constraints. PPO was proposed to approximate trust regions in a more simple way. To avoid solving complex surrogate object functions, PPO simplifies the constraint with simple clipping. Because PPO updates in the region where it can be trusted through clipping, it can reuse gathered data multiple times without instability. In the point of network architecture, PPO shares neural network parameters of policy and value function. PPO achieves good performance on several continuous tasks from robot control to Atari games.

Actor-Critic The actor-critic algorithm that learns policy and state value functions with actor network and critic network is also targeted for deep learning. Figure 9 shows the overall flow of the actor-critic. In this subsubsection, we will explain asynchronous advanced actor-critic (A3C), one of the famous actor-critic deep reinforcement learning algorithms. Figure 10 shows the overall flow of the A3C. A3C utilizes the advantage function instead of the total return. The advantage is defined as .r + γ V (st+1 ) − V (st ). The biggest improvement of A3C is the asynchronous

18

Deep Reinforcement Learning Processor Design for Mobile Applications Reward

Critic ¤ £ State

Environment £ ¤ Policy Update

Actor

¢ ¢ Action

Fig. 9 Overall flow of general actor-critic

Global Network Policy Ɠ(s)

V(s)

Network

Input (s)

Policy Ɠ(s)

V(s)

Policy Ɠ(s)

V(s)

Policy Ɠ(s)

V(s)

Policy Ɠ(s)

V(s)

Network

Network

Network

Network

Input (s)

Input (s)

Input (s)

Input (s)

Worker 0

Worker 1

Worker 2

Worker n

Env. 0

Env. 1

Env. 2

Env. n

Fig. 10 Overall flow of A3C

3 Group-Sparse Training Algorithm for Accelerating Deep Reinforcement. . .

19

Algorithm 8 A3C, each actor-learner thread, adapted from [10] 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20:

global shared parameter vectors θ and θv , thread-specific parameter vectors θ  and θv global shared counter T = 0, Tmax initialize step counter t ← 1 for T ≤ Tmax do reset gradients, dθ ← 0 and dθv ← 0 synchronize thread-specific parameters θ  = θ and θv = θv set tstart = t, get state st for st not terminal and t − tstart ≤ tmax do take at according to policy π(at |st ; θ  ) receive reward rt and new state st+1 t ← t + 1, T ← T + 1 end for for terminal st , R = 0. otherwise, R = V (st , θv ) for i ∈ t − 1, . . . , tstart do R ← ri + γ R accumulate gradients wrt θ  : dθ ← dθ + ∇θ  logπ(ai |si ; θ  )(R − V (si ; θv )) accumulate gradients wrt θv : dθv ← dθv + ∇θv (R − V (si ; θv ))2 end for update asynchronously θ using dθ , and θv using dθv end for

operation of the multi-thread actor-learner. Because the actor-critic is an on-policy method, it suffers from the correlation between consecutive samples and small sample efficiency. It cannot use replay memory like DQN because it is the on-policy method. To handle this, A3C creates independent environments for each thread and gathers experiences with multiple actors. Because the experiences between threads are independent, A3C cannot only resolve correlation issues but also increase training speed. The updated global neural network’s parameter is synchronized to each actor-learner. Algorithm 8 shows the pseudocode for A3C.

3 Group-Sparse Training Algorithm for Accelerating Deep Reinforcement Learning 3.1 Introduction Deep reinforcement learning (DRL) algorithms have made remarkable success in sequential decision-making problems such as gaming agents, autonomous robots, and human–computer interaction. Adopting deep neural network (DNN) to approximate the policy or value estimator of RL, DRL algorithms have overcome the performance limitation of classical RL algorithms and have achieved the human level or even better performance in various large-scale models or environments. According to recent researches of DeepMind [12–14], the trained DRL agents have overwhelmed skilled human players from simple video games such as Atari to complex games such as Go and StarCraft II.

20

Deep Reinforcement Learning Processor Design for Mobile Applications

However, it is very difficult to obtain such high-performance DRL agents because a huge amount of data is required to train the neural networks utilized for DRL. Unlike supervised learning in which labels exist, DNN in DRL inevitably suffers from unstable training because it should be trained with experience data obtained through interaction between the untrained DNN and the environment. The unstable training causes frequent communication to the environment and a lot of model parameter updates, which leads to a large training time. Vinyals et al. [14] trained a StarCraft II DRL agent, using 32 tensor processing units (TPUs) over 44 days. There are several studies for reducing the overall training time required for DRL [15–18]. The biggest problem of standard DRL implementations, including A2C and PPO [19], is a significant underutilization of computing resources caused by serial execution of experience sampling and computing neural networks. Most of the previous work tried to handle the limitation through a parallel and distributed DRL framework. A3C [15] utilized independent actors with independent policies that perform environment simulation, action generation, and gradient calculation in parallel. GA3C [16] outperformed CPU-only A3C by adopting GPU. IMPALA [17] improved the GA3C architecture through efficient GPU batching, and Seed RL [18] achieved higher throughput compared to [17] by reducing communication time. By accelerating experience sampling in parallel through distributed agents, the above frameworks have achieved state-of-the-art performance in DRL, such as shortening the training time of Atari games to several hours. However, it is hard to utilize parallel and distributed DRL algorithms in resourceconstrained devices such as mobile or edge devices due to their large amount of computational workload. Those algorithms require a vast amount of computing resources including hundreds of CPU cores and DNN accelerators for experience sampling from parallel environments and fast DNN training. Seed-RL [18], the state-of-the-art DRL framework, utilized dozens to hundreds of CPUs and 8 TPU v3 cores to accelerate the entire training of DRL. As more and more recent researches highlight a real-world DRL [20, 21] and adaptation to the sudden environment change of edge devices through DRL [22], efficient methods for DRL training acceleration on resource-constrained devices are essential. The biggest problem of DRL on mobile devices is memory bandwidth bottlenecks. Unlike previous DRL frameworks, mobile DRL frameworks utilize only a few parallel environments or do not use simulation environment. Therefore, the training speed in mobile DRL platforms is limited by memory bandwidth bottleneck due to DNN training rather than by the parallelism of the simulation environment. Figure 11 shows two main reasons for large memory access of DRL training: (1) DNNs utilized for DRL are mainly composed of fully connected (FC) layer or recurrent layer, which require a lot of memory access for high throughput due to small operational intensity. Compared to ResNet-34, which shows more than 100 Ops/byte operational intensity, the operational intensity of several famous DRL algorithms (STEVE [23], TD3 [24], SAC [25]) for Mujoco environment is limited to 50.∼60 Ops/byte. (2) The complex and sequential execution of multiple DNNs in DRL requires frequent access to model parameters and experiences, resulting in large memory access. Especially, the bandwidth required by the model parameter

3 Group-Sparse Training Algorithm for Accelerating Deep Reinforcement. . .

21

Bandwidth Requirement for Google Research Football

OPs/Byte 150

Model Parameter

100 50

146 GB/s (98.6 %) ResNet-34

STEVE

CNN

TD3

SAC

DRL

Observations (1.4 %)

Fig. 11 Large memory bandwidth requirement of DRL

is much larger than the bandwidth required by the observation (98.6% in Google Research Football [26]). The traditional method to solve the memory bottleneck caused by a large number of model parameters is model compression. In particular, model pruning shows state-of-the-art performance among model compression methods by removing unnecessary weight connections. However, it is difficult to utilize conventional pruning methods for the acceleration of DRL training due to two reasons: (1) Pruning should be performed iteratively during training to prevent severe performance drop, and the sparsity of the model parameter starts at 0 and increases little by little. Therefore, it shows a low weight compression ratio at the beginning of training, which limits the average compression ratio of training. (2) Fixed sparsity scheduling method in previous iterative pruning [27] makes the DRL training unstable. In this chapter, we propose group-sparse training (GST), a training method that can overcome the limited compression ratio of previous pruning methods at early iteration. Compared with the previous compression methods that utilized only pruning for entire training iterations, GST selectively utilizes block-circulant grouping along with pruning. Specifically, in the early iteration of training, both block-circulant and pruning are applied simultaneously to increase the compression ratio, and if sparsity is sufficiently high in the latter iteration of training, the blockcirculant grouping is selectively released to prevent a reward degradation. Moreover, we propose reward-aware pruning (RWP), which dynamically schedules target sparsity according to reward history to achieve not only stable training but also a high compression ratio.

3.2 Related Works It is widely accepted that a more efficient and faster DNN model can be generated by removing the redundant weight from the model. In this subsection, we introduce

22

Deep Reinforcement Learning Processor Design for Mobile Applications

previous works on model pruning, structured matrix, and pruning for DRL or fast training. Model Pruning The key concept of model pruning is eliminating redundant connections or neurons of the target neural network. Most of the pruning algorithms achieve a high compression ratio without loss of accuracy by repetitive removing and fine-tuning a small weight in a pre-trained network [28]. Gradual pruning [27] shows that iteratively removing the weight connection during network training can achieve a higher compression ratio. Unlike the above methods in which unstructured sparsity is induced, recent researches tried to create structured sparsity of filter level or channel level [29]. Most pruning studies have focused on obtaining high sparsity for efficient DNN inference and cannot accelerate the DNN training process. Indeed, their training processes are even slowed down due to the utilization of a pre-trained model or knowledge distillation. Structured Matrix The structured matrix methods compress DNNs by expanding shared weights into a pre-determined matrix format. Sindhwani et al. [30], Cheng et al. [31], and Moczulski et al. [32] first compressed fully connected (FC) layers using a structured matrix format. CirCNN [33] achieved not only a higher compression ratio by using a circulant-matrix format in a block unit but also significantly reduced the amount of computation by utilizing FFT. CircConv [34] showed that the blockcirculant method can be utilized for convolutional layers. Compared to the model pruning method, the structured matrix has a negligible cost for encoding but suffers from a lower compression ratio at the same accuracy. Pruning for DRL or Fast Training There are only a few studies utilized model compression to accelerate DRL or training. PruneTrain [35] first tried to accelerate training by using repeated regularization-based pruning, but it suffered from a low compression ratio due to the low initial and final sparsity of the regularization-based pruning method. PoPS [36] applied pruning to DRL for the first time. Liao and Yuan [36] were able to achieve a high compression ratio by repeatedly training a small student policy network using knowledge distillation, but it cannot accelerate the training process because of a pre-trained teacher network for pruning. Zhang et al. [37] accelerated the training of DRL by using the knowledge distillation method, but there was also a limitation in DRL training speed-up because of a large teacher network.

3.3 Methodology The objective of the proposed group-sparse training (GST) is to solve the memory bandwidth bottleneck due to large parameters by compressing the DNN models during all DRL training iterations. The proposed GST is designed to be included in the training loop of any DRL algorithms such as deep Q-learning (DQN) [12], TD3 [24], PPO [19], etc. Moreover, since GST only affects inference and back-

3 Group-Sparse Training Algorithm for Accelerating Deep Reinforcement. . .

23

propagation of DNN, DRL algorithms for better sample efficiency [38] can be used together with the GST.

The Architecture of the Group-Sparse Training The basic concept of GST is the selective utilization of block-circulant weight compression to compensate for an insufficient compression ratio of iterative pruning in the early iterations of training. Figure 12 shows the overall flow of the proposed GST and a comparison between the GST and the previous training method. Conventional DNN training consists of a feed-forward process with uncompressed weight (W ) and a backpropagation process with uncompressed transposed weight (.W T ). In the iterative pruning algorithm, sparsity is created by gradually removing connections with small absolute values from W and .W T as training progresses. However, the iterative pruning algorithm can remove only a small amount of weights at the beginning of training because it cannot determine the saliency of the specific connection before sufficient training is performed. Low sparsity at the early iterations causes two problems: (1) lower the average compression ratio over the entire iteration; (2) the large compression ratio difference between the early and the later iterations makes hard to optimize memory bandwidth design for weight transaction. To mitigate the limitation of iterative pruning, the proposed GST selectively utilizes a block-circulant weight compression. Compared with iterative pruning, the block-circulant weight compression method not only shows the negligible cost for index encoding but also can be applied to all training iterations. However, its compression ratio should be limited to prevent severe training performance drop because it cannot determine redundant weight like iterative pruning. During the early iterations, GST groups the weights in a block-circulant form while pruning the grouped weights simultaneously. We call the weight at this stage a group-sparse weight. Only after the sparsity of weights gets sufficiently high, GST selectively unpacks the grouped weights and iteratively prunes the weights until completion. By transitioning between the group-sparse and sparse weight compression according to the progress of training, GST can achieve a higher weight compression ratio without sacrificing the training performance. The detailed description of the proposed GST is in Algorithm 9. Note that the comparison result between the pre-defined phase shift sparsity and the current sparsity determines whether to apply the block-circulant. The phase shift algorithm is motivated by the fact that the reward limitation due to block-circulant begins after a certain amount of sparsity is obtained. By appropriately disabling the block-circulant, it is possible to achieve a high compression ratio without training performance loss. Results of phase shift algorithms are described in the experiment subsection.

W

OF

W

IE

time

Later Iterations

OE Sparse WT IE High Sparsity

IF Sparse W OF

Low W Comp. Ratio @ Early Iterations

Early Iterations

OE Sparse WT IE Low Sparsity

IF Sparse W OF

DNN Training w/ Iterative Pruning

Fig. 12 Overall flow of the proposed group-sparse training (GST)

No W Comp.

OE

T

Back-Prop.

IF

Feed-Forward

Conventional DNN Training

Sparse WT

Sparse W

IE

OF

Later Iterations

OE

IF

time

High Weight Comp. Ratio @ ALL Iterations

Early Iterations

h g d c g h c d OE Group-Sparse WT IE

0 e b 0 e 0 0 b

e 0 g h 0 e h g IF Group-Sparse W OF

0 b c d b 0 d c

Group-Sparse Training (GST)

24 Deep Reinforcement Learning Processor Design for Mobile Applications

3 Group-Sparse Training Algorithm for Accelerating Deep Reinforcement. . .

25

Algorithm 9 Algorithm description of GST Input: environment Env, pruning threshold step Pstep , pruning start step Pstart , pruning freqeuncy Pf re , sparsity upper bound Sub , block size B, Phase shift sparsity Sshif t , model parameter W Output: The compressed and trained model parameter W ∗ Initialize: initial model parameters W , current pruning threshold Pth = 0, previous highest reward Rprev = 0, current reward Rnew = 0, timestep T = 0, current sparsity Snow = 0 1: if (B > 1) and (Sshif t = 0) then 2: Reinitialize W as block-circulant form (size B) 3: end if 4: for T = 0; T = Tmax ; T + + do 5: next_state, Rnew = Exp_Gen(Env, W, state) 6: Calculate gradient based on generated experience 7: Measure current model parameters’ sparsity Snow 8: if Sshif t > Snow then 9: Gradient gen. as block-circulant form (size B) 10: end if 11: Update the model parameter W based on gradient 12: if (mod(T , Pf re ) = 0) and (Pstart < T ) then 13: if Snow < Sub then 14: Apply Reward-Aware Pruning 15: end if 16: end if 17: if Rnew > Rprev then 18: Rprev = Rnew 19: end if 20: end for

Reward-Aware Pruning In addition to the GST, we developed reward-aware pruning (RWP), an iterative pruning method that can minimize the reduction of reward in the DRL domain. The key concept of RWP is to enable stable training by dynamically adjusting the target sparsity of pruning according to the history of reward values. In the previous iterative pruning algorithm [27] targeting classification networks, the gradual pruning function (17) was utilized to determine the target sparsity according to training iterations.   t − t0 3 .st = sf + (si − sf ) 1 − nΔ

(17)

where .si is the initial sparsity, .sf is the final sparsity, .t0 is the pruning start point, .Δ is the pruning frequency, and n is the pruning steps. However, as pointed out in [36], applying (17) to DRL training can make the training procedure highly unstable. This is because Eq. (17) manages the pruning ratio in a fixed manner regardless of the status of training. Compared with classification DNN training, DRL training shows an unstable training curve and has

26

Deep Reinforcement Learning Processor Design for Mobile Applications

100

Gradual Pruning

Reward-aware Pruning

Reward Reward Reward Increasing Decreasing Increasing

100

75

Sparisty

75

Sparisty

Reward Reward Reward Increasing Decreasing Increasing

50

50

25

25

0

0 0

20

40

60

80

100

Pruning Step

Fixed Sparsity Schedule regardless of Reward

0

20

40

60

80

100

Pruning Step

Reward-aware Sparsity Scheduling

Fig. 13 Comparison between gradual pruning [27] and the proposed reward-aware pruning (RWP)

Algorithm 10 Algorithm description of RWP Input: pruning threshold step Pstep , previous highest reward Rprev , current reward Rnew , model parameters W , current pruning threshold Pth Output: The pruned model parameter Wpruned , Updated pruning threshold Pth_new 1: if Rnew > Rprev then 2: Pth_new = Pth + Pstep 3: Zeroize bottom Pth_new % of each layer’s weight 4: end if

various convergence speeds depending on the task. Therefore, pruning with a fixed sparsity schedule cannot sufficiently consider the saliency of the weight parameter in DRL, which leads to a training performance drop. Previous pruning papers in DRL domains [36, 37] suggested network compression based on knowledge distillation to overcome this problem, but they are hard to accelerate DRL training procedure due to utilization of pre-trained or large teacher network. Figure 13 shows the proposed RWP and previous gradual pruning [27]. The proposed RWP dynamically changes target sparsity according to the reward. Specifically, RWP increases target sparsity only when the obtained reward is higher than the previous highest reward. By using RWP, it is possible to dynamically find a pruning pattern suitable for various convergence patterns of DRL. The RWP is summarized in Algorithm 10.

3 Group-Sparse Training Algorithm for Accelerating Deep Reinforcement. . .

B2 friendly B4

c d a b

Phase Shift

d c b a

c d a b

a b c d

a b

c d

d a b c

b a

d c

Block 4 Friendly c' d' a' b' Block2 Weight b' c' d' a'

Time

Time

ts

b c d a

b a d c

ts

d a b c

a b c d Block 4 Weight

c' d'

a' b'

d' c'

b' a'

Block 2 Friendly Block 4 Weight Phase Shift

a b c d d a b c c d a b b c d a ts

a b c d

Projection

a Pt0 Block 2 Weight

Time

B4 friendly B2

27

Pt0 a

c Pt1 Pt1 c

c' Pt2 a' Pt3 Pt2 c' Pt3 a'

Block 4 Weight Phase Shift Proj. Block 2 Weight Proj. w/ Avg.

Ptn: Projected Weight of block n, time t (Pts n= bd/2)

Fig. 14 Block size conversion methods for supporting multiple block sizes during training

Block Size Conversion Methods Instead of controlling only the application of block-circulant during training, the GST can control block size B during training to achieve a higher compression ratio. Compared with utilizing only small B for the entire training, the compression ratio can be further increased without training performance loss by utilizing a large B for early iterations of training and a small B for later iterations of training. However, it is impossible to change the block size B directly during training because there is no compatibility between different sized block-circulant matrixes. Therefore, methods for converting a weight trained with a large B into a weight that is trainable with a small B are needed. Figure 14 shows three ways to convert .B = 4 weight into .B = 2 weight: (1) projection, (2) block4 friendly block2, (3) block2 friendly block4. The most simple conversion method is projecting the weight to the target B circulant tensor [34] when a block size transition occurs. However, the projection method causes a huge training performance drop due to a sudden parameter value change after the block size transition and makes subsequent training unstable. The block4 friendly block2 method composes the block2 structure by dividing the block4 structure in half instead of changing the parameter value. The block2 friendly block4 method composes the block4 structure as a collection of 4 block2 structures. By utilizing the friendly grouped matrix, the block size transition can be performed without changing the parameter value to prevent a training performance drop that occurs in the transition. The results of the three methods are described in the experiment subsection.

Estimation of Compression Ratio This chapter proposes a weight compression algorithm for a system that can support block-circulant and unstructured sparsity through an ASIC implementation. Ideally,

28

Deep Reinforcement Learning Processor Design for Mobile Applications

the compression ratio of the DNN with sparsity S and block size B is expressed by the following Eq. (18):  CR =

.

 Pcomp B +S−1 × B Ptotal

(18)

where .Ptotal is the total number of parameters and .Pcomp is the number of parameters to which GST is applied. However, even if ASIC implementation is considered, the overhead for encoding information of unstructured sparsity should be considered. In this chapter, bitmap sparsity encoding is adopted to estimate the overhead. The bitmap method encodes sparsity with a bitmap whose value is 1-bit 0 at the zero weight position and 1-bit 1 at the non-zero weight position. The other encoding methods such as CSR or Viterbi [39] showed a higher compression ratio than the bitmap for high sparsity, but their encoding overhead is greater than the bitmap for low sparsity. The compression ratio of the DNN with 16-bit weight and bitmap encoding is expressed by the following Eq. (19):  CR =

.

 Pcomp 1 B +S−1 × − 16 Ptotal B

(19)

3.4 Experiments In this subsection, we compare the reward of the proposed training algorithm and the conventional training algorithm while measuring the obtained compression ratio (CR). In order to show that the proposed GST can compress weight parameters while maintaining training performance without additional training iterations, the entire training curves were measured and reported. The CRs were measured with the method described in Eq. (19). The proposed GST can be applied to all existing DRL training algorithms. In this chapter, we mainly evaluate the proposed GST for TD3 [24] algorithm in Mujoco environments, which is one of the state-of-the-art DRL training algorithms. Halfcheetah-v2 and Humanoid-v2 are adopted among Mujoco environments for the GST evaluation. We used the same network configuration as the configuration utilized in [24]. The network used in the Halfcheetah-v2 and Humanoid-v2 consists of 3 FC layers. The number of neurons in the 1st FC layer and the last FC layer is determined according to the state and action dimension of each environment. In the case of Halfcheetah-v2, the grouping and pruning of the proposed GST were only applied to the second FC layer, which accounts for 91.7% of the total parameters. In the case of Humanoid-v2, the grouping and pruning were applied to the first and second FC layers, which account for 97.4% of the total parameters. Network parameter initialization and training hyperparameters were the same as those introduced in [24] for a fair comparison. All results on Mujoco environments

3 Group-Sparse Training Algorithm for Accelerating Deep Reinforcement. . .

29

are reported after averaging 5 random seeds’ results of the Gym simulator and the network initialization. Moreover, we verify the proposed GST on not only various RL algorithms (A2C and PPO [19]) with various test environments (Atari Breakout and Google Research Football [26]) but also classification datasets (CIFAR-10 [40] and ILSVRC2012 [41]) with the famous convolutional neural networks (ResNet-32 [42] and AlexNet [1]).

Gradual Pruning vs. Reward-Aware Pruning We first verify the effectiveness of the proposed reward-aware pruning (RWP) in order to determine the pruning methodology. To compare the performance of gradual pruning and RWP, we first apply RWP for different pruning start points (0.0M, 0.2M). After the reward and compression ratio of RWP are measured, we apply gradual pruning to have the same compression ratio as RWP and check the reward loss. Figure 15 shows the comparison results of gradual pruning and RWP. In the Halfcheetah-v2, the RWP achieved a 49.1% average compression ratio for pruning start point 0.0M. In the Humanoid-v2, the RWP achieved a 25.0% average compression ratio for pruning start point 0.0M. Interestingly, when gradual pruning is applied to achieve the same compression ratio as RWP, a small reward loss appears in the Halfcheetah-v2, whereas a very large reward loss appears in the Humanoidv2. This is due to the training stability difference caused by distinct environments. The reward graph of Halfcheetah-v2 converges early and training is stable, whereas the reward graph of Humanoid-v2 fluctuates and training is unstable. In the case of gradual pruning, target sparsity is increased even at points where training is unstable

Pstart = 0.2M

Pstart = 0.0M

0.5

0.25

0.75 0.5

0.25

5

Reward (1e3)

0.75

Pstart = 0.2M

5

Reward (1e3)

Reward (1e4)

Rewar d (1 e4 )

Pstart = 0.0M

1

1

3.75

2.5 1.25

3.75

2.5 1.25

0

0

0 0.4

0.6

0.8

0

1

Compression Ratio (CR)

Com pr essio n Ra tio ( CR)

100 75 50

Avg. CR : 49.1%

25 0

0

0.2

0.4

0.6

0.8

0.2

0.4

0.6

0 0

0.8

100

Avg. CR : 27.9%

75 50 25 0

0

1

0.2

0.4

0.6

0.8

Number of Steps (1e6)

Number of Steps (1e6)

(a)

0.2

0.4

0.6

0.8

1

0

Number of Steps (1e6)

Number of Steps (1e6)

Number of Steps (1e6)

100

Avg. CR : 25.0%

75 50 25 0

0

0.2

0.4

0.6

0.8

0.2

0.4

0.6

0.8

1

Number of Steps (1e6)

Compression Ratio (CR)

0.2

Compression Ratio (CR)

0

1

Number of Steps (1e6)

100

Avg. CR : 24.8%

75 50 25 0

0

0.2

0.4

0.6

0.8

1

Number of Steps (1e6)

(b)

Fig. 15 Reward and compression ratio measurement results of gradual pruning [27] and rewardaware pruning. Black lines are baseline result (no pruning), blue lines are gradual pruning result, and red lines are reward-aware pruning results. (a) Result on Mujoco Halfcheetah-v2 with TD3 [24] (b) Result on Mujoco Humanoid-v2 with TD3 [24]

30

Deep Reinforcement Learning Processor Design for Mobile Applications

and saliency of the parameter cannot be determined, resulting in a large reward loss in humanoids. On the other hand, reward loss could be prevented by dynamically stopping pruning at the corresponding points in the case of RWP.

GST Performance According to B and Sshif t To verify the performance of the proposed GST, we measured the reward and compression ratio on the Mujoco dataset according to block size B and phase shift sparsity .Sshif t . The pruning start point was fixed at 0.0M, and the reward and compression ratio were measured for .B = 2, 4, and phase shift sparsity .Sshif t = 0.25, 0.5, 0.75, 1.0. Figure 16 shows the proposed GST measurement results on the Halfcheetah-v2 environment. For all the B values, there is no training reward drop of the proposed GST when .Sshif t values of 0.5 or less were used, and a negligible reward drop occurred even at higher .Sshif t values. The maximum compression ratio of 74.9% was obtained under the condition of .Sshif t = 1.0 and .B = 4, which is 25 %p higher than when only RWP was applied. Figure 17 shows the proposed GST measurement results on the Humanoid-v2 environment. In the case of humanoid, there is no result for .Sshif t = 1.0 because sparsity of 75% or more was not achieved. When .B = 2 was used, a similar level of baseline reward could be achieved for all .Sshif t values. When .B = 4 was used, a large reward drop occurred for .Sshif t = 0.5, 0.75. The maximum compression ratio of 61.9% was obtained without reward drop under the condition of .Sshif t = 1.0 and .B = 2, which is 36.9 %p higher than when only RWP was applied. In both environments, we can observe two tendencies: (1) Even in the case of GST with a large B, a similar level of reward compared with baseline achieved

Sshift = 0.5

Sshift = 0.75

1

1

0.75

0.75

0.75

Sshift = 1.0 1

Reward (1e4)

R ew ar d (1 e4 )

Sshift = 0.25 1

0.5 0.25

0.75

0.25

0

0.4

0.6

0.8

1

0

100

Avg. CR : 59.1%

75 50 25

Avg. CR : 54.6%

0

0

0.2

0.4

0.6

0.8

Number of Steps (1e6)

0.2

0.4

0.6

0.8

1

0

Number of Steps (1e6) Compression Ratio (CR)

C om pr essi o n R a ti o ( C R)

Number of Steps (1e6)

1

100

Avg. CR : 66.0%

50 25

Avg. CR : 57.0% 0

0.2

0.4

0.6

0.8

Number of Steps (1e6)

0.4

0.6

0.8

1

0

Number of Steps (1e6)

75

0

0.2

1

100

Avg. CR : 72.0%

75 50 25

Avg. CR : 63.2%

0

0

0.2

0.4

0.6

0.8

Number of Steps (1e6)

0.2

0.4

0.6

0.8

1

Number of Steps (1e6) Compression Ratio (CR)

0.2

Compression Ratio (CR)

0

1

100

Avg. CR : 74.9%

75 50 25

Avg. CR : 65.1%

0

0

0.2

0.4

0.6

0.8

1

Number of Steps (1e6)

Fig. 16 Reward and compression ratio measurement results according to different block sizes and phase shift sparsity. Measurement environment is Mujoco Halfcheetah-v2 with TD3 [24]. Black lines are baseline result, blue lines are block size 4 results, and red lines are block size 2 results

3 Group-Sparse Training Algorithm for Accelerating Deep Reinforcement. . .

31

Compression Ratio (CR)

Sshift = 0.25 Rewar d (1 e3 )

5 3.75 2.5 1.25 0

0

0.2

0.4

0.6

0.8

100

Avg. CR : 52.7 %

75 50 25

Avg. CR : 43.4 %

0

1

0

Number of Steps (1e6)

0.2

0.4

0.6

0.8

1

Number of Steps (1e6)

Compression Ratio (CR)

Sshift = 0.5 R ew ar d (1 e3 )

5 3.75

2.5 1.25

0 0

0.2

0.4

0.6

0.8

100

Avg. CR : 66.1 %

75

50 25

Avg. CR : 47.9 %

0

1

0

Number of Steps (1e6)

0.2

0.4

0.6

0.8

1

Number of Steps (1e6)

Compression Ratio (CR)

Sshift = 0.75 R ew ar d (1 e3 )

5 3.75

2.5 1.25

0 0

0.2

0.4

0.6

0.8

Number of Steps (1e6)

1

100

Avg. CR : 74.9 %

75

50 25

Avg. CR : 61.9 %

0 0

0.2

0.4

0.6

0.8

1

Number of Steps (1e6)

Fig. 17 Reward and compression ratio measurement results according to different block sizes and phase shift sparsity. Measurement environment is Mujoco Humanoid-v2 with TD3 [24]. A line color configuration is the same as the configuration of Fig. 16

for the early iterations. (2) After the phase transition occurs, the reward of GST gradually recovers to the baseline reward level. Therefore, we can obtain both a high compression ratio and a high reward by releasing block-circulant at an appropriate time.

32

Deep Reinforcement Learning Processor Design for Mobile Applications

Performance of Block Size Conversion Methods Figure 18 shows the GST measurement result with 3 different block size conversion methods on the Halfcheetah-v2 and Humanoid-v2. For all .Sshif t values in both environments, the projection method showed the worst training reward, and the block4 friendly block2 method showed the highest training reward. In the projection method, a large fluctuation in reward occurred due to a sudden change in parameter value for each phase transition, which made training unstable. On the other hand, the methods using the friendly matrix enabled stable training, and the same level

Compression Ratio (CR)

Sshift = 0.5 Rewar d (1 e4 )

1 0.75 0.5 0.25 0 0

0.2

0.4

0.6

0.8

100 75

50

Avg. CR : 72.0% Avg. CR : 71.3% Avg. CR : 72.2%

25 0 0

1

0.2

0.4

0.6

0.8

1

Number of Steps (1e6)

Number off Steps (1e6)

Compression Ratio (CR)

Ssh shift = 0.75 Rewar d (1 e4 )

1 0.75 0.5 0.25 0 0

0.2

0.4

0.6

0.8

100 75

50

Avg. CR : 74.0% Avg. CR : 73.8% Avg. CR : 74.1%

25 0

0

1

0.2

0.4

0.6

0.8

1

Number of Steps (1e6)

Number of Steps (1e6)

(a)

Compression Ratio (CR)

Sshift = 0.25 Rewar d (1 e3 )

5 3.75 2.5 1.25 0

0

0.2

0.4

0.6

0.8

100 75 50

Avg. CR : 66.5% Avg. CR : 66.0% Avg. CR : 66.1%

25 0

1

0

Number of Steps (1e6)

0.2

0.4

0.6

0.8

1

Number of Steps (1e6)

Compression Ratio (CR)

Sshift = 0.5 5

Rewar d (1 e3 )

Fig. 18 Reward and compression ratio measurement results according to the three block size conversion methods. Black lines are baseline result, red lines are results of block4 friendly block2 method, blue lines are results of block2 friendly block4 method, and green lines are results of projection method. (a) Result on Mujoco Halfcheetah-v2 with TD3 [24], (b) Result on Mujoco Humanoid-v2 with TD3 [24]

3.75 2.5 1.25

0 0

0.2

0.4

0.6

0.8

1

100 75 50

Avg. CR : 72.6% Avg. CR : 72.5% Avg. CR : 71.4%

25 0

0

Number of Steps (1e6)

0.2

0.4

0.6

0.8

Number of Steps (1e6)

(b)

1

3 Group-Sparse Training Algorithm for Accelerating Deep Reinforcement. . .

33

of training reward as the block-circulant method was obtained. All three methods have similar average compression ratios. In the case of Halfcheetah-v2, there is no advantage of block conversion methods because it is possible to utilize block4 in the entire training. In the case of Humanoid-v2, the maximum compression ratio of 66.5% was obtained without reward drop under the condition of .Sshif t = 0.25 and block4 friendly block2 method, which is 4.6 %p higher than result of the GST without block conversion method.

Verification GST on Various DRL Benchmarks

Compression Ratio (CR)

We tested the proposed GST on a more diverse DRL benchmark to verify that it can operate regardless of the layer types (convolution layer, FC layer), RL algorithms, and task types. Figure 19 shows the training results of Atari Breakout with A2C and training results of Google Research Football with PPO. In the Atari Breakout training, a network configuration and training hyperparameter were the same as those introduced in [43], and we run 5 random seeds and average the results. The network consists of 3 convolutional layers and 2 FC layers, and GST was applied for

Reward

450

300

150 0 0

0.2

0.4

0.6

0.8

100

75 50 25

Avg. CR : 71.9% 0

1

0

Number of Steps (1e6)

0.2

0.4

0.6

0.8

1

Number of Steps (1e6)

Compression Ratio (CR)

(a) 8

Reward

6 4 2 0 0

0.2

0.4

0.6

0.8

1

100 75 50 25

Avg. CR : 73.6%

0

0

Number of Steps (1e6)

0.2

0.4

0.6

0.8

1

Number of Steps (1e6)

(b) Fig. 19 Reward and compression ratio measurement results of GST on various DRL benchmarks. Black lines are baseline result, and red lines are GST result. (a) Result on Atari breakout with A2C, (b) Result on Google Research Football with PPO

34

Deep Reinforcement Learning Processor Design for Mobile Applications

99% of the weight excluding the first and last layers. In this case, the highest average compression ratio of 71.9% was achieved without a reward drop when the .B = 2 and .Sshif t = 1.0. In the Google Research Football training, we utilized the same PPO network configuration and training parameters as [26], and we run 3 random seeds and average the results. We applied GST to 94.3% of the parameters excluding the first and last layers. We achieved an average compression ratio of 73.6% without reward loss when the .B = 4 and .Sshif t = 1.0.

Verification GST on Classification Networks To verify the generality of the proposed GST, we applied the GST to the famous classification benchmark. Figure 20 shows the results of training CIFAR-10 with ResNet-32 and training ILSVRC-2012 with AlexNet. We utilized the same network configuration and the same hyperparameters as the PyTorch official implementation for both experiments. In the CIFAR-10 training, GST was applied to 94.8% of the parameters including layer 12 to layer 31 of ResNet-32. The proposed GST achieved an average compression ratio of 68.2% when the .B = 4 and .Sshif t = 0.5. The accuracy was 91.4%, which was 0.8% lower than the baseline accuracy of

Compression Ratio (CR)

Accu ra cy ( %)

100 75 50 25 0 0

18

36

54

72

100 75

50 25

Avg. CR : 68.2%

0

90

0

18

Epochs

36

54

72

90

Epochs

(a) Compression Ratio (CR)

Accu ra cy ( %)

100 75

50 25 0 0

18

36

54

72

90

Epochs

100 75

50 25

Avg. CR : 62.9%

0 0

18

36

54

72

90

Epochs

(b) Fig. 20 Accuracy and compression ratio measurement results of GST on classification network training. Black lines are baseline results, and red lines are GST results. (a) CIFAR-10 training result with ResNet-32, (b) ImageNet training result with AlexNet

4 An Energy-Efficient Deep Reinforcement Learning Processor Design

35

92.2%. In the case of ILSVRC-2012 training, GST was applied to 87.4% of the parameters including FC5 and FC6 of AlexNet. The proposed GST achieved an average compression ratio of 62.9% when the block4 friendly block2 method and .Sshif t = 0.25 utilized. The accuracy was 55.8%, which was 0.4% lower than the baseline accuracy of 56.2%.

3.5 Conclusion and Future Work In this chapter, we propose a novel weight compression method for DRL training acceleration, group-sparse training (GST). To overcome the low compression ratio at the early iteration and unstable training due to the fixed sparsity schedule, GST selectively utilizes block-circulant compression for high compression ratio and dynamically adapts target sparsity through reward-aware pruning for stable training. Thanks to the features, GST achieves a 25 %p.∼41.5 %p higher average compression ratio than the previous iterative pruning method without reward degradation. To the best of our knowledge, this chapter is the first research to compress a DRL network training procedure through pruning without a pre-trained teacher network. In the future, we plan to work on an automatic hyperparameters search algorithm for finding optimized pruning threshold and phase shift sparsity online.

4 An Energy-Efficient Deep Reinforcement Learning Processor Design 4.1 Introduction Deep reinforcement learning (DRL) has been considered as the next step in the burst success of artificial intelligence (AI). The recent huge interest in DRL is due to its groundbreaking performance in sequential decision-making problems, which is comparable to or even beyond human level. Thanks to the unique characteristics to learn optimal actions to be taken by agents through trial-and-error, DRL is suitable for adaptive and continuous control tasks such as gaming agents [12– 14], autonomous driving [44, 45], autonomous drones [46, 47], autonomous robots [48, 49], and human–computer interactions [50]. DRL gaming agents from Google DeepMind overwhelmed human experts from simple video games such as Atari [12] to sophisticated games such as Go [13] and StarCraft [14]. A four-legged walking robot trained with DRL can adapt to sudden environmental changes such as new obstacles and slope variations [48], and a DRL robotic curling agent team beat 3 of 4 human expert teams in an actual curling match [49]. The main reason for the impressive success of DRL is the adoption of a deep neural network (DNN) for approximating complicated environments. Figure 21a shows the basic concept of classical reinforcement learning (RL) and DRL. Classical RL

Deep Reinforcement Learning Processor Design for Mobile Applications

RL Agent

DRL Agent Critic Model DNN DNN

Policy πθ(O t, at)

Back-propagation

Reinforcement Learning Algorithm

Reward rt

Observation Ot

Loss Lt Reinforcement Learning Algorithm

Action at

Reward rt

Observation Ot

Actor DNN

Environment

Action at

36

Environment (a)

Step 1: Experience Gathering Obs.

T=0

Action

Pred. Pred. Obs.

Act.

T=1

Iterative Inference Actor DNN Obs.

Act.

State, Reward

Training Experience Batch Data

Model DNN

Step 2: Policy Update

Experience Memory

Actor DNN Obs.

Act.

Critic DNN Obs. Val.

Act.

Target NN Obs.

Model DNN Obs.

Target V alue

Act.

Act.

Pred. Obs.

Digital Twin

Policy Update

Experience gathering DNN Type

A

M

A

M

A

C

T

C

A

M

M

Task Type

FF

FF

FF

FF

FF

FF

FF

BP

BP

FF

BP

time

(b) Fig. 21 (a) Basic concept of reinforcement learning (RL) and deep reinforcement learning (DRL). (b) Overall flow of DRL training

system consists of the environments and the RL agent. At each time t, RL agent gets an observation .Ot from the environment and takes action .at , and as a result, the agent receives a reward .rt , and the state of the environment is changed to .Ot+1 . The goal of RL is to find policy .πθ (Ot , at ) that maximizes the expected accumulated

4 An Energy-Efficient Deep Reinforcement Learning Processor Design

37

reward for the entire timestep. The classical RL algorithms [51] utilized table-based mapping functions that have a limited representation capacity and thus suffered severe performance drop in complicated real-world environments. In the case of DRL, DNNs trained by back-propagation with a loss .Lt from RL algorithms are utilized for approximating policy functions. Moreover, recent state-of-the-art DRL algorithms [19, 24, 25] utilize multiple (.>3) deep neural networks (DNNs) for fast convergence while minimizing communication between DRL agents and environments. Figure 21b shows the example flow of the DRL with multi-DNN. It integrates actor DNN (to generate actions), critic DNN (to evaluate training status), target DNN (to stabilize training), and model DNN (for digital twin). Overall training procedure consists of 2 stages: experience gathering (EG) and policy update (PU). EG generates experiences for training DRL agents by repetitive inference (feed-forward, FF) of actor DNN and model DNN while storing the generated experiences in experience memory. PU trains DRL policy with back-propagation (BP) sequences of the actor DNN, the critic DNN, and the target DNN. PU also trains the environment model by back-propagation of the model DNN. Until now, it is considered that the DRL training should be implemented in highperformance servers with environment simulators. Indeed, [14] utilized 32 tensor processing units (TPUs) for DRL agent training, and Seed-RL [18], which is a state-of-the-art DRL framework, also used hundreds of CPU cores and 8 TPU v3 cores. However, as more and more studies argue the need for real-world DRL, DRL training on the edge devices is essential [20, 21]. DRL can enable edge devices to adapt new environments that are private or restricted to the public [22]. Moreover, each edge device can be updated to provide an optimized service to the user. Although the number of edge devices that require DRL is increasing, it is impossible to create simulators that accurately represent the changes in the environments caused by the actions of each edge device [21]. Therefore, enabling DRL training on resource-limited edge devices is required for direct communication with environments. Unlike previous DRL frameworks, mobile DRL frameworks use only a small number of environment simulators or do not use environment simulators. Therefore, the training speed of the mobile DRL accelerator is mainly determined by the required time for training the DNN utilized in the DRL. However, a large memory bandwidth requirement of DRL limits the implementation of DRL training on edge devices. There are two main reasons for frequent memory access. First, DNNs utilized for DRL are mainly composed of fully connected (FC) layer or recurrent layer that requires a lot of memory access for high throughput. Figure 22a shows the computational intensity of several famous CNNs and DRL algorithms [23–25]. Compared with CNNs, which show more than 100 Ops/byte computational intensity, the computational intensity of DRL algorithms for the Mujoco environment is limited to 50.∼60 Ops/byte. Second, the complex and sequential execution of multiple DNNs in DRL requires frequent access to model parameters and experiences, resulting in large memory access. Indeed, a previous DRL training framework for Google Research Football suffered from 146 GB/s memory bandwidth requirement [18]. This frequent access problem occurs to both

38

Deep Reinforcement Learning Processor Design for Mobile Applications

OPs/Byte 150

Bandwidth Requirement for Google Research Football [15]

100

Observations (1.4 %)

Model Parameter 146 GB/s (98.6 %)

50

SAC

TD3 STEVE VGG-16

ResNet-34

CNN

DRL

(a)

53.6%

46.5%

53.5%

Weight F.map

SRAM

30.7%

DRL Processor's Power Breakdown**

EMA* Breakdown**

13.3%

PE

Ctrlr.

*EMA : External Memory Access, ** Measured @ TD3 training on Mujoco Humanoid-v2 with 64 batch

(b) Fig. 22 (a) Limited computational intensity and memory bandwidth requirement of DRL. (b) Analysis for external memory access (EMA) and power consumption of DRL processor

external and internal memory. Figure 22b shows the analysis of external memory access (EMA) and DRL processor’s power breakdown. The sequential and complex executions of DNNs increase weight transaction, which occupies 53.5% of the entire EMA. The massive internal memory access makes on-chip SRAM consume 53.6% of the entire processor’s power consumption. For energy-efficient DRL training on edge devices, some ASIC-implemented DRL accelerators have been studied [52–55]. Kim et al. [52] proposed the first

4 An Energy-Efficient Deep Reinforcement Learning Processor Design

39

DRL training accelerator that can compress the feature map’s exponent, but it could not reduce memory access for weight. Kang et al. [53] achieved impressive energy efficiency by exploiting the feature map’s sparsity generated by ReLU activation, but it could not also reduce memory access for weight. Furthermore, the ReLU activation function is deprecated in the DRL training for higher accuracy and stable training [56]. Amaravati et al. [54] and Cao et al. [55] proposed DRL processors with low-power analog multiply-and-accumulate (MAC) designs, but they could not reduce dominant memory power. Moreover, they utilized low-bit (.

< TD3, Humanoid-v2>

< PPO-GRF>

85.1 %

81.4 %

95.3 %

Data movement time

Calculation time

Fig. 37 Breakdown of OmniDRL performance benchmark results

Humanoid Action

OmniDRL Omn niDRL L Chip Ch hip p

85mm

State, Reward 60mm

(a)

DDR3 SDRAM Flash Memory

Embed. Processor

FPGA

Omni -DRL

System Bus

USB Ctrlr. & Interface

DMA Ctrlr.

DDR3 Ctrlr. OmniDRL EXT. Bridge

(b) Fig. 38 (a) Demo system of OmniDRL. (b) Block diagram of system board

of OmniDRL chip, FPGA, USB controller, DDR3 SDRAM, and flash memory. The FPGA is the host of the system board and manages data movement between components. The external interfaces of OmniDRL are connected with OmniDrl external bridges in FPGA, which are connected to other IPs including DDR3 controller and embedded processor via a system bus. The OmniDRL communicates with the laptop through the USB peripheral controller chip and FPGA.

4 An Energy-Efficient Deep Reinforcement Learning Processor Design

61

Table 2 Comparison table with the previous DRL training processors Target DNN Data Compression Process [nm] Die Area [mm2] Weight Precision Activation SRAM [KB] Supply Voltage [V] Operating Frequency

TPU-V2

JSSC 19`

JSSC 20`

ISSCC 19`

ISSCC 20`

This Work

GP-NPU 20 BFloat 16 BFloat 16 -

DRL X 55 3.4 FXP6 FXP7 0.2 0.4-1.0 67.5MHz

DRL X 65 2 FXP5~FXP8 FXP5~FXP8 16 0.4-1.0 1KHz-1.5MHz

DRL F.Map 65 16 FXP4/8/16 BFloat 16 448 0.67-1.1 ~200MHz

GAN, DRL F.Map 65 32.4 FP8/16 FP8/16 676 0.70-1.1 ~200MHz

DRL F.Map & Weight 28 12.96 BFloat 16 BFloat 16 1552 0.68-1.1 ~250MHz

45000

204

0.004

204

540 - 140303)

768 1) ė 41782)

2.4 (10MHz, 0.67V)

58 (25MHz, 0.70V)

3.12) ė 6.71) (5MHz, 0.68V)

196 (200MHz, 1.1V)

647 (200MHz, 1.1V)

231.6 2) ė 481.9 1) (250MHz, 1.1V)

Peak Performance [GFLOPS or GOPS] Power Consumption [mW]

225000

0.69 (67.5MHz, 1.2V)

0.0034 (1.5MHz, 1.2V)

Peak Energy Efficiency [TFLOPS/W or TOPS/W]

0.2

3.12 (1KHz, 0.4V)

9.1 (FXP3, 1KHz, 0.4V)

2.2 (FP16, 50 MHz, 0.73V)

1.8 ė 75.73) (50 MHz, 0.75V)

2.51) ė 29.32) (10MHz, 0.68V)

DRL Training Energy Efficiency4) [TFLOPS/W]

-

-

-

1.04 (FP16, 200MHz, 1.1V)

2.93 (FP16, 200MHz, 1.1V)

7.425) (Bfloat16, 200MHz, 1.1V)

1) Weight sparisty 0%, no weight group, exp CR 0%, 2) Weight sparisty 90%, weight group 4, exp CR 90%, 3) FP16, 90% In/Out Activation Sparsity 4) Measured @ Mujoco Halfcheetah-v2, TD3, 256 batch, excluding environment simulation running 5) Average efficiency over 1M iterations

Table 2 shows the comparison table with previous DRL processors. Most of the processors including OmniDRL support at least 16-bit floating point like FP16 or Bfloat16 for accurate DRL training, except [54, 55] which was only optimized for simple tasks such as obstacle avoidance. OmniDRL is the only DRL processor that can support data compression for both weight and feature maps. By utilizing direct computation of compressed data, OmniDRL achieves 7.42 TFLOPS/W DRL training efficiency in TD3 Halfcheetah training, which is 2.4.× higher than the previous state-of-the-art processor [53].

4.5 Discussion Analysis on GST Performance According to Group Size In the proposed GST, the group size is one of the most important parameters. The larger group size increases the weight compression ratio while reducing on-chip memory power consumption. However, it also damages the DRL training reward due to the limited DNN model’s capacity. Figure 39 shows the DRL training reward measurement results according to the different group sizes. The results are averaged over 5 seeds results after 1M iterations of group-sparse training on the Mujoco Humanoid environment with 256 batches. The mode switch sparsity of GST is set to be 50%. As shown in the results, group sizes 2 and 3 achieve comparable reward to baseline (.G = 1), and group size 4 shows a slightly smaller reward compared with baseline (.∼144). However, the group sizes larger than 4 suffer huge average reward degradation and larger instability. The proposed OmniDRL supports group size 2, which can guarantee a stable DRL training performance in various DRL applications, and group size 4, which can offer a higher compression ratio in simple applications.

62

Deep Reinforcement Learning Processor Design for Mobile Applications

< Training Reward acc. Group Size>

Reward (1e3)

5 3.75

Group Size

Reward

Group Size

Reward

2.5

G=1

5086 ± 296

G=5

4324 ± 1704

G=2

5057 ± 390

G=6

3369 ± 1354

G=3

5063 ± 402

G=7

4164 ± 835

G=4

4942 ± 535

G=8

2020 ± 1335

1.25 0 0

0.2

0.4

0.6

0.8

Number of Steps (1e6) G=1

G=2 G=6

1

* Measured @ GST, TD3 on Mujoco Humanoid -v2 with 256 Batch . Averaged result of 5 seeds for 1M DRL training iterations

G=4 G=8

Fig. 39 GST results on Mujoco Humanoid according to various group sizes

Scalability of the Proposed GSTC In this subsubsection, we will explore the scalability of the proposed GSTC in the two point of views. First, we will show the scalability of the weight router design that is a key block of the GSTC. Second, we explore the required adjustments for the other data representation except bfloat16. The weight router in OmniDRL supports 2 kinds of group sizes (2, 4) for an 8 .× 8 PE array. There are two parameters that affect the weight router’s complexity: (1) the magnitude of group size and (2) target PE array size. Basically, the weight router is composed of multiple multiplexers that select broadcasted data for PE row among values stored in the weight register. The important point is that it does not fully connect all PE rows and all weight registers. If the group size is 2, each PE row needs to be connected with only two weight registers. Therefore, the complexity of each multiplexer is proportional to the supported group sizes. The number of total multiplexers is proportional to the target PE array size. As a result, the total complexity of the weight router is proportional to the product of the supported weight group size and PE row size. In the point of data representation view, OmniDRL utilizes bfloat16 that has an 8-bit exponent, a 7-bit mantissa, and a 1-bit sign. In this paragraph, we analyze the effect of target data representation changes regardless of DRL training accuracy. If we utilize fixed-point instead of bfloat16, all the integrated floating-point arithmetic units should be changed into fixed-point units. Moreover, the exponent-meandelta encoding (EMDE) cannot be utilized anymore. If we vary the bit-width of mantissa or exponent while maintaining floating-point representation, we should update arithmetic units and on-chip memory bandwidth. In this case, we can still utilize EMDE, but a small bit-width of exponent or large bit-width of mantissa reduces the overall compression ratio of the EMDE. The group-sparse training (GST) and group-sparse training core (GSTC) can be utilized irrespective of the data representation.

4 An Energy-Efficient Deep Reinforcement Learning Processor Design

63

PE Array Size Decision of OmniDRL A lot of processing engines (PEs) in DNN hardware are usually distributed in several cores. Allocated PEs in the core form an array structure, and DNN hardware shares the fetched data inside of the PE array for higher energy efficiency. There is a tradeoff between the different sizes of PE array. We can integrate a small number of cores that contain a large size of PE array (Large Array Few Core, LAFC) or integrate a lot of cores that contain a small size of PE array (Small Array Many Core, SAMC). In the case of OmniDRL, the LAFC can integrate more PE by reducing the area overhead of the core controller but limits the advantage of sparsity exploitation. This is because each PE row’s weight sparsity pattern is different from the other. A large number of rows make it hard to support zero-skip operation with limited input feature map bandwidth. The SAMC can increase the speed-up from the sparsity exploitation, but it limits the number of integrated PEs due to controller overhead. Therefore, selecting the appropriate size of the PE array is important. Figure 40 shows the throughput measurement result according to the size of the PE row. All results are normalized by the throughput of PE row size 4 at 0% weight sparsity. The row16 shows the highest throughput at 0% weight sparsity due to small controller overhead, and row4 shows the highest throughput at 90% weight sparsity due to the higher speed-up of zero-skipping. In the DRL training scenario with the proposed GST, row8 achieves the best performance on the Halfcheetah environment, and row16 achieves the best performance on the Humanoid environment. We select the 8.×8 PE size to prevent underutilization for various DRL scenarios while achieving high performance.

Norm. Throughput @ Halfcheetah

row12

row16

row8

row12 row16

1.027

row8

1

row4 row4

1.005

10

1.048

Normalized Throughput

row4

row8

1

1.090

Norm. Throughput @ Humanoid row12 row16

0

10

20

30

40

50

60

70

80

90

Weight Sparsity (%) Fig. 40 Normalized GSTC throughput measurement results according to the PE row size

1.117

1.075

1

64

Deep Reinforcement Learning Processor Design for Mobile Applications

4.6 Conclusion OmniDRL, an energy-efficient deep reinforcement learning (DRL) processor, is proposed to realize DRL training on resource-limited edge devices. To mitigate a massive amount of external and internal memory access requirement of DRL, OmniDRL presents 4 key features that can compress data as much as possible and can support direct utilization of compressed data. A novel DRL training algorithm, group-sparse training, can increase the weight compression ratio by 41.5 %p by appropriately utilizing both block-circulant-based weight grouping and weight pruning. To fully take advantage of compressed weight by GST, the group-sparse training core is proposed to directly utilize the grouped and sparse compressed weight for high energy efficiency. The proposed exponent-mean-delta encoding shows 16.2 %p higher exponent compression ratio than the previous top-3 exponent compression method, and directly utilizing compressed exponent with decoding after fetching decreases SRAM power consumption by 23.3%. The irregular compressed weight can be transposed with a world-first on-chip sparse weight transposer. The OmniDRL is fabricated in 28 nm CMOS technology and occupies a .3.6 × 3.6 .mm2 die area. It shows a state-of-the-art peak performance of 4.18 TFLOPS and a peak energy efficiency of 29.3 TFLOPS/W. It achieved 7.42 TFLOPS/W energy efficiency for training robot agent (Mujoco Halfcheetah, TD3), which is 2.4.× higher than the previous state-of-the-art.

5 Low-Power Autonomous Adaptation System with Deep Reinforcement Learning 5.1 Introduction For the last decade, an interest in autonomous systems such as drones, autonomous robots, and autonomous vehicles has been growing rapidly. The goal of these systems is to perform risky, repetitive, and time-consuming tasks on behalf of humans. For this goal, the autonomous systems need to change their behavior in response to unanticipated events during operation [63]. Figure 41 shows the typical process of the autonomous system. It consists of 3 stages: (1) perception that interprets input data from sensors, (2) planning that decides and plans actions based on data analysis, and (3) control that makes actual actions through controller and actuators. Recently, each stage of the autonomous system adopts a deep neural network (DNN) to improve accuracy. If the DNN has trained sufficiently with a large amount of data, it can show impressive control performance. Indeed, selfdriving technology from Tesla and Google or auto-pilot from DARPA challenge shows beyond human-level performance. However, the high performance of the DNN is only maintained in the pretrained environment, and the performance can be severely degraded in the unknown

5 Low-Power Autonomous Adaptation System with Deep Reinforcement. . .

65

Fig. 41 Typical process of the autonomous system

environment or sudden environmental changes [64]. We can mitigate the performance drop by increasing the size of the dataset to be able to handle all kinds of environments. However, it makes training procedures difficult due to the slow convergence speed and costly data gathering process [65]. A self-adaptation with deep reinforcement learning (DRL) has been highlighted as an alternative to costly pre-training [22, 64, 65]. Unlike supervised learning that requires pre-labeled data, DRL gathers experiences by itself. Thanks to this distinct characteristic, DRL can be utilized for autonomous adaptation systems such as robot arm [64] and navigating robots [22, 65]. Figure 42a shows the basic components of the DRL system. The goal of DRL is training a policy DNN of DRL agent that determines action based on state and reward from the environment. Figure 42b shows the basic operation of DRL training. The DRL training is composed of iterative operations of 2 steps. The first step is experience gathering, which gathers experiences by repetitive inference of policy DNN and stores them in the experience memory. After enough amount of experience is generated, the second step is policy update that updates policy DNN through back-propagation of sampled experiences. Figure 42c shows the overall process of an autonomous system with DRL. The recent state-of-the-art DRL algorithms [24] utilize multiple DNNs to efficiently process the 3 stages of the autonomous system. In the perception stage, a feature extractor DNN converts sensor inputs into state. In the control stage, an actor DNN generates actions from the state. In the planning stage, a critic DNN evaluates the action and generates values. The values from critic DNN and target values from target DNN are utilized for calculating the loss required for updating DNNs. Please note that the planning stage of the DRL system is utilized for training DNNs instead of directly generating actions. However, it is hard to accelerate DRL in the autonomous system due to its massive number of external/internal memory access. There are two reasons for a large number of memory accesses: (1) redundant memory access due to complex and sequential execution of multiple DNNs and (2) small computational intensity of DNNs (usually fully connected layer). Indeed, most of the DRL algorithm’s computational intensity is limited to 60 Ops/byte, and 53.6% of the DRL processor’s power is dissipated in on-chip SRAM. Previous DRL systems [52, 53] tried to

66

Deep Reinforcement Learning Processor Design for Mobile Applications

Fig. 42 (a) Basic components of DRL system. (b) Basic operation of DRL training process. (c) Overall process of autonomous DRL system

mitigate the memory bandwidth requirement by compressing only feature map, but they could not compress weight. In this chapter, we propose a low-power and high-performance DRL system implementation with an energy-efficient DRL chip [6]. The chip [6] supports compression for both weight and feature maps. Moreover, it proposes a network design for seamless data transactions even with data compression. With the proposed chip,

5 Low-Power Autonomous Adaptation System with Deep Reinforcement. . .

67

we demonstrate an autonomous adaptation system. The system enables humanoids to adapt to sudden environmental changes such as size variation of head or arm, and it is verified on the Mujoco Humanoid environment. A detailed explanation for the proposed system including the overall architecture, system board configuration, and workload allocation will be introduced. The proposed system achieves .×3.9 higher energy efficiency than NVIDIA TX2 at the humanoid adaptation scenario.

5.2 Processor Design for DRL Acceleration Overall Architecture Figure 43 shows the overall architecture of the energy-efficient DRL chip [6]. It consists of 24 group-sparse training cores (GSTCs), a DRL task scheduler, a pseudo RNG, 64 KB global memory (GMEM), and a top RISC controller. All of these components are connected with a 32-bit 2-D mesh-type network-on-chip (NoC). The GSTC calculates DNN operations for DRL and is designed to minimize data transactions by supporting compression for both weight and feature map. It integrates 9.6 KB input/output buffer, 52 KB weight memory (WMEM), and 8 .× 8 bfloat16 PE array for DNN operations. The integrated weight router and weight/input prefetcher support direct utilization of compressed weight and feature map for high energy efficiency. Moreover, the compressing network interface (CNI) supports seamless compression, while core-to-core feature map transactions. All

Fig. 43 Overall architecture of the proposed DRL chip [6]

68

Deep Reinforcement Learning Processor Design for Mobile Applications

GSTC is connected with a reconfigurable accumulation and activation network to support inter-core partial sum accumulation. The detailed operation of GSTC and CNI will be discussed in Sects. 5.2 and 5.2.

Methods for Compressing Massive Data of DRL Figure 44 shows the methods for compressing data of DRL. Most of the data transactions are caused for calculating multiple DNNs, and each DNN operation is composed of multiple layers’ operations. The proposed chip supports 2 methods for compressing each layer’s weight and feature map: (1) group-sparse training (GST) for weight compression and (2) exponent-mean-delta encoding (EMDE) for both weight and feature map compression.

Fig. 44 The proposed method for compressing massive data of DRL

5 Low-Power Autonomous Adaptation System with Deep Reinforcement. . .

69

The main goal of the GST is to achieve a high weight compression ratio even for DRL training. It is widely accepted that we can compress weight by removing redundant weight connections (pruning). However, it is hard to prune weight during DRL training because the value of weight is continuously updated. In the previous method (iterative pruning), the sparsity of weight must be increased slowly because it needs time for judging the importance of weight parameters to remove the connection. To handle this problem, the GST adopts both blockcirculant-based weight grouping and pruning simultaneously. Both the reward and the weight sparsity increase over iterations, and only after the sparsity of weights gets sufficiently high, GST unpacks the grouped weights and iteratively prunes the weights until completion. The GST achieves a 29.5 %p higher weight compression ratio (63.9%) than iterative pruning on Mujoco Humanoid training. The goal of EMDE is to utilize extra bit-level compression for both feature map and weight. The DRL chip [6] adopts bfloat16 precision (8-bit exponent, 7-bit mantissa, and 1-bit sign) to maintain DRL training accuracy. Although an 8-bit wide exponent range is required, the number of some exponent values is much higher than the others. The previous DRL accelerator [52] proposed top-3 exponent compression, which expressed top-3 exponents with only 2-bit indexes and expressed other exponents with 8-bit original value and 2-bit indexes. However, it not only showed the low compression ratio for the wide exponent distributions but also suffered from the overhead for tracking and sorting all exponent distributions. The EMDE utilizes 3-bit exponent-mean-delta encoding instead of top-3 encoding. In the EMDE, the exponent that value difference with the mean value is less than or equal to 3 is expressed with a 3-bit index. The EMDE achieves a 16.2 %p higher exponent compression ratio (52.9%) than the top-3 compression method [52] on Mujoco Humanoid training.

Compressing Network Interface To fully exploit the advantage of GST and EMDE, we need to minimize overhead for compression itself. Unlike GST which updates encoding of weight only once for 1 iteration, the EMDE needs to update the encoding of feature map more frequently because each batch requires different encoding. Moreover, the EMDE decreases the core utilization due to the data dependency of the mean value, which is determined only after the complete output feature maps are calculated. Figure 45 shows the overall architecture and operation of a compressing network interface (CNI) to support the encoding process of EMDE seamlessly. The CNI is composed of a mean monitor, EMDE compressor, RX FIFO, TX FIFO, and core interface. By taking advantage of the similarity of exponent mean values between different batches in DRL, we utilize EMDE on the current batch with the exponent mean value of the previous batch. The mean monitor tracks the mean value of exponent data while receiving them from the previous GSTC, and the exponent compressor compresses the exponent data seamlessly while transmitting them to

70

Deep Reinforcement Learning Processor Design for Mobile Applications

Fig. 45 Detailed operation of EMDE with the compressing network interface

the next GSTC. By exploiting the batch-level delayed exponent compression, the CNI increases GSTC utilization by 1.3.×.

5.3 DRL System Design for Autonomous Adaptation DRL Adaptation Scenario of Humanoid From this subsection, the system with the proposed DRL chip will be introduced. To demonstrate autonomous adaptation characteristics of DRL, we design a humanoid adaptation scenario for sudden environment changes. Figure 46a shows the target scenario. At the beginning of the scenario, a humanoid with pretrained DRL system walks well. After the sudden modification, such as bigger head size and smaller arm size, the modified humanoid struggles to walk with pretrained DRL system. Figure 46b shows the overview of the humanoid adaptation process with the DRL system. First, the DRL system gathers experience from the modified environment. More specifically, the DRL network iteratively observes state .s(t) and generates action .a(t), and stores experiences .s(t), .a(t), .r(t), .s(t + 1) in the replay buffer. Second, the DRL system samples batches from replay buffer to train actor DNN, critic DNN, and target DNN.

5 Low-Power Autonomous Adaptation System with Deep Reinforcement. . .

71

Fig. 46 (a) Target humanoid adaptation scenario (b) Overview of the humanoid adaptation process with DRL system

Detailed Configuration and Operation of DRL System Figure 47 shows the hardware setup of the proposed system. It is composed of a laptop and DRL system board. The laptop runs a humanoid environment (Mujoco Humanoid-v2) and displays the adaptation results. We utilize a TD3 algorithm [24] for pre-training and fine-tuning humanoid. Figure 48a shows the block diagram of the DRL system board. It is composed of the DRL chip, a Cyclone V FPGA (5CEFA7), a Cypress USB controller, 4 Gb DDR3 SDRAM, a 128 Mb flash memory, and a JTAG interface. The FPGA is integrated for the interface between the DRL chip and other components. The FPGA is programmed to have a USB controller for the Cypress USB controller, a DDR3 controller for SDRAM, an EPCQ controller for flash memory, and a custom NoC external bridge for the DRL chip. Moreover, it contains a NIOS II processor, BMEM, and DMA for toplevel control. All of these components are connected with an Avalon bus interface. The NoC external bridge is connected with two external interfaces of the DRL chip.

72

Deep Reinforcement Learning Processor Design for Mobile Applications

Fig. 47 Low-power DRL system setup with laptop and DRL system board

Figure 48b shows the workload allocation method of the proposed system. In the proposed system, the laptop runs the host Python program and the Mujoco Humanoid-v2 environment. The laptop transmits state and pre-trained weight parameters to the FPGA and receives action from the FPGA. The FPGA manages the data transactions and controls the system components by generating instructions. In the SDRAM, the experiences of the replay buffer, weight parameters of the DRL, and intermediate feature maps are stored. The DRL chip performs the inference of the actor DNN and the target DNN and performs the training of the actor DNN and critic DNN.

Implementation Results Figure 49 shows the implementation results of the proposed chip and system. The proposed chip is implemented with 28 nm CMOS technology and occupies 3.6 mm .× 3.6 mm. By exploiting data compression, it achieves 18.0 TFLOPS/W energy efficiency (measured at 250 MHz, 1.1 V) at 90% weight sparsity and 90% exponent compression ratio. The proposed system demonstrates adaptation of big head humanoid that is doubled the head size and small head humanoid that is halved the head size. As shown in Fig. 9, the proposed system adapts big head humanoid within 12,000 iterations of training, and small head humanoid within 4000 iterations of training. The average power consumption of the proposed system is 2.6 W. The training efficiency of the proposed system is 10.3 iteration/J, which is 3.9.× higher than the NVIDIA TX2.

5 Low-Power Autonomous Adaptation System with Deep Reinforcement. . .

73

Fig. 48 (a) Block diagram of the DRL system board. (b) Overview and workload allocation of the proposed DRL system

5.4 Conclusion In this chapter, we propose a low-power autonomous adaptation system with an energy-efficient DRL chip. To reduce massive memory bandwidth requirements, the DRL chip supports GST and EMDE to compress both weight and feature map. Moreover, it integrates the CNI to mitigate processor utilization drop due to the compression. The proposed DRL system with the DRL chip demonstrates autonomous humanoid adaptation to sudden environmental changes such as size variation of head or arm. The proposed system adapts humanoid to walk within 12,000 training iterations while achieving 3.9.× higher training energy efficiency than NVIDIA TX2.

74

Deep Reinforcement Learning Processor Design for Mobile Applications

Fig. 49 Implementation results of the DRL chip and DRL system

6 Exponent-Computing-in-Memory for DNN Training Processor with Energy-Efficient Heterogeneous Floating-Point Computing Architecture 6.1 Introduction Over the years, reducing the power dissipated in memory has been considered a key requirement of energy-efficient processor design. This tendency was triggered by the fact that memory technology lags behind logic technology in terms of latency or energy consumption, but it has been accelerated due to the recent emergence of data-intensive applications. In particular, the deep neural network (DNN) achieves state-of-the-art performances in the various signal-processing tasks but at the same time causes large memory power consumption due to iterative access to the massive amounts of weights and feature maps. To mitigate the problem, a lot of DNN processors have tried to reduce on-chip memory access by reusing fetched data maximally [66, 67] or omitting access for redundant data [68, 69]. Recently, a computing-in-memory (CIM) design has been spotlighted as a promising method to further decrease the on-chip memory power consumption of DNN processors. By moving computation into or near memory, CIM can reduce the number of memory accesses itself or enable memory designs that are available energy-efficient accesses. With the groundbreaking performance of CIM, a lot of DNN CIM processors [70–74] have achieved state-of-the-art energy efficiency. More specifically, they reduced memory access power consumption by enabling multiple (.≥2) word-lines (WL) simultaneously and read the calculated result through bitline (BL) [70, 72–74] or reusing the operation result on the bitline over multiple cycles (.≥2) [71]. However, most of the previous CIM processors supported only fixed-point (FXP) data representation and could not support floating-point (FP) data representation. The main reason for the lack of the FP CIM processor is the higher operation

6 Exponent-Computing-in-Memory for DNN Training Processor with Energy-. . .

75

Floating-point (FP) Format Sign

s

Mantissa part

Exponent part

ep

e0 mq

m0

Expression Range of FP & FXP FP Range

Fixed-point (FXP) Format Sign

s

Integer part

br br-1

Cannot Express

Cannot Express

Fractional part

b1

b0

FXP Range

Fraction Length (FL)

Fig. 50 Difference between floating-point (FP) representation and fixed-point (FXP) representation

complexity compared to the FXP operation. Figure 50 shows the comparison of FP and FXP. The FXP format is composed of a specific number of bits and a fraction length. It represents a given range of numbers at equal intervals. On the other hand, FP is composed of a sign bit, exponent bits, and mantissa bits. It represents a given number of ranges with dynamic intervals according to exponent values. In general, FP covers a wider expression range than FXP but requires more computational logic due to complicated operation between exponent and mantissa. It was regarded as impossible to efficiently support the intricate FP operations with resource- and arealimited CIM designs. Indeed, the previous CIM processor [70] required hundreds to thousands of cycles of latency for single FP multiply-and-accumulate (MAC) operation. However, previous CIM processors that only support FXP operation cannot be adopted for applications that require a wide expression range, such as DNN training. Since DNN training requires a wide distribution of data from very small data such as gradients or errors for updating weight parameters to relatively large data such as feature maps, the limited-expression range of FXP is inappropriate. There were some researches for DNN training with FXP [75, 76], but most of them show accuracy degradation in large models with large datasets [75] or require complex statistical operations that are hard to accelerate in hardware [76]. Thus, most DNN training processors support FP instead of FXP to maintain high training accuracy. Especially, 16-bit brain floating point (BFP16) is considered an efficient FP representation that enables stable training [77]. BFP16 consists of a 1-bit sign, an 8-bit exponent, and a 7-bit mantissa. Despite its small bit-width (16-bit), it has the same expression range of 32-bit single floating-point (FP32) precision, making it suitable for both DNN training and inference. Indeed, commercial DNN accelerators such as Google’s TPUv2, ARM’s Armv8-A, and Intel’s Nervana adopt BFP16 multiplication and FP32 accumulation for stable DNN training performance [78].

76

Deep Reinforcement Learning Processor Design for Mobile Applications

Therefore, it is essential to extend the scope of CIM design to FP operation in order to implement an energy-efficient DNN training processor for edge devices. In this chapter, we propose a processor that realizes FP CIM with high energy efficiency and low latency. The main contributions of this chapter are as below: 1. We propose novel heterogeneous FP computing architecture (HFCA), which separately optimizes exponent with CIM and mantissa with digital logic. By considering the heterogeneous characteristics of FP systems, HFCA can maximize the advantages of CIM in FP computation without latency overhead. 2. We propose a novel FP computation algorithm, mantissa-free-exponent calculation (MFEC). By minimizing the unnecessary normalization process, MFEC not only lowers communication costs between exponent and mantissa but also reduces the entire MAC power. 3. We propose a novel exponent-computing-in-memory (ECIM) architecture that can greatly reduce the memory access power consumption of exponent computation by in-memory computing and temporal charge reusing. The rest of this chapter is organized as follows. In Sect. 6.2, we describe the architecture for FP CIM with proposed HFCA and overall processor architecture. The detail architecture and operation of MFEC and ECIM are explained in Sects. 6.3 and 6.4. Section 6.5 shows how to map DNN workload on the ECIM and mantissa PE. Section 6.6 provides the chip implementation results, and Sect. 6.7 concludes the paper.

6.2 Architecture for Computing-in-Memory of Floating Point Motivation Before the FP CIM design, we analyze the components of the FP computing system. Figure 51a shows the detailed building blocks of the FP computing system. Basically, the system repeatedly performs 3 processes: (1) fetching data from memory, (2) calculating data in the FP MAC, and (3) storing the result in memory. Each process is divided into exponent operations and mantissa operations. In the point of memory’s view, the same operations, fetch and store, are required for both exponent and mantissa. Therefore, the ratio of exponent and mantissa in-memory power consumption is only proportional to their bit-width ratio, which is almost the same for BFP16. However, FP MAC requires different operations between exponent and mantissa. Exponent MAC operation consists of simple addition and subtraction, but Mantissa MAC operation additionally requires complex operations including multiplication, programmable bit shifting, and leading one detection. Therefore, the mantissa operation occupies most of the power consumption and area of the FP MAC. The difference in characteristics between memory operation and MAC operation causes a heterogeneous characteristic of the FP computing system. Figure 51b

6 Exponent-Computing-in-Memory for DNN Training Processor with Energy-. . .

Memory Operation Analysis

Memory Exponent Memory

Mantissa Memory

Exponent

77

Power Portion

Mantissa

Exponent fetch Exponent store

Mantissa fetch Mantissa store

FP MAC Operation Analysis Exp. MAC

Mantissa MAC

Exponent

Mantissa

FXP Add FXP Subtract

FXP Multiply FXP Add FXP Subtract Programmable Bit Shift Leading One Detecting

FP MAC

Power Portion

Ȅ PM : Mantissa power consump., PE: Exponent power consump., Mb : Mantissa bit -width, Eb: Exponent bit -width

61.0 %

59.8 %

38.6 %

39.0 %

Exponent Part Power Breakdown 92.5%

FP Processor Power Breakdown

(a) Mantissa Part Power Breakdown

7.5% Mantissa Exponent

Memory

PE

Computation Intensive

Memory

PE

Memory Intensive

Ȅ 28nm BFP16 Multiplication & FP32 Accumulation & SRAM (1.1V @ 250MHz) estimated in 8×512 (input) and 512×256 (weight) w/ 16 PE

(b) Fig. 51 (a) Analysis of the heterogeneous characteristic of the floating-point computing system (b) Floating-point processor’s power breakdown

shows the power breakdown of the FP computing system with BFP16 multiplication and FP32 accumulation. The mantissa operations occupy 59.8% of the total power consumption, while exponent operations occupy 38.6%, but the memory access power of both operations is similar. Thus, the mantissa operation is a computationintensive operation in which 39.0% of the power consumption occurs in PE, whereas the exponent operation is a memory-intensive operation in which 92.5% of the power consumption occurs in memory. Because the CIM architecture can only integrate a limited number of computation logics due to area constraints, it is very difficult to optimize both mantissa and exponent at once with CIM architecture.

78

Deep Reinforcement Learning Processor Design for Mobile Applications

Heterogeneous Floating-Point Computing Architecture Figure 52a shows the flowchart and overall flow of previous FP CIM architecture [70]. The previous CIM architecture adopts a unified FP CIM design, which processes both mantissa and exponent-in-memory. Single MAC operation in unified FP CIM architecture is composed of a repetition of 2 stages: (1) For timestep .0 ∼ Te , repeat fetching exponent operands, calculating them in CIML, and storing Partial Result Store

Exp. Opd. Fetch

Simple Addition in CIML*

Partial Result Store

Op. Done?

Man. Opd. Fetch

Simple Addition in CIML*

Op. Done?

@T = Te~T e+T m

@T = 0~T e

Memory

Memory

Mantissa Cells Exponent Cells

Mantissa Cells Exponent Cells

Exp. Op erand

Total Done?

Single FP MAC Done

Single FP MAC Start

Partial Result Store

Partial Results

Man. Op erand

Partial Results

Computing-inMemory Logic

Computing-inMemory Logic

* CIML = Computing-inMemory Logic

@T = 0~1

Mantissa Operand Fetch

Mantissa op. in Digital PE

Exponent Operand Fetch

Exponent op. in CIML

Exponent?

Exponent -CIM

Single FP MAC Done

Single FP MAC Start

(a)

Mantissa Memory

Exponent Cells Mantissa Shift Amount

Exp. Operand

Computing-inMemory Logic

Norm. Result

Man. Operand

Digital Mantissa Processing Engine

(b) Fig. 52 (a) Conventional floating-point computing architecture. (b) Proposed heterogeneous floating-point computing architecture

6 Exponent-Computing-in-Memory for DNN Training Processor with Energy-. . .

79

the partial result, (2) For timestep .Te ∼ Te + Tm , repeat the same operations for the mantissa operands. At this time, the throughput degradation inevitably occurs because it must replace exponent and mantissa operation with repetitive simple operations due to limited computation logics of CIM architecture. Especially, the degradation is more severe in mantissa operations that require complex operations including multiplication and shifting. Indeed, the previous CIM processor [70] required .>100 cycles for single BFP16 multiplication and .>4900 cycles for single FP32 accumulation. To overcome the problem, we propose heterogenous FP computing architecture (HFCA). Figure 52b shows the flowchart and overall flow of the proposed HFCA. In the HFCA, exponent and mantissa are decoupled and stored in different memories. Only the exponent operations are calculated with the CIM, while the mantissa operation is calculated in the digital mantissa processing engine (MPE). Most exponent and mantissa operations are independent of each other, but MPE receives mantissa shift amount for mantissa addition from CIM, and CIM receives mantissa normalization result for exponent normalization from MPE. By selectively utilizing CIM for exponent operations considering the heterogeneous characteristic of FP, HFCA has 2 advantages. First, it can realize a single floating-point MAC within 2 cycles by replacing inefficient repetitive mantissa CIM operation with a dedicated PE operation. Second, it provides an opportunity to increase energy efficiency by enabling separate optimization of the conflicting characteristics of exponent and mantissa.

Overall Processor Architecture Figure 53 shows the overall architecture of the proposed processor that adopts HFCA. The proposed processor consists of 2 heterogeneous-exponent-mantissatraining-core (HEMTC) clusters, an aggregation and activation core (AAC), a 1-D SIMD core, and a top RISC controller. The processor integrates a 32-bit external interface operating at a maximum of 250 MHz frequency. All components in top architecture and an external interface are connected with a 2-D mesh-type networkon-chip (NoC) for communicating exponent and mantissa of weight, feature maps, and partial sums. Furthermore, each HEMTC cluster is separately connected with AAC for aggregating partial sums. HEMTC cluster consists of 8 HEMTCs that are connected with a shared bus interface. Each HEMTC can support the accumulation process, but inter-HEMTC accumulation due to a large input channel is performed in AAC. AAC consists of 4 KB AMEM, accumulation registers, and a 16-way activation unit. The aggregation of partial sums from 2, 4, 8, or 16 HEMTCs can be performed in AAC at a time. The 1-D SIMD core integrates a vector processing for simple elementwise multiplication, and the top RISC controller manages the operation of each component and data movement between components. HEMTC is a basic unit of DNN operations. All integrated memories of HEMTC are divided to separately store the exponent and mantissa for supporting HFCA.

×8

NoC I/F

Accumulation Registers

DMEM (1 KB)

NoC I/F & shared bus

HEMTC 8

FMEM (2 KB)

NoC I/F

×8

Vector Processing Unit

1-D SIMD Core

2-D Mesh Network-on-chip (NoC)

AMEM (4 KB)

16-way Act. Unit

Fig. 53 Overall architecture of the proposed processor

IMEM (1 KB)

NoC I/F & shared bus

Top RISC Controller

Ext. I/F

NoC I/F & shared bus

HEMTC 0

EMSTC SRCore Core000 7 SR Core SR

Output-ECIM SRAM Array (128 × 128 bit)

Shared Exponent Peripheral Circuit

Mantissa PE Line

WMMEM (8 KB)

IBMEM (0.25 KB) IMMEM (2 KB)

Weight-ECIM SRAM Array (512 × 128 bit)

IEMEM (2 KB)

Decoder Decoder

HEMTC Cluster1

×16 MPE

OMMEM (2 KB)

MPE

Aggregation & Activation Core

MPE

EMSTC SRCore Core000 7 SR Core SR

MPE

HEMTC Cluster0

MPE

80 Deep Reinforcement Learning Processor Design for Mobile Applications

F.map Zero Skip Ctrlr. 16-way Act. Unit

6 Exponent-Computing-in-Memory for DNN Training Processor with Energy-. . .

81

Specifically, each HEMTC integrates a total of 16 KB weight memory, 4 KB output memory, and 4.25 KB input memory including 8 KB weight ECIM and 2 KB output ECIM. In addition, HEMTC integrates decoders for feeding input feature maps to both ECIMs, a shared exponent peripheral circuit, a mantissa PE line composed of 16 mantissa PEs, a 16-way activation unit, and a feature map zero-skip controller. A detailed description of integrated components and the detailed operation of HEMTC will be explained in the following sections.

6.3 Mantissa-Free-Exponent Calculation The proposed HFCA greatly reduces the required latency by selectively applying CIM to the exponent, but 2 issues arise as trade-offs. First, since the CIM architecture is applied only to exponent operation, the power consumption of mantissa is not reduced. Considering the large portion of mantissa operation in total power consumption, optimization for computation-intensive mantissa operation is also essential. Second, since exponent operation and mantissa operation cannot be completely separated, communication costs including throughput degradation due to area-constraint CIM design inevitably occur. Figure 54a describes the throughput degradation problem. It shows the operational flow of the conventional FP MAC. There are 2 types of communication that occur for every MAC between the mantissa part and the exponent part: (1) the exponent subtractor transmits the mantissa shift amount to the mantissa shifter, (2) the normalization and round unit transmits normalization results to exponent update. At this time, ECIM shares the logic for exponent subtractor and exponent updater due to area overhead. Therefore, the pipeline stall occurs because the new multiplication must be performed after the normalization process that is required for every MAC. To alleviate the issues of HFCA, we propose a new FP computing algorithm, MFEC. MFEC exploits 2 facts to reduce communication between exponents and mantissa: (1) Normalization process, which aligns the FP operation results to a predefined FP format, must be performed before the results are stored in memory again. (2) In DNN operation, there are a lot of partial sums that can be accumulated to each other before being stored in memory. Figure 54b shows the overall flow of the FP MAC with MFEC. The key concept of the MFEC is removing the redundant normalization process that can be simplified in the accumulation sequence. MFEC replaces the complex normalization process with simple pre-normalization with an overflow counter and accumulation register. The final normalization is performed only once after all accumulations have been performed. Therefore, the MFEC enables not only pipelining between ECIM and mantissa PE but also efficient FP MAC design by reducing the critical path of mantissa. The detailed MAC operation with MFEC is described in Algorithm 11. Please note that the basic concept of each component is as follows: the ECIM tracks the maximum value of multiplication results’ exponent, and the mantissa PE pre-normalizes mantissa according to the tracked exponent.

82

Deep Reinforcement Learning Processor Design for Mobile Applications

Exp.A

Man.A

Exp.B

Man.B

Exp.A

Exp.B

Man.A

Man.B

Exponent Adder

Mantissa Multiplier

Exponent Adder

Mantissa Multiplier

Reg.

Reg.

Reg.

Reg.

Exponent Subtractor

Mantissa Shifter

Exponent Subtractor

Mantissa Shifter Mantissa Adder

Mantissa Adder

Exponent Updater

Norm. & Round

Reg.

Reg.

Exp. Adder Exp. Comp.

FP Mult.

Exp. Adder

FP FP FP Add. Norm. Add.

Exp. Comp.

FP Mult.

Accum Reg.

Norm. & Round

Exponent Updater

Stall

Stall

Ovf Counter

Reg.

FP Mult.

FP Mult. FP Add.

FP Add.

FP Norm.

Normalize every MAC

Normalize Once after Accum. Finished

(a)

(b)

Fig. 54 (a) Conventional FP MAC, (b) proposed FP MAC with mantissa-free-exponent computation (MFEC)

However, the calculation errors inevitably occur in the MFEC because the pre-normalization capability of mantissa PE varies depending on the width of the accumulation register and overflow counter. If the overflow counter width is not sufficient, mantissa PE cannot compensate for the maximum exponent value tracked by ECIM when the overflow occurs, which leads to a large calculation error. If the accumulation register width is not enough, underflow frequently occurs during the mantissa shifting, also increasing the calculation error. Figure 55a shows the calculation error according to the overflow counter width and accumulation register width. We measure the calculation error for various accumulation lengths at (1024 .× 1024).×(1024 .× 1024) matrix multiplication that was initialized with ResNet-18 training distribution. Regardless of accumulation length, we found that MFEC can perform FP MAC operation without calculation error by utilizing .≥21-bit accumulation register and .≥3-bit overflow counter. As the accumulation

6 Exponent-Computing-in-Memory for DNN Training Processor with Energy-. . .

83

Algorithm 11 Pseudocode of FP accumulation in MFEC Input: floating-point multiplication results P , previous overflow counter value Oold , previous accumulation register value Aold , previous maximum exponent Emaxold Output: floating-point accumulation results R, new overflow counter value Onew , new accumulation register value Anew , new maximum exponent Emaxnew [Note: EP /R : exponent, MP /R : mantissa] Initialize: if this is the first operation of MAC sequences, initialize Oold = 0, Aold = 0, Emaxold = EP 1: mantissa shift amount S = Emaxold − EP 2: Emaxnew = max(Emaxold , EP ) 3: final shift amount Sf inal = S + Oold 4: if (Sf inal > 0) then 5: right shift MP by Sf inal 6: Otemp = Oold 7: else 8: right shift Aold by Sf inal 9: Otemp = 0 10: end if 11: temp = Aold + MP 12: if (temp ≥ 2) then 13: Onew = Otemp + 1 14: Anew = right shift temp by 1 15: else 16: Onew = Otemp 17: Anew = temp 18: end if 19: if (Final of MAC sequence == True) then 20: ER , MR = N ormalize(Emaxnew , Onew , Anew ) 21: end if

length becomes larger, the advantage of MFEC increases because the number of replaceable normalization processes is also increased, but it can lead to the larger the calculation error. Figure 55b shows the result of the proposed MFEC. Compared with the BFP16 multiplier and FP32 accumulation-based FP MAC, MFEC can reduce a total of 14.4% of MAC power and a total of 11.7% of MAC area by removing the redundant normalization process of FP32 accumulator.

6.4 Exponent-Computing-in-Memory Figure 56a shows the overall architecture of the proposed ECIM. The main components are 512 .× 128-bit weight ECIM and 128 .× 128-bit output ECIM. The weight ECIM is utilized for the feed-forward and back-propagation stage of DNN training, and the output ECIM is utilized for the weight-gradient stage. Each ECIM adopts a hierarchical bitline structure that is composed of CIM local arrays (CLAs) with local bitlines (LBLs) and global bitlines (GBLs). Each CLA integrates 32 6T bitcells, a VDD pre-charger that can pre-charge local bitline with exponent

84

Deep Reinforcement Learning Processor Design for Mobile Applications

Error acc. Ovf Counter Width

Error acc. Accum Reg Width

120

Baseline** 110

PSNR* (dB)

PSNR* (dB)

120

# of Accum. @ 256 @ 512 @ 1024

100

90 17

18

19

20

21

22

Baseline** 80

# of Accum. @ 256 @ 512 @ 1024

40

0

23

1

Accumulation Reg Width

2

3

4

5

Overflow Counter Width

* Measured @ (1024×1024)×(1024×1024) matrix mult. initialized with ResNet18- training distribution, ground truth is from full -precision FP MAC **PSNR of BFP16 multiplier and FP32 accumulator

(a) Normalized MAC Area

Normalized MAC Power

0.586

w/o 0.2 MFEC w/ 0.2 MFEC

0.8

0.683

11.7%ȣ

w/ 0.27 MFEC

0.73 14.4%ȣ

w/o 0.27 MFEC

: FP32 Accumulator

: BFP16 Multiplier

(b) Fig. 55 (a) MFEC error according to accumulation register width and overflow counter width (b) Comparison result

bits of input feature map (IF) or error, and 2 GBL drivers. The operation of ECIM is controlled by the CLA decoder, word-line (WL) driver, and normal I/O interface. CLA decoder receives IF or error values for VDD pre-charger operation, and the WL driver enables WLs corresponding to the current weight index. In addition, there are exponent peripheral circuits that are shared with weight ECIM and output ECIM. The peripheral circuits integrate bias combined adder, exception handler, subtractor, and register. The peripheral circuits finalize exponent operation (exponents add, exponent compare) after receiving GBL and GBLB results from ECIM to generate mantissa shift size and final exponent value. ECIM shows 2 features for reducing memory power consumption for exponent operations: (1) in-memory AND/NOR operation at CLA and (2) bitline charge reusing with hierarchical bitline structure. In this chapter, we describe the operation of ECIM with weight ECIM for simplicity because the operation of weight ECIM and output ECIM is the same from the computational point of view. Figure 56b shows the intra-CLA level operation of ECIM, which performs inmemory AND/NOR operation. Before proceeding with the DNN operation, ECIM stores weight parameters in 6T bitcells. The 8-bit exponent is divided into 8 bitcells in different columns of the same row. The proposed in-memory operation is pipelined with 3 stages. First, the VDD pre-charger pre-charges LBL with a feature

6 Exponent-Computing-in-Memory for DNN Training Processor with Energy-. . .

IF or E Weight index Exp.

85

32 × 1 6T CIM Local Array (CLA)

Normal I/O Interface

WL 736 ~WL 767

WL31

Bitcell 31

CIM Local Array

GBL_EN DR

Local BL

CIM Local Array

WL0

Local BLB

512×128b Weight ECIM

V DD Pre-charger

32

Global BLB

WL Driver

WL 0 ~WL 31

CIM Local Array

Global BL

CLA Decoder

Feature Map's Exponent

CIM Local Array

GBL_ENB DR GBL_EN

WL Driver

Shared Exp. Peripheral

Exp.Adder

Exp. Adder

Exp.Comp.

Exp.Comp.

Exponent Comparator

Exp.Adder

Final Exponent

CLA Decoder

Mantissa Shift Size

128×128b Output ECIM Normal I/O Interface

GBL

GBLB

Bias Combined Adder

Exception Handler >?

Subtractor

Register

Man. Shift Final Exp.

(a)

VDD Pre-charger A

VWL

1) Pre -charge F.Map 'A', '/A'

WL 0

B B DR

3) Connect LBL with GBL

VWL = 0.5V

GBL_EN

CLA

VDD

DR

GBL_EN

CLA

A B CLK WL0

2) Read Cell w/ Lower V WL

LBLB

CLA

A LBL

WL Op. Cond.

0 0

0 1

1 0

1 1

LBL

Fail

Fail

LBLB

Fail

Fail

LBL

VWL = 0.7V

LBLB

VWL = 0.9V

LBLB

LBL

Fail

Fail

Fail Fail

* VNML = GBL driver's noise margin low

(b)

CLA0 Pre-charger

LBLB0

GBLB

CLA 0

CLA0 Bitcell 0

CLA0 Bitcell 31

CLK

GBL

CLA 16

WL0

LBL0

ECIM Array

CLA0

GBL EN0

GBL EN0

CLA 17

CLA 1

DR DR

CLA1 Pre-charger LBLB1

CLA 15

CLA1 Bitcell 0

CLA1 Bitcell 31

Precharge

CIM

WL1

LBL/LBLB

Precharge

CIM

GBL EN0 LBL1

WL1 CLA 31

CLA1

WL0 LBL/LBLB

GBL EN1

GBL EN1 GBL

Dr vi i n g

Charge reused

GBLB

Dr vi i n g

Charge reused

GBL EN1 DR DR

Operation Scenario

@T=0: Cell bit = 0, Precharge bit = 1 @T=1: Cell bit = 1, Precharge bit = 0

(c) Fig. 56 (a) Overall architecture of exponent-computing-in-memory (ECIM), (b) intra-CLA operation of ECIM, (c) inter-CLA operation of ECIM

86

Deep Reinforcement Learning Processor Design for Mobile Applications

map or error and LBLB with the complimentary value. Second, the corresponding WL is enabled to read out the cell values. At the same time, the in-memory operation is performed in the LBL and LBLB. The “AND” operation is performed on LBL, and the “NOR” operation is performed on LBLB. Finally, the calculated result is transferred to the GBL and GBLB through the GBL driver. Since the ECIM utilizes 6T SRAM, we prevent the corruption of operation data due to cell data by lowering word-line voltage (.VW L ). More specifically, the .VW L must be larger than the threshold voltage .Vth to readout cell values and must be smaller than .VN ML + Vth to guarantee proper operation of global bitline driver. As shown in the operational waveform in Fig. 56b, cell data cannot be read out if .VW L is too low, and the operation result is corrupted by cell data if .VW L is too high. (LBL pre-charge contents should be added.) Figure 56c shows the inter-CLA level operation of ECIM, which can reuse bitline (BL) charge with the hierarchical BL (HBL) structure. The motivation of BL charge reusing is the locality of the DNN’s data. DNN training requires a wide data distribution to represent multiple layer’s weight parameters, feature maps, errors, and gradients. Although an 8-bit wide exponent range is required to maintain accuracy for the entire training, the scales of the 2 adjacent values are similar. Therefore, the bit-level differences between the in-memory computing values are small, which leads to a large opportunity for reusing. ECIM exploits temporal charge reusing with HBL to take advantage of the scale similarity. In the HBL, the GBL is not pre-charged every cycle, and the enabled GBL driver of each CLA determines the value of GBL. To eliminate throughput degradation caused by CLA’s pipeline latency of 3 cycles, ECIM accesses bitcells in different CLA every cycle. If the previous CLA’s and/nor result is the same as current CLA’s and/nor result, the GBL charge can be reused. The proposed ECIM can reduce the dynamic power of bitline, which occupies most of the memory access power, with two methods: (1) ECIM reduces half of LBL pre-charge power with the in-memory operation. The conventional memory needs to pre-charge both LBL and LBLB for the read operation. ECIM pre-charges only one of them because the exponent bit of the input feature map is charged on LBL and its complement bit is charged on LBLB. (2) In addition, ECIM reduces GBL switching power by reusing the GBL charge. ECIM adopts a hierarchical bitline structure with LBL and GBL. Unlike LBL that requires pre-charge every cycle, GBL can maintain its charge because the GBL drivers determine the value of GBL. Therefore, GBL charge can be reused if the previous CLA’s and/nor result is the same as the current CLA’s and/nor result. Figure 57 shows the peripheral circuit design of ECIM and its optimization. The GBL charge reusing of the ECIM is motivated by the locality of the DNN data. The 8-bit exponent to support a wide expression range is essential for accurate DNN training, but the scales of the two adjacent values are similar. Therefore, the difference of exponent bits between neighboring values in the computing sequence is small. Figure 58a shows the layer-by-layer GBL charge reuse ratio measured in ResNet-18 training. By utilizing spatial-wise and channel-wise scale similarity in DNN training, we can reuse the GBL charge for .>84% of the ResNet-18 training scenario. By reducing pre-charge power with in-memory computing and GBL

Cout

Sout

Cout

Bias

SA

Cin

FA

Sout

Full-Adder (FA)

Full-Adder (FA)

Cin

Simplified-Adder (SA)

GBLB (A+B)

GBL /(A&B)

B

A

87

Simplified-Adder (SA)

6 Exponent-Computing-in-Memory for DNN Training Processor with Energy-. . .

FA

FA

Bias

Full-Adder (FA)

FA

Full-Adder (FA)

FA

(a) SA

SA

Cout7

(b) GBL /(A&B)

SA

GBLB (A+B)

Exception Handler Sout Ovf

Sout7 ` Cout Sout7

Ovf

SA

SA

MUX

F.map Sparsity

Sout

GND

Register

FP Comparator

SA

(c)

(d)

Normalized SRAM Power

90

80

@ EP

@ FF

@ WG

w/o ECIM

0.5

w/ ECIM

0.5

0.5 0.036

46.4%ȣ

GBL Charge Reuse Ratio (%)

Fig. 57 (a) Conventional peripheral design, (b) ECIM’s peripheral circuit design, (c) Extra optimization for ECIM’s peripheral circuit design, (d) exception handler design

70 : Mantissa MEM

1

3

5

7

9

11

13

15

17

ResNet-18 Layer Number (a)

19

21

*Measured @ 512 Ch

in

: Exponent MEM

, 16 Ch out , No Sparsity, 512×128 bit ECIM

(b)

Fig. 58 (a) Global bitline charge reuse ratio of ECIM, (b) memory power reduction results of ECIM

charge reusing, the proposed ECIM reduces total memory access power by 46.4% as shown in Fig. 58b.

88

Deep Reinforcement Learning Processor Design for Mobile Applications

CNN Workload Mapping of HEMTC Figure 59a shows the operation of the basic convolutional layer. To generate one output feature map channel (.o0 ), a convolution between the input feature map with M channels and M weight kernels is calculated. There are two kinds of accumulation in the convolution process: (1) The product results of M input feature maps and M weight kernels are accumulated along the input channel direction. (2) This input channel accumulation is repeated by .k 2 times along the image pixel direction. This process is repeated for different N groups of weight kernels to generate a total of N output channels. In order to maximize the advantage of MFEC, the proposed processor maximally utilizes partial sum reuse. The key concept of MFEC is to omit the redundant normalization during the accumulation process. It is hard to exploit MFEC’s advantage if the processor consecutively calculates partial sums that cannot be accumulated. Therefore, the proposed processor performs as many accumulations as possible and stores the final output feature map values in the memory. Figure 59b shows the logical data flow of the proposed processor. The processor utilizes broadcasting for input feature maps and unicasting for weight. Thus, multiple output channels are generated in parallel, and partial sums of corresponding output channels are calculated and accumulated at each timestep. In order to increase scalability for various types of layers, the proposed processor first performs accumulation along to the input channel direction and then to the image direction. After accumulation is completed, calculations for other output channels or output pixels are started. Figure 59c shows the CNN workload mapping of HEMTC. The exponent of the input feature map is broadcasted for pre-charging 16 CLAs, and the mantissa of the input feature map is broadcasted to 16 MPEs. ECIM tracks the maximum value of the partial sums’ exponent at every timestep while simultaneously transferring the mantissa shift amount to MPE. MPE performs mantissa accumulations. If the input channel 0 is fetched at timestep .T = 0, the next input channel (input channel 1) is fetched at .T = 1. To mitigate pipeline latency of CLA’s operation, the CLAs utilized for calculation in each timestep are all different. For instance, if CLA 0.∼15 is utilized at timestep .T = 0, CLA 16.∼31 is utilized at .T = 1.

6.5 Zero-Skip Architecture of HEMTC Recently, DNN models contain a lot of zero values in the feature map or weight due to the utilization of the ReLU activation function or pruning. A lot of previous processors exploited sparsity for high energy efficiency [71, 74]. Sparsity exploitation not only reduces memory bandwidth and on-chip storage requirements by non-zero compression but also increases throughput by omitting calculations for zero values. The proposed DNN training processor is designed to perform the zeroskip operation with ECIM to exploit feature map sparsity.

k

(a)

Weight

ww2121 ww1111 ww0101 ww0021 ww1011 ww2001

ww2121 ww1111 ww0101 ww0121 ww1111 ww2101

ww2121 ww1111 ww0101 ww0221 ww1211 ww2201

Output F.Map

oo0 oo000

'N' output channels o0

o0

o0

w 01 01 01 www 02

i0i0 i1i0

o0

w 21 21 21 www 10

(b)

o0

w 11 11 11 www 11

o0

w 01 01 01 www 12

Step2: Image Accumulation

i0i0

i0i0

w 11 11 11 www 01

w00

CLA0

w01

w00

w015

MPE15

Weight Mantissa MEM

MPE1

MPE0

Input Ch. 0

i0

CLA31

w015

CLA15

Shared Exponent Peri.

CLA17

w01

CLA1

Weight ECIM

(c)

w01

w015

CLA31

CLA15

w01

MPE1

w015

MPE15

Weight Mantissa MEM

w00

MPE0

Shared Exponent Peri.

w00

Input Ch. 1

i0

CLA1

Weight ECIM

CLA16 CLA17

CLA0

Input Ch. 1

i0

@T=1

CNN Workload Mapping to HEMTC

Input Ch.CLA16 0

i0

@T=0

Fig. 59 (a) Operation of basic convolutional layer, (b) Logical dataflow of the proposed processor, (c) CNN workload mapping of HEMTC

Input F.Map

i0i0 i1i1 i2i2 i0i0 i1i1 i2i2

'M' input channels

w 21 21 21 www 00

Step1: Input Channel Accum.

Input Exp. MEM Input Man. MEM

Input Exp. MEM Input Man. MEM

'M' input channels

6 Exponent-Computing-in-Memory for DNN Training Processor with Energy-. . . 89

90

Deep Reinforcement Learning Processor Design for Mobile Applications Input channel i0i0 i0i0

Input channel

Original F.Map

ch0 0 ch2 0 skip

Non-zero F.Map Input F.Map

0 ch5 ch6

skip skip

ch0 ch2 ch5 ch6

Non-zero Index

0

2

5

6

(a) Input Exp. MEM

F.map Zero Skip Ctrlr.

Weight ECIM

@T=0

CLA0

CLA1

CLA15

CLA16

skip CLA17

CLA31

CLA32

CLA33

CLA47

Non -zero Exp. Queue ch 6 ch 5 ch 2 ch 0 @T=1

Input Bitmap MEM

Non -zero Index Queue 6

5

2

0

CLA, WL Addr.

ch 6 ch 5 ch 2 ch 0

@T=0 @T=1

Input Man. MEM

Shared Exponent Peri.

Non -zero Man. Queue

Bitmap Decoder Address Generator

MPE0

WMMEM Addr.

MPE1

MPE15

Weight Mantissa MEM

(b) Fig. 60 (a) Non-zero encoding of input feature map, (b) Zero-skip operation of HEMTC

Figure 60a shows the non-zero encoding method for HEMTC. Since HEMTC accesses input feature map data along to input channel direction, non-zero encoding is also performed along to input channel direction. After the encoding finished, the original feature map is converted into non-zero value feature maps and indexes of non-zero values. Figure 60b shows the overall flow of the zero-skip operation. The basic concept of the proposed zero-skipping is the address control of the weight ECIM and weight mantissa memory according to the indexes of the non-zero feature maps. The detailed operation is as follows. (1) The feature map zero-skip controller fetches exponent, mantissa, and the encoded bitmap of input feature maps. (2) Nonzero exponent and mantissa values get pushed in the queue, and the bitmap decoder generates non-zero indexes and pushes them into the non-zero index queue. (3) The address generator calculates the appropriate address for current input feature map values. More specifically, it generates pre-charged CLA indexes and WL indexes for ECIM operations and generates weight mantissa memory’s address for MPE operations. Therefore, the HEMTC can skip calculation for zero values by not generating corresponding memory addresses that are calculated with zero input feature maps.

6 Exponent-Computing-in-Memory for DNN Training Processor with Energy-. . .

91

6.6 Implementation Results Figure 61a shows the chip photograph and the performance summary of the proposed DNN training processor using 28 nm CMOS technology. The processor occupies 5.832 .mm2 with a total of 396 KB on-chip SRAM including 160 KB ECIM. The processor can operate from a 0.76–1.1 V supply voltage with a maximum frequency of 250 MHz. Its peak performance is 119.4 GFLOPS for 0% sparsity and 662 GFLOPS for 90% sparsity. The energy efficiency, as measured on

Top RISC Ctrlr.

HEMTC 4 HEMTC 0

HEMTC 5 HEMTC 1

HEMTC 6 HEMTC 2

HEMTC 7 HEMTC 3

Specifications

28nm Logic CMOS

Die Area

1.62 mm × 3.6 mm

Precision

BFloat16

Supply Voltage

0.76 V Ě 1.1 V

Frequency

~ 250 MHz

Peak Performance [GFLOPS]

119.4* Ě 662.0** 1.2** Ě 2.1* (5 MHz, 0.76 V)

Power Consumption [mW]

1-D SIMD Core

HEMTC Cluster 1

Accum. & Act.

Technology

91.0** Ě 156.1* (250 MHz, 1.1 V)

1.43* Ě 13.7** (40 MHz, 0.76 V)

Energy Efficiency [TFLOPS/W]

0.76** Ě 7.3* (250 MHz, 1.1 V)

* = Activation Sparisty 0%, ** = Activation Sparisty 90% (1 MAC = 2 OP)

(a)

4.6

0

10

20

30

40

50

60

3.2

2.4

2.8 1.9

1.6

2.0 1.3

1.7 1.2

1

1.6 1.0

1.4

4.8

3.4 2.3

9.4

7.1

w/o ECIM & MFEC (@ 40 MHz, 0.76V)

0.9

TFLOPS/W

13.7 w/ ECIM & MFEC (@ 40 MHz, 0.76V)

10

70

80

90

Feature Map Sparsity (%) (b) Fig. 61 (a) Chip photograph and performance summary. (b) Energy efficiency according to input sparsity

92

Deep Reinforcement Learning Processor Design for Mobile Applications

Fig. 62 Voltage-frequency scaling results of the proposed ECIM

1.1 1.0

200

0.9

150 VDD Lower Bound

0.8

100

0.7

50

Power [mW]

Voltage [V]

250

Supply Voltage (VDD )

VWL 0.6

0 V

1

10 100 Frequency [MHz]

(a) Table 3 Performance comparison with previous DNN processors with CIM architecture Tech [nm]

Clock Freq. (MHz)

Supply Voltage (V)

ISSCC 19`[2]

28

475

ISSCC 21` [3]

65

JSSC 21`[4]

This Work

Precision

Zero Skip

MEM Size (KB)

Peak Perform.

Energy Efficiency

Weight

Act

1.1

FXP 1-16 , FP32

FXP 1-16 , FP32

X

128

1.13 GFLOP S1)

0.05 TFLO PS/W1) (114 MHz, 0.6V)

25 ~ 100

0.62 ~ 1.0

FXP 1-8

FXP 2/ 4/6/8

O (Weigh t)

8

0.12) ė 3.163) TOPS

2.752) ė 75.93) TOPS/W (37.5MHz,1.0V)

65

200

1.0

FXP 1-16

FXP 16

O (Weigh t)

4.75

0.374) ė 4.05) GO PS

0.314) ė 3.075) TOPS/W (200 MHz, 1.0V)

28

~ 250

0.76 ~ 1.1

BFP 16

BFP 16

O (F.Map )

396 (ECIM : 160 )

119 .4 6) ė 662 .0 7) GFLOP S

1.436) ė 13.77) TFL OPS/W (40MHz, 0.76V)

1) Normalized to BFP16, 2) 1b/2b precision, 0% sparsity, 3) 1b/2b precision, 75% sparsity, 4) 16b Weight, 0% Weight Sparsity, 5) 16b Weight, 90% Weight Sparsity, 6) BFloat16, Activation sparisty 0%, 7) BFloat16, Activation sparisty 90%

a convolutional layer (512 channels) with consideration of PE utilization, is 1.43 TFLOPS/W for 0% sparsity and 13.7 TFLOPS/W for 90% sparsity operating at 40 MHz and 0.76 V. Figure 61b shows the energy efficiency measurement results according to feature map sparsity. The proposed ECIM and MFEC increase the energy efficiency by .×1.42∼ × 1.6. Figure 62 shows voltage–frequency scaling and energy efficiency measurement results. We observed that the processor operates at 40 MHz and 0.76 V, but the processor does not operate at supply voltage that is lower than 0.76 V even at a lower operating frequency. This is because the WL voltage is bounded at 0.58 V to guarantee charge moving between bitcell and LBL. Table 3 shows the performance comparison table with the previous DNN processors with CIM architecture. Only the proposed processor and [70] support floating-point computing with CIM architecture, but the proposed processor achieves much higher energy efficiency than [70] thanks to the heterogeneous FP computing architecture with ECIM and MFEC. Table 4 shows the measured layerby-layer energy efficiency breakdown of the ResNet-18 network training.

6 Exponent-Computing-in-Memory for DNN Training Processor with Energy-. . .

93

Table 4 Energy efficiency analysis on ResNet-18 training Convolutional Layer

Measured @0.76V, 40 MHz

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

Kernel Size

7×7

3×3

3×3

3×3

3×3

3×3

3×3

1×1

3×3

3×3

3×3

3×3

1×1

3×3

3×3

3×3

3×3

1×1

3×3

3×3

Input Size

224

56

56

56

56

56

28

56

28

28

28

14

28

14

14

14

7

14

7

7

In/Out Channels

3 /64

64 /64

64 /64

64 /64

64 /64

64 /128

128 /128

64 /128

128 /128

128 /128

128 /256

256 /256

128 /256

256 /256

256 /256

256 /512

512 /512

256 /512

512 /512

512 /512

Sparsity (% )

0.0

Feed-Forward (FF) 25.0 33.9 17.8 46.0 22.2 59.0 22.2 48.3 74.6 49.6 64.8 49.6 53.6 78.6 53.6 79.7 53.6 78.9 79.2

Efficiency 1.44 1.87 2.12 1.71 2.60 1.80 3.39 1.80 2.71 5.59 2.78 3.97 2.78 3.01 6.63 3.01 6.99 3.01 6.73 6.83 (TFLOPS/W)

Error-Propagation (EP) Sparsity (% )

-

0.0

0.0

0.0

0.0

0.0

0.0

Efficiency (TFLOPS/W)

-

1.43 1.43 1.43 1.43 1.43 1.44 1.44 1.44 1.44 1.43 1.44 1.44 1.44 1.44 1.43 1.44 1.44 1.44 1.44

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

Sparsity (% )

0.0

25.0 33.9 17.8 46.0 22.2 59.0 22.2 48.3 74.6 49.6 64.8 49.6 53.6 78.6 53.6 79.7 53.6 78.9 79.2

Weight-Gradient (WG)

Efficiency 1.25 1.64 1.86 1.50 2.28 1.58 2.98 1.58 2.37 4.92 2.43 3.50 2.44 2.65 5.86 2.65 6.20 2.65 5.95 6.04 (TFLOPS/W)

6.7 Conclusion We propose a heterogeneous floating-point computing architecture with ECIM to realize energy-efficient DNN training on mobile devices. Based on the analysis of heterogeneous characteristics of floating-point computing, the dedicated architecture and algorithm are proposed for exponent processing and mantissa processing. The proposed ECIM reduces large memory power consumption with in-memory computing and bitline charge reusing, and the proposed MFEC enables pipelining between mantissa operation and exponent operation while reducing the power consumption of floating-point MAC. Furthermore, a BFP16 DNN training processor with ECIM and MFEC is implemented and fabricated in 28 nm CMOS technology. It supports feature map sparsity exploitation for DNN training acceleration. It occupies 1.62 .× 3.6 .mm2 die area and achieves 13.7 TFLOPS/W energy efficiency while supporting BFP16 operations with CIM architecture.

Reference

1. A. Krizhevsky, I. Sutskever, G. Hinton, ImageNet Classification with deep convolutional neural networks. Adv. Neural Informat. Process. Syst. 60, 1097–1105 (2012) 2. D. Niu et al., 184QPS/W 64Mb/mm23D Logic-to-DRAM hybrid bonding with process-nearmemory engine for recommendation system, in IEEE International Solid- State Circuits Conference (ISSCC), 2022 (2022), pp. 1–3 3. V. Sze, Hardware for machine learning: Design considerations, in 2018 VLSI Symposium Forum, 2018 (2018) 4. S. Mitra, Memory centric abundant data computing, in 2017 IEDM Short Course (2017) 5. J. Lee, S. Kim, S. Kim, W. Jo, H. Yoo, GST: Group-sparse training for accelerating deep reinforcement learning (2021). arXiv:2101.09650 6. J. Lee et al., OmniDRL: A 29.3 TFLOPS/W deep reinforcement learning processor with dualmode weight compression and on-chip sparse weight transposer, in 2021 Symposium on VLSI Circuits (2021), pp. 1–2 7. J. Lee, W. Jo, S.-W. Park, H.-J. Yoo, Low-power autonomous adaptation system with deep reinforcement learning, in 2022 IEEE 4th International Conference on Artificial Intelligence Circuits and Systems (AICAS) (2022), pp. 300–303 8. J. Lee, J. Kim, W. Jo, S. Kim, S. Kim, H. -J. Yoo, ECIM: exponent computing in memory for an energy-efficient heterogeneous floating-point DNN training processor. IEEE Micro 42(1), 99–107 (2022) 9. J. Lee, C. Kim, D. Han, S. Kim, S. Kim, H.-J. Yoo, Energy-efficient deep reinforcement learning accelerator designs for mobile autonomous systems, in 2021 IEEE 3rd International Conference on Artificial Intelligence Circuits and Systems (AICAS) (2021), pp. 1–4 10. Y. Li, Deep Reinforcement Learning: An Overview (2017). Preprint arxiv1701.07274 11. R.S. Sutton, A.G. Barto, Reinforcement Learning: An Introduction, 2nd edn. (The MIT Press, Cambridge, 2018) 12. V. Mnih, K. Kavukcuoglu, D. Silver et al., Humanlevel control through deep reinforcement learning. Nature 518, 529–533 (2015) 13. D. Silver, J. Schrittwieser, K. Simonyan et al., Mastering the game of go without human knowledge. Nature 550, 354–359 (2017) 14. O. Vinyals, I. Babuschkin, W.M. Czarnecki et al., Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 575, 350–354 (2019) 15. V. Mnih, A.P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, K. Kavukcuoglu, Asynchronous methods for deep reinforcement learning, in Proceedings of The 33rd International Conference on Machine Learning (2016), pp. 1928–1937

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 J. Lee, H.-J. Yoo, Deep Reinforcement Learning Processor Design for Mobile Applications, https://doi.org/10.1007/978-3-031-36793-9

95

96

Reference

16. M. Babaeizadeh, I. Frosio, S. Tyree, J. Clemons, J. Kautz, Reinforcement Learning Through Asynchronous Advantage Actor-Critic on a GPU (2017). Preprint arXiv:1611.06256 17. L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward, Y. Doron, V. Firoiu, T. Harley, I. Dunning et al., IMPALA: Scalable distributed deep-RL with importance weighted actor-learner architectures, in Proceedings of the 35th International Conference on Machine Learning (2018), pp. 1407–1416 18. L. Espeholt, R. Marinier, P. Stanczyk, K. Wang, M. Michalski, SEED RL: Scalable and efficient DEEP-RL with accelerated central inference, in International Conference on Learning Representations (2020) 19. J. Schulman, F. Wolski, P. Dhariwal, A. Radford, O. Klimov, Proximal policy optimization algorithms (2017). Preprint arXiv:1707.06347 20. H. Zhu, J. Yu, A. Gupta, D. Shah, K. Hartikainen, A. Singh, V. Kumar, S. Levine, The ingredients of real world robotic reinforcement learning, in International Conference on Learning Representations (2020) 21. G. Dulac-Arnold, D. Mankowitz, T. Hester, Challenges of real-world reinforcement learning (2019). Preprint arXiv:1904.12901 22. A. Nagabandi, I. Clavera, S. Liu, R.S. Fearing, P. Abbeel, S. Levine, C. Finn, Learning to adapt in dynamic, real-world environments through meta-reinforcement learning, in 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, May 6–9, 2019, OpenReview.net (2019) 23. J. Buckman, D. Hafner, G. Tucker, E. Brevdo, H. Lee, Sample-efficient reinforcement learning with stochastic ensemble value expansion, in Advances in Neural Information Processing Systems (2018), pp. 8224–8234 24. S. Fujimoto, H. van Hoof, D. Meger, Addressing function approximation error in actor-critic methods, in Proceedings of the 35th International Conference on Machine Learning (2018), pp. 1587–1596 25. T. Haarnoja, A. Zhou, P. Abbeel, S. Levine, Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor, in Proceedings of the 35th International Conference on Machine Learning (2018), pp. 1861–1870 26. K. Kurach, A. Raichuk, P . St a´nczyk, M. Zajaç, O. Bachem, L. Espeholt, C. Riquelme, D. Vincent, M. Michalski, O. Bousquet, S. Gelly, Google Research Football: A novel reinforcement learning environment (2019). Preprint arXiv:1904.12901 27. M. Zhu, S. Gupta, To prune, or not to prune: exploring the efficacy of pruning for model compression (2017). Preprint arXiv:1710.01878 28. S. Han, J. Pool, J. Tran, W. Dally, Learning both weights and connections for efficient neural network. Adv. Neural Informat. Process. Syst. 28, 1135–1143 (2015) 29. Y. He, P. Liu, Z. Wang, Z. Hu, Y. Yang, Filter pruning via geometric median for deep convolutional neural networks acceleration, in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019), pp. 4335–4344 30. V. Sindhwani, T. Sainath, S. Kumar, Structured transforms for small-footprint deep learning. Adv. Neural Informat. Process. Syst. 28, 3088–3096 (2015) 31. Y. Cheng, F.X. Yu, R.S. Feris, S. Kumar, A. Choudhary, S. Chang, An Exploration of parameter redundancy in deep networks with circulant projections, in 2015 IEEE International Conference on Computer Vision (ICCV) (2015), pp. 2857–2865 32. M. Moczulski, M. Denil, J. Appleyard, N. de Freitas, ACDC: A Structured Efficient Linear Layer (2016). Preprint arXiv:1511.05946 33. C. Ding, S. Liao, Y. Wang, Z. Li, N. Liu, Y. Zhuo, C. Wang, X. Qian, Y. Bai, G. Yuan, X. Ma, Y. Zhang, J. Tang, Q. Qiu, X. Lin, B. Yuan, CirCNN: Accelerating and compressing deep neural networks using block-circulant weight matrices, in 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) (2017), pp. 395–408 34. S. Liao, B. Yuan, CircConv: A structured convolution with low complexity, in Proceedings of the AAAI Conference on Artificial Intelligence (2019), pp. 4287–4294 35. D. Livne, K. Cohen, PoPS: policy pruning and shrinking for deep reinforcement learning. IEEE J. Select. Topics Signal Process. SC ’19 14(4), 789–801 (2019)

Reference

97

36. S. Liao, B. Yuan, CircConv: A structured convolution with low complexity, in Proceedings of the AAAI Conference on Artificial Intelligence (2020), pp. 789–801 37. H. Zhang, Z. He, J. Li, Accelerating the deep reinforcement learning with neural network compression, in 2019 International Joint Conference on Neural Networks (IJCNN) (2019), pp. 1–8 38. D. Horgan, J. Quan, D. Budden, G. Barth-Maron, M. Hessel, H. van Hasselt, D. Silver, Distributed prioritized experience replay, in International Conference on Learning Representations (2018) 39. D. Lee, D. Ahn, T. Kim, P.I. Chuang, J. Kim, Viterbi-based pruning for sparse matrix with fixed and high index compression ratio, in International Conference on Learning Representations (2018) 40. A. Krizhevsky, Learning Multiple Layers of Features from Tiny Images (2009) 41. O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. Berg, L. Fei-Fei, ImageNet large scale visual recognition challenge. Int. J. Comput. Vision 115, 211–252 (2015) 42. K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV (2016), pp. 770–778. https://doi.org/10.1109/CVPR.2016.90 43. I. Kostrikov, PyTorch Implementations of Reinforcement Learning Algorithms. GitHub Repository (2018) 44. Z. Huang, X. Xu, H. He, J. Tan, Z. Sun, Parameterized batch reinforcement learning for longitudinal control of autonomous land vehicles. IEEE Trans. Syst. Man Cyber. Syst. 49(4), 730–741 (2019) 45. A. Kendall et al., Learning to drive in a day, in 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC (2019) 46. D. Gandhi, L. Pinto, A. Gupta, Learning to fly by crashing, in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, BC (2017), pp. 3948–3955 47. E. Çetin, C. Barrado, G. Muñoz, M. Macias, E. Pastor, Drone navigation and avoidance of obstacles through deep reinforcement learning, in 2019 IEEE/AIAA 38th Digital Avionics Systems Conference (DASC) San Diego, CA (2019) 48. T. Haarnoja, S. Ha, A. Zhou, J. Tan, G. Tucker, S. Levine, Learning to Walk via Deep Reinforcement Learning (2019). arXiv:1812.11103 49. D. Won, K. Muller, S. Lee, An adaptive deep reinforcement learning framework enables curling robots with human-like performance in real-world conditions. Sci. Rob. 5(46), eabb9764 (2020) 50. E. Lakomkin, M.A. Zamani, C. Weber, S. Magg, S. Wermter, EmoRL: Continuous acoustic emotion classification using deep reinforcement learning, in Proceedings ICRA (IEEE, Piscataway, 2018), pp. 1–6 51. C. Watkins, P. Dayan, Q-learning. Mach. Learn. 8, 279–292 (1992) 52. C. Kim, S. Kang, D. Shin, S. Choi, Y. Kim, H. Yoo, A 2.1TFLOPS/W mobile deep RL accelerator with transposable PE array and experience compression, in 2019 IEEE International Solid- State Circuits Conference - (ISSCC) (2019), pp. 136–138 53. S. Kang et al., 7.4 GANPU: A 135TFLOPS/W multi-DNN training processor for GANs with Speculative dual-sparsity exploitation, in 2020 IEEE International Solid- State Circuits Conference - (ISSCC) (2020), pp. 140–142 54. A. Amaravati, S.B. Nasir, J. Ting, I. Yoon, A. Raychowdhury, A 55-nm, 1.0-0.4V, 1.25pJ/MAC time-domain mixed-signal neuromorphic accelerator with stochastic synapses for reinforcement learning in autonomous mobile robots. IEEE J. Solid-State Circuits 54(1), 75– 87 (2019) 55. N. Cao, M. Chang, A. Raychowdhury, A 65-nm 8-to-3-b 1.0-0.36-V 9.1-1.1-TOPS/W hybriddigital-mixed-signal computing platform for accelerating swarm robotics. IEEE J. Solid-State Circuits 55(1), 49–59 (2020) 56. S. Elfwing, E. Uchibe, K. Doya, Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Netw. 107, 3–11 (2018)

98

Reference

57. G. Marsaglia, Xorshift RNGs. J. Statist. Softw. 8(14), 1–6 (2003) 58. D. Blackman, S. Vigna, Scrambled linear pseudorandom number generators (2018). arXiv:1805.01407 59. S. Kim, J. Lee, S. Kang, J. Lee, W. Jo, H.-J. Yoo, PNPU: an energy-efficient deep-neuralnetwork learning processor with stochastic coarse-fine level weight pruning and adaptive input/output/weight zero skipping. IEEE Solid-State Circuits Lett. 4, 22–25 (2021) 60. J. Yue et al., 7.5 A 65nm 0.39-to-140.3TOPS/W 1-to-12b unified neural network processor using block-circulant-enabled transpose-domain acceleration with 8.1 .× higher TOPS/mm2and 6T HBST-TRAM-based 2D data-reuse architecture, in 2019 IEEE International Solid- State Circuits Conference - (ISSCC) (2019) 61. S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M.A. Horowitz, W.J. Dally, EIE: Efficient inference engine on compressed deep neural network, in 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA) (2016), pp. 243–254 62. K. Bong, S. Choi, C. Kim, D. Han, H. Yoo, A low-power convolutional neural network face recognition processor and a CIS integrated with always-on face detector. IEEE J. Solid-State Circuits 53(1), 115–123 (2018) 63. D. Watson, D. Scheidt, Autonomous systems. Johns Hopkins APL Technical Digest (Appl. Phys. Laboratory) 26, 368–376 (2005) 64. R. Julian et al., Never stop learning: The effectiveness of fine-tuning in robotic reinforcement learning, in Proceedings of the Conference on Robot Learning (2020) 65. L. Smith et al., Legged Robots that Keep on Learning: Fine-Tuning Locomotion Policies in the Real World (2021). Preprint arXiv:2110.05457 66. Y. Chen, T. Krishna, J.S. Emer, V. Sze, Eyeriss: an energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE J. Solid-State Circuits 52(1), 127–138 (2017) 67. Y. Chen, T. Yang, J. Emer, V. Sze, Eyeriss v2: a flexible accelerator for emerging deep neural networks on mobile devices. IEEE J. Emerg. Select. Topics Circuits Syst. 9(2), 292–308 (2019) 68. J. Song et al., 7.1 an 11.5TOPS/W 1024-MAC butterfly structure dual-core sparsity-aware neural processing unit in 8nm flagship mobile SoC, in 2019 IEEE International Solid- State Circuits Conference - (ISSCC) (2019), pp. 130–132 69. J. Lee, J. Lee, D. Han, J. Lee, G. Park, H. Yoo, 7.7 LNPU: A 25.3TFLOPS, W sparse deep-neural-network learning processor with fine-grained mixed precision of FP8-FP16, in IEEE International Solid- State Circuits Conference - (ISSCC). San Francisco, CA, vol. 2019 (2019), pp. 142–144 https://doi.org/10.1109/ISSCC.2019.8662302 70. J. Wang et al., A 28-nm compute SRAM with bit-serial logic/arithmetic operations for programmable in-memory vector computing. IEEE J. Solid-State Circuits 55(1), 76–86 (2020) 71. J.-H. Kim, J. Lee, J. Lee, J. Heo, J.-Y. Kim, Z-PIM: a sparsity-aware processing-in-memory architecture with fully variable weight bit-precision for energy-efficient deep neural networks. IEEE J. Solid-State Circuits 56(4), 1093–1104 (2021) 72. J. Su et al., 15.2 A 28nm 64Kb inference-training two-way transpose multibit 6T SRAM compute-in-memory macro for AI edge chips, in 2020 IEEE International Solid- State Circuits Conference - (ISSCC) (2020), pp. 240–242 73. H. Jia et al., 15.1 a programmable neural-network inference accelerator based on scalable in-memory computing, in 2021 IEEE International Solid- State Circuits Conference (ISSCC) (2021), pp. 236–238 74. J. Yue et al., 15.2 A 2.75-to-75.9TOPS/W computing-in-memory NN processor supporting set-associate block-wise zero skipping and ping-pong CIM with simultaneous computation and weight updating, in 2021 IEEE International Solid- State Circuits Conference (ISSCC) (2021), pp. 238–240

Reference

99

75. I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, Y. Bengio. Quantized neural networks: Training neural networks with low precision weights and activations (2016). Preprint arXiv:1609.07061 76. U. Köster et al., Flexpoint: An adaptive numerical format for efficient training of deep neural networks, in Advances in Neural Information Processing Systems (NIPS) (2017) 77. D. Kalamkar et al, A Study of BFLOAT16 for Deep Learning Training (2019). Preprint arXiv:1905.12322 78. N. Burgess, J. Milanovic, N. Stephens, K. Monachopoulos, D. Mansell, Bfloat16 processing for neural networks, in 2019 IEEE 26th Symposium on Computer Arithmetic (ARITH) (2019), pp. 88–91

Index

D Deep neural network (DNN) accelerator, 1, 2, 20, 75 Deep reinforcement learning (DRL), 1–93

M Memory power consumption optimization, 3 Mobile device, 2, 20, 93

F Floating-point (FP) in-memory computing, 76

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 J. Lee, H.-J. Yoo, Deep Reinforcement Learning Processor Design for Mobile Applications, https://doi.org/10.1007/978-3-031-36793-9

101