VLSI-SoC: Design and Engineering of Electronics Systems Based on New Computing Paradigms: 26th IFIP WG 10.5/IEEE International Conference on Very Large Scale Integration, VLSI-SoC 2018, Verona, Italy, October 8–10, 2018, Revised and Extended Selected Papers [1st ed.] 978-3-030-23424-9;978-3-030-23425-6

This book contains extended and revised versions of the best papers presented at the 26th IFIP WG 10.5/IEEE Internationa

374 5 16MB

English Pages XIV, 281 [296] Year 2019

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

VLSI-SoC: Design and Engineering of Electronics Systems Based on New Computing Paradigms: 26th IFIP WG 10.5/IEEE International Conference on Very Large Scale Integration, VLSI-SoC 2018, Verona, Italy, October 8–10, 2018, Revised and Extended Selected Papers [1st ed.]
 978-3-030-23424-9;978-3-030-23425-6

Table of contents :
Front Matter ....Pages i-xiv
A 65 nm CMOS Synthesizable Digital Low-Dropout Regulator Based on Voltage-to-Time Conversion with 99.6% Current Efficiency at 10-mA Load (Naoki Ojima, Toru Nakura, Tetsuya Iizuka, Kunihiro Asada)....Pages 1-13
An Instruction Set Architecture for Secure, Low-Power, Dynamic IoT Communication (Shahzad Muzaffar, Ibrahim (Abe) M. Elfadel)....Pages 14-31
The Connection Layout in a Lattice of Four-Terminal Switches (Anna Bernasconi, Antonio Boffa, Fabrizio Luccio, Linda Pagli)....Pages 32-52
Building High-Performance, Easy-to-Use Polymorphic Parallel Memories with HLS (L. Stornaiuolo, M. Rabozzi, M. D. Santambrogio, D. Sciuto, C. B. Ciobanu, G. Stramondo et al.)....Pages 53-78
Rectification of Arithmetic Circuits with Craig Interpolants in Finite Fields (Utkarsh Gupta, Irina Ilioaea, Vikas Rao, Arpitha Srinath, Priyank Kalla, Florian Enescu)....Pages 79-106
Energy-Accuracy Scalable Deep Convolutional Neural Networks: A Pareto Analysis (Valentino Peluso, Andrea Calimera)....Pages 107-127
ReRAM Based In-Memory Computation of Single Bit Error Correcting BCH Code (Swagata Mandal, Yaswanth Tavva, Debjyoti Bhattacharjee, Anupam Chattopadhyay)....Pages 128-146
Optimizing Performance and Energy Overheads Due to Fanout in In-Memory Computing Systems (Md Adnan Zaman, Rajeev Joshi, Srinivas Katkoori)....Pages 147-166
Mapping Spiking Neural Networks on Multi-core Neuromorphic Platforms: Problem Formulation and Performance Analysis (Francesco Barchi, Gianvito Urgese, Enrico Macii, Andrea Acquaviva)....Pages 167-186
Improved Test Solutions for COTS-Based Systems in Space Applications (Riccardo Cantoro, Sara Carbonara, Andrea Floridia, Ernesto Sanchez, Matteo Sonza Reorda, Jan-Gerd Mess)....Pages 187-206
Analysis of Bridge Defects in STT-MRAM Cells Under Process Variations and a Robust DFT Technique for Their Detection (Victor Champac, Andres Gomez, Freddy Forero, Kaushik Roy)....Pages 207-231
Assessment of Low-Budget Targeted Cyberattacks Against Power Systems (XiaoRui Liu, Anastasis Keliris, Charalambos Konstantinou, Marios Sazos, Michail Maniatakos)....Pages 232-256
Efficient Hardware/Software Co-design for NTRU (Tim Fritzmann, Thomas Schamberger, Christoph Frisch, Konstantin Braun, Georg Maringer, Johanna Sepúlveda)....Pages 257-280
Correction to: Improved Test Solutions for COTS-Based Systems in Space Applications (Riccardo Cantoro, Sara Carbonara, Andrea Floridia, Ernesto Sanchez, Matteo Sonza Reorda, Jan-Gerd Mess)....Pages C1-C1
Back Matter ....Pages 281-281

Citation preview

IFIP AICT 561

Nicola Bombieri Graziano Pravadelli Masahiro Fujita Todd Austin Ricardo Reis (Eds.)

VLSI-SoC: Design and Engineering of Electronics Systems Based on New Computing Paradigms

26th IFIP WG 10.5/IEEE International Conference on Very Large Scale Integration, VLSI-SoC 2018 Verona, Italy, October 8–10, 2018 Revised and Extended Selected Papers

IFIP Advances in Information and Communication Technology Editor-in-Chief Kai Rannenberg, Goethe University Frankfurt, Germany

Editorial Board Members TC 1 – Foundations of Computer Science Jacques Sakarovitch, Télécom ParisTech, France TC 2 – Software: Theory and Practice Michael Goedicke, University of Duisburg-Essen, Germany TC 3 – Education Arthur Tatnall, Victoria University, Melbourne, Australia TC 5 – Information Technology Applications Erich J. Neuhold, University of Vienna, Austria TC 6 – Communication Systems Aiko Pras, University of Twente, Enschede, The Netherlands TC 7 – System Modeling and Optimization Fredi Tröltzsch, TU Berlin, Germany TC 8 – Information Systems Jan Pries-Heje, Roskilde University, Denmark TC 9 – ICT and Society David Kreps, University of Salford, Greater Manchester, UK TC 10 – Computer Systems Technology Ricardo Reis, Federal University of Rio Grande do Sul, Porto Alegre, Brazil TC 11 – Security and Privacy Protection in Information Processing Systems Steven Furnell, Plymouth University, UK TC 12 – Artificial Intelligence Ulrich Furbach, University of Koblenz-Landau, Germany TC 13 – Human-Computer Interaction Marco Winckler, University of Nice Sophia Antipolis, France TC 14 – Entertainment Computing Rainer Malaka, University of Bremen, Germany

561

IFIP – The International Federation for Information Processing IFIP was founded in 1960 under the auspices of UNESCO, following the first World Computer Congress held in Paris the previous year. A federation for societies working in information processing, IFIP’s aim is two-fold: to support information processing in the countries of its members and to encourage technology transfer to developing nations. As its mission statement clearly states: IFIP is the global non-profit federation of societies of ICT professionals that aims at achieving a worldwide professional and socially responsible development and application of information and communication technologies. IFIP is a non-profit-making organization, run almost solely by 2500 volunteers. It operates through a number of technical committees and working groups, which organize events and publications. IFIP’s events range from large international open conferences to working conferences and local seminars. The flagship event is the IFIP World Computer Congress, at which both invited and contributed papers are presented. Contributed papers are rigorously refereed and the rejection rate is high. As with the Congress, participation in the open conferences is open to all and papers may be invited or submitted. Again, submitted papers are stringently refereed. The working conferences are structured differently. They are usually run by a working group and attendance is generally smaller and occasionally by invitation only. Their purpose is to create an atmosphere conducive to innovation and development. Refereeing is also rigorous and papers are subjected to extensive group discussion. Publications arising from IFIP events vary. The papers presented at the IFIP World Computer Congress and at open conferences are published as conference proceedings, while the results of the working conferences are often published as collections of selected and edited papers. IFIP distinguishes three types of institutional membership: Country Representative Members, Members at Large, and Associate Members. The type of organization that can apply for membership is a wide variety and includes national or international societies of individual computer scientists/ICT professionals, associations or federations of such societies, government institutions/government related organizations, national or international research institutes or consortia, universities, academies of sciences, companies, national or international associations or federations of companies. More information about this series at http://www.springer.com/series/6102

Nicola Bombieri Graziano Pravadelli Masahiro Fujita Todd Austin Ricardo Reis (Eds.) •







VLSI-SoC: Design and Engineering of Electronics Systems Based on New Computing Paradigms 26th IFIP WG 10.5/IEEE International Conference on Very Large Scale Integration, VLSI-SoC 2018 Verona, Italy, October 8–10, 2018 Revised and Extended Selected Papers

123

Editors Nicola Bombieri University of Verona Verona, Italy

Graziano Pravadelli University of Verona Verona, Italy

Masahiro Fujita University of Tokyo Tokyo, Japan

Todd Austin University of Michigan Ann Arbor, MI, USA

Ricardo Reis Universidade Federal do Rio Grande do Sul Porto Alegre, Brazil

ISSN 1868-4238 ISSN 1868-422X (electronic) IFIP Advances in Information and Communication Technology ISBN 978-3-030-23424-9 ISBN 978-3-030-23425-6 (eBook) https://doi.org/10.1007/978-3-030-23425-6 © IFIP International Federation for Information Processing 2019 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Preface

This book contains extended and revised versions of the highest-quality papers presented during the 26th edition of the IFIP/IEEE WG10.5 International Conference on Very Large Scale Integration (VLSI-SoC), a global System-on-Chip Design and CAD conference. The 26th edition of the conference was held during October 8–10, 2018, at the Hotel Leon d’Oro, Verona, Italy. Previous conferences have taken place in Edinburgh, Scotland (1981); Trondheim, Norway (1983); Tokyo, Japan (1985); Vancouver, Canada (1987); Munich, Germany (1989); Edinburgh, Scotland (1991); Grenoble, France (1993); Chiba, Japan (1995); Gramado, Brazil (1997); Lisbon, Portugal (1999); Montpellier, France (2001); Darmstadt, Germany (2003); Perth, Australia (2005); Nice, France (2006); Atlanta, GA, USA (2007); Rhodes Island, Greece (2008); Florianopolis, Brazil (2009); Madrid, Spain (2010); Kowloon, Hong Kong, SAR China (2011), Santa Cruz, CA, USA (2012), Istanbul, Turkey (2013), Playa del Carmen, Mexico (2014), Daejeon, South Korea (2015), Tallin, Estonia (2016), and Abu Dhabi, United Arab Emirates (2017). The purpose of this conference, sponsored by IFIP TC 10 Working Group 10.5, the IEEE Council on Electronic Design Automation (CEDA), and the IEEE Circuits and Systems Society, with the In-Cooperation of ACM SIGDA, is to provide a forum for the presentation and discussion of the latest academic and industrial results and developments as well as the future trends in the field of system-on-chip (SoC) design, considering the challenges of nano-scale, state-of-the-art and emerging manufacturing technologies. In particular, VLSI-SoC 2018 was held under the theme “Design and Engineering of Electronics Systems Based on New Computing Paradigms” by addressing cutting-edge research fields like heterogeneous, neuromorphic, and brain-inspired, biologically inspired, approximate computing systems. The chapters of this new book in the VLSI-SoC series continue its tradition of providing an internationally acknowledged platform for scientific contributions and industrial progress in this field. For VLSI-SoC 2018, 27 papers out of 106 submissions were selected for presentation, and out of these 27 full papers presented at the conference, 13 papers were chosen by a special selection committee to have an extended and revised version included in this book. The selection process of these papers considered the evaluation scores during the review process as well as the review forms provided by members of the Technical Program Committee and the Session Chairs as a result of the presentations. The chapters of this book have authors from Germany, India, Italy, Japan, Mexico, Singapore, The Netherlands, UAE, and USA. The Technical Program Committee for the regular tracks comprised 98 members from 25 countries. VLSI-SoC 2018 was the culmination of the work of many dedicated volunteers: paper authors, reviewers, session chairs, invited speakers, and various committee chairs. We thank them all for their contributions.

vi

Preface

This book is intended for the VLSI community at large, and in particular the many colleagues who did not have the chance to attend the conference. We hope you will enjoy reading this book and that you will find it useful in your professional life and for the development of the VLSI community as a whole. June 2019

Nicola Bombieri Graziano Pravadelli Masahiro Fujita Todd Austin Ricardo Reis

Organization

The IFIP/IEEE International Conference on Very Large Scale Integration System-on-Chip (VLSI-SoC) 2018 took place during October 8–10, 2018, at the Hotel Leon d’Oro, Verona, Italy. VLSI-SoC 2018 was the 26th in a series of international conferences, sponsored by IFIP TC 10 Working Group 10.5 (VLSI), IEEE CEDA and ACM SIGDA.

General Chairs Graziano Pravadelli Todd Austin

University of Verona, Italy University of Michigan, USA

Technical Program Chairs Nicola Bombieri Masahiro Fujita

University of Verona, Italy University of Tokyo, Japan

Special Sessions Chairs Sirnivas Katkoori Katell Morin-Allory

University of South Florida, USA TIMA Laboratory, France

PhD Forum Chairs Kiyoung Choi Sara Vinco

Seoul National University, South Korea Politecnico di Torino, Italy

Local Chair Franco Fummi

University of Verona, Italy

Industry Chair Yervant Zorian

Synopsys, USA (TBC)

Publicity Chairs Ricardo Reis Matteo Sonza Reorda

UFRGS, Brazil Politecnico di Torino, Italy

viii

Organization

VLSI-SoC Steering Committee Manfred Glesner Matthew Guthaus Luis Miguel Silveira Fatih Ugurdag Salvador Mir Ricardo Reis Chi-Ying Tsui Ian O’Connor Masahiro Fujita

TU Darmstadt, Germany UC Santa Cruz, USA INESC ID, Portugal Ozyegin University, Turkey TIMA, France UFRGS, Brazil HKUST, Hong Kong, SAR China INL, France The University of Tokyo, Japan

Publication Chairs Davide Bertozzi Mahdi Tala

University of Ferrara, Italy University of Ferrara, Italy

Registration Chair Michele Lora

Singapore University of Technology and Design, Singapore

Web Chair Florenc Demrozi

University of Verona, Italy

Technical Program Committee Analog, Mixed-Signal, and Sensor Architectures Track Chairs Piero Malcovati Tetsuya Iizuka

University of Pavia, Italy University of Tokyo, Japan

Digital Architectures: NoC, Multi- and Many-Core, Hybrid, and Reconfigurable Track Chairs Ian O’Connor Michael Huebner

Lyon Institute of Nanotechnology, France Ruhr-Universität Bochum, Germany

Organization

CAD, Synthesis, and Analysis Track Chairs Srinivas Katkoori Ibrahim Elfadel

University of South Florida, USA Masdar Institute, UAE

Prototyping, Verification, Modeling, and Simulation Track Chairs Tiziana Margaria Katell Morin-Allory

Lero, Ireland Grenoble Institute of Technology, France

Circuits and Systems for Signal Processing and Communications Track Chairs Fatih Ugurdag Luc Claesen

Ozyegin University, Turkey Hasselt University, Belgium

IoT, Embedded and Cyberphysical Systems: Architecture, Design, and Software Track Chairs Zebo Peng Donatella Sciuto

Linkoping University, Sweden Politecnico di Milano, Italy

Low-Power and Thermal-Aware IC Design Track Chairs Dimitrios Soudris Alberto Macii

National Technical University of Athens NTUA, Greece Politecnico di Torino, Italy

Emerging Technologies and Computing Paradigms Track Chairs Andrea Calimera Ricardo Reis

Politecnico di Torino, Italy UFRGS, Brazil

Variability, Reliability, and Test Track Chairs Salvador Mir Matteo Sonza Reorda

University of Grenoble Alpes, France Politecnico di Torino, Italy

ix

x

Organization

Hardware Security Track Chairs Mihalis Maniatakos Lilian Bossuet

New York University Abu Dhabi, UAE University of St. Etienne, France

Machine Learning for SoC Design and for Electronic Design Automation Track Chairs Mehdi Tahoori Manuel Barragan

Karlsruhe Institute of Technology, Germany TIMA, France

Technical Program Committee Abdulkadir Akin Aida Todri-Sanial Alberto Bosio Alberto Gola Andrea Acquaviva Anupam Chattopadhyay Arun Kanuparthi Bei Yu Brice Colombier Carlos Silva Cardenas Cecile Braunstein Chengmo Yang Chun-Jen Tsai Diana Goehringer Diego Barrettino Donghwa Shin Edoardo Bonizzoni Elena Ioana Vatajelu Federico Tramarin Franck Courbon Fynn Schwiegelshohn Georg Sigl Gildas Leger Giorgio Di Natale Haluk Konuk Haris Javaid Houman Homayoun Ippei Akita Iraklis Anagnostopoulos

ETHZ, Switzerland LIRMM, France LIRMM, France AMS, Italy Politecnico di Torino, Italy Nanyang Technological University, Singapore Intel, USA University of Texas at Austin, USA CEA, France Pontificia Universidad Catolica del Peru, Peru PMC/LIP6, France University of Delaware, USA National Chiao Tung University, Taiwan TU Dresden, Germany Ecole Polytechnique Federale de Lausanne, France Yeungnam University, South Korea University of Pavia, Italy IMAG, France CNR-IEIIT, Italy University of Cambridge, UK Ruhr University Bochum, Germany TU Munich, Germany Inst. de Microelect. de Sevilla IMSE-CNM-CSIC, Spain LIRMM, France Broadcom, USA Xilinx, Australia George Mason University, USA Toyohashi University of Technology, Japan National Technical University of Athens, Greece

Organization

Jaan Raik Jones Yudi Mori Jinmyoung Kim Johanna Sepulveda Jose Monteiro Ke Huang Kostas Siozios Lars Bauer Leandro Indrusiak Lionel Torres Luciano Ost Maksim Jenihhin Maria Michael Massimo Poncino Matthias Sauer Mirko Loghi Nadine Azemard Nele Mentens Nektarios Georgios Tsoutsos Ozgur Tasdizen Paolo Amato Patri Sreehari Peng Liu Per Larsson-Edefors Philippe Coussy Pierre-Emmanuel Gaillardon Po-Hung Chen Raik Brinkmann Rani S. Ghaida Robert Wille Rouwaida Kanj Said Hamdioui Salvatore Pennisi Sezer Goren Shahar Kvatinsky Sicheng Li Soheil Samii Sri Parameswaran Tetsuya Hirose Theocharis Theocharides Tolga Yalcin Valerio Tenace

Tallin University, Estonia University of Brasilia, Brazil Samsung Advanced Institute of USA, Technology, South Korea Technical University of Munich, Germany INESC-ID, IST University of Lisbon, Portugal San Diego State University, USA Aristotle University of Thessaloniki, Greece Karlsruhe Institute of Technology, Germany University of York, UK LIRMM, France University of Leicester, UK Tallinn University of Technology, Estonia University of Cyprus, Cyprus Politecnico di Torino, Italy University Freiburg, Germany Università di Udine, Italy LIRMM/CNRS, France Katholieke Universiteit Leuven, Belgium New York University, USA ARM, UK Micron, Italy National Institute of Technology, Warangal, India Zhejiang University, China Chalmers University, Sweden Université de Bretagne, France University of Utah, USA National Chiao Tung University, Taiwan OneSpin Solutions, Germany Global Foundries, USA Johannes Kepler University Linz, Austria American University of Beirut, Lebanon Delft Technical University, The Netherlands University of Catania, Italy Yeditepe University, Turkey Technion - Israel Institute of Technology, Israel HP, USA General Motors, USA University of New South Wales, Australia Kobe University, Japan University of Cyprus, Cyprus NXP, UK Politecnico di Torino, Italy

xi

xii

Organization

Victor Champac Victor Kravets Virendra Singh Vladimir Zolotov Wenjing Rao Yier Jin

National Institute of Astrophysics, Optics and Electronics, Mexico IBM, USA Indian Institute of Technology Bombay, India IBM, USA University of Illinois at Chicago, USA University of Florida, USA

Contents

A 65 nm CMOS Synthesizable Digital Low-Dropout Regulator Based on Voltage-to-Time Conversion with 99.6% Current Efficiency at 10-mA Load . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Naoki Ojima, Toru Nakura, Tetsuya Iizuka, and Kunihiro Asada An Instruction Set Architecture for Secure, Low-Power, Dynamic IoT Communication. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shahzad Muzaffar and Ibrahim (Abe) M. Elfadel The Connection Layout in a Lattice of Four-Terminal Switches . . . . . . . . . . Anna Bernasconi, Antonio Boffa, Fabrizio Luccio, and Linda Pagli Building High-Performance, Easy-to-Use Polymorphic Parallel Memories with HLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . L. Stornaiuolo, M. Rabozzi, M. D. Santambrogio, D. Sciuto, C. B. Ciobanu, G. Stramondo, and A. L. Varbanescu Rectification of Arithmetic Circuits with Craig Interpolants in Finite Fields . . . . Utkarsh Gupta, Irina Ilioaea, Vikas Rao, Arpitha Srinath, Priyank Kalla, and Florian Enescu Energy-Accuracy Scalable Deep Convolutional Neural Networks: A Pareto Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Valentino Peluso and Andrea Calimera ReRAM Based In-Memory Computation of Single Bit Error Correcting BCH Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Swagata Mandal, Yaswanth Tavva, Debjyoti Bhattacharjee, and Anupam Chattopadhyay Optimizing Performance and Energy Overheads Due to Fanout in In-Memory Computing Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Md Adnan Zaman, Rajeev Joshi, and Srinivas Katkoori Mapping Spiking Neural Networks on Multi-core Neuromorphic Platforms: Problem Formulation and Performance Analysis . . . . . . . . . . . . . . . . . . . . . Francesco Barchi, Gianvito Urgese, Enrico Macii, and Andrea Acquaviva Improved Test Solutions for COTS-Based Systems in Space Applications . . . Riccardo Cantoro, Sara Carbonara, Andrea Florida, Ernesto Sanchez, Matteo Sonza Reorda, and Jan-Gerd Mess

1

14 32

53

79

107

128

147

167

187

xiv

Contents

Analysis of Bridge Defects in STT-MRAM Cells Under Process Variations and a Robust DFT Technique for Their Detection . . . . . . . . . . . . . . . . . . . . Victor Champac, Andres Gomez, Freddy Forero, and Kaushik Roy

207

Assessment of Low-Budget Targeted Cyberattacks Against Power Systems . . . . XiaoRui Liu, Anastasis Keliris, Charalambos Konstantinou, Marios Sazos, and Michail Maniatakos

232

Efficient Hardware/Software Co-design for NTRU . . . . . . . . . . . . . . . . . . . Tim Fritzmann, Thomas Schamberger, Christoph Frisch, Konstantin Braun, Georg Maringer, and Johanna Sepúlveda

257

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

281

A 65 nm CMOS Synthesizable Digital Low-Dropout Regulator Based on Voltage-to-Time Conversion with 99.6% Current Efficiency at 10-mA Load Naoki Ojima1(B) , Toru Nakura2 , Tetsuya Iizuka1,3 , and Kunihiro Asada1,3 1

3

Department of Electrical Engineering and Information Systems, The University of Tokyo, Tokyo, Japan [email protected] 2 Department of Electronics Engineering and Computer Science, Fukuoka University, Fukuoka, Japan VLSI Design and Education Center, The University of Tokyo, Tokyo, Japan

Abstract. A synthesizable digital LDO implemented with standardcell-based digital design flow is proposed. The difference between output and reference voltages is converted into delay difference using inverter chains as voltage-controlled delay lines, then compared in the timedomain. Since the time-domain difference is straightforwardly captured by a simple DFF-based phase detector, the proposed LDO does not need an analog voltage comparator, which requires careful manual design. All the components in the LDO can be described with Verilog codes based on their specifications, and placed-and-routed with a commercial EDA tool. This automated layout design relaxes the burden and time of implementation, and enhances process portability. The proposed LDO implemented in a 65 nm standard CMOS technology occupies 0.015 mm2 area. With 10.4 MHz internal clock, the tracking response of the LDO to 200 mV switching in the reference voltage is ∼4.5 µs and the transient response to 5 mA change in the load current is ∼6.6 µs. At 10 mA load current, the quiescent current consumed by the LDO core is as low as 35.2 µA, which leads to 99.6% current efficiency.

1

Introduction

Along with the exponential advancement of process technologies, performance of LSI circuits rapidly improves and many functional building blocks such as analog, logic, RF, memory block, etc. can be integrated on a chip, which have brought system-on-a-chip (SoC) era. Meanwhile, in order to reduce power consumption as indicated by the scaling law, power supply voltages have been lowered. In addition, it is desirable that a power supply of each functional block is independently tuned according to the changing operating condition so as to have the c IFIP International Federation for Information Processing 2019  Published by Springer Nature Switzerland AG 2019 N. Bombieri et al. (Eds.): VLSI-SoC 2018, IFIP AICT 561, pp. 1–13, 2019. https://doi.org/10.1007/978-3-030-23425-6_1

2

N. Ojima et al. Vin CLK

Vin Vref

Vref EA

A

Digital Controller N-bit

Vout Cout

(a)

Vout

Iload

Cout

Iload

(b )

Fig. 1. LDO architectures. (a) Conventional analog LDO has a simple architecture, and includes an error amplifier, a driver amplifier and an analog pass transistor. (b) Digital LDO includes a comparator, a digital controller made of logic gates, and parallel pass transistors.

optimal power efficiency. The on-chip voltage regulation is essential for this purpose, because off-chip voltage regulators require large PCB area, which leads to increase in cost. For those reasons, efficient, tunable, fast-transient and on-chip power sources are in great demand for SoC, hence low-dropout (LDO) regulators are now widely used. As shown in Fig. 1(a), conventional LDOs have been designed with analog circuits and employed an error amplifier, a driver amplifier and an analog pass transistor to provide voltage regulation with negative feedback. When the supply voltage is high enough, they exhibit high current efficiency, fast transient response, high power supply rejection and small output ripple [1–4]. In addition, their area occupation could be smaller than other power management circuits such as switching regulators, because they do not require large inductors. However, they have difficulty in operating at low supply voltage, since amplifiers cannot sustain their dynamic range and high gain under such a situation. To solve this issue, the digital implementations of LDOs shown in Fig. 1(b) has been proposed [5–15]. A typical digital LDO has a digital controller made of logic gates that controls the number of turned-on PMOS switches at output stage, and employs an analog voltage comparator to detect the difference between reference and feedback voltages. Thus digital LDOs can eliminate amplifiers and operate under a low supply voltage. Moreover, since they are constructed mainly from digital logic gates, their performance can be easily improved by process downscaling and clock boosting. A voltage comparator, however, often requires careful manual design so as to minimize the voltage offset between two inputs. Thus even digital LDOs also require sophisticated analog design flows, which is often time-consuming. A digital design flow, on the other hand, requires much less design effort, because its layout design is automated. Since circuit implementation in digital design flows is based on RTL source codes, the circuits that have the similar specification can be made easily even when the used process technologies are updated. Thus recently many analog circuits such as analog-to-digital converters (ADC) [16] or phase-locked loops (PLL) [17] are designed through digital automated flow, in

A 65 nm CMOS Synthesizable Digital Low-Dropout Regulator

3

order to take advantage of the relaxed design burden and process portability. Hence we have been motivated to implement LDOs, one of the indispensable blocks for SoCs, in digital design flows. In order to relax the burden of manual analog designs, this paper proposes a synthesizable digital LDO, whose preliminary results have been presented in [18]. By utilizing voltage-to-time conversion, the proposed LDO has a suitable architecture for standard-cell-based automatic place and route (P&R).

2 2.1

Proposed Synthesizable LDO Architecture

One of the design issues in constructing an LDO with standard cells is an implementation of a voltage comparison unit. Reference [16] reported that an analog voltage comparator can be implemented with 3-input NAND gates. Such comparators can be easily designed, but they suffer from the random systematic offset owing to the randomness of the automatic P&R. Thus the single comparator made of NAND gates is not suitable for precise voltage comparison. The PLL-like LDO in [6] employs voltage-controlled ring oscillators in order to convert the voltage difference into the phase difference. However, this architecture is not preferable because a voltage-controlled ring oscillator has an integral characteristic that adds a pole to the system, which deteriorates the stability of the loop. Moreover, voltage-controlled ring oscillators might be a cause to increase the current consumption of the voltage comparison unit. Some digital LDOs utilize voltage-to-time converters (VTC) and time-to-digital converters (TDC) [7,8], so that they exclude analog voltage comparators. Although a TDC can be composed of digital logic cells, its layout implementation actually requires manual design because its linearity is very sensitive to the parasitic capacitance of its layout pattern. Hence we propose to use a simple bang-bang detector, in order to relax the complexity of the layout and eliminate the systematic offset even when the layout is automatically placed and routed. Our proposed LDO shown in Fig. 2 employs voltage-controlled delay lines (VCDL), and the difference between the reference and the output voltages are converted into the time-domain. The proposed LDO consists of two inverter chains, a bang-bang phase detector, a digital controller, a PMOS switch array, and an output capacitor. Though for this prototype a dedicated ring oscillator is used as an internal clock source and a pulse generator, these clock and pulse signals can be replaced by a clock for other blocks on the SoC. The digital controller generates 128-bit-width thermometer code from 1-bit output from the bang-bang phase detector to control the switches. The PMOS switch array has parallelly-aligned 128 PMOS transistors, all of which have the same size. Each gate of the switch is connected to each bit of the thermometer code from the digital controller. The two inverter chains have the same structure, which has a series connection of 128 inverters. As shown in Fig. 3, the bang-bang phase detector is simply composed of a D-FF and a buffer. The buffer is connected to the clock input of the D-FF to compensate the setup time of the D-FF. As shown

4

N. Ojima et al. Vin Frequency Selection input

Internal Clock and Pulse Generator

Vref

Clock Bang-Bang Digital Phase Controller Detector

Pulse Train

128-bit

On-chip

Vout Cout

Iload Off-chip

Fig. 2. Block diagram of the proposed LDO. Input from Inverter Chain powered by Vout Input from Inverter Chain powered by Vref

D

Output

Q

C

Fig. 3. Phase detector composed of standard cells. Divider (16 D-FFs)

Ring Oscillator

D C

Selection input (4 bit)

Q

D C

Q

D

Q

C

16-to-1 MUX Output (Clock and Pulse)

Fig. 4. Internal clock and pulse generator composed of a ring oscillator, a divider, and a multiplexer.

in Fig. 4, the internal clock and pulse generator is composed of inverters, D-FFs, and multiplexers for frequency tuning. The output capacitor is assembled offchip. Once the switch PMOS transistor cell is added to a standard-cell library, all cells needed to compose the proposed LDO are included in the library and the LDO can be generated from Verilog gate-level netlists and synthesized with a P&R tool. The layout of the PMOS switch has to follow the design rules for standard cells so that it can be placed and routed by the P&R tool. Figure 5 shows an outline of the PMOS switch cell layout. It is designed just by removing the

A 65 nm CMOS Synthesizable Digital Low-Dropout Regulator Vdd

IN

5

Vdd

IN

OUT

OUT

NMOS transistor is removed

Vss

Vss

(a)

(b)

Fig. 5. (a) An inverter cell and (b) an additional PMOS transistor cell for the switch array. Kp

Input[n]

Output[n]

Ki

Intg[n] z

-1

Fig. 6. Signal flow graph of the digital controller that includes proportional and integral paths.

NMOS transistor from the inverter cell for ease of the additional cell layout. Thus, the size of the PMOS switch cell is equal to that of the inverter cell. The operation of the proposed LDO is described as follows. As shown in Fig. 2, the two inverter chains are powered by Vref and Vout , respectively. An identical pulse train from the internal pulse generator enters into them at the same time. The inverter chains work as VCDLs. In other words, the inverter chains convert the voltage difference between Vref and Vout into the delay difference that is compared by the phase detector. Based on the phase detector output, the digital controller changes the number of turned-on PMOS switches. Figure 6 shows the signal flow graph of the digital controller. The operation of the digital controller is expressed by the following discrete-time difference equations. Intg[n] = Intg[n − 1] + Ki × Input[n] Output[n] = Intg[n] + Kp × Input[n]

(1) (2)

The digital controller includes proportional and integral paths. Output[n] is 7-bit binary. Then the binary output is decoded into thermometer code so that the number of turned-on PMOS switches can be controlled one by one. When Vout is higher than Vref , the phase detector output becomes HIGH and the digital controller decreases the number of turned-on switches. On the contrary,

6

N. Ojima et al.

when Vout is lower than Vref , the phase detector output becomes LOW and the digital controller increases the number of turned-on switches. In this way Vout approaches to Vref . As the divider and the multiplexer is attached to the internal clock and pulse generator, in this prototype the clock frequency can be easily tuned by the multiplexer for test purpose. The components other than the inverter chains are powered by Vin . The HIGH-level voltage of the pulse trains which travel through the inverter chains are equal to their power source voltage, Vref or Vout . Therefore, if Vref is lower than the logic threshold voltage of the phase detector powered by Vin , the phase detector cannot be driven by the pulse from the inverter chain powered by Vref . Thus, the lower limit of Vref is determined by the logic threshold voltage of the standard cells powered by Vin . 2.2

Transfer Function of the Control Loop

Figure 7 shows the signal flow graph of the proposed LDO. The comparison unit composed of inverter chains and a D-FF generates an error sample. Since the comparison is done based on the pulse train which is the same signal as the clock, one clock delay occurs here at every sample. As previously described, the digital controller has proportional and integral paths. In order to investigate the loop stability in continuous-time domain, the approximation below is applied: z ≈ 1 + sTs ,

(3)

where Ts represents the sampling period. The transfer function of the digital controller is thus approximated as follows: 1 Ki 1 − z −1 1 + sTs Ki . ≈ Kp + sTs

Hctrl = Kp +

(4) (5)

The output stage is composed of the switch array, the output capacitor, and the effective resistance. Ipmos is the current through a single PMOS switch. According to [9], the effective resistance Rl can be approximated as Vout /Iload . Using (3) and (5), the continuous-time open-loop transfer function G(s) is given by   Kp Ki Ipmos + . (6) G(s) = · −1 1 + sTs sTs Rl + sCout When Iload is small, Rl becomes big so that Rl−1  sCout . Then, G(s) approximates   Kp Ki Ipmos + . (7) G(s) ≈ · 1 + sTs sTs sCout If Kp = 0, the poles of the closed-loop transfer function are close to the imaginary axis, and thus the system tends to be unstable. To avoid oscillation, we add the proportional gain Kp to the digital controller. Figure 8 shows the bode plots

A 65 nm CMOS Synthesizable Digital Low-Dropout Regulator Comparison Unit and Delay Vref

z-1

Digital Controller

Output Stage

Kp

Ipmos Rl-1+sCout

7

Vout

Ki z-1

Fig. 7. Signal flow graph of the proposed LDO.

Fig. 8. Bode plots of the open loop system with the small Iload of 100 µA for Kp = 0 and Kp = 1.

of the open loop transfer function with small Iload of 100 µA for Kp = 0 and Kp = 1, respectively. When Kp = 0, the phase margin is 17◦ , whereas it is 27◦ when Kp = 1, which suppresses the abrupt phase change around 1 MHz. 2.3

Design Procedure of the Proposed LDO

This section explains the design procedure of the proposed LDO. Figure 9 shows the design flow diagram. The explanation follows the step numbers shown in Fig. 9. (0) The PMOS switch cell is designed in advance and added to the standard-cell library. (1) The specification of the circuit, such as maximum load current or reference voltage, is set. Based on this specification, the number of the PMOS switches in the switch array and the output bit width of the digital controller are determined. (2) The RTL Verilog code of the digital controller is prepared, then logically synthesized to have the gate-level Verilog netlist. (3) The gate-level Verilog netlists of other components, such as the inverter chain, the phase detector, the switch array, and the internal clock and pulse generator are generated by a dedicated script. Examples of the gate-level Verilog

8

N. Ojima et al. 2 1

Logic Synthesis

RTL (Digital Controller)

Set Specification

3 Dedicated Script

Netlist Generation

0 Design PMOS cell

5 Gate Level Verilog Netlist

Connect Modules

Whole Layout

4 Automatic P&R

Each Layout

Standard Cell Library

Fig. 9. Design flow of the proposed circuit.

Fig. 10. Examples of the gate-level Verilog description: (a) inverter chain, (b) phase detector, (c) switch array, and (d) internal clock and pulse generator.

description is shown in Fig. 10. Since each building block has simple standardcell-based structure, the gate-level netlist generation is simply implemented. For example, the switch array is constructed only by the parallel PMOS switch cells, and thus its gate-level netlist is generated easily by the script according to the specification. (4) The layout of each building block is individually placed and routed by a P&R tool. This is because they have different power supplies; the two inverter chains are powered by Vout or Vref respectively, and the other blocks are powered by Vin . We use the same layout for both of the two inverter chains, so that there is little systematic offset in the voltage comparison unit. (5) All the layouts are connected together. It takes few hours to generate the whole layout of the proposed LDO from scratch, which is much less time than that in the case for the conventional LDO design with analog flows.

A 65 nm CMOS Synthesizable Digital Low-Dropout Regulator

3

9

Prototype Implementation and Measurement Results

Inverter Phase Detector, Chain Digital Controller, Switch Array, Internal Inverter Clock and Pulse Chain Generator

115 μm

Based on the architecture described in the previous section, the prototype of the proposed LDO is fabricated in a 65 nm standard CMOS technology. Figure 11 shows the chip photo. The active area of the proposed LDO is 0.015 µm2 . Figure 12 shows the measured tracking response of Vout with 10.4 MHz clock when Vref , which is externally supplied in this measurement, switches between 600 mV and 800 mV. Here, Iload and Cout is 10 mA and 220 pF, respectively. When Vref changes from 600 mV to 800 mV, the settling time is 4.5 µs, whereas it is 4.4 µs when Vref changes from 800 mV to 600 mV. Figure 13 shows the measured transient response of Vout with 10.4 MHz clock when Iload changes between 5 mA and 10 mA. Vref of 800 mV and Cout of 220 pF are used in this experiment. When Iload changes from 5 mA to 10 mA, the settling time is 6.6 µs and the undershoot is 303 mV. When Iload changes from 10 mA to 5 mA, the settling time is 6.0 µs and the overshoot is 126 mV. Since the operation region of PMOS temporarily enters into saturation region when the undershoot occurs while it does not for the case of overshoot, the waveforms of Vout transient become different in these two cases.

130 μm

Fig. 11. Chip photo of the proposed LDO that occupies 115 µm × 130 µm in 65 nm standard CMOS technology.

Fig. 12. Measured tracking response of Vout when Vref switches between 600 mV and 800 mV with 10.4 MHz-clock, Iload of 10 mA and Cout of 220 pF.

10

N. Ojima et al.

Fig. 13. Measured transient response of Vout when Iload changes between 5 mA and 10 mA with 10.4 MHz-clock, Vref of 800 mV and Cout of 220 pF. 35.2 μA 12.5 % Other Components (LDO Core)

246.8 μA 87.5 %

Internal Clock and Pulse Generator

Fig. 14. Breakdown of the current consumption with Iload of 10 mA. Ratio of the current is calculated with circuit simulation.

The overall current consumption with 10.4 MHz clock and Iload of 10 mA is 282 µA including the current consumed at the internal clock and pulse generator, which is not essential in the actual use because it can be substituted by an internal clock for other functional blocks on the SoC. Based on the circuit simulation result, the LDO core consumes 12.5% of the total current as shown in Fig. 14. Thus, the quiescent current of the LDO core is assumed to be 35.2 µA, which leads to 99.6% current efficiency. In the proposed architecture, Vref is used as a power source of a VCDL. Hence, Vref is required to supply current to drive the VCDL for voltage comparison. Figure 15 shows the current consumption from Vref versus the frequency of the pulse train, which is equal to CLK, with Vref of 800 mV and Iload of 10 mA. The pulse is sent from the internal oscillator, and its frequency is tuned by the multiplexer. Typically, when the pulse frequency is 10.4 MHz, the current consumption from Vref is 10.6 µA. According to Fig. 15, the current consumption of Vref is proportional to the pulse frequency. When the pulse frequency is set high in order to have the fast transient response, Vref is required to supply more current.

A 65 nm CMOS Synthesizable Digital Low-Dropout Regulator

11

Fig. 15. Current consumption from Vref versus the frequency of the pulse train.

Table 1 shows performance comparison to prior digital LDOs. This work realizes a competitive current efficiency and FOMT with synthesizable architecture, while others cannot be fully synthesizable and need manual designs. Owing to the automated design flow of this work, if needed the maximum load current Iload,max can be easily increased by adding more PMOS switches, at the expense of the increase in the area and the quiescent current. Table 1. Comparison to prior arts This

[5]

[6]

[10]

[11]

[12]

[13]

[14]

work

CICC

JSSC

JSSC

JSSC

SSC-L

TPE

SSC-L

2010

2014

2017

2018

2018

2018

2018

65 nm

65 nm

32 nm

65 nm

65 nm

65 nm

65 nm

28 nm

Active area [mm2 ]

0.015

0.042

0.008

0.029

0.0023

0.012

0.014

0.019

Vin [V]

1.0

0.5

0.7–1.0

0.5–1.0

0.5–1.0

0.5–1.0

0.7–1.2

0.6–0.65 0.55–0.6

Process

Vout [V]

0.8

0.45

0.5–0.9

0.45–0.95 0.3–0.45

0.35–0.95

0.6–1.1

Iload,max [mA]

10

0.2

5

3.5

2

2.8

25

25

Cout [pF]

220

100000

100

400

400

100

1000

150

Quiescent Iq [µA]

35.2

Current efficiency [%] 99.6

2.7

92

12.5

14

45.2

6

28

98.7

98.2

96.3

99.8

98.4

99.97

99.96

150 mV

40 mV

40 mV

46 mV

200 mV

56 mV

Transient ΔVout

300 mV 40 mV

@ load step ΔIload

@ 5 mA @ 0.2 mA @ 0.8 mA @ 0.4 mA @ 1.06 mA @ 1.76 mA @ 23.5 mA @ 20 mA

FOMT [ps]a

93

a

270000

1150

1250

199

67.1

2.17

0.59

FOMT = (Cout ×ΔVout ×Iq )/ΔI2 load [1]

4

Conclusion

This paper proposes a synthesizable digital LDO that is designed by a P&R tool. In the proposed LDO, by using inverter chains as VCDLs, the difference between the output and the reference voltages is converted into the delay difference that can be compared by a phase detector. The voltage control loop is all composed of standard cells and synthesizable, which drastically relaxes the

12

N. Ojima et al.

design burden. The prototype is fabricated in a 65 nm standard CMOS technology with 0.015 mm2 area occupation. According to the measurement results of the prototype, with 10.4 MHz clock and Cout of 220 pF the tracking response time when Vref switches between 600 mV and 800 mV is ∼4.5 µs with Iload of 10 mA, and the transient response time when Iload changes between 5 mA and 10 mA is ∼6.6 µs with Vref of 800 mV. The quiescent current consumed by the LDO core is as low as 35.2 µA at 10 mA load current, which leads to 99.6% current efficiency. In our prototype, Vref needs to supply 10.6 µA current when the pulse frequency is 10.4 MHz. In this paper, we used a PMOS switch cell made from an inverter cell. However, this customized PMOS switch cell can be substituted by a tri-state inverter cell [15] or a tri-state buffer cell. If the inputs of these cells are tied to LOW (in the case of tri-state inverters) or HIGH (in the case of tri-state buffers), the output PMOS transistors can be controlled by the tri-state control inputs. Thus, if these cells are included in the standard cell library, a fully standard-cell based synthesizable LDO can be realized and the design burden would be more relaxed. Acknowledgment. This work is partly supported by JSPS KAKENHI Grant Number 17H03244, and is supported by VLSI Design and Education Center (VDEC), the University of Tokyo in collaboration with Synopsys, Inc., Cadence Design Systems, Inc., and Mentor Graphics, Inc.

References 1. Hazucha, P., Karnik, T., Bloechel, B.A., Parsons, C., Finan, D., Borkar, S.: Areaefficient linear regulator with ultra-fast load regulation. IEEE J. Solid State Circuits 40(4), 933–940 (2005) 2. Lam, Y.H., Ki, W.H.: A 0.9 V 0.35 µm adaptively biased CMOS LDO regulator with fast transient response. In: Proceedings of IEEE International Solid-State Circuits Conference Digest of Technical Papers, pp. 442–626, February 2008 3. Milliken, R.J., Silva-Martinez, J., Sanchez-Sinencio, E.: Full on-chip CMOS lowdropout voltage regulator. IEEE Trans. Circuits Syst. I Regul. Pap. 54(9), 1879– 1890 (2007) 4. El-Nozahi, M., Amer, A., Torres, J., Entesari, K., Sanchez-Sinencio, E.: High PSR low drop-out regulator with feed-forward ripple cancellation technique. IEEE J. Solid State Circuits 45(3), 565–577 (2010) 5. Okuma, Y., et al.: 0.5-V input digital LDO with 98.7% current efficiency and 2.7-µA quiescent current in 65 nm CMOS. In: Proceedings of IEEE Custom Integrated Circuits Conference, pp. 1–4, September 2010 6. Gangopadhyay, S., Somasekhar, D., Tschanz, J.W., Raychowdhury, A.: A 32 nm embedded, fully-digital, phase-locked low dropout regulator for fine grained power management in digital circuits. IEEE J. Solid State Circuits 49(11), 2684–2693 (2014) 7. Otsuga, K., et al.: An on-chip 250 mA 40 nm CMOS digital LDO using dynamic sampling clock frequency scaling with offset-free TDC-based voltage sensor. In: Proceedings of IEEE International SOC Conference, pp. 11–14, September 2012

A 65 nm CMOS Synthesizable Digital Low-Dropout Regulator

13

8. Oh, T., Hwang, I.: A 110-nm CMOS 0.7-V input transient-enhanced digital lowdropout regulator with 99.98% current efficiency at 80-mA load. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 23(7), 1281–1286 (2015) 9. Nasir, S.B., Gangopadhyay, S., Raychowdhury, A.: All-digital low-dropout regulator with adaptive control and reduced dynamic stability for digital load circuits. IEEE Trans. Power Electron. 31(12), 8293–8302 (2016) 10. Kim, D., Seok, M.: A fully integrated digital low-dropout regulator based on eventdriven explicit time-coding architecture. IEEE J. Solid State Circuits 52(11), 3071– 3080 (2017) 11. Salem, L.G., Warchall, J., Mercier, P.P.: A successive approximation recursive digital low-dropout voltage regulator with PD compensation and sub-LSB duty control. IEEE J. Solid State Circuits 53(1), 35–49 (2018) 12. Kim, S.J., Kim, D., Ham, H., Kim, J., Seok, M.: A 67.1-ps FOM, 0.5-V-hybrid digital LDO with asynchronous feedforward control via slope detection and synchronous PI with state-based hysteresis clock switching. IEEE Solid State Circuits Lett. 1(5), 130–133 (2018) 13. Akram, M.A., Hong, W., Hwang, I.: Fast transient fully standard-cell-based all digital low-dropout regulator with 99.97% current efficiency. IEEE Trans. Power Electron. 33(9), 8011–8019 (2018) 14. Zhao, L., Lu, Y., Martins, R.P.: A digital LDO with Co-SA logics and TSPC dynamic latches for fast transient response. IEEE Solid State Circuits Lett. 1(6), 154–157 (2018) 15. Liu, J., Maghari, N.: A fully-synthesizable 0.6 V digital LDO with dual-loop control using digital standard cells. In: Proceedings of IEEE International New Circuits and Systems Conference (NEWCAS), pp. 1–4, June 2016 16. Weaver, S., Hershberg, B., Moon, U.: Digitally synthesized stochastic flash ADC using only standard digital cells. IEEE Trans. Circuits Syst. I Reg. Pap. 61(1), 84–91 (2014) 17. Deng, W., et al.: A fully synthesizable all-digital PLL with interpolative phase coupled oscillator, current-output DAC, and fine-resolution digital varactor using gated edge injection technique. IEEE J. Solid State Circuits 50(1), 68–80 (2015) 18. Ojima, N., Nakura, T., Iizuka, T., Asada, K.: A synthesizable digital low-dropout regulator based on voltage-to-time conversion. In: 26th IFIP/IEEE International Conference on Very Large Scale Integration (VLSI-SOC), October 2018

An Instruction Set Architecture for Secure, Low-Power, Dynamic IoT Communication Shahzad Muzaffar and Ibrahim (Abe) M. Elfadel(B) Department of Electrical and Computer Engineering, Khalifa University, P.O. Box 54224, Masdar City, Abu Dhabi, UAE {shahzad.muzaffar,ibrahim.elfadel}@ku.ac.ae

Abstract. This chapter presents an instruction set architecture (ISA) dedicated to the rapid and efficient implementation of single-channel IoT communication interfaces. The architecture is meant to provide a programming interface for the implementation of signaling protocols based on the recently introduced pulsed-index schemes. In addition to the traditional aspects of ISA design such as addressing modes, instruction types, instruction formats, registers, interrupts, and external I/O, the ISA includes special-purpose instructions that facilitate bit stream encoding and decoding based on the pulsed-index techniques. Verilog HDL is used to synthesize a fully functional processor based on this ISA and provide both an FPGA implementation and a synthesised ASIC design in GLOBALFOUNDRIES 65 nm. The ASIC design confirms the low-power features of this ISA with consumed power around 31 µW and energy efficiency of less than 10 pJ/bit. Finally, this chapter shows how the basic ISA can be extended to include cryptographic features in support of secure IoT communication. Keywords: Dynamic signaling · Single-channel · Low-power communication · Clock and data recovery · Internet of things · Domain specific architecture · Pulsed-Index Communication · Instruction set architecture Secure communication

1

·

Introduction

IoT nodes need to meet two conflicting requirements: high data-rate communication to support bursts of activity in sensing and communication, and lowpower to improve energy autonomy. Unfortunately, existing protocols fail to meet these requirements simultaneously. Protocols providing high data rates, such as WiFi, WLAN, TCP/IP, USB, etc. [1–3], are power-hungry and involve complex controllers to handle two-way communications. On the other hand, low-power protocols such as 1-Wire [4] and UART [5] have low data rates. c IFIP International Federation for Information Processing 2019  Published by Springer Nature Switzerland AG 2019 N. Bombieri et al. (Eds.): VLSI-SoC 2018, IFIP AICT 561, pp. 14–31, 2019. https://doi.org/10.1007/978-3-030-23425-6_2

An Instruction Set Architecture for IoT Communication

15

To fill up the gap and address these two requirements at once, a novel family of pulsed signaling techniques for single-channel, high-data-rate, low-power dynamic communication have been recently proposed under the name of PulsedIndex Communication (PIC) [6,7]. The most important feature of this family of protocols is that they do not require any clock and data recovery (CDR). They are also highly tolerant of clocking differences between transmitter and receiver, and are fully adapted to the simple, low-power, area-efficient, and robust communication needs of IoT devices and sensors. These techniques are reviewed in Sect. 2 with their advantages and disadvantages clarified. The main issue that this chapter addresses is to provide a flexible framework that enables the implementation of the most suitable PIC technique for a given application. The issue of selecting and implementing a communication interface in a constrained IoT node is a prevalent one, and its solution should contribute to the streamlining of communication subsystem design in IoT devices. One candidate solution is to program all the protocols on a microprocessor and control their selection and parameters through registers. This is a standard practice that is followed for data transfer protocols such as I2 C, I2 S, SPI, UART, and CAN. Another possible solution is to design ASIC for the newest generation of the protocol and make it backward compatible with older versions as in the case of USB 1.0 through 3.0 [8]. Such methods increase silicon area and power consumption, and do not provide any customization features. Yet another approach is to adopt the principles of hardware-software codesign and provide a special-purpose hardware supporting a tuned or extended Instruction Set Architecture that can be used to configure and implement the various communication protocols of a given family without changing or re-designing the on-chip hardware modules. An example of such approach can be found in Cisco’s routers where the main CPU (e.g., MPC860 PowerQUICC processor from Motorala/NXP) includes an on-chip Communication Processor Module (CPM) [9]. The CPM is a RISC microcontroller dedicated to several special purpose tasks such as signal processing, communication interfaces, baud-rate generation, and direct memory access (DMA). The work described in this chapter is inspired with such a solution in that it proposes a flexible, fully programmable communication interface for the PIC family based on a full RISC-like ISA tailored for the efficient and seamless implementation of the PIC protocols. Specifically, a set of special-purpose instructions and registers along with a compact assembly language is proposed to help perform the specific tasks needed for the generation of pulsed signals and to give access to all the hardware resources. The proposed ISA is called Pulsed-Index Communication Interface Architecture (PICIA) and is meant to help reduce the number of instructions required to implement a PIC family member without impacting the advantageous data rates or low-power operation of the PIC family. Verilog HDL is used to synthesize and verify a fully functional processor based on this ISA over the Spartan-6 FPGA platform. Furthermore, an ASIC design in the GLOBALFOUNDRIES 65 nm process confirms the low-power operation with 31.4 µW and energy efficiency of less than 10 pJ/bit.

16

S. Muzaffar and I. (Abe) M. Elfadel

This chapter is an expanded version of an earlier publication of ours [10] and includes an entirely new section, Sect. 6, on secure IoT transmission using an extended ISA with cryptographic instructions. Other changes include improved figures and additional explanations that are spread throughout this chapter.

2

Pulsed-Signaling Techniques

Pulsed-signaling techniques are based on the basic concept of transmitting binary word attributes rather than modulated bits. The attributes are quantified, coded as pulse counts, and transmitted as streams of pulses. The key to the success of these techniques is the encoding step whose goal is to minimize the pulse count. At the receiver, the decoding is based on pulse counting by detecting the rising edge of each pulse. These techniques have the distinguished feature that they don’t require any clock and data recovery (CDR), which significantly contributes to their low-power and small foot-print hardware implementations. Recently, three techniques based on this concept have been introduced, namely, Pulsed-Index Communication (PIC) [6], Pulsed-Decimal Communication (PDC) [7], and Pulsed-Index Communication Plus (PICplus). With slight differences, these techniques apply an encoding scheme to a data word B to minimize the number of ON bits, and move them to the Least-Significant-Bit (LSB) end of the packet with the goal of lowering the number of pulses required to transmit the data bits. The encoding process includes a segmentation step where the data is broken into N independent segments of size l bits each (i.e. N = B/l). To maximize data rate, these use, on each segment, an encoding combination of bit inversion and/or segment reversion/flipping. For PIC and PICplus, this combination is meant to reduce the number of ON bits and decrease their index values. For PDC, the same combination is meant to reduce the number of ON bits and decrease the decimal number represented by each segment. To facilitate decoding, flag pulses representing the type of encoding performed are added to each segment. Unlike PIC, the PDC segment flags of two consecutive segments and the PICplus segment flags of four consecutive segments are combined in one data word flag and placed in the header. The PDC further applies a third segmentation step post-encoding whose goal is further reduce the number of pulses per segment and, therefore, further increase the data rate. All the pieces of information including flags, the number of indices, and the indices themselves in the case of PIC and PICplus, or the decimal numbers of each segment in the case of PDC, are transmitted in the form of pulse streams. Within a given packet, segment pulse streams are separated by an inter-symbol delay (α). The receiver counts the number of pulses for each pulse stream and applies the decoding according to the flags received.

3

Pulsed-Index Communication Interface Architecture (PICIA)

As described in Sect. 2, the PIC family members share many ideas, some of which are used in exactly the same way and others with few changes. Their packet for-

An Instruction Set Architecture for IoT Communication

17

mats are also quite similar. There could be a number of variations that could be introduced in these techniques as per needs and choice. The proposed PICIA can be used to generate not only these standard protocols with tune-able respective communication parameters (i.e. segment size, inter-symbol delay, pulse width etc.) but it can also be used to develop other customized communications techniques that use the same underlying idea of transmitting information in the form of pulses. The PICIA is described in detail in the next subsections. Table 1. PICIA register set Register

Type

Organization a

1 R0–R7

8 bit GP

2 Ctrl0

8 bit SPb [0, Mode, 3-bit SegNum, 3-bit SegSize]

8-bit Value

3 Ctrl1

8 bit SP

8-bit pulse width

4 LoadReg 16 bit SP 16-bit value a General Purpose b Special Purpose

3.1

Register Set

The PICIA uses three types of registers. The first type includes a set of eight 8-bit registers, R0 through R7, which are programmer-accessible general-purpose registers. The second type is that of Control Registers Ctrl0 and Ctrl1 which are 8-bit registers used to store protocol configuration parameters such as mode of transaction (transmitter or receiver), segment number, segment size, and pulse width in terms of a number of clock cycles. These control registers are initially set by the programmer through specific instructions but, once set, they become accessible only to the system. The third type is the LoadReg register, which is a 16-bit, I/O-dedicated register used to read the I/O port, set the I/O port, and to store the updated results after an instruction is executed. Like the Control Registers, LoadReg is a privileged register accessible only to the system. These register types are summarized in Table 1. In the remainder of the text, the word register will always refer to a general-purpose register. 3.2

Instruction Formats

The PICIA instructions are all 16-bit long and are of three different types. The first type, I-Type 1, handles one operand at a time and is used in operations such as to read/write the I/O port, set/clear the LoadReg, set various communication protocol parameters, and send/receive pulse streams. I-Type 1 is divided into five fragments, as shown in Fig. 1. The 5-bits Opcode represents the type of operation. Type (R/C) is used to set the type of operand (register or a constant) in an instruction. Halt PC/WE is used either to halt the PC during the transmission of pulse streams or to enable the store operation of received pulse-count to a specified register. The bit E sets

18

S. Muzaffar and I. (Abe) M. Elfadel

Fig. 1. PICIA instructions format

if an extra pulse should be added to the transmitted pulse stream and/or an extra pulse should be removed from the received pulse stream. The last 8-bits long fragment of I-Type 1 is used to indicate a register number or an immediate constant value. The second type of instruction, I-Type 2, needs two operands and is used in operations such as updating a register with a given constant value, and jumping to a specified label in the code depending on the validity of a condition in a register. I-Type 2 is divided into three fragments, as shown in Fig. 1. The 5bits Opcode represents the type of operation. The 3-bits Register field is used to indicate one of the general purpose registers and the 8-bits Constant field is used to provide a constant value or a label that is present in the code. The third type of instruction, I-Type 3, handles two or three operands simultaneously. I-Type 3 is used in operations such as encoding (inversion and reversion with or without condition), combining and splitting encoding flags, and copying register contents or some other information to a specified register conditionally. I-Type 3 is divided into six fragments, as shown in Fig. 1. The 5-bits Opcode represents the type of operation. The 3-bits Register fields are used to indicate one of the general purpose registers. The combinations of 1-bit I and Co fields are used to select the source of information to be copied. 3.3

Addressing Modes

The PICIA employs three addressing modes: immediate, register, and autodecrement. In the immediate mode, the source is either a constant or a label while the destination is one of the general-purpose, special-purpose, or program counter registers. In the register mode, the register contains the value of the operand. The auto-decrement mode is used only for jump operation where the branch to a label is taken and a specified register decrements by one if the register contains a non-zero number. 3.4

Interrupts

There are three interrupts in the PICIA supported processor. First, the I/O interrupt is generated when the data at the I/O port is available. The system

An Instruction Set Architecture for IoT Communication

19

remains in a halt state until the I/O interrupt is reached and the system starts the execution of instructions from the very start. Second, the transmitter interrupt is used to indicate the completion of the transmission of one pulse stream. The PICIA processor remains in a halt state, if activated, until transmitter interrupt is received and the execution continues from where it paused. Third, the receiver interrupt is generated when the reception of one pulse stream completes. The PICIA processor remains in a halt state until the receiver interrupt is received at which time, program execution is continued. 3.5

External I/O

Three external I/O ports are supported by the PICIA processor. One of these ports is the 16-bit data I/O port that is used to read from and write back to the external environment. To transmit and receive the packets in the form of pulse streams, a 1-bit signal I/O port is used. Another 1-bit data ready port is used to source the generation of I/O interrupts and start the execution of instructions.

4

PICIA Assembly Language

Before diving into the PICIA assembly language in detail, it is necessary to understand few relevant interpretations about the instructions and assembly language. These interpretations are shown in Table 2. The left part of the table shows the instruction interpretations where the values of the control bits are indicated along with the corresponding effect or representation. Similarly, the right part of the table does the same but for PICIA assembly language. The PICIA instructions are listed in Table 3 along with a brief description and an example for each. The instruction categories and types are given in Table 4. More details about the PICIA instructions are given in the next subsections. Table 2. PICIA interpretation Instruction interpretation

Assembly interpretation

Control Bit

Symbol

Meaning

Type (R/C) 0 : Register, 1 : Constant

R

Register Only

Halt PC

0 : No Halt, 1 : Halt

C

Constant Only

WE

0 : Register Write Disabled 1 : Register Write Enabled

RC

Register or Constant

E

0 : Extra Pulse Disabled 1 : Extra Pulse Enabled

R, Rx, Ry, Rs Register Number

I

0 : No Indexing, 1 : Indexing h

Co

0 : Copy Segment Disabled 1 : Copy Segment Enabled

Value : Effect

0 : No Halt, 1 : Halt

20

4.1

S. Muzaffar and I. (Abe) M. Elfadel

Type 1 Instructions (I-Type 1)

These instructions are concerned with configuration and transmission control operations and use only one operand. The first instruction towards this is RP, read from port, that collects the data from the I/O port and stores it in the LoadReg. WP, write to port, reads data from LoadReg and updates the I/O port. There is no operand to these instructions as the system accesses the special purpose register internally. SSS and SSN set the segment size and the segment number respectively in the Ctrl0 register. The operand for both of these instructions is an immediate constant value. The operand to SSS can be any of 0, 1, or 2 that represents a segment size of 4, 8, or 16 bits respectively. Segment size information helps the system break the data word into smaller independent segments. The operands to SSN can be the numbers 0, 1, 2, and 3. SSN is used to select the segment that is going to be processed by all the following instructions in the program until the segment number is changed again. SM, set the mode, also accesses the special purpose control register Ctrl0 and sets or clears a bit representing the mode of operation. The operand to SM can either be 0 or 1 that represents the transmitter or receiver mode respectively. During transmitter mode, the signal port is used to send the pulses out and, during reception mode, the same port is used to receive the pulses from the external world. If the receiver mode is selected, the LoadReg is automatically cleared by the system to make it ready for reception. If the transmitter mode is selected, the LoadReg is updated automatically with the data present on I/O port. SW, set pulse width, sets the count of system-clock cycles for which the pulse remains high. The operand to SW is an 8-bit integer number. The SP, send pulses, sends a pulse stream consists of a number of consecutive pulses equal in count specified by the operand that could either be a register or an immediate constant number. The argument h is used to decide if the system should halt during the transmission of a pulse stream or not. If 1, halt the system unless the pulse stream transmission is complete, or continue with the next instruction if 0. The argument E to SP instruction informs the system if the pulse stream should include the transmission of an additional pulse at the end of stream or not. This is helpful in representing the no-pulse or zeroindex condition with only one pulse as it is in the case of PIC and PICplus transmission, unlike PDC where all the pulse streams are transmitted with an additional pulse. If 1, include an extra pulse or send the exact number of pulses if 0. SD is a similar instruction but with minor differences. SD, send the delay, transmits an inter-symbol delay that is equal in length to the specified number of system-clock cycles. All the arguments and operands work in the same way as that of SP except that there is no choice of an extra pulse. To set the expected number of clock cycles per inter-symbol delay during the process of reception, the instruction SRD is used which takes either a register number or a constant number as an operand to represent the number of clock cycles. During a reception, the system needs to wait for the incoming pulse stream so that the pulses can be counted to infer the sent information. To fulfil this task, the instruction WRI, wait for receiver interrupt, is used. The system

An Instruction Set Architecture for IoT Communication

21

Table 3. PICIA assembly language Instruction

Description

Example

Configuration instructions 1

RP

Load data from Input Pins to data register

RP

2

WP

Output the received data from data register WP to the Pins

3

SSS C

Set segment size (C = 0, 1, 2 for 4 bit, 8 bit, SSS 1 16 bit)

4

SSN C

Select segment number (C = 0, 1, 2, 3)

5

SM C

Set Mode (C = 0, 1 for Transmitter, SM 0 Receiver). Setting RX mode clears LoadReg, setting TX loads input into LoadReg

6

SW C

Set width of pulse (C = integer specifying cycle count)

SW 2

7

SRD RC

Set Receiver Inter-Symbol Delay equal to RC number of clock cycles

SRD R0

8

NOP

No operation

NOP

SSN 2

Encoding/Decoding instructions 9

IV Rx,Ry

Inverse the selected segment. Rx = NOI & Ry = Flags (Rx/Ry = mR0, R1, . . . R7)

IV R0, R1

10 IVC Rx, Ry

Inverse conditionally the selected segment if IVC R0,R1 encoding condition satisfy (ON bits > Seg. Size/2). Rx = NOI & Ry = Flags (Rx/Ry = R0, R1,. . . R7)

11 FL Rx, Ry

Flip selected segment bits. Rx = NOI & Ry = Flags (Rx/Ry = R0, R1, . . . R7)

FL R0,R1

12 FLC Rx, Ry

Flip conditionally the selected segment bits if encoding condition satisfy (Seg. >Flip(Seg.)). Rx = NOI & Ry = Flags (Rx/Ry = R0, R1, . . . R7)

FLC R0, R1

13 IVFL Rx, Ry

Invert and Flip selected segment bits. Rx = NOI & Ry = Flags (Rx/Ry = R0, R1, . . . R7)

IVFL R0, R1

14 CRC R, Rs, I, Co Copy register conditionally. R = Rs if I = 0. R = Rs , if I = 1 and LoadReg [Rs] = 1 and Co = 0. R = 0 otherwise. R = Selected Segment, if Co = 1. Rs is ignored. (R/Rs = R0, R1, . . . R7). Can be used to clear the register

CRC R1, R2, 1, 1

15 CF R, Rx, Ry

Combine Flags. R = {Rx[1:0], Ry[1:0]}

CF R0,R1,R2

16 SF Rx, Ry, R

Split Flags. Rx = R[3:2], Ry = R[1:0]

SF R1, R2, R0 (continued)

22

S. Muzaffar and I. (Abe) M. Elfadel Table 3. (continued) Instruction

Description

Example

Transmission control instructions 17 SP h, E, RC

Send RC number of pulses (RC = register SP 1, 1, 4 number or constant value). Halt PC if h = 1 (h = 0, 1). (Type = 1 then it’s a constant). Send one extra pulse if E = 1 (E = 0, 1)

18 SD h, RC

Inter-Symbol delay of RC number of clock cycles. Halt PC if h = 1 (h = 0,1)

SD 1, 4

19 WRI WE, E, R Wait for receiver pulse stream interrupt. PC WRI 1, 1, R0 halts till the interrupt arrives. Remove one extra pulse count if E = 1 (E = 0,1). Enable received pulse count write to register R (R = R0, R1, . . . R7) if WE = 1 (WE = 0,1) 20 SDB C

Sets the index bits or the data bits in the LoadReg as per the received pulse stream. (C=0,1 for indexing and data respectively)

SDB 1

Register/Branch update instructions 21 WR R, C

Write constant value to a register R (R = R0, R1, . . . R7)

WR R0,8

22 BNZD R, label Branch to label and decrement R by 1 if the BNZD R0,loop specified register R contains non-zero number. (R = R0,R1,. . . R7) Table 4. I-Types Instructions Instructions category

I-Type

Configuration

1

Transmission control

1

Register/Branch update 2 Encoding/Decoding

3

goes into the halt state when this instruction is executed and returns back to the normal state at the reception of receiver interrupt that is generated when a pulse stream is received completely. The incoming pulses are counted and the count decrements once if the argument E to WRI is set. The count is stored in a specified register R if the argument WE is set. Among different types of information chunks in a received packet, a pulse streams related to data could either represent the index number of an ON bit (as in PIC or PICplus) or the decimal number for a segment (as in PDC or other custom techniques). The instruction SDB, set data bits, removes this confusion by informing the system if the received pulse count needs to be stored directly in the LoadReg as a segment’s content (if C = 1) or a bit in the LoadReg needs to be set at the

An Instruction Set Architecture for IoT Communication

23

index number represented by the count (if C = 0). The last instruction in the category of I-Type 1 is NOP, no operation, that is used when there is a need to wait for some operation to complete, as in the case of instructions SP and SD, without halting the system. In this case, there should be enough number of NOP s (P ulsCount+2) to wait for the completion of a pulse stream transmission. All or some of these NOP s can also be replaced by other instructions in order to perform useful tasks instead of waiting for transmission. 4.2

Type 2 Instructions (I-Type 2)

I-Type 2 is the smallest set of instructions. As mentioned earlier, these instructions handle two operands at a time and are concerned with register and/or branch update operations. One of the operands is a register and the other is an immediate constant. One of these instructions is WR, write register, that is used to store an immediate constant value to a specified general purpose register. The second instruction is the jump instruction BNZD, branch and decrement if not zero. The instruction takes two arguments, a register to check the condition and a label to jump to. If the content of the specified register is a non-zero value, the program counter jumps to the label and the register value decrements once. The BNZD is helpful in writing conditional loops. 4.3

Type 3 Instructions (I-Type 3)

These instructions are concerned with encoding and decoding and use either two or three operands, but all of these operands must be registers. The five instructions, described next, are used in encoding the selected segment. IV, invert, is used to complement the bits of the selected segment unconditionally and the resulting new segment replaces the corresponding segment in LoadReg. The operand register Rx stores the new number of ON bits (NOI) in the resulted segment and register Ry stores the corresponding flags to represent the encoding type, as per encoding description in PIC and PDC overview. The IVC, invert conditionally, works the same way as IV works but only if the condition of encoding is true. The condition, as mentioned earlier in the overview section, is that the number of ON bits in the selected segment should be greater than half the segment size. The Rx and Ry get updated with new NOI and Flags respectively. The FL, flip, and FLC, flip conditionally, work exactly the same way as IV and IVC, respectively, except for the base operation that is the bit wise reverse/flipping instead of inversion. The condition here for FLC is to check whether the content number of the selected segment is greater than the flipped content number of the same segment. If the condition is true, it means the ON bits are at the higher number of indices, hence, they represent a big decimal number and both of these can be reduced by relocating the ON bits to the lower index numbers. The fifth instruction that takes part in encoding is IVFL, invert and flip. The IVFL works in the same way as the other aforementioned four instructions work except it applies both the inversion and flipping together unconditionally.

24

S. Muzaffar and I. (Abe) M. Elfadel

The instructions CF, combine the flags, and SF, split the flags, are used for PDC, but can be used for any customized technique through PICIA. CF takes two operands, Rx and Ry, representing two flags to be combined and stores the result in the third operand register R. The first two LSBs of both Rx and Ry, in the same order, are combined to generate four LSBs in R. Similarly, SF splits the combined flags in a specified register R into two separate flags and stores these in registers Rx and Ry. The Ry takes the first two LSBs of R and Rx takes the next two LSBs of R. The last and the most complex instruction of PICIA is CRC, copy register conditionally. Based on the given settings for I and Co, the instruction performs four different copy operations, as shown in Table 5 where X is the don’t-care and [Rs] represents the index number of LoadReg. CRC can be used for a simple register to register copy because the instruction copies a register Rs to R if both Co and I are cleared. If Co is cleared and I is set, the source to be copied is decided by the bit of LoadReg located at the index number represented by the contents of register Rs. If the LoadReg bit at index Rs is cleared, 0 is copied to register R, or simply Rs is copied to R otherwise. This operation is helpful in generating PIC pulse streams. Remember, PIC selects the ON bits only in data and transmits their index numbers in the form of pulse streams. Therefore, CRC with such a configuration helps in finding if the target bit is ON or not. If the bit is ON, the index number of it needs to be transmitted that is present in register Rs and that is why it is copied to R. If the bit is OFF, nothing is there to transmit and that is why 0 is copied to register R. Hence, the index numbers of the ON bits can be transmitted in a loop. If Co is set, I becomes don’t care and the contents of the selected segment are copied to register R. This is helpful in generating PDC pulse streams as, unlike PIC, it transmits the contents of the sub-segments in the form of pulse streams. Hence using such a configuration for CRC, all segments of the data word can be selected and transmitted one-by-one in a loop. All the configurations of CRC instruction can be used to generate any other customized transmission techniques based on the idea of transmitting the information in the form of pulse streams. Table 5. CRC instruction functionality Co I

LoadReg[Rs] Description

0

0

X

R = Rs

0

1

0 1

R=0 R = Rs

1

X X

R = Selected Segment

An Instruction Set Architecture for IoT Communication

5

25

Experimental Verification and Results

Verilog HDL is used to describe a fully functional processor based on the proposed ISA and a full experimental setup is implemented on the Xilinx Spartan6 FPGA platform. The prototype platform is used to verify the functionality and performance of proposed PICIA. Extensive simulations and real-time hardware verification are performed to verify the results. A clock rate of 25 MHz is used for PICIA testing system. In the experimental flow, the PICIA processor’s transmitter sends the 16-bit data starting at 0 with an increment of 1 at each transmission. The PICIA processor’s receiver resends the same data back. The returned and original data words are compared to verify the complete round-trip chain. In another experiment, the software aspects of two implementations are compared. In one implementation, the PIC family member techniques are developed for TI’s MSP432X processor family. The reason for choosing the MSP432X in our experiments is that it is an ultra low-power RISC processor, and so it provides an appropriate off-the-shelf choice for comparing the PIC assembly programs using our PICIA processor vs. those of MSP432X. The second implementation used PICIA assembly language to develop the same techniques to run on the implemented processor. Both implementations use a 25 MHz clock. The number of instructions required to implement these techniques using MSP432X is approximately 1300 to 1400 on average whereas PICIA needs only 50 to 100 instructions. This is a notable reduction by a factor of 13 to 28, approximately. The data rates offered by the MSP432X implementation are also reduced significantly, approximately by a factor of 100. On the other hand, the data rates are preserved by the implementation of communication techniques using PICIA. The software implementation comparison is shown in Table 6 and Fig. 2. An example showing how PICIA reduces the number of instructions is illustrated in Fig. 3. At the left side of the figure, an encoding example implemented in C for PDC is presented. If the encoding is implemented using a RISC ISA,

Fig. 2. PIC family implementation: PICIA vs. MSP432x

26

S. Muzaffar and I. (Abe) M. Elfadel

Fig. 3. PICIA code reduction example

around 150 instructions would be required. On the other hand, if the same encoding is implemented using PICIA, only 15 instructions are required. A sample pseudo code in Fig. 3 highlights the flow of the program and the involved PICIA instructions. We have also synthesized the PICIA processor system using GLOBALFOUNDRIES 65nm technology and estimated that PICIA hardware consumes around

An Instruction Set Architecture for IoT Communication

27

Table 6. Results Implementation PICIA Stand-alone Software implementation comparison Avg. no. of instructions 50–100 Avg. data rate (Mbps)

1300–1400

≈4.1–7.1 ≈0.041-0.071

Hardware synthesis comparison Power (µW)

≈31.14

≈19–26.6

Avg. E b (pJ/bit)

≈4.2–7.6 ≈2.7-6.5

Area (gate count)

≈4700

≈2100–2400

31.14 µW with a gate count of about 4700 gates. The power consumption results are promising as they remain well within the power budget of a full-hardware implementation of stand-alone pulsed-signaling techniques. Additionally, the consumption of hardware resources is comparable, data rates are preserved and the required number of instructions is reduced. Moreover, PICIA offers a customizable solution. The PICIA solution differs in that it offers a fully programmable communication interface that is specifically geared to the realization of pulsed-transmission techniques.

6

Securing PICIA

This section presents a possible extension of the PICIA to support of secure PIC communication [11]. An advantage of the proposed extension is that it does not require any modification in the PICIA instruction format as it employs the very same instruction types of Sect. 4 to add instructions dedicated to cryptographic functions. The security layer extension of PICIA offers a programmable environment to select not only a suitable encryption algorithm but also to choose among various execution options of the selected algorithm with the goal of trading off transmission security with data rate. Specifically, the PICIA security layer has the following features: 1. Support of multiple encryption algorithms such as simple XOR, MA5/1 [11] and AES. 2. Encryption gating in case the crypto function is not needed. 3. Configurable encryption hardware to tune the number of clock cycles used in data encryption. A tradeoff between the number of crypto clock cycles and the required crypto hardware resources is implemented through the iterative use of a smaller crypto unit. In such case, the unused crypto hardware units are gated. In the following subsections, the security features of the extended PICIA architecture are highlighted.

28

S. Muzaffar and I. (Abe) M. Elfadel Table 7. Security layer registers in addition to the regular registers of Table 1

5 Ctrl2

8 bit SPa

[Enable SLb , 3-bit Enc.c Algorithm 4-bit Enc. Speed]

6 EncIniKey 16×16-bit SP 256-bit Initial Key Array of sixteen 16-bit registers Special Purpose b Security Layer c Encryption

a

6.1

Extended Register Set

Two new registers are added to the PICIA register set in support of the security layer extension, as shown in Table 7. The first register is the Control Register Ctrl2 which is an 8-bit register used to store configuration parameters of the security layer such as enabling the security layer, selection of the encryption algorithm, and the speed of encryption in terms of number of clock cycles. The programmer initially sets the control register through a specific instruction but, once set, it becomes accessible only to the system. The second register is the 256-bit EncIniKey register, organized as an array of sixteen 16-bit registers, and used to store the initial encryption key. Like the Control Register, EncIniKey is a privileged register accessible only to the system. 6.2

Extended Instruction Set

Three new instructions are added to the PICIA assembly language in support of the security layer. They are shown in Table 8. These instructions deal with the configuration and control of the security layer. The first instruction is ESL, enable security layer, which activates the security layer and updates the Ctrl2 register. The operand En is a one-bit modifier whose ZERO value signifies normal PIC transmission without encryption. Its ONE value enables encryption ahead of transmission. The second ESL operand, Alg, is a 3-bit operand that selects the encryption algorithm that should be used. There can be a maximum of eight hardware blocks in the PICIA processor system, each representing a particular encryption algorithm. In our current implementation, an Alg of 0 selects a simple XOR operation, while a value of 1 selects MA5/1, a modified, PICcompatible version of the symmetric A5/1 encryption algorithm [11]. The third ESL operand, ES, is used to set the speed of the encryption process in terms of the number of clock cycles. This instruction assumes that the encryption techniques implemented within the PICIA processor support changing the number of clock cycles used to generate a full encrypted data word. For example, if MA5/1 is selected to use one clock cycle, the full encryption hardware would be utilized. If the same algorithm is chosen to use four clock cycles, then one-fourth of the hardware would be used, and the rest would be gated to save power. The ES operand takes an unsigned integer value in the range of 0 to 15. The number of encryption clock cycles is calculated as nC = 2ES . Through this operand, a trade-off between crypto latency and power can be easily programmed into the configuration of the security layer. As described earlier, the length of the key register EncIniKey is 256 bits. The same register can also be used for initializing shorter keys, e.g, the 128-bit

An Instruction Set Architecture for IoT Communication

29

Table 8. Security layer instructions in addition to those of Table 3 Instruction

Description

Example

Security layer instructions 23 ESL En,Alg,ES Enable security layer. Enable if En=1, disable if En=0. Alg selects encryption algorithm (0:XOR, 1:MA5/1, . . . 7:OtherAlgo7). ES sets encryption speed in terms of number of clock cycles/encryption-iteration. (Number of clock cycles (nC ) = 2ES

ESL 1,0,1

24 LPI

Lock previously executed instruction. Unless LPI unlocked, all the next instructions are considered as 16-bit constant values for the locked instruction

25 UPI

Unlock the locked instruction

UPI

Mapping Security layer instructions are mapped to I-Type 1

initial key of MA5/1. There is therefore a need for introducing instructions for key-length setting and EncIniKey register initialization. Instructions LPI, lock previous instruction, and UPI, unlock previous instruction, are introduced for that very purpose. LPI locks the previously executed instruction in the control unit while keeping all the generated control signals active unless unlocked using UPI. In other words, these two instructions define the start and end of the user’s key section in the assembly program and must follow the ESL instruction. All the 16-bit binary numbers between these two instructions are considered segments of the full initial key. These segments are stored in the EncIniKey register using an internal 4-bit offset register. The offset register defines the row index of a 16 × 16 array version of the EncIniKey register. The offset register is cleared when the LPI instruction is executed and is incremented when a 16-bit segment is stored successfully. An example of EncIniKey initialization is shown in Table 9, where the current offset represents the EncIniKey offset value before the execution of a given instruction and the updated offset represents the EncIniKey offset value after its execution. 6.3

Instruction Format

There is no change to the PICIA instructions format given in Fig. 1 as a result of adding of the crypto instructions. All the new assembly language instructions, described in previous subsections, are of the I-Type 1 instruction format. As shown in Fig. 4, the only change we need to account for is in terms of operand values. In particular, the [Alg, ES] operands are added to the field

30

S. Muzaffar and I. (Abe) M. Elfadel Table 9. Key initialization examples 16-bit Key 64-bit Key 256-bit Key Current Offset Updated Offset ...

...

...

...

ESL 1, 1, 0 ESL 1, 1, 0 ESL 1, 1, 0 8

... 9

LPI

LPI

LPI

0

0

0 × F192

0 × F192

0 × F192

0

1

UPI

0 × 11AB

0 × 11AB

1

2

...

0 × A9F6

0 × A9F6

2

3

0 × 3313

0 × 3313

3

4

UPI

......

...

...

...

......

...

...

0 × 46F4

15

0

UPI

15

0

...

15

0

Fig. 4. Additional operand values in the PICIA I-Type 1 instructions

“Register/Constant” and the En operand, which controls the enabling of the security layer, is added to “E” field. The instruction opcode directs the instruction decoder to activate the control signals as per the issued assembly language command.

7

Conclusions

The Pulsed-Index Communication Interface Architecture (PICIA) is a RISCstyle special purpose ISA for single-channel, low-power, high data rate, dynamic, and robust communication based on pulsed-signaling protocols. It is designed to facilitate the efficient generation of compact assembly code that is specific to such communication interfaces. This hardware/software co-design capability can be used to embed not only an existing PIC family member but also any custom nonstandard PIC protocol without changing the underlying hardware while greatly reducing the number of required instructions. Furthermore, such communication interface implementation will result in minimal to no impact on the data rates, power consumption, or the reliability of the protocols. The PICIA processor has been synthesized in GLOBALFUONDRIES 65 nm technology and has been found to consume only 31.14 µW, which translates into an energy efficiency of less than 10 pJ per transmitted bit. To support secure communication,

An Instruction Set Architecture for IoT Communication

31

the basic PICIA has been extended to provide a programmable environment for selecting a suitable encryption algorithm and controlling its latency at execution. PICIA’s micro-architecture and the optimized hardware blocks that compactly implement its RISC-style ISA are the subject of a separate publication. Acknowledgments. This work has been supported by the Semiconductor Research Corporation (SRC) under the Abu Dhabi SRC Center of Excellence on Energy-Efficient Electronic Systems (ACE4 S), Contract 2013 HJ2440, with customized funding from the Mubadala Development Company, Abu Dhabi, UAE.

References 1. Dayu, S., Huaiyu, X., Ruidan, S., Zhiqiang, Y.: A Geo-related IoT applications platform based on Google map. In: 7th International Conference on e-Business Engineering (ICEBE), Shanghai, China, pp. 380–384, November 2010 2. Byun, J., Kim, S.H., Kim, D.: Lilliput: ontology-based platform for IoT social networks. In: IEEE International Conference on Services Computing, Anchorage, AK, USA, pp. 139–146, June–July 2014 3. Hsu, J.M., Chen, C.Y.: A sensor information gateway based on thing interaction in IoT-IMS communication platform. In: 10th International Conference on Intelligent Information Hiding and Multimedia Signal Processing (IIH-MSP), Kitakyushu, Japan, pp. 835–838, August 2014 4. MAXIM: OneWireViewer User’s Guide, Version 1.4 (2009) 5. dos Reis Filho, C., da Silva, E., de Azevedo, E., Seminario, J., Dibb, L.: Monolithic data circuit-terminating unit (DCU) for a one-wire vehicle network. In: Proceedings of the 24th European Solid-State Circuits Conference (ESSCIRC 1998), pp. 228– 231, Hague, Netherlands, September 1998 6. Muzaffar, S., Shabra, A., Yoo, J., Elfadel, I.M.: A pulsed-index technique for singlechannel, low power, dynamic signaling. In: Design, Automation and Test In Europe (DATE 2015), Grenoble, France, pp. 1485–1490, March 2015 7. Muzaffar, S., Elfadel, I.M.: A pulsed decimal technique for single-channel, dynamic signaling for IoT applications. In: 25th IFIP/IEEE International Conference on Very Large Scale Integration (VLSI-SoC 2017), Abu Dhabi, UAE, pp. 1–6, October 2017 8. Teja, R., Jammu, B.R., Adimulam, M., Ayi, M.: VLSI implementation of LTSSM. In: International conference of Electronics, Communication and Aerospace Technology (ICECA 2017), Coimbatore, India, pp. 129–134, April 2017 9. linux-mips.org: Cisco Systems Routers (2012). https://www.linux-mips.org/wiki/ Cisco 10. Muzaffar, S., Elfadel, I.M.: An instruction set architecture for low-power, dynamic IoT communication. In: 26th IFIP/IEEE International Conference on Very Large Scale Integration (VLSI-SoC 2018), Verona, Italy, October 2018, To appear 11. Muzaffar, S., Waheed, O.T., Aung, Z., Elfadel, I.M.: Single-clock-cycle, multilayer encryption algorithm for single-channel IoT communications. In: IEEE Conference on Dependable and Secure Computing (DSC 2017), Taipei, Taiwan, pp. 153–158, August 2017

The Connection Layout in a Lattice of Four-Terminal Switches Anna Bernasconi(B) , Antonio Boffa, Fabrizio Luccio, and Linda Pagli Dipartimento di Informatica, Universit` a di Pisa, Pisa, Italy {anna.bernasconi,fabrizio.luccio,linda.pagli}@unipi.it

Abstract. A non classical approach to the logic synthesis of Boolean functions based on switching lattices is considered, for which deriving a feasible layout has not been previously studied. All switches controlled by the same literal must be connected together and to an input lead of the chip, and the layout of such connections must be realized in superimposed layers. Inter-layer connections are realized with vias, with the overall goal of minimizing the number of layers needed. The problem shows new interesting combinatorial and algorithmic aspects. Since the specific lattice cell where each switch is placed can be decided with a certain amount of freedom, and one literal among several may be assigned for controlling a switch, we first study a lattice rearrangement (Problem 1) and a literal assignment (Problem 2), to place in adjacent cells as many switches controlled by the same literal as possible. Then we study how to build a feasible layout of connections onto different layers using a minimum number of such layers (Problem 3). We prove that Problem 2 is NP-hard, and Problems 1 and 3 appear also intractable. Therefore we propose heuristic algorithms for the three phases that show an encouraging performance on a set of standard benchmarks.

Keywords: Circuit layout Hard problems · Heuristics

1

· Switching lattices · Logic synthesis ·

Introduction

The logic synthesis of a Boolean function is the procedure for implementing the function into an electronic circuit. The literature on this subject is extremely vast and large part of it is devoted to two-level logic synthesis, where the function is implemented in a NAND or NOR circuit of maximal depth 2 [1]. In this paper, we focus on a different synthesis method based on a switching lattice, that is a two-dimensional array of four-terminal switches implemented in its cells. Each switch is linked to the four neighbors and is connected with them when the switch is ON, or is disconnected when the switch is OFF. The idea of using regular two-dimensional arrays of switches to implement Boolean functions dates back to a seminal paper by Akers in 1972 [2]. Recently, c IFIP International Federation for Information Processing 2019  Published by Springer Nature Switzerland AG 2019 N. Bombieri et al. (Eds.): VLSI-SoC 2018, IFIP AICT 561, pp. 32–52, 2019. https://doi.org/10.1007/978-3-030-23425-6_3

The Connection Layout in a Lattice of Four-Terminal Switches

33

with the advent of a variety of emerging nanoscale technologies based on regular arrays of switches, synthesis methods targeting lattices of multi-terminal switches have found a renewed interest [3–5]. A Boolean function can be implemented in a lattice with the following rules: – each switch is controlled by a Boolean literal, i.e. by one of the input variables or by its complement; – if a literal takes the value 1 all corresponding switches are connected to their four neighbors, else they are not connected; – the function evaluates to 1 for any input assignment that produces a connected path between two opposing edges of the lattice, e.g., the top and the bottom edges; the function evaluates to 0 for any input assignment that does not produce such a path. For instance, the 3 × 3 lattice of switches and corresponding literals in Fig. 1a implements the function f = x1 x2 x3 + x1 x3 + x2 x3 . If we assign the values 1, 0, 0 to the variables x1 , x2 , x3 , respectively, we obtain paths of 1’s connecting the top and the bottom edges of the lattices (Fig. 1b), and f evaluates to 1. On the contrary, the assignment x1 = 1, x2 = 0, x3 = 1, on which f evaluates to 0, does not produce any path from the top to the bottom edge (Fig. 1c). All the other input assignments can be similarly checked.

Fig. 1. A network of four terminal switches implementing the function f = x1 x2 x3 + x1 x3 + x2 x3 (a); the lattice evaluated on the assignments 1, 0, 0 (b) and 1, 0, 1 (c), with 1’s and 0’s representing ON and OFF switches, respectively.

The synthesis of a function f on a lattice consists of finding an assignment of input literals to the switches, such that the top-bottom paths in the lattice implement f and the number of switches in the lattice is reduced as much as possible. Recalling that the dual f D of a function f is such that f D (x1 , . . . , xn ) = f (x1 , . . . , xn ), in [3,4] Altun and Riedel developed a synthesis method where the implicants (products) of the minimal irredundant SOP forms of the function f and of its dual f D are respectively associated, in any order, to the columns and to the rows of the lattice. The literal assigned to a switch is chosen from the necessarily non-void intersection of the two subsets of literals appearing in the

34

A. Bernasconi et al.

implicants corresponding to the intersecting column and row of the lattice. If several literals appear in the intersection anyone of them can be chosen. As an elementary example consider the function f = x1 x3 x4 + x1 x2 + x1 x3 x4 and its dual f D = x1 x3 + x1 x4 + x2 x3 + x1 x2 x4 . Figure 2a shows the lattice for f where just one multiple assignment x1 , x3 occurs.

Fig. 2. A lattice implementing the function x1 x3 x4 + x1 x2 + x1 x3 x4 (a), where a choice between x1 and x3 must be performed in the first cell. The lattice after column permutation (b), and then row permutation (c), obtained with our method.

Starting from the lattice obtained by the Altun-Riedel method we consider three problems related to the physical implementation of the circuit, that must be solved obeying the following assumptions. 1. Equal literals must be connected together, and to an external terminal on one side (e.g. the top edge) of the lattice. This may require using different layers, and vias to connect cells of adjacent layers. 2. Connections can be laid out horizontally or vertically (but not diagonally) between adjacent cells. 3. Each cell can be occupied by a switch, or by a portion of a connecting wire, or by a via. No two such elements can share a cell on the same layer. In particular the connections cannot cross on the same layer. 4. The overall target is designing a layout with the minimum number of layers. Since the problem is computationally-hard, it will be relaxed to finding a reasonable layout by heuristic techniques. The circuit will be built starting from the original N × M lattice (level 0), and superimposing to it a certain number H of layers (levels 1 to H), to give rise to a three-dimensional grid of size N × M × (H + 1). Note that the switches associated with the same literal cannot be generally connected all together on the same layer, so one or more subsets of these switches will be connected on a layer and then made available through vias on the next layers to be connected to other subsets. Two degrees of freedom remain after the function has been synthesized, to be used to reduce the number of layers in the layout. One is the possibility

The Connection Layout in a Lattice of Four-Terminal Switches

35

of permuting lattice columns and rows arbitrarily, as exploited in the following Problem 1. The other is the possibility of selecting any literal from a multiple choice for each switch, as exploited in the following Problem 2. Problems 1 and 2 apply at level 0, preparing the lattice for the actual layout design which takes place in the next layers 1 to H, where the inputs of the switches associated with the same literal are connected together and to the corresponding external lead, as treated in the following Problem 3. Problems 1, 2 and 3 are solved one after the other, each producing the input for the next one. Since all of them are computationally-hard, we solve them heuristically, then show experimentally that our solutions are efficient for standard benchmarks where the size of the lattice and the number of variables are reasonable. A preliminary version of this study limited to Problem 2 (without a proof on NP-hardness) and to Problem 3 has been presented in [6].

2

Rearranging the Lattice

In principle a lattice can be seen as an array A[N × M ], or as a non-directed graph G = (V, E) whose vertices correspond to the lattice cells (then |V | = N M ) and whose edges correspond to the horizontal and vertical connections between adjacent cells (then |E| = 2N M − N − M ). We shall refer indifferently to a lattice cell A[i, j], 1 ≤ i ≤ N, 1 ≤ j ≤ M , or to graph vertex vk , 1 ≤ k ≤ N M . Obviously the vertices have degree 2, 3, or 4 if they respectively correspond to corner, border, or internal cells of the lattice. Let x1 , x2 , . . . , xn be the variables of the function and L be the set of all literals, |L| = 2n. After the Altun-Rieder synthesis is completed, each cell A[i, j] is associated with a non-void subset Li,j ∈ L, from which one literal has to be eventually assigned to the corresponding graph vertex. We pose: Definition 1. If a single variable is associated to each vertex, an mc-area is a maximal connected subgraph S of G in which all variables hold the same literal. Equivalently an mc-area is the portion of the lattice corresponding to S. Note that if two mc-areas A1 , A2 hold the same literal, no two cells c1 ∈ A1 , c2 ∈ A2 may be adjacent since A1 and A2 are maximal. A basic task is minimizing the number of mc-areas, or equivalently make them as large as possible, since this will imply reducing the number of layers. As shown below the core of this problem is NP-hard, so we study how to solve it heuristically. As already said we proceed in two consecutive phases. The first phase is aimed at permuting columns and rows in order to increase the number of adjacent cells holding common literals, even though subsets of two or more literals may still be associated to each cell. The second phase consists of the selection of one literal in each of the multiple assignments at the cells, so the mc-areas can be built according to Definition 1. To implement the first phase we pose: u,v = Definition 2. The weight of a pair of cells A[r, s], A[u, v] is given by: wr,s |Lr,s ∩ Lu,v |/(|Lr,s | · |Lu,v |).

36

A. Bernasconi et al.

For example for Lr,s = {a, b, c}, Lu,v = {b, c, d, e} we have Lr,s ∩ Lu,v = {b, c}, u,v u,v = 2/(3 · 4) = 1/6. Note that wr,s is the probability for A[r, s], A[u, v] then: wr,s to share the same literal if one literal is randomly chosen from Lr,s and Lu,v . The weight is relevant in our case for pairs of adjacent cells. As we are interested in building large mc-areas, we pose: Problem 1. Find a column permutation and a row permutation of A, to maximize the sum of weights of the pairs of adjacent cells. Once Problem 1 has been solved, one literal must be selected in each subset Li,j as stated in the following: Problem 2. Find a literal assignment for each cell A[i, j] that minimizes the number of mc-areas. 2.1

Solving Problem 1

To ease the task of Problem 1, an elementary observation is in order: Observation 1. Two cells can be made adjacent by column or row permutation if and only if they lay on the same row or on the same column, respectively. By Observation 1, column and row permutations can be independently performed since the result of one does not affect the result of the other. In fact the problem will be solved by a column permutation followed by a row permutation. We pose: Definition 3. The weight Cs,v of a pair of columns s, v of the lattice is the sum of weights of allthe pairs of cells lying in the columns s, v and in the same row, N i,v . that is: Cs,v = i=1 wi,s And, symmetrically: Definition 4. The weight Rr,u of a pair of rows r, u of the lattice is the sum of weights of allthe pairs of cells lying in the rows r, u and in the same column, M u,j . that is: Rr,u = j=1 wr,j For building large mc-areas we are interested in bringing adjacent pairs of columns and rows with higher weights, via permutations. To this end we adopt the following heuristic on the columns, then on the rows. 1. compute the weights Cs,v of all pairs of columns; 2. build a stack S whose elements contain the pairs (s, v) with non-zero weights, ordered for decreasing value of such weights; 3. start with M subsets of adjacent columns, each containing exactly one of them;

The Connection Layout in a Lattice of Four-Terminal Switches

37

4. pop one by one the pairs (s, v) from S: for the pair currently extracted decide a column re-arrangement (without actually performing it), merging two subsets of adjacent columns, one containing column s and the other containing column v, if the two columns can be made adjacent without breaking the subsets already built. The columns in each subset are kept ordered for increasing value of their index. This step 4 is performed according to the scheme COLUMNPERM shown below; 5. once the stack S is empty, permute the groups of columns as indicated in the corresponding subsets previously determined. These groups can then be arranged in the lattice in any order. COLUMN-PERMUTE 1. define a vector P of M elements, whose values are: P [j] = 1 if column j forms a subset of one column; P [j] = 2 if column j is the first one in a subset of more than one column; P [j] = 3 if column j is the last one in a subset of more than one column; P [j] = 4 if column j is neither the first one nor the last one in a subset of more than two columns; 2. initialize P as P [j] = 1 for 1 ≤ j ≤ M ; // all the initial subsets of columns contain one element: // recall that S is a stack of pairs of columns; 3. while S is not empty pop a pair (s, v) from S; if (P [s] = 4 OR P [v] = 4) discard (s, v) else according to the values P [s], P [v] the subsets of s and v are merged bringing s adjacent to v and the values of P [s], P [v] are updated accordingly; // nine combinations of values P [s], P [v] are possible: // for P [s] = P [v] = 2 and P [s] = P [v] = 3 the order // of the columns in one of the subsets must be inverted; 4. the process ends with one or more subsets each corresponding to the permutation of a group of columns.

After column permutation, the rows are permuted with a procedure perfectly symmetrical to the one given for columns, using row weights Rr,u . In the lattice of Fig. 2(a), the pairs of columns in S with non-zero weights, ordered for decreasing values of the corresponding weights are: (1,2) C1,2 = 3/2, (1,3) C1,3 = 3/2 producing the subsets of adjacent columns {1, 2}, {3} and then {3, 1, 2}, from which the column permutation of Fig. 2(b) is built. The ordered pairs of rows in S with non-zero weights are:

38

A. Bernasconi et al.

(1,2) R1,2 = 3/2,

(1,3) R1,3 = 3/2,

(3,4) R3,4 = 1

producing the subsets of adjacent rows {1, 2}, {3}, {4}, {3, 1, 2}, {4}, and then {4, 3, 1, 2}, from which the final row permutation of Fig. 2(c) is built. A global measure W of the amount of adjacent equal variables in the whole lattice, called the lattice weight, is naturally given by the sum of weights of all pairs of adjacent cells, or equivalently by the sum of weights of all the pairs M −1 N −1 of adjacent columns and rows, namely: W = j=1 Cj,j+1 + i=1 Ri,i+1 . In fact an increase of the value of W can be seen as an indicator of the advantage deriving from a column-row permutation. In the example of Fig. 2 the lattice weight increases from W = 4 (part (a)) to W = 7 (part (c) after column and row permutation). In the simulations discussed in the last section such value roughly doubles on the average after the heuristic for Problem 1 is applied. Implemented with standard data structures, the time required by the above heuristic is O(N M 2 n + M N 2 n) versus O(N M n) of the input size (recall that n is the number of variables). 2.2

Hardness of Problem 1

Although our heuristic for Problem 1 shows an encouraging experimental value, a weakness derives from the decision of not restructuring a subset of columns/rows after it is built, aside from possibly inverting the order of its elements. Merging two subsets in a strictly optimal way would lead to an exponential explosion of the time needed with a possibly minor improvement of the result. Up to now we have not been able to decide the time complexity of the problem, although we believe that is NP-hard. We simply pose the question of its precise hardness as a challenging open problem. 2.3

Solving Problem 2

Problem 2 is of high computational interest for two reasons. The first is that the number of prime implicants in f and f D strongly increases with the number of variables of the function, so that the elementary examples that we have given so far do not show the real entity of the phenomenon. In particular the possibility of having multiple assignments of many literals in the lattice cells is substantially high, making Problem 2 a crucial part of lattice rearrangement. The second reason is that Problem 2 is NP-hard as we will prove in the next subsection, so a clever heuristic must be devised for its solution. To this end a preliminary exam of the lattice is performed by applying the following Rule 1, with the attempt of reducing the number of literals contained in the subsets associated to the vertices. Rule 1. Let vj be a vertex; v1 , v2 , v3 , v4 be the four vertices adjacent to vj (if any); Lj , L1 , L2 , L3 , L4 be the relative subsets of literals. Apply in sequence the following steps:

The Connection Layout in a Lattice of Four-Terminal Switches

39

Step 1. Let |Lj | > 1. If a literal x ∈ Lj does not appear in any of the sets Li , for 1 ≤ i ≤ 4, cancel x from Lj and repeat the step until at least one element remains in Lj . Step 2. Let |Lj | > 1, and let Lk ⊂ Lj with k ∈ {1, 2, 3, 4}. If a literal x ∈ Lj appears in exactly one set Lh with h ∈ {1, 2, 3, 4} and h = k, then cancel x from Lj and repeat the step until at least the literals of Lk remain in Lj . We have: Proposition 1. The application of Rule 1 does not prevent finding a literal assignment that minimizes the number of mc-areas. Proof. Assume that a literal x canceled from vj by Step 1 or Step 2 of the rule would instead be assigned to vj in the final assignment. Step 1. An mc-area containing only vj with literal x would result. If, in a minimal solution, a different literal y is assigned to vj and is not assigned to v1 , v2 , v3 , or v4 , the number of mc-areas remains the same. If instead y is assigned to one or more of these vertices the number of mc-areas decreases. Step 2. If x is not assigned to vh an mc-area containing only vj with literal x would result, otherwise an mc-area containing vj , vh , and possibly other vertices with literal x would result. Since one of the literals y ∈ Lk must be assigned to vk in any minimal assignment, and Lk ⊂ Lj , then y could be assigned to vj instead of x and the number of mc-areas would remain the same, or would 2 decrease if y is assigned also to other neighbors of vj . An example of application of step 2 of Rule 1 is shown in Fig. 3. A literal cancellation from Lj may induce a further cancellation in an adjacent cell. In the example of Fig. 3, if all the cells adjacent to vh except for vj do not contain the literal c, the cancellation of c from Lj induces the cancellation of c from Lh if step 1 of Rule 1 is subsequently applied to vh .

vh c e

vk

a b

a b c d

vj d g

d f

Fig. 3. Canceling a literal from a multiple choice using step 2 of Rule 1. Literals are denoted by a, b, c, d, e, f, g. Literal c in cells vj , vh is canceled from vj .

Before running an algorithm for solving Problem 2, the sets Li may be reduced using Rule 1 through a scanning of the lattice. This is a simple task,

40

A. Bernasconi et al.

although a clever data structure must be devised for reducing the time required. Moreover several successive lattice scans may be applied for further reduction until no change occurs in a whole scan, although this is likely to produce much less cancellations than the first scan. These operations constitutes the first phase of any algorithm. Then, as the problem is computationally intractable, a heuristic must be applied. The following algorithm MC-AREAS proposed here, and already experimented in [7], builds each mc-area as a BFS tree, looking for a subset of adjacent vertices that share a same literal and assigning that literal to them. The lattice is treated as a tree. MC-AREAS 1. start from a vertex vi and reduce the associated set of literals Li to just one of its elements li chosen at random; vi will be the root of a tree Ti under construction for the current mc-area; 2. traverse the lattice from vi in BFS form: forany vertex vj encountered such that vj does not belong to a BFS tree already built: if (li ∈ Lj ) assign li to vj and insert vj into Ti else insert vj into a queue Q; continue the traversal for Ti ; 3. if (Q is not empy) extract a new vertex vi from Q and repeat steps 1 and 2 to build a new tree;

The experimental results discussed in the last section are derived following this approach. Better results would be possibly obtained with more skilled heuristics at the cost of a greater running time. 2.4

Hardness of Problem 2

To prove that Problem 2 is NP-hard we formulate it in graph form as done in [8] where the proof appeared. Let G = (V, E) be an undirected graph and let C be a set of k colors such that each vertex vi ∈ V is associated to (or “contains”) a non-void subset Ci of C. The graph problem equivalent to ours, indicated as MPA for minimal partition (color) assignment, is the one of assigning to each vertex vi of G a single color from among the ones in Ci such that the number σ of maximal connected subgraphs of G1 , . . . , Gσ of G whose vertices have the same color is minimal. Dealing with a lattice, as required in our Problem 2, it is sufficient to start with PMPA, that is the MPA problem where G is planar. We have: Proposition 2. The PMPA Problem is NP-hard. Proof. Reduction from planar graph 3-coloring. Let H be an arbitrary planar graph to be 3-colored with colors 1, 2, 3, and let G be a corresponding planar

The Connection Layout in a Lattice of Four-Terminal Switches

u

x1

y1

x2

y2

x3

e

41

v

y3

Fig. 4. Portion of graph G corresponding to the edge e = (u, v) of graph H.

graph whose MPA must be built. The set of colors of G is {1, 2, 3, 4, 5, 6, 7, 8, 9}. For each edge e = (u, v) of H there are nine vertices u, e, v, x1 , x2 , x3 , y1 , y2 , y3 in G connected as shown in Fig. 4, with subsets of colors: Cu = {1, 2, 3}, Cv = {1, 2, 3}, Ce = {4, 5, 6, 7, 8, 9}, Cx1 = {1, 6, 7, 8, 9}, Cx2 = {2, 4, 5, 8, 9}, Cx3 {3, 4, 5, 6, 7}, Cy1 = {1, 4, 5, 6, 9}, Cy2 = {2, 5, 6, 7, 8}, Cy3 {3, 4, 7, 8, 9}. Consider a minimal collection of monochromatic connected subgraphs G1 , . . . , Gσ of G. We have: (1) the vertices u, v must belong to two distinct subgraphs Gi , Gj and the vertex e cannot belong to Gi or to Gj because Ce ∩ Cu = ∅ and Ce ∩ Cv = ∅; (2) at most one of the vertices x1 , x2 , x3 may belong to Gi and at most one of the vertices y1 , y2 , y3 may belong to Gj due to their colors; (3) at most two of the vertices x1 , x2 , x3 and at most two of the vertices y1 , y2 , y3 may belong to the same subgraph of e implying that the colors assigned to u and v must be different due to the colors of all the vertices involved. Letting G = (V, E), from the points 1, 2, and 3 we have σ ≥ |V | + |E| and equality is met if and only if different colors can be assigned to u and v, depending on the color constraint imposed by the other vertices to which u and v are adjacent in H. That is, H can be 3-colored if and only if MPA can be solved on G with σ = |V | + |E|. 2 Starting from Theorem 2 we now prove that the result holds true even for planar grid-graphs GMPA as in the case of our lattice. Proposition 3. The GMPA Problem is NP-hard. Proof. We proceed in two steps. First we prove the result holds true for planar graphs of bounded vertex degree d ≥ 3 by reduction from PMPA. Then we pass from bounded degree planar graphs to GMPA. 1. Reduction from PMPA, by insertion of new vertices to reduce all vertex degrees to at most 3. If edges (a, b), (a, c), (a, d), (a, e) and possibly other edges (a, x) exist, i.e. deg(a) > 3, insert a new vertex z with Cz = Ca ∪ Cb ∪ Cc , delete edges (a, b), (a, c), and insert new edges (a, z), (z, b), (z, c). Note that the degree of a decreases by 1, the degrees of b, c are unchanged, and z has degree 3. Continue until each vertex has degree ≤ 3. The solution for the new

42

2.

3.

4.

5.

3

A. Bernasconi et al.

graph, i.e. the connected mono-colored subgraphs, coincides with a solution for PMPA if the new vertices z are deleted and the original edges are restored. Note that the graph resulting after the transformation is planar. A result of Leslie G. Valiant (Theorem 2 of [9]), states that a planar graph G of n vertices with degree at most four admits a planar embedding in an O(n×n) grid Γ. Of the O(n2 ) cells of Γ, obviously only n are used in the embedding for the vertices of G, while many of the others are used for embedding the edges of G as non intersecting sequences of cells in i, j directions. In [10] was then shown that one such embedding can be built where all edges are just straight line segments. Build the embedding on Γ, and extend the grid to a new grid Γ as follows. If two horizontal sequences of cells representing two edges of G lie in two rows i, i + 1 and part of these sequences share the same columns (i.e. the two sequences are partly adjacent), insert a new empty row between i and i + 1, using its cells where needed to fix vertical sequences possibly interrupted by the new row. Repeat the operation for any pair of partly adjacent sequences. Repeat the process on the columns, inserting new columns until no vertical sequences are partly adjacent. Note that the construction of Γ has been done in time and space polynomial in n. If two adjacent vertices a, b of G are embedded in two non adjacent cells of Γ , assign the set of colors Ca ∩ Cb to the cells of the sequence representing the edge (a, b). Repeat for all pairs of adjacent vertices. Assign a new color c ∈ C to all the grid cells not corresponding to the vertices and to the edges of G. Solve GMPA on Γ considering all the cells as vertices of a new larger graph. Discard the subsets of cells with color c, and in any other subset take only the cells corresponding to original vertices of G. These subsets constitute a solution for a bounded degree PMPA. 2

Solving Problem 3

After Problem 2 is solved, we have to choose how to connect the different mcareas associated with the same literal and then connect them to the external input leads. To this end different layers are needed to attain all non-crossing connections. Formally we pose the following problem: Problem 3. Find a minimum number of layers allowing to connect together all the mc-areas with the same literal, and to connect them to the input leads, obeying the assumptions 1 to 4 of Sect. 1. The solution has to be constructive, that is, the actual layout must be shown. In order to better understand the problem let us discuss an example. We start from a lattice of N × M cells each associated to one of the 2n input literals. Figure 5 shows a 5 × 6 lattice with 7 literals indicated for simplicity with the numbers 1 to 7. In practical applications we usually have 2n < N × M , hence there are cells assigned to the same literal that must be connected together to

The Connection Layout in a Lattice of Four-Terminal Switches 1

1

5

5

2

2

1

4

2

1

5

5

3

3

2

1

5

7

3

3

2

4

6

6

5

6

6

3

3

7

43

Fig. 5. Example of starting lattice. 1

5

2

4

1

1

5

5

2

2

1

5

1

4

2

1

5

5

4

1

3

3

2

1

5

7

3

3

2

4

6

6

3

5

6

6

3

3

7

5

2 5

2

(a)

7 4

6

6

3

7

(b) 7

6

3 5

2

2 3 5

6 6

3

(c)

3

6 3

7

7

(d)

Fig. 6. (a) Layer 1 with the first connections. (b) Layer 2. (c) Layer 3. (d) Layer 4. Arrows indicate the connections to the input leads on the top edge. Underlined literals indicate the starting cells of the vias.

be reached in parallel from outside. Recall that these literals may be grouped in the mc-areas deriving from the previous lattice rearrangement. The starting lattice constitutes layer 0 of the layout, above which the connections are to be built in successive layers. Suppose that the input leads to the circuit are in the top edge of the lattice. In layer 1 the only connections we can lay out are those of the mc-areas, and the mc-areas with cells in the top row may also be connected outside, see Fig. 6(a). Note that in layer 1 there is no room for other connections, hence a new layer

44

A. Bernasconi et al.

1 2 3 2 3 1 3 1 2 Fig. 7. A non solvable instance.

2 must be added. For each of the mc-areas connected in layer 1 and holding a literal that appears also in other mc-areas, or not yet connected to the outside, a single cell is connected through a via to the next layer. The final number of layers may be affected by the specific selection of these cells, however, in our heuristic we do not consider this point and choose the positions of the vias arbitrarily. In the layers of Fig. 6 the underlined literals indicate where the corresponding vias start. A possible implementation of the second layer is shown in Fig. 6(b). Each surviving mc-area is now represented by the arrival of a via. Recall that connections cannot cross, hence in general not all areas can be connected. Note that areas already connected to the top edge don’t need to be connected to the this edge again, even though they are not completely connected among each other. This is the case of the area associated with literal 1, while areas associated with literals 3, 6, and 7 have still to be connected to the top edge. Note also that all literals with label 4 can be connected in a single area and outside, therefore there is no need to connect this area to the next layer. The next layer 3 is depicted in Fig. 6(c) where one on the two areas of literal 3 can be connected to the top edge, but these two areas are still to be connected to one another. In this layer the connections for literals 2 and 5 are completed, while vias for literals 3, 6, and 7 are arbitrarily chosen. In the last layer 4 the layout is completed as shown in Fig. 6(d). 3.1

Impossible Instances

To address Problem 3 formally we must start with a crucial observation, namely not all instances of it are solvable no matter how many layers are used. As an introduction consider the lattice of Fig. 7 where each row contains a cyclic shift of the literals in the previous row and all these shift are different. It is easy to see that no cells with the same literal can be connected together independently of any column/row permutation. We now show that the vast majority of problem instances are theoretically solvable, although some may require an exceedingly high number of layers to be practically solved. We have: Proposition 4. A problem instance cannot be solved if and only if in the initial literal assignment no two adjacent cells share a same literal after any column/row permutation (Problem 1) and, no matter how the multiple assignments are resolved (Problem 2), each cell in row zero contains a literal that occurs also in another cell.

The Connection Layout in a Lattice of Four-Terminal Switches

45

Proof. If part. Assume that no two adjacent cells share a same literal after any solution of Problems 1 and 2. Although the cells of row zero can be connected to the output, there will be no way to connect them to the other cells with the same literal since all cells will be occupied by a via in all layers. Only-if part. If at least one of the conditions stated in the proposition does not hold, at least one cell in layer 2 is made available (or “free”) for routing due to an area built in layer 1, or to a connection to the output in that layer. Once a free cell arises, it can be “moved” to any cell of the array by consecutive movements of adjacent literals as in the well known 15-slide game, and any literal adjacent to the free cell can similarly be moved around to be brought adjacent to a cell with the same literal. Proceeding with this strategy all cells with a same literal can be linked together and brought to the output. 2 Note that the strategy indicated in the only-if part of the above proof may require a very large number of layers if only a small number ν of free cells exists in a layer, as only ν movements can be done in that layer. In particular, if only a few cells are made free by the solution of Problems 1 and 2, i.e. if layer 1 contains a large number of small mc-areas, the routing mechanism could require so many layers not to apply in practice. A decision on building or not such a layout must be taken after the simulation on significant examples. 3.2

Hardness of Problem 3

In the solution of Problem 3, the cells containing the same literal in any layer are connected, in the best case, as trees (and not as general subgraphs) to minimize the occupation of free cells, see next Sect. 3.3. The problem of minimizing the number of layers is related to the one of building the maximum number of such trees in any layer whose edges do not intersect. If a 15-slide movement of free cells is required the problem is NP-hard [11]. If such movements are not required the problem has strong similarities with other known NP-hard problems dealing with grid embedding of graphs, as for example determining the Steiner tree among k vertices on a grid [12], or determining the rectilinear crossing number of a graph [13], etc. We have not been able to prove that Problem 3 is NP-hard, and leave it as a challenging open problem. For its solution we rely on a heuristic algorithm that produces satisfying results on a large class of benchmark instances as shown in the last section. If no tree can be directly built in a layer, as discussed in the previous subsection, the heuristic stops declaring that routing is impossible. Otherwise we have: Proposition 5. Let α be the number of mc-areas generated by Problems 1 and 2, h be the number of literals involved in the lattice, and k be the number of literals appearing in the cells of the top edge. An upper and a lower bound to the number H of layers are given by α + (h − k)/M and h/M , respectively. Proof. Upper bound. First note that in layer 1 all the cells of any mc-area are connected together, and in each of the successive layers at least one pair of

46

A. Bernasconi et al.

cells having a via from the previous layer, and holding the same literal, are connected. Then at most α layers are needed to connect all the cells holding the same literal. In addition all the h literals must be connected to the corresponding inputs leads on the top edge of the lattice. For the k literals already in this edge the connections can take place in layer 1. The remaining h − k literals, in the worst case, may be brought to a further layer α + 1 by vias and connected to the input leads in (h − k)/M layers. Lower bound. Observe that h external leads must be displayed on the top edge of the lattice, possibly in different layers, and this edge contains M cells. 2 In the example of Fig. 6 we have α = 15 (there are 15 mc-areas in layer 0), h = 7, k = 3, and M = 6. The proposed layout with H = 4 layers is far from approaching the upper bound 15 + 4/6 = 16, while is closer to the lower bound

7/6 = 2. 3.3

Heuristics for Problem 3

Let us now discuss possible greedy heuristics to solve Problem 3 in a reasonable amount of time. Independently of the lattices that cannot be solved according to the conditions stated in Proposition 4, other cases may require an exceedingly large number of layers if the “15-slide game” moves are required, as indicated in the proof of the same Proposition 4. We do not accept such moves, treating a lattice requiring them as unsolvable. This limitation shows a minor importance in practice since our algorithms failed very rarely to find a layout for a theoretically solvable lattice, out of a very large number of cases, see Sect. 4. The general structure of our heuristics consists of the following two main steps: 1. Starting with layer 0 resulting from a re-arrangement of the lattice done in Problems 1 and 2, build the connecting trees for all mc-areas in layer 1, and connect the literals of the top side to the external leads. 2. While there are trees with the same literals still to be connected between them and/or to the outside: (a) place a via on a cell chosen at random of each such a tree; (b) add a new layer to receive the vias; (c) try to connect together as many vias as possible, associated to the same literal; (d) try to connect each group of cells containing a literal to the corresponding external lead, if not already done. Step 1 can be implemented in a standard way, in a lattice traversal. Note that this initial step is optimal, i.e., no algorithm for the minimization of the number of layers can do better on the first layer. To implement the second and main step of the heuristics we introduce the concept of free area where a connection between cells with the same literal can be displayed, namely:

The Connection Layout in a Lattice of Four-Terminal Switches

47

Definition 5. In the layers from 2 on, a free cell is one not containing a via; a free area is a maximal connected subset of free cells; the boundary cells of a free area are the ones surrounding it. For example layer 2 in Fig. 6(b) contains seven free areas, as shown in Fig. 8. Using proper coding and data structures, free areas and their boundary cells can be easily computed through a scanning of the lattice in optimal time O(N × M ). We have: 4 1

5

4

1

2 5

2 3 5

7 4

6

6 3

7

Fig. 8. Layer 2 of Fig. 6(b) with the seven free areas shown in grey.

Proposition 6. In any given layer from 2 on, cells holding the same literal can be connected together through free cells if and only if they are boundary cells of the same free area. Proof. If part. Any subset of cells holding the same literal and bounding the same free area can be connected by a tree of connections laid out inside the area. Only if part. Cells not bounding any free area are completely surrounded by vias holding different literals and cannot be connected to any other cell in that layer. For cells holding the same literal but not bounding the same area, any connecting path would inevitably meet another via. 2 Some considerations are in order. If a set of cells holding the same literal and facing the same free area are connected using free cells inside the area, other boundary cells of the same area can be connected only if the required connections do not cross those already laid out on that area. If two cells holding the same literal bound different areas they can still be connected in the layer through a path that meets only vias with their same literal, if any. If these conditions are not met, the cells must be connected in a next layer. In the example of Fig. 8 four free areas can be used to connect the two 1’s in cells (1,2), (2,4); the two 5’s in cells (1,4), (2,5); the two 7’s in cells (3,6), (5,6); and the two 4’s in cells (2,2), (4,4), the latter also connected to the external lead, as already shown in Fig. 6(b). The two 5’s in cells (1,4) and (5,1) cannot be

48

A. Bernasconi et al.

connected in this layer since they do not bound the same free area. However, if literal 2 in cell (3,3) were a 5, that connection would have been possible passing through cell (3,3), thus merging connections laid out in two different free areas. The still missing connections among different 5’s, 2’s, 6’s, and 3’s, will be done in the free areas of layers 3 and 4 as shown in Fig. 6. On these grounds we execute the crucial points c, d in step 2 of our heuristic first computing all the free areas in the layer, then trying to connect all boundary cells assigned to the same literal and facing the same free area. A relevant feature of this process is that free areas are mutually disjoint, then the searches for connections can be performed in parallel creating a thread for each free area. The only portion of the layer shared by multiple threads are the boundary cells facing different free areas, that can be managed through lock variables that force the threads to access those cells in mutual exclusion. The cells to be connected are treated pairwise, however, a subset of more than two cells holding the same literal may bound a free area and as many as possible of them must be connected in tree form. Therefore as soon as two of them are connected, the couple is treated as a single cell identified as the central cell of the connecting path holding the literal of the two, and the process continues looking for other cells to be connected to them. Clearly a connecting path cuts the free area in two parts and other cells facing this same area may become unreachable from one another. We could solve this issue by recomputing the free areas after a new connection is done, but this approach is computationally very heavy. Therefore, we compute free areas only once in each layer, and then apply a proper non-exhaustive search algorithm to limit the search for non-existing connections, still guaranteeing that mutually reachable cells will be connected with high probability. Let us now briefly discuss the possible implementations of this search within each free area. The main point is considering a boundary cell c1 that must be connected to a target cell c2 through a path of distinct free cells in a free area. This can be formalized as a state space search, where the state space is of size O(4N ×M ) as the number of cells in the area is O(N × M ) and there are at most four possible moves from each cell. As a search in this space would be prohibitively expensive, we use heuristics to find solutions of high quality as quickly as possible. We have tested several heuristics proposed in the literature, while for space reasons the results reported in Sect. 4 are limited to Best-first and Greedy-beam (see [14,15]) that select the next cell to visit according to an estimate of the Manhattan distance from the target cell. The first heuristic provides better results but its time complexity O(4N ×M ) is very high and can be applied only to small size lattices. The time complexity of Greedy-beam is instead linear in the lattice size, but produce worse quality results. In fact it may fail in connecting some mutually reachable cells on a given layer, so the final layout may contain a high number of layers. Depending on the lattice size and on the specific application, we can therefore select one of the two heuristics (or, for that matter, other known ones), trading quality of results vs. scalability.

The Connection Layout in a Lattice of Four-Terminal Switches

4

49

Experimental Results

In this section we report the experimental results related to the physical implementation of switching lattices N × M according to the assumptions 1 to 4 reported in Sect. 1. The physical implementation of a lattice is a 3-dimensional grid N × M × (H + 1), where H is the number of layers needed to route all the connections among switches controlled by the same input literal. The aim of our experimentation is to determine whether the proposed implementation can be considered technologically feasible. To this ends we have considered the lattices obtained applying the Altun-Riedel method to the benchmark functions taken from LGSynth93 [16], where each output has been treated as a separate Boolean function. For space reasons, in the following Table 1 we report only a significant subset of these functions as representative indicators of our experiments.1 The experiments have been run on a IntelCore i7-4710HQ 2.50 GHz CPU with 8 GB of main memory, running Linux Ubuntu 17.10. The algorithms have been implemented in C according to the lines indicated in the paper for the solution of Problems 1, 2 and 3. Table 1 is organized as follows. The first column reports the name and the number of the separate output functions of the benchmark circuit. The following two columns report the number of different literals occurring in the lattice and its dimension N × M . The last four columns report the number H of layers computed with the Best-first and the Greedy-beam heuristics, together with the Table 1. Number of layers for the lattice layout of a subset of standard benchmark circuits, built along the lines indicated in Problems 1, 2 and 3. Best-first

1

H

Greedy Beam

Bench

lit

N×M

Time(s) H

add6(5)

24

156×156 7

733.21

8

0.43

adr4(1)

16

36×36

6

0.19

8

0.02

alu2(2)

16

10×11

4

0.01

4

0.01

alu2(5)

20

13×14

4

0.01

4

0.01

alu3(0)

8

4×5

3

0.01

3

0.01

alu3(1)

12

7×8

4

0.01

4

0.01

b12(0)

7

6×4

4

0.01

4

0.01

b12(1)

9

5×7

4

0.01

4

0.01

b12(2)

10

6×7

3

0.01

3

0.01

bcc(5)

28

27×9

10 0.02

9

0.01

bcc(7)

29

31×11

10 0.03

11 0.01

bcc(8)

29

31×12

10 0.04

9

bcc(27)

28

39×19

10 0.13

12 0.02

bcc(43)

28

20×10

6

6

bench1(2) 18

45×24

10 0.26

0.02

Time(s)

0.01 0.01

11 0.04

Experimental results on a much larger set of benchmark functions may be requested to the present authors.

50

A. Bernasconi et al. Table 1. (continued) Best-first

Greedy Beam

Bench

lit

N×M

H

Time(s) H

Time(s)

bench1(3)

18

31×16

8

0.03

9

0.01

bench1(5)

18

50×27

9

0.31

9

0.04

bench1(6)

18

35×21

9

0.09

12

0.03

bench1(7)

18

43×21

9

0.12

12

0.02

bench1(8)

18

44×24

9

0.19

10

0.04

bench(6)

10

8×4

5

0.01

5

0.01

br2(4)

18

18×8

6

0.01

6

0.01

br2(5)

19

14×4

6

0.01

6

0.01

br2(6)

19

16×5

6

0.01

6

0.01

clpl(3)

11

6×6

3

0.01

3

0.01

clpl(4)

9

5×5

3

0.01

3

0.01

co14(0)

28

92×14

11

0.29

12

0.04

dc1(4)

7

5×4



0.01



0.01

dc2(4)

11

10×9

5

0.01

5

0.01

dc2(5)

9

6×6

4

0.01

4

0.01

dk17(1)

10

8×2

6

0.01

6

0.01

dk17(3)

11

11×3



0.01



0.01

dk17(4)

12

9×3

5

0.01

5

0.01

ex1010(0)

20

91×46

11

8.42

13

0.24

ex4(4)

13

17×6

7

0.01

7

0.01

ex4(5)

27

35×45

7

0.51

8

0.02

ex5(32)

14

4×10

3

0.01

3

0.01

ex5(36)

11

2×8

2

0.01

2

0.01

ex5(38)

13

4×9

3

0.01

3

0.01

ex5(40)

15

6×12

5

0.01

5

0.01

ex5(43)

15

8×14

6

0.01

6

0.01

exam(5)

13

11×6

4

0.01

4

0.01

exam(9)

20

59×30

9

0.75

12

0.04

max128(5)

14

14×17

6

0.01

6

0.01

max128(8)

13

5×10

5

0.01

5

0.01

max128(17) 14

26×25

7

0.06

8

0.01

max1024(5) 20

117×122

10

191.33

14

0.89

mp2d(6)

14

10×6

5

0.01

5

0.01

mp2d(9)

14

6×8

3

0.01

3

0.01

mp2d(10)

10

6×3

4

0.01

4

0.01

sym10(0)

20

130 × 210 9

1571.93

10

2.14

tial(5)

28

181×181

10

1491.45

12

1.19

z4(0)

7

15×15

5

0.01

6

0.01

z4(1)

14

28×28

7

0.08

7

0.01

Z5xp1(2)

14

12×11

4

0.01

4

0.01

Z5xp1(3)

14

18×18

5

0.01

7

0.01

336 3999,8

367 5,61

The Connection Layout in a Lattice of Four-Terminal Switches

51

corresponding running time. The last row reports the sum of the values of the corresponding column. The cases where the algorithm failed in finding a layout (see Sect. 3.3) are marked with a hyphen. As expected, we have obtained layouts with a smaller number of layers using the Best-first heuristic at the expense of a higher computation time. However we note that the increase in the number of layers computed with the faster Greedy-beam heuristic appears quite limited. Finally, these simulations have shown the effectiveness of the heuristic for Problem 1: indeed, the number of layers computed applying only the heuristics for Problems 2 and 3, without first permuting rows and columns, increases on average of about 35% using Best-first, and of about 43% using Greedy-beam. Moreover, running both the heuristic for Problem 1 and the algorithm MCAREAS for Problem 2, we have obtained a considerable reduction of the number of layers, when compared with the results published in [6].

5

Concluding Remarks

We have presented the first study on connection layout for two-dimensional switching lattices referring to the network implementation proposed by Altun and Riedel [4]. We have shown how to build a stack of consecutive layers where the connections between switches driven by the same variable can be laid without crossings, with the aim of minimizing the number H of layers. Since the problem is computationally intractable we have designed a family of heuristics for finding satisfactory solutions, then applied to a very large set of standard Boolean functions to validate our approach. For space reasons we have presented only the results obtained with the fastest and the slowest heuristics, and only for a subset of the functions analyzed taken as representative of the work done. The overall design consists of three main phases, studied as Problems 1, 2 and 3. The first two are aimed at rearranging the switch positions and their literal assignment of the starting lattice, in order to place in adjacent cells as many switches controlled by the same literal as possible. The third phase then builds the actual connections on the different layers of the chip. Countless improvements are open. While the NP-hardness of Problem 2 has been proved, for theoretical completeness also the NP-hardness of Problems 1 and 3 has to be proved to fully justify the use of heuristics. Better algorithms could be studied, and tested on larger data samples. The layout for other switching lattices should be considered. The layout rules should possibly be changed, in particular allowing more than one wire traversing a switch area in the higher layers. We are presently working on all these issues.

References 1. Micheli, G.D.: Synthesis and Optimization of Switching Theory. McGrow Hill, New York (1994) 2. Akers, S.B.: A rectangular logic array. IEEE Trans. Comput. 21(8), 848–857 (1972)

52

A. Bernasconi et al.

3. Altun, M., Riedel, M.D.: Lattice-based computation of Boolean functions. In: Proceedings of the 47th Design Automation Conference, DAC 2010, pp. 609–612, Anaheim, California, USA, 13–18 July 2010 4. Altun, M., Riedel, M.D.: Logic synthesis for switching lattices. IEEE Trans. Comput. 61(11), 1588–1600 (2012) 5. Gange, G., Søndergaard, H., Stuckey, P.J.: Synthesizing optimal switching lattices. ACM Trans. Des. Autom. Electron. Syst. 20(1), 6:1–6:14 (2014) 6. Bernasconi, A., Boffa, A., Luccio, F., Pagli, L.: Two combinatorial problems on the layout of switching lattices. In: IFIP/IEEE International Conference on Very Large Scale Integration (VLSI-SoC) (2018) 7. Bernasconi, A., Luccio, F., Pagli, L., Rucci, D.: Literal selection in switching lattice design. In: Proceedings of the 13th International Workshop on Boolean Problems (IWSBP 2018), pp. 205–220 (2018) 8. Luccio, F., Xia, M.: The MPA graph problem: definition and basic properties. Technical report, University of Pisa, Department of Informatics (2018) 9. Valiant, L.G.: Universality considerations in VLSI circuits. IEEE Trans. Comput. 30(2), 135–140 (1981) 10. de Fraysseix, H., Pach, J., Pollack, R.: Small sets supporting f´ ary embeddings of planar graphs. In: Proceedings of the 20th Annual ACM Symposium on Theory of Computing, pp. 426–433, Chicago, Illinois, USA, 2–4 May 1988 11. Ratner, D., Warmuth, M.K.: Finding a shortest solution for the N × N extension of the 15-puzzle is intractable. In: Proceedings of the 5th National Conference on Artificial Intelligence, Volume 1: Science, pp. 168–172, Philadelphia, PA, 11–15 August 1986 12. Chu, C.C.N., Wong, Y.: FLUTE: fast lookup table based rectilinear steiner minimal tree algorithm for VLSI design. IEEE Trans. CAD Integr. Circ. Syst. 27(1), 70–83 (2008) 13. Fox, J., Pach, J., Suk, A.: Approximating the rectilinear crossing number. In: Hu, Y., N¨ ollenburg, M. (eds.) GD 2016. LNCS, vol. 9801, pp. 413–426. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-50106-2 32 14. Hart, P.E., Nilsson, N.J., Raphael, B.: A formal basis for the heuristic determination of minimum cost paths. IEEE Trans. Syst. Sci. Cybern. 4(2), 100–107 (1968) 15. Russell, S.J., Norvig, P.: Artificial Intelligence - A Modern Approach. Prentice Hall Series in Artificial Intelligence, 2nd edn. Prentice Hall, Upper Saddle River (2003) 16. Yang, S.: Logic synthesis and optimization benchmarks user guide version 3.0. user guide, Microelectronic Center (1991)

Building High-Performance, Easy-to-Use Polymorphic Parallel Memories with HLS L. Stornaiuolo1(B) , M. Rabozzi1(B) , M. D. Santambrogio1(B) , D. Sciuto1(B) , C. B. Ciobanu2,3(B) , G. Stramondo3(B) , and A. L. Varbanescu3(B) 1

Politecnico di Milano, Milan, Italy {luca.stornaiuolo,marco.rabozzi,marco.santambrogio, donatella.sciuto}@polimi.it 2 Technische Universiteit Delft, Delft, The Netherlands [email protected] 3 Universiteit van Amsterdam, Amsterdam, The Netherlands {c.b.ciobanu,g.stramondo,a.l.varbanescu}@uva.nl

Abstract. With the increased interest in energy efficiency, a lot of application domains experiment with Field Programmable Gate Arrays (FPGAs), which promise customized hardware accelerators with highperformance and low power consumption. These experiments possible due to the development of High-Level Languages (HLLs) for FPGAs, which permit non-experts in hardware design languages (HDLs) to program reconfigurable hardware for general purpose computing. However, some of the expert knowledge remains difficult to integrate in HLLs, eventually leading to performance loss for HLL-based applications. One example of such a missing feature is the efficient exploitation of the local memories on FPGAs. A solution to address this challenge is PolyMem, an easy-to-use polymorphic parallel memory that uses BRAMs. In this work, we present HLS-PolyMem, the first complete implementation and in-depth evaluation of PolyMem optimized for the Xilinx Design Suite. Our evaluation demonstrates that HLS-PolyMem is a viable alternative to HLS memory partitioning, the current approach for memory parallelism in Vivado HLS. Specifically, we show that PolyMem offers the same performance as HLS partitioning for simple access patterns, and outperforms partitioning as much as 13x when combining multiple access patterns for the same data structure. We further demonstrate the use of PolyMem for two different case studies, highlighting the superior capabilities of HLS-PolyMem in terms of performance, resource utilization, flexibility, and usability. Based on all the evidence provided in this work, we conclude that HLS-PolyMem enables the efficient use of BRAMs as parallel memories, without compromising the HLS level or the achievable performance.

Keywords: Polymorphic Parallel Memory FPGA

· High-Level Synthesis ·

c IFIP International Federation for Information Processing 2019  Published by Springer Nature Switzerland AG 2019 N. Bombieri et al. (Eds.): VLSI-SoC 2018, IFIP AICT 561, pp. 53–78, 2019. https://doi.org/10.1007/978-3-030-23425-6_4

54

1

L. Stornaiuolo et al.

Introduction

The success of High-Level Languages (HLLs) for non-traditional computing systems, like Field Programmable Gate Arrays (FPGAs), has accelerated the adoption of these platforms for general purpose computing. In particular, the main hardware vendors released tools and frameworks to support their products by allowing the design of optimized kernels using HLLs. This is the case, for example, for Xilinx, which allows using C++ or OpenCL within the Vivado Design Suite [1] to target FPGAs. Moreover, FPGAs are increasingly used for dataintensive applications, because they enable users to create custom hardware accelerators, and achieve high-performance implementations with low power consumption. Combining this trend with the fast-paced development of HLLs, more and more users and applications aim to experiment with FPGA accelerators. In the effort of providing HLL tools for FPGA design, some of the features used by hardware design experts are difficult to transparently integrate. One such feature is the efficient use of BRAMs, the FPGA distributed, high-bandwidth, on-chip memories [2]. BRAMs can provide memory-system parallelism, but their use remains challenging due to the many different ways in which data can be partitioned in order to achieve efficient parallel data accesses. Typical HLL solutions allow easy-to-use mechanisms for basic data partitioning. These mechanisms work well for simple data access patterns, but can significantly limit the patterns for which parallelism (and thus, increased performance) can be achieved. Changing data access patterns on the application side is the current state-of-the-art approach: by matching the application patterns with the simplistic partitioning models of the HLL, one can achieve parallel operations and reduce the kernel execution time. However, if at all possible, this transformation also requires extensive modification of the application code, which is cumbersome and error-prone to the point of canceling the productivity benefits of HLLs. To address the challenges related to the design and practical use of parallel memory systems for FPGA-based applications, PolyMem, a Polymorphic Parallel Memory, was proposed [3]. PolyMem is envisioned as a high-bandwidth, twodimensional (2D) memory used to cache performance-critical data on the FPGA chip, making use of the distributed memory banks (the BRAMs). PolyMem is inspired by the Polymorphic Register File (PRF) [4], a runtime customizable register file for Single Instruction, Multiple Data (SIMD) co-processors. PolyMem is suitable for FPGA accelerators requiring high bandwidth, even if they do not implement full-blown SIMD co-processors on the reconfigurable fabric. The first hardware implementation of the Polymorphic Register File was designed in System Verilog [5]. MAX-PolyMem is the first prototype of PolyMem written entirely in MaxJ, and targeted at Maxeler Data Flow Engines (DFEs) [3,6]. Our new HLS PolyMem is an alternative HLL solution, proven to be easily integrated with the Xilinx toolchains. The current work is an extension of our previous implementation presented in [7]. Figure 1 depicts the architecture of a system using (HLS-)PolyMem. The FPGA board (with a high-capacity DRAM memory), is connected to the host CPU through a PCI Express link. PolyMem acts as a high-bandwidth, 2D par-

Building High-Performance, Easy-to-Use Polymorphic Parallel Memories

55

allel software cache, able to feed an on-chip application kernel with multiple data elements every clock cycle. The focus of this work is to provide an efficient implementation of PolyMem in Vivado HLS, and employ it to maximize memoryaccesses parallelism by exploiting BRAMs; we empirically demonstrate the gains we get from PolyMem by comparison against the partitioning of BRAMs, as provided by Xilinx tools, for three case-studies. FPGA Board FPGA Chip PCI-E

PolyMem

DRAM

Host

Kernel

Fig. 1. System organization using PolyMem as a parallel cache.

In this work, we provide empirical evidence that HLS-PolyMem provides significant improvements in terms of both performance and usability when compared with the current memory partitioning approach present in Vivado HLS. To this end, we highlight the following novel aspects of this work: • We provide a new, complete, open-source implementation [45] of PolyMem for Vivado HLS. This new implementation contains all the memory access schemes supported by the original PRF, as well as its multiview feature. Our implementation can be easily integrated within the Xilinx Hardware-Software Co-Design Workflow; • We present a basic, high-level PolyMem interface (i.e., a rudimentary API for using PolyMem). The API includes basic parallel read and write operations. Furthermore, our API was further extended to support masked writes, avoiding overwrites and further reduce latency. For example, when PolyMem supports wide parallel access (e.g., 8 elements), but the user requires less data to be stored (e.g., 5 elements), and wants to avoid overwriting existing data (e.g., the remaining 3 elements). We demonstrate the use of the API in all the applications discussed in this paper (synthetic and real-life examples alike); • We design and prototype a synthetic, parameterized microbenchmarking framework to thoroughly evaluate the performance of HLS-PolyMem. Our microbenchmarking strategy is based on chains of operations using one or several parallel access patterns, thus stressing both the performance and flexibility of the proposed parallel memory. The framework is extended to enable the comparison against existing HLS memory partitioning schemes. Finally, we show how to use these microbenchmarks to provide an extensive analysis of HLS-PolyMem’s performance.

56

L. Stornaiuolo et al.

• We design, implement, and analyze in detail two case-study applications which demonstrate the ability of our HLS-PolyMem to cope with real applications and data, multiple memory access patterns, and tiling. Our experiments for these case-studies focus on performance, resource-utilization, and productivity, and contrast our HLS PolyMem with standard memory partitioning techniques. Our results, collected for both synthetic and real-life case-studies, thoroughly demonstrate that HLS PolyMem outperforms traditional HLS partitioning schemes in performance and usability. We therefore conclude that our HLSPolyMem is the first approach that enables HLS programmers to use BRAMs to construct flexible, multiview parallel memories, which can still be easily embedded in the traditional HLS modus operandi. The remainder of this paper is organized as follows. Section 2 provides an introduction to parallel memories, and discusses the two alternative implementations presented in this work: the PRF-inspired PolyMem and the HLS partitioning schemes. Section 3 presents the HLS PolyMem class for Vivado, together with the proposed optimizations. In Sect. 4 we present our microbenchmarking framework, as well as the our in-depth evaluation using this synthetic workload. Section 5 describes our experience with designing, implementing, and evaluating the two case studies. Section 6 highlights relevant related work and, finally, our conclusion and future work directions are discussed in Sect. 7.

2 2.1

Parallel Memories: Challenges and Solutions Parallel Memories

Definition 1 (Parallel Memory). A Parallel Memory (PM) is a memory that enables the access to multiple data elements in parallel. A parallel memory can be realized by combining a set of independent memories, referred to as banks or lanes. The width of the parallel memory, i.e., the number of banks used in the implementation, represents the maximum number of elements that can be read in parallel. The capacity of the parallel memory refers to the amount of data that it can store. A specific element contained in a PM is identified by its location, a combination of a memory module identifier (to specify which one of the sequential memories hosts the data) and an in-memory address (to specify where within that memory the element is stored). Depending on how the information is stored and/or retrieved from the memory, we distinguish three types of parallel memories: redundant, non-redundant, and hybrid. Redundant PMs. The simplest implementation of a PM is a fully redundant one, where all M sequential memory blocks contain fully replicated information. The benefit of such a memory is that it allows an application to access any combination of M data elements in parallel. However, such a solution has two

Building High-Performance, Easy-to-Use Polymorphic Parallel Memories

57

major drawbacks: first, the total capacity of a redundant PM is M times lower than the combined capacities of all its banks, and, second, parallel writes are very expensive in order to maintain information consistency. To use such a memory, the application requires minimal changes, and the architecture is relatively simple to manage. Non-redundant PMs. Non-redundant PMs completely avoid data duplication: each data item is stored in only one of the M banks. The one-to-one mapping between the coordinate of an element in the application space and a memory location is part of the memory configuration. These memories can use the full capacity of all the memory resources available, and data consistency is guaranteed by avoiding data replication, making parallel writes feasible as well. The main drawback of non-redundant parallel memories is that they require additional logic - compared to redundant memories - to perform the mapping, and they restrict the possible parallel accesses: if two elements are stored in the same bank, they cannot be accessed in parallel. There are two major approaches used to implement non-redundant PM: (1) use a set of predefined mapping functions that enable parallel accesses in a set of predefined shapes [4,8–10], or, (2) derive an application-specific mapping function [11,12]. For the first approach, the application requires additional analysis and potential changes, while the architecture is relatively fixed. For the second approach, however, a new memory architecture needs to be implemented for every application, potentially a more challenging task when the parallel memory is to be implemented in hardware. Hybrid PMs. Besides the two extremes discussed above, there are also hybrid implementations of parallel memories, which combine the advantages of the two previous approaches by using partial data redundancy [13]. Of course, in this case, the challenge is to determine which data should be replicated and where. In turn, this solution requires both application and architecture customization. 2.2

The Polymorphic Register File and PolyMem

A PRF is a parameterizable register file, which can be logically reorganized by the programmer or a runtime system to support multiple register dimensions and sizes simultaneously [4]. The simultaneous support for multiple conflict-free access patterns, called multiview, is crucial, providing flexibility and improved performance for target applications. The polymorphism aspect refers to the support for adjusting the sizes and shapes of the registers at runtime. Table 1 presents the PRF multiview schemes (ReRo, ReCo, RoCo and ReTr), each supporting a combination of at least two conflict-free access patterns. A scheme is used to store data within the memory banks of the PRF, such that it allows different parallel access types. The different access types refer to the actual data elements that can be accessed in parallel. PolyMem reuses the PRF conflictfree parallel storage techniques and patterns, as well as the polymorphism idea. Figure 2(a) illustrates the access patterns supported by the PRF and PolyMem.

58

L. Stornaiuolo et al.

Fig. 2. PRF [4] design. The inputs are the matrix indexes (i, j) pointing to the first cell of the block of data the user wants to read/write in parallel, and the AccessType to select the shape of the parallel access. Table 1. The PRF memory access schemes PRF schemes Available access types ReO

Rectangle

ReRo

Rectangle, row, main/secondary diagonals

ReCo

Rectangle, column, main/secondary diagonals

RoCo

Row, column, rectangle

ReTr

Rectangle, transposed rectangle

In this example, a 2D logical address space of 8 × 9 elements contains 10 memory Regions (R), each with different size and location: matrix, transposed matrix, row, column, main and secondary diagonals. In a hardware implementation with eight memory banks, each of these regions can be read using one (R1–R9) or several (R0) parallel accesses. By design, the PRF optimizes the memory throughput for a set of predefined memory access patterns. For PolyMem, we consider p × q memory modules and the five parallel access schemes presented in Table 1. Each scheme supports dense, conflict-free access to p · q elements. When implemented in reconfigurable tech-

Building High-Performance, Easy-to-Use Polymorphic Parallel Memories

59

nology, PolyMem allows application-driven customization: its capacity, number of read/write ports, and the number of lanes can be configured to best support the application needs. The block diagram in Fig. 2(b) shows, at high level, the PRF architecture. The multi-bank memory is composed of a bi-dimensional matrix containing p×q memory modules. This enables parallel access to p · q elements in one memory operation. The inputs of the PRF are shown at the top of the diagram. AccessType represents the parallel access pattern. (i, j) are the top-left coordinates of the parallel access. The list of elements to access is generated by the AGU module and is sent to the A and m modules: the A module generates one in-memory address for each memory bank in the PRF, while the m module applies the mapping function of the selected scheme and computes, for each accessed element, the memory bank where it is stored. The Data Shuffle block reorders the Data In/Out, ensuring the PEF user obtains the accessed data in their original order. 2.3

Matrix Storage in a Parallel Memory

Figure 3 compares two ways for a 6×6 matrix to be mapped in BRAMs to enable parallel accesses. Thus, the default Vivado HLS partitioning techniques with a factor of 3 is compared against a PolyMem with 3 memory banks, organized exploiting the PolyMem RoCo scheme. The memory banks, in this case, are organized in a 1 × 3 structure, allowing parallel access to rows and columns of three, eventually unaligned, elements. The left side of Fig. 3 shows an example of a matrix to be stored in the partitioned

Fig. 3. Comparison between different partitioning techniques offered by Vivado HLS (facto = 3) and the RoCo scheme of PolyMem, with 3 memory banks, for data stored in a 6 × 6 matrix. PolyMem allows 3 parallel data reads/writes, from the rows and the columns of the original matrix. Unaligned blocks are also supported.

60

L. Stornaiuolo et al.

BRAMs, aiming to achieve read/write parallelism. The right side illustrates three techniques used to partition the matrix, using two unaligned, parallel accesses of 3 elements (gray and black in the figure), starting respectively from the cells containing elements 8 and 23. The HLS Array Partitioning techniques enable either the black or the gray access to be performed in parallel (for Block and Cyclic, respectively). Using PolyMem with a RoCo scheme, each element of each access is mapped on a different memory bank; in turn, this organization enables both the gray and the black access to happen in a single (parallel) operation1 .

3

Implementation Details

This section describes the main components of our PolyMem implementation for Vivado HLS. The goal of integrating PolyMem in the Xilinx workflow is to provide users with an easy-to-use solution to exploit parallelism when accessing data stored on the on-chip memory with different access patterns. Our Vivado HLS PolyMem implementation exploits all the presented five schemes (ReO, ReRo, ReCo, RoCo, ReTr) to store on the FPGA BRAMs the data required to perform the application operations. Compared to the default Vivado memory partitioning techniques, which allow hardware parallelism with a single access pattern, a PolyMem employing a multiview scheme allows multiple types of access simultaneously for unaligned data with conservative hardware resources usage. We implemented a template-based class polymem that exploits loop unrolling to parallelize memory accesses. When HLS PolyMem is instantiated within the user application code, it is possible to specify DATA T, i.e., the type of data to be stored, the (p × q) number of internal banks of memory (i.e., the level of parallelism), the (N × M ) dimension of the matrix to be stored (also used to compute the depth of each bank of data), and the scheme to organize data within the different banks of memory. Listing 1.1 presents the interfaces of methods that allow accesses to data stored within PolyMem. Simple read and write methods use the m and A modules (described in Sect. 2.2) to compute, respectively, the address and the depth of the bank of memory in which the required data is stored or needs to be saved. On the other hand, the read block and the write block exploit optimized versions of m and A to read/write (q · p) elements in parallel, while limiting the hardware resources used to reorder data. Finally, we optimized the memory access operations by implementing a write block masked method to specify which data in the block has to be overwritten within PolyMem. As an example, this method is useful when PolyMem supports a wide parallel access (e.g., 8 elements), but the user requires less data to be stored (e.g., 5 elements), and wants avoid overwriting existing data (e.g., the remaining 3 elements).

1

This small-scale example is included for visualization purposes only. Realapplications are likely to use more memory banks, allowing parallel accesses to larger data blocks.

Building High-Performance, Easy-to-Use Polymorphic Parallel Memories

61

Listing 1.1. List of the methods interfaces to allow user read/write data by used sequential or parallel accesses DATA_T read ( int i , int j ); void write ( DATA_T data , int i , int j ); void read_block ( int i , int j , DATA_T out [ p * q ] , int PRF_ACCESS_TYPE ); void write_block ( DATA_T in [ p * q ] , int i , int j , int PRF_ACCESS_TYPE ); void w r i t e _ b l o c k _ m a s k e d ( DATA_T in [ p * q ] , ap_uint < p * q > mask , int i , int j , int PRF_ACCESS_TYPE );

4

Evaluation and Results

In this Section, we focus on the evaluation of HLS PolyMem. The evaluation is based on a synthetic benchmark, where we demonstrate that PolyMem offers a high-performance, high-productivity alternative to partitioned memories in HLS. 4.1

Experimental Setup

We present the design and implementation of our microbenchmarking suite, and detail the way all our measurements are performed. All the experiments in this section are validated and executed on a Xilinx Virtex-7 VC707 board (part xc7vx485tffg1761-2), with the following hardware resources: 303600 LUTs, 607200 FFs, 1030 BRAMs, and 2800 DSPs. We instantiate a Microblaze processor on the FPGA to control the DMA that transfers data between the FPGA board DRAM memory and the on-chip BRAMs where the computational kernel performs memory accesses. The Microblaze also starts and stops an AXI Timer to measure the execution time of each experiment. The data transfers to and from the computational kernel employ the AXI Stream technology. Microbenchmark Design. To provide an in-depth evaluation of our Polymorphic memory, we designed a specific microbenchmark which tests the performance of PolyMem together with its flexibility - i.e., its ability to cope with applications that require different parallel access types to the same data structure. Moreover, we compare the results of the Polymem-augmented design with the ones achievable by partitioning the memory with the default techniques available in Vivado HLS. To ensure a fair comparison, we utilize a Vivado HLS Cyclic array partition with a factor equal to the number of PolyMem lanes (both designs can access at most p · q data elements in parallel from the BRAMs). The requirements we state for such a microbenchmark are:

62

L. Stornaiuolo et al.

1. Focus on the reading performance of the memory, in terms of bandwidth; 2. Support all access types presented in Sect. 2.2; 3. Test a combination of more access types, to further demonstrate the flexibility of polymorphism; 4. Measure the overall bandwidth achieved by these memory transfers. To achieve these requirements, we designed different computational kernels (IP Cores) that perform a (configurable) number of parallel memory reads, from various locations inside the memory, using different parallel access patterns. Each combination of parallel reads is performed in two different scenarios. The accessed on-chip FPGA memory (BRAMs) M, where the input data are stored, is partitioned by using (1) the default techniques available in Vivado HLS, and (2) the HLS PolyMem technology. A high-level description of the operations executed by the computational kernels and measured by the timer is presented in Listing 1.2. Memory M is used to store the input data and it is physically implemented in partitioned BRAMs. The kernel receives the data to fill the memory M and N READS matrix coordinates to perform parallel accesses with different access types - i.e., given an access type and the matrix coordinates (i, j), the computational kernel reads a block of data starting from (i, j) and following the access type. When the memory reads are done, the kernel sends sampled results on the output stream. Listing 1.2. The structure of the proposed microbenchmark stream in data to fill memory M stream in N_READS r e a d _ c o o r d i n a t e s synchronize // wait for s t r e a m i n g to c o m p l e t e // process reads foreach ACCESS_TYPE in P O L Y M E M _ S C H E M E _ S U P P O R T E D _ A C C E S S _ T Y P E S : chunk_size = N_READS / N _ S U P P O R T E D _ A C C E S S _ T Y P E S foreach (i , j ) in c h u n k _ o f _ r e a d _ c o o r d i n a t e s : c u r r e n t _ r e s u l t s _ b l o c k = M . read_block (i , j , ACCESS_TYPE ) // done p r o c e s s i n g reads foreach k in range ( N _ R E S U L T S _ B L O C K S ): stream out the k ^ th results_block synchronize // wait for s t r e a m i n g to c o m p l e t e

By comparing the performance results of HLS-partitioning and PolyMem, we are able to assess which scheme provides both performance and flexibility, and, moreover, provide a quantitative analysis of the performance gap between the two. We provide more details on how the measurements are performed in the following paragraphs. The complete code used for all the experiments described in this section is available in our code repository [45].

Building High-Performance, Easy-to-Use Polymorphic Parallel Memories

63

Measurement Setup. In order to measure the performance of the two different parallel memory microbenchmaks, we propose a setup as presented in Fig. 4. Specifically, in this diagram, “Memory” is either an HLS-partitioned memory, or an instance of PolyMem (as described in the previous paragraph). Computational Kernel A Microblaze

1

Memory

Result 2

Bu er

3

Fig. 4. The measurement setup used for microbenchmarks. The measured bandwidth corresponds to phase 2, where reading the data from the parallel memory happens.

In order to measure the performance of the two memories, we propose an approach based on the following steps. 1. Measure, on the Microblaze processor, the overhead of phases 1 and 3. We note that phases 1 and 3 are implemented using the basic streaming constructs provided by the AXI Stream technology. We achieve this by a one-time measurement where no memory reads or only one memory read are performed. 2. Measure, on the Microblaze processor, the execution time of the complete setup, from the beginning of phase 1 to the end of phase 3. Due to the explicit synchronization between phases 1, 2, and 3, we can guarantee that no overlap happens between them. 3. Determine, by subtraction, the absolute execution time of phase 2 alone, which is a measure of the parallel memory’s reading performance. 4. Present the absolute performance of the two memories in terms of achieved bandwidth. For the case of PolyMem, we can also assess the efficiency of the implementation by comparing the achieved bandwidth with that predicted by the theoretical performance model [14]. 5. Present the relative performance of the two parallel memories as speedup. We calculate speedup as the ratio of the execution time of HLS-based partitioning solution over the execution of the PolyMem-based solution. We chose to use the entire execution time, including the copy overhead, as an estimate of a realistic benchmark when the same architecture is used for real-life applications. We note that this choice is pessimistic, as the overhead of phases 1 and 3 can be quite large. 4.2

Results

All the results presented in this section are from experiments performed using the settings in Table 2. The input data stream employs double precision (64-bit) numbers, and the computational kernel receives an amount of data (equal for all the experiments), that includes the input matrix and the list of coordinates (i, j):

64

L. Stornaiuolo et al. Table 2. Microbenchmark settings Clock frequency (ClkFr)

200 MHz

Data type (DType)

64-bit double

Input matrix size (DIM × DIM )

96 × 96

HLS partitioning factor (FACTOR)

16

PolyMem lanes (p × q)

16 (2 × 8)

Number of passed coordinates (i, j) (N READS)

3072

Size of each read block (BLOCK SIZE)

16

Number of output blocks (N RESULTS BLOCKS) 50

N IN DATA = (DIM · DIM) + (N READS · 2) = 15360 64-bit elements The number of data that the computational kernel reads from the memory is computed as follow: N READ DATA = (N READS · BLOCK SIZE) = 49152 64-bit elements The output data stream employs double precision 64-bit numbers, and the computational kernel sends back to the microblaze a sample of the results (data read), equal for all the experiments, amounting to: N OUT DATA = (N RESULTS BLOCKS · BLOCK SIZE) = 80064-bit elements

To measure the overheads introduced by the data transfers in terms of hardware resources utilization and execution time, we implemented two computational kernels: the first one does not perform any memory accesses (the BRAMs are not even partitioned) and the second one performs only one memory access (the added execution time of this one access is negligible). The second kernel was executed for both memory configurations (HLS Cyclic and PolyMem). The results are shown in Table 3. The consistent execution time indicates that the overhead is systematic and constant. Table 3. Hardware resources utilization and execution time spent in phases 1 and 3 of the proposed architecture Memory

Access LUT

-

-

FF

21

0

265

HLS Cyclic Row

43194 35302 172

0

265

PolyMem

46375 36444 172

0

265

Row

41400 34064

BRAM DSP Runtime [µs]

Building High-Performance, Easy-to-Use Polymorphic Parallel Memories

65

Table 4. Hardware resources utilizations, execution times and bandwidths for microbenchmark experiments with different memory configurations and access schemes Memory

Scheme LUT

HLS Cyclic ReO PolyMem

ReO

HLS Cyclic ReRo PolyMem

FF

BRAM DSP Runtime [µs] BW [GB/s]

45800 36121 172

0

503

1.54

45590 36364 172

0

283

20.35

90197 65993 174

224

503

1.54

ReRo

59082 40661 172

0

283

20.35

HLS Cyclic ReCo

85055 64679 174

164

503

1.54

PolyMem

ReCo

62549 40434 172

0

283

20.35

HLS Cyclic RoCo

67066 54217 174

100

503

1.54

283

20.35

PolyMem

RoCo

55025 38944 172

HLS Cyclic ReTr

62259 54244 174

PolyMem

51282 37744 172

ReTr

0 40 0

503

1.54

283

20.35

Given the data transfers execution time overhead equal to 265 ns, we can compute the bandwidth (BW) in GB/s for each new experiment with the following formula: BW [B/s] =

N READS ∗ BLOCK SIZE ∗ 8 (Exec. time − overhead)

In Table 4, we report the detailed results of our microbenchmarking experiments, in terms of hardware resource utilization, execution time, and bandwidth. We provide results for the two different memory configurations and all PolyMem access schemes. As shown in Listing 1.2, the memory accesses are equally divided among the access patterns supported by the selected scheme. We further note that, for all the schemes, the speedup of the end-to-end computation (i.e., phases 1, 2 and 3 from Fig. 4) is 1.78x. For the actual computation, using the parallel memory (i.e., without the data transfer overhead), the PolyMem outperforms HLS partitioning by as much as 13.22x times. Moreover, in terms of hardware resources, (1) the BRAM utilization is similar for both parallel memories, which indicates no overhead for PolyMem, (2) PolyMem is more economical in terms of “consumed” LUT and FF (up to 20% less), and (3) HLS partitioning makes use of DSPs, while PolyMem does not. The following paragraph contains an evaluation of these results. Unaligned Accesses and Final Evaluation. The results suggest that the Vivado HLS default partitioning techniques are not able to exploit parallel reads for the described access patterns. This is due to the fact that, even if the data are correctly distributed among the BRAMs to perform at least one access type, parallel accesses unaligned with respect to the partitioning factor are not supported. To prove that, we perform experiments where the memory reads are forced to be aligned with respect to the partitioning factor, for one of the access

66

L. Stornaiuolo et al.

type - e.g. having a cyclic partitioned factor of 4 on the Ro access, it is possible to read 4 data in parallel at the coordinates {(i, j), (i, j+1), (i, j+2), (i, j+3)}, only if j is a multiple of 4. This is possible, at compile time, by using the integer division on the reading coordinates (i, j) as follows:   j aligned j = ∗ BLOCK SIZE BLOCK SIZE This ensures that aligned j is a multiple of the number of memory banks i.e. BLOCK SIZE. Using aligned j for the data access allows the HLS compiler to perform more aggressive optimizations parallelizing the access to the partitioned memory. Table 5 shows the results for the RoCo scheme with different combinations of access types, where forced aligned accesses are performed or not. The cases where the memory reads are aligned with respect to the partitioning factor are the only ones where the default Vivado HLS partitioning is able to achieve the same performance of PolyMem, while using fewer hardware resources. However, even in this cases, the default Vivado HLS partitioning is not able to perform all the memory accesses with the right amount of parallelism if the application requires multiple access patterns. Practical examples showing the advantages of using PolyMem are provided in the following section.

5

Application Case-Studies

In this Section, we analyze two case-study applications, i.e., matrix multiplication and Markov chain, that exploit our HLS PolyMem to parallelize accesses to matrix data. Each application demonstrates different HLS PolyMem features. In the matrix multiplication case-study, we show how our approach outperforms implementations that use the default partitioning of Vivado HLS. For the Markov Chain application, we show how HLS PolyMem enables performance gains with minimal changes to the original software code. Table 5. Hardware resources utilizations, execution times and bandwidths for the RoCo scheme with different combinations of access types with and without forced aligned accesses (FA) Memory

Access types

LUT FF

BRAM DSP Runtime [µs] BW [GB/s]

HLS Cyclic Ro

60127 52552 174

64

503

1.54

PolyMem

45641 36432 172

0

283

20.35

HLS Cyclic FA Ro

43048 35316 173

0

283

20.35

PolyMem

45175 36391 173

0

283

20.35

HLS Cyclic Ro, Co, Re

67066 54217 174

100

503

1.54

PolyMem

Ro FA Ro

55025 38944 172

0

283

20.35

HLS Cyclic Ro, Co, FA Re 47812 38003 173

Ro, Co, Re

0

429

2.23

PolyMem

0

283

20.35

Ro, Co, FA Re 55328 38975 173

Building High-Performance, Easy-to-Use Polymorphic Parallel Memories

5.1

67

Matrix Multiplication (MM)

With this case study, we aim to demonstrate the usefulness of the multiview property of HLS-PolyMem. Specifically, we investigate, in the context of a real application, two aspects: (1) if there is any performance loss or overhead between the two parallel memories for a single matrix multiplication, and (2) what is the performance gap between the two types of parallel memories in the case where multiple parallel access shapes are needed, on the same data structure, in the same application. Single Matrix Multiplication. For our first experiment, the application performs one multiplication of two square matrices, B and C, of size DIM , that are stored by using either the default HLS array partitioning techniques or the HLS PolyMem implementation. Since the multiplication B × C is performed by accessing the rows of B and multiply-accumulating the data with the columns of C, it is convenient, when using HLS default partitioning, to partition B on the second dimension and C on the first one. Indeed, this allows to achieve parallel accesses to the rows of B and columns of C in the innermost loop of the computation. On the other hand, for the HLS PolyMem implementation, we store both B and C in the HLS PolyMem, configured with a RoCo scheme, because it allows parallel accesses to both rows and columns. Listings 1.3 and 1.4 show the declaration of the matrices and their partitioning using the HLS default partitioning and the HLS PolyMem, respectively. Both parallel memories use 16 lanes (i.e., data is partitioned onto 16 memory banks): the HLS partitioned scheme uses a parallel factor of 16, while the B and C HLS PolyMem instances are initialized with p = 4 and q = 4. Listing 1.3. Declaration and partitioning of matrices to parallelize accesses to rows (dim=2) of B and to columns (dim=1) of C with a parallel factor of 16. float B [ DIM ][ DIM ]; # pragma HLS array_partition variable = B block factor =16 dim =2 float C [ DIM ][ DIM ]; # pragma HLS array_partition variable = C block factor =16 dim =1

Listing 1.4. Declaration of the matrices stored by using the HLS PolyMem with the RoCo scheme with a parallel factor of 4 · 4 = 16. # include " hls_prf . h " hls :: prf < float , 4 , 4 , DIM , DIM , SCHEME_RoCo > B ; hls :: prf < float , 4 , 4 , DIM , DIM , SCHEME_RoCo > C ;

Listings 1.5 and 1.6 show the matrix multiplication code when using the HLS default partitioning and the HLS PolyMem, respectively.

68

L. Stornaiuolo et al.

Listing 1.5. Matrix multiplication code that leverages default HLS partitioning to perform parallel accesses. // B * C matrix m u l t i p l i c a t i o n for ( int i = 0; i < DIM ; ++ i ) for ( int j = 0; j < DIM ; ++ j ) { # pragma HLS PIPELINE II =1 float sum = 0; for ( int k = 0; k < DIM ; ++ k ) sum += B [ i ][ k ] * C [ k ][ j ]; OUT [ i ][ j ] = sum ; }

Listing 1.6. Matrix multiplication code that exploits the HLS PolyMem with RoCo scheme to perform parallel accesses. // B * C matrix m u l t i p l i c a t i o n for ( int i = 0; i < DIM ; ++ i ) for ( int j = 0; j < DIM ; ++ j ) { # pragma HLS PIPELINE II =1 float sum = 0; for ( int k = 0; k < DIM ; k += 16) { B . read_block (i , k , temp_row , ACCESS_Ro ); C . read_block (k , j , temp_col , ACCESS_Co ); for ( int t = 0; t < 16; t ++) sum += temp_row [ t ] * temp_col [ t ]; } OUT [ i ][ j ] = sum ; }

Double (Mirrored) Matrix Multiplication. Even though both approaches achieve the goal of computing the matrix multiplication by accessing 16 matrix elements in parallel, the HLS PolyMem solution provides more flexibility when additional data access patterns are required, which is often the case for larger kernels. In order to highlight this aspect, we also consider a second kernel function, in which both the B × C and the C × B products need to be computed. This effectively means that the new kernel can only enable 16 parallel accesses for both multiplications if the matrices allow parallel reads in using both rowand column-patterns. Results and Analysis. Table 6 reports the latency and resource utilization estimated by Vivado HLS when computing the single matrix multiplication kernel (1MM), B × C (rows 1, 2), and when computing the double multiplication (2MM’s), B × C followed by C × B (rows 3, 4 and 5, 6) for the two parallel memories under consideration. As expected, when using the default Vivado HLS partitioning techniques, the second multiplication (C × B) cannot be computed efficiently due to the way in which the matrix data is partitioned into the memory banks, as described in Sect. 2. Indeed, C can only be accessed in parallel by rows and B by columns.

Building High-Performance, Easy-to-Use Polymorphic Parallel Memories

69

Table 6. Latency and hardware resources for matrix multiplication with different memory configurations and matrix dimensions Memory

Matrix size Parallel factor Latency Hardware resources 1 MM 2 MM’s BRAM DSP FF LUT

HLS

32

PolyMem 32 HLS

32

PolyMem 32 HLS

96

PolyMem 96

4

4227 n.a

18

40

6162

6485

4 (2 × 2)

4227 n.a

18

40

6153

6018

4

4227 16503

18

40

7444

9197

4 (2 × 2)

4227 4227

18

40

7367

7364

16

28033 442722 96

164

28554 40474

16 (4 × 4)

28033 28033

160

30969 43636

96

On the other hand, the implementation based on HLS PolyMem is perfectly capable of performing both matrix products (B × C and C × B) efficiently. The performance data reflects this very well: the estimated latency reported in Table 6 is the same for both products in the PolyMem case, and drastically different in the case of HLS partitioning. It is also worth noting that for a matrix size of 32 × 32, the two approaches have similar resource consumption, while for matrices with larger dimensions and a parallel factor of 16, the HLS PolyMem has a resource consumption overhead in terms of FF and LUT of at most 8.5% compared to the HLS default partitioning schemes. Finally, in order to empirically validate the designs, we implemented the kernel module performing both B × C and C × B with matrix size of 96 and a parallel factor of 16 on a Xilinx Virtex-7 VC707 with a target frequency of 100 MHz. The benchmarking system is similar to that presented in Sect. 4: a soft Microblaze core is used to manage the experiment, the input/output data (matrices B and C, and the result) are streamed into parallel memory, and the actual multiplication operations are performed using the parallel memory. For the kernel with a single multiplication, the performance of the two solutions is the same. However, for the kernel with the double multiplication, the HLS PolyMem version achieves an overall speedup of 5x compared to the implementation based on HLS memory partitioning. 5.2

Markov Chain and the Matrix Power Operation

With this case study, which has at its core the matrix power operation, we aim to reinforce the need for multiview accesses to the same data structure, and further demonstrate how tiling can be easily achieved and used in conjuction with HLS-PolyMem, to further alleviate its resource overhead. A Markov Chain is a stochastic model used to describe real-world processes. Some of its most relevant applications are found in queuing theory, the study of population growths [15], and in stochastic simulation methods such as Gibbs sampling [16] and Markov Chain Monte Carlo [17]. Moreover, Page Rank [18], an algorithm used to rank websites by search engines, leverages a time-continuous

70

L. Stornaiuolo et al.

variant of this model. A Markov Chain can also describe a system composed of multiple discrete states, where the probability of being in a state depends only on the previous state of the system. A Markov Transition Matrix A, which is a variant of an adjacency matrix, can be used to represent a Markov Chain. In this matrix, each row contains the probability to move from the current state to any other state of the system. More specifically, given two states i and j, the probability to transition from i to j is ai,j , where ai,j is the element at row i and column j of the transition matrix A. Computing the h-th power of the Markov Transition Matrix is a way to determine what is the probability to transition from an initial state to a final state in h steps. Furthermore, when the number of steps h tends to infinity, the result of Ah can be used to recover the stationary distribution of the Markov Chain, if it exists. From a computational perspective, an approximate value for the result of limx→∞ Ax is obtained for large enough values of x. In our implementation, matrix A is stored in a HLS PolyMem, so that both rows and columns can be accessed in parallel. We then compute A2 and save the result into a support matrix A temp, partitioned on the second dimension. After A2 is computed, h we can easily compute A2 by copying back results to the HLS PolyMem and iterating the overall computation h times. Listing 1.7 shows an HLS PolyMem-based algorithm that can be used to h compute A2 . The implementation consists of an outermost loop repeated h times in which we compute the product A × A whose result is stored in At emp and copied back to the PolyMem for A before the next iteration. Implementing the same algorithm by using the HLS partitioning techniques, as presented in the previous case study, results in poor exploitation of the available parallelism, or in duplicated data, since A needs to be accessed both by rows and columns. Listing 1.7. HLS PolyMem implementation of A2 hls :: prf < float , p , q , DIM , DIM , SCHEME_RoCo > A ; for ( int iter =0; iter < h ; iter ++){ // A * A matrix m u l t i p l i c a t i o n for ( int i = 0; i < DIM ; ++ i ){ for ( int j = 0; j < DIM ; ++ j ) { # pragma HLS PIPELINE II =1 float sum = 0; for ( int k = 0; k < DIM ; k += p * q ) { A . read_block (i , k , temp_row , ACCESS_Ro ); A . read_block (k , j , temp_col , ACCESS_Co ); for ( int t = 0; t < p * q ; t ++) sum += temp_row [ t ] * temp_col [ t ]; } A_temp [ i ][ j ] = sum ; } }

h

Building High-Performance, Easy-to-Use Polymorphic Parallel Memories

71

// Copy back results to PolyMem for ( int i = 0; i < DIM ; ++ i ){ for ( int t = 0; t < DIM ; t += p * q ) { # pragma HLS PIPELINE II =1 A . write_block (& A_temp [ i ][ t ] , i , t , ACCESS_Ro ); } } }

The HLS PolyMem enables parallel accesses to matrix A for both rows and columns, but adds some overhead in terms of hardware resources and complexity of the logic to shuffle data within the right memory banks. The resources overhead has a quadratic growth with respect to the number p · q of parallel memories used to store data [4]. A possible solution to this problem is a simple form of tiling, were we reduce the dimension of PolyMem by dividing the input matrix A and storing its values in a grid of multiple PolyMem s. If A has DIM × DIM elements, it is possible to organize the on-chip memory to store data in a grid of b × b square blocks, × DIM each having size DIM b b . In order to preserve the same level of parallelism, we can re-engineer the original computation to work in parallel on the data stored in each memory within the grid. Instead of computing a single vectorized row-column product, it is possible to perform the computation on multiple rowcolumn products in parallel and reduce the final results. Figure 5 shows how the input matrix is divided in multiple memories according to the choice of the parameters p, q and b. Moreover, the figure also shows which is the data accessed concurrently at each step of the computation. As an example, for the case p = q = b = 2 there are 4 row-column products performed in parallel (b2 ) and for each of them 4 values are processed in parallel (p · q). It is important to notice that when p = q = 1 the PolyMem reduces to memories in which a single element is accessed in parallel. In this case, each PolyMem can be removed and substituted by a single memory bank.

Fig. 5. Comparison between different partitioning of the input matrix in a grid of b2 components implemented by PolyMem with a level of parallelism of p × q. When both p and q are set to 1, it is possible to remove the HLS PolyMem logic.

72

L. Stornaiuolo et al.

In Table 7 we report the latency and the resource utilization estimated by Vivado HLS together with the number of lines of code (LOC) for different configurations of the parameters p, q and b on 8 iterations of the power operation for a 384 × 384 matrix. The numbers demonstrate that by re-engineering the code and the access patterns (b > 1), it is possible to achieve a smaller overall latency. However, this comes at the cost of a more convoluted code which is approximately twice as long, in terms of lines of code, as the original version. On the contrary, by using a single PolyMem (b = 1) we can still obtain higher performance than using the default HLS array partitioning techniques, with a much smaller and simpler code base. Indeed, PolyMem allows to reduce the time to develop an optimized FPGA-based implementation of the algorithm with minor modifications to the original software code. Thanks to HLS PolyMem we raise the level of abstraction of parallel memory accesses, thus enhancing the overall design experience and productivity. Finally, to validate the flexibility the HLS PolyMem library, we implemented and tested the application by using Xilinx SDx tool, that enables OpenCL integration and automatically generates the PCIe drivers for communication. In this case, the benchmarking follows a similar method as the one presented in Sect. 4.1 and Fig. 4, with two amendments: (1) instead of using the Microblaze softcore, we manage the experiment directly from the CPU of the host system where the FPGA board acts as an accelerator, and (2) the transfers from stages (1) and (3) are performed in blocks over the PCIe bus. We synthesized a design for a matrix size of 256 and parameters p = q = b = 2 at 200 MHz, and we benchmarked its performance on the Xilinx Kintex Ultrascale ADM-PCIE-KU3 platform. The obtained throughput is 1.6 GB/s. We note that this number is significantly lower than the expected performance of the HLS-PolyMem itself because it also includes the PCIe overhead. Without this overhead, the performance of the computation using the parallel memory alone is expected to be similar to the performance of a single PolyMem block with p × q lanes, running Table 7. Latency, hardware resources and lines of code, for 8 iterations of the matrix power operation with different memory configurations and a matrix size of 384 Memory

p q b Latency

PolyMem

2 2 1 1557835871 1036

14

9936 11071

98

PolyMem

2 4 1

840333407 1044

17

19678 28855

98

PolyMem

4 4 1

488632423 1060

31

36138 53621

98

multi PolyMem 1 1 2

758085955 1036

14

multi PolyMem 1 2 2

394149976 1044

28

14709 12934 188

multi PolyMem 2 2 2

214032480 1060

45

24845 22418 188

NO PolyMem

101848419 1124

76

32852 13706 188

1 1 4

Hardware resources BRAM DSP FF

6967

LOC LUT

5572 188

Building High-Performance, Easy-to-Use Polymorphic Parallel Memories

73

at 200 MHz, which should be in the same order as that presented in Table 1 (i.e., 21 GB/s for a 16-lane HLS-PolyMem).

6

Related Work

The concept of parallel memory is fairly old, and has been widely discussed in scientific literature. As early as 1971, Kuck et al. discuss the advantages and disadvantages of using memory systems with power of two memory banks [19], based on results collected from a study on performed on the Illiac IV machine. One of the earliest design methodologies and general designs of a parallel memory system suitable, dedicated to image processing applications, are presented in [20]. The memory space is already organized as a 2D structure, while the parameters p, q are the parameters of the parallel region to be accessed; the authors discuss three different mapping functions, and ultimately demonstrate the benefits parallel accesses bring to image processing. In the 90s, more work has been devoted to investigating various addressing schemes and their implementation. For example, [9] investigates schemes based on linear addressing transformation (i.e., XOR schemes), and the use of these schemes for accessing memory in conflict-free manner using multiple strides, blocks, and FFT access patterns. In [21], another memory system design, enabling different parallel accesses to a 2D parallel memory is presented; their design is different in that it focuses on 2D memories to be accessed by arrays of 2D processing units, and thus their mapping and addressing functions are specialized. SIMD processors have fueled more research in building and using parallel memories efficiently. For example, the use of memory systems that leverage a prime number of memory modules to provide parallel accesses for rectangles, rows, columns, and diagonals is investigated in [22]; the authors prove the advantages in building fast mapping/addressing functions for such particular memories, an idea already envisioned and analyzed in [23]. In the same work [22], Park also introduces a Multi Access Memory System, which provides access to multiple sub-array types, although it uses memory modules in a redundant manner. Research proposing an addressing function for 2D rectangular accesses, suitable for multimedia applications, is presented in [10]; the aim of this work is to minimize the number of required memory modules for efficient (i.e., full utilization) parallel accesses. The work in [24] also aims at the full utilization of the memory modules, introducing a memory system based on linear skewing (the same idea from 1971 [19]) that support accesses to block and diagonal conflict-free accesses in a 2D space. [25] proposes a memory system with power of 2 memory modules able to perform strided access with a power of two interval in horizontal and vertical directions. The analysis of parallel memories is also refined - for example, the effect of using a parallel memory to the dynamic instruction count of an application is explored in [8]. The PRF multiview access schemes - which are fundamental for this work are explained in detail in [4], together with the hardware design and implementation requirements. This work introduces an efficient HLS implementation of

74

L. Stornaiuolo et al.

the PRF addressing schemes, greatly simplifying the deployment of PolyMem on FPGAs. Alternative schemes also exist. For example, the Linear-TransformationBased (LTB) algorithm for automatic generation of memory partitions of multidimensional arrays, which is suitable for being used during FPGA HLS loop pipelining, is described in [11]. The Local Binary Pattern (LBP) algorithm from [12] considers the case of multi-pattern and multi-array memory partitioning. [26] discusses the advantages of a hierarchical memory structures generated on tree-based network, as well as different methods for their automatic generation. Building a memory hierarchy for FPGA kernels is recognized as a difficult, error-prone task [27,28]. For example, [28–32] focus on the design of generic, traditional caches. Moreover, the recently released High-Level Synthesis (HLS) tools for FPGAs [33] provide a simple set of parallel access patterns to onchip memory starting from high-level languages implementations. More recently, work has been done on using the Polyhedral Model to automatically determine the module assignment and addressing functions [34]. By comparison, our work proposes a parallel, polymorphic memory which can be exploited from HLS tools and acts as a caching mechanism between the DRAM and the processing logic; instead of supporting placement and replacement policies, our memory is configured for the application at hand, and it is directly accessible for reading and writing. Moreover, PolyMem includes a multiView feature, enabling multiple conflict-free access types, a capability not present in other approaches [34]. Application-specific caches have also been investigated for FPGAs [26,29,35], though none of these are polymorphic or parallel. For example, in [36], the authors demonstrate why and how different caches can be instantiated for specific data structures with different access patterns. PolyMem starts from a similar idea, but, benefiting from its multi-view, polymorphic design, it improves on it by using a single large memory for all these data structures. Many of PolyMem’s advantages arise from its PRF-based design [4], which is more flexible and performs better than alternative memory systems [37–40]; its high performance in scientific applications has also been proven for practical applications [41–43]. As stated before, the first hardware implementation of the Polymorphic Register File was designed in System Verilog [5]. MAX-PolyMem was the first prototype of PolyMem written entirely in MaxJ, and targeted at Maxeler DFEs [3,6]. Our new HLS PolyMem is an alternative HLL solution, proven to be easily integrated with the Xilinx toolchains. In summary, compared to previous work on enabling easy-to-use memory hierarchies and/or caching mechanisms for FPGAs, PolyMem proposes a PRFbased design that supports polymorphic parallel accesses through a single, multiview, application-specific software cache. The previous HLS implementation [3] has demonstrated good performance, but was specifically designed to be used on Maxeler-based systems. Our current HLS-PolyMem is the most generic implementation to date, it preserves the advantages of the previous incarnations of the system in terms of performance and flexibility, and adds the ease-of-use of an HLS library that can be easily integrated in the design flow of modern tools like Vivado HLx and Vivado SDx.

Building High-Performance, Easy-to-Use Polymorphic Parallel Memories

7

75

Conclusion and Future Work

In this paper, we presented a C++ implementation of PolyMem optimized for Vivado HLS, ready-to-use as a library for applications requiring parallel memories. Compared to the naive optimizations using HLS array partitioning techniques, the HLS PolyMem implementation is better in terms of performance, provides high flexibility in terms of supported parallel access patterns, and requires virtually zero implementation effort in terms of code re-engineering. Our design exposes an easy-to-use interface to enhance design productivity for FGPA-based applications. This interface provides methods for both the basic parallel read/write operations, and it is extended with to support masked onchip parallel accesses. Furthermore, we provide a full, open-source implementation of HLS-PolyMem, supporting all the original PolyMem schemes [45]. Our evaluation, based on comprehensive microbenchmarking, demonstrates sustained high-performance for all these schemes. Our results demonstrate HLS-PolyMem achieves the same level of performance as HLS-partitioning for simple access patterns (i.e., rows and columns), and significant performance benefits compared with HLS-partitioning for more complex access patterns. We observe bandwidth improvement as high as 13x for complex access patterns combinations, which HLS partitioning simply cannot support. We also proved the flexibility of the library among the Xilinx Design Tools, by implementing the kernels for both the Vivado workflow with a Virtex-7 VC707 and the SDx workflow with a Kintex Ultrascale 3 ADM-PCIE. Our empirical analysis of our library on two case studies (Matrix multiplication and Markov Chains) demonstrated competitive results in terms of latency, low code complexity, but also a small overhead in terms of hardware resource utilization. Our future work focuses on three different directions. First, we aim to provide the usability of HLS for more case-studies, and further develop the API to better support end-users. Second, we aim to further improve the implementation of the HLS-PolyMem backend. For example, we consider improving the HLS PolyMem shuffle module by exploiting a Butterfly Network [44] for the memory banks connections, and enhance our HLS implementation to support both standard and customized addressing. Third, we envision a wizard-like framework to automatically analyze the user application code, estimate the potential benefits of using HLS-PolyMem, and suggest how to actually embed the parallel memory in the code to reach the best possible performance.

References 1. White Paper: Vivado Design Suite: “Vivado Design Suite” (2012). https://www. xilinx.com/support/documentation/white papers/wp416-Vivado-Design-Suite. pdf 2. Weinhardt, M., Luk, W.: Memory access optimisation for reconfigurable systems. IEE Proc. Comput. Digit. Tech. 148(3), 105–112 (2001) 3. Ciobanu, C.B., Stramondo, G., de Laat, C., Varbanescu, A.L.: MAX-PolyMem: high-bandwidth polymorphic parallel memories for DFEs. In: IEEE IPDPSW RAW 2018, pp. 107–114, May 2018

76

L. Stornaiuolo et al.

4. Ciobanu, C.: Customizable register files for multidimensional SIMD architectures. Ph.D. thesis, TU Delft, The Netherlands (2013) 5. Ciobanu, C., Kuzmanov, G.K., Gaydadjiev, G.N.: Scalability study of polymorphic register files. In: Proceedings of DSD, pp. 803–808 (2012) 6. Ciobanu, C.B., et al.: EXTRA: an open platform for reconfigurable architectures. In: SAMOS XVIII, pp. 220–229 (2018) 7. Stornaiuolo, L., et al.: HLS support for polymorphic parallel memories. In: 2018 IFIP/IEEE International Conference on Very Large Scale Integration (VLSI-SoC), pp. 143–148. IEEE (2018) 8. Gou, C., Kuzmanov, G., Gaydadjiev, G.N.: SAMS multi-layout memory: providing multiple views of data to boost SIMD performance. In: ICS, pp. 179–188. ACM (2010) 9. Harper, D.T.: Block, multistride vector, and FFT accesses in parallel memory systems. IEEE Trans. Parallel Distrib. Syst. 2(1), 43–51 (1991) 10. Kuzmanov, G., Gaydadjiev, G., Vassiliadis, S.: Multimedia rectangularly addressable memory. IEEE Trans. Multimedia 8, 315–322 (2006) 11. Wang, Y., Li, P., Zhang, P., Zhang, C., Cong, J.: Memory partitioning for multidimensional arrays in high-level synthesis. In: DAC, p. 12. ACM (2013) 12. Yin, S., Xie, Z., Meng, C., Liu, L., Wei, S.: Multibank memory optimization for parallel data access in multiple data arrays. In: Proceedings of ICCAD, pp. 1–8. IEEE (2016) 13. auf der Heide, F.M., Scheideler, C., Stemann, V.: Exploiting storage redundancy to speed up randomized shared memory simulations. Theor. Comput. Sci. 162(2), 245–281 (1996) 14. Stramondo, G., Ciobanu, C.B., Varbanescu, A.L., de Laat, C.: Towards applicationcentric parallel memories. In: Mencagli, G., et al. (eds.) Euro-Par 2018. LNCS, vol. 11339, pp. 481–493. Springer, Cham (2019). https://doi.org/10.1007/978-3-03010549-5 38 15. Arsanjani, J.J., Helbich, M., Kainz, W., Boloorani, A.D.: Integration of logistic regression, Markov chain and cellular automata models to simulate urban expansion. Int. J. Appl. Earth Obs. Geoinformation 21, 265–275 (2013) 16. Smith, A.F., Roberts, G.O.: Bayesian computation via the Gibbs sampler and related Markov chain Monte Carlo methods. J. R. Stat. Society. Ser. B (Methodol.) 55, 3–23 (1993) 17. Gilks, W.R., Richardson, S., Spiegelhalter, D.: Markov Chain Monte Carlo in Practice. CRC Press, Boca Raton (1995) 18. Kamvar, S.D., Haveliwala, T.H., Manning, C.D., Golub, G.H.: Extrapolation methods for accelerating PageRank computations. In: Proceedings of the 12th International Conference on World Wide Web, pp. 261–270. ACM (2003) 19. Budnik, P., Kuck, D.: The organization and use of parallel memories. IEEE Trans. Comput. C–20(12), 1566–1569 (1971) 20. Van Voorhis, D.C., Morrin, T.: Memory systems for image processing. IEEE Trans. Comput. C–27(2), 113–125 (1978) 21. Kumagai, T., Sugai, N., Takakuwa, M.: Access methods of a two-dimensional access memory by two-dimensional inverse omega network. Syst. Comput. Jpn. 22(7), 22–31 (1991) 22. Park, J.W.: Multiaccess memory system for attached SIMD computer. IEEE Trans. Comput. 53(4), 439–452 (2004) 23. Lawrie, D.H., Vora, C.R.: The prime memory system for array access. IEEE Trans. Comput. 31(5), 435–442 (1982)

Building High-Performance, Easy-to-Use Polymorphic Parallel Memories

77

24. Liu, C., Yan, X., Qin, X.: An optimized linear skewing interleave scheme for onchip multi-access memory systems. In: Proceedings of the 17th ACM Great Lakes Symposium on VLSI, GLSVLSI 2007, pp. 8–13 (2007) 25. Peng, J.y., Yan, X.l., Li, D.x., Chen, L.z.: A parallel memory architecture for video coding. J. Zhejiang Univ. Sci. A 9, 1644–1655 (2008). https://doi.org/10.1631/ jzus.A0820052 26. Yang, H.J., Fleming, K., Winterstein, F., Chen, A.I., Adler, M., Emer, J.: Automatic construction of program-optimized FPGA memory networks. In: FPGA 2017, pp. 125–134 (2017) 27. Putnam, A., et al.: Performance and power of cache-based reconfigurable computing. In: ISCA 2009, pp. 395–405 (2009) 28. Adler, M., Fleming, K.E., Parashar, A., Pellauer, M., Emer, J.: Leap scratchpads: automatic memory and cache management for reconfigurable logic. In: FPGA 2011, pp. 25–28 (2011) 29. Chung, E.S., Hoe, J.C., Mai, K.: CoRAM: an in-fabric memory architecture for FPGA-based computing. In: FPGA 2011, pp. 97–106 (2011) 30. Yiannacouras, P., Rose, J.: A parameterized automatic cache generator for FPGAs. In: FPT 2003 (2003) 31. Gil, A.S., Benitez, J.B., Calvino, M.H., Gomez, E.H.: Reconfigurable cache implemented on an FPGA. In: ReConFig 2010 (2010) 32. Mirian, V., Chow, P.: FCache: a system for cache coherent processing on FPGAs. In: FPGA 2012, pp. 233–236 (2012) 33. Cong, J., Liu, B., Neuendorffer, S., Noguera, J., Vissers, K., Zhang, Z.: Highlevel synthesis for FPGAs: from prototyping to deployment. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 30(4), 473–491 (2011) 34. Wang, Y., Li, P., Cong, J.: Theory and algorithm for generalized memory partitioning in high-level synthesis. In: Proceedings of the 2014 ACM/SIGDA International Symposium on Field-programmable Gate Arrays, FPGA 2014, pp. 199–208. ACM, New York (2014) 35. Putnam, A.R., Bennett, D., Dellinger, E., Mason, J., Sundararajan, P.: CHiMPS: a high-level compilation flow for hybrid CPU-FPGA architectures. In: FPGA 2008, p. 261 (2008) 36. Nalabalapu, P., Sass, R.: Bandwidth management with a reconfigurable data cache. In: IPDPS 2005. IEEE (2005) 37. Kuck, D., Stokes, R.: The Burroughs scientific processor (BSP). IEEE Trans. Comput. C–31(5), 363–376 (1982) 38. Panda, D., Hwang, K.: Reconfigurable vector register windows for fast matrix computation on the orthogonal multiprocessor. In: Proceedings of ASAP, pp. 202– 213, May–July 1990 39. Corbal, J., Espasa, R., Valero, M.: MOM: a matrix SIMD instruction set architecture for multimedia applications. In: Proceedings of the SC 1999 Conference, pp. 1–12 (1999) 40. Park, J., Park, S.B., Balfour, J.D., Black-Schaffer, D., Kozyrakis, C., Dally, W.J.: Register pointer architecture for efficient embedded processors. In: Proceedings of DATE, pp. 600–605 (2007) 41. Ramirez, A., et al.: The SARC architecture. IEEE Micro 30(5), 16–29 (2010) 42. Ciobanu, C., Martorell, X., Kuzmanov, G.K., Ramirez, A., Gaydadjiev, G.N.: Scalability evaluation of a polymorphic register file: a CG case study. In: Proceedings of ARCS, pp. 13–25 (2011) 43. Ciobanu, C., Gaydadjiev, G., Pilato, C., Sciuto, D.: The case for polymorphic registers in dataflow computing. Int. J. Parallel Program. 46, 1185–1219 (2018)

78

L. Stornaiuolo et al.

44. Avior, A., Calamoneri, T., Even, S., Litman, A., Rosenberg, A.L.: A tight layout of the butterfly network. Theory Comput. Syst. 31(4), 475–488 (1998) 45. https://github.com/storna/hls polymem

Rectification of Arithmetic Circuits with Craig Interpolants in Finite Fields Utkarsh Gupta1(B) , Irina Ilioaea2 , Vikas Rao1 , Arpitha Srinath1 , Priyank Kalla1 , and Florian Enescu2 1

Electrical and Computer Engineering, University of Utah, Salt Lake City, UT, USA {utkarsh.gupta,vikas.k.rao,arpitha.srinath}@utah.edu, [email protected] 2 Mathematics and Statistics, Georgia State University, Atlanta, GA, USA [email protected], [email protected]

Abstract. When formal verification of arithmetic circuits identifies the presence of a bug in the design, the task of rectification needs to be performed to correct the function implemented by the circuit so that it matches the given specification. In our recent work [26], we addressed the problem of rectification of buggy finite field arithmetic circuits. The problems are formulated by means of a set of polynomials (ideals) and solutions are proposed using concepts from computational algebraic geometry. Single-fix rectification is addressed – i.e. the case where any set of bugs can be rectified at a single net (gate output). We determine if single-fix rectification is possible at a particular net, formulated as the Weak Nullstellensatz test and solved using Gr¨ obner bases. Subsequently, we introduce the concept of Craig interpolants in polynomial algebra over finite fields and show that the rectification function can be computed using algebraic interpolants. This article serves as an extension to our previous work, provides a formal definition of Craig interpolants in finite fields using algebraic geometry and proves their existence. We also describe the computation of interpolants using elimination ideals with Gr¨ obner bases and prove that our procedure computes the smallest interpolant. As the Gr¨ obner basis algorithm exhibits high computational complexity, we further propose an efficient approach to compute interpolants. Experiments are conducted over a variety of finite field arithmetic circuits which demonstrate the superiority of our approach against SAT-based approaches. Keywords: Rectification Craig interpolants

1

· Arithmetic circuits · Gr¨obner bases ·

Introduction

The past decade has witnessed extensive investigations into formal verification of arithmetic circuits. Circuits that implement polynomial computations over This research is funded in part by the US National Science Foundation grants CCF-1619370 and CCF-1320385. c IFIP International Federation for Information Processing 2019  Published by Springer Nature Switzerland AG 2019 N. Bombieri et al. (Eds.): VLSI-SoC 2018, IFIP AICT 561, pp. 79–106, 2019. https://doi.org/10.1007/978-3-030-23425-6_5

80

U. Gupta et al.

large bit-vector operands are hard to verify automatically using methods such as SAT/SMT-solvers, decision diagrams, etc. Recent techniques have investigated the use of polynomial algebra and algebraic geometry techniques for their verification. These include verification of integer arithmetic circuits [1–3], integer modulo-arithmetic circuits [4], word-level RTL models of polynomial datapaths [5,6], finite field combinational circuits [7–9], and also sequential designs [10]. A common theme among the above approaches is that designs are modeled as sets of polynomials in rings with coefficients from integers Z, finite integer rings Z2k , finite fields F2k , and more recently also from the field of fractions Q. Subsequently, the verification checks are formulated using algebraic geometry [11] (e.g., the Nullstellensatz), and Gr¨ obner basis (GB) theory and technology [12] are used as decision procedures (ideal membership test) for formal verification. While these techniques are successful in proving correctness or detecting the presence of bugs, the task of post-verification debugging, error diagnosis and rectification of arithmetic circuits has not been satisfactorily addressed. Debugging and rectification of arithmetic circuits is of utmost importance. Arithmetic circuits are mostly custom designed; this raises the potential for errors in the implementation, which have to be eventually rectified. Instead of redesigning the whole circuit, it is desirable to synthesize rectification sub-functions with minimal topological changes to the existing design – a problem often termed as partial synthesis. Moreover, the debug, rectification and partial synthesis problem is analogous to that of synthesis for Engineering Change Orders (ECO), where the current circuit implementation should be minimally modified (rectified) to match the ECO-modified specification. The partial synthesis approach also applies here to generate ECO-patches for rectification. The problem of debug, rectification and ECO synthesis has been addressed for control-dominated applications and random-logic circuits, where the early developments of [13–15] were extended by [16] by formulating as CNF-SAT, and computing rectification functions using Craig Interpolants [17] in propositional logic. Craig Interpolation (CI) is a method in automated reasoning to construct and refine abstractions of functions. It is a logical tool to extract concise explanations for the infeasibility of a set of mutually inconsistent statements. As an alternative to quantifier elimination, CI finds application in verification as well as in partial synthesis – and therefore, in rectification. In propositional logic, they are defined as follows. Definition 1.1 (Craig Interpolants). Let (A, B) be a pair of CNF formulas (sets of clauses) such that A ∧ B is unsatisfiable. Then there exists a formula I such that: (i) A =⇒ I; (ii) I ∧ B is unsatisfiable; and (iii) I refers only to the common variables of A and B, i.e. V ar(I) ⊆ V ar(A) ∩ V ar(B). The formula I is called the interpolant of (A, B). Despite these advancements in automated debugging and rectification of control and random logic circuits, the aforementioned SAT and CI-based approaches are infeasible for rectification of arithmetic circuits.

Rectification of Arithmetic Circuits with Craig Interpolants in Finite Fields

1.1

81

Problem Description, Objectives, and Contributions

We address the problem of rectification of buggy finite field arithmetic circuits. Our problem setup is as follows: – A specification model (Spec) is given either as a polynomial description fspec over a finite field, or as a golden model of a finite field arithmetic circuit. The finite field considered is the field of 2k elements (denoted by F2k ), where k is the operand-width (bit-vector word length). An implementation (Impl) circuit C is also given. – Equivalence checking is performed between the Spec and the Impl circuit C, and the presence of a bug is detected. No restrictions on the number, type, or locations of the bugs are assumed. – We assume that error-diagnosis has been performed, and a subset X of the nets of the circuit is identified as potential rectification locations, called target nets. Given the Spec, the buggy Impl circuit C, the set X of potential rectifiable locations, our objective is to determine whether or not the buggy circuit can be rectified at one particular net (location) xi ∈ X. This is called single-fix rectification in literature [16]. If a single-fix rectification does exist at net xi in the buggy circuit, then our subsequent objective is to derive a polynomial function U (XP I ) in terms of the set of primary input variables XP I . This polynomial needs to be further translated (synthesized) into a logic sub-circuit such that xi = U (XP I ) acts as the rectification function for the buggy Impl circuit C so that this modified C matches the specification. Given the above objective, this article makes the following specific contributions to solve the debug and rectification problem. 1. We formulate the test for single-fix rectifiability at a net xi using concepts and techniques from algebraic geometry [12]. – The problem is modeled in polynomial rings of the form F2k [x1 , . . . , xn ], where k corresponds to the operand-width and the variables x1 , . . . , xn are the nets of the circuit. – The rectification test is formulated using elimination ideals and the Weak Nullstellensatz, and solved using Gr¨ obner basis as a decision procedure. 2. If rectification is feasible at xi , then we compute a rectification function xi = U (XP I ). – We show that the rectification function U (XP I ) can be determined based on the concept of Craig interpolants in algebraic geometry. While Craig interpolation is a well-studied concept in propositional and first-order logic theories, to the best of our knowledge, it has not been investigated in algebraic geometry. – We define Craig interpolants in polynomial algebra in finite fields and prove their existence. We also show how to compute such an interpolant using Gr¨ obner bases.

82

U. Gupta et al.

3. The rectification function U (XP I ) obtained using Craig interpolants is a polynomial in F2k [x1 , . . . , xn ]. We subsequently show how a logic circuit can be obtained from this polynomial. 4. We use Gr¨ obner basis not only as a decision procedure for the rectification test, but also as a quantification procedure for computing the rectification function. Computation of Gr¨ obner bases exhibits very high complexity. To make our approach scalable, we further show how to exploit the topological structure of the given circuit to improve this computation. We demonstrate the application of our techniques to rectify finite field arithmetic circuits with large operand sizes, where conventional SAT-solver based rectification approaches are infeasible. The paper is organized as follows. The following section reviews previous work in automated diagnosis and rectification, and recent applications of Craig interpolants. Section 3 describes concepts from computer algebra and algebraic geometry. Section 4 describes an equivalence checking framework using the Weak Nullstellensatz over finite fields. Section 5 presents results that ascertain the single-fix rectifiability of the circuit. Section 6 introduces Craig interpolants in finite fields using Gr¨ obner basis methods, and gives a procedure for obtaining the rectification function through algebraic interpolants. Section 7 addresses improvements to the Gr¨ obner basis computation. Section 8 presents our experimental results and Sect. 9 concludes the paper.

2

Review of Previous Work

Automated diagnosis and rectification of digital circuits has been addressed in [13,18]. The paper [14] presents algorithms for synthesizing Engineering Change Order (ECO) patches. The partial equivalence checking problem has been addressed in [15,19] that checks whether a partial implementation can be extended to a complete design so that it becomes equivalent to a given specification. The partial implementation comprises black-boxes for which some functions fi ’s need to be computed. The problem is formulated as Quantified Boolean Formula (QBF) solving: does there exist a function fi , such that for all primary input assignments, the Impl circuit is equivalent to the Spec circuit. Incremental SAT-solving based approach has been presented in [20] in lieu of solving the QBF problem. This approach has been extended in [21,22] to generate rectification functions when the Impl circuit topology is fixed. The use of Craig interpolation as an alternative to quantifier elimination has been presented in [16,23,24] for ECO applications. The single-fix rectification function approach in [23] has been extended in [16] to generate multiple partial-fix functions. Recently, an efficient approach on resource aware ECO patch generation has been presented in [25]. As these approaches are SAT based, they work well for random logic circuits but are not efficient for arithmetic circuits. In contrast, this article presents a word-level formulation for single-fix rectification using algebraic geometry techniques. Computer algebra has been utilized for circuit debugging and rectification in [27–29]. These approaches rely heavily on the structure of the circuit

Rectification of Arithmetic Circuits with Craig Interpolants in Finite Fields

83

for debugging, and in general, are incomplete. If the arithmetic circuit contains redundancies, the approach may not identify the buggy gate, nor compute the rectification function. On the other hand, our approach is complete, as it can always compute a single-fix rectification function, if one exists. Although our polynomial algebra based approach is applicable to any circuit in general, it is more efficient and practical for finite field arithmetic circuits. The concept of Craig interpolants has been extensively investigated in many first order theories for various applications in synthesis and verification. Given the pair (A, B) of two mutually inconsistent formulas (cf. Definition 1.1) and a proof of their unsatisfiability, a procedure called the interpolation system constructs the interpolant in linear time and space in the size of the proof [30]. As the abilities of SAT solvers for proof refutation have improved, interpolants have been exploited as abstractions in various problems that can be formulated as unsatisfiable instances, e.g. model checking [30], logic synthesis [31], etc. Their use as abstractions have also been replicated in other (combinations of) theories [32–35], etc. However, the problem has been insufficiently investigated over polynomial ideals in finite fields from an algebraic geometry perspective. In that regard, the works that come closest to ours are by Gao et al. [36] and [37]. While they do not address the interpolation problem per se, they do describe important results of Nullstellensatz, projections of varieties and quantifier elimination over finite fields that we utilize to develop the theory and algorithms for our approach. Moreover, prior to debugging, our approach requires that verification be performed to detect the presence of a bug. For this purpose, we make use of techniques presented in [7,8,38]. We have described the notion of Craig interpolants in finite fields in our work [26]. This article is an extended version of that work where we formally define Craig interpolants in finite fields and prove their existence. Moreover, we describe a procedure for computing an interpolant and prove that the computed interpolant is the smallest. The computation of interpolants uses Gr¨ obner basis based algorithms which have high computational complexity. In contrast to [26], we further propose an efficient approach to compute interpolants based on the given circuit topology.

3

Preliminaries: Notation and Background Results

Let Fq denote the finite field of q elements where q = 2k , Fq be its algebraic closure, and k is the operand width. The field F2k is constructed as F2k ≡ F2 [x] (mod P (x)), where F2 = {0, 1}, and P (x) is a primitive polynomial of degree k. Let α be a primitive element of F2k , so that P (α) = 0. Let R = Fq [x1 , . . . , xn ] be the polynomial ring in n variables x1 , . . . , xn , with coefficients from Fq . A monomial is a power product of variables xe11 · xe22 · · · xenn , where ei ∈ Z≥0 , i ∈ {1, . . . , n}. A polynomial f ∈ R is written as a finite sum of terms f = c1 X1 + c2 X2 + · · · + ct Xt , where c1 , . . . , ct are coefficients and X1 , . . . , Xt are monomials. A monomial order > (or a term order) is imposed on the ring – i.e. a total order and a well-order on all the monomials of R s.t. multiplication with

84

U. Gupta et al.

another monomial preserves the order. Then the monomials of all polynomials f = c1 X1 +c2 X2 +· · ·+ct Xt are ordered w.r.t. >, such that X1 > X2 > · · · > Xt , where lm(f ) = X1 , lt(f ) = c1 X1 , and lc(f ) = c1 are called the leading monomial, leading term, and leading coefficient of f , respectively. In this work, we employ lexicographic (lex) term orders (see Definition 1.4.3 in [12]). Polynomial Reduction via Division: Let f, g be polynomials. If lm(f ) is g divisible by lm(g), then we say that f is reducible to r modulo g, denoted f −→ r, lt(f ) where r = f − lt(g) · g. This operation forms the core operation of polynomial division algorithms and it has the effect of canceling the leading term of f . Similarly, f can be reduced w.r.t. a set of polynomials F = {f1 , . . . , fs } to obtain F a remainder r. This reduction is denoted as f −→+ r, and the remainder r has the property that no term in r is divisible (i.e. cannot be canceled) by the leading term of any polynomial fi in F . We model the given circuit C by a set of multivariate polynomials f1 , . . . , fs ∈ F2k [x1 , . . . , xn ]; here x1 , . . . , xn denote the nets (signals) of the circuit. Every Boolean logic gate of C is represented by a polynomial in F2 , as F2 ⊂ F2k . This is shown below. Note that in F2k , −1 = +1. z = ¬a → z + a + 1 (mod 2) z = a ∧ b → z + a · b (mod 2) z = a ∨ b → z + a + b + a · b (mod 2) z = a ⊕ b → z + a + b (mod 2)

(1)

Definition 3.1 (Ideal of polynomials). Given a set of polynomials F = {f1 , . . . , fs } in Fq [x1 , . . . , xn ], the ideal J ⊆ R generated by F is, s  J = f1 , . . . , fs = { hi · fi : hi ∈ Fq [x1 , . . . , xn ]}. i=1

The polynomials f1 , . . . , fs form the basis or the generators of J. Let a = (a1 , . . . , an ) ∈ Fnq be a point in the affine space, and f a polynomial in R. If f (a) = 0, we say that f vanishes on a. In verification, we have to analyze the set of all common zeros of the polynomials of F that lie within the field Fq . In other words, we need to analyze solutions to the system of polynomial equations f1 = f2 = · · · = fs = 0. This zero set is called the variety. It depends not just on the given set of polynomials but rather on the ideal generated by them. We denote it by V(J) = V(f1 , . . . , fs ), and it is defined as follows: Definition 3.2 (Variety of an ideal). Given a set of polynomials F = {f1 , . . . , fs } in Fq [x1 , . . . , xn ], their variety V (J) = V (f1 , . . . , fs ) = {a ∈ Fnq : ∀f ∈ J, f (a) = 0}

Rectification of Arithmetic Circuits with Craig Interpolants in Finite Fields

85

We denote the complement of a variety, Fnq \ V(J), by V(J). The Weak Nullstellensatz: To ascertain whether V (J) = ∅, we employ the Weak Nullstellensatz over Fq , for which we use the following notations. Definition 3.3 (Sum and Product of Ideals). Given two ideals J1 f1 , . . . , fs , J2 = h1 , . . . , hr , their sum and product are

=

J1 + J2 = f1 , . . . , fs , h1 . . . , hr J1 · J2 = fi · hj : 1 ≤ i ≤ s, 1 ≤ j ≤ r Ideals and varieties are dual concepts: V (J1 + J2 ) = V (J1 ) ∩ V (J2 ), and V (J1 · J2 ) = V (J1 ) ∪ V (J2 ). Moreover, if J1 ⊆ J2 then V (J1 ) ⊇ V (J2 ). For all elements α ∈ Fq , αq = α. Therefore, the polynomial xq − x vanishes everywhere in Fq , and is called the vanishing polynomial of the field. Let J0 = xq1 − x1 , . . . , xqn − xn be the ideal of all vanishing polynomials in R. Then the variety of ideal J0 is the entire affine space, i.e. V (J0 ) = Fnq . Moreover, by extending any ideal J ∈ R = Fq [x1 , . . . , xn ] by the ideal of all vanishing polynomials in R, the variety is restricted to points within Fnq , i.e. V (J + J0 ) ⊂ Fnq . Theorem 3.1 (The Weak Nullstellensatz over finite fields (from Theorem 3.3 in [37])). For a finite field Fq and the ring R = Fq [x1 , . . . , xn ], let J = f1 , . . . , fs ⊆ R, and let J0 = xq1 − x1 , . . . , xqn − xn be the ideal of vanishing polynomials. Then V(J) = ∅ ⇐⇒ 1 ∈ J + J0 . To determine whether V (J) = ∅, we need to test whether or not the unit element 1 is a member of the ideal J + J0 . For this ideal membership test, we need to compute a Gr¨ obner basis of J + J0 . Gr¨ obner Basis of Ideals: An ideal may have many different sets of generators: J = f1 , . . . , fs = · · · = g1 , . . . , gt . Given a non-zero ideal J, a Gr¨ obner basis (GB) for J is one such set G = {g1 , . . . , gt } that possesses important properties that allow to solve many polynomial decision problems. Definition 3.4 (Gr¨ obner basis [12]). For a monomial ordering >, a set of nonobner zero polynomials G = {g1 , g2 , . . . , gt } contained in an ideal J, is called a Gr¨ basis of J iff ∀f ∈ J, f = 0, there exists gi ∈ {g1 , . . . , gt } such that lm(gi ) divides lm(f ); i.e., G = GB(J) ⇔ ∀f ∈ J : f = 0, ∃gi ∈ G : lm(gi ) | lm(f ). Then J = G holds and so G = GB(J) forms a basis for J. Buchberger’s algorithm [39] is used to compute a Gr¨ obner basis. The algorithm, shown in Algorithm 1, takes as input the set of polynomial F = {f1 , . . . , fs } and computes their Gr¨ obner basis G = {g1 , . . . , gt } such that J = F = G , where the variety V ( F ) = V ( G ) = V (J). In the algorithm, Spoly(fi , fj ) = where L = LCM (lm(fi ), lm(fj )).

L L · fi − · fj lt(fi ) lt(fj )

(2)

86

U. Gupta et al.

Algorithm 1. Buchberger’s Algorithm Require: F = {f1 , . . . , fs } Ensure: G = {g1 , . . . , gt } 1: G := F ; 2: while G = G do 3: G := G 4: for each pair {fi , fj }, i = j in G do 5: 6: 7:

G

Spoly(fi , fj ) −→+ h if h = 0 then G := G ∪ {h}

A GB may contain redundant polynomials, and it can be reduced to eliminate these redundant polynomials from the basis. A reduced GB is a canonical representation of the ideal. Moreover, when 1 ∈ J, then G = reduced GB(J) = {1}. Therefore, to check if V (J) = ∅, from Theorem 3.1 we compute a reduced GB G of J + J0 and see if G = GB(J + J0 ) = {1}. If so, the generators of ideal J do not have any common zeros in Fnq . Craig Interpolation: The Weak Nullstellensatz is the polynomial analog of SAT checking. For UNSAT problems, the formal logic and verification communities have explored the notion of abstraction of functions by means of Craig interpolants, which has been applied to circuit rectification [16]. Given the pair (A, B) and their refutation proof, a procedure called the interpolation system constructs the interpolant in linear time and space in the size of the proof [30]. We introduce the notion of Craig interpolants in polynomial algebra over finite fields, based on the results of the Nullstellensatz. We make use of the following definitions and theorems for describing the results on Craig interpolants in finite fields. Definition 3.5. Given an ideal J ⊂ R and V (J) ⊆ Fnq , the ideal of polynomials that vanish on V (J) is I(V (J)) = {f ∈ R : ∀a ∈ V (J), f (a) = 0}. If I1 ⊂ I2 are ideals then V (I1 ) ⊃ V (I2 ), and similarly if V1 ⊂ V2 are varieties, then I(V1 ) ⊃ I(V2 ). √ Definition 3.6. For any ideal J ⊂ R, the radical of J is defined as J = {f ∈ R : ∃m ∈ N s.t.f m ∈ J}. √ When J = J, then J is called a radical ideal. Over algebraically closed fields, the Strong Nullstellensatz establishes the correspondence between radical ideals and varieties. Over finite fields, it has a special form. Lemma 3.1 (From [36]). For an arbitrary ideal J ⊂ √ Fq [x1 , . . . , xn ], and J0 = xq1 − x1 , . . . , xqn − xn , the ideal J + J0 is radical; i.e. J + J0 = J + J0 . Theorem 3.2 (The Strong Nullstellensatz over finite fields (Theorem 3.2 in [36])). For any ideal J ⊂ Fq [x1 , . . . , xn ], I(V(J)) = J + J0 .

Rectification of Arithmetic Circuits with Craig Interpolants in Finite Fields

87

Definition 3.7. Given an ideal J ⊂ Fq [x1 , . . . , xn ], the l-th elimination ideal Jl is an ideal in R defined as Jl = J ∩ Fq [xl+1 , . . . , xn ]. Theorem 3.3 (Elimination Theorem (from Theorem 2.3.4 [12])). Given an ideal J ⊂ R and its GB G w.r.t. the lexicographical (lex) order on the variables where x1 > x2 > · · · > xn , then for every 0 ≤ l ≤ n we denote by Gl the GB of l-th elimination ideal of J and compute it as: Gl = G ∩ Fq [xl+1 , . . . , xn ]. Jl is called the l-th elimination ideal as it eliminates the first l variables from J. Example 3.1 (from [11]). Consider polynomials f1 : x2 − y − z − 1, f2 : x − y 2 − z − 1, and f3 : x − y − z 2 − 1 and the ideal J = f1 , f2 , f3 ⊂ C[x, y, z]. GB(J) with lex term order x > y > z equals to {g1 : x − y − z 2 − 1, g2 : y 2 − y − z 2 − z, g3 : 2yz 2 − z 4 − z 2 , g4 : z 6 − 4z 4 − 4z 3 − z 2 }. Then, the GB of 2nd elimination ideal of J is G2 = GB(J) ∩ C[z] = {g4 } and GB of 1st elimination ideal is G1 = GB(J) ∩ C[y, z] = {g2 , g3 , g4 }. Definition 3.8. Given an ideal J = f1 , . . . , fs ⊂ R and its variety V (J) ⊂ Fnq , the l-th projection of V (J) denoted as P rl (V (J)) is the mapping P rl (V (J)) : Fnq → Fn−l , P rl (a1 , . . . , an ) = (al+1 , . . . , an ), q for every a = (a1 , . . . , an ) ∈ V (J). In a general setting, the projection of a variety is a subset of the variety of an elimination ideal: P rl (V (J)) ⊆ V (Jl ). However, operating over finite fields, when the ideals contain the vanishing polynomials, then the above set inclusion turns into an equality. Lemma 3.2 (Lemma 3.4 in [36]). Given an ideal J ⊂ R that contains the vanishing polynomials of the field, then P rl (V (J)) = V (Jl ), i.e. the l-th projection of the variety of ideal J is equal to the variety of its l-th elimination ideal.

4

Algebraic Miter for Equivalence Checking

Given fspec as the Spec polynomial and an Impl circuit C, we need to construct an algebraic miter between fspec and C. For equivalence checking, we need to prove that the miter is infeasible. Figure 1 depicts how a word-level algebraic miter is setup. Suppose that A = {a0 , . . . , ak−1 } and Z = {z0 . . . , zk−1 } denote the k-bit primary inputs and outputs of the finite field circuit, respectively. Then k−1 k−1 A = i=0 ai αi , Z = i=0 zi αi correspond to polynomials that relate the wordlevel and bit-level inputs and outputs of C. Here α is the primitive element of F2k . Let ZS be the word-level output for fspec , which computes some polynomial function F(A) of A, so that fspec : ZS + F(A). The word-level outputs Z, ZS are mitered to check if for all inputs, Z = ZS is infeasible.

88

U. Gupta et al.

Specification Polynomial

Word-Level Miter

Circuit Implementation C

Fig. 1. Word-level miter

The logic gates of C are modeled as the set of polynomials F = {f1 , . . . , fs } according to Eq. (1). In finite fields, the disequality Z = ZS can be modeled as a single polynomial fm , called the miter polynomial, where fm = t·(Z−ZS )−1, and t is introduced as a free variable. If Z = ZS , then Z −ZS = 0. So fm : t·0+1 = 0 has no solutions (miter is infeasible). Whereas if for some input A, Z = ZS , then Z − ZS = 0. Let t−1 = (Z − ZS ) = 0. Then fm : t · t−1 − 1 = 0 has a solution as (t, t−1 ) are multiplicative inverses of each other. Thus the miter becomes feasible. Corresponding to the miter, we construct the ideal J = fspec , f1 , . . . , fs , fm . In our formulation, we need to also include the ideal J0 corresponding to the vanishing polynomials in variables Z, Zs , A, t, and xi ; here Z, Zs , A, t are the word-level variables that take values in F2k , and xi corresponds to the bit level (Boolean) variables in the miter. In fact, it was shown in [7] that in J0 it is sufficient to include vanishing polynomials for only the primary input bits (xi ∈ XP I ). Therefore, J0 = x2i − xi : xi ∈ XP I . In this way, equivalence checking using the algebraic model is solved as follows: Construct an ideal J = fspec , f1 , . . . , fs , fm , as described above. Add to it the ideal J0 = x2i − xi : xi ∈ XP I . Determine if the variety V (J + J0 ) = ∅, i.e. if reduced GB(J + J0 ) = {1}? If V (J + J0 ) = ∅, the miter is infeasible, and a0

c0

a1

c3

a0

c0

a1

c3

z0

b0

z1

c2

z0

b0

r0 b1

c1

z1

c2 r0

b1

c1

Fig. 2. Correct (a) and buggy (b) 2-bit modulo multiplier circuit implementations

Rectification of Arithmetic Circuits with Craig Interpolants in Finite Fields

89

C implements fspec . If V (J + J0 ) = ∅, the miter is feasible, and there exists a bug in the design. Example 4.1. Consider a modulo multiplier with output Z and inputs A, B. The Spec polynomial is given as fspec : Z + A · B (mod P (X)), where P (X) is a primitive polynomial of the field. An implementation of such a multiplier with operand (Z, A, B) bit-width = 2 is shown in Fig. 2(a). Now let’s say that the designer has introduced a bug, and the XOR gate with output net r0 has been replaced with an AND gate in the actual implementation in the circuit of Fig. 2(b). The polynomials for the gates of the correct circuit implementation are, f1 : c0 + a0 · b0 , f5 : r0 + c1 + c2 ,

f2 : c1 + a0 · b1 , f6 : z0 + c0 + c3 ,

f3 : c2 + a1 · b0 ,

f4 : c3 + a1 · b1 ,

f7 : z1 + r0 + c3 ,

whereas for the buggy implementation, the polynomial f5 is f5 : r0 + c1 c2 . The problem is modeled over F4 and let α be a primitive element of F4 . The wordlevel polynomials are f8 : Z + z0 + z1 α, f9 : A + a0 + a1 α, and f10 : B + b0 + b1 α. The specification polynomial is fspec : Zs + AB. We create a miter polynomial against this specification as fm : t(Z − Zs ) − 1. To perform equivalence checking of the correct implementation and the specification polynomial, we construct ideal J = fspec , f1 , . . . , f5 , . . . , f10 , fm . Computing GB of J + J0 (J0 is the ideal of vanishing polynomials) results in {1}, implying the the circuit in Fig. 2(a) is equivalent to the specification. However, computing GB of the ideal J  + J0 where J  = fspec , f1 , . . . , f5 , . . . , f10 , fm results in a set of polynomials G = {g1 , . . . , gt } = {1}, implying the presence of a bug(s) in the design.

5

Formulating the Rectification Check

Equivalence checking is performed between the Spec and Impl circuit C, and it reveals the presence of a bug in the design. Post-verification, we assume that error diagnosis has been performed, and a set of nets X has been identified as potential single-fix rectifiable locations. While the nets in X might be target nets for single-fix, the circuit may or may not be rectifiable at any xi ∈ X. We have to first ascertain that the circuit is indeed single-fix rectifiable at some xi ∈ X, and subsequently compute a rectification function U (XP I ), so that xi = U (XP I ) rectifies the circuit at that net. 5.1

Single Fix Rectification

Using the Weak Nullstellensatz (Theorem 3.1), we formulate the test for rectifiability of C at a net xi in the circuit. For this purpose, we state and prove the following result, which is utilized later.

90

U. Gupta et al.

Proposition 5.1. Given two ideals J1 and J2 over some finite field such that V (J1 ) ∩ V (J2 ) = ∅, there exists a polynomial U which satisfies V (J1 ) ⊆ V (U ) ⊆ V (J2 ). Proof. Over finite fields Fq , V (J1 ) and V (J2 ) are finite sets of points. Every finite set of points is a variety of some ideal. Therefore, given V (J1 ) ∩ V (J2 ) = ∅, there exists a set of points (a variety) which contains V (J1 ), and which does not intersect with V (J2 ). Let this variety be denoted by V (JI ), where JI is the corresponding ideal. Then V (J1 ) ⊆ V (JI ) ⊆ V (J2 ). In addition, we can construct a polynomial U that vanishes exactly on the points in V (JI ) by means of the Lagrange’s interpolation formula.   We now present the result that ascertains the circuit’s rectifiability at a target net. Let the net xi ∈ X (i.e. ith gate) be the rectification target, and a possible rectification function be xi = U (XP I ). Then the ith gate is represented by a polynomial fi : xi + U (XP I ). Consider the ideal J corresponding to the algebraic miter – the polynomials f1 , . . . , fi , . . . , fs representing the gates of the circuit, the specification polynomial fspec , and the miter polynomial fm : J = fspec , f1 , . . . , fi : xi + U (XP I ), . . . , fs , fm . The following theorem checks whether the circuit is indeed single-fix rectifiable at the net xi . Theorem 5.1. Construct two ideals: – JL = fspec , f1 , . . . , fi : xi + 1, . . . , fs , fm where fi : xi + U (XP I ) in J is replaced with fi : xi + 1. – JH = fspec , f1 , . . . , fi : xi , . . . , fs , fm where fi : xi +U (XP I ) in J is replaced with fi : xi . Compute EL = (JL + J0 ) ∩ F2k [XP I ] and EH = (JH + J0 ) ∩ F2k [XP I ] to be the respective elimination ideals, where all the non-primary input variables have been eliminated. Then the circuit can be single-fix rectified at net xi with the polynomial function fi : xi + U (XP I ) to implement the specification iff 1 ∈ EL + E H . Proof. We will first prove the if case of the theorem. Assume 1 ∈ EL + EH , or equivalently VXP I (EL )∩VXP I (EH ) = ∅. The subscript XP I in VXP I denotes that the variety is being considered over XP I variables, as the non-primary inputs have been eliminated from EL and EH . Using Proposition 5.1, we can find a polynomial U (XP I ) such that, VXP I (EL ) ⊆ VXP I (U (XP I )) ⊆ VXP I (EH ).

(3)

Rectification of Arithmetic Circuits with Craig Interpolants in Finite Fields

91

Note, however, that since VXP I (EL ), VXP I (EH ) are considered over only primary |X | input bits, they contain points from F2 P I . Therefore, there exists a polynomial U (XP I ) as in Eq. (3) with coefficients only in F2 . Let us consider a point p in V (J). Point p is an assignment to every variable in J such that all the generators of J are satisfied. We denote by a, the projection of p on the primary inputs (i.e. the primary input assignments under p). There are only two possibilities for U (XP I ), 1. U (a) = 1, or in other words a ∈ VXP I (U (XP I )). It also implies that the value of xi under p must be 1 because xi + U (XP I ) = 0 needs to be satisfied. Since the generator fi of JL also forces xi to be 1 and all other generators are exactly the same as those of J, p is also a point in V (JL ). Moreover, EL is the elimination ideal of JL , and therefore, a ∈ VXP I (EL ). But this a contradiction to our assumption that VXP I (EL ) ⊆ VXP I (U (XP I )) and such a point a (and p) does not exist. 2. U (a) = 0, or in other words a ∈ VXP I (U (XP I )). Using similar argument as the previous case, we can show that a ∈ VXP I (EH ). This is again a contradiction to our assumption VXP I (U (XP I )) ⊆ VXP I (EH ). In conclusion, there exists no point in V (J) (or the miter is infeasible) when U (XP I ) satisfies Eq. 3, and therefore, circuit can be rectified at xi . Now we will prove the only if direction of the proof. We show that if 1 ∈ EL + EH , then there exists no polynomial U (XP I ) that can rectify the circuit. If 1 ∈ EL + EH , then EL and EH have a common zero. Let a be a point in VXP I (EL ) and VXP I (EH ). This point can be extended to some points p and p in V (JL ) and V (JH ), respectively. Notice that in point p the value of xi will be 1, and in p xi will be 0. Any polynomial U (XP I ) will either evaluate to 0 or 1 for the assignment a to the primary inputs. If it evaluates to 1, then we can say that p is in V (J) as fi in J forces xi = 1 and all other generators of J and JL are same. This implies that fm (p ) = 0 (fm : miter polynomial is feasible) and this choice of U (XP I ) will not rectify the circuit. If U (XP I ) evaluates to 0, then p is a point in V (J).   Therefore, no choice of U (XP I ) can rectify the circuit if 1 ∈ EL + EH . Example 5.1. Consider the buggy modulo multiplier circuit of Fig. 2(b) (reproduced in Fig. 3), where the gate output r0 should have been the output of an XOR gate, but an AND gate is incorrectly implemented. We apply Theorem 5.1 to check for single-fix rectifiability at r0 . The polynomials for the gates of the correct circuit implementation are, f1 : c0 + a0 · b0 , f2 : c1 + a0 · b1 , f3 : c2 + a1 · b0 , f4 : c3 + a1 · b1 , f5 : r0 + c1 + c2 , f6 : z0 + c0 + c3 , f7 : z1 + r0 + c3

92

U. Gupta et al.

a0

c0

a1

c3

z0

b0

z1

c2 r0

b1

c1

Fig. 3. A buggy 2-bit modulo multiplier circuit

The problem is modeled over F4 and let α be a primitive element of F4 . The word-level polynomials are f8 : Z + z0 + z1 α, f9 : A + a0 + a1 α, and f10 : B + b0 + b1 α. The specification polynomial is fspec : Zs + AB. We create a miter polynomial against this specification as fm : t(Z − Zs ) − 1. The ideals JL and JH are constructed as: JL = fspec , f1 , . . . , f4 , r0 + 1, f6 , . . . , f10 , fm JH = fspec , f1 , . . . , f4 , r0 , f6 , . . . , f10 , fm The ideal J0 is: J0 = b21 − b1 , b20 − b0 , a21 − a1 , a20 − a0 , and the corresponding ideals EL and EH are computed to be: EL = a0 b1 + a1 b0 , a1 b0 b1 + a1 b0 , a0 a1 b0 + a1 b0 EH = b0 b1 + b0 + b1 + 1, a1 b1 + a1 + b1 + 1, a0 b1 + a1 b0 + 1, a0 b0 + a0 + b0 + 1, a0 a1 + a0 + a1 + 1 Computing a Gr¨ obner basis G of EL + EH results in G = {1}. Therefore, we can rectify this circuit at r0 . On the other hand, if we apply the rectification theorem at net c2 , the respective ideals EL and EH are as follows, EL = a20 + a0 , a21 + a1 , b20 + b0 , b21 + b1 , a1 b0 + b0 , a0 b1 b0 + a0 b1 + a0 b0 + a0 , a0 a1 + a0  EH = a20 + a0 , b20 + b0 , b21 + b1 , b1 b0 + b1 + b0 + 1, a1 + 1, a0 b0 + a0 + b0 + 1

When we compute G = GB(EL + EH ), we obtain G = {1} indicating that single-fix rectification is not possible at net c2 , for the given bug.

Rectification of Arithmetic Circuits with Craig Interpolants in Finite Fields

6

93

Craig Interpolants in Finite Fields

Once it is ascertained that a net xi admits single-fix rectification, the subsequent task is to compute a rectification polynomial function xi = U (XP I ) in terms of the primary inputs of the circuit. In this section, we describe how such a rectification polynomial function can be computed. For this purpose, we introduce the concept of Craig interpolants using algebraic geometry in finite fields. We describe the setup for Craig interpolation in the ring R = Fq [x1 , . . . , xn ]. Partition the variables {x1 , . . . , xn } into disjoint subsets A, B, C. We are given two ideals JA ⊂ Fq [A, C], JB ⊂ Fq [B, C] such that the C-variables are common to the generators of both JA , JB . From here on, we will assume that all ideals include the corresponding vanishing polynomials. For example, generators of JA include Aq − A, C q − C, where Aq − A = {xqi − xi : xi ∈ A}, and so on. Then these ideals become radicals and we can apply Lemmas 3.1 and 3.2. We use VA,C (JA ) to denote the variety of JA over the Fq -space spanned by A and C B C C variables, i.e. VA,C (JA ) ⊂ FA q × Fq . Similarly, VB,C (JB ) ⊂ Fq × Fq . Now let J = JA + JB ⊆ Fq [A, B, C], and suppose that it is found by application of the Weak Nullstellensatz (Theorem 3.1) that VA,B,C (J) = ∅. When we compare the varieties of JA and JB , then we can consider the varieties in B C B A B C FA q × Fq × Fq , as VA,B,C (JA ) = VA,C (JA ) × Fq ⊂ Fq × Fq × Fq . With this setup, we define the interpolants as follows. Definition 6.1 (Interpolants in finite fields). Given two ideals JA ⊂ Fq [A, C] and JB ⊂ Fq [B, C] where A, B, C denote the three disjoint sets of variables such that VA,B,C (JA ) ∩ VA,B,C (JB ) = ∅. Then there exists an ideal JI satisfying the following properties: 1. VA,B,C (JI ) ⊇ VA,B,C (JA ) 2. VA,B,C (JI ) ∩ VA,B,C (JB ) = ∅ 3. Generators of JI contain only the C-variables; or JI ⊆ Fq [C]. We call VA,B,C (JI ) the interpolant in finite fields of the pair (VA,B,C (JA ), VA,B,C (JB )), and the corresponding ideal JI the ideal-interpolant. As the generators of JI contain only the C-variables, the interpolant B VA,B,C (JI ) is of the form VA,B,C (JI ) = FA q × Fq × VC (JI ). Therefore, the subscripts A, B for the interpolant VA,B,C (JI ) may be dropped for the ease of readability. Example 6.1 Consider the ring R = F2 [a, b, c, d, e], partition the variables as A = {a}, B = {e}, C = {b, c, d}. Let ideals JA = ab, bd, bc + c, cd, bd + b + d + 1 + J0,A,C JB = b, d, ec + e + c + 1, ec + J0,B,C

94

U. Gupta et al.

where J0,A,C and J0,B,C are the corresponding ideals of vanishing polynomials. Then, VA,B,C (JA ) = FB q × VA,C (JA ) = (abcde) : {01000,00010, 01100, 10010, 01001, 00011, 01101, 10011} VA,B,C (JB ) = FA q × VB,C (JB ) = (abcde) : {00001,00100, 10001, 10100} Ideals JA , JB have no common zeros as VA,B,C (JA ) ∩ VA,B,C (JB ) = ∅. The pair (JA , JB ) admits a total of 8 interpolants: 1. V (JS ) = (bcd) : {001, 100, 110} JS = cd, b + d + 1 2. VC (J1 ) = (bcd) : {001, 100, 110, 101} J1 = cd, bd + b + d + 1, bc + cd + c 3. VC (J2 ) = (bcd) : {001, 100, 110, 011} J2 = b + d + 1 4. VC (J3 ) = (bcd) : {001, 100, 110, 111} J3 = b + cd + d + 1 5. VC (J4 ) = (bcd) : {001, 100, 110, 011, 111} J4 = bd + b + d + 1, bc + b + cd + c + d + 1 6. VC (J5 ) = (bcd) : {001, 100, 110, 101, 111} J5 = bc + c, bd + b + d + 1 7. VC (J6 ) = (bcd) : {001, 100, 110, 101, 011} J6 = bd + b + d + 1, bc + cd + c 8. VC (JL ) = (bcd) : {001, 011, 100, 101, 110, 111} JL = bd + b + d + 1 .

Fig. 4. The Interpolant lattice for Example 6.1

It is easy to check that all V (JI ) satisfy the 3 conditions of Definition 6.1. Note also that V (JS ) is the smallest interpolant, contained in every other interpolant. Likewise, V (JL ) contains all other interpolants and it is the largest. The other containment relationships are shown in the corresponding interpolant lattice in Fig. 4; VC (J1 ) ⊂ VC (J5 ), VC (J1 ) ⊂ VC (J6 ), etc. Theorem 6.1 (Existence of Craig Interpolants). An ideal-interpolant JI , and correspondingly the interpolant VA,B,C (JI ), as given in Definition 6.1, always exists. Proof. Consider the elimination ideal JI = JA ∩ Fq [C]. We show JI satisfies the three conditions for the interpolant. Condition 1: VA,B,C (JI ) ⊇ VA,B,C (JA ). This condition is trivially satisfied due to construction of elimination ideals. As JI ⊆ JA , VA,B,C (JI ) ⊇ VA,B,C (JA ). Condition 2: VA,B,C (JI ) ∩ VA,B,C (JB ) = ∅. This condition can be equivalently stated as VB,C (JI ) ∩ VB,C (JB ) = ∅ as neither JI nor JB contain any variables from the set A. We prove this condition by contradiction. Let’s assume that

Rectification of Arithmetic Circuits with Craig Interpolants in Finite Fields

95

there exists a common point (b, c) in VB,C (JI ) and VB,C (JB ). We know that the projection of the variety P rA (VA,C (JA )) is equal to the variety of the elimination ideal VC (JI ), where JI = JA ∩ Fq [C], due to Lemma 3.2. Therefore, the point (c) in the variety of JI can be extended to a point (a, c) in the variety of JA . This implies that the ideals JA and JB vanish at (a, b, c). This is a contradiction to our initial assumption that the intersection of the varieties of JA and JB is empty. Thus JI , JB have no common zeros. Condition 3: The generators of JI contain only the C-variables. This condition is trivially satisfied as JI is the elimination ideal obtained by eliminating   A-variables in JA . The above theorem not only proves the existence of an interpolant, but also gives a procedure to construct its ideal: JI = JA ∩ Fq [C]. In other words, compute a reduced Gr¨ obner basis G of JA w.r.t. the elimination order A > B > C and take GI = G ∩ Fq [C]. Then GI gives the generators for the ideal-interpolant JI . Example 6.2. The elimination ideal JI computed for JA from Example 6.1 is JI = JS = cd, b + d + 1 with variety VC (JI ) = (bcd) : {001, 100, 110}. This variety over the variable set A and C is VA,C (JI ) = (abcd) : {0001, 0100, 0110, 1001, 1100, 1110}, and it contains VA,C (JA ). Moreover, VA,B,C (JI ) also has an empty intersection with VA,B,C (JB ). Theorem 6.2 (Smallest interpolant). The interpolant VA,B,C (JS ) corresponding to the ideal JS = JA ∩ Fq [C] is the smallest interpolant. Proof. Let JI ⊆ Fq [C] be any another ideal-interpolant = JS . We show that VC (JS ) ⊆ VC (JI ). For VC (JI ) to be an interpolant it must satisfy VA,B,C (JA ) ⊆ VA,B,C (JI ) which, due to Theorem 3.2, is equivalent to I(VA,B,C (JA )) ⊇ I(VA,B,C (JI )) =⇒ JA ⊇ JI As the generators of JI only contain polynomials in C-variables, this relation also holds for the following JA ∩ Fq [C] ⊇ JI =⇒ JS ⊇ JI =⇒ VC (JS ) ⊆ VC (JI ). 6.1

 

Computing a Rectification Function from Craig Interpolants

Back to our formulation of single-fix rectification, from Theorem 5.1 we have 1 ∈ EL + EH or V (EL ) ∩ V (EH ) = ∅. Therefore, we can consider the pair (EL , EH ) for Craig interpolation. In other words, based on the notation from Definition 6.1, JA = EL and JB = EH . Moreover, EL and EH are elimination ideals containing only XP I variables. As a result, the partitioned set of variables for Craig interpolation A, B, and C all correspond to primary inputs. Furthermore, we want to compute an ideal JI in XP I such that

96

U. Gupta et al.

VXP I (EL ) ⊆ VXP I (JI ) and VXP I (JI ) ∩ VXP I (EH ) = ∅. The smallest idealinterpolant JI = EL ∩ F2k [XP I ] = EL itself. Therefore, we use EL to compute the correction function U (XP I ). Obtaining U (XP I ) from EL : In finite fields, given an ideal J, it always possible to find a polynomial U such that V (U ) = V (J). The reason is that every ideal in a finite field has a finite variety, and a polynomial with those points as its roots can always be constructed using the Lagrangian interpolation formula. We construct the rectification polynomial U from the ideal-interpolant EL as shown below, such that V (EL ) = V (U ). Let the generators of EL be denoted by g1 , . . . , gt . We can compute U as, U = (1 + g1 )(1 + g2 ) · · · (1 + gt ) + 1

(4)

It is easy to assert that V (U ) = V (EL ). Consider a point a in V (EL ). As all of g1 , . . . , gt vanish (= 0) at a, U (a) = (1 + g1 (a))(1 + g2 (a)) · · · (1 + gt (a)) + 1 = (1 + 0)(1 + 0) · · · (1 + 0) + 1 = 0 Conversely, for a point a ∈ V (EL ), at least one of g1 , . . . , gt will evaluate to 1. Without loss of generality, if g1 evaluates to 1 at a , then U = (1 + 1)(1 + 0) · · · (1 + 0) + 1 = 0. Using Eq. (4), a recursive procedure is derived to compute U , and it is depicted in Algorithm 2. At every recursive step, we also reduce the intermediate results by (mod J0 ) (line 7) so as to avoid terms of high degree. In this fashion, from the ideal-interpolant EL , we compute the single-fix rectification polynomial function U (XP I ), and synthesize a sub-circuit at net xi such that xi = U (XP I ) rectifies the circuit. Algorithm 2. Compute U from J such that V (U ) = V (J) 1: U = compute U (J, J0 ) + 1 2: procedure compute U (J, J0 ) /*J = g1 , . . . , gt */ 3: if size(J) = 1 then 4: return (1 + J[1]) 5: subsetJ = {J[1], J[2], . . . , J[size(J) − 1]} 6: poly S1 = compute U (subsetJ, J0 ) 7: 8:

J

0 Perform S1 · J[size(J)] −→ + S2 return S1 + S2

Example 6.3. Example 5.1 showed that the buggy circuit of Fig. 3 can be rectiobner fied at net r0 . This rectification check required the computation of the (Gr¨ basis of ) ideal EL . Using Algorithm 2, we compute U (XP I ) from EL to be a0 b1 + a1 b0 , and the rectification polynomial as r0 + a0 b1 + a1 b0 . This can be synthesized into a sub-circuit as r0 = (a0 ∧ b1 ) ⊕ (a1 ∧ b0 ), by replacing the modulo 2 product and sum in the polynomial with the Boolean AND and XOR operators, respectively.

Rectification of Arithmetic Circuits with Craig Interpolants in Finite Fields

7

97

Efficient Gr¨ obner Basis Computations for EL and EH

The proposed rectification approach requires the computation of (generators of) obner basis elimination ideals EL and EH . This is achieved by computing a Gr¨ each for GB(JL + J0 ) ∩ F2k [XP I ] and GB(JH + J0 ) ∩ F2k [XP I ], respectively. The rectification polynomial function xi = U (XP I ) is subsequently derived from the generators of EL . As the generators of JL and JH comprise polynomials derived from the entire circuit, these GB-computations become infeasible for larger circuits due to its high complexity. In [37], it was shown that the time and space complexity of computing GB(J + J0 ) over Fq [x1 , . . . , xn ] is bounded by q O(n) . In the context of our work, as q = 2k where k is the operand-width, and n the number of variables (nets) in the miter, we have to overcome this complexity to make our approach practical for large circuits. Prior work [8] has shown that the GB-computation can be significantly improved when the polynomials are derived from circuits. By analyzing the topology of the given circuit, a specialized term order can be derived that can significantly reduce the number of Spoly computations in the GB-algorithm. We present a similar approach to improve the GB-computation for ideals EL , EH . Lemma 7.1 (Product Criterion [40]). For two polynomials fi , fj in any polynomial ring R, if the equality lm(fi ) · lm(fj ) = LCM (lm(fi ), lm(fj )) holds, i.e. G

if lm(fi ) and lm(fj ) are relatively prime, then Spoly(fi , fj ) − →+ 0. Buchberger’s algorithm therefore does not pair those polynomials fi , fj (Algorithm 1, line 4) whose leading monomials are relatively prime, as they do not produce any new information in the basis. Moreover, based on the above criterion, when the leading monomials of all polynomials in the basis G →+ 0. As no new F = {f1 , . . . , fs } are relatively prime, then all Spoly(fi , fj ) − polynomials are generated in Buchberger’s algorithm, F already constitutes a Gr¨ obner basis (F = GB(J)). For a combinational circuit C, a specialized term order > can always be derived by analyzing the circuit topology which ensures such a property [4,7]: Proposition 7.1 (From [7]). Let C be an arbitrary combinational circuit. Let {x1 , . . . , xn } denote the set of all variables (signals) in C. Starting from the primary outputs, perform a reverse topological traversal of the circuit and order the variables such that xi > xj if xi appears earlier in the reverse topological order. Impose a lex term order > to represent each gate as a polynomial fi , s.t. obner fi = xi + tail(fi ). Then the set of all polynomials {f1 , . . . , fs } forms a Gr¨ basis G, as lt(fi ) = xi and lt(fj ) = xj for i = j are relatively prime. This term order > is called the Reverse Topological Term Order (RTTO). RTTO ensures that the set of all polynomials {f1 , . . . , fs } of the given circuit C have relatively prime leading terms. However, the model of the algebraic miter (Fig. 1, with the Spec and the miter polynomial, in addition to the given circuit) is such that under RTTO >, not all polynomials have relatively prime leading

98

U. Gupta et al.

terms. However, we show that imposition of RTTO on the miter still significantly reduces the amount of computation required for Gr¨ obner bases. We demonstrate the technique on the GB computation for the ideal JL + J0 (analogously also for JH + J0 ), corresponding to the miter, as per Theorem 5.1. Given the word-level miter of Fig. 1, impose a lexicographic (lex) monomial order on the ring R, with the following variable order: t > Z > ZS > A > nets of C in RTTO order > Primary input variables

(5)

Here t is the free variable used in the miter polynomial, and Z, Zs are the word-level outputs of Impl and Spec, respectively, and A is the word-level input. Corresponding to the circuit in Fig. 3 (Example 5.1), we use a lex term order with variable order: t > Z > ZS > A > B > z1 > z0 > r0 > c0 > c1 > c2 > c3 > b1 > b0 > a1 > a0 (6) The polynomials {f1 , . . . , f10 , fspec , fm } in Example 5.1 are already written according to the term order of Eq. (6). Note also that the leading terms of the generators of the ideal JL are the same as the leading terms of polynomials in {f1 , . . . , f10 , fspec , fm }. From among these, the only pair of polynomials that do not have relatively prime leading terms are f8 and fm . This condition also holds when considering the ideal JL + J0 (instead of only JL ) as J0 is composed of only bit-level primary input variables. In general, modeling an algebraic miter with RTTO > will ensure that we have exactly one pair of polynomials with leading monomials that are not relatively prime. This pair includes: (i) the miter polynomial fm : tZ − tZs − 1, with lm(fm ) = tZ; and (ii) the polynomial (hereafter denoted by fo ) that relates the word-level and bit-level variables of the circuit, fo : Z +z0 +z1 α+· · ·+zk−1 αk−1 , with lm(fo ) = Z. Therefore, in the first iteration of Algorithm 1 for computing GB(JL +J0 ), the only critical pair to compute is Spoly(fm , fo ), as all other pairs reduce to 0, due to Lemma 7.1. Moreover, computing Spoly(fm , fo ) results in Spoly(fm , fo ) = t(ZS + z0 + · · · + zk−1 αk−1 ) + 1. Once again, RTTO > ensures the following: J +J

L 0 −−→ Lemma 7.2. Spoly(fm , fo ) −− + h = t · r + 1, where r is a polynomial in bit-level primary input variables.

J +J

L 0 −−→ Proof. Consider the polynomial reduction of Spoly(fm , fo ) −− + h:

fspec

t(ZS + z0 + · · · + zk−1 αk−1 ) + 1 −−−→+ where fspec = ZS + F(A). The remainder for this reduction will be t(F(A) + z0 + · · · + zk−1 αk−1 ) + 1,

Rectification of Arithmetic Circuits with Craig Interpolants in Finite Fields

99

where F(A) is the polynomial specification in word-level input variable(s). This remainder is then reduced by the polynomial relating the word-level and bitlevel primary input variables, i.e. by A + a0 + · · · + ak−1 αk−1 . The subsequent remainder is A+a0 +···+ak−1 αk−1

t(F(A) + z0 + · · · + zk−1 αk−1 ) + 1 −−−−−−−−−−−−−→+

(7)

t(z0 + · · · + zk−1 αk−1 + G(a0 , · · · , ak−1 )) + 1,

where the word-level specification polynomial F(A) gets reduced to a polynomial expression G(a0 , . . . , ak−1 ) in primary input bits. Due to RTTO >, subsequent divisions of the above remainder in Eq. (7) by {f1 , . . . , fs } will successively cancel the terms in variables zi , i = 0, . . . , k − 1, and express them in terms of the primary input bits. Since primary input bits are last in RTTO >, they never appear as leading terms in any of the polynomials in JL ; so the terms in primary input bits cannot be canceled. As a result, after complete reduction of Spoly(fm , fo ) by JL + J0 , the remainder will be a polynomial expression of the J +J

L 0 form Spoly(fm , fo ) −− −−→ + h = t·r +1, where r is a polynomial only in bit-level primary input variables.  

Coming back to the computation GB(JL +J0 ), the polynomial h is now added to the current basis, i.e. G = {JL +J0 }∪{h} in Buchberger’s algorithm (Line 7 in Algorithm 1). This polynomial h now needs to be paired with other polynomials in the basis. There are only two sets of possibilities for subsequent critical pairings: (i) the pair Spoly(fm , h); and (ii) to pair h with corresponding vanishing polynomials from the ideal J0 . For all other polynomials fi ∈ {f1 , . . . , fs }, lm(h) J +J

L 0 and lm(fi ) have relatively prime leading terms, so Spoly(h, fi )i=1,...,s −− −−→ + 0; so the pairs (h, fi ) need not be considered in GB(JL + J0 ). We now show that

G={JL +J0 }∪{h}

Spoly(fm , h) −−−−−−−−−−−→+ 0, so the pair (fm , h) also need not be considered. From Lemma 7.2 and its proof, we have that h = t · r + 1 and Z + G={JL +J0 }

ZS −−−−−−−−→+ = r, with r composed of primary input bits. Let r = e+r , where e = lt(r) is the leading term and r = r − e is tail(r), both expressed in primary input bits. With this notation, h = te+tr +1 and lt(h) = te. The LCM L of leading monomials of fm and h is L = LCM (lm(fm ), lm(h)) = LCM (tZ, te) = tZe. Consider the computation Spoly(fm , h): L L · fm − ·h lt(fm ) lt(h) = efm + Zh = e(tZ + tZS + 1) + Z(te + tr + 1)

Spoly(fm , h) =

= tr Z + teZS + Z + e

(8)

100

U. Gupta et al.

Next consider the reduction of Spoly(fm , h) by {JL + J0 } ∪ {h}, where h itself h

is used in the division. The reduction Spoly(fm , h) − →+ is computed as, tr Z + teZS + Z + e − →+ tr Z + (tr + 1)ZS + Z + e = tr (Z + ZS ) + Z + ZS + e h

= (tr + 1)(Z + ZS ) + e

(9)

Reducing the intermediate remainder of Eq. (9) by the polynomials in JL + J0 results in (tr + 1)(r) + e. This reduction process is similar to the one in the proof of Lemma 7.2. Now consider the polynomial (tr + 1)(r) + e (tr + 1)(r) + e = (tr + 1)(e + r ) + e = ter + tr2 + e + r + e = ter + tr2 + r

(10)

The polynomial in Eq. (10) can be further reduced by h which results in 0 imply{JL +J0 }∪{h}

ing that Spoly(fm , h) −−−−−−−−−→+ 0. ter + tr2 + r − →+ (tr + 1)r + tr2 + r h

= tr2 + r + tr2 + r = 0 In summary, we have shown that to compute EL as GB(JL + J0 ) ∩ F2k [XP I ], J +J

L 0 we only need to compute Spoly(fm , fo ) −− −−→ + h, and pair h with polynomials of J0 , as all other Spoly(h, fi ) reduce to 0. This gives us the following procedure to compute the Gr¨ obner basis of EL (respectively EH ):

J +J

L 0 1. Compute Spoly(fo , fm ) −− −−→ + h, where (fm , fo ) is the only pair of polynomials in JL + J0 that do not have relatively prime leading monomials. 2. Use Buchberger’s algorithm to compute GB of the set of vanishing polynomials and h, i.e. compute G = GB(J0 = {x2i − xi : xi ∈ XP I }, h). 3. From G, collect the polynomials not containing t; i.e EL = G ∩ F2k [XP I ]. These polynomials generate the ideal EL .

The same technique is also used to compute EH by replacing JL with JH in the above procedure. In our approach, we use the above procedures to compute EL , EH for Theorem 5.1 and then compute U (XP I ) from EL using Algorithm 2.

8

Experimental Results

We have performed rectification experiments on finite field arithmetic circuits that are used in cryptography, where the implementation is different from the specification due to exactly one gate. This is to ensure that single-fix rectification is feasible for such bugs, so that a rectification function can be computed. We have implemented the procedures described in the previous sections—i.e.

Rectification of Arithmetic Circuits with Craig Interpolants in Finite Fields

101

the concepts of Theorem 5.1, Sect. 7 and Algorithm 2—using the SINGULAR symbolic algebra computation system [ver. 4-1-0] [41]. Given a Spec, a buggy Impl circuit C, and the set X of rectification targets, our approach checks for each net xi ∈ X if single-fix rectification is feasible, and if so, computes a rectification function xi = U (XP I ). The experiments were conducted on a desktop computer with a 3.5 GHz Intel CoreTM i7-4770K Quad-core CPU, 16 GB RAM, running 64-bit Linux OS. Experiments are performed with three different types of finite field circuit benchmarks. Two of these are the Mastrovito and the Montgomery multiplier circuit architectures used for modular multiplication. Mastrovito multipliers compute Z = A × B (mod P (x)) where P (x) is a given primitive polynomial for the datapath size k. Montgomery multipliers are instead preferred for exponentiation operations (often required in cryptosystems). The last set of benchmarks are circuits implementing point addition over elliptic curves used for encryption, decryption and authentication in elliptic curve cryptography (ECC). Table 1. Mastrovito multiplier rectification against Montgomery multiplier specification. Time in seconds; Time-out = 5400 s; k: Operand width k

# of Gates SAT Mas Mont

4

48

8 9 10

96 0.09

Theorem 5.1 Algorithm 2 Mem 0.03

0.001

8.16 MB

292

319 158.34 0.41

0.006

20.36 MB

237

396 4,507

0.001

18.95 MB

285

480 TO

0.47 0.84

0.001

28.2 MB

16 1,836 1,152 TO

73.63

0.024

0.32 GB

32 5,482 4,352 TO

3621

0.043

2.4 GB

First we present the results for the case where the reference Spec is given as a Montgomery multiplier, and the buggy implementation is given as a Mastrovito multiplier, which is to be rectified. Theorem 5.1, along with efficient GB-computation of the ideals EL , EH , is applied at a net xi ∈ X, such that the circuit is rectifiable at xi . Table 1 compares the execution time for the SAT-based approach of [16] against ours (Theorem 5.1) for checking whether a buggy Mastrovito multiplier can be rectified at a certain location in the circuit against a Montgomery multiplier specification. The SAT procedure is implemented using the abc tool [42]. We execute the command inter on the ON set and OFF set as described in [16]. The SAT-based procedure is unable to perform the necessary unsatisfiability check for circuits beyond 9-bit operand word-lengths, whereas our approach easily scales to 32-bit circuits. Using our approach, the polynomial U (XP I ) needed for rectification is computed from EL and the time is reported in Table 1 in the Algorithm 2 column. The last column in the table shows the memory usage of our approach.

102

U. Gupta et al.

We also perform the rectification when the Spec is given as a polynomial expression instead of a circuit. Table 2 shows the results for checking whether the incorrect Mastrovito implementation can be single-fix rectified against the word-level specification polynomial fspec : ZS + A · B. Table 2. Mastrovito multiplier rectification against polynomial specification ZS = AB. Time in seconds; Time-out = 5400 s; k: Operand width k

# of Gates Theorem 5.1 Algorithm 2 Mem

4

48

0.01

0.001

7.24 MB

8

292

0.08

0.006

14.95 MB

16

1,836

4.83

0.038

0.2 GB

32

5,482

100.52

0.015

1.42 GB

4,989

0.117

12.25 GB

64 21,813

Point addition is an important operation required for the task of encryption, decryption and authentication in ECC. Modern approaches represent the points in projective coordinate systems, e.g., the L´ opez-Dahab (LD) projective coordinate [43], due to which the operations can be implemented as polynomials in the field. Table 3. Point Addition circuit rectification against polynomial specification D = B 2 · (C + aZ12 ). Time in seconds; Time-out = 5400 s; k: Operand width k

# of Gates Theorem 5.1 Algorithm 2 Mem

8

243

16 1,277

0.05

0.022

9.73 MB

3.48

0.019

88.78 MB

32 3,918

86.75

0.028

0.47 GB

64 1,5305

4,923

0.053

7.13 GB

Example 8.1. Consider point addition in L´ opez-Dahab (LD) projective coordinate. Given an elliptic curve: Y 2 + XY Z = X 3 Z + aX 2 Z 2 + bZ 4 over F2k , where X, Y, Z are k-bit vectors that are elements in F2k and similarly, a, b are constants from the field. We represent point addition over the elliptic curve as (X3 , Y3 , Z3 ) = (X1 , Y1 , Z1 ) + (X2 , Y2 , 1). Then X3 , Y3 , Z3 can be computed as follows: A = Y2 · Z12 + Y1

B = X 2 · Z 1 + X1

C = Z1 · B

D = B 2 · (C + aZ12 )

Z3 = C

2

E =A·C

X3 = A2 + D + E

F = X 3 + X2 · Z 3

G = X3 + Y2 · Z3

Y3 = E · F + Z3 · G

Rectification of Arithmetic Circuits with Craig Interpolants in Finite Fields

103

Each of the polynomials in the above design are implemented as (gate-level) logic blocks and are interconnected to obtain final outputs X3 , Y3 and Z3 . Table 3 shows the results for the block that computes D = B 2 · (C + aZ12 ). Our approach can rectify up to 64-bit circuits. Limitations of Our Approach: We also performed experiments where we apply Theorem 5.1 at a gate output which cannot rectify the circuit. We used the Montgomery multiplier as the specification and a Mastrovito multiplier as the implementation. For 4- and 8-bit word-lengths, the execution time of our approach was comparable to that of the SAT-based approach, and was ∼0.1 s. For the 16-bit multipliers, the SAT-based approach completed in 0.11 s. On the other hand, application of Theorem 5.1 resulted in a memory explosion and consumed ∼30 GB of memory within 5–6 min. This is due to the fact that when obner basis 1 ∈ EL + EH , then GB(EL + EH ) is not equal to {1} and the Gr¨ algorithm produces a very large output. To improve our approach we are working on term ordering heuristics so that our approach can perform efficiently in both cases. We also wish to employ other data-structures better suited to circuits, as SINGULAR’s data structure is not very memory efficient. SINGULAR also has an upper limit on the number of variables (32,768) that can be accommodated in the system, limiting application to larger circuits.

9

Conclusion

This paper considers single-fix rectification of arithmetic circuits. The approach is applied after formal verification detects the presence of a bug in the design. We assume that post-verification debugging has been performed a set (X) of nets is provided as rectification targets. The paper presents necessary and sufficient conditions that ascertains whether a buggy circuit can be single-fix rectified at a net xi ∈ X. When single-fix rectification is feasible, we compute a rectification polynomial function xi = U (XP I ), which can be synthesized into a circuit. For this purpose, the paper introduces the notion of Craig interpolants in algebraic geometry in finite fields, proves their existence, and gives an effective procedure for their computation. Furthermore, we show how the rectification polynomial can be computed from algebraic interpolants. Experiments are performed over various finite field arithmetic circuits that show the efficiency of our approach as against SAT-based approaches. Limitations of our approach are also analyzed. We are currently investigating the extension of our approach to multi-fix rectification.

References 1. Ritirc, D., Biere, A., Kauers, M.: Column-wise verification of multipliers using computer algebra. In: Formal Methods in Computer-Aided Design (FMCAD), pp. 23–30 (2017) 2. Ciesielski, M., Yu, C., Brown, W., Liu, D., Rossi, A.: Verification of gate-level arithmetic circuits by function extraction. In: 52nd ACM/EDAC/IEEE Design Automation Conference (DAC), pp. 1–6 (2015)

104

U. Gupta et al.

3. Sayed-Ahmed, A., Große, D., K¨ uhne, U., Soeken, M., Drechsler, R.: Formal verification of integer multipliers by combining Gr¨ obner basis with logic reduction. In: Design, Automation Test in Europe Conference Exhibition (DATE), pp. 1048–1053 (2016) 4. Wienand, O., Wedler, M., Stoffel, D., Kunz, W., Greuel, G.-M.: An algebraic approach for proving data correctness in arithmetic data paths. In: Gupta, A., Malik, S. (eds.) CAV 2008. LNCS, vol. 5123, pp. 473–486. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-70545-1 45 5. Shekhar, N., Kalla, P., Enescu, F.: Equivalence verification of polynomial datapaths using ideal membership testing. IEEE Trans. CAD 26(7), 1320–1330 (2007) 6. Tew, N., Kalla, P., Shekhar, N., Gopalakrishnan, S.: Verification of arithmetic datapaths using polynomial function models and congruence solving. In: Proceedings of International Conference on Computer-Aided Design (ICCAD), pp. 122–128 (2008) 7. Lv, J., Kalla, P., Enescu, F.: Efficient Gr¨ obner basis reductions for formal verification of Galois field arithmetic circuits. IEEE Trans. CAD 32(9), 1409–1420 (2013) 8. Pruss, T., Kalla, P., Enescu, F.: Efficient symbolic computation for word-level abstraction from combinational circuits for verification over finite fields. IEEE Trans. CAD 35(7), 1206–1218 (2016) 9. Lvov, A., Lastras-Montano, L., Trager, B., Paruthi, V., Shadowen, R., El-Zein, A.: Verification of Galois field based circuits by formal reasoning based on computational algebraic geometry. Form. Methods Syst. Des. 45(2), 189–212 (2014) 10. Sun, X., Kalla, P., Pruss, T., Enescu, F.: Formal verification of sequential Galois field arithmetic circuits using algebraic geometry. In: Proceedings of Design, Automation and Test in Europe (2015) 11. Cox, D., Little, J., O’Shea, D.: Ideals, Varieties, and Algorithms: An Introduction to Computational Algebraic Geometry and Commutative Algebra. Springer, New York (2007). https://doi.org/10.1007/978-0-387-35651-8 12. Adams, W.W., Loustaunau, P.: An Introduction to Gr¨ obner Bases. American Mathematical Society, Providence (1994) 13. Madre, J.C., Coudert, O., Billon, J.P.: Automating the diagnosis and the rectification of design errors with PRIAM. In: Kuehlmann, A. (ed.) The Best of ICCAD. Springer, Boston (2003). https://doi.org/10.1007/978-1-4615-0292-0 2 14. Lin, C.C., Chen, K.C., Chang, S.C., Marek-Sadowska, M.: Logic synthesis for engineering change. In: Proceedings of Design Automation Conference (DAC), pp. 647–652 (1995) 15. Scholl, C., Becker, B.: Checking equivalence for partial implementations. In: Equivalence Checking of Digital Circuits. Springer, Boston (2004) 16. Tang, K.F., Wu, C.A., Huang, P.K., Huang, C.Y.: Interpolation-based incremental ECO synthesis for multi-error logic rectification. In: Proceedings of Design Automation Conference (DAC), pp. 146–151 (2011) 17. Craig, W.: Linear reasoning: a new form of the Herbrand-Gentzen theorem. J. Symb. Log. 22(3), 250–268 (1957) 18. Liaw, H.T., Tsaih, J.H., Lin, C.S.: Efficient automatic diagnosis of digital circuits. In: Proceedings of ICCAD, pp. 464–467 (1990) 19. Gitina, K., Reimer, S., Sauer, M., Wimmer, R., Scholl, C., Becker, B.: Equivalence checking of partial designs using dependency quantified Boolean formulae. In: IEEE International Conference on Computer Design (ICCD) (2013)

Rectification of Arithmetic Circuits with Craig Interpolants in Finite Fields

105

20. Jo, S., Matsumoto, T., Fujita, M.: SAT-based automatic rectification and debugging of combinational circuits with LUT insertions. In: IEEE 21st Asian Test Symposium (2012) 21. Fujita, M., Mishchenko, A.: Logic synthesis and verification on fixed topology. In: 22nd International Conference on Very Large Scale Integration (VLSI-SoC) (2014) 22. Fujita, M.: Toward unification of synthesis and verification in topologically constrained logic design. Proc. IEEE 103, 2052–2060 (2015) 23. Wu, B.H., Yang, C.J., Huang, C.Y., Jiang, J.H.R.: A robust functional ECO engine by SAT proof minimization and interpolation techniques. In: International Conference on Computer Aided Design, pp. 729–734 (2010) 24. Ling, A.C., Brown, S.D., Safarpour, S., Zhu, J.: Toward automated ECOs in FPGAs. IEEE Trans. CAD 30(1), 18–30 (2011) 25. Dao, A.Q., et al.: Efficient computation of ECO patch functions. In: 55th Design Automation Conference (DAC), pp. 51:1–51:6, June 2018 26. Gupta, U., Ilioaea, I., Rao, V., Srinath, A., Kalla, P., Enescu, F.: On the rectifiability of arithmetic circuits using Craig interpolants in finite fields. In: International Conference on Very Large Scale Integration (VLSI-SoC), pp. 49–54 (2018) 27. Ghandali, S., Yu, C., Liu, D., Brown, W., Ciesielski, M.: Logic debugging of arithmetic circuits. In: IEEE Computer Society Annual Symposium on VLSI (2015) 28. Farahmandi, F., Mishra, P.: Automated debugging of arithmetic circuits using incremental Gr¨ obner basis reduction. In: IEEE International Conference on Computer Design (ICCD) (2017) 29. Farahmandi, F., Mishra, P.: Automated test generation for debugging arithmetic circuits. In: Proceedings of the 2016 Conference on Design, Automation & Test in Europe, DATE (2016) 30. McMillan, K.L.: Interpolation and SAT-based model checking. In: Hunt, W.A., Somenzi, F. (eds.) CAV 2003. LNCS, vol. 2725, pp. 1–13. Springer, Heidelberg (2003). https://doi.org/10.1007/978-3-540-45069-6 1 31. Lee, R.-R., Jiang, J.-H.R., Hung, W.-L.: Bi-decomposing large Boolean functions via interpolation and satisfiability solving. In: Proceedings of Design Automation Conference (DAC), pp. 636–641 (2008) 32. McMillan, K.: An interpolating theorem prover, theoretical computer science. In: Tools and Algorithms for the Construction and Analysis of Systems (TACAS 2004), vol. 345, no. 1, pp. 101–121 (2005) 33. Kapur, D., Majumdar, R., Zarba, G.: Interpolation for data-structures. In: Proceedings of ACM SIGSOFT International Symposium on Foundation of Software Engineering, pp. 105–116 (2006) 34. Cimatti, A., Griggio, A., Sebastiani, R.: Efficient interpolant generation in satisfiability modulo theories. In: Ramakrishnan, C.R., Rehof, J. (eds.) TACAS 2008. LNCS, vol. 4963, pp. 397–412. Springer, Heidelberg (2008). https://doi.org/10. 1007/978-3-540-78800-3 30 35. Griggio, A.: Effective word-level interpolation for software verification. In: Formal Methods in Computer-Aided Design (FMCAD), pp. 28–36 (2011) 36. Gao, S., Platzer, A., Clarke, E.: Quantifier elimination over finite fields with Gr¨ obner bases. In: Algebraic Informatics: 4th International Conference, CAI, pp. 140–157 (2011) 37. Gao, S.: Counting zeros over finite fields with Gr¨ obner bases. Master’s thesis, Carnegie Mellon University (2009) 38. Lv, J.: Scalable formal verification of finite field arithmetic circuits using computer algebra techniques. Ph.D. dissertation, Univ. of Utah, August 2012

106

U. Gupta et al.

39. Buchberger, B.: Ein Algorithmus zum Auffinden der Basiselemente des Restklassenringes nach einem nulldimensionalen Polynomideal. Ph.D. dissertation, University of Innsbruck (1965) 40. Buchberger, B.: A criterion for detecting unnecessary reductions in the construction of Gr¨ obner-bases. In: Ng, E.W. (ed.) Symbolic and Algebraic Computation. LNCS, vol. 72, pp. 3–21. Springer, Heidelberg (1979). https://doi.org/10.1007/3540-09519-5 52 41. Decker, W., Greuel, G.-M., Pfister, G., Sch¨ onemann, H.: Singular 4-1-0 – a computer algebra system for polynomial computations (2016). http://www.singular. uni-kl.de 42. Brayton, R., Mishchenko, A.: ABC: an academic industrial-strength verification tool. In: Touili, T., Cook, B., Jackson, P. (eds.) CAV 2010. LNCS, vol. 6174, pp. 24–40. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-14295-6 5 43. L´ opez, J., Dahab, R.: Improved algorithms for elliptic curve arithmetic in GF (2n ). In: Tavares, S., Meijer, H. (eds.) SAC 1998. LNCS, vol. 1556, pp. 201–212. Springer, Heidelberg (1999). https://doi.org/10.1007/3-540-48892-8 16

Energy-Accuracy Scalable Deep Convolutional Neural Networks: A Pareto Analysis Valentino Peluso and Andrea Calimera(B) Department of Control and Computer Engineering, Politecnico di Torino, 10129 Turin, Italy {valentino.peluso,andrea.calimera}@polito.it

Abstract. This work deals with the optimization of Deep Convolutional Neural Networks (ConvNets). It elaborates on the concept of Adaptive Energy-Accuracy Scaling through multi-precision arithmetic, a solution that allows ConvNets to be adapted at run-time and meet different energy budgets and accuracy constraints. The strategy is particularly suited for embedded applications made run at the “edge” on resource-constrained platforms. After the very basics that distinguish the proposed adaptive strategy, the paper recalls the software-to-hardware vertical implementation of precision scalable arithmetic for ConvNets, then it focuses on the energy-driven per-layer precision assignment problem describing a meta-heuristic that searches for the most suited representation of both weights and activations of the neural network. The same heuristic is then used to explore the optimal trade-off providing the Pareto points in the energy-accuracy space. Experiments conducted on three different ConvNets deployed in real-life applications, i.e. Image Classification, Keyword Spotting, and Facial Expression Recognition, show adaptive ConvNets reach better energy-accuracy trade-off w.r.t. conventional static fixed-point quantization methods.

1

Introduction

Deep Neural Networks (DNNs) are computational models that emulate the activity of the human brain during pattern recognition. They consist of deep chains of neural layers that apply non-linear transformations on the input data [1]. The projection on the new feature-space enables a more efficient classification, achieving accuracies that are close, and in some cases even above, those scored by the human brain. Convolutional Neural Networks [2] (ConvNets hereafter) are the first example of DNNs applied to problems of human-level complexity. They have brought about breakthroughs in computer vision [3] and voice recognition [4], improving the state-of-the-art in many application domains. From a practical viewpoint, the forward pass through a ConvNet is nothing more than matrix multiplications between pre-trained parameters (the synaptic weights of the hidden neurons) and the input data. c IFIP International Federation for Information Processing 2019  Published by Springer Nature Switzerland AG 2019 N. Bombieri et al. (Eds.): VLSI-SoC 2018, IFIP AICT 561, pp. 107–127, 2019. https://doi.org/10.1007/978-3-030-23425-6_6

108

V. Peluso and A. Calimera

The most common use-case for ConvNets is image classification where a multi-channel image (e.g. RGB) is processed producing as output the probability that the subject depicted in the picture belongs to a specific class of objects or concepts (e.g. car, dog, airplane, etc.). One can see this end-to-end inference process as a kind of data compression: high-volume raw-data (the pixels of the image) are compressed into a highly informative tag (the resulting class). In this regard, the adoption on the Internet-of-Things (IoT) is disruptive: distributed smart-objects with embedded ConvNets may implement data-analytics at the edge, near the source of data [5], with advantages in terms of predictability of the service response time, energy efficiency, privacy and, in general, scalability of the IoT infrastructure. The design of embedded ConvNets encompasses a training stage during which the synaptic weights of the hidden neurons are learned using a back-propagation algorithm (e.g. the Stochastic Gradient Descent [6]). The learning is supervised and accuracy-driven, namely, it adjusts the weights such that an accuracy loss function evaluated over a set of labeled samples is minimized. Once trained, the ConvNet can be flashed on the smart-object and deployed at the edge, where it runs inference on never occurred samples. To notice that ConvNets presented in the literature show different depth (number of layers) and size (number of neurons per layer); also the topology may change due to optional layers used to reduce the cardinality of the intermediate activations, e.g., local pooling layers, or their sparsity, e.g. Rectified Linear Units (ReLU). Regardless of the internal structure, ConvNets show a common characteristic, complexity. Even the most simple model, e.g. AlexNet [2] or the more compact MobileNets [7], show millions of synaptic weights to be stored and tens of thousands of matrix convolutions to be run [5]. This prevents their use on low-power embedded platforms which offer low storage capacity, low compute power, and limited energy budget. How to design ConvNets that fit the stringent resource constraints while preserving classification accuracy is the new challenge indeed. Recent works introduced several optimization strategies, both at the software level and hardware level [8]. They mainly exploit the intrinsic redundancy of ConvNets in order to reduce (i) the number of weights/neurons (the so-called pruning methods) or (ii) the arithmetic precision (quantization methods) or (iii) both [9]. Precision scaling is of practical interest due to its simplicity and the solid theories developed in the past for DSP applications. It concurrently reduces the memory footprint (the lower the bit-width, the lower the memory footprint) and the execution latency (the lower the bit-width, the faster the execution). The use of fixed-point arithmetic with 16- and 8-bit [10] instead of the 32-bit floating-point, or even below, e.g. 6 and 4-bit [11], has shown remarkable savings with no, or very marginal accuracy drop. Aggressive binarization [12] is an alternative approach provided that large accuracy loss is acceptable. Obviously, the implementation of quantized ConvNets asks for integer units that can process data with reduced representations; recent hardware designs, both from industry and academia, follow this trend [13–15].

Energy-Accuracy Scalable Deep Convolutional Neural Networks

109

Most of the existing optimizations, both pruning and quantization, were originally conceived as static methods. Let’s consider quantization. For a given ConvNet the numeric precision of the weights is defined at design-time and then kept constant during run-time. Therefore, the design effort is that of finding the proper bit-width such that accuracy losses are minimized [16]. Although effective, this approach is very conservative as inference always operates at full speed and hence under maximum resource usage. Adaptive strategies that speculate on the quality of results to reach higher energy efficiency are a more interesting option for portable devices deployed on non-critical missions [17]. There exist applications or use-cases for which the classification accuracy can be relaxed without affecting much the user perception, or, alternatively, conditions under which other extra-functional properties of the system, e.g. energy budget or latency, get higher priority. For such cases, one may use the arithmetic precision as a control knob to manage the resources. This concept of energy-accuracy scaling is a well-established technique for VLSI designs [18], while it represents a less explored option for ConvNets (and DNNs in general). The idea of energy-accuracy scalable ConvNets through dynamic precision scaling was first introduced in [19] and then elaborated in [20] with the introduction of an energy-driven optimization framework. The method applies to software-programmable arithmetic accelerators where precision scaling is achieved through variable-latency Multiply & Accumulate (MAC) instructions. This implementation applies for any general purposes MCUs (e.g. [21]) or application-specific processors with a multi-precision instruction-set (e.g. Google TPU [22]); it can also be extended to dedicated architecture (both ASIC or FPGA accelerators [23]). This chapter further investigates on this strategy introducing a Pareto analysis of the energy-accuracy space. An optimization engine is used to identify the arithmetic precision that minimizes energy and accuracy loss concurrently. The obtained precision settings can be loaded at run-time with minimal overhead thus to allow ConvNets to reach the operating conditions that satisfy the requirements imposed at the system-level. As test-benches we used three real-life applications built upon state-of-the-art ConvNets, i.e. Image Classification [24], Keyword Spotting [25], and Facial Expression Recognition [26]. Experimental results suggest the proposed strategy is a practical solution for the development of flexible, yet efficient IoT applications. The remaining sections are organized as follows. Section 2 gives an overview of related works in the field. Section 3 describes the implementation details for the single weight-set multi-precision arithmetic used in scalable ConvNets. Section 4 recalls the optimization engine and the energy-accuracy models adopted. Finally, Sect. 5 shows the Pareto analysis over the three benchmarks and the performance of the optimization heuristic.

2

Related Works

With the emerging of the edge-computing paradigm, the reduction of ConvNets complexity has become the new challenge for the IoT segment. The problem is

110

V. Peluso and A. Calimera

being addressed from different perspectives: with the design of custom hardware that improves the execution of data-intensive loops achieving energy efficiencies of few pico-Joules/operation [11,27]; with new learning strategies that generate less complex networks [28]; with iso-accuracy compression techniques aimed at squeezing the model complexity. A thorough review is reported in [29]. To notice that while many existing techniques are conceived as static methods, the dynamic management of ConvNets is a less explored field. This work deals with this latter aspect. 2.1

Adaptive ConvNets

Following the recent literature, the concept of adaptive ConvNets may have multiple interpretations and hence different implementations. On the one hand, there are solutions that adapt to the complexity of the input data. On the other hand, solutions that adapt to external conditions or triggers, regardless of data complexity. The former class is mainly represented by techniques that implement the general principle of coarse-to-fine computation [30]. These methods make use of branches in the internal network topology generating conditional deep neural nets [31]. In its most simple implementation, a conditional ConvNet is made up of a chain of two classifiers, a coarse classifier (for “easy” inputs) and a fine classifier (for “hard” inputs) [32]; the coarse classifier is always-on, while the fine classifier is occasionally activated for “hard” inputs (statistically less frequent). As a result, ConvNets can adapt to the complexity of data at run-time. An extension with deeper chains of quantized micro-classifiers is proposed in [33], while in [34] authors propose the use of Dynamic Voltage Accuracy Frequency Scaling (DVAFS) for the recognition of objects of different complexity. Concerning the second class, that is the main target of this work, adaptivity is achieved by tuning the computational effort of the ConvNet depending on the desired accuracy. The control knob is the arithmetic precision of the convolutional layers. The work described in [19] is along this direction as it introduces an HW-SW co-design to implement multi-precision arithmetic at run-time. Depending on the parallelism of the HW integer units (e.g. 16- or 8-bits), weights can be loaded and processed using different bit-widths thus to achieve different degrees of accuracy under different energy budgets. This is the enabler for accuracy-energy scaling adaptive ConvNets. To notice that unlike static quantization methods where different accuracy levels could be achieved using multiple pre-trained weight-sets stored as separate entities, here the precision scaling is achieved using a single set of weights and incomplete arithmetic operations. The same strategy is adopted in this work. Hybrid solutions may jointly exploit the complexity of the input problem with the accuracy imposed at the application level. For instance, the authors of [35] introduce the concept of multi-level classification where the classification task can be performed at different levels of semantic abstraction: the higher the abstraction, the easier the classification problem. Then, depending on the

Energy-Accuracy Scalable Deep Convolutional Neural Networks

111

abstraction level and the desired accuracy, the ConvNet is tuned to achieve the maximum energy efficiency. 2.2

Fixed-Point Quantization

Since the multi-precision strategy adopted in this work encompasses the quantization to fixed-point, this subsection gives a brief taxonomy of the existing literature on the subject. Complexity reduction through fixed-point quantization exploits the characteristics of the weight distributions across different convolutional layers in order to find the most efficient data representation [36]. Two main stages are involved: the definition of the bit-width, i.e. the data parallelism, and the radix-point scaling, i.e. the position of the radix point. A common practice is to define the bit-width depending on hardware availability (e.g. 16-, 8-bit for most of the architectures), then find the radix-point position that minimizes the quantization error. The existing techniques, mainly from the DSP theory, differ in the radix-point scaling scheme. A complete review is out of the scope of this work and the interested reader can refer to [8]. It is worth emphasizing that a onesize-fits-all solution does not exist as efficiency is affected by the kind of neural networks under analysis and the characteristics of the adopted hardware. A more relevant discriminant factor is the spatial granularity at which the fixed-point format is applied, per-net or per-layer. In the former case all the layers share the same representation; in the latter case, each layer has its own representation. Since the weights distribution may substantially differ from layer to layer, a finer, i.e. per-layer, approach achieves lower accuracy loss [36]. Whatever the granularity is, existing works from the machine-learning community, e.g. [36,37], focused on accuracy-driven optimal precision scaling. Only a few papers take hardware resources into account, which is paramount when dealing with embedded systems. The authors of [16] briefly describe a greedy approach where low precision is assigned starting from the first layer of the net (topological order) without considering the complexity of the layer. In [10] authors describe the design of embedded ConvNets for FPGAs and propose a per-layer precision scaling that is aware of the number of memory accesses. Only very few works, e.g. [20,29], bring energy consumption as a direct variable in the optimization loop.

3

Energy-Accuracy Scalable Convolution

The proposed adaptive ConvNet strategy leverages precision scalable arithmetic. This section introduces a possible implementation of matrix convolution using software-programmable multi-precision Multiply & Accumulate (MAC) instructions. It first describes the algorithmic details, then it presents a custom processing element that accelerates the variable-latency MAC with minimal design overhead.

112

V. Peluso and A. Calimera

I

MxM

W Ii

I

Ii M

*

N=16

#1

Ii H L

#2

× = #3

Wi H L

K=8 K=8

=

Wi

H

W

×

H H L

+ × ×

#4

Ii



Wi



Half:KxK = 8x8

L H L

+ + ×

Mixed:KxN = 8x16

L

Full:NxN = 16x16

cycles

Fig. 1. Iterative multiply-accumulate algorithm.

3.1

SW: Multiprecision Convolution

For a given layer in a ConvNet, the convolution between the M × M input map matrix I and the M × M weight matrix of a kernel W is the dot-product of the two unrolled vectors I and W of length (M × M ). The dot-product between I and W is the sum of the (M × M ) products Ii × Wi , as shown in Fig. 1. Assuming a N -bit fixed-point representation (N = 16 in this work), Ii and Wi can be seen as two concatenated halfwords of K = N/2 bits (K = 8); the most significant parts IiH and WiH and the least significant parts IiL and WiL . As pictorially described in Fig. 1, each single product Ii × Wi is implemented by means of a four-cycles procedure where the most significant and least significant halfwords are iteratively multiplied, shifted and accumulated. To notice that IiH and WiH are signed integers, IiL and WiL are unsigned. Different precision options can be reached by stopping the execution at earlier cycles: half (K × K) 1 cycle, mixed (K × N ) 2 cycles and full (N × N -bit) 4 cycles; an additional mixed precision option (N × K) is also obtained by swapping the second and the third cycle (2 cycles). The same four options can be extended to the dot-product procedure as described in Algorithm 1. At half-precision, both the operands Ii and Wi are reduced to K bits. The first loop (lines 1–2) operates on the most significant parts IiH , WiH . The result is then returned (line 3). At mixed-precision, only one operand, the input Ii (or the weight Wi , not shown in the pseudo-code), is reduced to K bits. First, the partial result r is shifted of K-bits (line 4), then the second loop (lines 5–6) iterates on IiH and WiL (IiL and WiH ) and the result is returned (line 7). At full-precision, both Wi and Ii are taken as N bit operands. In this case the last two loops (lines 8–12) come into play and they iterate on the least significant parts WiL and Ii (both H and L) thus to complete the remaining part of the product. To summarize, with N = 16, the available precision options

Energy-Accuracy Scalable Deep Convolutional Neural Networks

113

Algorithm 1. Iterative multiply-accumulate algorithm

1 2 3 4 5 6 7 8 9 10 11 12 13

Input: I, W , precision Output: Dot-Product r for i = 0; i < M ; i = i + 1 do r = r + IiH × WiH if (precision==half ) then return r ; r=rK for i = 0; i < M ; i = i + 1 do r = r + IiH × WiL if (precision==mixed) then return r ; for i = 0; i < M ; i = i + 1 do r = r + IiL × WiH r=rK for i = 0; i < M ; i = i + 1 do r = r + IiL × WiL return r ;

// half:KxK

// mixed:KxN

// full:NxN

are: half (K × K, i.e. 8 × 8), mixed (N × K, i.e. 16 × 8 or K × N , i.e. 8 × 16), full (N × N , i.e. 16 × 16). Given the regular structure of the algorithm, all them can be implemented on the same K × K MAC unit. This straightforward algorithm offers a simple way to adjust the precision of the results and the resource usage. Firstly, it allows the computational effort, and hence the energy consumption, to scale with the arithmetic precision; secondly, it alleviates the memory bandwidth as less bits need to be moved from/to the memory banks at lower precisions1 . 3.2

HW: Variable-Latency Processing Element

Figure 2 gives the RTL-view of the proposed processing element (PE) for N = 16. The PE is composed by 9 × 9 multiplier, where the 9th bit is used for the sign extension of the operands. As described in the previous subsection, the most significant parts (IiH , WiH ) are signed, while the least significant parts (IiL , WiL ) are unsigned. Therefore, the MSB of (IiL , WiL ) belongs to the module, while that of (IiH , WiH ) is the sign. In order to account for this issue we implemented the following mechanism: when (IiH , WiH ) are processed, the sign is extended to the 9th by concatenating the MSB (i.e. the sign) of I and W ; when (IiL , WiL ) are processed a 0 is concatenated. The selection is done through the control signals signed-I and signed-W driven by the local control unit (omitted in the picture for the sake of space). The same control unit is in charge of feeding the MAC with the right sequence of data (H or L) fetched from a local memory. The accumulator has 16 guard bits and an embedded saturation logic to handle underflow and overflow. The role of the programmable shifter is twofold. First, to shift the partial results when needed (see Algorithm 1). Second, 1

We assume the availability of memories that support both word (N -bit) and halfword (K-bit) accesses [38].

EN NOR

W

clock

8

zero-skip

== 0

signed-W 0 latch W MSB DQ EN

0 1

18 9

48 48

48

48 4

Trunca on w/ range check

8 == 0

clear 9

Shi er

I

0 1

ACCUMULATOR w/ satura on

signed-I 0 latch I MSB DQ

concatenate

V. Peluso and A. Calimera

concatenate

114

16

clock gate EN Q CP

Fig. 2. 8 × 8 HW unit for multi-precision MAC.

to implement the dynamic fixed point arithmetic by moving the radix point of the final accumulation result depending on the desired fractional length [39]. A range check logic drives bit saturation if the result does not fit the word-length. In order to minimize the dynamic power consumption, a zero-skipping strategy [34] is implemented by means of latch-based operand isolation and clockgating. If one of the operands is zero, then the latches prevent the propagation of inputs minimizing the switching activity, while the clock-gating cell disables the clock signal thus reducing the equivalent load capacitance of the clock signal. 3.3

Hardware Characterization

The proposed SW-HW precision scaling strategy can be implemented using both FPGA and ASIC technologies. In this work we designed and characterized the 8 × 8 MAC unit using a commercial 28 nm UTBB FDSOI technology and the Synopsys Galaxy Platform, versions L-2016.03. The frequency constraint is set to 1 GHz at 0.90 V in a typical process corner (compliant with recent works that used the same technology [40]). Power consumption is extracted using Synopsys PrimeTime L-2016.06 with SAIF back-annotation. Collected results show a standard cell area of 1443 µm2 and total average power consumption of 0.95 mW. Compared to a traditional 8×8 MAC unit, the proposed architecture shows 3.7% area penalty. Table 1. Energy/MAC vs precision Precision (I × W ) Ncycles EMAC (pJ) 16 × 16

4

3.80

16 × 8

2

1.90

8 × 16

2

1.90

8×8

1

0.95

Energy-Accuracy Scalable Deep Convolutional Neural Networks

115

Table 1 shows the latency (Ncycles ) and the energy consumption per MAC operation (EMAC ) for the four precisions available. As one can see, each row in the table corresponds to a different implementation point in the precision-energy space. If one of the two operands is zero, energy Ezero reduces substantially due to the zero-skipping logic: Ezero = 0.103EMAC .

4 4.1

Energy-Driven Precision Assignment Fixed-Point Quantization

The shift from floating-point to fixed-point is a well-known problem in the DSP domain. In this sub-section, we review the basic theory and the main aspects involving this work. A floating-point value V can be represented with a binary word Q of N bits using the following mapping function: V = Q · 2−F L

(1)

F L indicates the fraction length, i.e. the position of the radix-point in Q. Given a set of real values, the choice of N and F L affects the information loss due to quantization. Since the bit-width N is usually given as a design constraint (e.g. 16-bit in this work), the problem reduces to searching the optimal F L (the integer length IL is then given by N -IL). The choice of F L affects the maximum representable value |Vmax | and the minimum quantization error Qstep . Concerning |Vmax |, the relationship is described in the following equation:   BW −1  2 −1 F L = log2 (2) |Vmax | A trade-off does exit: the lower the F L the lower the Vmax ; the larger the F L the lower the Qstep . The decision of which constraint to guard more (|Vmax | or Qstep ) mainly depends on the distribution of the original floating-point weights and their importance in the neural model under quantization. A dynamic fixed-point scheme is implemented where the fraction length is defined layer-by-layer. The F Lopt that minimizes the L2 distance between the original 32-bit floating point values and the quantized values is searched among N − 1 possible values. The search is done over a calibration set built by randomly picking 100 samples from the training set. To be noted that our problem formulation applies a symmetric linear quantization using a binary radix-point scaling. As an additional piece of information, it is important to underline that quantization is not followed by retraining, a very time-consuming procedure even for small ConvNets.

116

4.2

V. Peluso and A. Calimera

Multiprecision Fixed-Point ConvNets

Problem Formulation. For a ConvNet of L layers, the classification accuracy can be scaled to different values by optimally selecting the arithmetic precision of each layer. The choice of such optimal precision should be done for the input map (I) and the weight (W) matrices of each layer each layer i, and for the output map matrix (O) of the last layer2 . Assuming the availability of the four accuracy options described in Sect. 3, i.e. full (16 × 16), mixed (16 × 8 or 8 × 16), half (8 × 8), the precision for I and W of each layer, and that of O for last layer, can be assigned to 8-bit or 16-bit. We encode the unknown of the problem as a vector X of (2 × L + 1) Boolean variables xi , where the variable x2×L+1 refers to O. The encoding map is: x = 0 → 8-bit, x = 1 → 16-bit. The optimal assignment is the one that minimizes the total energy consumption E(X) while ensuring an accuracy loss λ(X) lower than a user-defined constraint λmax . Energy-Driven Precision Assignment. The optimal precision assignment to each layer is carried out using a custom meta-heuristic based on Simulated Annealing (SA). Algorithm 2 shows the pseudo-code of the SA. It gets as inputs the parameters listed in Table 2. Table 2. Simulated annealing hyper-parameters T0

Initial temperature

Tf

Final temperature

X0

Starting solution

cooling Temperature derating factor (geometric in our case) Kb

Normalization factor of the acceptance probability

iter

Number of iterations for each temperature T

λmax

User-defined accuracy drop (percentage)

cal set Calibration set size

In all the experiments, the starting solution X0 is the full-precision (16-bit) to all the L layers (both I and W, and O. The estimation of the accuracy drop is done on a subset of images randomly picked from the training set, referred to as the calibration set. Its size is defined by the cal set parameter. At each iteration, the next state is generated as a random perturbation of the current state (line 6). For those states that satisfy the accuracy constraint (line 7), the energy cost function E is evaluated (line 8) through the function energy. If ΔE (line 9) reduces (line 10), the new state is accepted (lines 11–12). 2

The precision of O does not impact computation as it only affects the number of memory accesses.

Energy-Accuracy Scalable Deep Convolutional Neural Networks

117

Algorithm 2. Simulated Annealing

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Input: T0 , Tf , X0 , cooling, Kb , iter, λt extrmmax, cal set Output: X T = T0 E = energy(X0 ) Emax = energy (ones (2L + 1)); Emin = energy (zeros (2L + 1)) while (T ≥ Tf ) do for i = 0; i < iter; i = i+1 do next state = move (current state) if accuracy drop(next state, cal set, tested) < λmax then E next = energy (next state) ΔE = (E next-E current) / (Emax - Emin ) if (dE < 0) or (exp[-ΔE/Kb ·T] > random(0, 1)) then current state = next state E current = E new if E current < E best then E best = E current best state = current state update(tested) T = T · cooling return best state

If not, the new state is accepted following a Boltzmann probability function (lines 10–12); the acceptance ratio gets smaller as T reduces. States that show minimum energy are iteratively saved as best solutions (lines 13–15). Once the total number of iterations is reached (line 5), the temperature T is cooled down (line 17). The process iterates till the minimum temperature Tf is reached (line 4). The bottleneck of the algorithm is the call to the function accuracy drop. For this reason, the algorithm takes trace of already processed states; this information is fed to the accuracy drop function which can eventually by-pass accuracy estimation (line 16). Energy. The system-level architecture depicted in Fig. 3 serves as a general template to describe Application-Specific Processors for ConvNets computing, e.g. [11]. It consists of a planar array of processing elements (PE), in our case the MAC units described in Sect. 3, a set of SRAM buffers for storing temporal data (Input Buffer, Weight Buffer, and Output Buffer), an off-chip memory (DRAM) and its DMA, a control unit (RISC) that schedules the operations. The total energy consumption E is the sum of two main contributions: E = E comp + E mem . E comp is the energy consumed by the PE array, E mem is the energy consumed due to data movement through the memory hierarchy.

118

V. Peluso and A. Calimera

DRAM

RISC Control + IMEM

PE

PE

PE

PE

Input Buffer

PE

PE

PE

PE

Weight Buffer

PE

PE

PE

PE

PE

PE

PE

PE

DMA

Output Buffer

2D SIMD Array

Fig. 3. Architectural template of ConvNet accelerators.

The first term is defined as: E comp =

L 

E MAC · Ncycles (xi ) · NiMAC + E zero · Nizero

(3)

i=1

L is the number of layers of the ConvNet. E MAC is the energy consumption of the half-precision MAC (row 8 × 8 in Table 1). Ncycles is the latency of a single MAC operation of the i-th layer; it is given as multiple of the latency of the half-precision MAC (row 8 × 8 in Table 1) and it is function of the precision xi . NiMAC is the number of non-zero MAC operations of the i-th layer. Eizero is the energy consumed under zero-skipping (mostly due to leakage). Nizero is the number of zero MAC. The second term is defined as: E mem =

2·L+1 

E MAC · [αi (xi ) + βi (xi ) + γi (xi )]

(4)

i=1

E MAC is the same as in Eq. 3, while αi , βi and γi are three parameters that describe the energy consumed by the i-th layer due to reading/writing the input map (αi ), the weights (βi ), the output map (γi ). More specifically they represent the ratio between the energy consumption of the memory and the energy consumption of the PE array; here again, the energy unit is the half-precision MAC (row 8 × 8 in Table 1) [11]. Obviously, α and β do not contribute for the final output layer: αL+1 = 0 and βL+1 = 0. All the three parameters are function of the layer precision xi : both fetch and write-back operations depend on (i) the accuracy of the MAC algorithm, and (ii) the number of zero-multiplications (switching activity to/from memory may change substantially). Moreover αi , βi , γi change depending on the ConvNet model: number and size of weights/channels per layer, stride and padding. Finally, they also differ depending on the size of the hardware components (PE array, and global buffers). Since the target of this work is not the energy model per se, not even the evaluation of different architectural solutions, αi , βi , γi are extracted for the architecture proposed in [11] and then scaled to our precision reduction strategy. The same E mem model applies to different architectures by proper tuning of the three parameters.

Energy-Accuracy Scalable Deep Convolutional Neural Networks

119

Accuracy Drop. The accuracy drop is computed as the ratio between the number of miss-classified images and the total number of images in the calibration set (cal set), hence its estimation implies the execution of S feed-forward inferences using the quantized fixed-point model (S as the cardinality of cal set). Unfortunately, common GPUs do not have integer units. To address this issue we implemented the fake quantization proposed in [37]. It is a SW strategy that emulates the loss of information due to fixed-point arithmetic still using floatingpoint data-type. Each layer is wrapped with a software module that converts its input data and weights (32-bit floating-point) into a fake integer, namely, still a 32-bit floating-point number subtracted of an amount equal to the error that the fixed-point representation would have brought. The advantage is that all the fixed-point operations are physically run by the high-performance FP units.

5 5.1

Results Experimental Set-up

The objective of this work is to provide a Pareto analysis of adaptive ConvNets implemented with the proposed energy-accuracy scaling strategy. As benchmarks we adopted three different applications which are reaching widespread use in several domains: Image Classification (IC), Keyword Spotting (KWS), Facial Expression Recognition (FER). Additional details provided in the next subsection. The exploration in the energy-accuracy space is conducted using the SA engine introduced in Sect. 4. More specifically, the algorithm is made run under different accuracy loss constraints, from 1% to 15% with step 1%, and collecting the energy consumption reached by the optimal precision settings. Table 3 summarizes the SA parameters used in the experiments. For all the networks we selected the same hyper-parameters, except for the number of iterations iter at a given temperature T . As described in the next sub-section, the three ConvNets have different number of layers, hence different complexity; as the cardinality of the search space increases, more iterations are needed to explore the cost function. Table 3. Simulated annealing hyper-parameters values IC KWS FER T0

512

Tf

2.5

Kb

1e−2

cooling 2.5 iter

10 102

103

λmax

1%–15%, step 1%

cal set

2000

120

5.2

V. Peluso and A. Calimera

Benchmarks

Image Classification (IC): the typical image recognition on the popular CIFAR-10 dataset. The dataset collects 60000 32 × 32 RGB images [24] evenly split in 10 classes, with 50000 and 10000 samples for the train-set and testset respectively. The adopted ConvNet is taken from the Caffe framework [41], which consists of three convolutional layers interleaved with max-pooling and one fully-connected layer. The three benchmarks under analysis serve very different purposes; their functionality and main characteristics, as well as their training set, are described separately therefore. Keyword Spotting (KWS): a standard problem in the field of speech recognition. We considered a simplified version of the problem3 . The reference dataset is the Speech Commands Dataset [25]; it counts of 65k 1 s-long audio samples collected during the repetition of 30 different words by thousands of different people. The goal is to recognize 10 specific keywords, i.e. “Yes”, “No”, “Up”, “Down”,“Left”, “Right”, “On”, “Off”, “Stop”, “Go”, out of the 30 available words; samples that do not fall in these 10 categories are labeled as “unknown”. Table 4. Benchmarks overview. Considering that each convolutional layer with shape (ch, kh , kw ), fully-connected with shape (ch), and max-pooling layer with shape (kh , kw ). Where kh and kw are respectively the kernel height and width, while ch denotes the number of output channels. Application

IC

KWS

Dataset

CIFAR-10 [24]

Speech Commands [25] FER2013 [26]

Input Shape

3 × 32 × 32

1 × 32 × 40

Model Architecture Conv2d

(32,5,5) Conv2d

MaxPool2d (3,3) Conv2d Conv2d

1 × 48 × 48 (186,32,8) Conv2d

MaxPool2d (1,1)

(32,5,5) Conv2d

MaxPool2d (3,3)

FER

Linear

(64,5,5) Linear

(32,3,3)

Conv2d

(32,3,3)

(64,10,4)

Conv2d

(32,3,3)

(32)

MaxPool2d (2,2)

(128)

Conv2d

MaxPool2d (3,3)

Linear

(128)

Conv2d

(64,3,3)

Linear

Linear

(12)

Conv2d

(64,3,3)

(10)

(64,3,3)

MaxPool2d (2,2) Conv2d

(128,3,3)

Conv2d

(128,3,3)

Conv2d

(128,3,3)

MaxPool2d (2,2) Linear Top-1 Acc

83.04%

80.94%

65.67%

#MACs

12 298 240

504 128

149 331 456

#Op. Points

512

8 192

2 097 152

3

https://www.tensorflow.org/tutorials/sequences/audio recognition.

(7)

Energy-Accuracy Scalable Deep Convolutional Neural Networks

121

There is also an additional “silence” class made up of background noise samples (pink noise, white noise, and human-made sounds). The training set and test set collect 56196 and 7518, respectively. The adopted ConvNet is the cnn-onefstride4 described in [42]; it has two convolutional layers, one max-pooling layer and four fully-connected layers. The ConvNet is fed with the spectrogram of the recorded signal which is obtained through the pre-processing pipeline introduced in [42] (extraction of time × f requency = 32 × 40 inputs w/o any data augmentation). Facial Expression Recognition (FER): it is about inferring the emotional state of people from their facial expression. Quite popular in the field of vision reasoning, this task is very challenging as many face images might convey multiple emotions. The reference dataset is the Fer2013 dataset given by the Kaggle competition [26]. It collects 32297 48 × 48 gray-scale facial images split into 7 categories, i.e. “Angry”, “Disgust”, “Fear”, “Happy”, “Sad”, “Surprise”, “Neutral”. The training set counts of 28708 examples, while the remaining 3589 are in the test set. The adopted ConvNet4 consists of nine convolutional layers evenly spaced by three max-polling layers, and one fully-connected layer. Each benchmark is powered by a different model whose topology is described in Table 4. Within the same table we also collected additional information: the top-1 classification accuracy achieved with the original 32-bit floating-point model (Top-1 Acc.) training w/o any optimization; the overall number of MAC instructions for one inference run using 32-bit floating-point representations (#MAC); the number of possible precision configurations, namely the number of possible operating points in the parameters space (#Op. Points). Concerning the Top-1 accuracy reported in Table 4, the results are consistent with the state-of-the-art. They were obtained with a dedicated training and testing framework integrated into PyTorch, version 0.4.1, with the following settings: 150 training epochs using the Adam algorithm [43]; learning rate 1e−3; linear decay 0.1 every 50 epochs; batch size of 128 samples randomly picked from the training set; non-overlapping testing set and training set. 5.3

Results

Table 5 shows the top-1 prediction accuracy achieved with a coarse per-net precision scaling scheme in which all the layers share the same precision. The table collects the results for the original 32-bit floating-point model and the four fixed-point precision options made available with the multi-precision arithmetic described in Sect. 3. To notice that we do not run any retraining after quantization. This allows storing a single set of weights for any desired precision. Previous works suggest a re-training stage to recover the loss due to quantization and this would imply that each precision is coupled with a different fine-tuned model. What we propose instead is the use of a unique set of weights trained at full-precision (i.e. 16-bit for both weights and activations), then, at 4

Inspired by https://github.com/JostineHo/mememoji.

122

V. Peluso and A. Calimera Table 5. Per-net precision scaling: top-1 accuracy 32-bit FP full mixed mixed half 16×16 Fix 8×16 Fix 16×8 Fix 8×8 Fix IC

83.04%

83.04%

82.27%

73.07%

73.93%

KWS 80.94%

80.92%

79.99%

77.26%

76.97%

FER

65.70%

64.31%

62.78%

59.04%

65.67%

Fig. 4. Operating points. Accuracy drop normalized w.r.t. full precision (16 × 16). (Color figure online)

run-time, data are fetched and processed with the right precision. This is the key advantage of the proposed multi-precision scheme and the main enabler for adaptive ConvNets. As reported in the table, the full-precision fixed-point ConvNets (column 16×16 Fix) keeps almost the same accuracy of the original floating-point model (the maximum relative drop is 0.02% for KWS). The results are in line with previous works and motivate the choice of 16 × 16 as the baseline for comparison. Concerning the mixed-precision options, 8×16 assigns 8-bit to input maps (I) and 16-bit to the weights (W ); 16×8 does the opposite. The 8×16 option is by far more accurate than 16×8: minimum drop of 0.93% for IC; maximum drop of 2.07% for FER. The half-precision (column 8×8) shows larger loss: minimum drop of 4.90% for KWS; maximum drop 10.97% for IC. These numbers suggest the per-net granularity is too weak for effective deployment of adaptive ConvNets. Among the four available precision options, a very small set per se, only three are of practical use, i.e. 16×16, 8×16, 16×8. Indeed, when precision is reduced to 8 × 8 all the three benchmarks show a dramatic quality degradation. For instance, when shifted from 8 × 16 to 8 × 8, the IC shows a 10× drop (from 0.93% to 10.97%). This calls for a finer precision assignment policy, which is the technique proposed in this work. A detailed analysis of the results is provided by means of a Pareto analysis, Fig. 4. The plots show the possible operating points in the energy-accuracy space achieved with a per-net precision scaling (blue ×) and the proposed per-layer precision scaling (red •). Each point comes with a different precision setting.

Energy-Accuracy Scalable Deep Convolutional Neural Networks

123

Table 6. Comparison between the per-net precision scaling and the per-layer precision scaling with the proposed SA optimization: the collected statistics refer to the Pareto curves of the two solutions (full-precision excluded). Optimization # Op. Points Av. Drop Av. Savings Av. Exec. Time Per-Net SA

2 4

−5.95 −4.84

42.56 42.83

8s

KWS Per-Net SA

2 4

−3.52 −2.93

40.09 44.95

13 s

FER

2 8

−6.13 −4.60

41.41 39.83

66 min 18 s

IC

Per-Net SA

The accuracy drop and the energy savings are normalized with respect to fullprecision (rightmost × marker with 0% accuracy drop). The dotted lines connect the points at the Pareto frontier. As aforementioned, with the per-net granularity only three among four points are Pareto. Moreover, the shift from one operating point to another is very coarse with substantial accuracy drop. The advantage of the per-layer is twofold. First, the Pareto curve is more dense and hence it gives more options for a finer control; this aspect is evident in larger ConvNets (e.g. FER). Second, the Pareto curve is dominating the per-net solutions, thus enabling larger (or comparable) average energy savings. Table 6 reports some statistics over the subset of Pareto points, both per-net and per-layer. The column #Op. Points gives the number of Pareto Points; column Av. Drop refer to the accuracy drop averaged over the Pareto points; column Av. Savings does the same for the energy savings. For all the three benchmarks the energy-accuracy scaling operated with an optimal per-layer multi-precision assignment ensures optimality and usability on several context scenarios. Table 6 also shows the average execution time taken by the SA engine to draw a Pareto point, column Av. Exec. Time. Results are collected on a workstation powered by an Intel i7-8700K CPU and an NVIDIA GTX-1080 GPU with CUDA 9.0. As expected, time gets larger with network complexity. For the largest benchmark (FER) the tool consumes 66 min and 18 s. A viable option to improve performance is to reduce the granularity at which the SA explores the parameters space. This can be achieved by constraining the number of iterations for each explored temperature T (parameter iter in Table 2). A quantitative comparison is given in Fig. 5, whose plot shows the Pareto curves obtained with iter = 1000 (the original value), 500 and 250 for the FER benchmark. The execution time reduces linearly, i.e. (66 min, 18 s) with iter = 1000, (33 min, 35 s) with iter = 500, (16 min, 36 s) with iter = 250, while the quality of results reveal more interesting trends. Whereas it is generally true that a larger iter leads to better absolute numbers, the gain practically fades when considering the relative distance between the obtained curves. With iter = 1000 the average savings across the Pareto points (39.3%) is just 5% larger than that obtained

124

V. Peluso and A. Calimera

Fig. 5. Pareto analysis for FER benchmark using different number of iterations during the SA evolution: 1000, 500, 250.

using iter = 500 (34.6%) and iter = 250 (34.4%); both iter = 1000 and iter = 500 collects the same number of Pareto points, 7 overall; only with iter = 250 the number of Pareto points reduces from 7 to 5. This analysis suggests that for larger ConvNets there’s a margin for tuning the SA to reasonable execution time w/o degrading much the quality.

6

Conclusions

The evolution of ConvNets has been driven by accuracy improvement. High accuracy reflected on large-scale network topologies which turned the inference into a too expensive task for low-power, energy-constrained embedded systems. ConvNets compression is therefore an urgent need for the growth of neural computing at the edge. While most of the existing techniques mainly focus on static optimizations, dynamic resource management represents a viable option to further improve energy efficiency. This chapter introduced a practical implementation of adaptive ConvNets. The proposed strategy allows ConvNets to relax their computational effort, and hence their energy consumption, leveraging the accuracy margin typical of non-critical applications. The technique is built upon a low overhead implementation of dynamic multi-precision arithmetic. The resulting ConvNets are free to move in the energy-accuracy space achieving better tradeoffs. A Pareto analysis conducted on three representative applications (Image Recognition, Keyword Spotting, Facial Expression Recognition) quantified the energy savings suggesting potential improvement for the Simulated Annealing (SA) optimization engine. Future works will bring this adaptive strategy to larger ConvNets deployed on real HW implementations.

Energy-Accuracy Scalable Deep Convolutional Neural Networks

125

References 1. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015) 2. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012) 3. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015) 4. Hinton, G., Deng, L., Yu, D., Dahl, G.E., Mohamed, A.-R., et al.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29(6), 82–97 (2012) 5. Xu, X., Ding, Y., Hu, S.X., Niemier, M., Cong, J., et al.: Scaling for edge inference of deep neural networks. Nat. Electron. 1(4), 216 (2018) 6. Bottou, L.: Large-scale machine learning with stochastic gradient descent. In: Lechevallier, Y., Saporta, G. (eds.) Proceedings of COMPSTAT’2010, pp. 177– 186. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-7908-2604-3 16 7. Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., et al.: MobileNets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017) 8. Sze, V., Chen, Y.-H., Yang, T.-J., Emer, J.: Efficient processing of deep neural networks: a tutorial and survey. arXiv preprint arXiv:1703.09039 (2017) 9. Grimaldi, M., Tenace, V., Calimera, A.: Layer-wise compressive training for convolutional neural networks. Future Internet 11(1) (2018). http://www.mdpi.com/ 1999-5903/11/1/7 10. Szegedy, C., Liu, C., Jia, Y., Sermanet, P., Reed, S., et al.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015) 11. Chen, Y.-H., Krishna, T., Emer, J.S., Sze, V.: Eyeriss: an energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE J. Solid State Circ. 52(1), 127–138 (2017) 12. Courbariaux, M., Bengio, Y., David, J.-P.: BinaryConnect: training deep neural networks with binary weights during propagations. In: Advances in Neural Information Processing Systems, pp. 3123–3131 (2015) 13. Flamand, E., Rossi, D., Conti, F., Loi, I., Pullini, A., et al.: Gap-8: a RISC-V SoC for AI at the edge of the IoT. In: 2018 IEEE 29th International Conference on Application-Specific Systems, Architectures and Processors (ASAP), pp. 1–4. IEEE (2018) 14. Moons, B., Verhelst, M.: A 0.3-2.6 TOPS, W precision-scalable processor for realtime large-scale ConvNets. In: IEEE Symposium on VLSI Circuits (VLSI-Circuits), pp. 1–2. IEEE (2016) 15. Albericio, J., Delm´ as, A., Judd, P., Sharify, S., O’Leary, G., et al.: Bit-pragmatic deep neural network computing. In: Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 382–394. ACM (2017) 16. Moons, B., De Brabandere, B., Van Gool, L., Verhelst, M.: Energy-efficient ConvNets through approximate computing. In: 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1–8. IEEE (2016) 17. Shafique, M., Hafiz, R., Javed, M.U., Abbas, S., Sekanina, L.: Adaptive and energyefficient architectures for machine learning: challenges, opportunities, and research roadmap. In: 2017 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), pp. 627–632. IEEE (2017)

126

V. Peluso and A. Calimera

18. Alioto, M., De, V., Marongiu, A.: Energy-quality scalable integrated circuits and systems: continuing energy scaling in the twilight of moore’s law. IEEE J. Emerg. Sel. Top. Circuits Syst. 8(4), 653–678 (2018) 19. Peluso, V., Calimera, A.: Weak-MAC: arithmetic relaxation for dynamic energyaccuracy scaling in ConvNets. In: IEEE International Symposium on Circuits and Systems (ISCAS), pp. 1–5. IEEE (2018) 20. Peluso, V., Calimera, A.: Energy-driven precision scaling for fixed-point ConvNets. In: 2018 IFIP/IEEE International Conference on Very Large Scale Integration (VLSI-SoC), pp. 1–6. IEEE (2018) 21. Lai, L., Suda, N.: Enabling deep learning at the IoT edge. In: Proceedings of the International Conference on Computer-Aided Design, p. 135. ACM (2018) 22. Jouppi, N.P., Young, C., Patil, N., Patterson, D., Agrawal, G., et al.: In-datacenter performance analysis of a tensor processing unit. In: Proceedings of the 44th Annual International Symposium on Computer Architecture, ISCA 2017, pp. 1–12. ACM, New York (2017). http://doi.acm.org/10.1145/3079856.3080246 23. Moons, B., Verhelst, M.: An energy-efficient precision-scalable ConvNet processor in 40-nm CMOS. IEEE J. Solid State Circuits 52(4), 903–914 (2017) 24. Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images. Technical report, Citeseer (2009) 25. Warden, P.: Speech commands: a dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209 (2018) 26. Challenges in representation learning: facial expression recognition challenge. http://www.kaggle.com/c/challenges-in-representation-learning-facial-expressionrecognition-challenge 27. Andri, R., Cavigelli, L., Rossi, D., Benini, L.: YodaNN: an architecture for ultra-low power binary-weight CNN acceleration. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 37, 48–60 (2017) 28. Gu, J., Wang, Z., Kuen, J., Ma, L., Shahroudy, A., et al.: Recent advances in convolutional neural networks. Pattern Recogn. (2017). http://www.sciencedirect. com/science/article/pii/S0031320317304120 29. Yang, T.J., Chen, Y.H., Sze, V.: Designing energy-efficient convolutional neural networks using energy-aware pruning. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6071–6079, July 2017 30. Fleuret, F., Geman, D.: Coarse-to-fine face detection. Int. J. Comput. Vis. 41(1), 85–107 (2001) 31. Panda, P., Sengupta, A., Roy, K.: Conditional deep learning for energy-efficient and enhanced pattern recognition. In: Proceedings of the 2016 Conference on Design, Automation & Test in Europe, DATE 2016, pp. 475–480. EDA Consortium, San Jose (2016). http://dl.acm.org/citation.cfm?id=2971808.2971918 32. Yan, Z., Zhang, H., Piramuthu, R., Jagadeesh, V., DeCoste, D., et al.: HD-CNN: hierarchical deep convolutional neural networks for large scale visual recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2740–2748 (2015) 33. Neshatpour, K., Behnia, F., Homayoun, H., Sasan, A.: ICNN: an iterative implementation of convolutional neural networks to enable energy and computational complexity aware dynamic approximation. In: Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 551–556. IEEE (2018) 34. Moons, B., Uytterhoeven, R., Dehaene, W., Verhelst, M.: 14.5 envision: A 0.26to-10TOPS, W subword-parallel dynamic-voltage-accuracy-frequency-scalable convolutional neural network processor in 28 nm FDSOI. In: IEEE International Solid-State Circuits Conference (ISSCC), pp. 246–247. IEEE (2017)

Energy-Accuracy Scalable Deep Convolutional Neural Networks

127

35. Peluso, V., Calimera, A.: Scalable-effort ConvNets for multilevel classification. In: 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), pp. 1–8. IEEE (2018) 36. Lin, D., Talathi, S., Annapureddy, S.: Fixed point quantization of deep convolutional networks. In: International Conference on Machine Learning, pp. 2849–2858 (2016) 37. Shan, L., Zhang, M., Deng, L., Gong, G.: A dynamic multi-precision fixed-point data quantization strategy for convolutional neural network. In: Xu, W., Xiao, L., Li, J., Zhang, C., Zhu, Z. (eds.) NCCET 2016. CCIS, vol. 666, pp. 102–111. Springer, Singapore (2016). https://doi.org/10.1007/978-981-10-3159-5 10 38. Jahnke, S.R., Hamakawa, H.: Micro-controller direct memory access (DMA) operation with adjustable word size transfers and address alignment/incrementing. US Patent 6,816,921, 9 November 2004 39. Courbariaux, M., Bengio, Y., David, J.-P.: Training deep neural networks with low precision multiplications. arXiv preprint arXiv:1412.7024 (2014) 40. Desoli, G., Chawla, N., Boesch, T., Singh, S.-P., Guidetti, E.: 14.1 A 2.9 TOPS, W deep convolutional neural network SoC in FD-SOI 28 nm for intelligent embedded systems. In: 2017 IEEE International Solid-State Circuits Conference (ISSCC), pp. 238–239. IEEE (2017) 41. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., et al.: Caffe: convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM International Conference on Multimedia, pp. 675–678. ACM (2014) 42. Sainath, T.N., Vinyals, O., Senior, A., Sak, H.: Convolutional, long short-term memory, fully connected deep neural networks. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4580–4584. IEEE (2015) 43. Kingma, D., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

ReRAM Based In-Memory Computation of Single Bit Error Correcting BCH Code Swagata Mandal1(B) , Yaswanth Tavva2 , Debjyoti Bhattacharjee2 , and Anupam Chattopadhyay2 1

Department of Electronics and Communication Engineering, Jalpaiguri Government Engineering College (Autonomous), Jalpaiguri, India [email protected] 2 School of Computer Science Engineering, Nanyang Technological University, Singapore, Singapore

Abstract. Error resilient high speed robust data communication is the primary need in the age of big data and Internet-of-things (IoT), where multiple connected devices exchange huge amount of information. Different multi-bit error detecting and correcting codes are used for error mitigation in the high speed data communication though it introduces delay and their decoding structures are quite complex. Here we have discussed the implementation of single bit error correcting Bose, Chaudhuri, Hocquenghem (BCH) code with simple decoding structure on a state-of-the art ReRAM based in-memory computing platform. ReRAM devices offer low leakage power, high endurance and non-volatile storage capabilities, coupled with stateful logic operations. The proposed lightweight library presents the mapping for generation of elements on Galois field (GF ) for computation of BCH code, along with encoding and decoding operations on input data stream using BCH code. We have verified the results for BCH code with different dimensions using SPICE simulation. For (15,11) BCH code, the number of clock cycles required for element generation, decoding and encoding of BCH code are 103, 230 and 251 respectively, which demonstrates the efficacy of the mapping.

Keywords: Error correcting code In memory computing · ReRAM

1

· BCH code ·

Introduction

In the age of big data and IoT, error resilient data storage, analysis and transmission are very crucial in different fields like social media, health care, deep space exploration and underwater surveillance etc. Even though the chances of data corruption in the silicon based semiconductor memory has grown with the shrinking of technology node, semiconductor based storage devices like random access memory (RAM), read only memory (ROM) and flash memory popularly c IFIP International Federation for Information Processing 2019  Published by Springer Nature Switzerland AG 2019 N. Bombieri et al. (Eds.): VLSI-SoC 2018, IFIP AICT 561, pp. 128–146, 2019. https://doi.org/10.1007/978-3-030-23425-6_7

ReRAM Based In-Memory Computation of BCH Code

129

used in the memory industry still have large footprint [1]. In order to prevent data corruption in the semiconductor memories, various traditional error mitigation techniques like triple modular redundancy (TMR), concurrent error detection (CED) [2] and readback with scrubbing [3] are generally used. The above mentioned methods consume large area, power and are not suitable for real time applications. Sometimes interleaving is used for error mitigation in memory but it increases the complexity of the memory and is not useful for small memory devices. In order to alleviate the drawbacks of TMR, CED or scrubbing, various error detecting and correcting (EDAC) codes are used for error mitigation in the data memory as well as in the communication channels. In general, single bit errors in the memory are corrected by using single bit error correcting code such as Hamming or Hisao code. In order to correct multiple erroneous bits, multi-bit error correcting block codes like Bose, Chaudhuri, Hocquenghem (BCH) code [4], Reed-Solomon code [5] are used. They have greater decoding complexity and large overhead due to the presence of more number of redundant bits compared to single bit error correcting code. Data in the memory is arranged as a matrix. Hence, different product codes are used for error mitigation in the memory where two low complexity block codes are used as component codes. Product codes formed using only Hamming codes as component codes [6] or Hamming code and parity code as component codes [6], are used to correct multi-bit upset in the SRAM based semiconductor memory. Error detection capability of different complex EDAC codes can be concatenated with Hamming code to generate low complexity multi-bit error correcting code, such as RS code concatenated with Hamming code [7] and BCH code concatenated with Hamming code [8]. In addition to block code, memory based convolutional codes [9] are also used for error mitigation in the storage devices. Error detection and correction methods discussed so far are implemented separately that read data from memory, perform encoding and decoding operation and finally write back data into the memory. With the rise of emerging technologies, computing can be performed in the memory itself, alongside storage of data unlike traditional von Neumann computing models [10]. Redox based Random Access Memory (ReRAM) is one of the non-volatile storage technology which supports such in memory computing [11]. Due to high circuit density, high retention capability and low power consumption, ReRAM technology is capable of being used as an alternative of NAND or NOR flash in the industry. Unlike CMOS or TTL based semiconductor memory technology, ReRAM uses different dielectric materials to develop its crossbar structure. ReRAM demonstrates good switching characteristics between high and low resistance state compared to other emerging memories like magnetic random access memory (MRAM), ferroelectric random access memory (FRAM) [12], etc. ReRAM based memory technology is compatible with conventional CMOS based design flow and provides inherent parallelism due to its crossbar structure. The working principle of ReRAM technology involves formation of low resistance conducting path through dielectric material by applying a high voltage across it. The conducting

130

S. Mandal et al.

path arises due to multiple mechanisms like metal defect, vacancy, etc. [13]. The conducting tunnel through insulator can be controlled by an external voltage source for performing SET or RESET operations on the device. Several in-memory computation platforms have already been proposed using ReRAMs, such as, general purpose in memory arithmetic circuit implementations [14], neuromorphic computing platforms [15] and general purpose Programmable Logic-in-Memory (PLiM) [16]. Apart from these general purpose applications, ReRAM based computation platforms are also used to implement different domain specific algorithms like machine learning [17,18], encryption [19] or compression algorithm [20]. Authors in [21] proposed efficient hardware implementation of BCH code. Further, hardware implementation of non-binary BCH code or RS code is also proposed by authors in [22]. The basic building blocks of error correcting code is the finite field arithmetic. The hardware implementation of high throughput finite field multiplier circuit on field programmable gate array (FPGA) and application specific integrated circuit (ASIC) are discussed by authors in [23]. Recently, ReRAM based in memory computation of Galois field (GF ) arithmetic is described by authors in [24]. In this work, we propose the first in-memory BCH encoding and decoding operation library. Specifically, our contributions are as follows:– This work presents the first in-memory implementation of encoding and decoding operation of BCH code using ReRAM crossbar array. – The proposed mapping harnesses the bit-level parallelism offered by ReRAM crossbar arrays and supports a wide variety of crossbar dimensions. – In order to perform matrix multiplication during encoding and decoding operations, we have proposed a new method of implementation of binary matrix multiplication using ReRAM crossbar array. We refer the method as BiBLAS-3, since it is a level-3 binary basic linear algebra subprogram. – The proposed implementation has a very low footprint in terms of devices required as well as energy, which makes it suitable for use as building blocks for different applications. The rest of the paper is organized as follows. Section 2 presents the fundamentals of GF arithmetic, basics of encoding and decoding operations using BCH code along with a succinct introduction to ReVAMP, a state-of-the-art ReRAM based in-memory computing platform. Section 3 presents detailed implementation of element generation of GF , encoding and decoding operations for the ReVAMP platform using BiBLAS-3. Experimental results are described in Sect. 4, followed by conclusion in Sect. 5.

2

Preliminaries

In this section, we present the fundamentals of encoding and decoding operation using BCH code. We introduce the preliminaries of logic operation using ReVAMP architecture. The encoding and decoding operations of the BCH code will be performed on binary GF , that we describe briefly.

ReRAM Based In-Memory Computation of BCH Code

2.1

131

Galois Field Arithmetic

A field is a set of elements on which basic mathematical operations like addition and multiplication can be performed without leaving the set. Hence, these basic operations must satisfy distributive, associative and commutative laws [25]. The order of a field is the number of elements in the field. A field with finite number of elements is known as GF . The order of the GF is always a prime number or the power of a prime number. If p be a prime number and m be a positive integer, then GF will contain pm elements and can be represented as GF (pm ). For m = 1, p = 2, the elements in GF will be {0,1} and this is known as binary field. Here, we will consider GF of 2m elements from the binary field GF (2) where m > 1. If U be the set of the elements of the field and α be an element of GF (2m ), then U can be represented by Eq. (1). U = [0, α0 , α1 , α2 , α3 , ......., α2

m

−1

]

(1)

Let f (x) be a polynomial over GF (2m ) and it is said to be irreducible if f (x) is not divisible by any other polynomial in GF (2m ) with degree less than m, but greater than zero [26]. The irreducible polynomial is a primitive polynomial, if the smallest positive integer q for which f (x) divides xq + 1, where q = 2m − 1. For each value of m, there can be multiple primitive polynomials, but we will use the primitive polynomial with least number of terms for computation over GF . (b)

(a) Primitive GF (2m ) Polynomial 22 23 24 25 26 27

x2 + x + 1 x3 + x + 1 x4 + x + 1 x5 + x2 + 1 x6 + x + 1 x7 + x3 + 1

Power Repr. 0 1 α α2 α3 α4 α5 α6 α7 α8 α9 α10 α11 α12 α13 α14

Polynomial Repr.

4-Tuple Repr.

0 α0 α1 α2 α3 α+1 α2 + α α3 + α2 α3 + α + 1 α2 + 1 α3 + α α2 + α + 1 α3 + α2 + α 3 α + α2 + α + 1 α3 + α 2 + 1 α3 + 1

(0, 0, 0, 0) (0, 0, 0, 1) (0, 0, 1, 0) (0, 1, 0, 0) (1, 0, 0, 0) (0, 0, 1, 1) (0, 1, 1, 0) (1, 1, 0, 0) (1, 0, 1, 1) (0, 1, 0, 1) (1, 0, 1, 0) (0, 1, 1, 1) (1, 1, 1, 0) (1, 1, 1, 1) (1, 1, 0, 1) (1, 0, 0, 1)

Fig. 1. (a) Primitive polynomial for various order GF. (b) Representation of elements in GF (24 ).

132

S. Mandal et al.

Table 1. Variation of dimension of single bit error correcting BCH code with the order of GF. Order of GF (m) Dimension of BCH code αk 3

(7, 4)

αk = αk−2 + αk−3

4

(15, 11)

αk = αk−3 + αk−4

5

(31, 26)

αk = αk−3 + αk−5

6

(63, 57)

αk = αk−5 + αk−6

7

(127, 120)

αk = αk−4 + αk−7

The list of primitive polynomials for different values of m is shown in Fig. 1a. These primitive polynomials are the basis of computation using the elements of GF . For the generation of elements of GF , we will start from two basic elements 0, 1 and another new element α. In this paper, we have discussed encoding and decoding operation of single bit error correcting BCH code on GF (2m ) where m varies from 3 to 7. As α is an element of GF (2m ), it must satisfy the primitive polynomial corresponding to GF (2m ). With the variation of m, not only primitive polynomial changes but also dimension of BCH code changes as shown in Table 1. If α be an element in GF (2m ), αk (where k is an positive integer and k > 2) is also be an element of GF (2m ) and the recursive expression that will be used to calculate αk for different values m in GF (2m ) are shown in Table 1. Here in Fig. 1b we have illustrated the power, polynomial and 4-Tuple representation of all the elements of GF (24 ) are shown in Fig. 1b. Based on the elements of GF , the encoding and decoding operations of BCH code will be performed. 2.2

Basics of BCH Encoding and Decoding Operation

BCH is a powerful random error correcting cyclic code which is basically general purpose multi-bit error correcting Hamming code. Given two integers m and t such that m > 3 and t < 2m − 1, then there exists a binary BCH code whose block length will be n = 2m − 1 with the number of parity check bits equal to (n − k) ≤ mt and the minimum distance will be dmin ≥ (2t + 1). This will represent t error correcting BCH code. If α be a primitive element in GF (2m ), then generator polynomial g(x) of t error correcting BCH code of length 2m − 1 will be lowest degree polynomial over GF (2) and α, α2 ,. . .,α2t will be its root. Hence, the Eq. (2) must satisfy. g(αi ) = 0 ∀i ∈ {1, 2, . . . , t}

(2)

If φi (x) be the minimal polynomial of αi , then g(x) will be formed using the Eq. (3). (3) g(x) = LCM {φ1 (x), φ2 (x), . . . , φ2t (x)}

ReRAM Based In-Memory Computation of BCH Code

133



As αi and αi (where i = i 2l , i is odd and l > 1) are conjugate to each other φi (x) = φi (x). Hence, g(x) will be formed using the Eq. (4). g(x) = LCM {φ1 (x), φ3 (x), . . . , φ2t−1 (x)}

(4)

Since we will use single bit error correcting BCH code, the generator polynomial g(x) for GF (24 ) is given by g(x) = φ1 (x) = x4 + x + 1 The degree of g(x) will be at most mt and the number of parity bits will be (n−k). After the generation g(x), the encoding operation will involve multiplication of input data D(x) with g(x), i.e, C(x) = D(x) × g(x). The decoding operation of BCH code will involve the following steps: 1. Syndrome computation. 2. Determine the error locater polynomial λ from the syndrome components S1 , S2 ,. . .,S2t . 3. Find the error location by solving the error locater polynomial λ(x). Let r(x) = r0 + r1 x + r2 x2 + . . . + rn−1 xn−1 be the received data and e(x) be the error pattern, then r(x) = D(x) + e(x). For t error correcting BCH code, the parity check matrix will be ⎡ ⎤ 1 α α2 α3 ... α(n−1) ⎢1 α3 (α3 )2 (α3 )3 . . . (α3 )(n−1) ⎥ ⎢ ⎥ 5 5 2 ⎢ (α ) (α5 )3 . . . (α5 )(n−1) ⎥ H = ⎢1 α ⎥ ⎢. ⎥ .. .. .. .. .. ⎣ .. ⎦ . . . . . (2t−1) (2t−1) 2 (2t−1) 3 (2t−1) (n−1) (α ) (α ) . . . (α ) 1α The syndrome is a 2t-tuple S = (S1 , S2 , . . . , S2t ) = r × H T where H is the parity check matrix. Since we are considering single bit error correcting BCH code, t will be equal to 1 and S = S1 = r × H T . In the next step, from the syndrome values 2t nonlinear equations are formed which will be solved using either Berlekamp-Massey or Euclid’s algorithm [27] and an error locater polynomial is formed using the roots obtained by solving the 2t nonlinear equations. Finally, the roots of the error locater polynomial is solved using Chien search algorithm [27]. Single bit error correcting BCH code generate only one syndrome whose value can directly locate the position of the erroneous bit and hence, we have not discussed the detailed implementation of step 2 and step 3 of the decoding of BCH code. 2.3

In-Memory Computing Using ReRAM

In this subsection, we describe the ReRAM-based in-memory computing platform—ReVAMP, introduced in [28]. The architecture, presented in Fig. 2 utilizes ReRAM crossbar with lightweight peripheral circuitry for in-memory

PC

wI

Instruction Decode and Control Signal Generation

Instruction Fetch

0

Cc

1

Mc 1+wD

PIR

Read Address

Update PC

Wc

Primary Input

wD

Instruction Decode

DMR

wD

wDx(1+wD) switch network

Execute

Row Decoder

Instruction Memory (IM)

IR

Write Circuit

Data Out

Wordline Select

S. Mandal et al.

Source Select

134

Data and Computation Memory

Sense Amplifiers Column Decoder

wD

Fig. 2. ReVAMP architecture. Read wl Apply wl s ws wb (v valwD −1 ) . . . (v val0 ) Fig. 3. ReVAMP instruction format.

computing. The ReRAM crossbar memory is used as data storage and computation memory (DCM). This is where in-memory computation using ReRAM devices takes place. A ReRAM crossbar memory consists of multiple 1-Select 1Resistance (1S1R) ReRAM devices [29], arranged in the form of a crossbar [30]. A V/2 scheme is used for programming the ReRAM array. Unselected lines are kept to ground. In a readout phase, the presence of a high current (≈5 μA) is considered as logic ‘1’ while presence of a low current (