Algorithms and Architectures for Cryptography and Source Coding in Non-Volatile Flash Memories 9783658344597, 3658344598

543 177 3MB

English Pages 155 Year 2021

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Algorithms and Architectures for Cryptography and Source Coding in Non-Volatile Flash Memories
 9783658344597, 3658344598

Table of contents :
Acknowledgements
Abstract
Contents
Acronyms
List of Figures
List of Tables
1 Introduction
1.1 Problem Statement and Motivation
1.2 Structure of the Thesis
2 Elliptic Curve Cryptography
2.1 Cryptography for Flash Memory Controllers
2.2 Applications of Elliptic Curve Cryptographic (ECC) Systems
2.2.1 Elliptic Curves
2.2.2 Key Exchange
2.2.3 Digital Signatures
2.3 Elliptic Curve Point Multiplication
2.4 Elliptic Curve Geometry and Group Laws
2.4.1 Elliptic Curve Geometry
2.4.2 Group Laws for Prime Curves
2.5 Reducing the Number of Field Inversions for Elliptic Curves over Prime Fields
2.6 Discussion
3 Elliptic Curve Cryptography over Gaussian Integers
3.1 Gaussian Integer Rings and Fields
3.2 Point Multiplication over Gaussian Integers
3.2.1 Determining the τ-adic Expansions
3.2.2 Elliptic Curve Point Multiplication for Complex Expansions
3.3 Resistance Against side Channel Attacks using Gaussian Integers
3.3.1 Improved τ-adic Expansion Algorithm
3.3.2 Comparison with Existing Non-binary Expansions
3.4 Discussion
4 Montgomery Arithmetic over Gaussian Integers
4.1 Montgomery Arithmetic
4.2 Reduction over Gaussian Integers using the Absolute Value
4.3 Reduction over Gaussian integers using the Manhattan Weight
4.3.1 Montgomery reduction algorithm using the Manhattan weight
4.3.2 Reduction after addition (or subtraction)
4.4 Simplifying the Reduction based on the Manhattan Weight
4.5 Discussion
5 Architecture of the ECC Coprocessor for Gaussian Integers
5.1 Coprocessor Architecture for Gaussian Integers
5.1.1 Basic Concepts of the Proposed Design
5.1.2 Hardware Architecture
5.1.3 Instruction Set Architecture
5.1.4 Data Memory
5.1.5 Arithmetic Unit for Gaussian Integer Fields
5.2 ECC Coprocessor Architecture for Prime Fields
5.2.1 Preliminary Considerations and Modulo Reduction
5.2.2 Architecture
5.2.3 Instruction Set
5.2.4 Arithmetic Unit
5.3 Implementation Results
5.4 Discussion
6 Compact Architecture of the ECC Coprocessor for Binary Extension Fields
6.1 Group Laws and Projective Coordinates for Binary Extension Fields
6.1.1 Group Law for Binary Extension Curves
6.1.2 Projective Coordinates for Elliptic Curves over Binary Extension Fields
6.2 ECC Coprocessor Architecture
6.2.1 Instruction Set Architecture
6.2.2 Data Memory
6.2.3 Arithmetic Unit
6.3 Results and Discussion
7 The Parallel Dictionary LZW Algorithm for Flash Memory Controllers
7.1 Data Compression for Flash Memory Devices
7.1.1 Reducing the Write Amplification
7.1.2 Combining Data Compression and Error-correcting Codes
7.1.3 Suitable Data Compression Scheme
7.2 Parallel Dictionary LZW (PDLZW) Algorithm
7.3 Address Space Partitioning for the PDLZW
7.3.1 Data Model
7.3.2 Partitioning the PDLZW Address Space
7.4 Reducing the Memory Requirements of the PDLZW
7.4.1 Recursive PDLZW Algorithm
7.4.2 Basic Concept of the Word Partitioning Technique
7.4.3 Dimensioning the Layers
7.4.4 Dictionary Architecture
7.4.5 Implementation
7.5 Compression and Implementation Results
7.6 Discussion and Comparison with Other Data Compression Schemes
8 Conclusion
Bibliography

Citation preview

Schriftenreihe der Institute für Systemdynamik (ISD) und optische Systeme (IOS)

Malek Safieh

Algorithms and Architectures for Cryptography and Source Coding in Non-Volatile Flash Memories

¨ Schriftenreihe der Institute fur Systemdynamik (ISD) und optische Systeme (IOS) Chefredakteure Jürgen Freudenberger, Institut für Systemdynamik, Hochschule Konstanz (HTWG), Konstanz, Baden-Württemberg, Deutschland Johannes Reuter, Institut für Systemdynamik, Hochschule Konstanz (HTWG), Konstanz, Baden-Württemberg, Deutschland Matthias Franz, Institut für Optische Systeme, Hochschule Konstanz (HTWG), Konstanz, Baden-Württemberg, Deutschland Georg Umlauf, Institut für Optische Systeme, Hochschule Konstanz (HTWG), Konstanz, Baden-Württemberg, Deutschland

Die „Schriftenreihe der Institute für Systemdynamik (ISD) und optische Systeme (IOS)“ deckt ein breites Themenspektrum ab: von angewandter Informatik bis zu Ingenieurswissenschaften. Die Institute für Systemdynamik und optische Systeme bilden gemeinsam einen Forschungsschwerpunkt der Hochschule Konstanz. Die Forschungsprogramme der beiden Institute umfassen informations- und regelungstechnische Fragestellungen sowie kognitive und bildgebende Systeme. Das Bindeglied ist dabei der Systemgedanke mit systemtechnischer Herangehensweise und damit verbunden die Suche nach Methoden zur Lösung interdisziplinärer, komplexer Probleme. In der Schriftenreihe werden Forschungsergebnisse in Form von Dissertationen veröffentlicht. The “Series of the institutes of System Dynamics (ISD) and Optical Systems (IOS)” covers a broad range of topics: from applied computer science to engineering. The institutes of System Dynamics and Optical Systems form a research focus of the HTWG Konstanz. The research programs of both institutes cover problems in information technology and control engineering as well as cognitive and imaging systems. The connective link is the system concept and the systems engineering approach, i.e. the search for methods and solutions of interdisciplinary, complex problems. The series publishes research results in the form of dissertations.

Weitere Bände in der Reihe http://www.springer.com/series/16265

Malek Safieh

Algorithms and Architectures for Cryptography and Source Coding in Non-Volatile Flash Memories

Malek Safieh University of Ulm Munich, Germany Dissertation, Ulm University, 2021

ISSN 2661-8087 ISSN 2661-8095 (electronic) Schriftenreihe der Institute für Systemdynamik (ISD) und optische Systeme (IOS) ISBN 978-3-658-34458-0 ISBN 978-3-658-34459-7 (eBook) https://doi.org/10.1007/978-3-658-34459-7 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Fachmedien Wiesbaden GmbH, part of Springer Nature 2021 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. Responsible Editor: Stefanie Eggert This Springer Vieweg imprint is published by the registered company Springer Fachmedien Wiesbaden GmbH part of Springer Nature. The registered company address is: Abraham-Lincoln-Str. 46, 65189 Wiesbaden, Germany

Acknowledgements

First and foremost, I would like to thank my doctoral supervisor Prof. Dr. Jürgen Freudenberger for giving me the opportunity to join the research group at the Institute of System Dynamics (ISD) at the University of Applied Sciences (HTWG) Konstanz, Germany. Due to his efforts, it became possible for me to work on my Ph.D. for the last three and a half years. I appreciate the valuableness, the time, and efforts he offered for the long and fruitful discussions we had. I am very grateful for his advices and supports that contributed to the improvement of my research skills. I would also like to thank Prof. Dr. Martin Bossert for reading my thesis and being my second supervisor. I am very thankful to all the research members at the ISD. It would be truly unfair to mention only a few people since everyone made my time there a pleasure. The conversations during the coffee breaks in the kitchen, after eating in the canteen, as well as during the after-work beer were always fun, kept me motivated and with a fresh mindset. I would like to express my gratitude to my family and friends for all their help and contribution during my study and the research period. Further thanks goes to everyone who supported me during my doctoral research.

v

Abstract

Flash memory is an important non-volatile storage medium. Reliable and secure data storage in flash memories requires sophisticated coding and signal processing techniques. Although error-correcting codes are applied in practically all flash storage systems, coding techniques for cryptography and data compression are less developed. Due to the limited computational performance of the flash controller, many flash storage systems rely on symmetric cryptography for the message authentication. Asymmetric cryptography like elliptic curve cryptographic (ECC) systems offer additional functionality such as digital signatures and key exchange methods, which allows a verification of the integrity and authenticity. In this work, we demonstrate that ECC systems over Gaussian integers are very efficient. Gaussian integers are a subset of complex numbers with integers as real and imaginary parts. Since many Gaussian integer fields are isomorphic to prime fields, this arithmetic is suitable for ECC systems. Implementations of cryptographic algorithms are prone to side channel attacks. We show that using Gaussian integers can reduce the complexity and memory requirements for hardware implementations which are protected against such attacks. However, determining the modulo reduction over Gaussian integers is extremely expensive. To reduce the complexity, we derive a Montgomery modular arithmetic over Gaussian integers. Moreover, we develop a hardware architecture optimized for the ECC operations targeting low area. The proposed ECC processor for Gaussian integers is a competitive solution for applications in flash memories and other resource-constrained embedded systems. Data compression provides several advantages for flash memory controllers, e.g. improving the lifetime and storage capacity. In this work, we focus on the

vii

viii

Abstract

dictionary-based Lempel-Ziv-Welch (LZW) algorithm. A fast and compact implementation of this universal data compression procedure is a challenging task due to the recursive structure of the LZW dictionary. To speed up the encoding, we present an architecture that applies multiple dictionaries in parallel. Two dictionary partitioning techniques are introduced that improve the compression rate and reduce the memory size of this parallel dictionary LZW algorithm.

Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Problem Statement and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Structure of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 1 3

2 Elliptic Curve Cryptography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Cryptography for Flash Memory Controllers . . . . . . . . . . . . . . . . . . 2.2 Applications of Elliptic Curve Cryptographic (ECC) Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Elliptic Curve Point Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Elliptic Curve Geometry and Group Laws . . . . . . . . . . . . . . . . . . . . 2.5 Reducing the Number of Field Inversions for Elliptic Curves over Prime Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5 5

3 Elliptic Curve Cryptography over Gaussian Integers . . . . . . . . . . . . . 3.1 Gaussian Integer Rings and Fields . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Point Multiplication over Gaussian Integers . . . . . . . . . . . . . . . . . . 3.3 Resistance Against side Channel Attacks using Gaussian Integers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Montgomery Arithmetic over Gaussian Integers . . . . . . . . . . . . . . . . . 4.1 Montgomery Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Reduction over Gaussian Integers using the Absolute Value . . . . 4.3 Reduction over Gaussian integers using the Manhattan Weight . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Simplifying the Reduction based on the Manhattan Weight . . . . . 4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7 14 17 20 22 25 26 29 33 41 43 44 46 51 57 59 ix

x

Contents

5 Architecture of the ECC Coprocessor for Gaussian Integers . . . . . . . 5.1 Coprocessor Architecture for Gaussian Integers . . . . . . . . . . . . . . . 5.2 ECC Coprocessor Architecture for Prime Fields . . . . . . . . . . . . . . 5.3 Implementation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Compact Architecture of the ECC Coprocessor for Binary Extension Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Group Laws and Projective Coordinates for Binary Extension Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 ECC Coprocessor Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 The Parallel Dictionary LZW Algorithm for Flash Memory Controllers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1 Data Compression for Flash Memory Devices . . . . . . . . . . . . . . . . 7.2 Parallel Dictionary LZW (PDLZW) Algorithm . . . . . . . . . . . . . . . . 7.3 Address Space Partitioning for the PDLZW . . . . . . . . . . . . . . . . . . 7.4 Reducing the Memory Requirements of the PDLZW . . . . . . . . . . 7.5 Compression and Implementation Results . . . . . . . . . . . . . . . . . . . . 7.6 Discussion and Comparison with Other Data Compression Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

61 62 70 76 81 83 84 87 94 97 99 107 112 116 124 128

8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

131

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

133

Acronyms

ADD AES ANSI ASCII ASIC ASIP BCH BMH BWT CAM DBL DPA DP-RAM DSP ECC FF FPGA FTL GF HDD HMAC KB LUT LZW MAC MH

point addition advanced encryption standard American National Standards Institute American standard code for information interchange application-specific integrated circuit application-specific instruction set processor Bose Chaudhuri Hocqenhem BWT-MTF-Huffman Burrows-Wheeler transformation content-addressable memories point doubling differential power analysis dual-port RAM digital signal processor elliptic curve cryptographic flip-flops field programmable gate array flash translation layer Galois field hard disc drives keyed-hash message authentication code Kilobyte lookup tables Lempel-Ziv-Welch message authentication codes MTF-Huffman

xi

xii

MLC MTF NIST P/E PDLZW PM RAM RIP RPA RSA SADD SCA SD SPA SSD TA USB WER ZPA

Acronyms

multi-level cell move-to-front National Institute of Standards and Technology program/erase parallel dictionary LZW elliptic curve point multiplication random-access memory randomized initial point refined power analysis Rivest-Shamir-Adleman special point addition side channel attacks secure digital simple power analysis solid-state drives timing attacks universal serial bus word error rate zero-value point attacks

List of Figures

Figure 2.1 Figure 2.2 Figure 2.3 Figure Figure Figure Figure

2.4 2.5 3.1 4.1

Figure 4.2 Figure 4.3 Figure Figure Figure Figure Figure

5.1 5.2 5.3 6.1 6.2

Figure 7.1 Figure 7.2

Figure 7.3

Examples for elliptic curve over real numbers . . . . . . . . . . . Principal of digital signatures . . . . . . . . . . . . . . . . . . . . . . . . . Hierarchy of calculating the point multiplication for prime fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Point addition (ADD) example . . . . . . . . . . . . . . . . . . . . . . . . Point doubling (DBL) example . . . . . . . . . . . . . . . . . . . . . . . The set of Gaussian integers for π = 4 + i . . . . . . . . . . . . . . Elements of the Gaussian integer field G29 with π = 5 + 2i . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Illustration of the the Montgomery domain for p = 29 and π = 5 + 2i . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Illustration of the the Montgomery domain for p = 53 and π = 7 + 2i . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Block diagram of the ECC processor . . . . . . . . . . . . . . . . . . Structure of the Karatsuba multiplier . . . . . . . . . . . . . . . . . . . Structure of the arithmetic unit (AU) . . . . . . . . . . . . . . . . . . . Block diagram of the ECC processor . . . . . . . . . . . . . . . . . . Block diagram of the parallelized least significant bit-first multiplier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Example of the garbage collection procedure . . . . . . . . . . . . Codeword structure with and without the combination of data compression techniques and an error-correcting code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Illustrating the lifetime improvement using word error rates with different data compression algorithms for the Canterbury corpus according to [4] . . . . . . . . . . . . . .

9 12 15 18 19 28 46 50 56 65 69 75 88 93 100

102

104

xiii

xiv

Figure 7.4 Figure 7.5 Figure 7.6 Figure 7.7 Figure 7.8 Figure 7.9 Figure 7.10 Figure 7.11

Figure 7.12

List of Figures

Example of the PDLZW encoding . . . . . . . . . . . . . . . . . . . . . Illustration of a Markov chain for the PDLZW encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stationary probability distribution for Calgary corpus for an address space with 1024 entries . . . . . . . . . . . . . . . . . Comparing the PDLZW with the recursive PDLZW . . . . . . Word partitioning principal . . . . . . . . . . . . . . . . . . . . . . . . . . . Distribution of the number of unique entries for the different dictionaries . . . . . . . . . . . . . . . . . . . . . . . . . . Structure of the PDLZW encoding with word partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Compression results for the Calgary corpus with different input block lengths and different number of entries (address space) . . . . . . . . . . . . . . . . . . . . . Compression results for the Canterbury corpus with a partitioning derived from the Calgary corpus and q ≤ 0.5 for different input block lengths and different number of entries (address space) . . . . . . . . . .

109 112 114 117 119 121 123

125

126

List of Tables

Table 2.1

Table 3.1 Table 3.2 Table 3.3

Table 3.4 Table 4.1

Table 5.1 Table 5.2 Table 5.3 Table 5.4 Table 6.1 Table 6.2 Table 6.3

Comparison key sizes of ECC and RSA cryptographic systems for equivalent security levels in bits according to [14] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Examples for primes of the form p = a 2 + b2 with b = 1 or b = a − 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Number of multiplications and squarings using Jacobian coordinates according to [14, 78] . . . . . . . . . . . . . . . . . . . . . . . . Number of point operations required for the precomputations and an iteration of the point multiplication with different τ . . . . . . . . . . . . . . . . . . . . . . . . . . Performance results for different τ -adic expansions in comparison with [31] for a binary key length r = 163 . . . . Examples of primes of the form p = a 2 + b2 suitable for ECC applications and the percentage of the number of offset reduction steps required (red.) . . . . . . . . . . . . . . . . . . Definition of the instruction set for the Gaussian integers arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Definition of the instruction set . . . . . . . . . . . . . . . . . . . . . . . . . Comparison of the hardware resources with other area-efficient implementations . . . . . . . . . . . . . . . . . . . . . . . . . . Latencies comparison with other area-efficient implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Definition of the instruction set . . . . . . . . . . . . . . . . . . . . . . . . . Number of arithmetic operations for DBL and ADD over G F(2m ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Required resources for the proposed architecture . . . . . . . . . .

7 29 38

39 40

58 66 73 79 80 89 90 94

xv

xvi

Table 6.4 Table 7.1 Table 7.2 Table 7.3 Table 7.4 Table 7.5 Table 7.6 Table 7.7

List of Tables

Comparison with other area-efficient implementations . . . . . . Determination of dictionary sizes using the probability distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Proposed dictionary partitioning for different address spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mean compressed block length of LZW for different block lengths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hardware size for the register-based implementation . . . . . . . Hardware size for the RAM-based implementation . . . . . . . . . Mean block size for the PDLZW with word partitioning (4 dict.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparison with other data compression schemes . . . . . . . . .

95 115 115 116 127 127 128 129

1

Introduction

Flash memories have gained popularity in a variety of applications in the last decade [1–4]. Nowadays, non-volatile flash memories are central components in most mobile devices like tablets, smart cellular phones, USB sticks, SD memory cards, and many embedded systems [1, 5]. Flash memory devices in form of socalled SSD are rapidly replacing mechanical HDD. This revolution results from the benefits provided by the flash memory, such as a higher read/write speed, higher mechanical reliability, fast random access, silent operation, lower power consumption, lower weight, and many others [1, 4]. In this work, we devise several algorithms and architectures for cryptography and data compression that can be employed for non-volatile flash memory systems.

1.1

Problem Statement and Motivation

A flash memory system mainly comprises two parts, the memory chip with the flash cells and a controller chip [1]. The FTL, bad block management, and error correction are handled within the controller, allowing the host to perform only simplified read/write operations to the device [1, 2, 4]. Error-correcting codes are applied in practically all flash storage systems to improve the reliability [1, 2, 4–7]. However, techniques for cryptography and data compression are less developed. Many publications intend to improve the security of flash memory devices from different perspectives [1, 3, 6, 8–13]. In this work, we aim to improve the security of such devices by introducing asymmetric cryptography into the flash memory controller. Due to the limited computational performance of flash memory controllers [6], many flash devices still rely on symmetric cryptography for the key distribution and message authentication, using so-called MAC [1]. Asymmetric cryptography, also known as public key cryptography, enables additional function© The Author(s), under exclusive license to Springer Fachmedien Wiesbaden GmbH, part of Springer Nature 2021 M. Safieh, Algorithms and Architectures for Cryptography and Source Coding in Non-Volatile Flash Memories, Schriftenreihe der Institute für Systemdynamik (ISD) und optische Systeme (IOS), https://doi.org/10.1007/978-3-658-34459-7_1

1

2

1

Introduction

ality that can improve the authenticity and integrity provided by flash memory devices, such as the use of digital signatures and Diffie-Hellman key exchange [14, 15]. The main goal of using digital signatures is to integrate sender authentication, which can be used for many applications like firmware updates for the flash memory controller. The most common systems for asymmetric cryptography are the RSA system and elliptic curve cryptography [14–17]. However, both systems are computational intensive. In this work, we demonstrate that processing these systems over Gaussian integers can reduce the computational complexity, which makes asymmetric cryptography feasible for applications in small embedded. Gaussian integers are a subset of complex numbers such that the real and imaginary parts are integers [18]. Many Gaussian integer rings are isomorphic to rings over ordinary integers [19], hence this arithmetic is suitable for ECC and RSA systems. In this work, we focus on the ECC systems, due to the similar security level promised with significantly smaller key lengths compared with RSA [14, 15]. However, software implementations of ECC algorithms on flash memory controllers may result in unacceptable computational latency [6]. To reduce the computational latency to an acceptable range of a few milliseconds, we present hardware architectures optimized for the ECC operations targeting a low area, similar to [20–24]. We propose an ECC processor design for Gaussian integers that is a competitive solution for applications in resource-constrained embedded systems such as flash memory controllers. Implementations of cryptographic algorithms are prone to side channel attacks [14, 25–28]. Such attacks attempt to extract secrets from information leaked from physical processes like power consumption, electromagnetic radiation, timing of operations, and fault analysis [14]. In this work, we focus on side channel attacks on ECC algorithms. These algorithms are based on the elliptic curve point multiplication, i.e. k P, where P is a point on the elliptic curve and k is an integer (the key) [14, 15]. The point multiplication is typically implemented based on the binary expansion of the integer k = k0 , . . . , kr −2 , kr −1 , where k j ∈ {0, 1} and r is the key length in bits. The resistance against side channel attacks can be improved using a τ -adic expansion of the integer k with a non-binary basis τ [14, 29–36]. In this dissertation, we propose a new τ -adic expansion algorithm that increases the resistance against such attacks. Furthermore, we demonstrate that considering Gaussian integers for the basis τ can reduce the complexity and memory requirements for implementations of the point multiplication, which are protected against side channel attacks. Nonetheless, calculating the modulo reduction over Gaussian integers is extremely expensive [18]. To reduce the complexity, we consider the Montgomery

1.2 Structure of the Thesis

3

modular arithmetic over Gaussian integers, similar to [37, 38]. This arithmetic simplifies the modulo reduction [39]. The generalization of the Montgomery reduction from ordinary integers to Gaussian integers is non-trivial. The final reduction step of the algorithm utilizes the total order of integers [39]. However, such an order relation does not exist for complex numbers [40]. In this work, we introduce two new algorithms for the Montgomery reduction for Gaussian integers. The two algorithms differ in the norm used to measure the size of Gaussian integers and they have different complexities. We demonstrate that these algorithms are advantageous for RSA and ECC applications. The second part of this work focuses on universal lossless data compression techniques. Universal data compression for flash memory devices has recently gained interest [4, 41–49], because data compression can improve the storage capacity and the flash memory lifetime. Compression of redundant data reduces the amount of data transferred from the host to the flash memory [4, 7, 41]. Moreover, data compression can be utilized to improve the reliability of the non-volatile storage system [42, 45–47, 49]. Data compression for such applications requires high data throughput, which can only be achieved using hardware implementations. Moreover, it requires a universal data compression algorithm that can be used with small data blocks [4, 46, 48]. In this work, we consider the LZW algorithm, which is one of the most important dictionary-based compression algorithms [50, 51]. A fast and compact implementation of this data compression algorithm is a challenging task due to the recursive structure of the LZW dictionary. To speed up the encoding, we utilize the simplified parallel dictionary LZW algorithm [52, 53]. Instead of a single recursive dictionary, this algorithm employs multiple dictionaries to simplify the search process. We derive new dictionary partitioning techniques that improve the resulting compression rate and reduce the memory requirements of the parallel dictionary LZW algorithm compared with the original proposal [52, 53].

1.2

Structure of the Thesis

The remainder of this dissertation is organized as follows. In Chapter 2, we review the basics for ECC systems. In Chapter 3, we demonstrate that Gaussian integers are beneficial for ECC and RSA systems. We present the new algorithm for the τ -adic expansion of the key that increases the resistance against side channel attacks. Moreover, we illustrate that applying this algorithm for Gaussian integers can reduce the computational complexity and memory requirements for the ECC point multipli-

4

1

Introduction

cation. The proposed algorithm and some other parts of this chapter are published in [54–57]. In Chapter 4, we derive the Montgomery arithmetic over Gaussian integers and present two associated reduction algorithms. Parts of this chapter are published in [56, 57]. In Chapter 5, we present an efficient hardware architecture for the ECC point multiplication using the Montgomery arithmetic derived in Chapter 4. Furthermore, we propose a second hardware design for the ECC point multiplication over prime fields for the comparison. In Chapter 6, we introduce a fast and compact hardware architecture for elliptic curves over binary extension fields. These architectures are published in [55, 58, 59]. In Chapter 7, we discuss universal data compression approaches suitable for flash memories. We also consider the partitioning technique for the parallel dictionary LZW data compression algorithm and corresponding efficient hardware architectures. Moreover, we compare the resulting performance of this concept with other universal data compression approaches from [4]. Parts of these chapters are published in [48, 60–63].

2

Elliptic Curve Cryptography

In this chapter, we briefly motivate the use of asymmetric cryptography based on ECC systems for applications in resource-constrained devices like flash memory controllers. We review applications and fundamentals of elliptic curve cryptography, the corresponding one-way function, i.e. the so-called PM, as well as associated group laws. Furthermore, we summarize methods that reduce the computational complexity and speed up the calculation of the point multiplication. In Chapter 3, we investigate the advantages of ECC systems over Gaussian integers. Subsequently, we present hardware architectures for associated coprocessors for the point multiplication in Chapter 5. This chapter is organized as follows. In Section 2.1, we consider the advantages of asymmetric cryptography for flash memory controllers and compare ECC with RSA systems. In Section 2.2, we briefly review the form of considered elliptic curves in this work, as well as the use of elliptic curves in cryptographic applications. In Section 2.3, we propose different algorithms for calculating the elliptic curve point multiplication. We illustrate the required elliptic curve point operations to compute the point multiplication in Section 2.4. We review a method to reduce the time consumption for calculating the point multiplication in Section 2.5. Finally, we discuss consequences of the basics of ECC systems for this work in Section 2.6.

2.1

Cryptography for Flash Memory Controllers

Nowadays, nearly all modern computer, electronic, and communication devices contain flash memories as a digital storage medium in the form of USB sticks, SSDs, SD cards, and so on [1]. The increasing presence of networked devices has led to higher requirements for improved information security. Many concepts were previously considered that aim to improve different security issues for flash memory © The Author(s), under exclusive license to Springer Fachmedien Wiesbaden GmbH, part of Springer Nature 2021 M. Safieh, Algorithms and Architectures for Cryptography and Source Coding in Non-Volatile Flash Memories, Schriftenreihe der Institute für Systemdynamik (ISD) und optische Systeme (IOS), https://doi.org/10.1007/978-3-658-34459-7_2

5

6

2

Elliptic Curve Cryptography

devices [1, 3, 6, 8–13, 64]. For example, in [10] a method is presented to protect the data stored in a flash memory using a cryptographic file system. Similarly, in [11– 13] new approaches are presented to increase the resistance of the memory content against attackers with physical access to the device. In this work, we focus instead on introducing efficient asymmetric cryptography into highly resource-constrained systems like flash memory controllers. This enables additional functionalities, e.g. key exchange and digital signatures, which can be used to authenticate firmware updates, for example. In general, cryptographic systems are divided in two types, namely symmetric and asymmetric cryptographic systems [14, 15]. Asymmetric cryptographic systems are very computationally intensive [14, 15, 65, 66]. For applications in personal and server computing, sufficient computational performance is available to perform asymmetric cryptographic methods [64]. On the other hand, many resourceconstrained systems such as flash memory controllers still completely rely on symmetric cryptography [1, 64] due to the limited computational performance. For message authentication, such systems may utilize a message authentication code like the HMAC to validate the data integrity [14, 15]. However, there are drawbacks of symmetric cryptography, such as the so-called key distribution problem and key management problem, where a secret (and authenticated) channel is expected [14]. Note that a secret channel is protected from overhearing and tampering, while an authentic channel is resistant against tampering but not necessarily overhearing. Mostly, such a secret channel does not exist in practical applications. Hence, services of an online trusted third party are used for the key distribution [14]. These services violate the non-repudiation, because the keys may be shared between two or more users, i.e. it is impossible to distinguish between the individual users. Consequently, for an attacker it is sufficient to extract the secret key of one user to obtain the secret key of all other users in the communication system. Asymmetric cryptography (also called public key cryptography) provides solutions for the problems of symmetric cryptography, due to the necessity for an authentic (but not secret) channel for the exchange of the so-called public key [14]. There exist several public key systems in the literature like the elliptic curve cryptography and the Rivest-Shamir-Adleman system [14–17, 67], which enable additional functionalities like the key exchange and digital signature. These functionalities comply the non-repudiation principle and provide origin authentication, which is advantageous for many applications such as the authentication of firmware updates for flash memory controllers. Public key cryptographic systems using elliptic curves were introduced in [68, 69]. These systems have become widespread in the last decade due to the comparatively small key lengths. It was shown in [14] that elliptic curve cryptographic systems require smaller key sizes in comparison with RSA systems to provide the

2.2 Applications of Elliptic Curve Cryptographic (ECC) Systems

7

Table 2.1 Comparison key sizes of ECC and RSA cryptographic systems for equivalent security levels in bits according to [14] Security level in bits 80

112

128

192

256

ECC key length RSA key length

224 2048

256 3072

384 8192

512 15360

160 1024

same security level. Note that a security level of k-bit indicates that the best-known algorithm for breaking this system would execute 2k steps [14]. Table 2.1 illustrates some examples for the recommended key length for both systems with varying security levels. Consequently, due to the limited performance of highly resourceconstrained systems like flash memory controllers, in this work we focus on low-cost coprocessors optimized for the elliptic curve operations. Furthermore, we investigate concepts to improve the resistance of such coprocessors against side channel attacks. Such improvements are beneficial for applications where the private key is employed.

2.2

Applications of Elliptic Curve Cryptographic (ECC) Systems

In this section, we illustrate relevant forms of elliptic curves for this work. In particular, we review the simplified Weierstrass curves [14], i.e. elliptic curves over prime and binary extension fields according to the NIST standard [16]. Furthermore, we consider the key exchange, and digital signatures using ECC systems in more detail. Elliptic curves are advantageous for several cryptographic schemes. In [14], a basic ElGamal ECC scheme was considered, where the encryption and decryption are completely based on elliptic curve operations. This cryptographic scheme provides high security, but involves high computational complexity due to the expensive elliptic curve operations. In this work, we focus on applications in flash memory controllers with limited processing power and storage capacity for intermediate computation results. Consequently, this cryptographic scheme is too expensive for applications in such embedded systems. However, small embedded systems can benefit from a hybrid approach, where the key exchange between two users is applied with ECC schemes. Subsequently, the encryption and decryption are undertaken using a low-cost symmetric cryptographic algorithm such as AES [14, 65, 66], using the previously-exchanged keys. Moreover, ECC systems are interesting for digital signatures, where the users are supposed to authenticate the message received, which is not necessarily encrypted. In this scenario, the hybrid system profits from

8

2

Elliptic Curve Cryptography

the ECC public-private key pair, where the public key is known for all users in the communication system. Subsequently, the public key is utilized for the authentication of the message received, while signing a message is done with a private key.

2.2.1

Elliptic Curves

There exist different forms of elliptic curves that are suitable for ECC applications [14, 16, 17, 67, 70]. In this work, we focus on two types of elliptic curves. Curves of the form y 2 = x 3 + αx + β (2.1) have been recommended for prime fields G F( p) [17, 67]. Such curves are called prime curves in this work. The parameters α and β are constant coefficients satisfying 4α 3 + 27β 2  ≡ 0 (mod p) for primes p > 3. The points P(x, y) on the curve are defined by a pair x and y that satisfies (2.1), and the point at infinity denoted by O [17]. For binary extension fields G F(2m ), typically elliptic curves of the form y 2 + x y = x 3 + αx 2 + β

(2.2)

are used [14, 16, 17]. In this work, such curves are called binary extension curves. Figure 2.1 illustrates two examples for a curve of the form (2.1) with α = −20 and β = 40, as well as the form (2.2) with α = −4 and β = 7 over real numbers. The parameters α, β and the fields used define the points on the curve. This is illustrated in the following example. Example 2.1 Let p = 53, α = 8, β = 0, and consider the prime curve y 2 = x 3 + 8x

(2.3)

defined over G F( p). This is an elliptic curve that satisfies the condition for the parameters α and β, since 4α 3 + 27β 2 = 2048 ≡ 34 mod 53  ≡ 0. In this case, there are 29 points on the curve. These points are (1, 3) (46, 5) (52, 16) (37, 4)

(47, 1) (16, 14) (29, 20) (4, 19)

(4, 34) (29, 33) (16, 39) (47, 52)

(37, 49) (52, 37) (46, 48) (1, 50)

(6, 23) (9, 18) (49, 13) (7, 9) (24, 17) (44, 10) (44, 43) (24, 36) (7, 44) (49, 40) (9, 35) (6, 30) O.

2.2 Applications of Elliptic Curve Cryptographic (ECC) Systems

9

Figure 2.1 Examples for elliptic curve over real numbers

Asymmetric cryptographic systems are based on one-way functions [15]. A oneway function is a mathematical function that is highly asymmetric in terms of its computational complexity with respect to calculating the function and its inverse. A one-way function is relatively easy to compute in the forward direction, but the computation of the inverse should be extremely difficult [15]. The one-way function of ECC systems is the point multiplication [14, 15]. The calculation of the point multiplication dominates the complexity and execution time of ECC algorithms. Next, we discuss applications of the ECC systems. In Section 2.3, we review the point multiplication in more detail.

2.2.2

Key Exchange

In this section, we consider the key pair generation and the elliptic curve DiffieHellman key exchange [15]. The basic idea of the key exchange is to generate a shared secret for two users over an insecure channel. Each of these users requires a public-private key pair, which can be calculated with Algorithm 2.1. This algorithm utilizes the domain parameters of the corresponding elliptic curve, i.e. a prime p,

10

2

Elliptic Curve Cryptography

the elliptic curve equation E, a point on the curve P, and the order of this point n. The private key is an integer d, which is selected randomly from the interval {1, . . . , n − 1} [14]. The public-private key pair resulting from this algorithm is denoted by (Q, d), respectively. Algorithm 2.1 Elliptic curve key pair generation according to [14] input: Elliptic curve domain parameters ( p, E, P, n) output: Public key Q and private key d 1: d ∈ {1, . . . , n − 1} 2: Q = d · P 3: return(Q, d)

// Select a random private key d // Point multiplication

Now, assume two users Alice and Bob with two key pairs (Q A , d A ) and (Q B , d B ), respectively. Both users share their public keys over an insecure channel. Alice receives the public key Q B and calculates the point S(x, y) according to Algorithm 2.2, where her own private key d A is required. The shared secret resulting from this algorithm is the coordinate x of the the calculated point S(x, y). Bob performs the same process with Q A and d B . Now, the calculated point in Algorithm 2.2 is S(x, y) = d A · Q B = d A d B · P = d B d A · P = d B · Q A .

(2.4)

Hence, the resulting coordinate x is equal for the two users. Note that no other user can calculate the same coordinate x, except by solving the Diffie-Hellman problem [71], i.e. by determining the inverse function of the point multiplication. Furthermore, in many applications the authentication of the users is also achieved using this Diffie-Hellman key exchange. However, in this manner the public key can be vulnerable to a man-in-the-middle attack [71], which is based on replacing the public key Q A or Q B with a public key of the attacker during the exchange. Furthermore, the Diffie-Hellman key exchange is used for generating a key pair. In the next subsection, we consider the concept of digital signatures, which is dominating for message authentications.

2.2.3

Digital Signatures

The digital signature is an asymmetric cryptographic method that is necessary to verify whether a message was signed by a specific user, i.e. to ensure the integrity of the message and its origin [72]. The idea behind this method is that a transmitter

2.2 Applications of Elliptic Curve Cryptographic (ECC) Systems

11

Algorithm 2.2 Elliptic curve Diffie-Hellman key exchange [71] input: Q B , d A , and the elliptic curve domain parameters ( p, E, P, n) output: Shared secret coordinate x, or invalid 1: S(x, y) = d A · Q B 2: if (S  = O) then 3:

// Point multiplication

return coordinate x of the point S(x, y)

4: else

return invalid 6: end if 5:

A generates a digital signature using the private key d A . All other users in the same communication system can verify whether the signature was generated by transmitter A, using the corresponding public key Q A . Using this approach, it is possible to indicate whether either the message or the signature were falsified over the insecure channel. Figure 2.2 illustrates the principal of digital signatures with two users, where Alice is the transmitter and Bob is the receiver. As shown in this figure, Alice first computes a hash value from the message that she wants to sign. There exist several standards for hash functions to compute this value [15, 71, 73], but it is essential that Alice and all other users in the communication system agree on the same hash function. Next, the digital signature is generated by encrypting the computed hash value with the private key of Alice d A . This signature is attached to the original message and can be sent over an insecure channel. Bob is one of the users in the same communication system that receives the message and the signature. The public key from Alice Q A is already known to Bob. Hence, Bob can obtain the hash value by decrypting the signature employing Q A . Bob computes a second hash value for the message received using the same hash function as Alice. The two hash values are then compared, which proves that the message was send from Alice and is correctly received if the two values are identical. The generation and verification of digital signatures according to [17] are summarized in Algorithms 2.3 and 2.4, respectively. The public-private key pair (Q, d) is generated by the transmitter with Algorithm 2.1, and the calculated hash value from the message M is denoted by e. The resulting digital signature comprises two values, r and s. These are required for the signature verification in Algorithm 2.4. This algorithm results in a valid signature if the received value r and the determined value v are identical. This implies

12

2

Elliptic Curve Cryptography

Figure 2.2 Principal of digital signatures

k · P = e s −1 · P + r s −1 · Q  , e k d r k k·P= P+ · P, e + dr e + dr (e + dr )k · P = e k · P + d  r k · P, 



(ek + dr k) · P = (e k + d r k) · P,

(2.5) (2.6) (2.7) (2.8)

where all scalar multiplications and additions are calculated modulo n. Moreover, we assumed Q  = d  · P according to Algorithm 2.1 for the signature verification algorithm, while e and e denote the hash values determined in Algorithms 2.3 and 2.4, respectively. Equality in (2.8) holds for d = d  , i.e. for the valid publicprivate key pair (Q, d) of the transmitter from which the message was signed. Hence, it can be undeniably authenticated that the message was sent by a specific user. Moreover, the two hash values must be identical for the equality in (2.8), i.e. e = e . In this case, the receiver ensures that neither the message nor the signature have been changed or falsified. The public key Q and the value r of the digital signature are calculated using the one-way function of the ECC system, i.e. the point multiplication, where the computation of the inverse to obtain the secret key d is extremely difficult.

2.2 Applications of Elliptic Curve Cryptographic (ECC) Systems

13

Algorithm 2.3 Digital signature generation according to [17] input: private key d, message M and the elliptic curve domain parameters ( p, E, P, n) output: the signature (r , s) of the message M e = S H A(M) k ∈ {1, . . . , n − 1} (x, y) = k · P r = x mod n if (r = 0) then 6: repeat from step 2 7: end if 8: s = (e + dr )k −1 mod n 9: if (s = 0) then 10: repeat from step 2 11: end if 12: return Signature (r , s) 1: 2: 3: 4: 5:

// Hash the message // Select a random parameter k // Point multiplication

Algorithm 2.4 Digital signature verification according to [17] input: public key Q, message M, signature (r , s), and domain parameters ( p, E, P, n) output: valid or invalid signature e = S H A(M) u 1 = es −1 mod n u 2 = r s −1 mod n (x, y) = u 1 · P + u 2 · Q v = x mod n if (v = r ) then 7: return valid signature 8: else 9: return invalid signature 10: end if 1: 2: 3: 4: 5: 6:

// Hash the message

// Two point multiplications

14

2.3

2

Elliptic Curve Cryptography

Elliptic Curve Point Multiplication

Next, we discuss the one-way function of ECC systems in more detail. For any point on the curve P and any integer k, the product k · P is an elliptic curve point multiplication. Note that the integer k is usually a key, e.g. it is the private key in Algorithms 2.1 to 2.3, or the x coordinate of the public key Q for the signature verification in Algorithm 2.4. The point multiplication is typically calculated based  −1 on the binary expansion k = rj=0 k j 2 j of the integer k with binary digits k j ∈ {0, 1}, where r is the length in bits and k0 , k1 , . . . , kr −1 is the binary representation of the integer k. Using the Horner scheme, the point multiplication can be calculated as k·P=

r −1 

k j 2 j (P) = 2(. . . 2(2kr −1 P + kr −2 P) + . . .) + k0 P,

(2.9)

j=0

with the double-and-add method [14]. Figure 2.3 illustrates the three-level hierarchy for computing the point multiplication. The point multiplication in level 3 is obtained with the ADD and DBL operations from level 2. Both point operations are computed using modular arithmetic over a finite field, as demonstrated in level 1 of Figure 2.3. In the following, we focus on the point operations in levels 2 and 3, and refer to [14, 15, 74] for the prime and binary extension field arithmetic for level 1. However, we present hardware architectures for the point multiplication in Chapters 5 and 6, for which we discuss the arithmetic operations over both fields in more detail. There exist many algorithms to calculate the point multiplication. In this work, the three most commonly-used point multiplication algorithms are considered [14]. The double-and-add method is summarized in Algorithm 2.5, the point multiplication according to the ANSI standard [17] is described in Algorithm 2.6, and the Montgomery ladder point multiplication in Algorithm 2.7. Accordingly, all three algorithms are based on the point addition and doubling operations from level 2 of Figure 2.3. The number of required point additions and doubling operations depend on the key as well as the point multiplication algorithm applied. For Algorithm 2.5, r − 1 point doublings and at most r − 1 point additions are necessary to determine the point multiplication, whereas the number of point additions varies with the integer k, i.e. a point addition is only performed if the current binary digit k j = 1. Note that we assume kr −1 = 1 [72]. Similarly, r − 1 point doublings and at most r − 1 point additions are required per point multiplication using Algorithm 2.6. This algorithm utilizes two integers k

2.3 Elliptic Curve Point Multiplication

15

Figure 2.3 Hierarchy of calculating the point multiplication for prime fields

Algorithm 2.5 Point multiplication double-and-add [72, 75] input: P, k = (1, kr −2 , . . . , k1 , k0 )2 output: k · P 1: Q = P 2: for ( j = r − 2 down to 0) do 3: 4: 5: 6: 7: 8:

Q = 2Q if (k j = 1) then Q=Q+P end if end for return Q

// DBL Q // ADD Q and P

and h of length r + 1, where h is derived from k as h = 3k and h r = 1. In step 4 of this algorithm, the point doubling is performed for the current point Q. Executing the point addition depends on the values of the digits of k and h. In step 6, the points Q and P are added, whereas in step 8 P is subtracted from Q. The point subtraction can be calculated by adding Q and −P, i.e. the additive inverse of P [14, 17]. Note that the additive inverse of any point depends on the form of the curve, as detailed in Section 2.4. In contrast to Algorithms 2.5 and 2.6, the number of point additions in Algorithm 2.7 is fixed to r − 1, which increases the robustness against side channel attacks. However, depending on the actual digit of k, the point doubling is applied to either Q 0 or Q 1 . Overall, r point doubling operations are performed using this algorithm.

16

2

Elliptic Curve Cryptography

Algorithm 2.6 Point multiplication according to the ANSI [17] input: P, k = (kr , kr −1 , . . . , k1 , k0 )2 output: k · P 1: h = 3k; 2: Q = P; 3: for ( j = r − 1 down to 1) do 4: 5: 6: 7: 8: 9: 10: 11:

Q = 2Q if ((h j = 1)&&(k j = 0)) then Q=Q+P else if ((h j = 0)&&(k j = 1)) then Q=Q−P end if end for return Q

// DBL Q // ADD Q and P // ADD Q and −P

Algorithm 2.7 Montgomory ladder point multiplication according to [22, 76] input: P, k = (1, kr −2 , . . . , k1 , k0 )2 output: k · P 1: Q 0 = P 2: Q 1 = 2P 3: for ( j = r − 2 down to 0) do 4: 5: 6: 7: 8: 9: 10: 11: 12:

if (k j = 1) then Q0 = Q0 + Q1 Q 1 = 2Q 1 else Q1 = Q0 + Q1 Q 0 = 2Q 0 end if end for return Q 0

// DBL P

// ADD Q 0 and Q 1 // DBL Q 1 // ADD Q 0 and Q 1 // DBL Q 0

2.4 Elliptic Curve Geometry and Group Laws

17

Calculating the point doublings and additions is based on the form of the corresponding elliptic curve [14, 17], as discussed in the following.

2.4

Elliptic Curve Geometry and Group Laws

In this section, we explain the point addition and doubling geometrically. As previously mentioned, all operations in the first level of Figure 2.3 are calculated with finite field arithmetic. Therefore, we distinguish between the group laws for curves over prime and binary extension fields, which are the most common fields in ECC applications [14, 15, 17]. In this section, we review group laws for curves over primes and in Section 6.1 for curves over binary extension fields.

2.4.1

Elliptic Curve Geometry

We first consider the point addition geometrically using the chord-and-tangent rule [14, 15]. This operation forms an Abelian group with the identity as a point at infinity O [14, 15]. Let P1 (x1 , y1 ) and P2 (x2 , y2 ) be two distinct points on the elliptic curve. Subsequently, the sum is a third point P3 (x3 , y3 ) on the same curve. Figure 2.4 illustrates the point addition geometrically. First, a line is drawn through P1 and P2 . The third intersection point of this line with the curve is denoted by the point T . Finally, the reflection of T around the x-axis obtains the result of the point addition P3 [14, 15]. Note that for P1 = −P2 , the point addition results in a point at infinity O. For the point O, we have a finite x coordinate and y = ±∞. The point doubling is a special case of the point addition operation, where P1 = P2 . This case is depicted in Figure 2.5 and geometrically defined as follows. Since we have P1 = P2 , we draw a tangent to the curve at this point. The second intersection point of this tangent with the curve is denoted as the point T . Reflecting T around the x-axis obtains a point P3 , which is the final result of the point doubling. The algebraic definitions for point addition and doubling operations can be derived from the geometric description. For instance, with the introduced process of the point addition, the resulting line through the points P1 and P2 is given by y = m(x − x1 ) + y1 ,   y2 − y1 m= . x2 − x1

(2.10) (2.11)

We obtain the y-coordinate of the point T , i.e. y = y3 by substituting x = x3 in (2.10), which is defined according to [15, 77] as

18

2

Elliptic Curve Cryptography

Figure 2.4 Point addition (ADD) example

x = m 2 − x1 − x2 .

(2.12)

Note that x3 is the x-coordinate of the point T or P3 [15, 77]. Consequently, the negation of y3 obtains y3 , which is the y-coordinate of the resulting point P3 . However, the point addition and doubling depend on the form of the corresponding curve. Hence, in the following we consider the algebraic definitions for prime curves according to [14]. In Chapter 6, we briefly review the definitions for binary extension fields and introduce a suitable hardware architecture for calculating the elliptic curve point multiplication over such fields.

2.4.2

Group Laws for Prime Curves

In this section, we review prime curves of the form defined in (2.1), where p > 3. The properties of these curves can be derived geometrically as demonstrated in Subsection 2.4.1. We summarize these properties in the following according to [14, 15].

2.4 Elliptic Curve Geometry and Group Laws

19

Figure 2.5 Point doubling (DBL) example

• Identity: P + O = O + P = P for any point P on the elliptic curve over G F( p). • Negatives: For any point P on the elliptic curve there exists a point −P such that P + (−P) = O holds. This point on the curve is defined as − P = P(x, −y).

(2.13)

• Point addition: Let P1 and P2 be two distinct points on a prime elliptic curve, where P1  = −P2 . Subsequently, the addition of these points is defined as P3 (x3 , y3 ) = P1 (x1 , y1 ) + P2 (x2 , y2 ),   y2 − y1 2 x3 = − x1 − x2 , x2 − x1   y2 − y1 (x1 − x3 ) − y1 , y3 = x2 − x1

(2.14) (2.15) (2.16)

where P3 is a point on the curve. The calculations of x3 and y3 follow from the geometric derivations (2.12) and (2.10), respectively.

20

2

Elliptic Curve Cryptography

• Point doubling: Let P1 be a point on a prime elliptic curve. Consequently, doubling this point obtains P3 (x3 , y3 ) = 2P1 (x1 , y1 ),  2 3x12 + α x3 = − 2x1 , 2y1   3x12 + α y3 = (x1 − x3 ) − y1 , 2y1

(2.17) (2.18)

(2.19)

where P3 is a point on the curve. Note that x3 and y3 in equations (2.15) to (2.19) are calculated using modular arithmetic over prime fields G F( p). Next, we propose examples for additions and doublings of points on the prime curve. Example 2.2 For p = 53, α = 8, and β = 0, the points P1 = (1, 3) and P2 = (7, 9) are located on the curve y 2 = x 3 + 8x defined over G F( p), as shown in Example 2.1. Now, adding P1 and P2 results in P1 + P2 = P3 = (46, 5). Similarly, doubling P1 obtains 2P1 = P4 = (47, 1). The resulting points in both cases are located on the same elliptic curve y 2 = x 3 + 8x. The most expensive arithmetic operation to calculate the point addition and doubling is the multiplicative inversion in the finite field. Therefore, the complexity of the point multiplication strongly depends on the implementation of these inversions. In the following, we demonstrate that using projective coordinates is beneficial to reduce the number of field inversions per point multiplication.

2.5

Reducing the Number of Field Inversions for Elliptic Curves over Prime Fields

In this section, we consider different methods of the projective coordinates that reduce the number of required multiplicative inversions and speed up the point multiplication algorithms [14, 78–80]. Such methods transform each point P(x, y) from affine to projective coordinates P(X , Y , Z ), where the computation of point additions and doublings is applied without inversion. After computing a point multi-

2.5 Reducing the Number of Field Inversions for Elliptic Curves over Prime Fields

21

plication, the result is transformed back into affine coordinates by applying a single field inversion. Nonetheless, in some applications the result of a point multiplication does not have to be transformed back to affine coordinates, i.e. representing the result of a point multiplication in projective coordinates is sufficient and no inversion is required. Note that transforming any point P(x, y) from affine to projective coordinates can be achieved by considering an additional coordinate Z = 1, i.e. P(x, y, 1), although the inverse transformation depends on the projective coordinates used. Furthermore, the projection is typically defined for a specific form of the elliptic curve [14, 78–80]. In the following, we consider suitable projective coordinates for prime curves. In Chapter 6, we review the corresponding projective coordinates for binary extension curves. In order to speed up the calculation of the point multiplication, several point representation methods were considered. For elliptic curves over prime fields, the projective homogeneous coordinates [79] and the Jacobian coordinates [14, 78] are dominating. These two representations of the projective coordinates differ in the number of modular arithmetic operations required. In this work, we consider the Jacobian coordinates [14, 78] because overall fewer field multiplications are required. Transforming any point (the result of a point multiplication) from Jacobian to affine coordinates requires a single inversion, and is defined as x=

X Y ,y = 3. Z2 Z

(2.20)

Next, we consider the point addition and doubling from (2.14) and (2.17) in Jacobian coordinates, where no inversion is required. Let P1 (X 1 , Y1 , Z 1 ) and P2 (X 2 , Y2 , Z 2 ) be the Jacobian representation of two distinct points on a prime curve of the form defined in (2.1). The addition of these points is calculated according to [78] as P1 + P2 = (X 3 , Y3 , Z 3 ),

(2.21)

X3 = S − T W ,

(2.22)

Y3 = V S − M W 3 ,

(2.23)

Z3 = Z1 Z2W ,

(2.24)

2

2

where S = Y1 Z 23 − Y2 Z 13 , W = X 1 Z 22 − X 2 Z 12 , V = T W 2 − 2X 3 , T = X 1 Z 22 + X 2 Z 12 , M = Y1 Z 23 + Y2 Z 13 .

22

2

Elliptic Curve Cryptography

Similarly, doubling P1 (X 1 , Y1 , Z 1 ) is defined according to [14, 78] as 2P1 = (X 3 , Y3 , Z 3 ),

(2.25)

X 3 = W 2 − 2S,

(2.26)

Y3 = W (S − X 3 ) − T ,

(2.27)

Z 3 = 2Y1 Z 1 ,

(2.28)

where W = 3X 12 + α Z 14 , S = 4X 1 Y12 , and T = 8Y12 . Calculating the point addition is more expensive than the point doubling, but we can reduce this complexity by exploiting some properties of the Jacobian coordinates. Consider the calculation of the point addition in step 5 of Algorithm 2.5. The second argument P does not change throughout the point multiplication algorithm. Consequently, the point P(x, y) can always be represented in Jacobian coordinates by considering Z = 1 as P = P(X , Y , 1). This enables a point addition with reduced complexity [14, 78, 79]. We call this SADD. For two distinct points P1 (X 1 , Y1 , Z 1 ) and P2 (X 2 , Y2 , 1) on a prime curve of the form defined in (2.1), the special point addition is defined as P1 + P2 = (X 3 , Y3 , Z 3 ),

(2.29)

2

X3 = S − T W ,

(2.30)

Y3 = S(X 1 W 2 − X 3 ) − Y1 W 3 ,

(2.31)

Z3 = Z1W ,

(2.32)

2

where S = Y2 Z 13 − Y1 , W = X 2 Z 12 − X 1 , and T = X 1 + X 2 Z 12 . A drawback of the representation in the projective coordinates for a compact hardware implementation is that many interim results have to be stored. Nonetheless, in Chapter 5 we consider hardware architectures, where the intermediate results are stored in a RAM to provide an area-efficient implementation.

2.6

Discussion

We have shown in this chapter that ECC systems provide beneficial functionalities for many applications. However, ECC systems are relatively computational intensive. Hence, many applications benefit from hybrid systems to exploit the efficiency of symmetric cryptography and the feature of public key algorithms. We have demonstrated that the ECC point multiplication is used as a one-way function,

2.6 Discussion

23

which is the most expensive operation. The implementation of the point multiplication dominates the computational complexity and time consumption of the whole ECC system. Consequently, software implementations of the point multiplication algorithms may be too slow on resource-constrained systems, such as flash memory controllers [81, 82]. Hence, coprocessors were proposed that are optimized for calculating the point multiplication. Many such coprocessor designs aim to achieve high computational performance and require only a few microseconds per point multiplication [83–87]. Such designs target applications in network infrastructure and data centers to cope with the increase of secure network traffic, which requires public key algorithms. On the other hand, ECC applications in small embedded systems such as flash memory controllers are less demanding regarding computational performance. For example, a latency of a few milliseconds may be acceptable for the key exchange, the secure authentication, or the verification of a digital signature. In this work, we focus on area-efficient elliptic curve cryptographic coprocessors for calculating the point multiplication. These coprocessors are designed for applications in small embedded systems where high performance coprocessors are too costly.

3

Elliptic Curve Cryptography over Gaussian Integers

Thus far, we have reviewed the point multiplication k · P according to (2.9) with the binary expansion of the key. In this chapter, we alternatively consider the τ -adic expansion of the integer k with a non-binary basis τ . We call the scalar multiplication with a point on the elliptic curve a complex point multiplication if the basis τ is a complex number such as a Gaussian, Eisenstein, or Kleinian integer [33, 88]. Non-binary expansions were proposed to speed up the point multiplication [30–32, 34–36]. Moreover, it was demonstrated in [22, 27, 29, 32, 34, 36, 75] that non-binary expansions are beneficial to harden ECC implementations against SCA [14, 25–28]. Implementations of the elliptic curve point multiplication are prone to SCA. There exist different attacks on the point multiplication in the literature, such as TA, SPA, DPA, RPA, and ZPA [14, 25–28]. Many publications of non-binary-base expansions consider implementations for extension fields [88–90]. In this scenario, the point multiplication can be calculated efficiently because the point doubling is replaced by a simple Frobenius mapping (Frobenius endomorphism) τ · P. For example, over binary extension fields, this mapping results for any point P(x, y) in the the point (x 2 , y 2 ), which requires only squaring of the field elements x and y. Note that in [89, 90] arithmetic over binary extension field is considered, where the squaring operation is very efficient [91, 92]. In [88], a multiple-base expansion and an arithmetic over Kleinian integers is used. The approach in [89, 90] is based on a ternary instead of a binary expansion. The corresponding point multiplication has a run-time of r /3 point additions and r − 1 Frobenius mappings τ · P. This point multiplication with the ternary basis is much faster than with a binary basis, due to replacing the point doubling with a simple Frobenius mapping. In this chapter, we consider an endomorphism for prime fields [14], where the prime field is represented by Gaussian integers. Gaussian integers are a subset of complex numbers with integers as real and imaginary parts. Many finite rings over © The Author(s), under exclusive license to Springer Fachmedien Wiesbaden GmbH, part of Springer Nature 2021 M. Safieh, Algorithms and Architectures for Cryptography and Source Coding in Non-Volatile Flash Memories, Schriftenreihe der Institute für Systemdynamik (ISD) und optische Systeme (IOS), https://doi.org/10.1007/978-3-658-34459-7_3

25

26

3

Elliptic Curve Cryptography over Gaussian Integers

Gaussian integers are isomorphic to rings over ordinary integers of the same order. Due to this isomorphism, arithmetic over Gaussian integers is suitable for many cryptographic systems such as ECC and RSA systems. Hardware implementations of ECC systems can benefit from arithmetic over Gaussian integers and non-binary expansion. In particular, we show that calculating the complex point multiplication results in a significant reduction in complexity, if both the key as well as τ are Gaussian integers. Furthermore, we present a new algorithm for the τ -adic expansion of the key k. The resulting expansion increases the robustness against SPA and timing attacks. We demonstrate that applying this algorithm for Gaussian integers can reduce the required memory size and the computational complexity for a protected point multiplication. Furthermore, we show that preventing DPA, RPA, and ZPA can be achieved by a simple modification of the complex point multiplication with the RIP method proposed in [31]. In Section 3.1, we briefly review Gaussian integers and their suitability for cryptographic systems. The complex point multiplication over Gaussian integers is presented in Section 3.2. The resistance against side channel attacks is investigated in Section 3.3. Finally, we discuss results of this chapter in Section 3.4.

3.1

Gaussian Integer Rings and Fields

In this section, we briefly review the Gaussian integer fields and rings. Furthermore, we show that Gaussian integer fields are suitable for ECC applications, while Gaussian integer rings are applicable for RSA systems. We also present an algorithm and some examples for Gaussian integer fields that are suitable for ECC applications. The set of Gaussian integers is typically√denoted by Z[i]. The modulo function of a Gaussian integer x = c + di with i = −1 and c, d ∈ Z is defined as  ∗ xπ x mod π = x − · π, (3.1) ππ ∗ where π ∗ is the conjugate of the Gaussian integer π and [·] denotes rounding to the closest Gaussian integer [18]. Accordingly, for a complex number x = c + di, we have [x] = [c] + [d] i. For primes p of the form p ≡ 1 mod 4, the set G p = {x mod π : x = 0, . . . , p − 1, x ∈ Z} ,

(3.2)

is a finite field isomorphic to a prime field G F( p) over ordinary integers [18]. Hence, these sets are suitable for ECC applications. In this case, p is the sum of two perfect squares, i.e. π = a + bi with integers a, b and p = ππ ∗ = |a|2 + |b|2 .

3.1 Gaussian Integer Rings and Fields

27

Furthermore, for integers n = cd, where c and d are both primes of the form c ≡ d ≡ 1 mod 4 and c  = d, the set Gn is a ring isomorphic to the ring Zn over ordinary integers [19]. Hence, Gaussian integers are also suitable for the RSA cryptographic system. In the following, we use the notation G p for Gaussian integer fields and Gn for rings of Gaussian integers. The inverse mapping of a Gaussian integer z ∈ G p to an element z  of the prime field G F( p) is defined as z  = (zvπ ∗ + z ∗ uπ) mod p,

(3.3)

where the parameters u and v are calculated using the extended Euclidean algorithm [18]. Next, we consider an example to illustrate the modular arithmetic over Gaussian integer fields. Example 3.1 Consider the set G17 with π = 4 + i and p = 17. The elements of the Gaussian integer field according to (3.2) are shown in Figure 3.1. For the ordinary integers x  = 7 and y  = 9, the corresponding Gaussian integers x, y ∈ G17 are x = x  mod π = −1 − 2i and y = y  mod π = 2i. The sum z = (x + y) mod π = −1 can also be calculated as z = (x  +y  ) mod π = 16 mod π = −1. Similarly, the product z = (x · y) mod π = (4 − 2i) mod π = −1 + i can also be expressed as z = (x  · y  ) mod π = 63 mod π = −1 + i. Any Gaussian integer z ∈ G17 can be mapped to an element of the prime field z  ∈ G F(17) using the parameters v = 2 + i and u = −2. For instance, for z = −1 + i we have z  = ((−1 + i) · vπ ∗ + (−1 − i) · uπ ) mod p = 63 mod 17 = 12. The complex multiplication of two Gaussian integers x = c + di and y = e + f i can be efficiently calculated as x y = (v2 − v3 ) + (v1 − v2 − v3 )i,

(3.4)

where v1 = (c + d)(e + f ), v2 = ce, and v3 = d f . Hence, the complex multiplication requires three integer multiplications. We call such an integer multiplication an atomic multiplication. While the complex multiplication over Gaussian integers does not provide a significant complexity reduction in comparison with the multiplication over ordinary integers, the squaring of a Gaussian integer can be simplified to x 2 = (c + d)(c − d) + (cd + cd)i, (3.5) where only two atomic multiplications and three additions are required [37, 38].

28

3

Elliptic Curve Cryptography over Gaussian Integers

Figure 3.1 The set of Gaussian integers for π = 4 + i

For sufficient security in ECC cryptographic systems, key lengths higher than 159-bit are typically considered, as shown in Table 2.1. Finding perfect squares a and b for such a bit length is not trivial. However, in [18] an algorithm for primes of the form p ≡ 1 mod 4 to find a, b such that p = a 2 + b2 was proposed. This algorithm comprises two steps and is performed as summarized in Algorithm 3.1. Algorithm 3.1 Finding a, b with p = a 2 + b2 according to [18] input: p ≡ 1 mod 4 output: a, b such that p = a 2 + b2 1: Find x such that x 2 ≡ −1 mod p // use the Tonelli-Shanks algorithm [93] √ 2: Apply the Euclidean algorithm to p and x. The first two remainders less than p are a and b 3: return a, b

3.2 Point Multiplication over Gaussian Integers

29

On the other hand, there exist many primes of the form p ≡ 1 mod 4 such that p = a 2 + (a − 1)2 or p = a 2 + 1. Exploiting this observation, we can search for a such that the sum p = a 2 + 1 or p = 2a 2 − 2a + 1 is prime. This can significantly reduce the search complexity. Table 3.1 illustrates some examples of such primes with sufficient bit lengths for ECC applications. Table 3.1 Examples for primes of the form p = a 2 + b2 with b = 1 or b = a − 1 bits

p

169 170 190 256 382

4 · 1050

3.2

+ 216 · 1025

+ 2917 8 · 1050 + 6 · 1026 + 113 8 · 1056 + 3 · 1030 + 2813 8 · 1076 + 2516 · 1038 + 197821 9 · 10114 + 384 · 1057 + 4097

a

b

2 · 1025

1 2 · 1025 + 7 2 · 1028 + 37 2·1038 +314 1

+ 54 2 · 1025 + 8 2 · 1028 + 38 2 · 1038 + 315 3 · 1057 + 64

Point Multiplication over Gaussian Integers

In this section, we demonstrate that τ -adic expansions with τ ∈ Z[i] are beneficial to reduce the computational complexity of a point multiplication, especially by expanding a Gaussian integer key, i.e. k ∈ Gn . In the context of ECC, Gaussian integers were proposed to speed up the point multiplication [14, 33, 94–96]. Such methods use an expansion of the integer k with a non-binary basis τ , i.e. a τ -adic expansion. Similar to the point multiplication with binary keys (2.9), this results in the point multiplication k·P=

l−1 

κ j τ j (P) = τ (. . . τ (τ κl−1 · P + κl−2 · P) + . . .) + κ0 · P,

(3.6)

j=0

where the digits κ j are Gaussian, Eisenstein, or Kleinian integers depending on the basis τ [33, 88]. In the following, we introduce the expansion of an integer k with the basis τ , where k and τ are Gaussian integers.

30

3

3.2.1

Elliptic Curve Cryptography over Gaussian Integers

Determining the τ -adic Expansions

Computing the τ -adic expansion of an integer k is summarized in Algorithm 3.2. This algorithm was previously considered in [33]. Algorithm 3.2 is the ordinary division algorithm for base conversion when k and τ are both integers. In this case, the algorithm results in an expansion with l ≈ log2 (k)/ log2 (τ ) digits. However, we consider the expansion where k and τ are Gaussian integers. Hence, the modulo function is determined according to (3.1), and the number of digits can be estimated as log2 (|k|) l≈ . (3.7) log2 (|τ |) To see this, we note that the quotient resulting from the division (z − κl )/τ in step 5 of Algorithm 3.2 is a Gaussian integer. Hence, the absolute value of the enumerator z − κl is diminished by a factor |τ | in each iteration. Algorithm 3.2 τ -adic expansion of a Gaussian integers (or an integer) k input: Gaussian integer k and base τ output: τ -adic expansion k = (κl−1 , . . . , κ1 , κ0 )τ 1: 2: 3: 4: 5: 6: 7: 8:

z=k l=0 while (z = 0) do κl = z mod τ l z = z−κ τ l =l +1 end while return κl−1 , . . . , κ1 , κ0

// remainder // quotient

From the estimate (3.7), it follows that it is advantageous to represent the key as a Gaussian integer. In order to show this, we bound the absolute value of any element of a Gaussian integer ring Gn . Lemma 3.1 For any x in the Gaussian integer ring Gn with n = ππ ∗ , the absolute value |x| is bounded by  |π| n |x| < √ = . (3.8) 2 2

3.2 Point Multiplication over Gaussian Integers

31

Proof Let c and d be the real part and the imaginary part of x/π. For x ∈ Gn we have x = x mod π and consequently 

xπ ∗ ππ ∗

 = 0.

(3.9)

Hence, we obtain [c] = [d] = 0. This implies |c| < 1/2, |d| < 1/2, and x 1    < √ . π 2 Multiplying this inequality by |π| results in (3.8).

(3.10) 

From (3.7) and Lemma 3.1, it follows that the τ -adic expansion of the Gaussian integer k mod π requires only l digits with l≈

√ log2 ( n/2) log2 (n) − 1 log2 (n) − 1 , = = log2 (|τ |) 2 log2 (|τ |) log2 (|τ |2 )

(3.11)

where we consider a Gaussian integer ring Gn with n = ππ ∗ . Note that the size of the key space is determined by the order of the elliptic curve. For a curve of order n, we choose a Gaussian integer key from Gn or a smaller ring. The size of this ring should equal the order to retain the size of the key space. Consequently, using k ∈ Gn approximately halves the number of digits compared with an expansion of an ordinary integer k  ∈ Zn , which has a maximum key length l ≈ log2 (n)/ log2 (|τ |). Next, we consider the elliptic curve point multiplication using the key expansion according to Algorithm 3.2.

3.2.2

Elliptic Curve Point Multiplication for Complex Expansions

In this subsection, we present the point multiplication algorithm based on the τ -adic expansion, where τ is an arbitrary Gaussian integer. This algorithm is suitable for elliptic curves with an appropriate endomorphism mapping τ · P. In [14, 97], several endomorphisms for different curves over prime fields were presented. In order to demonstrate that a τ -adic expansion can reduce the computational complexity for Gaussian integers, we consider the curve (2.1) with β = 0 and p ≡ 1 mod 4 [14, 97]. For the products κ j · P(x, y), we use the endomorphism iP(x, y) = P(−x, iy).

32

3

Elliptic Curve Cryptography over Gaussian Integers

Note that P(−x, iy) is a point on the curve (2.1) if P(x, y) is a valid point. The negation of any point −P(x, y) is P(x, −y) according to (2.13). Hence, for the products κ j · P(x, y) with κ j ∈ {±1, ±i}, we have P = P(x, y),

(3.12)

−P = P(x, −y),

(3.13)

iP = P(−x, iy),

(3.14)

−iP = P(−x, −iy).

(3.15)

With this endomorphism, all possible κ j ∈ Z[i] can be constructed using point additions and doublings. For instance, let κ j = 2+i, then the product κ j · P(x, y) = (2+i)P = 2P+iP, which can be calculated by doubling the point P and then adding the point iP. This requires one point doubling and one addition. A similar calculation holds for the mapping τ · P, where τ ∈ Z[i]. Based on this fact, we summarize the point multiplication with τ -adic expansion and τ ∈ Z[i] in Algorithm 3.3. This algorithm is derived from Algorithm 2.5 for binary point multiplication. Algorithm 3.3 Complex point multiplication with τ -adic expansion, τ ∈ Z[i] input: P, k = (κl−1 , . . . , κ1 , κ0 )τ output: k · P 1: 2: 3: 4: 5: 6: 7: 8:

Q = κl−1 · P; for ( j = l − 2 down to 0) do Q=τ·Q if κ j = 0 then Q = Q + κj · P end if end for return Q

// use DBL and ADD operations since κl−1 ∈ Z[i] // use DBL and ADD operations since τ ∈ Z[i] // ADD Q and κ j · P

According to (3.1) and (3.3), any key can be uniquely represented as an ordinary or Gaussian integer. We illustrate the point multiplication according to Algorithm 3.3 using a Gaussian integer key in the following example: Example 3.2 We consider the curve y 2 = x 3 + 3x over the field G17 from Example 3.1. Using the generator point P(1, 2) we obtain a subgroup of order 13, which comprises the neutral element O and the following points

3.3 Resistance Against side Channel Attacks using Gaussian Integers

33

(1, 2) (i, 1 + i) (−i, −1i + i) (−2i, −1 − i) (−1, −2i) (2i, 1 − i) (2i, −1 + i) (−1, 2i) (−2i, 1 + i) (−i, 1 − i) (i, −1 − i) (1, −2). Due to the order of the subgroup we choose key from G13 , i.e. k = 1 + i. Using Algorithm 3.2 with τ = 2 + i, we obtain the expansion (κ1 , κ0 )τ = (1, −1)τ . With the starting point P(1, 2), the point multiplication from Algorithm 3.3 results in the point Q(−2i, 1 + i). Each iteration of Algorithm 3.3 is more complex than a single iteration of the binary point multiplication from Algorithm 2.5. However, the τ -adic expansion reduces the number of iterations required to calculate the point multiplication, as demonstrated by the following example: Example 3.3 Let r be the number of bits for the binary key representation, and l the number of digits κ j , and consider τ = 2 + i. We have l ≈ r / log2 (5) ≈ 0.43r , due to (3.11). Algorithm 2.5 requires a maximum of r − 1 iterations with one point doubling and one point addition in each iteration. Similarly, the point multiplication based on the τ -adic expansion in Algorithm 3.3 requires a maximum of l −1 iterations, with one point doubling and two point additions in each iteration. This results in l −1 ≈ 0.43r point doublings and at most 2(l −1) ≈ 0.86r additions. Hence, the number of both operations is reduced compared with the binary case. Note that this consideration also holds for the Montgomery ladder point multiplication in Algorithm 2.7. Next, we discuss the side channel attacks resistance of the point multiplication using Gaussian integers.

3.3

Resistance Against side Channel Attacks using Gaussian Integers

In this section, we present a new algorithm for the τ -adic expansion of the key k that is suitable for ordinary as well as Gaussian integers. The resulting expansion from this algorithm prevents SPA and timing attacks on calculating the point multiplication. Furthermore, we show that the robustness against DPA, RPA, as well as ZPA can be improved for this τ -adic expansion using the randomized initial point approach proposed in [31]. Finally, we demonstrate that Gaussian integers can reduce the computational complexity and memory requirements for calculations protected against SCA, by comparing the proposed concept with existing non-binary expansion algorithms from the literature [31].

34

3.3.1

3

Elliptic Curve Cryptography over Gaussian Integers

Improved τ -adic Expansion Algorithm

Consider the binary point multiplication in Algorithm 2.5. The special point addition is only calculated if the current bit of the binary key is non-zero. Moreover, the point doublings and additions employ different numbers of arithmetic operations. Hence, the power consumption and the execution time of an iteration depend on the current bit of the secret key. Similar considerations hold for the point multiplication with Algorithm 2.6. Consequently, an attacker can estimate the current bit of the secret key by observing the power consumption over time. To avoid SPA and timing attacks, many publications balance the number of applied point operations by considering additional dummy additions [27, 29, 32, 34, 75]. Similar to the binary point multiplication, the conditional special point addition in Algorithm 3.3 is computed if κ j  = 0. However, using the τ -adic expansion, the probability of κ j = 0 is reduced compared with a binary key representation, i.e. zero digits occur with probability P(κ j = 0) = 1/ |τ |2 , which is much smaller than P(k j = 0) = 1/2 for binary keys. Moreover, the calculation of κ j · P is a point multiplication, where the number of point doublings and special additions executed depends on the actual value of κ j . Hence, an attacker can infer κ j by estimating the number of point doublings and special additions calculated from the power consumption of the device. Improved robustness against SPA can be achieved by precomputing all possible products κ j · P and storing these values in a memory [29, 31, 32, 34]. Hence, by processing the point multiplication according to Algorithm 3.3, the corresponding product κ j · P is read from the memory. This results in fixed calculation times for the special point addition with any product κ j · P. Nonetheless, this concept is not fully protected against SPA and timing attacks. An attacker still can detect all key expansions with κ j = 0, where the conditional special point addition is skipped. In order to protect the point multiplication in Algorithm 3.3 against SPA and timing attacks, we modify the generation of the τ -adic expansion of the key k, as shown in Algorithm 3.4. This algorithm replaces all values κi = 0 by κi = τ in the expansion, where the base τ can be an ordinary or Gaussian integer. Hence, it determines a new base conversion of the key excluding all zero elements. Consequently, the special point addition in step 5 of Algorithm 3.3 is calculated in each iteration independent of the value of κ j and without the need for dummy additions. This results in a constant computation time for a point multiplication and increases the robustness against SPA and timing attacks. In the following, we prove that the key expansions according to Algorithms 3.2 and 3.4 result in the same point multiplication.

3.3 Resistance Against side Channel Attacks using Gaussian Integers

35

Algorithm 3.4 SPA resistant τ -adic expansion of a Gaussian integer (or an ordinary integer) k input: Gaussian integer k and base τ output: τ -adic expansion k = (κl−1 , . . . , κ1 , κ0 )τ 1: z = k 2: l = 0 3: while (z = 0) do 4: κl = z mod τ 5: if (κl = 0) then 6: κl = τ 7: end if l 8: z = z−κ τ 9: l = l + 1 10: end while 11: return κl−1 , . . . , κ1 , κ0

// remainder

// quotient

Proposition 3.1 Let κl−1 , . . . , κ1 , κ0 be the expansion of the Gaussian integer k according to Algorithm 3.2 and κ˜l−1 ˜ , . . . , κ˜ 1 , κ˜ 0 the corresponding expansion according to Algorithm 3.4. These expansions are equivalent representations of the Gaussian integer k, i.e. ˜ l−1 l−1   k= κjτ j = κ˜ j τ j . (3.16) j=0

j=0

Proof First, we note that the two expansions are identical if κl−1 , . . . , κ1 , κ0 contains no zero element. Next, we consider a zero element κ j = 0 and assume that κ j+1  = 1. We have κ˜ j = τ and κ˜ j+1 = κ j+1 − 1  = 0. The corresponding terms in the sums in (3.16) satisfy κ j τ j + κ j+1 τ j+1 = κ˜ j τ j + κ˜ j+1 τ j+1 , (3.17) because κ j τ j + κ j+1 τ j+1 = κ j+1 τ j+1 and κ˜ j τ j + κ˜ j+1 τ j+1 = τ · τ j + (κ j+1 − 1)τ j+1 = κ j+1 τ j+1 . Hence, the substitution κ˜ j = τ does not alter the results of the sums in (3.16). For κ j = 0 and κ j+1 = 1, we have κ˜ j = τ and κ˜ j+1 = 0. The condition κ˜ j+1 = 0 may result in the termination of the algorithm if z = 0. In this case, we have equality in the sums again, because κ j τ j + κ j+1 τ j+1 = τ j+1 and κ˜ j τ j + κ˜ j+1 τ j+1 = τ · τ j = τ j+1 . If the algorithm does not terminate, we have κ˜ j = κ˜ j+1 = τ and κ˜ j+2 = κ j+2 −1 for κ j+2  = 1. Consequently, we obtain

36

3

Elliptic Curve Cryptography over Gaussian Integers

κ j τ j + κ j+1 τ j+1 + κ j+2 τ j+2 = τ j+1 + κ j+2 τ j+2

(3.18)

and κ˜ j τ j + κ˜ j+1 τ j+1 + κ˜ j+2 τ j+2 = τ · τ j + τ · τ j+1 + (κ j+2 − 1)τ j+2 = τ j+1 + κ j+2 τ j+2 .

(3.19) Similarly, the proof for sequences κ j = 0 and κ j+1 = κ j+2 = . . . = 1 follows by induction.  The number of iterations l for the point multiplication in Algorithm 3.3 is similar for the two expansions. Next, we demonstrate the robustness against DPA, RPA, and ZPA can be improved for the τ -adic expansion according to Algorithm 3.4 using the randomized initial point method proposed in [31]. DPA attacks are more sophisticated and were introduced in [26]. These attacks are based on collecting several power traces and analyzing them using a statistical tool to extract the secret key [26, 31, 34]. RPA attack is a new variant of DPA attack [31] that was presented in [27]. The idea of RPA is to utilize special points on the corresponding curve that cannot be easily randomized to find correlations between the power consumption and the data processed in the point multiplication [31]. A special variant of the RPA is the ZPA proposed in [28], where points with zero coordinates are exploited to infer data processed in the point multiplication and extract the secret key. Consider the τ -adic expansion according to Algorithm 3.4. This expansion is not sufficient to prevent DPA, RPA, and ZPA, i.e. an attacker can still exploit some correlations by collecting power traces of the read κ j · P values from the memory. However, in [31] a modification of the point multiplication algorithm was presented that involves the randomized initial point method. This algorithm is suitable for the expansion according to Algorithm 3.4 and it increases the resistance against DPA, RPA, as well as ZPA. The RIP method is based on introducing a random point R into the calculation of the point multiplication. Consequently, all intermediate results in the calculation of the point multiplication become randomized. The modified point multiplication with the RIP method from [31] is summarized in Algorithm 3.5, where the corresponding precomputed points Sl−1 , . . . , S1 , S0 are determined in steps 1 to 4. The point addition in step 5 results in the point Q = Sl−1 + R. This point is multiplied with τ and then added to the corresponding precomputation S j in the loop in steps 6 to 9 of this algorithm. Due to the second product of the precomputations −(τ − 1) · R, we obtain the point Q + R after each iteration of this loop. Correspondingly, the final result is the point Q + R = k · P + R. Hence, we obtain the correct result of the point multiplication k · P by subtracting

3.3 Resistance Against side Channel Attacks using Gaussian Integers

37

R from Q in step 10. The special point addition in step 8 is computed in each iteration, since the τ -adic expansion of the key according to Algorithm 3.4 excludes all zero elements. This prevents SPA and timing attacks. Furthermore, all stored points Sl−1 , . . . , S1 , S0 are randomized due to the subtracting the random point −(τ − 1) · R. Thus, the resistance against DPA, RPA, and ZPA is increased.

3.3.2

Comparison with Existing Non-binary Expansions

In this subsection, we discuss the computational complexity of the point multiplication that is protected against SPA and timing attacks with a τ -adic key expansion according to Algorithm 3.4. Moreover, we compare the computational complexity and the memory requirements for the point multiplication in Algorithm 3.3 with results from [31] that aim to prevent SPA and timing attacks on calculating the point multiplication for Weierstrass curves over G F( p). Finally, we demonstrate that Gaussian integers can reduce the computational complexity for the point multiplication in Algorithm 3.5, which increases the resistance against DPA, RPA, and ZPA. Algorithm 3.5 Protected complex point multiplication against SCA using the RIP method according to [31] with τ ∈ Z[i] input: P, k = (κl−1 , . . . , κ1 , κ0 )τ output: k · P 1: R = random point // randomized initial point 2: for ( j = l − 1 down to 0) do 3: S j = κ j · P − (τ − 1) · R // store all precomputations in memory 4: end for 5: Q = Sl−1 + τ · R; // ADD κl−1 · P − (τ − 1) · R and τ · R 6: for ( j = l − 2 down to 0) do 7: Q = τ · Q // use DBL and ADD operations since τ ∈ Z[i] 8: Q = Q + S j // ADD Q and κ j · P − (τ − 1) · R that is read from the memory 9: end for 10: return Q − R

First, we consider the arithmetic operations required to derive an approximation for the computational complexity of Algorithm 3.3. As previously mentioned, we utilize Jacobian projective coordinates to reduce the number of field inversions [14, 78]. Similar to the point multiplication in Algorithm 2.5, the point P(x, y) of the second argument κ j · P in step 5 of Algorithm 3.3 does not change throughout the point multiplication algorithm. Hence, this point can be represented in affine coordinates with Z = 1. For κ j = 1, we obtain P = P(X , Y , 1) due to the

38

3

Elliptic Curve Cryptography over Gaussian Integers

endomorphism (3.12). Similarly, for κ j = i, iP = P(−X , iY , 1) holds due to (3.14), and so on. This enables the special point addition in Jacobian coordinates (2.29) with reduced complexity. Using the Jacobian coordinates, all point operations can be calculated without inversion. Hence, the complex multiplication and squaring dominate the computational complexity for Gaussian integers. Both complex operations can be calculated according to (3.4) and (3.5), respectively. We summarize the number of complex multiplications and squarings executed for the point doubling, special point addition, and (normal) point addition according to (2.25), (2.29), and (2.21) in Table 3.2. Furthermore, we assume that the cost of a complex squaring according to (3.5) is about two-thirds of a complex multiplication. Hence, we can approximate the computational complexity in terms of multiplication equivalent operations M similar to [29, 31, 32, 34]. This number is provided in the last column of Table 3.2.

Table 3.2 Number of multiplications and squarings using Jacobian coordinates according to [14, 78] point operation

squaring

multiplications

M

DBL SADD ADD

6 3 4

4 8 12

8 10 14.67

Next, we demonstrate that using Gaussian integers reduces the computational complexity and memory requirements for the precomputations, i.e. for all possible products κ j · P resulting from Algorithm 3.4. First, we show that Gaussian integer rings Gn are symmetric. Lemma 3.2 For any x ∈ Gn with n = ππ ∗ , we have the following symmetry − x, ix, −ix ∈ Gn ,

(3.20)

Proof Substituting x with −x, i x, or −i x in (3.9) results in the same value. Hence, (3.20) holds.  Exploiting this symmetry, we can restrict the precomputations to elements from the first quadrant. Since elements that lie on an axis do not belong to any quadrant,

3.3 Resistance Against side Channel Attacks using Gaussian Integers

39

we additionally consider all elements on the positive real axis. Now, utilizing the intended elements and the endomorphism (3.12) to (3.15), we obtain all remaining products κ j · P, as illustrated by the following example: Example 3.4 We consider expansions with τ = 4 + i. The set of all possible digits κ j is depicted in Figure 3.1. The corresponding precomputations can be obtained as follows. First, the product 2P is calculated by doubling the point P. Next, we calculate the products κ j · P for the remaining κ j digits in the first quadrant, i.e. (1 + i)P and (1 + 2i)P, where we use two special point additions and the endomorphism (3.14). The corresponding digits are depicted with filled points in Figure 3.1. For the Gaussian integer κ j = τ , we have to calculate τ · P = (4 + i)P, which additionally requires a point doubling and a special point addition. Due to the symmetry of Gaussian integers according to Lemma 3.2, we can use (3.12) to (3.15) and the elements from the first quadrant to determine all products κ j · P in the remaining quadrants without additional point operations. Hence, the precomputations require a total of two point doublings and three special point additions. Similarly, the last column in Table 3.3 indicates the number of point operations required for the precomputations with other τ values. Table 3.3 Number of point operations required for the precomputations and an iteration of the point multiplication with different τ |τ |2

τ

iteration of PM

precomputations

5 17 29 113 257

2+i 4+i 5 + 2i 8 + 7i 16 + i

DBL+SADD+ADD 2DBL+SADD+ADD 2DBL+SADD+2ADD 3DBL+SADD+2ADD 4DBL+SADD+ADD

DBL+SADD 2DBL+3SADD 2DBL+8SADD 3DBL+26SADD 4DBL+60SADD

The number of stored points depends on the sign representation used in the hardware implementation. Using the two’s complement, we have to store |τ |2 points if additional operations for negations should be avoided to prevent SPA attacks. Using a sign-magnitude binary format, it is sufficient to store all values from the first quadrant including all elements on the positive real axis that are represented with filled points in Figure 3.1, and the value τ . The different signs for real and imaginary parts resulting from (3.12) to (3.15) are stored as separate parts of the κ j

40

3

Elliptic Curve Cryptography over Gaussian Integers

values. This reduces the number of stored precomputed points to (|τ |2 − 1)/4 + 1 including the value κ j = τ . Now, we consider the point operations required for each iteration of the point multiplication in Algorithm 3.3. The mapping τ ·Q is calculated with point doublings and additions, while step 5 of Algorithm 3.3 is determined with a special point addition. Similar to Example 3.3, the mapping τ · Q = (4 + i) · Q is calculated by doubling the point Q twice, and then adding the point iQ. These are followed by a special point addition with the corresponding point κ j · P read from the memory. Note that the total number of iterations per point multiplication for Gaussian integer keys is calculated as l ≈ r / log2 (|τ |2 ) due to (3.11). We summarize the number of point operations per iteration of Algorithm 3.3 for different τ values in the third column of Table 3.3. Next, we illustrate results for the complex point multiplication that is protected against SPA and timing attacks with different Gaussian integer basis τ in Table 3.4. This table provides the number of stored points, the number of iterations l per point multiplication, the number of multiplications M per point multiplication without precomputations, and the total number of multiplications M per point multiplication including precomputations. We provide the number of stored precomputed points with sign-magnitude representation. Furthermore, we include results from [31] in the lower part of this table for comparison. The concept proposed in [31] aims to reduce the memory requirements for precomputed points in comparison with the fixed-base method. For example, compare

Table 3.4 Performance results for different τ -adic expansions in comparison with [31] for a binary key length r = 163 reference

|τ |2 or 2w

stored points

l

M for PM M for PM and precomputations

proposed proposed proposed proposed in [31] fixed-base [31] proposed in [31] fixed-base [31] proposed in [31] fixed-base [31]

5 17 29 4 4 16 16 32 32

2 5 8 2 3 8 15 16 31

0.430r 0.245r 0.206r 0.506r 0.506r 0.2515r 0.2515r 0.203r 0.203r

2293 1622 1857 -

2311 1678 1953 3026 3010 2726 2710 2796 2780

3.4 Discussion

41

τ = 5 + 2i with |τ |2 = 29 digits and the base w = 5 with 2w = 32 digits from [31]. The concept from [31] reduces the number of stored points to halve the number of digits. Using a sign-magnitude binary format, the new Gaussian integer expansion only requires memory for eight precomputed points, which halves the memory size compared with the result in [31]. Note that both expansions in [31] require w = 5-bit to represent a digit of the expansion. For the Gaussian integer expansion, three bits are required to address the stored magnitudes in the memory, while the remaining two bits represent the signs of the real and imaginary part. The point multiplications have similar numbers of iterations. However, the proposed algorithm requires about 30% fewer multiplications. Protecting the calculation of the point multiplication against DPA, RPA, and ZPA using the RIP method increases the number of the stored precomputed points to l, as shown in for-loop in steps 2 to 4 of Algorithm 3.5. Furthermore, two additional point additions and a doubling on the point R are required for calculating the point multiplication. This results in slightly higher computational complexity and increases the number of multiplications M in comparison with Algorithm 3.3. However, it is only a small overhead regarding the total number of multiplications executed for a point multiplication. Moreover, using a Gaussian integer key obtains a shorter τ -adic expansion and reduces the complexity for a PM, i.e. it results in a smaller number of iterations l ≈ r / log2 (|τ |2 ) for the point multiplication. Hence, Algorithm 3.5 reduces the complexity compared with the point multiplication based on the RIP method from [31].

3.4

Discussion

In this chapter, we have illustrated that Gaussian integers are suitable for RSA and ECC applications. In the context of ECC systems, we have demonstrated that calculating the complex point multiplication with the τ -adic expansion of the key k can reduce the number of executed point operations, especially by considering the key as a Gaussian integer, i.e. k ∈ Gn . This concept provides improved resistance against side channel attacks, since the probability of skipping the conditional special point addition is significantly reduced. Furthermore, we have presented a new key expansion algorithm, where all digits κ j = 0 are replaced by κ j = τ . This balances the number of point doublings and additions required for a complex point multiplication. Consequently, it prevents hardware attacks based on timing and simple power analysis on the point multiplication. Moreover, we have shown that exploiting the symmetry of Gaussian integers, and storing the precomputed points in the sign-magnitude binary format reduces

42

3

Elliptic Curve Cryptography over Gaussian Integers

the computational complexity and memory requirements. To evaluate our results, we refer to the concept from [31], which also aims to reduce the memory size for the precomputed points for a secured point multiplication. We have demonstrated that further attacks such as differential power analysis, refined power analysis, and zero-value point attacks can be prevented using a simple modification of the point multiplication algorithm. This modification was previously considered in [31] and is based on the randomized initial point method. We have shown that performing this algorithm using Gaussian integers results in significant complexity reduction. As an outlook for this chapter, we note that all benefits from performing the point multiplication over Gaussian integers can be investigated for other complex numbers such as Eisenstein-Jacobi integers [98], or quaternions like Lipschitz [99, 100] and Hurwitz integers [101]. Correspondingly, we have published a paper for Eisenstein-Jacobi integers in [102], and we believe that this is a promising direction for further research.

4

Montgomery Arithmetic over Gaussian Integers

Up to now, we have demonstrated that Gaussian integers are suitable for RSA and ECC systems. Moreover, we have illustrated that performing complex point multiplications over Gaussian integers is advantageous regarding robustness side channel attacks, computational complexity, and memory requirements. However, calculating the modular arithmetic over Gaussian integers is not trivial. Determining the modulo reduction for Gaussian integers is very computational intensive according to (3.1). In [37, 38], a new arithmetic approach was proposed based on Montgomery modular arithmetic over Gaussian integers. Due to the independent computation of real and imaginary parts of Gaussian integers, this principal reduces the hardware complexity of arithmetic operations, especially of the Montgomery reduction. It was shown in [37, 38] that a significant complexity reduction can be achieved for the RSA cryptographic system using Gaussian integers. Such a cryptographic system is presented in [103, 104]. Similarly, a Rabin cryptographic system was previously considered over Gaussian integers in [105, 106]. Moreover, coding applications over Gaussian integers [18, 19, 40, 107–109] could also benefit from an algorithm for the Montgomery reduction of Gaussian integers. Nonetheless, no algorithm for the Montgomery reduction over Gaussian integers has previously been advised. The generalization of the Montgomery reduction from ordinary integers to Gaussian integers is not trivial. The final reduction step of the algorithm utilizes the total order of integers, although such an order relation does not exist for complex numbers. In this chapter, we introduce two new algorithms for the Montgomery reduction of Gaussian integers. For the first approach, we utilize the absolute value to measure the size of a Gaussian integer. We have shown in Lemma 3.1 that the absolute values of all elements of the Gaussian integer ring Gn are bounded. However, not all Gaussian integers that fulfill this bound are representatives of the ring. We demonstrate how the correct representative can be uniquely determined. The second approach © The Author(s), under exclusive license to Springer Fachmedien Wiesbaden GmbH, part of Springer Nature 2021 M. Safieh, Algorithms and Architectures for Cryptography and Source Coding in Non-Volatile Flash Memories, Schriftenreihe der Institute für Systemdynamik (ISD) und optische Systeme (IOS), https://doi.org/10.1007/978-3-658-34459-7_4

43

44

4

Montgomery Arithmetic over Gaussian Integers

aims to reduce the complexity for the Montgomery reduction. In this algorithm, the absolute value is replaced by the Manhattan weight. The calculation of the Manhattan weight requires only addition, whereas squaring is used to determine the absolute value. This algorithm may not obtain a unique solution, but there are at most two possible solutions that are congruent. Moreover, for many applications uniqueness is not an issue for the intermediate results in calculations, such as the elliptic curve point multiplication, which requires many subsequent reduction steps. Consequently, all intermediate results to determine the complex point multiplication can be calculated using Montgomery arithmetic based on the Manhattan weight. For the final result, the correct representative is determined once using absolute values. Both algorithms presented are suitable for efficient RSA and ECC applications. Moreover, due to the feature of the Montgomery arithmetic, the proposed algorithms enable highly flexible implementations by supporting arbitrary Gaussian integer rings. In Section 4.1, we review the concept of the Montgomery multiplication and discuss the issues of the generalization to Gaussian integers. In Sections 4.2 and 4.3, we introduce the new Montgomery reductions based on the absolute value and the Manhattan weight, respectively. In Section 4.4, we demonstrate that sequential reduction steps for compact implementations based on the Manhattan weight are beneficial. Finally, we discuss the innovation of this chapter in Section 4.5.

4.1

Montgomery Arithmetic

In this section, we review the basics of the Montgomery arithmetic for ordinary integers. Moreover, we discuss the extension of this arithmetic to complex numbers and the resulting issues for the Montgomery reduction. Later on, we derive two algorithms for the Montgomery reduction of Gaussian integers. The Montgomery multiplication is based on the arithmetic over a ring Rn that is isomorphic to the ring Zn = {x mod n, x ∈ Z} over ordinary integers [39]. The main aim of the Montgomery multiplication is to reduce the complexity of the modulo reduction by replacing the expensive calculation mod n by mod R, where R > n is a power of two. Note that in this case the modulo reduction is a simple bitwise AND operation with R − 1. Each element x ∈ Zn is mapped to the Montgomery domain as X = x R mod n.

(4.1)

4.1 Montgomery Arithmetic

45

The addition and subtraction in the Montgomery domain are equal to the modular arithmetic of ordinary integers, because (X + Y ) mod n = (x R + y R) mod n = (x + y) R mod n.

(4.2)

However, the multiplication in the Montgomery domain requires a special reduction function, defined as μ(X Y ) = X Y R −1 mod n = x y R mod n,

(4.3)

where the result is again in the Montgomery domain. The reduction function μ(X ) also determines the inverse mapping, because μ(X ) = x R R −1 mod n = x mod n. The Montgomery reduction is described in Algorithm 4.1. Algorithm 4.1 Montgomery reduction according to [39] input: Z = X Y , with 0 ≤ Z < n R, n  = −n −1 mod R, and R = 2l ≥ n output: M = μ(Z ) = Z R −1 mod n 1: 2: 3: 4: 5: 6: 7:

t = Z n  mod R q = (Z + tn) div R if (q ≥ n) then return M = q − n else return M = q end if

// bitwise AND with R − 1 // shift right by l

Typically, at the beginning of the calculation, all variables are mapped into the Montgomery domain using the reduction function X = μ(x R 2 ) = x R 2 R −1 mod n = x R mod n,

(4.4)

which requires multiplication with the precomputed value R 2 mod n. The arithmetic operations in the Montgomery domain can simply be extended for Gaussian integers by replacing mod n with mod π, where π = a + bi. Nonetheless, extending the Montgomery reduction for Gaussian integers is not trivial. Lines 3 to 7 of the Montgomery reduction according to Algorithm 4.1 uniquely determine the smallest element in the Montgomery domain that is congruent to Z R −1 mod n. However, it is not possible to directly apply Algorithm 4.1 to Gaussian integers, because complex numbers cannot be totally ordered [40]. This problem is not

46

4

Montgomery Arithmetic over Gaussian Integers

Figure 4.1 Elements of the Gaussian integer field G29 with π = 5 + 2i

regarded in [37, 38]. In the following, we interpret elements of a Gaussian integer field or ring as an equivalent representation for the Montgomery domain. In this chapter, we consider two reduction strategies. We use the following two  different norms for the final reduction step: the absolute value |x| = |c|2 + |d|2 of the Gaussian integer x = c + di, and the Manhattan weight x = |c| + |d|. Calculating the Manhattan weight is less complex than calculating the absolute value, because only addition is required, whereas squaring is necessary to determine |x|.

4.2

Reduction over Gaussian Integers using the Absolute Value

In the following, we derive a Montgomery reduction for Gaussian integers. We utilize the absolute value to measure the size of a Gaussian integer. First, we consider an important property of Gaussian integer rings Gn .

4.2 Reduction over Gaussian Integers using the Absolute Value

47

Lemma 4.1 For any x ∈ Gn and x  ∈ / Gn with x = x  mod π, we have   |x| < x   .

(4.5)

Proof Let c and d be the real and imaginary part of x/π, respectively, and consider a Gaussian integer x  ∈ / Gn with x = x  mod π and x  /π = c + d  i. Note that x =  x mod π according to (3.1) implies c = c − [c ] and d = d  − [d  ]. Furthermore,   / Gn implies that at least one x ∈    rounded value [c ] or [d ] is non-zero. Now,      [c ]  = 0 results in |c| < 1/2 < c . Similarly, [d ]  = 0 implies |d| < 1/2 < d  . Consequently, we have (4.5).  The √ next example demonstrates that the upper bound from (3.8), i.e. |x| < |π| / 2 is not a sufficient condition for x ∈ Gn . Example 4.1 We consider the finite field G29 of Gaussian integers with π = 5 + 2i as depicted in Figure 4.1. Consider q  = 3 with q = 3 mod π = −2 − 2i, i.e. / G29 and q ∈ G29 . The representative q = −2 − 2i cannot be determined based q ∈ on   the bound√(3.8) for the absolute √ value, because both Gaussian integers satisfy q  < |π| / 2 and |q| < |π| / 2. On the other hand, comparing the absolute   values, we observe |q| ≈ 2.8284 < q   = 3.

Algorithm 4.2 Montgomery reduction for Gaussian integers input: Z = X Y , π  = −π −1 mod R, R = 2l > output: M = μ(Z ) = Z R −1 mod π

|π | √ 2

1: t = Z π  mod R // bitwise AND of the real and imaginary parts with R − 1 2: q = (Z + √ tπ ) div R // shift Re, Im right by l 3: if (|q| < ( √2−1) |π |) then 2 4: return M = q |π | 5: else if (|q| < √ ) then 2 6: determine α  according to (4.6) 7: return M = q − α  π 8: else 9: determine α  according to (4.7) 10: return M = q − α  π 11: end if

48

4

Montgomery Arithmetic over Gaussian Integers

Next, we consider the Montgomery reduction for Gaussian integers. The reduction is performed according to Algorithm 4.2, where α  = argmin |q − απ|, α∈{0,±1,±i}

α  =

argmin

α∈{±1,±i,±1±i}

|q − απ|.

(4.6) (4.7)

Note that steps 1 and 2 of Algorithm 4.2 are applied for the real- and imaginary part separately. The following proposition demonstrates that Algorithm 4.2 results in the correct representative. Proposition 4.1 For M and Z , according to Algorithm 4.2, we have M = Z R −1 mod π.

(4.8)

Proof Consider the sum Z + tπ in step 2 of Algorithm 4.2. We have Z + tπ ≡ Z + Z π  π mod π ≡ Z + Z (−π −1 )π mod π ≡ Z − Z mod π ≡ 0 mod R,

(4.9)

which implies Z ≡ tπ mod R. Hence, R divides Z + tπ. Now, considering the corresponding quotient q = (Z + tπ) div R. q mod π ≡ (Z + tπ) div R mod π ≡ (Z + tπ) R −1 mod π ≡ Z R −1 + tπ R −1 mod π ≡ Z R −1 mod π,

(4.10)

which shows that q is congruent to Z R −1 mod π. This implies Z R −1 = q − απ, where α is a Gaussian integer.

4.2 Reduction over Gaussian Integers using the Absolute Value

49

The absolute value of Z R −1 mod π is bounded, i.e. 2  −1  |π| R |π|  Z R  = |X Y | R −1 ≤ |π| < √ = √ . 2R 2R 2

(4.11)

√ Note that even for |q| < |π| / 2, the quotient q may not be the correct representative Z R −1 mod π, as shown in Example 4.1. In this case, q − απ with α ∈ {±1, ±i} are also possible candidates. However, for any x ∈ Gn and α ∈ {±1, ±i}, the lower bound √ 2−1 |x − απ| ≥ ||π| − |x|| > √ |π| , (4.12) 2 follows from the triangle inequality and Lemma 4.1. Hence, for |q| < √ quotient q is the unique solution. Moreover, |α| > 2 implies |π| |x − απ| ≥ ||απ| − |x|| > √ , 2

√ 2−1 √ 2

|π| the

(4.13)



|π | , only the candidates according for any x ∈ Gn . Thus, for ( √2−1) |π| ≤ |q| < √ 2 2 to (4.6) are possible. Therefore, the correct candidate can be found by minimizing the absolute value √ among all candidates, which follows from Lemma 4.1. If |q| ≥ |π| / 2, the result M is calculated as M = q − α  π with α  according √ to (4.7). In order to demonstrate that (4.7) is correct, we derive the bound |α| ≤ 2. Consider again the quotient q = (Z + tπ) div R, whereby we aim to compensate the offset tπ R −1 by απ. Note that the absolute values are bounded by √   |απ| = tπ R −1  = |t| |π| R −1 ≤ 2 |π| , (4.14)

√ √ |t| ≤ R 2 + R 2 = 2R due to the reduction mod R. This implies |α| ≤ because √ 2. Consequently, (4.7) includes all possible candidates.  We can now describe the final reduction step of Algorithm 4.2. Note that there are up to eight possible values α according to (4.6) and (4.7), but not all potential values have to be considered. We illustrate the final reduction procedure in the following example. Example 4.2 Again, we consider the finite field G29 of Gaussian integers with π = 5 + 2i. Figure 4.2 depicts the Montgomery domain and all possible quotients q √ 2−1 √ |π| ≈ 1.5773 < after the first two steps of Algorithm 4.2. For q = 3, we have 2

50

4

Montgomery Arithmetic over Gaussian Integers

|π | |q| = 3 < √ ≈ 3.8079. Hence, α  should be determined according to (4.6). As q is 2 in the first quadrant (including only the positive real axis), we have only two possible values for α, i.e. α ∈ {0, 1}. Now, calculating |q − 0| = 3 and |q − π| ≈ 2.3852 leads to α  = 1. Consequently, we obtain the correct representative M = q −α  π = −2 − 2i.

The final reduction step of Algorithm 4.2 always √ results in the correct representative M = q mod π satisfying |M| ≤ |π| / 2. Hence, Algorithm 4.2 achieves the desired precision reduction. The number of possible candidates α is restricted depending on the absolute value of q, as illustrated in (4.6) and (4.7). Moreover, the number of valid candidates α can be further reduced based on the sign of q, as shown in Example 4.2. Nonetheless, the reduction according to Algorithm 4.2 can be rather complex due to the squaring to determine the absolute values. This complexity may not be required in all applications of the Montgomery multiplication. For applications in cryptography, it is adequate to only find the unique representative Z R −1 mod π in

Figure 4.2 Illustration of the the Montgomery domain for p = 29 and π = 5 + 2i. Circles centered with a point are possible quotients q after the Montgomery reduction and filled circles are possible offsets απ

4.3 Reduction over Gaussian integers using the Manhattan Weight

51

the last calculation step, whereas for intermediates results a reduction that achieves M ≡ Z R −1 mod π is sufficient. This motivates a low-complexity Montgomery reduction based on the Manhattan weight for the intermediate calculations, which is considered in the next section.

4.3

Reduction over Gaussian integers using the Manhattan Weight

In this section, we derive a low-complexity reduction for Gaussian integers based on the Manhattan weight. First, we note that using the symmetry according to Lemma 4.1, we can restrict the following derivations to the elements of the first quadrant. Next, we consider some important properties of the Manhattan weight. Lemma 4.2 For x ∈ G p the Manhattan weight is upper bounded by x ≤ a − 1 = N ,

(4.15)

where we assume π = a + bi with a > b ≥ 1 without loss of generality. Proof Let c and d be the real and imaginary part of x/π, respectively. For x ∈ G p we have x = x mod π and consequently 

xπ ∗ ππ ∗

 = 0.

(4.16)

This implies [c] = [d] = 0 and |c| < 1/2, |d| < 1/2. Hence, we consider (1/2 + 1/2i)π to upper bound the real and imaginary parts of x as   a − b , |Re {x}| <  2    a + b . |Im {x}| <  2 

(4.17) (4.18)

Note that either a is odd and b is even or vice versa, because p is an odd prime. Furthermore, the real and imaginary parts of x are integers. Consequently, we have   a − b 1 |Re {x}| ≤  −  , 2 2

(4.19)

52

4

Montgomery Arithmetic over Gaussian Integers

  a + b 1  |Im {x}| ≤  −  . 2 2 We can bound the Manhattan weight of x as   a − b − 1 a + b − 1  . x ≤  + i   2 2

(4.20)

(4.21)

With a > b ≥ 1, the Manhattan weight of x is upper bounded by x ≤

a−b−1 a+b−1 + = a − 1. 2 2

(4.22) 

Hence, (4.15) holds.

4.3.1

Montgomery reduction algorithm using the Manhattan weight

In this subsection, we present the Montgomery reduction algorithm based on the Manhattan weight. Without loss of generality, we consider π = a + bi with a > b ≥ 1. The precision reduction is described in Algorithm 4.3, where αˆ =

argmin

α∈{±1,±i,±1±i}

q − απ,

(4.23)

and N = a − 1. We demonstrated that this algorithm always obtains M ≡ Z R −1 mod π, where M ≤ N . Algorithm 4.3 Precision reduction for Gaussian integers using the Manhattan weight. input: Z = X Y , π  = −π −1 mod R, R = 2l > N output: M ≡ μ(Z ) = Z R −1 mod π 1: 2: 3: 4: 5: 6: 7: 8:

t = Z π  mod R // bitwise AND of the real and imaginary parts with R − 1 q = (Z + tπ ) div R // shift Re and Im right by l if (q ≤ N ) then return M = q else determine αˆ according to (4.23) return M = q − απ ˆ end if

4.3 Reduction over Gaussian integers using the Manhattan Weight

53

Next, we consider bounds on the Manhattan weight for the product of two elements x and y. Note that we consider arithmetic without modulo reduction. Lemma 4.3 For any Gaussian integers x, y with x ≤ N and y ≤ N , we have the upper bound x y ≤ N 2 , (4.24) for multiplications without modulo reduction. Proof Without loss of generality, we consider two elements x = c + di and y = e+ f i from the first quadrant for the product in (4.24). This implies x = c+d ≤ N or d ≤ N − c. Similarly, we have f ≤ N − e. For the product x y, we have x y = (ce − d f ) + (ed + c f )i.

(4.25)

First, consider the absolute value of the imaginary part Im {x y} |Im {x y}| = ed + c f ≤ e(N − c) + c(N − e) = eN + cN − 2ce.

(4.26)

To determine the maximum value, we consider the bivariate function g(e, c) = eN + cN − 2ce and its partial derivatives ∂ g(e, c) = N − 2c = 0, ∂e ∂ g(e, c) = N − 2e = 0. ∂c

(4.27) (4.28)

This results in a maximum for c = e = N /2 and the bound |Im {x y}| = ed + c f ≤ N 2 /2. Due to symmetry, this bound also holds for the absolute value of the real part. Hence, we have N2 N2 x y ≤ (4.29) + = N 2. 2 2  The following proposition demonstrates that Algorithm 4.3 results in the desired reduction.

54

4

Montgomery Arithmetic over Gaussian Integers

Proposition 4.2 For M and Z according to Algorithm 4.3, we have M ≡ Z R −1 mod π,

(4.30)

M ≤ N .

(4.31)

Proof The first two steps are identical in Algorithms 4.3 and 4.2. Hence, the first statement (4.30) follows from the same arguments as in the proof of Proposition 4.1. For q ≤ N , we immediately have (4.31). For q > N , we consider again the corresponding quotient q = (Z +tπ) div R. The Manhattan weight of Z R −1 mod π is bounded by    Z  Z  X Y  N2 NR  = = ≤ ≤ = N, (4.32) R R R R R where we have used (4.29) and the assumption R ≥ N . −1 Similar to the proof of Proposition 4.1, we √ aim to compensate tπ R with απ. From Proposition 4.1, it follows that |α| ≤ 2. Hence, the minimization in (4.23) over α ∈ {±1, ±i, ±1 ± i} is sufficient.  From (4.32), it follows that there is at least one solution with M = q − απ ˆ  ≤ N . Consequently, the minimization in (4.23) will find a solution satisfying (4.31).  Finally, we can now describe the final reduction step of Algorithm 4.3. Note that there are eight possible values for α according to (4.23), but not all potential values have to be considered. The reduction procedure is demonstrated in the following example. Example 4.3 Again, we consider the finite field G29 of Gaussian integers with π = 5 + 2i, where all possible values of q after the first two steps of Algorithm 4.3 are depicted in Figure 4.2. Consider the point q = 3 + 2i in the first quadrant. We have q = 5 > N = 4. Hence, further reduction is required. As q is in the first quadrant, we have three possible values for α, i.e. α ∈ {1, i, 1 + i}. Calculating q − απ results in −2, 5 − 3i, −5i, where only the solution x = −2 satisfies the condition −2 = 2 ≤ N = 4. Hence, we choose x = −2 as the final result. The final reduction step results in a Gaussian integer x with x ≤ N . Hence, Algorithm 4.3 achieves the desired precision reduction using the Manhattan weight. However, the result may not always be an element of the ring. For instance, for some π, the value x = a−1 is a ring element. Alternatively, the point x  = x −π = −1−bi can be a ring element. The two values are congruent and can satisfy x = a − 1 ≤

4.3 Reduction over Gaussian integers using the Manhattan Weight

55

  N , x   = b+1 ≤ N depending on π. The ring element is the value with the smallest absolute value. There exist at most two congruent values with Manhattan weight less than or equal to N . Consequently, the correct representation can be selected by minimizing the absolute value of these two congruent values. Depending on the application, this step might be required once to obtain the final result x = q mod π.

4.3.2

Reduction after addition (or subtraction)

In this subsection, we consider the reduction after the addition of two elements x, y based on the Manhattan weight. First, we demonstrate that the offsets according to (4.7) and (4.23) are not sufficient for the reduction after the addition (or subtraction) of two intermediate values satisfying (4.15), as illustrated in the following example: Example 4.4 Consider the finite field G53 of Gaussian integers with π = 7 + 2i and p = 53. All possible values M obtained from Algorithm 4.3 are depicted in Figure 4.3, illustrating the value M = 6 with M = 6 = N and assuming a doubling of this value. This results in z = 2M = 12 and z = 2N = 12. Hence, the offsets απ according to (4.7) or (4.23) are not sufficient for the reduction and the offset α = 2 is required. Next, we bound the Manhattan weight for the sum of two elements x and y to determine all required offsets for the reduction. Lemma 4.4 For any Gaussian integers x, y with x ≤ N and y ≤ N , we have the upper bound z = x + y ≤ 2N , (4.33) for the sum without modulo reduction. Proof This bound follows from the triangle inequality and (4.15), i.e. z = x + y ≤ x + y ≤ 2N .

(4.34) 

56

4

Montgomery Arithmetic over Gaussian Integers

Figure 4.3 Illustration of the the Montgomery domain for p = 53 and π = 7 + 2i. Circles centered with a point are possible values M resulted from Algorithm 4.3 that satisfy M ≤ N = 6. Filled circles are possible offsets απ according to (4.7). The circle centered with plus indicates 2π

Consequently, the addition (or subtraction) of two Gaussian integers x, y can potentially result in a Gaussian integer z with z > N . For any z satisfying N < z ≤ 2N , a precision reduction is required. From Example 4.4, it follows that the offsets according to (4.7) or (4.23) are not sufficient. Next, we consider the precision reduction for the sum based on the Manhattan weight with the associated offsets. Note that |x| ≤ x (4.35) holds for any Gaussian integer x, which follows from squaring both sides of the inequality. Proposition 4.3 For the sum z = x +y of any Gaussian integers x, y with x ≤ N , y ≤ N , and N < z ≤ 2N , we obtain the precession reduction as

4.4 Simplifying the Reduction based on the Manhattan Weight

M = z − απ, ˜ α˜ =

argmin

α∈{±g,±hi}, h,g∈{1,1+i,2,2+i}

z − απ,

57

(4.36) (4.37)

where M ≡ z and M ≤ N . Proof Using Lemma 4.4, we can restrict all valid candidates α as απ − N ≤ 2N . Furthermore, from (4.35) follows

Now, since N
1, because the virtual dictionary contains all possible characters.

110

7 The Parallel Dictionary LZW Algorithm for Flash Memory Controllers

Algorithm 7.1 PDLZW encoding input: charstream, max output: intstream 1: l = 0 2: repeat 3: c = readchar(charstream) 4: s = [s | c] 5: l = l + 1 6: new = index(s, l − 1) 7: if (s is not in dictionary l − 1) then 8: add s to dictionary l − 1 9: intstream.put([l − 2|old]) 10: s=c 11: l=1 12: old = index(s, l) 13: else if (l − 1 == max) then 14: intstream.put([max − 1|new]) 15: l=0 16: else 17: old = new 18: end if 19: until charstream is empty 20: intstream.put([l − 1|old])

// reset the actual length of string s // read next character // concatenate s and c // search for s in dictionary l − 1 // output dictionary number and index // save c for next iteration

// output dictionary number and index // save index for next iteration // output dictionary number and index

If a miss occurs, the string s is added to the dictionary l − 1 at the next free index. Moreover, an output codeword is generated. This output is determined by the string s from the previous iteration, which is represented by the address stored in the variable old. This address was found in dictionary l − 2. Hence, the output is the concatenation [l − 2|old] of the integer l − 2 and the address old. Finally, the character c is stored in the string s and the value of l and old are updated accordingly. In the case where the sting s is found in dictionary l − 1, the algorithm depends on the length l of the string. For l < max, the address is stored in the variable old and the algorithm continuous with the next iteration. However, l == max indicates the maximum possible length. Hence, the string s cannot be extended and the corresponding codeword [max − 1|new] is written to the output. The algorithm continuous with an empty string s, which is indicated by setting l = 0. In the final step, the input is empty, although the string s may contain some symbols. The remaining content of the string s is written to the output as [l − 1|old]. The PDLZW decoding is described in Algorithm 7.2. The decoding process is less complex than the encoding, because no search is required. The codewords received contain the dictionary number l and the addresses c of the dictionary entries.

7.2 Parallel Dictionary LZW (PDLZW) Algorithm

111

Hence, the dictionaries can be implemented using RAM. The integers old and new denote the last and the currently-received encoded codeword. The decoded symbol sequences are stored in the sting s. Algorithm 7.2 PDLZW decoding input: intstream output: charstream 1: repeat 2: [l|c] = readint(intstream) 3: if (c is not available in dictionary l) then 4: s = string([l-1|old])+firstchar(string([l-1|old])) 5: add s to dictionary l in address c 6: charstream.put(s) 7: else 8: s = string([l − 1|old])+firstchar(string([l|c])) 9: add s to dictionary l 10: charstream.put(s) 11: end if 12: old = c 13: until intstream is empty

// read next codeword // build s // output the current string // build s // output the current string // save codeword for next iteration

The partitioning of the LZW encoding dictionary into multiple dictionaries with the PDLZW algorithm results in a reduced compression gain compared with the original LZW algorithm. This follows from the fact that a sub-dictionary of the PDLZW algorithm may be completely filled, whereas other sub-dictionaries still have free entries. In this case, no strings can be added to a filled sub-dictionary, while the original LZW algorithm could store such strings until the complete dictionary is filled. Minimizing the compression loss with the PDLZW algorithm requires a suitable partitioning of the available address space on the dictionaries. In Section 7.3, we derive such an address space partitioning technique for the PDLZW dictionaries. Furthermore, the complexity of the PDLZW algorithm is dominated by the required memory size. In Section 7.4, we present two modifications for the PDLZW algorithm that reduce the memory requirements. The first modification is based on organizing the dictionaries recursively, while the second one follows the word partitioning technique from [136, 137] for the PDLZW dictionaries.

112

7 The Parallel Dictionary LZW Algorithm for Flash Memory Controllers

Figure 7.5 Illustration of a Markov chain for the PDLZW encoding

7.3

Address Space Partitioning for the PDLZW

In this section, we model statistical dependencies in the input data using a Markov chain [145] to drive an address space partitioning technique that improves the compression rate of the PDLZW.

7.3.1

Data Model

The partitioning of the address space depends on the data. Hence, a suitable data model is required to derive a proper partitioning. We use Markov chains to model statistical dependencies in the data [145]. A similar assumption, that there is an underlying Markov model for the data, is made in the original work of Ziv-Lempel [50]. The basic idea of the proposed partitioning method is to use the stationary distribution of the Markov model for the LZW algorithm to derive the address space partitioning for the PDLZW algorithm. We use a Markov chain as depicted in Figure 7.5 to model the LZW encoding process. Each state in the Markov chain represents a state of the encoder, i.e. the encoder is in state i if the current input sequence of length i + 1 is found in the dictionary. Furthermore, each state supplies a transition probability qi+1 for the next symbol sequence being found in the dictionary. Let pi be the probability that the Markov chain is in state i. The stationary distribution p0 , p1 , . . . , pn−1 is a probability distribution, which remains unchanged in the Markov chain as time progresses. Obviously, the probability distribution varies as long as new entries are added to the LZW dictionary. However, once the dictionary is completely filled, we may assume that the distribution is stationary. Moreover,

7.3 Address Space Partitioning for the PDLZW

113

the probability pi is proportional to the number of strings of length i + 1 in the LZW dictionary. Based on the Markov chain model, we can calculate the probabilities for each state with the recursion pi = pi−1 · qi (7.2) Assuming a homogeneous Markov chain with qi = q, we obtain pi = p0 · q i ,

(7.3)

where p0 can be determined using the normalization of the probability distribution and the geometric series for q  = 1 n−1 

pi = p0

i=0

resulting in p0 =

1 − qn =1 1−q

1−q . 1 − qn

(7.4)

(7.5)

Hence, the homogeneous Markov model results in exponentially decreasing state probabilities, where the stationary distribution is determined by the transition probability q and the number of dictionaries n. In order to demonstrate that this model is able to characterize the LZW state distribution for real-world data, we consider data from the Calgary and Canterbury corpora [139, 146]. Figure 7.6 illustrates an example of the probability distribution for the complete Calgary corpus. Here an address space with 1024 entries was considered. The probability for the first state (dictionary 0) is not depicted because the first dictionary has a fixed length of 256 entries and does not require partitioning. The probability distribution is obtained by the following procedure. First, the data is encoded with the LZW algorithm until the dictionary is completely filled. Subsequently, further data is encoded and the state probability is estimated by the relative frequency of the corresponding encoder state. Figure 7.6 depicts the logarithm of the state probabilities (solid line) and a regression line (dashed) for log( pi ) ≈ c + i log(q).

(7.6)

These curves indicate that the exponentially decreasing state probabilities of the homogeneous Markov model approximate the actual distribution, where log(q) can be determined from the slope of the regression line.

114

7 The Parallel Dictionary LZW Algorithm for Flash Memory Controllers

Figure 7.6 Stationary probability distribution for Calgary corpus for an address space with 1024 entries

7.3.2

Partitioning the PDLZW Address Space

Next, we introduce a new address space partitioning technique for the PDLZW algorithm. With a proper partitioning of the address space, the state probabilities pi for the LZW should be equal to the probabilities of finding a string in dictionary i for the PDLZW. Hence, we can use the state distribution to estimate the number of entries in the PDLZW dictionaries. Let M denote the total number of entries in the dictionaries, i.e. M is the difference between the total size of the address space and the size of the virtual dictionary. Typically, 256 addresses are reserved for the virtual dictionary. With an address space of 1024, only M = 768 entries can be stored in the dictionaries. We propose allocating approximately pi M entries to the dictionary i, because pi M is the expected value of strings of length i + 1 in the LZW dictionary. Table 7.1 contains a few examples for the resulting partitioning, where the transition probabilities are

7.3 Address Space Partitioning for the PDLZW

115

Table 7.1 Determination of dictionary sizes using the probability distribution data

q

dictionary sizes (1–9)

book1 Calgary Kennedy

0.167 0.379 0.744

640 107 18 3 0 0 0 0 0 0 477 181 68 26 10 4 1 1 0 0 212 157 117 87 65 48 36 27 20

Table 7.2 Proposed dictionary partitioning for different address spaces address space

dictionary partitioning (1–12)

512 1024 for q ≤ 0.5 1024 for q > 0.5 2048

128 64 32 16 8 4 4 0 0 0 0 0 512 128 64 32 16 8 4 4 0 0 0 0 256 256 128 64 32 16 8 4 4 0 0 0 512 512 256 128 128 64 64 32 32 32 16 16

obtained by regression for log( pi ). The file book1 from the Calgary corpus results in the smallest transition probability, where n = 5 dictionaries would suffice. On the other hand, the Kennedy file has a large transition probability, which results in longer strings. Obviously, this partitioning does not result in a universal solution because the partitioning strongly depends on the transition probability. However, this dependency can be partially resolved by a simple practical constraint. The PDLZW algorithm is predominantly used for hardware implementations, and consequently the dictionary sizes are typically chosen as powers of two to support RAM-based implementations. Restricting the dictionary sizes to powers of two also limits the number of possible address space segmentations. Table 7.2 presents the proposals for the dictionary partitioning for different address spaces. Due to the rounding of pi M to powers of two, a large range of different transitions probabilities results in the same dictionary partitioning. The partitioning for 512 addresses is suitable for all files in the Calgary and Canterbury corpora. For an address space with 1024 addresses, we propose two segmentations depending on q, where the partitioning for q ≤ 0.5 provides good compression gains for the both corpora. Table 7.2 is limited to address spaces up to 2048 addresses. Such addresses are suitable for applications in flash memory devices, where today data blocks of 1, 2, or 4 KB are mostly processed. In order to determine which address space is best suited for an input block length, the input blocks are compressed using the LZW algorithm. Table 7.3 lists the mean block length of compressed data using the LZW

116

7 The Parallel Dictionary LZW Algorithm for Flash Memory Controllers

Table 7.3 Mean compressed block length of LZW for different block lengths corpus

address space

Calgary Calgary Calgary Calgary Calgary Canterbury Canterbury Canterbury Canterbury Canterbury

512 1024 2048 4096 8192 512 1024 2048 4096 8192

1024

2048

638.3 649.3 714.2 – – 446.4 470.3 517.3 – –

1270.0 1179.3 1268.7 – – 871.6 841.4 917.2 – –

block length 4096 2554.7 2295.2 2262.9 2462.3 – 1831.5 1599.9 1648.3 1798.2 –

8192 – 4593.9 4318.7 4394.0 4758.8 – 3331.4 3097.3 3259.9 3531.4

algorithm depending on the size of the address space. The columns in this table demonstrate different input block lengths up to 8 KB. In each column, the smallest mean block length is marked in boldface, which indicates the best address space for the input block length considered. As shown in this table, address spaces with 512, 1024 or 2048 entries are suitable for block lengths up to 8 KB.

7.4

Reducing the Memory Requirements of the PDLZW

In this section, we present two modifications for PDLZW algorithm. The first one, called the recursive PDLZW algorithm, aims to reduce the memory requirements without sacrificing performance compared with the non-recursive PDLZW. The recursive PDLZW enables RAM-based implementations, which are beneficial for many applications. The second modification is based on applying the word partitioning concept according to [136, 137] to further reduce the memory requirements of the recursive PDLZW algorithm. This concept results in a small compression loss compared with the non-recursive PDLZW algorithm.

7.4.1

Recursive PDLZW Algorithm

The operating principle of the PDLZW encoding is illustrated as a block diagram in the upper part of the Figure 7.7. This diagram shows an example of the encoding with eight dictionaries. Note that dictionary 0 is the virtual dictionary, which does

7.4 Reducing the Memory Requirements of the PDLZW

Figure 7.7 Comparing the PDLZW with the recursive PDLZW

117

118

7 The Parallel Dictionary LZW Algorithm for Flash Memory Controllers

not require hardware resources and is not depicted in this figure. The height of the dictionaries represents their storage capacity. The storage capacity decreases with increasing string length, because the probability of occurrence is smaller for longer strings. The output includes the internal address of the dictionary and the dictionary number. As shown in Figure 7.7, each dictionary for strings of length l ≥ 3 extends only strings from the previous dictionary by an additional symbol. For example, dictionary 2 contains strings of length three. The first two symbols of these strings are already stored in the dictionary 1. We exploit this fact to derive a memory efficient implementation of the PDLZW encoding. This recursive PDLZW encoding is illustrated in the lower part of Figure 7.7. As depicted in this figure, the memory size for each dictionary with strings of length l ≥ 3 can be reduced by storing the address of the sub-string from the previous dictionary instead of the sub-string itself. The memory reduction increases for higher memory numbers, because longer substrings are represented by addresses. Furthermore, the number of dictionary entries is reduced for longer strings. Hence, fewer bits are required to store the address. The recursive PDLZW encoding is similar to the PDLZW encoding described in Algorithm 7.1. The only difference is that the string s is not required, i.e. the string is represented by the last found address in the variable old and the newly-received symbol c. Consequently, the string is replaced by the tuple s = [old|c]. For example, the first entries of the dictionaries 1–3 in Figure 7.4 will be [0 × 41|r], [0 × 03|s], and [0 × 00|e] instead of the strings “Ar”, “ees”, and “eese”. Algorithm 7.2 describes the decoding of the PDLZW for recursive and nonrecursive encoding, because the two implementations of the PDLZW encoding result in the same codewords. Similarly, we can use a recursive or non-recursive representation of the dictionaries for the decoding. The recursive representation is more memory efficient but requires several read operations to decode a complete string, because exactly one symbol is decoded per clock cycle. By contrast the non-recursive implementation stores the complete strings in a single RAM, which requires higher memory size but provides a faster decoding process. In [135], the authors introduced a PDLZW implementation based on CAM to perform the parallel search. One disadvantage of this implementation is that the resulting CAM module must be synthesized with circuits based on a large number of flip-flops, because most cell-based libraries or FPGA-based devices do not support these CAM modules. To overcome this issue, we propose a RAM-based implementation of the CAMs. The memory size of the RAM-based implementation in [147] increases exponentially with the CAM word size. Hence, such an implementation would not be feasible for the original PDLZW. However, the required word size for the proposed recursive PDLZW is significantly reduced compared with the original

7.4 Reducing the Memory Requirements of the PDLZW

119

proposal. This enables a RAM-based implementation of the CAMs. Corresponding implementation results are presented in Section 7.5. In the following, we introduce an efficient CAM-based hardware implementation for the recursive PDLZW algorithm that further reduces the memory requirements. In [136, 137], the concept of word partitioning was proposed to reduce the complexity for RAM-based implementations of CAMs with large addresses. We apply this concept to the dictionaries of the recursive PDLZW. In particular, we present a word partitioning that significantly reduces the memory size of the individual dictionaries such that RAM-based implementations become feasible. We consider only four dictionaries, because this is the most compact and simplest implementation. Furthermore, it enables a comparison with the results from [53]. However, the proposed concept is extendible for arbitrary number of dictionaries.

7.4.2

Basic Concept of the Word Partitioning Technique

Figure 7.8 illustrates the basic concept of word partitioning. The input address of Cbit width is divided into two sub-addresses, each of which is processed independently in a dedicated layer. Each layer is a small CAM, which is used to reduce the length

Figure 7.8 Word partitioning principal

120

7 The Parallel Dictionary LZW Algorithm for Flash Memory Controllers

of the address, i.e. we have m ≤ C/2 and n ≤ C/2. Finally, these (m + n)-bit are concatenated to form the CAM address. If an incoming address appears for the first time in a layer, it is stored in the next free position. The output of the layer is the index of the entry where the input address is found. However, the number of entries in the layers is limited to 2m and 2n , i.e. the first 2m and 2n addresses, respectively. In the case where all entries are occupied and a new address appears, a miss occurs and the address cannot be stored. Note that the miss rate depends on the size of the layer and the statistics of the input addresses. Using the word partitioning for the PDLZW dictionaries may result in a higher miss rate for dictionary entries, which could affect the achievable compression gain. However, a low miss rate can be achieved with a proper dimensioning of the layers.

7.4.3

Dimensioning the Layers

The miss rate of a layer depends on the statistics of the addresses. To estimate this, again we consider real-world data from the Calgary and Canterbury corpora [139, 146]. Based on the probability of occurrence for the addresses, a suitable word partitioning for the individual dictionaries can be defined. Figure 7.9 shows the address distribution for the PDLZW for all of the Calgary test files. For this data, we used a PDLZW with four dictionaries. The data was processed in blocks of 1 kilobyte such that each block can be decoded independently. This results in a total of 3190 blocks for the complete Calgary corpus. Figure 7.9 depicts the number of unique address occurrences per block. With four dictionaries, an address space of 1024 addresses and without a word partitioning technique, each dictionary contains at most 256 entries, although not all addresses occur within each encoded block. As can be seen in this figure, the actual number of unique entries is typically much smaller than the maximum value. For instance, consider the virtual dictionary. For this dictionary, an address value is equal to the 8-bit input data. Hence, the address statistics is determined by the input data. As can be seen in Figure 7.9, there are data blocks where more than 200 unique data symbols occur. However, in the vast majority of all cases, fewer than 128 unique symbols occur. This observation can be used to reduce the address space for dictionary 1. The input address of dictionary 1 comprises two entries from the virtual dictionary, each of which is represented by 8-bit. Now consider the structure in Figure 7.8 with layers that reduce the output addresses of the virtual dictionary from 8 to 7-bit. These two layers reduce the address space of dictionary 1 from 16

7.4 Reducing the Memory Requirements of the PDLZW

121

Figure 7.9 Distribution of the number of unique entries for the different dictionaries

to 14-bit. With a RAM-based CAM implementation, this word partitioning reduces the memory size for dictionary 1 by a factor of four, i.e. from 216 to 214 words. For most of the data blocks in the Calgary corpus, this word partitioning does not affect the compression gain because the layers are sufficiently large to represent all possible addresses. However, there are some data blocks where more than 128 unique addresses occur. In such a case, the layers can only represent the first 128 occurrences. An address that is not found in the layer causes a miss. The corresponding symbol will be encoded with a codeword from the virtual dictionary and no data is lost. Nonetheless, the corresponding layer cannot provide a valid output address. Thus, this symbol cannot be stored in dictionary 1. If this symbol occurs more often, it will always be encoded with a codeword from the virtual dictionary. Consequently, the misses caused by the layers can lead to a loss in compression gain. However, in the next section we will see that the average loss in compression rate is very low.

122

7 The Parallel Dictionary LZW Algorithm for Flash Memory Controllers

The statistics of dictionary 1 are a special case because the average number of unique entries is high. As shown in Figure 7.9, the average value is close to 128. Hence, reducing the address space from 256 to 128 would cause a loss of many dictionary entries and a high compression loss. Nevertheless, the address space for dictionary 2 can be reduced. Each entry in dictionary 2 comprises one address from dictionary 1 and one address from the virtual dictionary. For the address from dictionary 1, we use m = C/2 = 8, but for the address from the virtual dictionary we can use n = 7, which reduces the address width of dictionary 2 from 16 to 15-bit and the memory size by a factor of two. For the remaining dictionaries, the average number of unique entries decreases (cf. Section 7.3). Hence, layers with smaller sizes are possible. For the addresses of dictionary 2, a layer with 128 entries suffices such that no loss occurs, whereas 64 entries are sufficient if a small loss is acceptable. Similarly, for dictionary 3, a layer with 64 entries would hardly cause any loss and 32 entries a very small loss. Detailed compression results for the Calgary and Canterbury corpora are discussed in Subsection 7.4.5.

7.4.4

Dictionary Architecture

The resulting dictionary structure is illustrated in Figure 7.10 for the recursive PDLZW with four dictionaries. Each entry of the dictionaries is uniquely addressed with 10-bit, where two bits indicate the dictionary and eight bits the address within the dictionary. In Figure 7.10, all input symbols pass through the sub-CAM 0, which is the layer for the virtual dictionary. Note that the input of dictionary 1 comprises two entries from the virtual dictionary. Hence, only one layer is required to implement the word partitioning for this dictionary. Similarly, the input of all other dictionaries consists of one address from the predecessor dictionary and one address from the virtual dictionary. Hence, subCAM 0 can be used as a layer for all addresses from the virtual dictionary. This reduces the hardware costs for the layers, because at most one layer is required per dictionary. As mentioned in the previous subsection, we keep the address width m = 8 for the entries of dictionary 1. Thus, there is only one further layer for the addresses of dictionary 2. This layer is sub-CAM 2 in Figure 7.10.

7.4 Reducing the Memory Requirements of the PDLZW

Figure 7.10 Structure of the PDLZW encoding with word partitioning

123

124

7.4.5

7 The Parallel Dictionary LZW Algorithm for Flash Memory Controllers

Implementation

A CAM can be realized using registers or RAM. The register-based variant has the advantage that the address of an entry can be found within one cycle. This is typically not possible with a RAM-based implementation because the RAM may require several cycles for the reading operation. On the other hand, the hardware costs for a register-based CAM with large capacity are high. The sub-CAMs in Figure 7.10 require only 128 or 64 entries with a width of 8-bit. Hence, these small CAMs can be efficiently realized with registers. The implementation with registers also speeds up the encoding process. On the other hand, the actual dictionaries have to store 256 entries with a width between 13 and 15-bit, which would require a much larger number of flip-flops. However, with 13 and 15-bit these CAMs can be efficiently realized using RAM, where the largest RAM block is required for dictionary 2 with 215 words. All RAMs have a word length of 8-bit to represent 256 entries. For the PDLZW with four dictionaries, a total of 57344 bytes RAM for the dictionaries and 1536 registers for the layers are required. Note that the original PDLZW requires 20635 registers and the recursive PDLZW still 12442. A detailed comparison will be provided in the next section.

7.5

Compression and Implementation Results

In this section, we compare the first data compression results of the proposed address space partitioning technique with the LZW and the partitions according to [53]. Moreover, we discuss synthesis results for the proposed PDLZW codec with four dictionaries. The synthesis results are obtained for the Xilinx Virtex 7 FPGA. Figure 7.11 illustrates compression rates of all three encoding schemes for the Calgary and Canterbury corpora. These results are depicted for different address spaces with varying input block length. In [53], no partitioning for 512 addresses is provided. However, this address space is interesting for short block lengths. Obviously, the LZW encoding always obtains the best compression results. As depicted in this figure, the proposed partitioning improves the compression compared with the partitioning proposed in [53]. For most cases, the PDLZW algorithm with the proposed partitioning achieves compression gains similar to the LZW encoding. In order to investigate the universality of the partitioning, we encode data from the Canterbury corpus with a partitioning derived from the Calgary corpus. The corresponding results are depicted in Figure 7.12. The dictionaries’ partitioning from Table 7.2 with q ≤ 0.5 improves the compression compared with the partitioning

7.5 Compression and Implementation Results

125

Figure 7.11 Compression results for the Calgary corpus with different input block lengths and different number of entries (address space)

proposed in [53], except for the case with address space 1024 and block length 1024. However, for blocks of length 1024, the address space with 512 entries is the better choice. For most cases, the PDLZW algorithm with the proposed partitioning achieves compression gains similar to the LZW encoding. Nevertheless, the proposed partitioning cannot be applied universally, since other data sources may result in different dictionary partitioning. On the other hand, significantly different data sources result in the same dictionary partitioning due to the rounding to powers of two. Next, we consider the hardware size and operating frequency for the registerbased FPGA implementations. The number of look-up tables, the number of registers, as well as the operation frequencies are listed in Table 7.4. First, we compare

126

7 The Parallel Dictionary LZW Algorithm for Flash Memory Controllers

Figure 7.12 Compression results for the Canterbury corpus with a partitioning derived from the Calgary corpus and q ≤ 0.5 for different input block lengths and different number of entries (address space)

the register-based implementation of the original PDLZW algorithm and the proposed recursive PDLZW approach with four dictionaries. As shown in this table, the recursive PDLZW reduces the number of registers by 40% and the logic by 30%. Due to the smaller logic, it can be operated with a higher frequency. Note that with the original proposal of the PDLZW [53], the number of registers increases with the number of dictionaries. For instance, with an address space of 1024 entries, the number of registers increases by 75% if the number of directories is increased from four to ten. Table 7.4 presents results for the proposed recursive PDLZW with ten dictionaries and a partitioning according to Table 7.2. These results demonstrated that the size of the proposed encoder hardly depends on the number of dictionaries.

7.5 Compression and Implementation Results

127

Table 7.4 Hardware size for the register-based implementation version

LUT

FF

f clk [M H z]

original PDLZW (4 dict.) recursive PDLZW (4 dict.) recursive PDLZW (10 dict.)

18275 12908 12963

20635 12442 12200

150 175 175

Table 7.5 Hardware size for the RAM-based implementation version original PDLZW recursive PDLZW without partitioning recursive PDLZW with partitioning

LUT

FF

RAM

encoding





38806290430

encoding

39

41

1769472

decoding

330

177

12288

encoding

2134

1689

516096

decoding

2345

1770

12288

Finally, we consider the RAM-based implementations. The corresponding results are summarized in Table 7.5 for the encoder and decoder, where the decoder is also implemented recursively. The RAM size for the original PDLZW encoding is infeasible. Hence, no synthesis results can be provided. All RAM-based implementations can be operated with a frequency of 220 MHz. Comparing the recursive PDLZW with the proposed approach with address space partitioning, we note that the memory size for the recursive PDLZW is higher by a factor of 3.5. However, the address space partitioning requires more LUT and registers for the sub-CAMs. The word partitioning results in a small compression loss, as demonstrated by the results in Table 7.6. These results are the mean block size in bytes and they were obtained for a block size of 1024 bytes, an address space with 1024 entries, and four dictionaries. As can be seen, the word partitioning results in a very small compression loss compared with the original PDLZW.

128

7 The Parallel Dictionary LZW Algorithm for Flash Memory Controllers

Table 7.6 Mean block size for the PDLZW with word partitioning (4 dict.) data

original PDLZW

proposed PDLZW

Calgary Canterbury

691.6 561.9

691.8 562.0

The runtime of the encoding is determined by the number of clock cycles to process an input block, because all of the RAM-based implementations presented are operated with the same clock frequency. The numbers of clock cycles required for the original PDLZW and for both presented approaches are similar for the same number of valid dictionaries, since each input symbol is processed in a single clock cycle.

7.6

Discussion and Comparison with Other Data Compression Schemes

The parallel dictionary LZW algorithm improves the throughput of the LZW encoding by using multiple dictionaries, which makes the PDLZW algorithm very interesting for fast hardware implementations. However, it requires a proper partitioning of the address space to achieve a performance close to that of the original LZW algorithm. We have used a Markov model to derive such a dictionary partitioning. A single parameter of this model, i.e. the state transition probability, characterizes the stationary state distribution of the Markov chain. This distribution can be used to estimate the size of the parallel dictionaries. The numerical results presented demonstrate that the date model is suitable and that the proposed partitioning improves the compression gain of the PDLZW compared with the original proposal in [53]. The proposed architecture is scalable for different input block lengths. Nonetheless, the best achievable compression gain depends on the size of the address space that is used for the PDLZW dictionaries. We have demonstrated in Table 7.3 that address spaces with 512, 1024, and 2048 entries are suitable for both use cases in flash memory controllers discussed in Section 7.1, i.e. for input blocks length of 1, 2, 4, and 8 KB. Furthermore, two approaches were presented that enable efficient implementations and reduce the memory requirements of the original PDLZW. The proposed recursive dictionary structure reduces the memory requirements without affecting the compression gain of the PDLZW. This enables RAM-based dictionary imple-

7.6 Discussion and Comparison with Other Data Compression Schemes

129

mentations. The RAM size for the dictionaries can be further reduced with the proposed word partitioning technique, at the cost of a small compression loss. Both presented approaches achieve a compression rate close to the LZW algorithm. In the following, we compare implementation results for these data compression schemes to illustrate the efficiency of the PDLZW architecture derived. As previously mentioned, the complexity of the BMH scheme is dominated by the implementation of the BWT algorithm. For reduced complexity, many implementations reduce the context sorting order k of this algorithm [141, 142]. This simplifies the encoding process but the decoding process becomes more complex and requires additional calculating steps. We have introduced a RAM-based pipelined architecture for N = 128 input symbols each of length 8-bit and a context sorting order k = 4. The implementation results for this decoding architecture are depicted in the second row of Table 7.7. These results illustrate that the proposed BWT-decoding architecture obtains high data throughput but still utilizes high logic and many registers. Since the pipeline stages in [62] are implemented using RAM blocks, we observe high RAM requirements. Furthermore, the required successive decoding stages introduce a decoding latency of order O(k N ). For the proposed architecture with k = 4, nine consecutive stages each with N = 128 symbols are utilized, which introduce a total decoding latency of 1160 cycles or 5.8 µs. For the intended application in flash memory controllers with an input block length that amounts to multiple of N , for example 1024 symbols, the resulting latency for solely the BWT decoding would be too high. This reduces the performance of the random access time and is infeasible for practical applications in flash memory devices. To reduce the complexity of the BMH scheme, the simplified MH scheme was suggested in [4]. In [48, 63], we have introduced a compact hardware architecture for the encoding and decoding of the MH scheme. The corresponding implementation results are illustrated in rows 3 and 4 of Table 7.7, respectively. As shown in this

Table 7.7 Comparison with other data compression schemes architecture reference

LUT

FF

RAM

data throughput [MB/s]

BWT-decoder [62] MH-encoder [48] MH-decoder [48] recursive PDLZW-encoder with partitioning recursive PDLZW-decoder with partitioning

3991 2656 2675 2134

4238 2211 2169 1689

24960 10640 10640 516096

198 60 60 220

2345

1770

12288

220

130

7 The Parallel Dictionary LZW Algorithm for Flash Memory Controllers

table, the hardware architecture introduced provides low implementation costs, i.e. both the encoder and decoder of the MH scheme together utilize similar amount of hardware resources and RAM size compared with the BWT decoder (without BWT encoder). Nonetheless, the resulting data throughput is very low for applications in flash memory controllers. Furthermore, the presented architecture is derived for input block length of 1 KB and cannot be scaled easily. Note that encoder and decoder of the MH are included in the BMH scheme, which also reduces the achievable data throughput. Finally, comparing the implementation results for the MH scheme and the PDLZW architecture including the proposed word partitioning technique in Table 7.7, we observe a lower resources requirement, and higher data throughput. Moreover, according to Table 7.3, the proposed architecture is scalable for different input block lengths. The best compression gain is achieved by selecting the best suitable address space for the current input block length from Table 7.3. Note that the selection of the best address space (and the corresponding address space partitioning) can be efficiently implemented using a simple multiplexing depending on the input block length. A drawback of the proposed PDLZW architecture is the high RAM size. Alternatively, the register-based recursive PDLZW architecture can be used to avoid high RAM requirements for a fast implementation, as presented in Table 7.4.

8

Conclusion

In this work, we have investigated asymmetric cryptography and data compression techniques for flash memory devices. Both approaches require very efficient algorithms due to the limited computational performance of flash memory controllers. For the asymmetric cryptography, we have focused on elliptic curves, where the computational complexity is determined by the point multiplication. We have presented a new τ -adic expansion algorithm that excludes all zero elements for the base conversion of the key. The resulting key expansion increases the robustness against simple power analysis and timing attacks. Furthermore, we have demonstrated that applying this expansion algorithm for Gaussian integer keys can reduce the complexity and memory requirements. Note that this concept can be generalized to statistical side channel attacks by modifying the point multiplication with the randomized initial point method [31]. The modulo reduction over Gaussian integers is computationally intensive. We have introduced two reduction algorithms for the Montgomery arithmetic over Gaussian integers using different norms. The first reduction utilizes the absolute value to measure the size of Gaussian integers, while the second one replaces the absolute value by the Manhattan weight to reduce the complexity. The reduction with the Manhattan weight obtains at most two congruent solutions, where the correct one can be established by utilizing the absolute value. Hence, this reduction is useful when many intermediate results have to be calculated, e.g. for the ECC point multiplication or RSA systems. To evaluate the benefits of Gaussian integer fields for calculating the point multiplication, we have presented two coprocessor designs, one for ordinary integer fields and one using the proposed arithmetic. We have demonstrated that the coprocessor based on the Montgomery arithmetic over Gaussian integers is a competitive solution. It enables high flexibility by supporting arbitrary primes of the form p mod 4 = 1 and different key representations. All investigations for Gaussian inte© The Author(s), under exclusive license to Springer Fachmedien Wiesbaden GmbH, part of Springer Nature 2021 M. Safieh, Algorithms and Architectures for Cryptography and Source Coding in Non-Volatile Flash Memories, Schriftenreihe der Institute für Systemdynamik (ISD) und optische Systeme (IOS), https://doi.org/10.1007/978-3-658-34459-7_8

131

132

8

Conclusion

ger fields can be undertaken for other fields over complex numbers like Eisenstein integer fields, which can be constructed for primes of the form p mod 6 = 1 [98]. Data compression can improve the lifetime of flash memory devices. To support different compression strategies, a universal data compression algorithm is needed that is scalable regarding the input block length. We have considered the parallel dictionary LZW algorithm, since it provides better trade-off for the requirements of the intended application compared with other data compression schemes [4, 48, 49, 62]. The PDLZW algorithm can achieve a compression gain close to the LZW algorithm. However, the PDLZW needs a proper address space partitioning. We have derived a technique to adjust the partitioning to a data model. The PDLZW algorithm reduces the complexity of the LZW algorithm, but it has high memory requirements. We have proposed two concepts for the memory reduction. The first one is based on organizing the content of the dictionaries recursively, which significantly reduces the memory size without sacrificing the compression gain. The second concept applies a word partitioning technique to reduce the memory size, which leads to an additional size reduction and a small compression loss.

Bibliography

[1] [2] [3]

[4] [5]

[6]

[7]

[8]

[9] [10]

[11]

[12] [13]

R. Micheloni, L. Crippa, and A. Marelli, Inside NAND Flash Memories, Springer Netherlands, 2010. D. Richter, Flash Memories, Springer, 2014. P. Swierczynski, M. Fyrbiak, P. Koppe, A. Moradi, and C. Paar, “Interdiction in practice–hardware Trojan against a high-security USB flash drive,” Journal of Cryptographic Engineering, vol. 7, no. 3, pp. 199–211, 2017. M. Rajab, Channel and source coding for non-volatile flash memories, Springer, 2020. Y. Wang, W. Yu, S. Wu, G. Malysa, G. E. Suh, and E. C. Kan, “Flash memory for ubiquitous hardware security functions: True random number generation and device fingerprints,” in 2012 IEEE Symposium on Security and Privacy, 2012, pp. 33–47. X. Gong, Z. Dai, W. Li, and L. Feng, “Design and implementation of a SoC for privacy storage equipment,” in Proceedings 2011 International Conference on Transportation, Mechanical, and Electrical Engineering (TMEE), 2011, pp. 435–438. L. Dolecek and Y. Cassuto, “Channel coding for nonvolatile memory technologies: Theoretical advances and practical considerations,” Proceedings of the IEEE, vol. 105, no. 9, pp. 1705–1724, Sept. 2017. S. Subha, “An algorithm for secure deletion in flash memories,” in 2009 2nd IEEE International Conference on Computer Science and Information Technology, 2009, pp. 260–262. N. Ahn and D. H. Lee, “Schemes for privacy data destruction in a NAND flash memory,” IEEE Access, vol. 7, pp. 181305–181313, 2019. S. Kim and Y. Cho, “The design and implementation of flash cryptographic file system based on YAFFS,” in 2008 International Conference on Information Science and Security (ICISS 2008), 2008, pp. 62–65. T. Unterluggauer, M. Werner, and S. Mangard, “Meas: memory encryption and authentication secure against side-channel attacks,” Journal of cryptographic engineering, vol. 9, no. 2, pp. 137–158, 2019. S. Skorobogatov, “Local heating attacks on flash memory devices,” in 2009 IEEE International Workshop on Hardware-Oriented Security and Trust, 2009, pp. 1–6. A. Wang, Z. Li, X. Yang, and Y. Yu, “New attacks and security model of the secure flash disk,” Mathematical and Computer Modelling, vol. 57, no. 11, pp. 2605–2612, 2013, Information System Security and Performance Modeling and Simulation for Future Mobile Networks.

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Fachmedien Wiesbaden GmbH, part of Springer Nature 2021 M. Safieh, Algorithms and Architectures for Cryptography and Source Coding in Non-Volatile Flash Memories, Schriftenreihe der Institute für Systemdynamik (ISD) und optische Systeme (IOS), https://10.1007/978-3-658-34459-7

133

134

Bibliography

[14]

D. Hankerson, A. J. Menezes, and S. Vanstone, Guide to Elliptic Curve Cryptography, Springer, New York, 2003. J. Katz and Y. Lindell, Introduction to Modern Cryptography (Chapman & Hall/Crc Cryptography and Network Security Series), Chapman & Hall/CRC, 2007. G. Locke and P. Gallagher, “Digital signature standard (DSS),” Standard FIPS PUB 186-3, National Institute of Standards and Technology, 2009. “Public key cryptography for the financial services industry, key agreement and key transport using elliptic curve cryptography,” Standard ANSI X9.63, American National Standards Institute, 1998. K. Huber, “Codes over Gaussian integers,” IEEE Transactions on Information Theory, pp. 207–216, 1994. J. Freudenberger, F. Ghaboussi, and S. Shavgulidze, “New coding techniques for codes over Gaussian integers,” IEEE Transactions on Communications, vol. 61, no. 8, pp. 3114–3124, Aug. 2013. P. M. Matutino, J. Araújo, L. Sousa, and R. Chaves, “Pipelined FPGA coprocessor for elliptic curve cryptography based on residue number system,” in International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS), July 2017, pp. 261–268. Y. Ma, Q. Zhang, Z. Liu, C. Tu, and J. Lin, “Low-cost hardware implementation of elliptic curve cryptography for general prime fields,” in Information and Communications Security, Kwok-Yan Lam, Chi-Hung Chi, and Sihan Qing, Eds. 2016, Lecture Notes in Computer Science, pp. 292–306, Springer International Publishing. A. Salman, A. Ferozpuri, E. Homsirikamol, P. Yalla, J. Kaps, and K. Gaj, “A scalable ECC processor implementation for high-speed and lightweight with side-channel countermeasures,” in 2017 International Conference on ReConFigurable Computing and FPGAs (ReConFig), Dec. 2017, pp. 1–8. K. C. C. Loi and S. Ko, “Scalable elliptic curve cryptosystem FPGA processor for NIST prime curves,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 23, no. 11, pp. 2753–2756, Nov. 2015. P. H. W. Leong and I. K. H. Leung, “A microcoded elliptic curve processor using FPGA technology,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 10, no. 5, pp. 550–559, Oct. 2002. P. Kocher, “Timing attacks on implementations of Diffie-Hellman, RSA, DSS, and other systems,” in Annual International Cryptology Conference, 1996. P. C. Kocher, J. Jaffe, and B. Jun, “Differential power analysis,” in Proceedings of the 19th Annual International Cryptology Conference on Advances in Cryptology, Berlin, Heidelberg, 1999, CRYPTO ’99, p. 388–397, Springer-Verlag. L. Goubin, “A refined power-analysis attack on elliptic curve cryptosystems,” in International Workshop on Public Key Cryptography. Springer, 2003, pp. 199–211. T. Akishita and T. Takagi, “Zero–value point attacks on elliptic curve cryptosystem,” in Information Security, Colin Boyd and Wenbo Mao, Eds., Berlin, Heidelberg, 2003, pp. 218–233, Springer Berlin Heidelberg. B. Möller, “Securing elliptic curve point multiplication against side-channel attacks,” in International Conference on Information Security. Springer, 2001, pp. 324–334.

[15] [16] [17]

[18] [19]

[20]

[21]

[22]

[23]

[24]

[25] [26]

[27] [28]

[29]

Bibliography [30]

[31]

[32]

[33]

[34]

[35]

[36]

[37] [38]

[39] [40]

[41]

[42]

[43]

[44]

135

K. Jarvinen, M. Tommiska, and J. Skytta, “A scalable architecture for elliptic curve point multiplication,” in IEEE International Conference on Field- Programmable Technology, Dec. 2004, pp. 303–306. M. Hedabou, P. Pinel, and L. Bénéteau, “Countermeasures for preventing comb method against SCA attacks,” in Information Security Practice and Experience, Robert H. Deng, Feng Bao, HweeHwa Pang, and Jianying Zhou, Eds., Berlin, Heidelberg, 2005, pp. 85–96, Springer Berlin Heidelberg. Z. Tao, F. Mingyu, and Z. Xiaoyu, “Secure and efficient elliptic curve cryptography resists side-channel attacks,” Journal of Systems Engineering and Electronics, vol. 20, no. 3, pp. 660–665, 2009. C. Heuberger and M. Mazzoli, “Symmetric digit sets for elliptic curve scalar multiplication without precomputation,” Theoretical Computer Science, vol. 547, pp. 18–33, 2014. C. Vuillaume, K. Okeya, T. Takagi, “Efficient representations on Koblitz curves with resistance to side channel attacks,” in Information Security and Privacy, Colin Boyd and Juan Manuel Gonza’lez Nieto, Eds., Berlin, Heidelberg, 2005, pp. 218–229, Springer Berlin Heidelberg. N. Thériault, “SPA resistant left-to-right integer recodings,” in Selected Areas in Cryptography, B. Preneel and S. Tavares, Eds., Berlin, Heidelberg, 2006, pp. 345–358, Springer Berlin Heidelberg. S. Liu, H. Yao, and X. A. Wang, “SPA resistant scalar multiplication based on addition and tripling indistinguishable on elliptic curve cryptosystem,” in 10th International Conference on P2P, Parallel, Grid, Cloud and Internet Computing (3PGCIC), Nov. 2015, pp. 785–790. A. Koval, “Algorithm for Gaussian integer exponentiation,” in Information Technology: New Generations. 2016, pp. 1075–1085, Springer International Publishing. A. Koval, Security systems based on Gaussian integers: Analysis of basic operations and time complexity of secret transformations, Dissertation, New Jersey Institute of Technology, 2011. P. L. Montgomery, “Modular multiplication without trial division,” Mathematics of computation, vol. 44, no. 170, pp. 519–521, 1985. C. Martinez, R. Beivide, and E. Gabidulin, “Perfect codes for metrics induced by circulant graphs,” IEEE Transactions on Information Theory, vol. 53, no. 9, pp. 3042– 3052, 2007. Y. Park and J.-S. Kim, “zFTL: power-efficient data compression support for NAND flash-based consumer electronics devices,” IEEE Transactions on Consumer Electronics, vol. 57, no. 3, pp. 1148–1156, Aug. 2011. N. Xie, G. Dong, and T. Zhang, “Using lossless data compression in data storage systems: Not for saving space,” IEEE Transactions on Computers, vol. 60, no. 3, pp. 335–345, Mar. 2011. M. Martina, C. Condo, G. Masera, and M. Zamboni, “A joint source/channel approach to strengthen embedded programmable devices against flash memory errors,” IEEE Embedded Systems Letters, vol. 6, no. 4, pp. 77–80, Dec. 2014. J. Li, K. Zhao, X. Zhang, J. Ma, M. Zhao, and T. Zhang, “How much can data compressibility help to improve NAND flash memory lifetime?,” in Proceedings of the

136

[45]

[46]

[47]

[48]

[49]

[50] [51] [52]

[53]

[54]

[55]

[56]

[57] [58]

[59]

Bibliography 13th USENIX Conference on File and Storage Technologies (FAST15), Feb. 2015, pp. 227–240. J. Freudenberger, A. Beck, and M. Rajab, “A data compression scheme for reliable data storage in non-volatile memories,” in IEEE 5th International Conference on Consumer Electronics (ICCE), Sept. 2015, pp. 139–142. T. Ahrens, M. Rajab, and J. Freudenberger, “Compression of short data blocks to improve the reliability of non-volatile flash memories,” in International Conference on Information and Digital Technologies (IDT), July 2016, pp. 1–4. J. Freudenberger, M. Rajab, and S. Shavgulidze, “A channel and source coding approach for the binary asymmetric channel with applications to MLC flash memories,” in 11th International ITG Conference on Systems, Communications and Coding (SCC), Hamburg, Feb. 2017, pp. 1–4. J. Freudenberger, M. Rajab, D. Rohweder, and M. Safieh, “A codec architecture for the compression of short data blocks,” Journal of Circuits, Systems, and Computers (JCSC), vol. 27, no. 2, pp. 1–17, Feb. 2018. J. Freudenberger, M. Rajab, and S. Shavgulidze, “A source and channel coding approach for improving flash memory endurance,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 26, no. 5, pp. 981–990, May 2018. J. Ziv and A. Lempel, “Compression of individual sequences via variable-rate coding,” IEEE Transactions on Information Theory, vol. 24, no. 5, pp. 530–536, Sept. 1978. T. A. Welch, “A technique for high-performance data compression,” Computer, vol. 17, no. 6, pp. 8–19, June 1984. M.-B. Lin, “A parallel VLSI architecture for the LZW data compression algorithm,” in Proceedings of Technical Papers. International Symposium on VLSI Technology, Systems, and Applications, June 1997, pp. 98–101. M.-B. Lin, “A hardware architecture for the LZW compression and decompression algorithms based on parallel dictionaries,” Journal of VLSI signal processing systems for signal, image and video technology, vol. 26, no. 3, pp. 369–381, Nov. 2000. M. Safieh, J. Thiers, and J. Freudenberger, “Side channel attack resistance of the elliptic curve point multiplication using Gaussian integers,” in 2020 Zooming Innovation in Consumer Technologies Conference (ZINC), Aug. 2020, pp. 231–236. M. Safieh, J. Thiers, and J. Freudenberger, “A compact coprocessor for the elliptic curve point multiplication over Gaussian integers,” Electronics, vol. 9, no. 12, pp. 2050–2071, Dec. 2020. M. Safieh and J. Freudenberger, “Montgomery modular arithmetic over Gaussian integers,” in 2020 24th International Conference on Information Technology (IT), Apr. 2020, pp. 1–4. M. Safieh and J. Freudenberger, “Montgomery reduction for Gaussian integers,” Cryptography, vol. 5, no. 1, pp. 6–24, Feb. 2021. J. Thiers, M. Safieh, and J. Freudenberger, “An elliptic curve cryptographic coprocessor for resource-constrained systems with arithmetic over Solinas primes and arbitrary prime fields,” in 2020 Zooming Innovation in Consumer Technologies Conference (ZINC), Aug. 2020, pp. 313–318. M. Safieh, J. Thiers, and J. Freudenberger, “Area efficient coprocessor for the elliptic curve point multiplication,” in 12th International ITG Conference on Systems, Communications and Coding (SCC), Feb. 2019, pp. 1–6.

Bibliography [60]

[61]

[62]

[63]

[64]

[65]

[66] [67] [68] [69]

[70]

[71] [72] [73]

[74] [75]

[76]

137

M. Safieh and J. Freudenberger, “Address space partitioning for the parallel dictionary LZW data compression algorithm,” in 16th Canadian Workshop on Information Theory (CWIT), July 2019, pp. 1–6. M. Safieh and J. Freudenberger, “Efficient VLSI architecture for the parallel dictionary LZW data compression algorithm,” IET Circuits, Devices & Systems, vol. 13, no. 5, pp. 576–583, 2019. M. Safieh and J. Freudenberger, “Pipelined decoder for the limited context order Burrows–Wheeler transformation,” IET Circuits, Devices Systems, vol. 13, no. 1, pp. 31–38, 2019. M. Safieh, D. Rohweder, and J. Freudenberger, “Implementierung einer speichereffizienten Huffman-Decodierung,” in 58th Multi Project Chip (MPC) Workshop, July 2018, vol. 58/59, pp. 27–31. J. Crenne, R. Vaslin, G. Gogniat, J.P. Diguet, R. Tessier, and D. Unnikrishnan, “Configurable memory security in embedded systems,” ACM Transactions on Embedded Computing Systems (TECS), vol. 12, no. 3, pp. 1–23, 2013. PUB FIPS, “197 federal information processing standards publication,” ADVANCED ENCRYPTION STANDARD (AES), National Institute of Standards and Technology, 2001. H. Schmidt and M. Schwabl-Schmidt, Eine konkrete Darstellung von AES, pp. 39–52, Springer Fachmedien Wiesbaden, Wiesbaden, 2017. M. Krisell, Elliptic Curve Digital Signatures in RSA Hardware, Scholar’s Press, 2012. N. Koblitz, “Elliptic curve cryptosystems,” Mathematics of Computation, vol. 48, no. 177, pp. 203–209, Jan. 1987. V. S. Miller, “Use of elliptic curves in cryptography,” in Advances in Cryptology — CRYPTO ’85 Proceedings, Hugh C. Williams, Ed., Berlin, Heidelberg, 1986, pp. 417– 426, Springer Berlin Heidelberg. D. J. Bernstein, “Curve25519: new Diffie-Hellman speed records,” in Public Key Cryptography – PKC 2006 (9th International Conference on Practice and Theory in PublicKey Cryptography, New York NY, USA, April 24–26, 2006, Proceedings), M. Yung, Y. Dodis, A. Kiayias, and T. Malkin, Eds., Germany, 2006, Lecture Notes in Computer Science, pp. 207–228, Springer. D. Brown, “Standards for efficient cryptography, sec 1: elliptic curve cryptography,” Released Standard Version, vol. 1, 2009. C. Paar and J. Pelzl, Kryptografie verständlich: ein Lehrbuch für Studierende und Anwender, Springer-Verlag Berlin Heidelberg, 2016. Q. H. Dang, “Secure hash standard,” Standard ANSI X9.63, The Federal Information Processing Standards Publication Series of the National Institute of Standards and Technology (NIST), 2015. M. Bossert, Einführung in die Nachrichtentechnik, De Gruyter, 2012. D. Amiet, A. Curiger, and P. Zbinden, “Flexible FPGA-based architectures for curve point multiplication over GF(p),” in 2016 Euromicro Conference on Digital System Design (DSD), Aug. 2016, pp. 107–114. T. Oliveira, D. F. Aranha, J. López, and F. Rodríguez-Henríquez, “Fast point multiplication algorithms for binary elliptic curves with and without precomputation,” in Selected Areas in Cryptography – SAC 2014, Antoine Joux and Amr Youssef, Eds., Cham, 2014, pp. 324–344, Springer International Publishing.

138

Bibliography

[77]

Henk C. A. van Tilborg, Fundamentals of cryptology, Norwell: Kluwer Academic, 2000. D. M. Schinianakis, A. P. Fournaris, H. E. Michail, A. P. Kakarountas, and T. Stouraitis, “An RNS implementation of an F p elliptic curve point multiplier,” IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 56, no. 6, pp. 1202–1213, June 2009. É. Brier and M. Joye, “Weierstraß elliptic curves and side-channel attacks,” in Public Key Cryptography, David Naccache and Pascal Paillier, Eds., Berlin, Heidelberg, 2002, pp. 335–345, Springer Berlin Heidelberg. J. López and R. Dahab, “Improved algorithms for elliptic curve arithmetic in GF(2n ),” in International Workshop on Selected Areas in Cryptography. Springer, 1998, pp. 201– 212. M. S. Albahri and M. Benaissa, “Parallel Comba multiplication in GF(2163 ) using homogenous multicore microcontroller,” in IEEE International Conference on Electronics, Circuits, and Systems (ICECS), Dec. 2015, pp. 641–644. M. N. Hassan and M. Benaissa, “Embedded software design of scalable low-area elliptic-curve cryptography,” IEEE Embedded Systems Letters, vol. 1, no. 2, pp. 42– 45, Aug. 2009. W. N. Chelton and M. Benaissa, “Fast elliptic curve cryptography on FPGA,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 16, no. 2, pp. 198– 205, Feb. 2008. Z. U. A. Khan and M. Benaissa, “Throughput/area-efficient ECC processor using Montgomery point multiplication on FPGA,” IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 62, no. 11, pp. 1078–1082, Nov. 2015. H. Li, “Modeling of threshold voltage distribution in NAND flash memory: A Monte Carlo method,” IEEE Transactions on Electron Devices, vol. 63, no. 9, pp. 3527–3532, Sept. 2016. Z. U. A. Khan and M. Benaissa, “High-speed and low-latency ECC processor implementation over GF(2m ) on FPGA,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 25, no. 1, pp. 165–176, Jan. 2017. D. B. Roy and D. Mukhopadhyay, “High-speed implementation of ECC scalar multiplication in GF(p) for generic Montgomery curves,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 27, no. 7, pp. 1587–1600, July 2019. V. S. Dimitrov, K. U. Järvinen, M. J. Jacobson, W. F. Chan, and Z. Huang, “FPGA implementation of point multiplication on Koblitz curves using Kleinian integers,” in Cryptographic Hardware and Embedded Systems – CHES 2006, L. Goubin and M. Matsui, Eds., Berlin, Heidelberg, 2006, pp. 445–459, Springer Berlin Heidelberg. J. Lutz and A. Hasan, “High performance FPGA based elliptic curve cryptographic co-processor,” in International Conference on Information Technology: Coding and Computing (ITCC), Apr. 2004, vol. 2, pp. 486–492 Vol. 2. L. Li and S. Li, “Improved algorithms and implementations for integer to τ NAF conversion for Koblitz curves,” IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 65, no. 1, pp. 154–162, Jan. 2018. M. S. Hossain, E. Saeedi, and Y. Kong, “High-speed, area-efficient, FPGA-based elliptic curve cryptographic processor over NIST binary fields,” in IEEE International Conference on Data Science and Data Intensive Systems, Dec. 2015, pp. 175–181.

[78]

[79]

[80]

[81]

[82]

[83]

[84]

[85]

[86]

[87]

[88]

[89]

[90]

[91]

Bibliography [92]

[93] [94]

[95] [96]

[97]

[98] [99]

[100]

[101] [102]

[103]

[104]

[105]

[106]

[107]

139

A. Mahboob and N. Ikram, “Faster polynomial basis finite field squaring and inversion for GF(2m ) with cryptographic software application,” in International Symposium on Biometrics and Security Technologies, Apr. 2008, pp. 1–6. V. Diekert, M. Kufleitner, G. Rosenberger, and U. Hertrampf, Discrete Algebraic Methods: Arithmetic, Cryptography, Automata and Groups, De Gruyter, 2016. N. Koblitz, “An elliptic curve implementation of the finite field digital signature algorithm,” in Advances in Cryptology (CRYPTO), Santa Barbara, CA, USA, 1998, pp. 327– 337. J. A. Solinas, “Efficient arithmetic on Koblitz curves,” Designs, Codes and Cryptography, vol. 19, pp. 195–249, 2000. C. Heuberger, R. Avanzi and H. Prodinger, “Redundant τ -adic expansions I: nonadjacent digit sets and their applications to scalar multiplication,” Des. Codes Cryptogr., pp. 173–202, 2011. R. Gallant, R. Lambert, and S. Vanstone, “Faster point multiplication on elliptic curves with efficient endomorphisms,” in Advances in Cryptology — CRYPTO 2001, J. Kilian, Ed., Berlin, Heidelberg, 2001, pp. 190–200, Springer Berlin Heidelberg. K Huber, “Codes over Eisenstein-Jacobi integers,” Contemporary Mathematics, pp. 165–179, Jan. 1994. C. Martinez, E. Stafford, R. Beivide, and E. Gabidulin, “Perfect codes over Lipschitz integers,” in IEEE International Symposium on Information Theory (ISIT), June 2007, pp. 1366–1370. J. Freudenberger and S. Shavgulidze, “New four-dimensional signal constellations from Lipschitz integers for transmission over the Gaussian channel,” IEEE Transactions on Communications, vol. 63, no. 7, pp. 2420–2427, July 2015. M. Güzeltepe, “Codes over Hurwitz integers,” Discrete Mathematics, vol. 313, pp. 704– 714, 2013. J. Thiers, M. Safieh, and J. Freudenberger, “Side channel attack resistance of the elliptic curve point multiplication using Eisenstein integers,” in 10th IEEE International Conference of Consumer Technology (ICCE), Oct. 2020. A. Koval and B. S. Verkhovsky, “Analysis of RSA over Gaussian integers algorithm,” in fifth International Conference on Information Technology: New Generations (ITNG), Apr. 2008, pp. 101–105. H. Elkamchouchi, K. Elshenawy, and H. Shaban, “Extended RSA cryptosystem and digital signature schemes in the domain of Gaussian integers,” in The 8th International Conference on Communication Systems (ICCS), Nov. 2002, vol. 1, pp. 91–95 vol. 1. Y. Awad, A. N. El-Kassar, and T. Kadri, “Rabin public-key cryptosystem in the domain of Gaussian integers,” in International Conference on Computer and Applications (ICCA), Aug. 2018, pp. 336–340. K. Bhargava and V. Soni, “A novice cryptosystem based on nth root of Gaussian integers,” in 2017 International Conference on Computer, Communications and Electronics (Comptelix), July 2017, pp. 271–274. D. Rohweder, J. Freudenberger, and S. Shavgulidze, “Low-density parity-check codes over finite Gaussian integer fields,” in 2018 IEEE International Symposium on Information Theory (ISIT), June 2018, pp. 481–485.

140

Bibliography

[108] J. Freudenberger, F. Ghaboussi, and S. Shavgulidze, “Set partitioning and multilevel coding for codes over Gaussian integer rings,” in 9th International ITG Conference on Systems, Communications and Coding (SCC), Munich, Jan. 2013, pp. 1–5. [109] C. Quilles and R. Palazzo Jr., “Quasi-perfect geometrically uniform codes derived from graphs over Gaussian integer rings,” in IEEE International Symposium on Information Theory (ISIT), Austin, Texas, June 2010. [110] M. Esmaeildoust, D. Schinianakis, H. Javashi, T. Stouraitis, and K. Navi, “Efficient RNS implementation of elliptic curve point multiplication over GF(p),” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 21, no. 8, pp. 1545–1549, Aug. 2013. [111] J. Solinas, “Generalized Mersenne prime,” National Security Agency, Ft. Meade, MD, USA, 2005. [112] K. Bigou and A. Tisserand, “Improving modular inversion in RNS using the plusminus method,” in Cryptographic Hardware and Embedded Systems – CHES 2013, G. Bertoni and J.-S. Coron, Eds., Berlin, Heidelberg, 2013, pp. 233–249, Springer Berlin Heidelberg. [113] S. A. Mozhi and P. Ramya, “Efficient bit-parallel systolic multiplier over GF(2m ),” in International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT), Mar. 2016, pp. 4899–4902. [114] C. C. Wang, T. K. Troung, H. M. Shao, L. J. Deutsch, J. K. Omura, and I. S. Reed, “VLSI architectures for computing multiplications and inverses in GF(2m ),” IEEE Transactions on Computers, vol. C-34, no. 8, pp. 709–717, Aug. 1985. [115] C. H. Kim, N. S. Chang, and Y. I. Cho, “Modified sequential multipliers for type-k Gaussian normal bases,” in fifth FTRA International Conference on Multimedia and Ubiquitous Engineering, June 2011, pp. 220–225. [116] A. P. Fournaris and O. Koufopavlou, “Low area elliptic curve arithmetic unit,” in IEEE International Symposium on Circuits and Systems, May 2009, pp. 1397–1400. [117] H. Wu, “Bit-parallel finite field multiplier and squarer using polynomial basis,” IEEE Transactions on Computers, vol. 51, no. 7, pp. 750–758, July 2002. [118] A. P. Fournaris and O. Koufopavlou, “Creating an elliptic curve arithmetic unit for use in elliptic curve cryptography,” in IEEE International Conference on Emerging Technologies and Factory Automation, Sept. 2008, pp. 1457–1464. [119] C. S. Yeh, I. S. Reed, and T. K. Truong, “Systolic multipliers for finite fields GF(2m ),” IEEE Transactions on Computers, vol. C-33, no. 4, pp. 357–360, Apr. 1984. [120] M. Schmalisch and D. Timmermann, “A reconfigurable arithmetic logic unit for elliptic curve cryptosystems over GF(2m ),” in 46th Midwest Symposium on Circuits and Systems, Dec. 2003, vol. 2, pp. 831–834 Vol. 2. [121] L. Song and K. K. Parhi, “Low-energy digit-serial/parallel finite field multipliers,” Journal of VLSI signal processing systems for signal, image and video technology, vol. 19, no. 2, pp. 149–166, 1998. [122] W. Drescher, Schaltungsanordnung zur Galoisfeld-Arithmetik, Dissertation TUDresden, 2005. [123] T. Itoh and S. Tsujii, “A fast algorithm for computing multiplicative inverses in GF(2m ) using normal bases,” Information and computation, vol. 78, no. 3, pp. 171–177, 1988. [124] J. H. Guo and C. L. Wang, “Systolic array implementation of Euclid’s algorithm for inversion and division in GF(2m ),” in IEEE International Symposium on Circuits

Bibliography

[125] [126] [127]

[128]

[129]

[130]

[131]

[132]

[133]

[134]

[135]

[136]

[137]

[138]

[139]

141

and Systems. Circuits and Systems Connecting the World (ISCAS), May 1996, vol. 2, pp. 481–484 vol. 2. H. Brunner, A. Curiger, and M. Hofstetter, “On computing multiplicative inverses in GF(2m ),” IEEE Transactions on Computers, vol. 42, no. 8, pp. 1010–1015, Aug. 1993. J. Katz, A. J. Menezes, P. C. Van Oorschot, and S. A. Vanstone, Handbook of applied cryptography, CRC press, 1996. V. Trujillo-Olaya, J. Velasco-Medina, and J. C. López-Hernández, “Design of polynomial basis multipliers over GF (2233 ),” in XIII-IBERCHIP WORKSHOP (IWS 2007), 2007. D. F. Djusdek, H. Studiawan, and T. Ahmad, “Adaptive image compression using adaptive Huffman and LZW,” in 2016 International Conference on Information Communication Technology and Systems (ICTS), Oct. 2016, pp. 101–106. S. Singh and P. Pandey, “Enhanced LZW technique for medical image compression,” in 2016 3rd International Conference on Computing for Sustainable Global Development (INDIACom), Mar. 2016, pp. 1080–1084. A. Yazdanpanah and M. R. Hashemi, “A new compression ratio prediction algorithm for hardware implementations of LZW data compression,” in 2010 15th CSI International Symposium on Computer Architecture and Digital Systems, Sept. 2010, pp. 155–156. L. Li and S. Li, “High-performance pipelined architecture of elliptic curve scalar multiplication over GF(2m ),” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 24, no. 4, pp. 1223–1232, Apr. 2016. R. Samanta and R. N. Mahapatra, “An enhanced CAM architecture to accelerate LZW compression algorithm,” in 20th International Conference on VLSI Design held jointly with 6th International Conference on Embedded Systems (VLSID’07), Jan. 2007, pp. 824–829. M. B. Lin and Y. Y. Chang, “A new architecture of a two-stage lossless data compression and decompression algorithm,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 17, no. 9, pp. 1297–1303, Sept. 2009. X. Zhou, Y. Ito, and K. Nakano, “An efficient implementation of LZW decompression in the FPGA,” in 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), May 2016, pp. 599–607. M.-B. Lin, J.-F. Lee, and G. E. Jan, “A lossless data compression and decompression algorithm and its hardware architecture,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 14, no. 9, pp. 925–936, Sept. 2006. Z. Ullah, K. Ilgon, and S. Baeg, “Hybrid partitioned SRAM-based ternary content addressable memory,” IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 59, no. 12, pp. 2969–2979, Dec. 2012. Z. Ullah, M. K. Jaiswal, and R. C. C. Cheung, “Z-TCAM: An SRAM-based architecture for TCAM,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 23, no. 2, pp. 402–406, Feb. 2015. G. Dong, N. Xie, and T. Zhang, “Enabling NAND flash memory use soft-decision error correction codes at minimal read latency overhead,” IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 60, no. 9, pp. 2412–2421, 2013. M. Powell, “Evaluating lossless compression methods,” in New Zealand Computer Science Research Students’ Conference, Canterbury, 2001, pp. 35–41.

142

Bibliography

[140] M. Burrows and D. Wheeler, A block-sorting lossless data compression algorithm, SRC Research Report 124, Digital Systems Research Center, Palo Alto, CA., 1994. [141] M. Schindler, “A fast block-sorting algorithm for lossless data compression,” in Data Compression Conference, Mar. 1997, pp. 469. [142] B. Balkenhol, S. Kurtz, and Y.M. Shtarkov, “Modifications of the Burrows and Wheeler data compression algorithm,” in Proceedings Data Compression Conference (DCC99), Mar. 1999, pp. 188–197. [143] P. Elias, “Interval and recency rank source coding: Two on-line adaptive variablelength schemes,” IEEE Transactions on Information Theory, vol. 33, no. 1, pp. 3–10, Jan. 1987. [144] D. A. Huffman, “A method for the construction of minimum-redundancy codes,” Proceedings of the IRE, vol. 40, no. 9, pp. 1098–1101, Sept. 1952. [145] R. B. Ash, Information Theory, Dover Publications Inc., New York, 1991. [146] T. C. Bell, J. G. Cleary, and I. H. Witten, Text compression, Prentice Hall, Englewood Cliffs, NJ, 1990. [147] S. V. Kartalopoulos, “RAM-based associative content-addressable memory device, method of operation thereof and ATM communication switching system employing the same,” Aug. 2000, US Patent 6,097,724.