Topics in Computational Number Theory Inspired by Peter L. Montgomery 9781316271575

Peter L. Montgomery has made significant contributions to computational number theory, introducing many basic tools such

561 112 4MB

English Pages 281 Year 2017

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Topics in Computational Number Theory Inspired by Peter L. Montgomery
 9781316271575

Table of contents :
Contents......Page 5
Contributors......Page 12
Preface......Page 14
1.2 Biographical Sketch......Page 16
1.3 Overview......Page 20
1.4 Simultaneous Inversion......Page 23
2.1 Introduction......Page 25
2.2 Montgomery Multiplication......Page 27
2.2.1 Interleaved Montgomery Multiplication......Page 30
2.2.2 Using Montgomery Arithmetic in Practice......Page 31
2.2.3 Computing the Montgomery Constants μ and R2......Page 33
2.2.4 On the Final Conditional Subtraction......Page 34
2.2.5 Montgomery Multiplication in F2k......Page 36
2.3 Using Primes of a Special Form......Page 37
2.3.1 Faster Modular Reduction with Primes of a Special Form......Page 38
2.3.2 Faster Montgomery Reduction with Primes of a Special Form......Page 39
2.4 Concurrent Computing of Montgomery Multiplication......Page 41
2.4.2 Montgomery Multiplication Using SIMD Extensions......Page 42
2.4.3 A Column-Wise SIMD Approach......Page 46
2.4.4 Montgomery Multiplication Using the Residue Number System Representation......Page 51
3.1 Introduction and Summary......Page 55
3.3 Montgomery’s Novel Modular Multiplication Algorithm......Page 57
3.4 Standard Acceleration Techniques......Page 58
3.5.1 The Classical Algorithm......Page 59
3.5.2 Montgomery......Page 60
3.6 Interleaving Multiplication Steps with Modular Reduction......Page 61
3.7.1 Traditional......Page 63
3.7.2 Bounding the Partial Product......Page 65
3.7.4 Summary......Page 67
3.8 Using Redundant Representations......Page 68
3.8.2 Montgomery......Page 69
3.9 Changing the Size of the Hardware Multiplier......Page 70
3.10.1 Traditional......Page 72
3.10.2 Montgomery......Page 75
3.11 Precomputing Multiples of B and N......Page 76
3.12 Propagating Carries and Carry-Save Inputs......Page 77
3.13 Scaling the Modulus......Page 80
3.14 Systolic Arrays......Page 82
3.14.1 A Systolic Array for A×B......Page 83
3.14.2 Scalability......Page 85
3.14.3 A Linear Systolic Array......Page 87
3.14.4 A Systolic Array for Modular Multiplication......Page 88
3.15 Side-Channel Concerns and Solutions......Page 91
3.16 Logic Gate Technology......Page 95
3.17 Conclusion......Page 96
4.1 Introduction......Page 97
4.2 Fast Scalar Multiplication on the Clock......Page 98
4.2.2 Differential Addition Chains......Page 100
4.3.1 Montgomery Curves as Weierstrass Curves......Page 102
4.3.2 The Group Law for Weierstrass Curves......Page 103
4.3.3 Other Views of the Group Law......Page 104
4.3.4 Edwards Curves and Their Group Law......Page 105
4.3.5 Montgomery Curves as Edwards Curves......Page 107
4.3.6 Elliptic-Curve Cryptography (ECC)......Page 108
4.3.7 Examples of Noteworthy Montgomery Curves......Page 109
4.4.1 Doubling: The Weierstrass View......Page 110
4.4.2 Optimized Doublings......Page 111
4.4.4 Completeness of Generic Doubling Formulas......Page 112
4.4.5 Doubling: The Edwards View......Page 113
4.5.1 Differential Addition: The Weierstrass View......Page 114
4.5.3 Quasi-Completeness......Page 116
4.5.4 Differential Addition: The Edwards View......Page 118
4.6.1 The Montgomery Ladder Step......Page 119
4.6.2 Constant-Time Ladders......Page 120
4.6.3 Completeness of the Ladder......Page 121
4.7 A Two-Dimensional Ladder......Page 122
4.7.1 Introduction to the Two-Dimensional Ladder......Page 123
4.7.2 Recursive Definition of the Two-Dimensional Ladder......Page 124
4.7.4 The Even-Even Pair in Each Line: Doubling......Page 125
4.8 Larger Differences......Page 126
4.8.1 Examples of Large-Difference Chains......Page 127
4.8.3 Allowing d to Vary......Page 129
5.1 Introduction......Page 131
5.2.1 Two-Step Approach......Page 132
5.2.2 Smoothness and L-notation......Page 134
5.2.3 Generic Analysis......Page 135
5.2.4 Smoothness Testing......Page 136
5.2.6 Filtering......Page 138
5.3.1 Dixon’s Random Squares Method......Page 141
5.3.2 Continued Fraction Method......Page 142
5.4.1 Linear Sieve......Page 144
5.4.2 Quadratic Sieving: Plain......Page 147
5.4.3 Quadratic Sieving: Fancy......Page 148
5.4.4 Multiple Polynomial Quadratic Sieve......Page 149
5.5 Number Field Sieve......Page 152
5.5.1 Earlier Methods to Compute Discrete Logarithms......Page 154
5.5.2 Special Number Field Sieve......Page 160
5.5.3 General Number Field Sieve......Page 167
5.5.4 Coppersmith’s Modifications......Page 173
5.6 Provable Methods......Page 175
6.2 Early Methods......Page 176
6.3 General Remarks......Page 179
6.4 A Lattice Based Method......Page 181
6.5 Skewness......Page 183
6.6 Base m Method and Skewness......Page 185
6.7 Root Sieve......Page 186
6.8 Later Developments......Page 188
7.1 Linear Systems for Integer Factoring......Page 190
7.2 The Standard Lanczos Algorithm......Page 191
7.3 The Case of Characteristic Two......Page 194
7.4 Orthogonalizing a Sequence of Subspaces......Page 195
7.5 Construction of the Next Iterate......Page 196
7.6 Simplifying the Recurrence Equation......Page 197
7.7 Termination......Page 199
7.8 Implementation in Parallel......Page 201
7.9 Recent Developments......Page 202
8.1 Introduction......Page 204
8.2 FFT Extension for the Elliptic Curve Method......Page 206
8.2.2 The POLYEVAL Algorithm......Page 207
8.2.3 The POLYGCD Algorithm......Page 210
8.2.4 Choice of Points of Evaluation......Page 212
8.3 FFT Extension for the p − 1 and p + 1 Methods......Page 214
8.3.1 Constructing F(X ) by Scaling and Multiplying......Page 216
8.3.2 Evaluation of a Polynomial Along a Geometric Progression......Page 217
9 Cryptographic Pairings......Page 221
9.1 Preliminaries......Page 222
9.1.1 Elliptic Curves......Page 223
9.1.2 Pairings......Page 224
9.2 Finite Field Arithmetic for Pairings......Page 228
9.2.1 Montgomery Multiplication......Page 229
9.2.3 Finite Field Inversions......Page 230
9.2.4 Simultaneous Inversions......Page 233
9.3.1 Costs for Doubling and Addition Steps......Page 234
9.3.2 Working over Extension Fields......Page 238
9.3.3 Simultaneous Inversions in Pairing Computation......Page 240
9.4 The Double-Add Operation and Parabolas......Page 242
9.4.1 Description of the Algorithm......Page 243
9.4.3 Application to Pairings......Page 244
9.5 Squared Pairings......Page 246
9.5.1 The Squared Weil Pairing......Page 247
9.5.2 The Squared Tate Pairing......Page 248
Bibliography......Page 250
Subject Index......Page 276

Citation preview

Topics in Computational Number Theory Inspired by Peter L. Montgomery Peter L. Montgomery has made significant contributions to computational number theory, introducing many basic tools such as Montgomery multiplication, Montgomery simultaneous inversion, Montgomery curves, and the Montgomery ladder. This book features state-of-the-art research in computational number theory related to Montgomery’s work and its impact on computational efficiency and cryptography. It covers a wide range of topics such as Montgomery multiplication for both hardware and software implementations; Montgomery curves and twisted Edwards curves as proposed in the latest standards for elliptic curve cryptography; and cryptographic pairings. This book provides a comprehensive overview of integer factorization techniques, including dedicated chapters on polynomial selection, the block Lanczos method, and the FFT extension for algebraic-group factorization algorithms. Graduate students and researchers in applied number theory and cryptography will benefit from this survey of Montgomery’s work. Joppe W. Bos is a cryptographic researcher at the Innovation Center for Cryptography & Security at NXP Semiconductors. He also currently serves as the Secretary of the International Association for Cryptologic Research (IACR). His research focuses on computational number theory and high-performance arithmetic as used in public-key cryptography. Arjen K. Lenstra is Professor of Computer Science at École Polytechnique Fédérale de Lausanne. His research focuses on cryptography and computational number theory, especially in areas such as integer factorization. He was closely involved in the development of the number field sieve method for integer factorization as well as several other cryptologic results. He is the recipient of the Excellence in the Field of Mathematics RSA Conference 2008 Award and a Fellow of the International Association for Cryptologic Research (IACR).

Downloaded from https://www.cambridge.org/core. University of Texas Libraries, on 08 Jan 2020 at 20:33:58, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575

Downloaded from https://www.cambridge.org/core. University of Texas Libraries, on 08 Jan 2020 at 20:33:58, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575

Topics in Computational Number Theory Inspired by Peter L. Montgomery Edited by

J O P P E W. B O S NXP Semiconductors, Leuven, Belgium ARJEN K. LENSTRA EPFL, Lausanne, Switzerland

Downloaded from https://www.cambridge.org/core. University of Texas Libraries, on 08 Jan 2020 at 20:33:58, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575

University Printing House, Cambridge CB2 8BS, United Kingdom One Liberty Plaza, 20th Floor, New York, NY 10006, USA 477 Williamstown Road, Port Melbourne, VIC 3207, Australia 4843/24, 2nd Floor, Ansari Road, Daryaganj, Delhi - 110002, India 79 Anson Road, #06-04/06, Singapore 079906 Cambridge University Press is part of the University of Cambridge. It furthers the University’s mission by disseminating knowledge in the pursuit of education, learning, and research at the highest international levels of excellence. www.cambridge.org Information on this title: www.cambridge.org/9781107109353 DOI: 10.1017/9781316271575 © Joppe W. Bos and Arjen K. Lenstra 2017 This publication is in copyright. Subject to statutory exception and to the provisions of relevant collective licensing agreements, no reproduction of any part may take place without the written permission of Cambridge University Press. First published 2017 Printed in the United States of America by Sheridan Books, Inc. A catalogue record for this publication is available from the British Library Library of Congress Cataloging-in-Publication data Names: Bos, Joppe W., editor. | Lenstra, A. K. (Arjen K.), 1956– editor. Title: Topics in computational number theory inspired by Peter L. Montgomery / edited by Joppe W. Bos, NXP Semiconductors, Belgium; Arjen K. Lenstra, EPFL, Lausanne, Switzerland. Description: Cambridge : Cambridge University Press, 2017. | Series: London Mathematical Society lecture note series | Includes bibliographical references and index. Identifiers: LCCN 2017023049 | ISBN 9781107109353 (pbk. : alk. paper) Subjects: LCSH: Number theory. | Cryptography – Mathematics. | Montgomery, Peter L., 1947– Classification: LCC QA241 .T657 2017 | DDC 512.7 – dc23 LC record available at https://lccn.loc.gov/2017023049 ISBN 978-1-107-10935-3 Paperback Cambridge University Press has no responsibility for the persistence or accuracy of URLs for external or third-party Internet Web sites referred to in this publication and does not guarantee that any content on such Web sites is, or will remain, accurate or appropriate. Downloaded from https://www.cambridge.org/core. University of Texas Libraries, on 08 Jan 2020 at 20:33:58, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575

Contents

List of Contributors Preface

page xi xiii 1 1 1 5 8

1 1.1 1.2 1.3 1.4

Introduction Outline Biographical Sketch Overview Simultaneous Inversion

2 2.1 2.2

Montgomery Arithmetic from a Software Perspective Introduction Montgomery Multiplication 2.2.1 Interleaved Montgomery Multiplication 2.2.2 Using Montgomery Arithmetic in Practice 2.2.3 Computing the Montgomery Constants μ and R2 2.2.4 On the Final Conditional Subtraction 2.2.5 Montgomery Multiplication in F2k Using Primes of a Special Form 2.3.1 Faster Modular Reduction with Primes of a Special Form 2.3.2 Faster Montgomery Reduction with Primes of a Special Form Concurrent Computing of Montgomery Multiplication 2.4.1 Related Work on Concurrent Computing of Montgomery Multiplication 2.4.2 Montgomery Multiplication Using SIMD Extensions 2.4.3 A Column-Wise SIMD Approach

2.3

2.4

10 10 12 15 16 18 19 21 22 23 24 26 27 27 31

v Downloaded from https://www.cambridge.org/core. City, University of London, on 09 Jan 2020 at 19:53:13, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575

vi

Contents

2.4.4 3

Montgomery Multiplication Using the Residue Number System Representation

36

3.15 3.16 3.17

Hardware Aspects of Montgomery Modular Multiplication Introduction and Summary Historical Remarks Montgomery’s Novel Modular Multiplication Algorithm Standard Acceleration Techniques Shifting the Modulus N 3.5.1 The Classical Algorithm 3.5.2 Montgomery Interleaving Multiplication Steps with Modular Reduction Accepting Inaccuracy in Quotient Digits 3.7.1 Traditional 3.7.2 Bounding the Partial Product 3.7.3 Montgomery 3.7.4 Summary Using Redundant Representations 3.8.1 Traditional 3.8.2 Montgomery Changing the Size of the Hardware Multiplier Shifting an Operand 3.10.1 Traditional 3.10.2 Montgomery Precomputing Multiples of B and N Propagating Carries and Carry-Save Inputs Scaling the Modulus Systolic Arrays 3.14.1 A Systolic Array for A×B 3.14.2 Scalability 3.14.3 A Linear Systolic Array 3.14.4 A Systolic Array for Modular Multiplication Side-Channel Concerns and Solutions Logic Gate Technology Conclusion

40 40 42 42 43 44 44 45 46 48 48 50 52 52 53 54 54 55 57 57 60 61 62 65 67 68 70 72 73 76 80 81

4 4.1 4.2

Montgomery Curves and the Montgomery Ladder Introduction Fast Scalar Multiplication on the Clock

82 82 83

3.1 3.2 3.3 3.4 3.5

3.6 3.7

3.8

3.9 3.10

3.11 3.12 3.13 3.14

Downloaded from https://www.cambridge.org/core. City, University of London, on 09 Jan 2020 at 19:53:13, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575

Contents

4.3

4.4

4.5

4.6

4.7

4.8

5 5.1 5.2

vii

4.2.1 The Lucas Ladder 4.2.2 Differential Addition Chains Montgomery Curves 4.3.1 Montgomery Curves as Weierstrass Curves 4.3.2 The Group Law for Weierstrass Curves 4.3.3 Other Views of the Group Law 4.3.4 Edwards Curves and Their Group Law 4.3.5 Montgomery Curves as Edwards Curves 4.3.6 Elliptic-Curve Cryptography (ECC) 4.3.7 Examples of Noteworthy Montgomery Curves Doubling Formulas without y 4.4.1 Doubling: The Weierstrass View 4.4.2 Optimized Doublings 4.4.3 A Word of Warning: Projective Coordinates 4.4.4 Completeness of Generic Doubling Formulas 4.4.5 Doubling: The Edwards View Differential-Addition Formulas 4.5.1 Differential Addition: The Weierstrass View 4.5.2 Optimized Differential Addition 4.5.3 Quasi-Completeness 4.5.4 Differential Addition: The Edwards View The Montgomery Ladder 4.6.1 The Montgomery Ladder Step 4.6.2 Constant-Time Ladders 4.6.3 Completeness of the Ladder A Two-Dimensional Ladder 4.7.1 Introduction to the Two-Dimensional Ladder 4.7.2 Recursive Definition of the Two-Dimensional Ladder 4.7.3 The Odd-Odd Pair in Each Line: First Addition 4.7.4 The Even-Even Pair in Each Line: Doubling 4.7.5 The Other Pair in Each Line: Second Addition Larger Differences 4.8.1 Examples of Large-Difference Chains 4.8.2 CFRC, PRAC, etc. 4.8.3 Allowing d to Vary

85 85 87 87 88 89 90 92 93 94 95 95 96 97 97 98 99 99 101 101 103 104 104 105 106 107 108 109 110 110 111 111 112 114 114

General Purpose Integer Factoring Introduction General Purpose Factoring

116 116 117

Downloaded from https://www.cambridge.org/core. City, University of London, on 09 Jan 2020 at 19:53:13, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575

viii

Contents

5.6

5.2.1 Two-Step Approach 5.2.2 Smoothness and L-notation 5.2.3 Generic Analysis 5.2.4 Smoothness Testing 5.2.5 Finding Dependencies 5.2.6 Filtering 5.2.7 Overall Effort Presieving General Purpose Factoring 5.3.1 Dixon’s Random Squares Method 5.3.2 Continued Fraction Method Linear and Quadratic Sieve 5.4.1 Linear Sieve 5.4.2 Quadratic Sieving: Plain 5.4.3 Quadratic Sieving: Fancy 5.4.4 Multiple Polynomial Quadratic Sieve Number Field Sieve 5.5.1 Earlier Methods to Compute Discrete Logarithms 5.5.2 Special Number Field Sieve 5.5.3 General Number Field Sieve 5.5.4 Coppersmith’s Modifications Provable Methods

117 119 120 121 123 123 126 126 126 127 129 129 132 133 134 137 139 145 152 158 160

6 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8

Polynomial Selection for the Number Field Sieve The Problem Early Methods General Remarks A Lattice Based Method Skewness Base m Method and Skewness Root Sieve Later Developments

161 161 161 164 166 168 170 171 173

7 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9

The Block Lanczos Algorithm Linear Systems for Integer Factoring The Standard Lanczos Algorithm The Case of Characteristic Two Orthogonalizing a Sequence of Subspaces Construction of the Next Iterate Simplifying the Recurrence Equation Termination Implementation in Parallel Recent Developments

175 175 176 179 180 181 182 184 186 187

5.3

5.4

5.5

Downloaded from https://www.cambridge.org/core. City, University of London, on 09 Jan 2020 at 19:53:13, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575

Contents

8 8.1 8.2

8.3

9 9.1

9.2

9.3

9.4

9.5

FFT Extension for Algebraic-Group Factorization Algorithms Introduction FFT Extension for the Elliptic Curve Method 8.2.1 The Product Tree Algorithm 8.2.2 The POLYEVAL Algorithm 8.2.3 The POLYGCD Algorithm 8.2.4 Choice of Points of Evaluation 8.2.5 A Numerical Example FFT Extension for the p − 1 and p + 1 Methods 8.3.1 Constructing F (X ) by Scaling and Multiplying 8.3.2 Evaluation of a Polynomial Along a Geometric Progression

ix

189 189 191 192 192 195 197 199 199 201 202

Cryptographic Pairings Preliminaries 9.1.1 Elliptic Curves 9.1.2 Pairings 9.1.3 Pairing-Friendly Elliptic Curves Finite Field Arithmetic for Pairings 9.2.1 Montgomery Multiplication 9.2.2 Multiplication in Extension Fields 9.2.3 Finite Field Inversions 9.2.4 Simultaneous Inversions Affine Coordinates for Pairing Computation 9.3.1 Costs for Doubling and Addition Steps 9.3.2 Working over Extension Fields 9.3.3 Simultaneous Inversions in Pairing Computation The Double-Add Operation and Parabolas 9.4.1 Description of the Algorithm 9.4.2 Application to Scalar Multiplication 9.4.3 Application to Pairings Squared Pairings 9.5.1 The Squared Weil Pairing 9.5.2 The Squared Tate Pairing

206 207 208 209 213 213 214 215 215 218 219 219 223 225 227 228 229 229 231 232 233

Bibliography Subject Index

235 261

Downloaded from https://www.cambridge.org/core. City, University of London, on 09 Jan 2020 at 19:53:13, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575

Downloaded from https://www.cambridge.org/core. City, University of London, on 09 Jan 2020 at 19:53:13, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575

Contributors

Joppe W. Bos, NXP Semiconductors, Leuven, Belgium Arjen K. Lenstra, EPFL, Lausanne, Switzerland Herman te Riele, CWI, Amsterdam, Netherlands Daniel Shumow, Microsoft Research, Redmond, USA Peter L. Montgomery, Self Colin D. Walter, Royal Holloway, University of London, Egham, United Kingdom Daniel J. Bernstein, University of Illinois at Chicago, Chicago, USA and Technische Universiteit Eindhoven, Eindhoven, The Netherlands Tanja Lange, Technische Universiteit Eindhoven, Eindhoven, The Netherlands Thorsten Kleinjung, University Leipzig, Leipzig, Germany and EPFL, Lausanne, Switzerland Emmanuel Thomé, Inria, Nancy, France Richard P. Brent, Australian National University, Canberra, Australia Alexander Kruppa, Technische Universität München, München, Germany Paul Zimmermann, Inria/LORIA, Nancy, France Kristin Lauter, Microsoft Research, Redmond, USA Michael Naehrig, Microsoft Research, Redmond, USA

xi Downloaded from https://www.cambridge.org/core. City, University of London, on 08 Jan 2020 at 20:34:16, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575

Downloaded from https://www.cambridge.org/core. City, University of London, on 08 Jan 2020 at 20:34:16, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575

Preface

This book was written in honor of Peter L. Montgomery and his inspirational contributions to computational number theory. The editors would like to extend their sincerest thanks to all authors for their enthusiastic response to our invitation to contribute, and to Nicole Verna for the cover design.

xiii Downloaded from https://www.cambridge.org/core. Access paid by the UCSB Libraries, on 13 Sep 2018 at 19:36:36, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.001

Downloaded from https://www.cambridge.org/core. Access paid by the UCSB Libraries, on 13 Sep 2018 at 19:36:36, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.001

1 Introduction Joppe W. Bos, Arjen K. Lenstra, Herman te Riele, and Daniel Shumow

1.1 Outline This introductory chapter collects some personal background on Peter L. Montgomery, both provided by himself and based on stories from his colleagues and collaborators. This is followed by a brief introduction to the other chapters, and by a description of Peter’s simultaneous inversion method.

1.2 Biographical Sketch Peter was born in San Francisco, California, on September 25, 1947. He described his activities since his 1965 graduation from San Rafael High School (SRHS) in a letter for a 1995 high-school reunion; this letter is copied in its entirety in Figure 1.1, and further commented on throughout this chapter. At Berkeley, Peter’s undergraduate advisor was Derrick H. Lehmer, an excellent match given Peter’s research interests since high school. Ron Graham, one of Peter’s coauthors on his first three papers (all on Ramsey theory and written shortly after his time in Berkeley), recalls [164] that Peter had the “breakthrough idea” for the first of these papers, with coauthor Bruce Rothschild remembering Peter “as an extremely clever guy who came up with surprising ideas” [295]. The papers in question ([144], [145] and [146]) were further coauthored by Paul Erdös, Joel Spencer, and Ernst Straus. After this auspicious start, Peter’s next paper describes an algorithm that he developed at SDC to evaluate arbitrary Fortran boolean expressions in such a way that they result in a small number of instructions when compiled on a Control Data 7600 computer [250]. The paper is illustrative of the type of tinkering in which Peter excels. About twenty years after his time in Berkeley, and by the time that Peter was already “internationally famous” (cf. Figure 1.1), David G. Cantor supervised Peter’s PhD dissertation An FFT extension of the elliptic curve method 1 Downloaded from https://www.cambridge.org/core. Teachers College Library - Columbia University, on 02 Nov 2017 at 06:38:23, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.002

2

Introduction

Figure 1.1 Peter’s letter for his 1995 high-school reunion.

of factorization [254]. Milos Ercegovac, who served on the PhD committee, related [143] that Cantor referred to Peter as “a genius UCLA never saw before,” how Peter was brilliant in solving engineering problems related to computer arithmetic, and that, as a result of his incisive reviews, authors on several occasions asked if they could include this “marvelous” reviewer as a coauthor because Peter’s comments invariably provided better solutions than

Downloaded from https://www.cambridge.org/core. Teachers College Library - Columbia University, on 02 Nov 2017 at 06:38:23, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.002

1.2 Biographical Sketch

(a) At age 15

3

(b) In 2009

Figure 1.2 Peter L. Montgomery (from Peter’s collection and Wikipedia).

those given in the submissions. Another member of the committee, Murray Schacher, recalled [299] how Peter “regularly swamped the system” in what may have been his first venture into large scale computing: verifying reducibility of X 8k + X m + 1 ∈ F2 [X] for large k and 0 < m < 8k. For this project Peter taught himself the Cantor–Zassenhaus method because the non-sparse 8k × 8k matrix required by the usual Berlekamp algorithm surpassed the system’s capacity [34, 87]. Peter did not accept the offer for a “research position in a New Jersey community lacking sidewalks” (cf. Figure 1.1). The offer was made after Peter’s successful visits at the Bellcore research facility in Morristown, NJ, during the summers of 1994 and 1995. During his stays Peter worked with the second author’s summer internship students Scott Contini and Phong Nguyen on implementing Peter’s new methods that were useful for the number field sieve integer factorization algorithm (cf. Section 1.3 below): block Lanczos with Scott in 1994 and the next year the final square root method with Phong (who for his project had to learn about integer lattices and lattice basis reduction, and did so only very reluctantly – after a scolding by Ramarathnam “Venkie” Venkatesan). A memorable event was a lab outing where Peter beat Phong at bowling and afterwards bought ice cream for the whole party.

Downloaded from https://www.cambridge.org/core. Teachers College Library - Columbia University, on 02 Nov 2017 at 06:38:23, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.002

4

Introduction

In 1998 Peter accepted a position with the Security and Cryptography group at Microsoft Research (MSR) in Redmond, WA. He retired in 2014. His work at MSR was largely focused on pure research (cf. [227], [99], [259], [137], [138], and [73]). However, Peter began implementing libraries for asymmetric encryption for use in products as well. Peter’s first contribution was to write Diffie–Hellman key exchange and the Digital Signature Algorithm in the static bignum library. This was included in the default Microsoft Windows CAPI 1.0 library dssenh.dll, that shipped in Windows 2000. A later version of this library, msbignum, also included RSA as well as elliptic curve cryptography. In 2006, with Windows Vista, msbignum became the basis of the API for the Cryptographic Next Generation and underlay in the following decade all cryptography running on Windows at Microsoft. Peter’s role may have been unique at MSR at that time: over the course of the organization’s existence there have been very few teams at MSR, let alone individual researchers, delivering entire functional software libraries. During his time at MSR, Peter continued collaborations with researchers world-wide and was a regular long-term guest at various universities and research institutes. The first three authors of this chapter have fond memories of Peter’s visits (and are not the only ones who have lugged around Peter’s considerable suitcase when regular cabs were not up to the task [164]) – as Milos Ercegovac put it: a modest man with a sense of humor, a gentle chuckle when people make silly technical comments [143]. Numerous papers were the result of his visits: several record factorizations (140 digits [91], 512 bits [92], 768 bits [209, 211], and Mersenne numbers [71]); security estimates [68]; elliptic curve discrete logarithm cryptanalysis using Playstation 3 game consoles [70, 69]; improved stage two for Pollard’s p ± 1 method [261]; better elliptic curve factoring parameter selection [26]; and results on the period of Bell numbers [262]. While discussing his work at SDC in the early software industry Peter liked to proudly show a copy of The System Builders: The Story of SDC [32] with his name displayed on the front cover, about an inch underneath the title (cf. Figure 1.3). Peter never showed a similar pride in his more celebrated algorithmic and mathematical results – possibly because his work as a software engineer greatly motivated his mathematical research and he invented new approaches just to make his implementations more efficient. This is illustrated by Peter’s explanation how he got the inspiration for his paper “Modular multiplication without trial division” [251], arguably one of his most widely used results: when implementing textbook multiprecision multiplication in assembler on a PDP series computer, he noticed there were unused registers and

Downloaded from https://www.cambridge.org/core. Teachers College Library - Columbia University, on 02 Nov 2017 at 06:38:23, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.002

1.3 Overview

5

Figure 1.3 Front cover of [32] (© System Development Corporation, 1981).

wanted to find a way to exploit them. He managed to do so by interleaving the multiplication with the modular reduction. When Peter began working, at SDC, he programmed using punch cards on time-shared main frame computers. By the time he had retired from MSR, he had implemented and optimized arithmetic code that was running on the mobile processors in smartphones. Peter worked as a programmer from the earliest stages of software engineering when computers needed to be held in their own buildings until the era of ubiquitous computing where people carry computing devices in their pockets. Both his algorithmic and mathematical contributions and his work as a programmer were instrumental to these massive advances that spanned his career – a career of more than forty years, impressive by the standards of the software industry.

1.3 Overview The subsequent chapters of this book are on active research areas that were initiated by Peter or to which he has made substantial contributions. This section contains an overview of the subjects that are treated. As mentioned above, Peter’s 1985 method for modular multiplication without trial division [251] was invented almost serendipitously when Peter tried to put otherwise idle registers to good use. Because it allows simple implementation both in software and in hardware, Montgomery multiplication (as the method is now commonly referred to) has found broad application. Its software and hardware aspects are covered in Chapters 2 and 3, respectively. Peter’s contributions as described and expanded upon in Chapters 4 through 8 were all inspired by integer factorization, his main research interest since high school. In [252, section 10.3.1] Peter describes how his original implementation

Downloaded from https://www.cambridge.org/core. Teachers College Library - Columbia University, on 02 Nov 2017 at 06:38:23, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.002

6

Introduction

of the elliptic curve method of integer factorization [239] uses the Weierstrass equation with affine coordinates, and how the modular inversion required for the point addition can be made faster if several curves are run at once (cf. Section 1.4 on page 8). He then writes “The author later discovered an alternative parametrization that requires no inversions” and continues with the description of what is now known as Montgomery curves and the Montgomery ladder. The former is an elliptic curve parametrization that allows fast computation of the sum of two points if their difference is known and that does not require modular inversion. The latter refers to the addition chain that can then be used to compute any scalar multiple of a point. Initially these results were relevant solely to those implementing the elliptic curve integer factorization method, but the later growing importance of elliptic curve cryptography increased the interest in Peter’s and yet other curve parametrizations and addition chains. Chapter 4 describes the developments in this field that have taken place since the publication of [252]. In parallel with his work on special-purpose integer factoring methods in [252], Peter also heavily influenced the two most important methods of the more generic integer factorization algorithms that are described in Chapter 5: the quadratic sieve and the number field sieve, two methods that were initially plagued by severe performance problems [282, 234]. Although Peter was not the first to propose a practical version of the quadratic sieve, his multiple polynomial quadratic sieve (as published by Robert Silverman in [315] and described in Chapter 5), represented the state of the art in integer factorization from the mid 1980s until the early 1990s. It was later enhanced by selfinitialization [287], but by that time the number field sieve had taken over as the record-breaking factoring method, a position it holds to the present day. Peter played a prominent role in that development, contributing important ideas to all steps of the number field sieve: polynomial selection, (early) sieving, building a matrix, processing the matrix, and the final square root calculation. A brief summary follows, with Chapter 5 describing in more detail how the steps fit together. In the first step of the number field sieve two or more polynomials must be found that have a root in common modulo the number to be factored. Though theoretically such polynomials are easily found, small modifications may trigger orders of magnitude reductions of the resulting factoring effort. Finding good polynomials has become a subject in its own right since Peter’s early contributions (many of which were first published in Brian Murphy’s 1999 PhD dissertation [266]). Chapter 6 is entirely devoted to the current state of the art of polynomial selection for the number field sieve.

Downloaded from https://www.cambridge.org/core. Teachers College Library - Columbia University, on 02 Nov 2017 at 06:38:23, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.002

1.3 Overview

7

The number field sieve (and many other sieving-based integer factorization methods) relies on the ability to quickly recognize which values in an arithmetic progression have many small factors. Chapter 5 briefly mentions some of Peter’s many tricks which are now part of the standard repertoire to speed up this so-called sieving step. His improvements to the earliest line sieving software for the number field sieve by the second author were used for numerous integer factorizations in the 1990s [78, 235]. For most composites, however, line sieving has since then been surpassed by lattice sieving [281]. The sieving step results in a large, very sparse matrix. In the matrix step dependencies among the columns of this matrix must be found. Given its sparsity, it is advantageous to first transform the matrix into an equivalent matrix that is smaller but less sparse. Originally, when regular Gaussian elimination was still used for the matrix step, a set of ad hoc methods referred to as structured Gaussian elimination [226, 286] was used to find a transformed matrix with a smallest possible number of rows, but without paying attention to its density. The more sophisticated tools that became available for the matrix step in the mid 1990s required an adaption of the matrix transformation method that would minimize the product of the number of rows of the matrix and its number of non-zero entries. These days this is referred to as filtering. The details of the various filtering steps proposed by Peter (and based on the earlier work in [226, 286]) can be found in [89, 90] and are sketched in Section 5.2.6 on page 123. The above “more sophisticated tools” are the block Wiedemann and block Lanczos methods, based on the classical Wiedemann and Lanczos methods. They were almost simultaneously and independently developed by Don Coppersmith and Peter, respectively, and have comparable performance [106, 257]. Though both methods allow efficient distributed implementation, for block Lanczos all processors must be able to communicate quickly (i.e., a single cluster with a fast interconnection network), whereas block Wiedemann can be made to run efficiently on a modest number of disjoint clusters. Almost since its inception in the mid 1990s, block Lanczos became the method of choice for all academic factoring projects. Around 2005 it became more convenient for these applications to use a number of separate clusters for the matrix step, which implied a switch to block Wiedemann [209, 211]. Chapter 7 describes the block Lanczos method in detail. For the number field sieve each column dependency produced by the matrix step requires a square root calculation in the number fields involved before a factorization of the targeted composite may be derived. In [255] Peter presented an iterative approach that cleverly exploits the special form of the squared algebraic numbers, thereby replacing Jean–Marc Couveignes’ much slower and

Downloaded from https://www.cambridge.org/core. Teachers College Library - Columbia University, on 02 Nov 2017 at 06:38:23, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.002

8

Introduction

more restrictive method [113]. Peter’s method is used, essentially unmodified, until the present day, and sketched in Section 5.5.3 on page 157. Returning to special-purpose integer factorization, in the last paragraph of his p − 1 paper [278] John Pollard mentions “we would theoretically use the fast Fourier transform on Step 2,” concluding his paper with “I cannot say at what point this becomes a practical proposition.” In [252, section 4] Peter refers to Pollard’s remark, and further pursues the idea for Pollard’s p − 1 method, first with Silverman in [263] and later with Alexander Kruppa in [261]. For the elliptic curve factoring method it is the subject of Peter’s PhD dissertation [254]. Peter’s work on this subject is presented in Chapter 8. In Chapter 9 Peter’s contributions to cryptographic pairings are outlined. Given Peter’s simultaneous inversion method (described below in Section 1.4) and his fast modular inversion it became attractive, contrary to conventional wisdom, to replace projective coordinates by affine ones for applications involving multiple or parallelized pairings, products of pairings, or pairings at high security levels [137, 138, 227]. Moreover, Peter coauthored a paper introducing algorithms to compute the squared pairing, which has the advantage, compared to Victor Miller’s algorithm, that its computation cannot fail [248].

1.4 Simultaneous Inversion For completeness, this section describes Peter Montgomery’s simultaneous inversion method, a popular method that did not lead to any follow-up work and which is therefore not treated in the later chapters. At a computational number theory conference held in August 1985 at Humboldt State University in Arcata, CA, Peter gave a presentation about his experience using the elliptic curve integer factorization method ([239], informally announced in February 1985). He proposed to maximize the overall throughput of the method as opposed to minimizing the latency per curve, thus exploiting the method’s inherent parallelism. The idea is simple: at a modest overhead the cost of the modular inversion required for each elliptic curve group operation in the Weierstrass model can be shared among any number of simultaneously processed curves, assuming all curves target the same number to be factored. It has become known as simultaneous inversion, and was later briefly mentioned in [252, section 10.3.1]: When working over several curves, the program used a scheme similar to (1.2) to do all the inversions at once, since (1/x) = y(1/xy) and (1/y) = x(1/xy). This reduces the asymptotic cost of an inversion to that of 3 multiplications.

Downloaded from https://www.cambridge.org/core. Teachers College Library - Columbia University, on 02 Nov 2017 at 06:38:23, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.002

1.4 Simultaneous Inversion

9

The “scheme similar to (1.2)” suggests that Peter took his inspiration from John Pollard’s observation that the cost of a number of greatest common divisor calculations with the same to-be-factored modulus can be reduced to that same number of multiplications modulo the same modulus, plus a single gcd calculation. In the inversion context, there may have been a historical precedent for Peter’s trick, with Richard Schroeppel recalling he may have seen a similar method to avoid floating divides in pre-1970s Fortran programs. But even if it is not entirely original, Peter’s insight cleverly adapts an old idea to a regime where numerics are exact [303]. These days, simultaneous inversion is, for instance, used in record computations of elliptic curve discrete logarithms (e.g., [69] and the record attempt [20]; it was not used in [344]). Let z1 , z2 , . . . , zk ∈ Z/nZ for some modulus n be the elements that must be inverted; in the rare case that at least one of the inverses does not exist, simultaneous inversion fails, and an appropriate alternative approach is used. Working in Z/nZ and with w0 = 1, first for i = 1, 2, . . . , k in succession calculate and store wi = wi−1 zi . Next, calculate w = wk−1 . Finally, for i = k, k − 1, . . . , 1 as wwi−1 and next replace w by wzi (so that in succession first compute z−1 i −1 ). The overall cost is 3(k − 1) multiplications, a single inversion, and w = wi−1 storage for k temporary values, all in Z/nZ. Given that for relevant modulus sizes and software implementation of arithmetic in Z/nZ inversion is at least five times costlier than multiplication, the resulting speed up can be significant. One should be aware, though, that in hardware the difference can be made smaller [200].

Downloaded from https://www.cambridge.org/core. Teachers College Library - Columbia University, on 02 Nov 2017 at 06:38:23, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.002

2 Montgomery Arithmetic from a Software Perspective Joppe W. Bos and Peter L. Montgomery

We propose a representation of residue classes so as to speed modular multiplication without affecting the modular addition and subtraction algorithms. – Peter L. Montgomery [251]

Abstract This chapter describes Peter L. Montgomery’s modular multiplication method and the various improvements to reduce the latency for software implementations on devices that have access to many computational units.

2.1 Introduction Modular multiplication is a fundamental arithmetical operation, for instance when computing in a finite field or a finite ring, and forms one of the basic operations underlying almost all currently deployed public-key cryptographic protocols. The efficient computation of modular multiplication is an important research area since optimizations, even ones resulting in a constant-factor speedup, have a direct impact on our day-to-day information security life. In this chapter we review the computational aspects of Peter L. Montgomery’s modular multiplication method [251] (known as Montgomery multiplication) from a software perspective (while the next chapter highlights the hardware perspective). Throughout this chapter we use the terms digit and word interchangeably. To be more precise, we typically assume that a b-bit non-negative multi-precision integer X is represented by an array of n = b/r computer words as X=

n−1 

xi ri

i=0

10 Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003

2.1 Introduction

11

(the so-called radix-r representation), where r = 2w for the word size w of the target computer architecture and 0 ≤ xi < r. Here xi is the ith word of the integer X. Let N be the positive modulus consisting of n digits while the input values to the modular multiplication method (A and B) are non-negative integers smaller than N and consist of up to n digits. When computing modular multiplication C = AB mod N, the definitional approach is first to compute the product P = AB. Next, a division is computed to obtain P = NQ + C such that both C and Q are non-negative integers less than N. Knuth studies such algorithms for multi-precision non-negative integers [215, Alg. 4.3.1.D]. Counting word-byword instructions, the method described by Knuth requires O(n2 ) multiplications and O(n) divisions when implemented on a computer platform. However, on almost all computer platforms divisions are expensive (relative to the cost of a multiplication). Is it possible to perform modular multiplication without using any division instructions? If one is allowed to perform some precomputation which only depends on the modulus N, then this question can be answered affirmatively. When computing the division step, the idea is to use only “cheap” divisions and “cheap” modular reductions when computing the modular multiplication in combination with a precomputed constant (the computation of which may require “expensive” divisions). These “cheap” operations are computations which either come for free or at a low cost on computer platforms. Virtually all modern computer architectures internally store and compute on data in binary format using some fixed word-size r = 2w as above. In practice, this means that all arithmetic operations are implicitly computed modulo 2w (i.e., for free) and divisions or multiplications by (powers of) 2 can be computed by simply shifting the content of the register that holds this value. Barrett introduced a modular multiplication approach (known as Barrett multiplication [31]) using this idea. This approach can be seen as a Newton method which uses a precomputed scaled variant of the modulus’ reciprocal in order to use only such “cheap” divisions when computing (estimating and adjusting) the division step. After precomputing a single (multi-precision) value, an implementation of Barrett multiplication does not use any division instructions and requires O(n2 ) multiplication instructions. Another and earlier approach based on precomputation is the main topic of this chapter: Montgomery multiplication. This method is the preferred choice in cryptographic applications when the modulus has no “special” form (besides being an odd positive integer) that would allow more efficient modular reduction techniques. See Section 2.3 on page 22 for applications of Montgomery multiplication in the “special” setting. In practice, Montgomery

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003

12

Montgomery Arithmetic from a Software Perspective

multiplication is the most efficient method when a generic modulus is used (see e.g., the comparison performed by Bosselaers, Govaerts, and Vandewalle [75]) and has a very regular structure which speeds up the implementation. Moreover, the structure of the algorithm (especially if its single branch, the notorious conditional “subtraction step,” can be avoided, cf. page 20 in Section 2.2.4) has advantages when guarding against certain types of cryptographic attacks (for more information on differential power analysis attacks see the seminal paper by Kocher, Jaffe, and Jun [221]). In the next chapter, Montgomery’s method is compared with a version of Barrett multiplication in order to be more precise about the computational advantages of the former technique. As observed by Shand and Vuillemin in [310], Montgomery multiplication can be seen as a generalization of Hensel’s odd division [184] for 2-adic numbers. In this chapter we explain the motivation behind Montgomery arithmetic. More specifically, we show how a change of the residue class representatives used improves the performance of modular multiplication. Next, we summarize some of the proposed modifications to Montgomery multiplication, which can further speed up the algorithm in certain settings. Finally, we show how to implement Montgomery arithmetic in software. We especially study how to compute a single Montgomery multiplication concurrently using either vector instructions or when many computational units are available that can compute in parallel.

2.2 Montgomery Multiplication Let N be an odd b-bit integer and P a 2b-bit integer such that 0 ≤ P < N 2 . The idea behind Montgomery multiplication is to change the representatives of the residue classes and change the modular multiplication accordingly. More precisely, we are not going to compute P mod N but 2−b P mod N instead. This explains the requirement that N needs to be odd (otherwise 2−b mod N does not exist). It turns out that this approach is more efficient (by a constant factor) on computer platforms. Let us start with a basic example to illustrate the strategy used. A first idea is to reduce the value P one bit at a time and repeat this for b bits such that the result has been reduced from 2b to b bits (as required). This can be achieved without computing any expensive modular divisions by noting that  P/2 if P is even, −1 2 P mod N = (P + N)/2 if P is odd.

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003

2.2 Montgomery Multiplication

13

When P is even, the division by two can be computed with a basic operation on computer architectures: shift the number one position to the right. When P is odd one can not simply compute this division by shifting. A computationally efficient approach to compute this division by two is to make this number P even by adding the odd modulus N, since obviously modulo N this is the same. This allows one to compute 2−1 P mod N at the cost of (at most) a single addition and a single shift. Let us compute D < 2N and Q < 2b such that P = 2b D − NQ since then D ≡ 2−b P mod N. Initially set D equal to P and Q equal to zero. We denote  i by qi the ith digit when Q is written in binary (radix-2), i.e., Q = b−1 i=0 qi 2 . Next perform the following two steps b times starting at i = 0 until the last time when i = b − 1: (Step 1) qi = D mod 2,

(Step 2) D = (D + qi N)/2.

This procedure gradually builds the desired Q and at the start of every iteration P = 2i D − NQ remains invariant. The process is illustrated in the example below. Example 2.1 (Radix-2 Montgomery reduction) For N = 7 (3 bits) and P = 20 < 72 we compute D ≡ 2−3 P mod N. At the start of the algorithm, set D = P = 20 and Q = 0. i = 0, 20 = 20 · 20 − 7 · 0 (Step 1) q0 = 20 mod 2 = 0, i = 1, 20 = 21 · 10 − 7 · 0 (Step 1) q1 = 10 mod 2 = 0, i = 2, 20 = 22 · 5 − 7 · 0 (Step 1) q2 = 5 mod 2 = 1,

⇒ 2−0 · 20 ≡ 20 mod 7 (Step 2) D = (20 + 0 · 7)/2 = 10 ⇒ 2−1 · 20 ≡ 10 mod 7 (Step 2) D = (10 + 0 · 7)/2 = 5 ⇒ 2−2 · 20 ≡ 5 mod 7 (Step 2) D = (5 + 1 · 7)/2 = 6

Since Q = q0 20 + q1 21 + q2 22 = 4 and P = 20 = 2k D − NQ = 23 · 6 − 7 · 4, we have computed 2−3 · 20 ≡ 6 mod 7. The approach behind Montgomery multiplication [251] generalizes this idea. Instead of dividing by two at every iteration the idea is to divide by a Montgomery radix R which needs to be coprime to, and should be larger than, N. By precomputing the value μ = −N −1 mod R,

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003

14

Montgomery Arithmetic from a Software Perspective

Algorithm 2.1 The Montgomery reduction algorithm. Compute PR−1 modulo the odd modulus N given the Montgomery radix R > N and using the precomputed Montgomery constant μ = −N −1 mod R. Input: N, P, such that 0 ≤ P < N 2 . Output: C ≡ PR−1 mod N such that 0 ≤ C < N. 1: q ← μ(P mod R) mod R 2: C ← (P + Nq)/R 3: if C ≥ N then 4: C ←C−N 5: end if 6: return C adding a specific multiple of the modulus N to the current value P ensures that   (2.1) P + N (μP mod R) ≡ P − N N −1 P mod R ≡P−P≡0

(mod R).

Hence, P + N (Pμ mod R) is divisible by the Montgomery radix R while P does not change modulo N. Let P be the product of two non-negative integers that are both less than N. After applying Equation (2.1) and dividing by R, the value P (bounded by N 2 ) has been reduced to at most 2N since 0≤

P + N(μP mod R) N 2 + NR < < 2N R R

(2.2)

(since R was chosen larger than N). This approach is summarized in Algorithm 2.1: given an integer P bounded by N 2 , it computes PR−1 mod N, bounded by N, without using any “expensive” division instructions when assuming the reductions modulo R and divisions by R can be computed (almost) for free. On most computer platforms, where one chooses R as a power of two, this assumption is indeed true. Equation (2.2) guarantees that the output is bounded by 2N. Hence, a conditional subtraction needs to be computed at the end of Algorithm 2.1 to ensure the output is less than N. The process is illustrated in the example below, where P is the product of integers A, B with 0 ≤ A, B < N. Example 2.2 (Montgomery multiplication) Exact divisions by 102 = 100 are visually convenient when using a decimal system: just shift the number two places to the right (or “erase” the two least significant digits). Assume the following modular reduction approach: use the Montgomery radix R = 100 when computing modulo N = 97. This example

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003

2.2 Montgomery Multiplication

15

computes the Montgomery product of A = 42 with B = 17. First, precompute the Montgomery constant μ = −N −1 mod R = −97−1 mod 100 = 67. After computing the product P = AB = 42 · 17 = 714, compute the first two steps of Algorithm 2.1 omitting the division by R: P + N(μP mod R) = 714 + 97(67 · 714 mod 100) = 714 + 97(67 · 14 mod 100) = 714 + 97(938 mod 100) = 714 + 97 · 38 = 4400. Indeed, 4400 is divisible by R = 100 and we have computed (AB)R−1 ≡ 42 · 17 · 100−1 ≡ 44

(mod 97)

without using any “difficult” modular divisions.

2.2.1 Interleaved Montgomery Multiplication When working with multi-precision integers, integers consisting of n digits of w bits each, it is common to write the Montgomery radix R as R = rn = 2wn , where w is the word-size of the architecture where Montgomery multiplication is implemented. The Montgomery multiplication method (as outlined in Algorithm 2.1) assumes the multiplication is computed before performing the Montgomery reduction. This has the advantage that one can use asymptotically fast multiplication methods (like e.g., Karatsuba [202], Toom–Cook [327, 102], Schönhage–Strassen [301], or Fürer [152] multiplication). However, this has the disadvantage that the intermediate results can be as large as r2n+1 . Or, stated differently, when using a machine word size of w bits the intermediate results are represented using at most 2n + 1 computer words. The multi-precision setting was already handled in Montgomery’s original paper [251, section 2] and the reduction and multiplication were meant to be interleaved by design (see also the remark in the second to last paragraph of

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003

16

Montgomery Arithmetic from a Software Perspective

Algorithm 2.2 The radix-r interleaved Montgomery multiplication algorithm. Compute (AB)R−1 modulo the odd modulus N given the Montgomery radix R = rn and using the precomputed Montgomery constant μ = −N −1 mod r. The modulus N is such that rn−1 ≤ N < rn and r and N are coprime.  i Input: A = n−1 i=0 ai r , B, N such that 0 ≤ ai < r, 0 ≤ A, B < R. Output: C ≡ (AB)R−1 mod N such that 0 ≤ C < N. 1: C ← 0 2: for i = 0 to n − 1 do 3: C ← C + ai B 4: q ← μC mod r 5: C ← (C + Nq)/r 6: end for 7: if C ≥ N then 8: C ←C−N 9: end if 10: return C Section 1.2 on page 5). When representing the integers in radix-r representation A=

n−1 

ai ri ,

such that 0 ≤ ai < r

i=0

then the radix-r interleaved Montgomery multiplication (see also the work by Dussé and Kaliski Jr. in [134]) ensures the intermediate results never exceed r + 2 computer words. This approach is presented in Algorithm 2.2. Note that this interleaves the naive schoolbook multiplication algorithm with the Montgomery reduction and therefore does not make use of any asymptotically faster multiplication algorithm. The idea is that every iteration divides by the value r (instead of dividing once by R = rn in the “non-interleaved” Montgomery multiplication algorithm). Hence, the value for μ is adjusted accordingly. In [93], Koç, Acar, and Kaliski Jr. compare different approaches to implementing multi-precision Montgomery multiplication. According to this analysis, the interleaved radix-r approach, referred to as coarsely integrated operand scanning in [93], performs best in practice.

2.2.2 Using Montgomery Arithmetic in Practice As we have seen earlier in this section and in Algorithm 2.1 on page 14 Montgomery multiplication computes C ≡ PR−1 mod N. It follows that, in order to use Montgomery multiplication in practice, one should transform the input

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003

2.2 Montgomery Multiplication

17

operands A and B to A˜ = AR mod N and B˜ = BR mod N: this is called the Montgomery representation. The transformed inputs (converted to the Montgomery representation) are used in the Montgomery multiplication algorithm. At the end of a series of modular multiplications the result, in Montgomery representation, is transformed back. This works correctly since Montgomery ˜ B, ˜ N ) computes (A˜ B)R ˜ −1 mod N and it is indeed the case multiplication M(A, ˜ that the Montgomery representation C of C = AB mod N is computed from the Montgomery representations of A and B since ˜ B, ˜ N ) ≡ (A˜ B)R ˜ −1 C˜ ≡ M(A, ≡ (AR)(BR)R−1 ≡ (AB)R ≡ CR

(mod N).

Converting an integer A to its Montgomery representation A˜ = AR mod N can be performed using Montgomery multiplication with the help of the precomputed constant R2 mod N since M(A, R2 , N ) ≡ (AR2 )R−1 ≡ AR ≡ A˜

(mod N).

Converting (the result) back from the Montgomery representation to the regular representation is the same as computing a Montgomery multiplication with the integer value one since ˜ 1, N ) ≡ (A˜ · 1)R−1 ≡ (AR)R−1 ≡ A M(A,

(mod N).

As mentioned earlier, due to the overhead of changing representations, Montgomery arithmetic is best when used to replace a sequence of modular multiplications, since this overhead is amortized. A typical use-case scenario is when computing a modular exponentiation as required in the RSA cryptosystem [294]. As noted in the original paper [251] (see the quote at the start of this chapter) computing with numbers in Montgomery representation does not affect the modular addition and subtraction algorithms. This can be seen from  A˜ ± B˜ ≡ AR ± BR ≡ (A ± B)R ≡ A ±B

(mod N).

Computing the Montgomery inverse is, however, affected. The Montgomery  −1 . This is different inverse of a value A˜ in Montgomery representation is A −1 ˜ ˜ from computing the inverse of A modulo N since A ≡ (AR)−1 ≡ A−1 R−1 (mod N) is the Montgomery representation of the value A−1 R−2 . One of the correct ways of computing the Montgomery inverse is to invert the number

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003

18

Montgomery Arithmetic from a Software Perspective

in its Montgomery representation and Montgomery multiply this result by R3 since  −1 M(A˜ −1 , R3 , N ) ≡ ((AR)−1 R3 )R−1 ≡ A−1 R ≡ A

(mod N).

Another approach, which does not require any precomputed constant, is to compute the Montgomery reduction of a Montgomery residue A˜ twice before inverting since ˜ 1, N ), 1, N )−1 ≡ M((AR)R−1 , 1, N )−1 M(M(A, ≡ M(A, 1, N )−1 ≡ (AR−1 )−1 ≡ A−1 R  −1 ≡A

(mod N).

2.2.3 Computing the Montgomery Constants μ and R2 In order to use Montgomery multiplication one has to precompute the Montgomery constant μ = −N −1 mod r. This can be computed with, for instance, the extended Euclidean algorithm. A particularly efficient algorithm to compute μ when r is a power of two and N is odd, the typical setting used in cryptology, is given by Dussé and Kaliski Jr. and presented in [134]. This approach is recalled in Algorithm 2.3. To show that this approach is correct, it suffices to show that at the start of Algorithm 2.3 and at the end of every iteration we have Nyi ≡ 1 (mod 2i ). This can be shown by induction as follows. At the start of the algorithm we set y to Algorithm 2.3 Compute the Montgomery constant μ = −N −1 mod r for odd values N and r = 2w as presented by Dussé and Kaliski Jr. in [134]. Input: Odd integer N and r = 2w for w ≥ 1. Output: μ = −N −1 mod r y←1 for i = 2 to w do if (Ny mod 2i ) = 1 then y ← y + 2i−1 end if end for return μ ← r − y

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003

2.2 Montgomery Multiplication

19

one, denote this start setting with y1 , and the condition holds since N is odd by assumption. Denote with yi , for 2 ≤ i ≤ w, the value of y in Algorithm 2.3 at the end of iteration i. When i > 1, our induction hypothesis is that Nyi−1 = 1 + 2i−1 m for some positive integer m, at the end of iteration i − 1. We consider two cases r (m is even) Since Nyi−1 = 1 + m 2i ≡ 1 (mod 2i ) we can simply update yi 2 to yi−1 and the condition holds. r (m is odd) Since Nyi−1 = 1 + 2i−1 + m−1 2i ≡ 1 + 2i−1 (mod 2i ), we 2 update yi with yi−1 + 2i−1 . We obtain N(yi−1 + 2i−1 ) = 1 + 2i−1 (1 + N) + m−1 i 2 ≡ 1 (mod 2i ) since N is odd. 2 Hence, after the for loop yw is such that Nyw ≡ 1 mod 2w and the returned value μ = r − yw ≡ 2w − N −1 ≡ −N −1 (mod 2w ) has been computed correctly. The precomputed constant R2 mod N is required when converting a residue modulo N from its regular to its Montgomery representation (see Section 2.2.2 on page 16). When R = rn is a power of two, which in practice is typically the case since r = 2w , then this precomputed value R2 mod N can also be computed efficiently. For convenience, assume R = rn = 2wn and 2wn−1 ≤ N < 2wn (but this approach can easily be adapted when N is smaller than 2wn−1 ). Commence by setting the initial c0 = 2wn−1 < N. Next, start at i = 1 and compute ci ≡ ci−1 + ci−1 mod N and increase i until i = wn + 1. The final value cwn+1 ≡ 2wn+1 c0 ≡ 2wn+1 2wn−1 ≡ 22wn ≡ (2wn )2 ≡ R2 mod N as required and can be computed with wn + 1 modular additions.

2.2.4 On the Final Conditional Subtraction It is possible to alter or even completely remove the conditional subtraction from lines 3–4 in Algorithm 2.1 on page 14. This is often motivated by either performance considerations or turning the (software) implementation into straight-line code that requires no conditional branches. This is one of the basic requirements for cryptographic implementations that need to protect themselves against a variety of (simple) side-channel attacks as introduced by Kocher, Jaffe, and Jun [221] (those attacks that use physical information, such as elapsed time, obtained when executing an implementation, to deduce information about the secret key material used). Ensuring constant running time is a first step in achieving this goal. In order to change or remove this final conditional subtraction the general idea is to bound the input and output of the

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003

20

Montgomery Arithmetic from a Software Perspective

Montgomery multiplication in such a way that they can be re-used in a subsequent Montgomery multiplication computation. This means using a redundant representation, in which the representation of the residues used is not unique and can be larger than N. 2.2.4.1 Subtraction-less Montgomery Multiplication The conditional subtraction can be omitted when the size of the modulus N is appropriately selected with respect to the Montgomery radix R. (This is a result presented by Shand and Vuillemin in [310] but see also the sequence of papers by Walter, Hachez, and Quisquater [337, 177, 340].) The idea is to select the modulus N such that 4N < R and to use a redundant representation for the input and output values of the algorithm. More specifically, we assume A, B ∈ Z/2NZ (residues modulo 2N), where 0 ≤ A, B < 2N, since then the outputs of Algorithm 2.1 on page 14 and Algorithm 2.2 on page 16 are bounded by 0≤

(2N)2 + NR NR + NR AB + N(μAB mod R) < < = 2N. (2.3) R R R

Hence, the result can be reused as input to the same Montgomery multiplication algorithm. This avoids the need for the conditional subtraction except in a final correction step (after having computed a sequence of Montgomery multiplications) where one reduces the value to a unique representation with a single conditional subtraction. In practice, this might reduce the number of arithmetic operations whenever the modulus can be chosen beforehand and, moreover, simplifies the code. However, in the popular use-cases in cryptography, e.g., in the setting of computing modular exponentiations when using schemes based on RSA [294] where the bit-length of the modulus must be a power of two due to compliance with cryptographic standards, the condition 4N < R results in a Montgomeryradix R, which is represented using one additional computer word (compared to the number of words needed to represent the modulus N). Hence, in this setting, such a multi-precision implementation without a conditional subtraction needs one more iteration (when using the interleaved Montgomery multiplication algorithm) to compute the result compared to a version that computes the conditional subtraction. 2.2.4.2 Montgomery Multiplication with a Simpler Final Comparison Another approach is not to remove the subtraction but make this operation computationally cheaper. See the analysis by Walter and Thompson in [342, section 2.2] which is introduced again by Pu and Zhao in [292]. In practice, the

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003

2.2 Montgomery Multiplication

21

Montgomery-radix R = rn is often chosen as a multiple of the word-size of the computer architecture used (e.g., r = 2w for w ∈ {8, 16, 32, 64}). The idea is to reduce the output of the Montgomery multiplication to {0, 1, . . . , R − 1}. instead of to the smaller range {0, 1, . . . , N − 1}. Just as above, this is a redundant representation but working with residues from Z/RZ. This representation does not need more computer words to represent the result and therefore does not increase the number of iterations one needs to compute; something which might be the case when the Montgomery radix is increased to remove the con ditional subtraction. Computing the comparison if an integer x = ni=0 xi ri is at least R = rn can be done efficiently by verifying if the most significant word xn is non-zero. This is significantly more efficient compared with computing a full multi-precision comparison. This approach is correct since if the input values A and B are bounded by R then the output of the Montgomery multiplication, before the conditional subtraction, is bounded by 0≤

R2 + NR B + N(μAB mod R) < = R + N. R R

(2.4)

Subtracting N whenever the result is at least R ensures that the output is also less than R. Hence, one still requires to evaluate the condition for subtraction in every Montgomery multiplication. However, the greater-or-equal comparison becomes significantly cheaper and the number of iterations required to compute the interleaved Montgomery multiplication algorithm remains unchanged. In the setting where a constant running time is required this approach does not seem to bring a significant advantage (see the security analysis by Walter and Thompson in [342, section 2.2] for more details). A simple constant running time solution is to compute the subtraction and select this result if no borrow occurred. However, when constant running time is not an issue this approach (using a cheaper comparison) can speed up the Montgomery multiplication algorithm.

2.2.5 Montgomery Multiplication in F2k The idea behind Montgomery multiplication carries over to finite fields of cardinality 2k as well. Such finite fields are known as binary fields or characteristictwo finite fields. The application of Montgomery multiplication to this setting is outlined by Koç and Acar in [219]. Let n(x) be an irreducible polynomial of degree k. Then an element a(x) ∈ F2k ∼ = F2 [x]/(n(x)) can be represented in the

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003

22

Montgomery Arithmetic from a Software Perspective

polynomial-basis representation by a polynomial of degree at most k − 1 a(x) =

k−1 

ai xi

where ai ∈ F2 .

i=0

The equivalent of the Montgomery-radix is the polynomial r(x) ∈ F2 [x]/(n(x)), which in practice is chosen as r(x) = xk . Since n(x) is irreducible this ensures that the inverse r−1 (x) mod n(x) exists and the Montgomery multiplication a(x)b(x)r−1 (x) mod n(x) is well defined. Let a(x), b(x) ∈ F2 [x]/(n(x)) and their product p(x) = a(x)b(x) of degree at most 2(k − 1). Computing the Montgomery reduction p(x)r−1 (x) mod n(x) of p(x) can be done using the same steps as presented in Algorithm 2.1 on page 14 given the precomputed Montgomery constant μ(x) = −n(x)−1 mod r(x). Hence, one computes q(x) = p(x)μ(x) mod r(x) c(x) = (p(x) + q(x)n(x))/r(x). Note that the final conditional subtraction step is not required since deg(c(x)) ≤ max(2(k − 1), k − 1 + k) − k = k − 1, (because r(x) is a degree k polynomial). A large characteristic version of this approach using the interleaved Montgomery multiplication for finite fields of large prime characteristic from Section 2.2.1 on page 15, works here as well.

2.3 Using Primes of a Special Form In some settings in cryptography, most notably in elliptic curve cryptography (introduced independently by Miller and Koblitz in [247, 217]), the (prime) modulus can be chosen freely and is fixed for a large number of modular arithmetic operations. In order to gain a constant factor speedup when computing the modular multiplication, Solinas suggested [316] a specific set of special primes which were subsequently included in the FIPS 186 standard [330] used in public-key cryptography. More recently, prime moduli of the form 2s ± δ have gained popularity where s, δ ∈ Z>0 and δ < 2s such that δ is a (very) small integer. More precisely, the constant δ is small compared to the typical word-size of computer architectures used (e.g., less than 232 ) and often is chosen as the smallest integer such that one of 2s ± δ is prime. One should

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003

2.3 Using Primes of a Special Form

23

be aware that the usage of such primes of a special form not only accelerates the cryptographic implementations, but the cryptanalytic methods benefit as well. See, for instance, the work by this chapter’s authors, Kleinjung, and Lenstra related to efficient arithmetic to factor Mersenne numbers (numbers of the form 2M − 1) in [71]. An example of one of the primes, suggested by Solinas, is 2256 − 2224 + 2192 + 296 − 1 where the exponents are selected to be a multiple of 32 to speed up implementations on 32-bit platforms (but see for instance the work by Käsper how to implement such primes efficiently on 64-bit platforms [204]). A more recent example proposed by Bernstein [35] is to use the prime 2255 − 19 to implement efficient modular arithmetic in the setting of elliptic curve cryptography.

2.3.1 Faster Modular Reduction with Primes of a Special Form We use the Mersenne prime 2127 − 1 as an example to illustrate the various modular reduction techniques in this section. Given two integers a and b, such that 0 ≤ a, b < 2127 − 1, the modular product ab mod 2127 − 1 can be computed efficiently as follows. (We follow the description given by the first author of this chapter, Costello, Hisil, and Lauter from [64].) First compute the product with one’s preferred multiplication algorithm and write the result in radix-2128 representation c = ab = c1 2128 + c0 ,

where 0 ≤ c0 < 2128 and 0 ≤ c1 < 2126 .

This product can be almost fully reduced by subtracting 2c1 times the modulus since c ≡ c1 2128 + c0 − 2c1 (2127 − 1) ≡ c0 + 2c1

(mod 2127 − 1).

Moreover, we can subtract 2127 − 1 one more time if the bit c0 /2127 (the most significant bit of c0 ) is set. When combining these two observations a first reduction step can be computed as c = (c0 mod 2127 ) + 2c1 + c0 /2127 ≡ c

(mod 2127 − 1)

(2.5)

This already ensures that the result 0 ≤ c < 2128 since c ≤ 2127 − 1 + 2(2126 − 1) + 1 < 2128 . One can then reduce c further, using conditional subtractions. Reduction modulo 2127 − 1 can therefore be computed without using any multiplications or expensive divisions by taking advantage of the form of the modulus.

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003

24

Montgomery Arithmetic from a Software Perspective

2.3.2 Faster Montgomery Reduction with Primes of a Special Form Along the same lines one can select moduli that speed up the operations when computing Montgomery reduction. Such special moduli have been proposed many times in the literature to reduce the number of arithmetic operations (see for instance the work by Lenstra [231], Acar and Shumow [1], Knezevic, Vercauteren, and Verbauwhede [214], Hamburg [178], and the first author of this chapter, Costello, Hisil, and Lauter [64, 65]). They are sometimes referred to as Montgomery-friendly moduli or Montgomery-friendly primes. Techniques to scale an existing modulus such that this scaled modulus has a special shape which reduces the number of arithmetic operations, using the same techniques as for the Montgomery-friendly moduli, are also known and called Montgomery tail tailoring by Hars in [183]. Following the description in the book by Brent and Zimmermann [80], this can be seen as a form of preconditioning as suggested by Svoboda in [320] in the setting of division. When one is free to select the modulus N beforehand, then the number of arithmetic operations can be reduced if the modulus is selected such that μ = −N −1 mod r = ±1 in the setting of interleaved Montgomery multiplication (as also used by Dixon and Lenstra in [127]). This ensures that the multiplication by μ can be avoided (since μ = ±1) in every iteration of the interleaved Montgomery multiplication algorithm. This puts a first requirement on the modulus, namely N≡ ∓ 1 mod r. In practice, r = 2w where w is the word-size of the computer architecture. Hence, this requirement puts a restriction on the least significant word of the modulus (which equals either 1 or −1 ≡ 2w − 1 mod 2w ). Combining lines 4 and 5 from the interleaved Montgomery multiplication (Algorithm 2.2 on page 16) we see that one has to compute C+N(μCr mod r) . Besides the multiplication by μ one has to compute a multi-word multiplication with the (fixed) modulus N. In the same vein as the techniques from Section 2.3.1 above, one can require N to have a special shape such that this multiplication can be computed faster in practice. This can be achieved, for instance, when one of the computer words of the modulus is small or has some special shape while the remainder of the digits are zero except for the most significant word (e.g., when μ = 1). Along these lines the first author of this chapter, Costello, Longa, and Naehrig select primes for usage in elliptic curve cryptography where N = 2α (2β − γ ) ± 1

(2.6)

where α, β, and γ are integers such that γ < 2β ≤ r. Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003

2.3 Using Primes of a Special Form

25

The final requirement on the modulus is to ensure that 4N < R = rn since this avoids the final conditional subtraction (as shown on page 20 in Section 2.2.4). Examples of such Montgomery-friendly moduli include 2252 − 2232 − 1 = 2192 (260 − 240 ) − 1 = 2224 (228 − 28 ) − 1 (written in different form to show the usage on different architectures that can compute with β-bit integers) proposed by Hamburg in [178] and the modulus 2240 (214 − 127) − 1 = 2222 (232 − 218 · 127) − 1 proposed by the first author of this chapter, Costello, Longa, and Naehrig in [66]. The approach is illustrated in the example below. Other examples of Montgomery-friendly moduli are given in [166, table 4] based on Chung– Hasan arithmetic [98]. Example 2.3 (Montgomery-friendly reduction modulo 2127 − 1) Let us consider Montgomery reduction modulo 2127 − 1 on a 64-bit computer architecture (w = 64). This means we have α = 64, β = 63, and γ = 0 in Equation (2.6) on page 24 (2127 − 1 = 264 (263 − 0) − 1). Multiplication by μ can be avoided since μ = −(2127 − 1)−1 mod 264 = 1. Furthermore, due to the special form of the modulus the multiplication by 2127 − 1 can be simplified. The computation of C+N(μCr mod r) , which needs to be done for each computer word (twice in this setting), can be simplified when using the Montgomery interleaved multiplication algorithm. Write C = c2 2128 + c1 264 + c0 (see line 3 in Algorithm 2.2 on page 16) with 0 ≤ c2 , c1 , c0 < 264 , then C + (2127 − 1)(C mod 264 ) C + N(μC mod r) = r 264 =

(c2 2128 + c1 264 + c0 ) + (2127 − 1)c0 264

=

c2 2128 + c1 264 + c0 2127 264

= c2 264 + c1 + c0 263 . Hence, only two addition and two shift operations are needed in this computation. Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003

26

Montgomery Arithmetic from a Software Perspective

2.4 Concurrent Computing of Montgomery Multiplication Since the seminal paper by the second author introducing modular multiplication without trial division, people have studied ways to obtain better performance on different computer architectures. Many of these techniques are specifically tailored towards a (family) of platforms motivated by the desire to enhance the practical performance of public-key cryptosystems. One approach focuses on reducing the latency of the Montgomery multiplication operation. This might be achieved by computing the Montgomery product using many computational units in parallel. One example is to use the single instruction, multiple data (SIMD) programming paradigm. In this setting a single vector instruction applies to multiple data elements in parallel. Many modern computer architectures have access to vector instruction set extensions to perform SIMD operations. Example platforms include the popular high-end x86 architecture as well as the embedded ARM platform which can be found in the majority of modern smartphones and tablets. To highlight the potential, Gueron and Krasnov were the first to show in [175] that the computation of Montgomery multiplication on the 256-bit wide vector instruction set AVX2 is faster than the same computation on the classical arithmetic logic unit (ALU) on the x86_64 platform. In Section 2.4.2 below we outline the approach by the authors of this chapter, Shumow, and Zaverucha from [73] for computing a single Montgomery multiplication using vector instruction set extensions which support 2-way SIMD operations (i.e., perform the same operation on two data points simultaneously). This approach allows one to split the computation of the interleaved Montgomery multiplication into two parts which can be computed in parallel. Note that in a follow-up work [308] by Seo, Liu, Großschädl, and Choi it is shown how to improve the performance on 2-way SIMD architectures even further. Instead of computing the two multiplications concurrently, as is presented in Section 2.4.2, they compute every multiplication using 2-way SIMD instructions. By careful scheduling of the instructions they manage to significantly reduce the read-after-write dependencies which reduces the number of bubbles (execution delays in the instruction pipeline). This results in a software implementation which outperforms the one presented in [73]. It would be interesting to see if these two approaches (from [73] and [308]) can be merged on platforms which support efficient 4-way SIMD instructions. In Section 2.4.4 on page 36 we show how to compute Montgomery multiplication when integers are represented in a residue number system. This approach can be used to compute Montgomery arithmetic efficiently on highly parallel computer architectures which have hundreds of computational cores or more and when large moduli are used (such as in the RSA cryptosystem). Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003

2.4 Concurrent Computing of Montgomery Multiplication

27

2.4.1 Related Work on Concurrent Computing of Montgomery Multiplication A parallel software approach describing systolic Montgomery multiplication is described by Dixon and Lenstra in [127], by Iwamura, Matsumoto, and Imai in [192], and Walter in [336]. See Chapter 3.14 on page 67 for more information about systolic Montgomery multiplication. Another approach is to use the SIMD vector instructions to compute multiple Montgomery multiplications in parallel. This can be useful in applications where many computations need to be processed in parallel such as batch-RSA or cryptanalysis. This approach is studied by Page and Smart in [274] using the SSE2 vector instructions on a Pentium 4 and by the first author of this chapter in [63] on the Cell Broadband Engine (see Section 2.4.3 on page 31 for more details about this platform). An approach based on Montgomery multiplication which allows one to split the operand into two parts, which can then be processed in parallel, is called bipartite modular multiplication and was introduced by Kaihara and Takagi in [199]. The idea is to use a Montgomery radix R = rαn where α is a rational number such that αn is an integer and 0 < αn < n: hence, the radix R is smaller than the modulus N. For example, one can choose α such that αn = n2 . In order to compute xyr−αn mod N (where 0 ≤ x, y < N) write y = y1 rαn + y0 and compute xy1 mod N

and

xy0 r−αn mod N

in parallel using a regular interleaved modular multiplication algorithm (see, e.g., the work by Brickell [81]) and the interleaved Montgomery multiplication algorithm, respectively. The sum of the two products gives the correct Montgomery product of x and y since (xy1 mod N) + (xy0 r−αn mod N) ≡ x(y1 rαn + y0 )r−αn ≡ xyr−αn

(mod N).

2.4.2 Montgomery Multiplication Using SIMD Extensions This section is an extended version of the description of the idea presented by this chapter’s authors, Shumow, and Zaverucha in [73] where an algorithm is presented to compute the interleaved Montgomery multiplication using two threads running in parallel performing identical arithmetic steps. Hence, this algorithm runs efficiently when using 2-way SIMD vector instructions as frequently found on modern computer architectures. For illustrative purposes we assume a radix-232 system, but this can be adjusted accordingly to any other radix system. Note that for efficiency considerations the choice of the radix system depends on the vector instructions available. Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003

28

Montgomery Arithmetic from a Software Perspective

Algorithm 2.2 on page 16 outlines the interleaved Montgomery multiplication algorithm and computes two 1 × n → (n + 1) computer word multiplications, namely ai B and qN, and a single 1 × 1 → 1 computer word multiplication (μC mod 232 ) every iteration. Unfortunately, these three multiplications depend on each other and therefore can not be computed in parallel. Every iteration computes (see Algorithm 2.2) 1. C ← C + ai B 2. q ← μC mod 232 3. C ← (C + qN)/232 In order to reduce latency, we would like to compute the two 1 × n → (n + 1) word multiplications in parallel using vector instructions. This can be achieved by removing the dependency between these two multi-word multiplications by  32i is comcomputing the value of q first. The first word c0 of C = n−1 i=0 ci 2 puted twice: once for the computation of q (in μC mod 232 ) and then again in the parallel computation of C + ai B. This is a relatively minor penalty of one additional one-word multiplication and addition per iteration to make these two multi-word multiplications independent of each other. This means an iteration i can be computed as 1. q ← μ(c0 + ai b0 ) mod 232 2. C ← C + ai B 3. C ← (C + qN)/232 and the 1 × n → (n + 1) word multiplications in steps 2 and 3 (ai B and qN) can be computed in parallel using, for instance, 2-way SIMD vector instructions. In order to rewrite the remaining operations, besides the multiplication, the authors of [73] suggest inverting the sign of the Montgomery constant μ, i.e., instead of using −N −1 mod 232 use μ = N −1 mod 232 . When computing the Montgomery product C = AB2−32n mod N, one can compute D (which contains the sum of the products ai B) and E (which contains the sum of the products qN), separately and in parallel using the same arithmetic operations. Due to the modified choice of the Montgomery constant μ we have C = D − E ≡ AB2−32n (mod M), where 0 ≤ D, E < N: the maximum values of both D and E fit in an n-word integer. This approach is presented in Algorithm 2.4 on page 29. At the start of every iteration in the for-loop iterating with j, the two separate computational streams running in parallel need to communicate information to compute the value of q. More precisely, this requires the knowledge of both d0 and e0 , the least significant words of D and E respectively. Once the values of both d0 and e0 are known to one of the computational streams, the updated

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003

2.4 Concurrent Computing of Montgomery Multiplication

29

Algorithm 2.4 A parallel radix-232 interleaved Montgomery multiplication algorithm. Except for the computation of q, the arithmetic steps in the outer for-loop, performed by computation 1 and computation 2, are identical. This approach is suitable for 32-bit 2-way SIMD vector instruction units. ⎧ A, B, M, μ such that ⎪ ⎪  n−1 32i n−1 ⎪ ⎨ 32i 32i A = n−1 i=0 ai 2 , B = i=0 bi 2 , M = i=0 mi 2 , Input: 32 32(n−1) ⎪ 0 ≤ ai , bi < 2 , 0 ≤ A, B < M, 2 ≤ M < 232n , ⎪ ⎪ ⎩ 2  M, μ = M −1 mod 232 , −32n mod M such that 0 ≤ C < M Output: C ≡ AB2 Computation 1 Computation 2 ei = 0 for 0 ≤ i < n di = 0 for 0 ≤ i < n for j = 0 to n − 1 do t0 ← a j b0 + d0 t 0 t0 ← 32 2 for i = 1 to n − 1 do p0 ← a j bi + t0 + di p 0 t0 ← 32 2 di−1 ← p0 mod 232 end for dn−1 ← t0 end for

for j = 0 to n − 1 do q ← ((μb0 )a j + μ(d0 − e0 )) mod 232 t1 ← qm0 + e0 // where t0 ≡ t1 (mod 232 ) t 1 t1 ← 32 2 for i = 1 to n − 1 do p1 ← qmi + t1 + ei p 1 t1 ← 32 2 ei−1 ← p1 mod 232 end for en−1 ← t1 end for





C ←D−E

// where D =

if C < 0 do C ← C + M end if return C

n−1 

di 232i , E =

i=0

n−1 

ei 232i

i=0

value of q can be computed as q = ((μa0 )b j + μ(d0 − e0 )) mod 232 = μ(c0 + b j a0 ) mod 232 since c0 = d0 − e0 . Except for this communication cost between the two streams, to compute the value of q, all arithmetic computations performed by computation 1 and computation 2 in the outer for-loop are identical but work on different data.

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003

30

Montgomery Arithmetic from a Software Perspective

Table 2.1 A simplified comparison, only stating the number of wordlevel instructions required, to compute the Montgomery multiplication when using a 32n-bit modulus for a positive even integer n. Two approaches are shown, a sequential one on the classical ALU (Algorithm 2.2 on page 16) and a parallel one using 2-way SIMD instructions (performing two operations in parallel, cf. Algorithm 2.4 on page 29). Classical Instruction add sub shortmul muladd muladdadd SIMD muladd SIMD muladdadd

32-bit

64-bit

– – n 2n 2n(n − 1) – –

– – n 2

n n( n2 − 1) – –

2-way SIMD 32-bit n n 2n – – n n(n − 1)

This makes this approach suitable for computation using 2-way 32-bit SIMD vector instructions. This technique benefits from 2-way SIMD 32 × 32 → 64-bit multiplication and matches exactly the 128-bit wide vector instructions as present in many modern computer architectures. By changing the radix used in Algorithm 2.4, larger or smaller vector instructions can be supported. Note that as opposed to a conditional subtraction in Algorithm 2.1 on page 14 and Algorithm 2.2 on page 16, Algorithm 2.4 computes a conditional addition because of the inverted sign of the precomputed Montgomery constant μ. This condition is based on the fact that if D − E is negative (produces a borrow), then the modulus is added to make the result positive. 2.4.2.1 Expected Performance We follow the analysis of the expected performance from [73], which just considers execution time. The idea is to perform an analysis of the number of arithmetic instructions as an indicator of the expected performance when using a 2-way SIMD implementation instead of a regular (non-SIMD) implementation for the classical ALU. We assume the 2-way SIMD implementation works on pairs of 32-bit words in parallel and has access to a 2-way SIMD 32 × 32 → 64-bit multiplication instruction. A comparison to a regular implementation is not straightforward since the word-size can be different, the platform might be able to compute multiple instructions in parallel (on different

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003

2.4 Concurrent Computing of Montgomery Multiplication

31

ALUs) and the number of instructions per arithmetic operation might differ. This is why we present a simplified comparison based on the number of arithmetic operations when computing Montgomery multiplication using a 32n-bit modulus for a positive even integer n. We denote by muladdw (e, a, b, c) and muladdaddw (e, a, b, c, d) the computation of e = ab + c and e = ab + c + d, respectively, for 0 ≤ a, b, c, d < 2w (and thus 0 ≤ e < 22w ). These are basic operations on a computer architecture which works on w-bit words. Some platforms have these operations as a single instruction (e.g., on some ARM architectures) while others implement this using separate multiplication and addition instructions (as on the x86 platform). Furthermore, let shortmulw (e, a, b) denote e = ab mod 2w : this w × w → w-bit multiplication only computes the least significant w bits of the result and is faster than computing a full doubleword product on most modern computer platforms. Table 2.1 summarizes the expected performance of Algorithm 2.2 on page 16 and Algorithm 2.4 on page 29 in terms of arithmetic operations only. The shifting and masking operations are omitted for simplicity as well as the operations required to compute the final conditional subtraction or addition. When just taking the muladd and muladdadd instructions into account it becomes clear from Table 2.1 that the SIMD approach uses exactly half the number of instructions compared to the 32-bit classical implementation and almost twice as many operations compared to the classical 64-bit implementations. However, the SIMD approach requires more operations to compute the value of q every iteration and has various other overheads (e.g., inserting and extracting values from the large vector registers). Hence, when assuming that all the characteristics of the SIMD and classical (non-SIMD) instructions are identical, which is most likely not the case on most platforms, then we expect Algorithm 2.4 running on a 2-way 32-bit SIMD unit to outperform a classical 32-bit implementation using Algorithm 2.4 by at most a factor of two while being roughly half as fast as a classical 64-bit implementation.

2.4.3 A Column-Wise SIMD Approach A different approach, suitable for computing Montgomery multiplication on architectures supporting 4-way SIMD instructions is outlined by the first chapter author and Kaihara in [67]. This approach is particularly efficient on the Cell Broadband Engine (see a brief introduction to this architecture below), since it was designed for usage on this platform, but can be used on any platform supporting the SIMD instructions used in this approach. This approach differs from the one described in the previous section in that it uses the SIMD instructions to compute the multi-precision arithmetic in parallel, so it works

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003

32

Montgomery Arithmetic from a Software Perspective

on a lower level, while the approach from Section 2.4.2 above computes the arithmetic operations itself sequentially but divides the work into two steps, which can be computed concurrently. 2.4.3.1 The Cell Broadband Engine The Cell Broadband Engine (cf. the introductions given by Hofstee [187] and Gschwind [173]), denoted by “Cell” and jointly developed by Sony, Toshiba, and IBM, is a powerful heterogeneous multiprocessor which was released in 2005. The Cell contains a Power Processing Element, a dual-threaded Power architecture-based 64-bit processor with access to a 128-bit AltiVec/VMX single instruction, multiple data (SIMD) unit (which is not considered in this chapter). Its main processing power, however, comes from eight Synergistic Processing Elements (SPEs). For an introduction to the circuit design see the work by Takahashi et al. [323]. Each SPE consists of a Synergistic Processing Unit (SPU), 256 KB of private memory called Local Store (LS), and a Memory Flow Controller (MFC). To avoid the complexity of sending explicit direct memory access requests to the MFC, all code and data must fit within the LS. Each SPU runs independently from the others at 3.192 GHz and is equipped with a large register file containing 128 registers of 128 bits each. Most SPU instructions work on 128-bit operands denoted as quadwords. The instruction set is partitioned into two sets: one set consists of (mainly) 4- and 8-way SIMD arithmetic instructions on 32-bit and 16-bit operands respectively, while the other set consists of instructions operating on the whole quadword (including the load and store instructions) in a single instruction, single data (SISD) manner. The SPU is an asymmetric processor; each of these two sets of instructions is executed in a separate pipeline, denoted by the even and odd pipeline for the SIMD and SISD instructions, respectively. For instance, the {4, 8}-way SIMD left-rotate instruction is an even instruction, while the instruction left-rotating the full quadword is dispatched into the odd pipeline. When dependencies are avoided, a single pair consisting of one odd and one even instruction can be dispatched every clock cycle. One of the first applications of the Cell processor was to serve as the brain of Sony’s PlayStation 3 game console. Due to the widespread availability of this game console and the fact that one could install and run one’s own software this platform has been used to accelerate cryptographic operations [112, 111, 95, 272, 74, 63] as well as cryptanalytic algorithms [318, 72, 69]. 2.4.3.2 Montgomery Multiplication on the Cell Broadband Engine In this section we outline the approach presented by the first author of this chapter and Kaihara tailored towards the instruction set of the Cell Broadband Engine. Most notably, the presented techniques rely on an efficient 4-way Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003

2.4 Concurrent Computing of Montgomery Multiplication

33

Figure 2.1 The 16-bit words xi of a 16(n + 1)-bit  n+1 positive integer X = n 16i x 2 < 2N are stored column-wise using s = 128-bit vectors X j on i i=0 4 the SPE architecture.

SIMD instruction to multiply two 16-bit integers and add another 16-bit integer to the 32-bit result, and a large register file. Therefore, the approach described here uses a radix r = 216 to divide the large numbers into words that match the input sizes of the 4-SIMD multipliers of the Cell. This can easily be adapted to any other radix size for different platforms with different SIMD instructions.  The idea is to represent integers X in a radix-216 system, i.e., X = ni=0 xi 216i where 0 ≤ xi < 216 . However, in order to use the 4-way SIMD instructions of this platform efficiently these 16-bit digits xi are stored in a 32-bit datatype. The usage of this 32-bit space is to ensure that intermediate values of the form ab + c do not produce any carries since when 0 ≤ a, b, c < 216 then 0 ≤ ab + c < 232 . Hence, given an odd 16n-bit modulus N, then a Montgomery   residue X, such that 0 ≤ X < 2N < 216(n+1) , is represented using s = n+1 4 vectors of 128 bits. Note that this representation uses roughly twice the number of bits when compared to storing it in a “normal” radix-representation. The single additional 16-bit word is required because the intermediate accumulating result of Montgomery multiplication can be almost as large as 2N (see page 20 in Section 2.2.4, above). The 16-bit digits xi are placed column-wise in the four 32-bit datatypes of the 128-bit vectors. This representation is illustrated in Figure 2.1. The four 32-bit parts of the j-th 128-bit vector X j are denoted by X j = {X j [3], X j [2], X j [1], X j [0]}. Each of the (n+1) 16-bit words xi of X is stored in the most significant 16 bits of Xi mod s [ si ]. The motivation for using this column-wise representation Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003

34

Montgomery Arithmetic from a Software Perspective

is that a division by 216 can be computed efficiently: simply move the digits in vector X0 “one position to the right,” which in practice means a logical 32bit right shift, and relabeling of the indices such that X j becomes X j−1 , for 1 ≤ j < s − 1 and the modified vector X0 becomes the new Xs−1 . Algorithm 2.5 on page 35 computes Montgomery multiplication using such a 4-way columnwise SIMD representation. In each iteration, the indices of the vectors that contain the accumulating partial product U change cyclically among the s registers. In Algorithm 2.5, each 16-bit word of the inputs X, Y and N and the output Z is stored in the upper part (the most significant 16 bits) of each of the four 32-bit words in a 128-bit vector. The vector μ contains the replicated values of −N −1 mod 216 in the lower 16-bit positions of the four 32-bit words. In its most significant 16-bit positions, the temporary vector K stores the replicated values of yi , i.e., each of the parsed coefficients of the multiplier Y corresponding to the ith iteration of the main loop. The operation A ← muladd(B, c, D), which is a single instruction on the SPE, represents the operation of multiplying the vector B (where data are stored in the higher 16-bit positions of 32 bit words) by a vector with replicated 16-bit values of c across all higher positions of the 32-bit words. This product is added to D (in 4-way SIMD manner) and the overall result is placed into A. The temporary vector V stores the replicated values of u0 in the least significant 16-bit words. This u0 refers to the least significant 16-bit word of the  updated value of U, where U = nj=0 u j 216 j and is stored as s vectors of 128bit Ui mod s , Ui+1 mod s , . . . , Ui+n mod s (where i refers to the index of the main loop). The vector Q is computed as an element-wise logical left shift by 16 bits of the 4-SIMD product of vectors V and μ. The propagation of the higher 16-bit carries of U(i+ j) mod s as stated in lines 10 and 18 of Algorithm 2.5 consists of extracting the higher 16-bit words of these vectors and placing them into the lower 16-bit positions of temporary vectors. These vectors are then added to the “next” vector U(i+ j+1) mod s correspondingly. The operation is carried out for the vectors with indices j ∈ {0, 1, . . . , s − 2}. For j = s − 1, the last index, the temporary vector that contains the words is logically shifted 32 bits to the left and added to the vector Ui mod s . Similarly, the carry propagation of the higher words of U(i+ j) mod s in line 22 of Algorithm 2.5 is performed with 16-bit word extraction and addition, but requires a sequential parsing over the (n + 1) 16-bit words. Hence, the approach outlined in Algorithm 2.5 computes Montgomery multiplication by computing the multi-word multiplications using SIMD instructions and representing the integers using a column-wise approach (see Figure 2.1 on page 33). This approach comes at the cost that a single 16n-bit integer is rep bits: requiring slightly over twice the amount of storage. resented by 128 n+1 4

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003

2.4 Concurrent Computing of Montgomery Multiplication

35

Algorithm 2.5 Montgomery multiplication algorithm for the Cell ⎧ ⎪ N represented by s 128-bit vectors: Ns−1 , . . . , N0 , such that ⎪ ⎪ ⎪ 16(n−1) ⎪ ≤ N < 216n , 2  N, ⎪ ⎪ 2 ⎨ X, Y each represented by s 128-bit vectors: Xs−1 , . . . , X0 , and Input: ⎪ Ys−1 , . . . , Y0 , such that 0 ≤ X, Y < 2N, ⎪ ⎪ ⎪ ⎪ μ : a 128-bit vector containing (−N)−1 (mod 216 ) ⎪ ⎪ ⎩ replicated in all 4 elements.  Z represented by s 128-bit vectors: Zs−1 , . . . , Z0 , such that Output: Z ≡ XY 2−16(n+1) mod N, 0 ≤ Z < 2N. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25:

for j = 0 to s − 1 do Uj = 0 end for for i = 0 to n do /* lines 6–9 compute U = yi X + U */ K = {yi , yi , yi , yi } for j = 0 to s − 1 do U(i+ j) mod s = muladd(X j , K, U(i+ j) mod s ) end for Carry propagation on U(i+ j) mod s for j = 0, . . . , s − 1 (see text) /* lines 12–13 compute Q = μV mod 216 */ V = {u0 , u0 , u0 , u0 } Q = shiftleft(mul(V , μ), 16) /* lines 15–17 compute U = NQ + U */ for j = 0 to s − 1 do U(i+ j) mod s = muladd(N j , Q, U(i+ j) mod s ) end for Carry propagation on U(i+ j) mod s for j = 0, . . . , s − 1 (see text) /* line 20 computes the division by 216 */ Ui mod s = vshiftright(Ui mod s , 32) end for Carry propagation on Ui mod s for i = n + 1, . . . , 2n + 1 (see text) for j = 0 to s − 1 do Z j = U(n+ j+1) mod s end for

Note, however, that an implementation of this technique outperforms the native multi-precision big-number library on the Cell processor by a factor of about 2.5, as summarized in [67].

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003

36

Montgomery Arithmetic from a Software Perspective

2.4.4 Montgomery Multiplication Using the Residue Number System Representation The residue number system (RNS) as introduced by Garner [156] and Merrill [244] is an approach, based on the Chinese remainder theorem, to represent an integer as a number of residues modulo smaller (coprime) integers. The advantage of RNS is that additions, subtractions, and multiplication can be performed independently and concurrently on these smaller residues. Given an RNS basis βn = {r1 , r2 , . . . , rn }, where gcd(ri , r j ) = 1 for i = j, the RNS  modulus is defined as R = ni=1 ri . Given an integer x ∈ Z/RZ and the RNS basis βn , this integer x is represented as an n-tuple x = (x1 , x2 , . . . , xn ), where xi = x mod ri for 1 ≤ i ≤ n. In order to convert an n-tuple back to its integer value one can apply the Chinese remainder theorem (CRT)      n  R −1 R xi mod ri mod R. (2.7) x= r r i i i=1 Modular multiplication using Montgomery multiplication in the RNS setting has been studied, for instance, by Posch and Posch in [290] and by Bajard, Didier, and Kornerup in [23] and subsequent work. In this section we outline how to achieve this. First note for the application in which we are interested, we can not use the modulus N as the RNS modulus since N is either prime (in the setting of elliptic curve cryptography) or a product of two large primes (when using RSA). When computing the Montgomery reduction one has to perform arithmetic modulo the Montgomery radix. One possible approach is to use the  RNS modulus R = ni=1 ri as the Montgomery radix. This has the advantage that whenever one computes with integers represented in this residue number system they are reduced modulo R implicitly. However, since we are performing arithmetic in the ring Z/RZ this means that division by R, as required in the Montgomery reduction, is not well defined. One way this problem can be circumvented is by introducing an auxiliary  basis βn = {r1 , r2 , . . . , rn } with auxiliary RNS modulus R = ni=1 ri such that gcd(R , R) = gcd(R, N) = gcd(R , N) = 1 (and both R and R are larger than 4N). The idea is to convert the intermediate result represented in βn to the auxiliary basis βn and perform the division by R here (since R and R are coprime this inverse exists). The concept of base-extension, converting the representation from one base to another, is by Szabo and Tanaka in [321] (but see also the work by Gregory and Matula in [169]). Methods are either based on the CRT as used by Shenoy and Kumaresan [312], Posch and Posch [289], and Kawamura, Koike, Sano,

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003

2.4 Concurrent Computing of Montgomery Multiplication

37

and Shimbo [205] or on an intermediate representation denoted by a mixed radix system as presented in Szabo and Tanaka in [321]. Carefully selected RNS bases can significantly impact the performance in practice as shown by Bajard, Kaihara, and Plantard in [22] and Bigou and Tisserand in [52]. Another RNS approach is presented by Phillips, Kong, and Lim [277]. With these two RNS bases defined we can compute the Montgomery product   modulo N. Let R = ni=1 ri and R = ni=1 ri be the two RNS moduli for RNS basis βn and βn respectively. The representations of N in βn and βn are denoted  and N   . Let A, B ∈ Z/NZ be represented in both RNS bases: βn (a and by N  b) and β (a and b ). Then we can compute the Montgomery product in both n these RNS bases using the following steps. 1. Compute the product of A and B in both RNS bases βn and βn : d = a · b, where di = ai bi mod ri , for 1 ≤ i ≤ n, d = a · b where di = ai bi mod ri for 1 ≤ i ≤ n. 2. Compute ((ab)(−N −1 mod R) mod R). This is realized by computing q =  −1 · d in basis βn as qi = −N −1 di mod ri for 1 ≤ i ≤ n. −N i 3. Convert q in basis βn to q in basis βn (for instance using Equation (2.7)). ) · R  −1 of the Montgomery multi4. Compute the final part c = (d + q · N plication (including the division by R) in basis βn by computing ci = (di + qi Ni )ri−1 mod ri for 1 ≤ i ≤ n. 5. Convert c in basis βn to c in basis βn (for instance using Equation (2.7)). After step 5 we have c = {c1 , c2 , . . . , cn } and c = {c1 , c2 , . . . , cn } such that     ci ≡ abR−1 mod N mod ri and ci ≡ abR−1 mod N mod ri . This approach has been used to implement asymmetric cryptography on highly parallel computer architectures like graphics processing units (e.g., as in [24, 265, 322, 182, 10]). The results presented in these papers show that when multiple Montgomery multiplications are computed concurrently using RNS the latency can be reduced significantly while the throughput is increased (compared to computation on a multi-core CPU) when computing with thousands of threads on the hundreds of cores on a graphics processing unit. This highlights the potential of using graphics cards as cryptographic accelerators when large batches of work require processing (and a low latency is required). The process is illustrated in the example below. Example 2.4 (RNS Montgomery multiplication) Let A = 42, B = 17 and N = 67. In this example we show how to compute the Montgomery product ABR−1 mod N using a residue number system. Let

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003

38

Montgomery Arithmetic from a Software Perspective

us first define the two coprime RNS bases β3 = {3, 7, 13} β3 = {5, 11, 17}

with RNS modulus with RNS modulus

R = 3 · 7 · 13 = 273, R = 5 · 11 · 17 = 935,

such that both RNS moduli are larger than 4N. Recall that R plays the role of both the RNS modulus as well as of the Montgomery radix. First, we need to represent the inputs A and B in both RNS bases: A = 42 is represented as r r

a = {42 mod 3, 42 mod 7, 42 mod 13} = {0, 0, 3} in basis β3 , a = {42 mod 5, 42 mod 11, 42 mod 17} = {2, 9, 8} in basis β3 ,

and B = 17 is represented as r b = {17 mod 3, 17 mod 7, 17 mod 13} = {2, 3, 4} in basis β3 , r b = {17 mod 5, 17 mod 11, 17 mod 17} = {2, 6, 0} in basis β  . 3

Furthermore, we need to represent the precomputed Montgomery constant μ = −N −1 mod R = −67−1 mod 273 = 110 in basis β3  = {110 mod 3, 110 mod 7, 110 mod 13} = {2, 5, 6}, μ as well as the modulus N in basis β3   = {67 mod 5, 67 mod 11, 67 mod 17} = {2, 1, 16}. N Compute the first step: the product of these two numbers in both bases. This can be done in parallel for all the individual moduli r r

d = a · b = {0 · 2 mod 3, 0 · 3 mod 7, 3 · 4 mod 13} = {0, 0, 12}, d = a · b = {2 · 2 mod 5, 9 · 6 mod 11, 8 · 0 mod 17} = {4, 10, 0}.

 in basis β3 Next, compute q = d · μ q = d · μ  = {0 · 2 mod 3, 0 · 5 mod 7, 12 · 6 mod 13} = {0, 0, 7}. Change the representation of q: convert q = {0, 0, 7}, which is represented in basis β3 , to q which is represented in basis β3 . This can be done by first converting q back to its integer representation following Equation (2.7) on page 36   −1  q = 0 · 273 mod 3 273 + 3  −1  3 273 273 0· mod 7 7 +  7 −1   7 · 273 mod 273 mod 13 273 13 13 = 0 + 0 + 7 · 5 · 21 mod 273 = 189.

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003

2.4 Concurrent Computing of Montgomery Multiplication

39

From q obtain q = {189 mod 5, 189 mod 11, 189 mod 17} = {4, 2, 2}. The ) · R  −1 : final step computes the result c in basis β3 as c = (d + q · N c = ({4, 10, 0} + {4, 2, 2} · {2, 1, 16}) · {2, 5, 1} = {4, 5, 15}. When converting this to the integer representation we obtain    −1 935 mod 5 + c = 4 · 935  5−1  5 935 935 5 · 11 mod 11 11 +  −1   935 15 · 935 mod 935 mod 17 17 17 = 374 + 170 + 440 mod 935 = 49. This is indeed correct since c = ABR−1 mod N = 42 · 17 · (3 · 7 · 13)−1 mod 67 = 49. Acknowledgements The authors want to thank Robert Granger, Arjen K. Lenstra, Paul Leyland, and Colin D. Walter for feedback and suggestions. Peter Montgomery’s authorship is honoris causa; all errors and inconsistencies in this chapter are the sole responsibility of the first author.

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003

3 Hardware Aspects of Montgomery Modular Multiplication Colin D. Walter

Abstract This chapter compares Peter Montgomery’s modular multiplication method with traditional techniques for suitability on hardware platforms. It also covers systolic array implementations and side-channel leakage.

3.1 Introduction and Summary This chapter looks at the hardware implementation of Peter Montgomery’s Modular multiplication without trial division [251]. Such dedicated hardware is used primarily for arithmetic over the rational integers Z with a very large modulus N, including the prime field GF (p) case when N = p is prime. For simplicity, it is assumed that all the arithmetic here is over the integers. There are increasingly important applications over large finite fields. However, apart from simpler carry propagation when the field characteristic is very small, the main issues covered here have similar solutions. The interested reader might start by consulting [219, 298, 50, 6, 223] for this case. Because of the overhead of translating to and from the Montgomery domain (see Section 2.2.2 on page 16), use of Montgomery’s method is, for the most part, in cryptography where exponentiation is a central operation and the cost of the translation can be amortised over all the modular multiplications in the exponentiation. Diffie–Hellman key exchange, RSA, DSA, ElGamal encryption and their elliptic curve equivalents are chief among the applications [126, 294, 267, 140, 247, 217, 180]. This justifies the concentration on instances over Z where the modulus N is a large integer, typically with 1024 or more bits. A consequence of this is that, with representations in radix r, processing a regular digit position in the long number arithmetic should be as efficient as possible and, with the notation of Algorithm 2.2 on page 16, the 40 Downloaded from https://www.cambridge.org/core. UNIVERSITY OF ADELAIDE LIBRARIES, on 31 Mar 2019 at 10:14:57, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.004

3.1 Introduction and Summary

41

determination of the multiple q of N that has to be added to or subtracted from the running total P should not slow down such processing. Although not used exclusively, Montgomery’s method should be the predominant one for modular multiplication in the applications above. We will consider the reasons for its already widespread adoption by comparing it with classical techniques, demonstrating how it fits in well with standard word sizes whereas standard methods suffer from widespread contamination by over-large digits1 . The chapter is structured to provide an overview of the main acceleration techniques for hardware modular multiplication and for each of these we deduce that Montgomery multiplication is better than, or at least as good as, classical techniques. Many of the implementations of interest to us occur in smart cards where side-channel leakage is a significant threat. Consequently, the final part of this chapter is a study of the security issues that arise from the use of Montgomery’s method and how they can be mitigated. At the other end of a secure transaction involving a smart card there is probably a server performing a large number of simultaneous cryptographic processes for many secure transactions, and so requiring the capability of very high throughput. This can be achieved using a systolic array and, for this context, Montgomery’s algorithm is really the only sensible choice. The target hardware could be one of a wide variety of different possibilities. Among these there are small smart-card cryptographic co-processors with a single 8- or 16-bit multiplier, ARM-based processors, single core and multi-core Single Instruction Multiple Data (SIMD) processors with pipelined multipliers, systolic arrays with substantial processing power in each processing element, and Application Specific Integrated Circuits (ASICs) which may process all bits of the modulus in parallel. The variety means that allowance must be made for all possible ways of implementing the basic arithmetic operations which appear in a description of Montgomery’s modular multiplication method, particularly addition and scalar multiplication, and indeed including consideration of number representations (such as redundant ones) which are alternatives to the usual binary. The low-level programming of these architectures was covered

1

Where classical methods are still used, the reason is often ascribed to the difficulty of translating inputs to the Montgomery domain – see Section 2.2.2 on page 16. This requires computing and storing R2 mod N where R is the Montgomery radix. However, if the public exponent E is known and the exponentiation is to the power D where DE ≡ 1 mod φ(N), then this reason is spurious. Specifically, start by computing 1E−1 using Montgomery multiplication. Then U ≡ R2−E mod N is obtained if E > 2. This generally requires only a small number of multiplications as E is typically a small Fermat prime. Next Montgomery-multiply A and U to obtain AR1−E mod N. Raising this to the power D > 1 using Montgomery multiplication yields (AR1−E )D R1−D ≡ AD RD(1−E )+(1−D) ≡ AD R mod N. The usual post-processing of Montgomery-multiplying by 1 then yields AD mod N, as required, and Montgomery representations have been avoided.

Downloaded from https://www.cambridge.org/core. UNIVERSITY OF ADELAIDE LIBRARIES, on 31 Mar 2019 at 10:14:57, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.004

42

Hardware Aspects of Montgomery Modular Multiplication

in the previous chapter. Here we consider the hardware itself, particularly the size of the digit×digit multiplier, how to increase clock speed, and the effects of bus width and communications.

3.2 Historical Remarks By the late 1980s, RSA encryption had been in the public domain for a dozen or so years and was regarded as more secure for encryption than DES because academic mathematicians had failed to make significant progress in factoring large numbers despite considerable effort and there were mathematically unfounded suspicions than IBM had built some weaknesses into DES that would allow the NSA to crack ciphertexts more easily than could the general public. Consequently, there was great interest in implementing RSA not just for military purposes but also to provide better security in the banking sector where card payment at point of sale was already very well established using hand-written signatures for verification but only plain text data transmission. There was an expanding market which made it commercially viable to develop point of sale terminals containing RSA for secure data transmission, and which was shortly going to expand to smart cards. Then, “smart” cards were used only in prepaid telephone applications and contained essentially no security. The market for them was yet to develop. At that time RISC-based processors such as the ARM2 with its 32-bit multiplier did not yet yield acceptable performance for software implementations of RSA. ASIC cryptographic co-processors with a full battery of hardware acceleration techniques and the most efficient algorithms were required to yield the necessary encryption speeds of at most several seconds. (A fraction of a second is required nowadays.) Peter Montgomery’s then recently published work [251] on modular multiplication played, and continues to play, a significant role in achieving user-acceptable encryption and signing speeds, perhaps now more than ever with the proliferation of RFID tags and embedded cryptographic devices in an increasingly security-conscious world.

3.3 Montgomery’s Novel Modular Multiplication Algorithm Modular multiplication without trial division was published in 1985 [251] and provided a more efficient way of implementing the necessary arithmetic historically at exactly the right time. Indeed, it arose out of the need to speed up RSA. The details are given in Algorithms 2.1 on page 14 and 2.2 on page 16 of the previous chapter, and the reader is encouraged to renew familiarity with the methods and the notation provided there. Downloaded from https://www.cambridge.org/core. UNIVERSITY OF ADELAIDE LIBRARIES, on 31 Mar 2019 at 10:14:57, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.004

3.4 Standard Acceleration Techniques

43

As the title of the paper suggests, Montgomery’s technique avoids the delay caused by the usual trial-and-error method of determining which multiple q of the modulus N needs to be subtracted from partial products P, replacing that trial division with an exact digit×digit multiplication. Readers will be familiar with the schoolbook method of long division in which the first few decimal digits of N and P are used in a trial-and-error manner to determine the first digit q of the quotient P/N. The first estimate q is used to compute P − q N and it is then refined to the correct value q depending on whether the result is less than 0 or ≥ N. Similarly, with hardware rather than human processors, the base is a power of 2 such as 216 or 232 rather than 10, but the problem is otherwise identical for the multi-digit numbers appearing in RSA. As explained in the previous chapter, Peter Montgomery’s method replaces the usual hit-or-miss technique with an exact calculation which depends only on the lowest digit of the partial product and the modulus (see Algorithm 2.2 on page 16). Although there is the penalty of scaling inputs and outputs to or from their Montgomery representations in the Montgomery domain (see Section 2.2.2 on page 16), there are other advantages. For example, the exact process at each digit iteration means much more straightforward firmware, which is both easier to verify and takes up less ROM. Furthermore, subtraction can usually be completely eliminated. This has the significant commercial value of much less likelihood of company critical errors in the implementation as well as shorter time-to-market and smaller die size.

3.4 Standard Acceleration Techniques The reason for emphasising the commercial and implementation advantages of Peter Montgomery’s algorithm is that the mathematical advantages are often over-stated in the literature. Indeed, with the use of appropriate implementation techniques, the overall complexity seems to be asymptotically exactly the same as for classical modular reduction techniques when the argument size increases. The main disadvantage is that, unlike Montgomery’s algorithm, traditional techniques are ill-matched to standard hardware word sizes, bus widths and multiplier sizes. However, before reaching that conclusion, let us review all the normal hardware acceleration techniques applicable in the traditional computation of A · B mod N and consider their analogues in the Montgomery computation of ABR−1 mod N. Most of these were enumerated in [139] and they are covered in detail in the following sub-sections. This comparison will show where the main differences are in the complexity of implementing the arithmetic components. Chief among these is the determination and use of the quotient digit q which is covered in Section 3.7 on page 48. There are usually significant differences between the two algorithms for this aspect when the Downloaded from https://www.cambridge.org/core. UNIVERSITY OF ADELAIDE LIBRARIES, on 31 Mar 2019 at 10:14:57, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.004

44

Hardware Aspects of Montgomery Modular Multiplication

hardware multiplier is already provided rather than custom-built: most digitlevel multipliers are ill-adapted to processing the slightly larger digits encountered in classical algorithms. At the cutting edge of the fastest or most efficient implementations, it is the fine detail which becomes important, such as the critical path length, the number of load and write operations, loop unrolling, the types of register and memory used, and the area devoted to wiring. The interested reader is referred to the vast research literature such as [242] for this more detailed level of coverage. Further, we omit consideration of enhancements that make modular squaring faster than modular multiplication. Instead, we confine ourselves to what is sufficient in a normal commercial setting, ignoring also minor costs such as those for loop establishment and index calculations. Several of the acceleration techniques are already covered in the previous chapter, there being no clear boundary between what is hardware and what is software. Some techniques involved the choice of moduli with particular properties such as sparseness (of non-zero bits) − see Section 2.3 on page 22. In this chapter it is assumed that the hardware needs to process any modulus, and so inputs may need to be transformed first in order to benefit from the previously mentioned techniques. Such transformations are described in detail. Although “efficiency” is measured here in terms of Time or Area × Time, in many portable and some RFID devices the predominant issue is energy consumption. For cryptographic applications, the choice of cryptosystem combined with the use of dedicated rather than general purpose hardware is critical in reducing the energy used. At the hardware level the use of fields with small characteristic saves power in a multiplier by reducing the switching activity due to propagating carries. Secondly, careful algorithm design to reduce memory access is also very helpful, and is applicable to time efficiency as well. Thirdly, one can employ fast multiplication techniques such as Karatsuba– Ofman [202, 170, 203]. However, for the most efficient use of the space available in this chapter, the discussion of energy efficiency is limited to this paragraph!

3.5 Shifting the Modulus N 3.5.1 The Classical Algorithm When using the traditional algorithm (see Algorithm 3.1) it is advisable to shift the modulus N up so that its most significant bit always has the same position in the hardware (normally the top bit of a register). This left alignment allows moduli with different numbers of bits to be processed in a more standard way,

Downloaded from https://www.cambridge.org/core. UNIVERSITY OF ADELAIDE LIBRARIES, on 31 Mar 2019 at 10:14:57, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.004

3.5 Shifting the Modulus N

45

Algorithm 3.1 A radix-r interleaved classical modular multiplication algorithm to compute A · B mod N.  i n Input: A = n−1 i=0 ai r , B, N such that 0 ≤ ai < r, 0 ≤ A, B < N < r . Output: P ≡ A · B mod N such that 0 ≤ P < N. 1: P ← 0 2: for i = n − 1 downto 0 do 3: P ← rP + ai · B 4: q ≈ P/N (a lower approximation to the greatest integer ≤ P/N) 5: P←P−q·N 6: end for 7: while P ≥ N do 8: P←P−N 9: end while 10: return P reducing combinational logic and ROM code.b Identical adjusting shifts are made to A or B and reversed in the output. If S is the power of 2 corresponding to the number of bit positions in the shift then the new values to use for A and N are AS and NS. So the correct result is obtained by computing (A · B) mod N = S−1 ((AS · B) mod NS) in which S−1 is the shift back down. For efficiency, the quotient digit q in line 4 of Algorithm 3.1 is generally computed from only the top bits of P and N. So, if n is the value of the top bits of the register containing N which are used for this, then the shift up means that n has a more limited range of values. In particular, the top bit of n being non-zero keeps its value well away from zero and so provides an upper limit to q, whose definition includes a division by n or n + 1. We look at the definition and computation of q in more detail in Section 3.7 on page 48.

3.5.2 Montgomery For completeness, it is worth observing that there is an analogue of this classical technique for Montgomery multiplication that is slightly more than just the right alignment of the modulus and operands. It makes the Montgomery method generally applicable instead of only in circumstances where the modulus is prime to the radix r of the number representation. If we take N to be a general modulus as in the classical case, then any common factor S with the computation base r needs to be removed before applying

Downloaded from https://www.cambridge.org/core. UNIVERSITY OF ADELAIDE LIBRARIES, on 31 Mar 2019 at 10:14:57, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.004

46

Hardware Aspects of Montgomery Modular Multiplication

Montgomery’s reduction technique. Replacing N by N  = S−1 N is the analogous shifting down process. With binary representations, S is a power of 2, N  is odd, and the lowest non-zero bit of N  is the bottom one in its register. The Montgomery multiplication is then done modulo N  . For RSA and ECC analogues, N is odd and certainly prime to any conceivable computing base. Hence N does not generally require to be shifted down in this way, and the normal right alignment corresponds to no shift, i.e., S = 1. Nevertheless, with this action Montgomery can be used for general moduli. To complete the computation after the shift adjustment requires application of the Chinese Remainder Theorem (CRT) for the coprime moduli N  and S. This is best done after exiting from the Montgomery domain (see Section 2.2.2 on page 16). Using Garner’s formula [156] we have (A · B) mod N = (A · B) mod N  + N  T where T = (N −1 mod S)((A · B) mod S − (A · B) mod N  ) mod S. Of course, here T is probably calculated very easily even in the most general setting so that the overhead is low. This is because mod S selects the lowest bits of a number to base r. In particular, if S divides the computation base r, then T is a single digit computed from the lowest digit of each of A, B and N  . A similar process of adjusting the modulus can be applied to any modular arithmetic, not just to modular multiplication, thereby enabling the Montgomery modular reduction technique to be used just as easily in a general setting. From here onwards it is assumed that the shifting described in this section is performed. Consequently, the highest or lowest bit of N, as appropriate, is assumed to be 1 in all future discussion.

3.6 Interleaving Multiplication Steps with Modular Reduction As already noted on page 16 in Section 2.2.1, instead of computing the product A · B first and then performing the reduction modulo N, it is advisable to interleave modular subtractions at each shift and add a step during the normal calculation of the product. This means the partial product stays roughly the size of the modulus N rather than being as large as A · B. This saves register space. The term integrated is often used for this technique, as opposed to separated, which is when the modular reduction is performed after the multiplication.

Downloaded from https://www.cambridge.org/core. UNIVERSITY OF ADELAIDE LIBRARIES, on 31 Mar 2019 at 10:14:57, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.004

3.6 Interleaving Multiplication Steps with Modular Reduction

47

However, the product can be computed in two main ways. The easier to organise and more uniform is the operand scanning technique in which the partial product P is increased by ai B, with i being decremented or incremented at each iteration step according to whether the classical or Montgomery algorithm is being used: P ← P + ai B. This is how it is computed in Algorithm 3.1 on page 45 and Algorithm 2.2 on page 16. Alternatively, the product scanning technique accumulates all the digit products a j bi− j ( j = 0, 1, 2, . . . ) at the ith step, again with i decremented or incremented at each  step: P ← P + j a j bi− j bi for digit base r [134, 171]. For both these alternatives, there is the choice of whether to alternate between the multiplication steps and the reduction steps P ← P ± qN at a digit level, or at a limb level − the level of the outermost loop over multiplier digits. These are referred to as the finely integrated and coarsely integrated operand or product scanning techniques respectively (FIOS and CIOS for operand scanning). The details of the different possibilities have been much studied by Ç. Koç and others [93, 242]2 . The reader is referred to those publications for further details. Here, as in Algorithms 3.1 and 2.2, we concentrate on CIOS where the multiplication steps P ← P + ai B alternate with reduction steps P ← P ± qN. However, other processing orders can be beneficial in processing carries and preventing pipeline stalls on certain hardware architectures with SIMD operations [308]. For product scanning, the successive quotient digits q have to be stored. This makes its area requirements greater than for operand scanning. Consequently, interleaved multiplication using operand scanning is generally preferred and that is what is described in this chapter. On the other hand, Liu and Großschädl [242] note that product scanning requires fewer store instructions on an Atmega128 with 32 registers, and they use it for speed. Readers who wish to perform the multiplication independently in advance of the reduction, or perform just a Montgomery reduction, can still use the contents of this chapter almost verbatim. As with hardware multiplication, a library modular multiplication routine is often best presented as a modular multiplyaccumulate operation since the initial partial product is almost as easily set to the value of the accumulate argument as to zero. That can be done by a very

2

The paper [93] claims that FIOS requires more digit level additions and reading/writing operations, making it slower than CIOS. This occurs in their algorithm because of an extra digit addition to incorporate the upper word from the first multiply-accumulate digit operation (MAC). However, this extra addition can be avoided by having two carry digit variables instead of one and incorporating one carry into each of the two multiply-accumulate operations as occurs in the MACs of their CIOS version. Thus the algorithms should use the same number of each type of digit operation.

Downloaded from https://www.cambridge.org/core. UNIVERSITY OF ADELAIDE LIBRARIES, on 31 Mar 2019 at 10:14:57, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.004

48

Hardware Aspects of Montgomery Modular Multiplication

minor modification of Algorithm 2.2 on page 16. So, for a modular reduction only, readers just need to omit the step P ← P + ai B (i.e., set A = 0 or B = 0) after initialising P to the product value which, in this case, may initially have many more digits than N.

3.7 Accepting Inaccuracy in Quotient Digits This section takes a close look at the definition and computation of the quotient digit q in order to highlight the main difference between the traditional and Montgomery algorithms. It is primarily this difference that makes Montgomery the preferred method in the majority of high-end commercial cryptographic applications, whether in hardware or software. Readers are invited to skim or skip the technical details of the following sub-section on the traditional modular multiplication algorithm and head straight for Section 3.7.3 on page 52 on Montgomery’s method unless and until they have the need to check the mathematics or perform similar calculations themselves! A final summary in Section 3.7.4 on page 52 covers what is necessary to appreciate the value of Peter’s contribution.

3.7.1 Traditional The aim of the acceleration technique in this section is to simplify the hardware logic for calculating q so that it is much faster. This is achieved in two ways. One is to replace the division by N in line 4 of the classical Algorithm 3.1 on page 45 with a multiplication by a pre-calculated N −1 . Multiplication is faster than division, so this improves the speed. The other aid is to approximate the long numbers P and N −1 by using only their most significant digits. The shortened multiplication will also speed up the calculation of q. Recognition of the use of such an approximation process is given by writing “q ≈ . . . ” in the algorithm. The resulting method is generally known either as Barrett reduction [31] or as Quisquater’s method [198], depending on the details of the approximations. The more general version here appeared in [334] and later in [122, 123, 124, 213], and it allows more control over how large P might become. These simplifications lead to an occasionally slightly inaccurate value for P/N . However, with care they can be used in every multiplication step without any correcting adjustment until after the main loop terminates. Clearly, if too low a multiple of N is subtracted on one iteration, a compensating subtraction needs to be made at some future point. Hence the “digit” q may have

Downloaded from https://www.cambridge.org/core. UNIVERSITY OF ADELAIDE LIBRARIES, on 31 Mar 2019 at 10:14:57, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.004

3.7 Accepting Inaccuracy in Quotient Digits

49

to be larger than the natural bound of the base r which is being used for the number representations. The larger digit then compensates for the inaccuracy of the previous choice of quotient digit. At the end of Algorithm 3.1 on page 45, conditional subtractions have been added in case P has grown to be larger than N. Let us denote by p and ninv the approximations to P and N −1 that will be used in defining q. They will turn out to be just two or three bits longer than a radix-r digit. For convenience, assume r is a power of 2 and let us aim at k-bit approximations for some small k that we have yet to determine. Define Z (for “Zwei”) as the power of 2 such that 2k−1 Z < N ≤ 2k Z.

(3.1)

Then the most significant few bits of the inverse N −1 are given by ninv = 22k Z/N ,

(3.2)

that is, the greatest integer equal to or less than 22k Z/N. Bounds on ninv are easily deduced from these definitions: ninv ≤ 22k Z/N < 22k Z/2k−1 Z = 2k+1 and ninv + 1 > 22k Z/N ≥ 22k Z/2k Z = 2k , so that 2k ≤ ninv < 2k+1 .

(3.3)

Thus ninv has exactly k + 1 bits, the leading bit being 1. Next, let us write Pmid for the value of P which is used in the calculation of q (Algorithm 3.1, line 4). To reduce Pmid modulo N, the bits we need from P for the approximation p are those above the position of the top bit of N plus one or two more to make q accurate enough. This is fewer bits than given by Pmid /Z , so let p = Pmid /zZ

(3.4)

where z is a (small) power of 2 which will be determined shortly. The new, approximate definition for q is then the following: q = p · ninv z/22k .

(3.5)

Note the symmetry here with the definition provided by Equation (3.10) on page 52 of q for Montgomery’s algorithm: a product of the top bits of P and N −1 is taken rather than a product of the bottom bits of these quantities. Note also that the integer division by 22k in (3.5) is trivial in hardware logic so that, as in Montgomery’s algorithm, q is essentially determined just by a multiplication.  Thirdly, note that, as in computing n−1 0 , the cost of computing ninv is almost  immaterial since ninv will only be computed once for each fresh modulus N.

Downloaded from https://www.cambridge.org/core. UNIVERSITY OF ADELAIDE LIBRARIES, on 31 Mar 2019 at 10:14:57, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.004

50

Hardware Aspects of Montgomery Modular Multiplication

First, ninv can be approximated using just the top few digits of N, and then that value incremented until (3.2) holds. From (3.5), q ≤ p · ninv zZ/22k Z ≤ Pmid ninv /22k Z ≤ Pmid /N, so the partial product P does not become negative in line 5 of Algorithm 3.1. This establishes 0 as a lower bound on P throughout execution of the algorithm. As each of these comparisons could be equalities, this is in some sense the best approximation we can employ to maintain this lower bound3 .

3.7.2 Bounding the Partial Product The next task is to establish a common upper bound B on the value Pend of the partial product at the beginning and end of each iteration. This bound depends on the number of input bits which are used in calculating q. The more bits are taken, the more accurate q is, and so the lower the value of B can be made. As the bound B must be preserved from one iteration to the next, we require Pend ≤ B ⇒ rPend + ai B − qN ≤ B

(3.6)

for each digit ai in A. To determine a suitable value B easily, we now switch from discrete, integer arithmetic and allow continuous real-valued quantities for the variable P. As will be clear from the argument below, the value of B obtained in that context will certainly work for the integer case. The most difficult case to satisfy (3.6) is when the approximation q is smallest compared with the correct value, B is the least upper bound, and Pend and ai B are maximal. With Pmid the value used to calculate q as in Section 3.7.1 on page 48, this would occur for the following values at the point when q is evaluated: Pend = B, q = (p ninv z − 22k + 1)/22k , ai = r − 1, B = N − 1, Pmid = p zZ + zZ − 1 and (ninv + 1)N = 22k Z. Note that the graph of Pmid − qN is like the teeth of a saw, increasing at the same rate as Pmid , but stepping down each time q increases by 1. The restriction on the form of Pmid is to ensure we select a value at the top of one of the teeth as these are the worst cases for satisfying (3.6). The values of these maxima increase with P because q is defined using the upper bound 22k Z/ninv for N instead of N itself. So under these conditions the maximum value for P at the start of the loop will lead to the maximum output value at the end of the loop. Hence, the selection of B as the least upper bound means that B will also be the value at the end of the iteration. Substituting in 3

Signed numbers are usually avoided in modular arithmetic and they are avoided here. However, if we wanted the output to be the residue of least absolute value instead of the least non-negative, we might take the nearest integer approximation in the definitions rather than greatest integer below. This would sometimes cause P to be negative and a new negative lower bound would need to be established in the same way as is done next for the upper bound.

Downloaded from https://www.cambridge.org/core. UNIVERSITY OF ADELAIDE LIBRARIES, on 31 Mar 2019 at 10:14:57, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.004

3.7 Accepting Inaccuracy in Quotient Digits

51

these values at the beginning and end of the iteration, Pmid − qN = p zZ + zZ − 1 − (p ninv z − 22k + 1)N/22k = B, Pmid = p zZ + zZ − 1 = rB + (r − 1)N.

(3.7)

Eliminating B and N from the equations (3.7) and ignoring the relatively small terms which are not multiples of Z yields p z =

(r − 1)(ninv + 1)z + (2r − 1)22k − r . ninv + 1 − r

(3.8)

As the numerator is positive, and p is non-negative, the denominator must be positive. Hence ninv ≥ r, which is expected because ninv must provide q with at least as many bits of accuracy as in a digit. The more bits chosen for ninv , the better the approximation for q and so the lower the bound B. A good choice is to take ninv with three more bits than a digit, i.e., 4r ≤ ninv < 8r so that 2k = 4r by (3.3) on page 49. Then plugging the above value for p z into the second equation of (3.7) and using (ninv + 1)N = 22k Z to eliminate k yields an upper bound on Pend of (r − 1)zZ − Z + (2 − r−1 )(ninv + 1)N + zZ − (1 − r−1 )N ninv + 1 − r which is easily determined to be less than 5r 4rz − 1 (r − 1)zZ − Z+(2r − 1)N + zZ + N = N+ Z. 3r+1 3r+1 3r+1

(3.9)

Using the property 2rZ < N which holds for this case, this in turn is less than 2N for z = r in the case of r = 2, and always less than 2N for z = 12 r. Thus 2N is always an upper bound on the output of each iteration in the modular multiplication for some z. This yields Pmid < 2rN+(r − 1)(N − 1) < 3rN as a bound on the size of register needed for P. Then the property q ≤ Pmid /N means q < 3r and so q has at most two more bits than a digit. Moreover, N < 4rZ yields Pmid < 12r2 Z = 12rzZ or 24rzZ for z = r or z = 12 r respectively. So p will be 4 or 5 more bits than a digit. Appropriate adjustments to these calculations can easily be made for different scenarios, such as i) if B is not fully reduced but has another upper bound than N (such as R or 2N), ii) if some of the quantities have carry-save representations that lead to different bounds on their values, iii) if the multiplication is completed before any reduction, or iv) N has a special form or fixed, known value.

Downloaded from https://www.cambridge.org/core. UNIVERSITY OF ADELAIDE LIBRARIES, on 31 Mar 2019 at 10:14:57, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.004

52

Hardware Aspects of Montgomery Modular Multiplication

Clearly, once derived, the classical modular multiplication algorithm has a straightforward formula for the multiple q of N which needs to be subtracted. It is only mildly tedious to determine in the above manner i) the best number of bits to use in the calculation and ii) an upper bound on the partial product to ensure sufficient register space is made available. However, the fact that q and the inputs to its calculation are generally larger by several bits than the radix r means that the built-in hardware multiplier and bus width may be too small for computing q and qN in the most obvious, convenient way.

3.7.3 Montgomery By contrast, the determination, calculation, storage and use of q is much easier in Montgomery’s Algorithm 2.2 on page 16: the value of q is simply q = −p0 /n0 mod r.

(3.10)

Unlike the classical case, all the operations here involve quantities which are within the normal digit range [0, r − 1]. So fewer bits are required from N −1 and P for computing q than in the classical case, and a standard digit × digit multiplier suffices for computing qN. As with the traditional algorithm, the definition of q using short division is replaced by one involving multiplication: q = p0 · ninv where ninv = −n0 −1 mod r.

(3.11)

It is clear by induction that 2N is an upper bound on the value of P at the end of each iteration and so also at the end of the loop when B < N and the digits of A are from a non-redundant representation, i.e., in the range [0, r − 1]. However, for several reasons, we need to look at this bound again later because the inputs A and B are more likely to be bounded above by 2N rather than N.

3.7.4 Summary This section has highlighted one of the key problems with the classical algorithm that makes Montgomery’s algorithm so much easier to implement: checking the details of the classical algorithm and implementing it are more complex. In particular, the value of q is typically two bits larger than a normal digit and so requires a larger hardware multiplier and perhaps more clock cycles or wider buses than normal. On the other hand, in Montgomery’s algorithm bounds are easily established, q has a normal digit size, and standard multipliers and buses can be used. Although the computation of q and the cost of the scalar multiplication q · N are slightly more expensive for the classical algorithm, this needs to be

Downloaded from https://www.cambridge.org/core. UNIVERSITY OF ADELAIDE LIBRARIES, on 31 Mar 2019 at 10:14:57, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.004

3.8 Using Redundant Representations

53

compared with the cost of scaling the inputs and output from Montgomery’s algorithm. In the case of RSA exponentiation, these costs are fairly similar if the multiplier can be chosen accordingly. However, with a fixed digit-sized multiplier, the extra bits in q make the classical algorithm more expensive to design, more complex to implement, hungrier in its firmware area, slower to execute, more prone to implementation errors, and therefore overall less profitable and generally less desirable to choose.

3.8 Using Redundant Representations The use of redundant representations enables digit calculations to be done in parallel. Typically this employs a carry-save representation in order to avoid the problems of carry propagation exhibited by the decimal addition of 1 to 999...9 or the subtraction of 1 from 100...0. This is the problem that forces all our schoolbook arithmetic to be done from the least significant digit to the most significant. However, with a carry-save representation the carries (or borrows) are absorbed by the next digit up as part of the next operation instead of the current one. Then carries only need to be propagated at the end of all the arithmetic. An addition of two base r digits x and y in the range [0, r − 1] creates a result in the range [0, 2r − 2] which is stored as a base r digit s (the save part) and an overflow bit c (the carry part) such that x + y = s + rc. Fortunately, this addition also has space to incorporate a carry bit c from a previous addition in the position below: x + y + c is in the range [0, 2r − 1] and so it too can be split into a one bit carry and a save digit. This means long number additions can easily be done in any order, not just from least to most significant digit but also, for example, from most to least significant digit or all together in parallel, or a number of digit additions in parallel using whatever resources are available. In an ASIC the extra combinational logic and wiring for one extra bit per digit is not too significant. Similarly, a multiplication of two base r digits x and y creates a result in the range [0, (r − 1)2 ] which is split into two digits s and c such that x · y = s + rc. More generally, the operation is usually implemented in hardware as a MAC operation that will add in one or two further digits: MAC(x, y, d, e) = x · y + d + e generates an output in the range [0, r2 − 1] and can therefore also be represented in a two-digit carry-save form (s, c) whose value is s + rc. Thus, the carry from one MAC of the form x · y + z can be accumulated by the next digit position up when it performs a similar MAC operation on the next clock cycle. As with additions, this means long number scalar (limb)

Downloaded from https://www.cambridge.org/core. UNIVERSITY OF ADELAIDE LIBRARIES, on 31 Mar 2019 at 10:14:57, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.004

54

Hardware Aspects of Montgomery Modular Multiplication

multiplications such as P ← P + ai · B can easily be done from least to most significant digit, but also that a succession of long number scalar multiplications P ← P + ai · B, i = 0, 1, . . . can have their digits processed in parallel, with each scalar multiplication absorbing the carries from the previous such operation. The use of parallel digit operations implies significant hardware resources for long integer operations, with adders, multipliers, memory, and registers required for a number of digit positions. These are available to a limited extent in multi-core processors and more widely on Field-Programmable Gate Arrays (FPGAs) boards and ASICs . However, the usual count of logic gates is insufficient as a measure of area complexity. This is because there can be significant wiring overheads to consider as well since some data needs to be broadcast simultaneously to all digit positions which are performing operations. For example, when modular multiplication digits are computed in parallel, the multiplier and quotient digits ai and q need to be broadcast to each digit position. Whereas the digits ai are known in advance and can be queued ready for use, the digits q may only be known on the previous cycle and require immediate broadcast to all computing elements with consequent heavy wiring cost.

3.8.1 Traditional In the classical algorithm with a standard number representation, if we want to avoid computing the top digits twice the determination of q can only be easily made once the carries have been fully propagated. This limits the faster computation of the modular product even if additional computing elements are available. However, with a carry-save representation, additional multipliers can be used for parallel digit operations or digits can be processed most significant first provided an approximation is used for q that does not require the carry-up to be known. In Section 3.7 on page 48 we saw how approximate values can be used satisfactorily. As carries from lower positions have almost no impact on the computation of the top digit or so of P, and only one or two more bits than the top digit are used in order to choose q, the adjustments to the argument in Section 3.7 to accommodate a redundant representation are minimal, with only a minor increase in the maximum value of P if no other action is taken.

3.8.2 Montgomery In Montgomery’s algorithm q does not depend on the completion of any carry propagation. Hence multiple computing elements can be used in parallel with ease if a carry-save representation is used. However, there is still one significant carry to take care of, and this can lead to bubbles (which are when computing Downloaded from https://www.cambridge.org/core. UNIVERSITY OF ADELAIDE LIBRARIES, on 31 Mar 2019 at 10:14:57, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.004

3.9 Changing the Size of the Hardware Multiplier

55

elements are left with nothing to do). Specifically, if one takes the iteration boundary to be when q is calculated, then one loop iteration of the interleaved Algorithm 2.2 on page 16 computes first q and then P ← (P + q · N)r−1 + ai · B. Thus, as well as the lowest digit of ai · B, the next value of q depends on the digits with index 1 from P + q · N plus any carry from its digits of index 0. In both algorithms we see that latency − the time from input to output − can be reduced by employing redundancy when more hardware resources are available. The carry propagation need only be done after the final operation of a cryptographic function, not during or after every constituent arithmetic operation although, as we will see later, executing some limited intermediate carry propagation can be very useful to reduce the size of carries. However, as q is on the critical path, even with additional resources both methods are potentially held up while q is computed and distributed to all the computing elements.

3.9 Changing the Size of the Hardware Multiplier Suppose first that there is no choice over the available integer multiplier because the hardware platform or multiplier design is fixed. Typically, the multiplier will perform a MAC operation a · b + d or perhaps a · b + d + e where a, b, d and e are single words. Then the multiplier may be used more efficiently by supplying the modular multiplication arguments A and B divided into digits which do not adhere to the word boundaries. In particular, if digits have fewer bits than the word size, carries may be incorporated within a word without any overflow. This is very helpful for a carry-save representation and a standard square multiplier may be able to process these values and the over-sized q from the traditional algorithm in one go. As noted in the previous section, the accumulate part conveniently accommodates carries from long integer multiplications. However, the multiplier may not be square (i.e., the above MAC inputs a and b could have different lengths), and may indeed allow several extra bits above the word size for each argument in order to deal with overflows more efficiently or to enable better rounding when used for floating point operations. When a carry-save representation is used, any available extra bits in multiplier arguments can be valuable in processing carries. There may be several multipliers which can be operated in parallel on the chip using a SIMD architecture. Combined in this way, they should operate as a single non-square multiplier enabling several words/digits of long integer arguments to be processed simultaneously in a single MAC operation a · b + d where a and b now have different numbers of bits but which is actually split Downloaded from https://www.cambridge.org/core. UNIVERSITY OF ADELAIDE LIBRARIES, on 31 Mar 2019 at 10:14:57, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.004

56

Hardware Aspects of Montgomery Modular Multiplication

over the multipliers. Consequently, in order to make the best use of the multiplier(s), we need the flexibility to consider the arguments of (A · B) mod N to have representations with different bases which may or may not match the word size or multiples of it [189, 273]. However, if there is a choice of multiplier, then one has to balance costs such as power and floor area against any advantages of the various choices as well as noting that, while a larger multiplier reduces the number of clock cycles, it may increase the depth of the hardware that has to be driven so that each clock tick is longer. Hardware synthesis tools (which are commercially available) include Intellectual Property Blocks for parametrisable multipliers, thereby enabling ASIC designers to choose their own MAC operation circuitry without having to design it themselves. A v × w-bit multiplier will have area approximately proportional to vw and critical path depth proportional to log(vw) plus the register delay and set-up and hold times. As observed in Section 3.8 on page 53, having a second argument that can be accumulated in a MAC operation is useful. If the multiplier is non-square, i.e., has multiplier and multiplicand inputs a and b with different numbers v and w of bits respectively, then its maximum output for a · b is (2v − 1)(2w − 1) = (2v 2w − 1) − (2v − 1) − (2w − 1). So it is possible to accumulate arguments d and e of v and w bits respectively without overflowing the v + w bits necessary for the output. When performing a scalar multiplication ai · B where ai has v bits and B is partitioned into w-bit digits, each use of the multiplier performs a v- by w-bit multiplication, and accumulates both the save value of w bits from the previous operation at that position and the carry of v bits from the MAC on the previous position. At the extreme, an n×1-bit multiplier is just an adder; dispensing with a multiplier is a possible optimisation. Without a multiplier but registers that can hold all of A, B and N, the clock speed in a cryptographic co-processor can usually be increased substantially. In the early days of ASICs for RSA cryptography, the fastest chips often used base 4 for A and, rather than employing a multiplier, just added in B and 2B to the partial product as required by the two-bit digits of A. However, besides initialisation, multiplication, additions and subtractions, the modular multiplication algorithms include reading and writing to various types of memory and other movement of data as well as comparisons, incrementing of counters and instruction processing. Some of these rather than the multiplier may limit clock speed, or even dominate overall performance by using many clock cycles. If Montgomery’s modular reduction is used then the multiplication requirements are marginally less than for the traditional algorithm. Specifically, q has around two fewer bits, which may make the scalar multiplication q · B less

Downloaded from https://www.cambridge.org/core. UNIVERSITY OF ADELAIDE LIBRARIES, on 31 Mar 2019 at 10:14:57, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.004

3.10 Shifting an Operand

57

expensive (see Section 3.7 on page 48). Multipliers may require more area if the number of bits in the arguments is increased, but their throughput and depth of combinational logic should be logarithmic in the number of bits. Consequently, adding two bits to an argument (and to the bus) to accommodate the classical algorithm may not alter the time a cryptographic process takes. However, the power and area requirements will be marginally increased.

3.10 Shifting an Operand In this section we start to make progress on the earlier computation of q so that its calculation ceases to be a bottleneck. Specifically, we want to ensure that none of the hardware devoted to the long number operations P ← P + ai · B and P ← P ± q · N is lying idle while q is computed and broadcast to where it is needed.

3.10.1 Traditional In the classical Algorithm 3.1 on page 45, the determination of q is on the critical path. The operation P ← P − q · N cannot start until q is known. Therefore, any simplification in computing q has the potential to improve the latency of the modular multiplication. Let us shift up the multiplicand A and the modulus N by several digits before the modular multiplication starts, and perform a compensating shift down at the end. So we calculate ((AS · B) mod (SN))/S where S is the power of r corresponding to the shift. The output is still equal to (A · B) mod N but the inputs to the calculation of q now come from several digits higher up in the registers. Using asi , i = 0, 1, . . . , to denote the digits of AS, the shift must be sufficient to move the position of these input bits to above the most significant bit of each asi B. Then B is small in comparison with the new modulus SN and addition of the digit multiple asi B in P ← rP + asi · B will only affect the topmost bits used to compute q if there is a carry which propagates as far as the relevant top bits of P. We saw earlier that q is only computed as an approximation to the correct value, and so it can allow for ignoring that propagated carry. In fact, the relatively smaller value of B means that the formula for q and the bound B in Section 3.7 on page 48 can be improved4 . The determination of q can now be advanced to make it available 4

B is smaller by about 13 N for the parameter choice in Section 3.7.2 on page 50 and a one digit shift.

Downloaded from https://www.cambridge.org/core. UNIVERSITY OF ADELAIDE LIBRARIES, on 31 Mar 2019 at 10:14:57, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.004

58

Hardware Aspects of Montgomery Modular Multiplication

without delaying the reduction step: it can be computed after the assignment P ← P − q · N in the previous loop iteration without waiting for the addition P ← rP + asi · B. Surprisingly, the computational complexity is little changed although the shift means more iterations − one more for each digit of shift. For simplicity, let us assume that the shift is by a whole number of digits. Indeed this must be the case if, as in Section 3.5 on page 44, we are aligning the most significant bit of the modulus with the top bit of a digit. Because the number of non-zero digits in AS is the same as in A, the number of digit × digit multiplications ai · b j is unchanged by the shift if the program code can omit the steps for which the bottom digits of AS are known always to be zero. Thus the scalar multiplication steps P ← rP + asi · B should have the same computational complexity as before. Now consider the cost of the reduction steps. First note that the sequence of values q form a long integer Q such that Q · SN is the quantity subtracted from P by all executions of line 5 in Algorithm 3.1 on page 45, while AS · B is the quantity added to P by all executions of line 3. So AS · B − Q · SN is the output of the main loop. Incrementing Q as necessary to account for any subtraction arising from the final lines 7 to 9 yields Q = AS · B/SN . This integer quotient is unchanged by the shift since AS · B/SN = A · B/N . So the value of Q is the same in both cases. However, the two representations of Q may be different as a result of the shift. From Section 3.7.2 on page 50, a typical definition of q ensures qi < 3r for each i. Hence, both representations might differ from the standard radix-r representation by a borrow of up to 2 being transferred from each digit to the one below it. So there is almost no scope for changing the number of digits in Q by the shift (unless r is very small). Thus, as program code should ensure the same number of digit × digit multiplications are performed calculating q · SN as q · N, the shifted and unshifted modular multiplications can normally be expected to require the same number of digit × digit multiplications in the modular reduction steps. Before concluding, let us review the minor differences in computational complexity that might arise in the modular reduction steps. Since B is smaller compared to the modulus in the shifted case, the upper bound on P is a little smaller, making the digits qi slightly smaller on average in that case. For certain parameter choices, the lower bound on these digits could reduce the number of bits they require, leading to small but real hardware and power reductions in digit×digit multiplications. The smaller bound on P could also translate to slightly fewer iterations of the final conditional subtraction in lines 7 to 9 of Algorithm 3.1 on page 45. On the other hand, the lower bound on digits qi

Downloaded from https://www.cambridge.org/core. UNIVERSITY OF ADELAIDE LIBRARIES, on 31 Mar 2019 at 10:14:57, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.004

3.10 Shifting an Operand

59

may lead to the representation of Q sometimes using one more digit than in the non-shifted case5 . So, overall, the main cost of this acceleration technique is just some increased data movement caused by the shifting, an increased register length, not forgetting, of course, the increased program code area to avoid the known multiplications by zero. However, the arithmetic digit operations have very similar costs. How and where does this technique improve performance? Recall that the purpose of shifting A and N up is to enable the digits q to be calculated earlier, and specifically without having to await the addition in line 3 of Algorithm 3.1. The value of the technique arises when several digit × digit multipliers (or equivalent) are available to operate in parallel. One scenario might be using a number of processing elements (PEs) in an FPGA with a pipeline of digits in memory awaiting processing. Another would be a multi-core processor. For simplicity, assume that the digit × digit multiplier can accept q as one of its inputs6 . Assume also that one multiplication is enough to calculate q and that in total k multipliers are available. Then, without the shift, for each of the n loop iterations there can be some multipliers idle while the last digits of P ← rP + ai · B are determined and then there are k − 1 multipliers that are idle during each calculation of a quotient digit. However, with the shift and a suitable instruction set, P ← rP + ai · B can commence while one multiplier is computing q, so that q is ready when there are multipliers available for starting on P ← P − q · N. Then no multipliers are idle until the end of the last loop iteration when at most k − 1 multiplier cycles may be lost rather than at least n(k − 1). Thus, at least k − 1 idle PE cycles will have been saved on every iteration except maybe the last. Remarks (i) An alternative view of this shifting process is that the reductions are simply delayed until q is available and, in the meanwhile, the processing of the multiplication steps continues. (ii) Since a carry-save representation is used to enable parallel processing of k digit positions, we could avoid performing the shift and still calculate q in time. Thus, the scheduler could select the topmost digits of line 3 of Algorithm 3.1 to be computed first, then compute q using the next free multiplier cycles, and then complete the execution of line 3. This way, q would be available for use in 5

6

Note, however, that the maximum value for the leading digit is atypical as it depends closely on the upper bounds for A and B. If both are bounded by 2N, then Q < 4N and its top digit position could only have the value 0, 1, 2 or 3. This property of the multiplier is desirable in an ASIC because half of all digit products involve the quotient “digit” q, but recall that q may exceed r − 1.

Downloaded from https://www.cambridge.org/core. UNIVERSITY OF ADELAIDE LIBRARIES, on 31 Mar 2019 at 10:14:57, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.004

60

Hardware Aspects of Montgomery Modular Multiplication

starting line 5 if there were spare multiplier cycles when finishing the execution of line 3. However, executing line 4 within line 3 is bound to be rather messy for both code and data movement.

3.10.2 Montgomery In Montgomery’s method, the corresponding shifting technique requires making the determination of q independent of the lowest digit of B which is used in the step P ← P + ai B. To achieve this, B can be shifted up by computing the Montgomery product P ← (A  SB) mod N where the notation  is a convenience to avoid explicitly writing in the introduced power of r which depends on the number of loop iterations in Algorithm 2.2 on page 16. This number is discussed in the next paragraph. So B is shifted rather than A and N which were shifted in the traditional schoolbook method above. Once more, assume the shift is by a whole number of digit places since otherwise there are needless complications. As in the classical case, some extra loop iterations are required to process the larger arguments. Thus, if the original Montgomery computation requires n iterations (the number of digits in A and B) and B is shifted up by s digits, i.e., S = rs , then there should be n + s iterations in P ← (A  SB) mod N. The extra s iterations process s extra digits an , an+1 , . . . , an+s−1 of A which are all zero. This means (A · Br−n ) mod N is computed in the unshifted case and (A · rs Br−n−s ) modN in the case of a shift by s digits. Hence the residue class modulo N of the output is unchanged by the shift and corresponding extra iterations. The value of the partial product P at the end of the last iteration is easily seen to be less than (A · BS + rn+s · N)r−n−s = (A · B + rn · N)r−n < 2N, which is the same upper bound as in the unshifted case. Hence the output of the shifted algorithm satisfies all the post-conditions of the unshifted algorithm, and is therefore acceptable as is. In fact, a more careful analysis would reveal that the arithmetic is entirely identical, so that the outputs are the same. As noted for the classical algorithm shift in Remark (i) above, the shift simply delays the evaluation of q and addition of q · N relative to the addition of ai · B. To appreciate the potential value of the shift, suppose again that there are k hardware multipliers that can simultaneously process k digits of the long number operations P ← P + ai · B or P ← (P + q · N)/r. If the three steps of each loop iteration are performed sequentially using the multipliers as they become available, then up to k − 1 may again be idle waiting while one multiplier computes q before P ← (P + q · N)/r can commence. The solution to this

Downloaded from https://www.cambridge.org/core. UNIVERSITY OF ADELAIDE LIBRARIES, on 31 Mar 2019 at 10:14:57, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.004

3.11 Precomputing Multiples of B and N

61

bottleneck requires earlier scheduling of the computation of q. Without the shift, it would have to take place somewhere in the middle of performing P ← P + ai · B once the lowest digits have been obtained. But, as with Remark (ii) above for the classical algorithm, this would be messy to implement. It is far cleaner to perform a shift so that the determination of q can precede P ← P + ai · B. Such a solution can also cater for the case when the hardware processes all digit positions of P ← P + ai · SB and P ← (P + q · N)/r in parallel and separate circuitry is used to find q. Then, to process one loop iteration requires a depth of hardware equal to two multipliers in each digit position above the shift. However, below the shift position only the depth of a single digit × digit multiplier is required since the corresponding digits of SB are zero. This gives the time needed to compute q without delaying the rest of the hardware [139]. Moreover, each further shift by one digit provides additional time to calculate and broadcast the quotient digits. In one sense the overall computational complexity is essentially the same with the shift as without: the same digit × digit products need to be formed − the extra such products involve an operand digit which is known to be zero and so can be programmed out. The advantage in hardware is that it is much easier to avoid resources being idle while quotient digits are computed. Thus time is reduced. However, as in the case of shifting for the classical algorithm, the number of data movements and other minor operations has increased slightly. Thus the shifting technique applies equally well to the two algorithms. Remarks (i) Whilst a shift of about one digit position seems sufficient for both the classical and Montgomery algorithms, a larger shift enables the computation of q to start even earlier and results in more time being available for its calculation and broadcasting to all digit positions when digit operations are done in parallel. (ii) For large moduli and large numbers of multipliers, the wiring, multiplexers and power for distributing the value of q can become a significant cost that might take several clock cycles and needs to be factored into any measure of circuit complexity. Such wiring may lead to design and implementation problems arising from routing issues and noise from crosstalk. However, shifting inputs as above is free of any need for further pre- and post-processing of I/O.

3.11 Precomputing Multiples of B and N A much-valued and widely used method for increasing the speed of many computations is to precompute and store frequently used values [139, 271].

Downloaded from https://www.cambridge.org/core. UNIVERSITY OF ADELAIDE LIBRARIES, on 31 Mar 2019 at 10:14:57, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.004

62

Hardware Aspects of Montgomery Modular Multiplication

Including tables of precomputed values is a space-time trade-off. It reduces algorithmic time by computing repeatedly used data once beforehand instead of every time it is required. The cost is the increased area taken up by the memory used for the data. Here, if we have some control over the design of the multiplier, which may be the case with an ASIC, the digit multiplier can be personalised into essentially an adder of a subset of precomputed values aB and qN. This adapted multiplier should then execute faster than a general multiplier. When memory for the precomputed values is very cheap or available and otherwise unused and access to it is fast, this makes good sense. If only a very small number of bits of A are processed in one clock tick, then every possible value of aB and qN or aB ± qN could be precomputed and stored in a look-up table. As the same digits a and q will turn up frequently, the repeated re-computation of aB and qN would be completely avoided. Then each iteration of either modular multiplication algorithm would require just one or two additions and no multiplier. However, in practice a and q can be only at most two or three bits in size before the space requirement becomes prohibitive. For larger radices r = 2w , simply storing the w shifted values 2i B and 2i N, i = 0, . . . , w − 1 is not so expensive and may save valuable computing time and power when shifting itself takes time. If the necessary shifts cost virtually nothing (such as when hard-wired into an ASIC) then storing a small number of other combinations, such as B, 3B, N, and 3N, can be worthwhile. Then, for example, P + ai · B can be formed by adding in shifted copies of B and 3B as necessary for every pair of bits in ai . The advantages of using a look-up table depend critically on the extent to which the technique shortens the critical path length or reduces the power consumption, and these will vary enormously between different implementations. Selecting and loading the required multiples of B and N itself takes time. There are also many different types of memory, some fast but expensive such as register space and others cheap but slow. These techniques apply equally well and in exactly the same way to both classical and Montgomery modular multiplication, but may or may not be applicable in a particular context.

3.12 Propagating Carries and Carry-Save Inputs For a modular multiplication we will always assume the modulus input N is in a standard non-redundant form, i.e., its digits are in the range [0, r − 1] for some radix r. This is because its conversion to such a form will be cheaper than the extra processing involved in the modular multiplication using a redundant

Downloaded from https://www.cambridge.org/core. UNIVERSITY OF ADELAIDE LIBRARIES, on 31 Mar 2019 at 10:14:57, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.004

3.12 Propagating Carries and Carry-Save Inputs

63

carry-save form, as described in Section 3.8 on page 53. However, redundancy may be desirable to allow in one or both of the other inputs, i.e., in A or B, because such a form enables digit values to be processed independently. When modular multiplication is used in the context of a modular exponentiation, the output from one modular multiplication is an input to a subsequent such operation. If only one or at most two digit multipliers are available, then the modular multiplication algorithm should process the partial product P sequentially from least to most significant digit propagating carries on the way and outputting a result in non-redundant form. This is not a problem for inputting to the next modular multiplication. However, since the use of carrysave representations is helpful when a greater number of digit multipliers is available, carry propagation may be necessary between modular multiplications unless the modular multiplier allows inputs in redundant form. As the hardware modular multiplier may be used for modular squaring, one might expect that both or neither of the arguments should allow for redundant inputs. However, the digit products in a given ai · B may be performed in parallel, and the scalar products ai · B(i = 0, 1, . . . , n − 1 or i = n − 1, n − 2, . . . , 1, 0) generated sequentially. This means the digits of operand B are required in parallel, but the digits of A are consumed sequentially. Thus there could be time to convert operand A to a standard form before use but not operand B. So sometimes only one argument may need to be allowed a redundant form, even for modular squarings. Clearly, a carry-save representation takes up valuable resources, adding to the cost of reading, processing, writing and storing the output. Consequently, although processing the carries takes time, carries should always be propagated when there is minimal cost, such as by using any spare argument in a MAC operation, as noted earlier. Otherwise, the propagation cost is that of one (full length) addition when digits are processed sequentially. In practice, this is unlikely to be much less than the cost of a scalar multiplication a · B or q · N. However, even a full-length, digit-parallel addition would reduce each carrysave pair of digits to a single digit and a one bit carry. When performed on the output from one modular multiplication, it reduces the work for the next modular multiplication: with a full two-digit carry-save representation for B and single digit for ai , the scalar multiplication ai · B will require 2n digit × digit multiplications whereas reducing each digit position of B to a digit and a carry bit cuts the work to only n digit × digit multiplications and an n-digit addition. Indeed, the hardware multiplier might be large enough to accommodate the carry bit as well as the digit, Section 3.9 on page 55. Unfortunately, in the classical algorithm, the digits of input A are consumed from most to least significant. So, if the argument A is not already in a

Downloaded from https://www.cambridge.org/core. UNIVERSITY OF ADELAIDE LIBRARIES, on 31 Mar 2019 at 10:14:57, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.004

64

Hardware Aspects of Montgomery Modular Multiplication

non-redundant form, it is not easy to convert it fully on a digit-by-digit basis before use if time is an issue. However, A can still be converted from a two-digit carry-save representation to a digit and carry bit representation by processing the digits from most significant to least significant, thereby reducing the work of each modular multiplication in the same way as when improving the representation of B. There may be sufficient time and resources to do this. Then the digits ai will have a range less than that of q and a multiplier which is large enough to compute the products q · ni may also be sufficient for computing the ai · b j . On the other hand, in Montgomery’s algorithm the digits of A are consumed one-by-one, least significant first. So, if the argument A were not already in non-redundant form, it would be easy to fully convert it just before use. This would make the formation of ai · B cheaper and easier even than in the traditional algorithm. Now consider evaluating P ← P + ai · B using parallel digit operations with B in a carry-bit representation and ai in standard binary. The products ai · b j contribute a maximum of r2 − r to each digit position if the bit carries in each b j are shifted up to contribute 1 to the next position. So the necessary redundancy in P will give a maximum of at least r2 in each position. This is beyond the full extent of the carry-save representation and does not even leave room to accumulate any carry up from the previous operation without expanding into a third digit. Consequently, even in the Montgomery case, it is necessary either to have a larger digit multiplier7 or to insert an extra addition into each loop iteration in order to deal with carries when parallel digit operations are to be performed. The situation is slightly worse for the classical algorithm where the digit representation of ai may also have a carry bit. This emphasises the desirability of processing digits sequentially from right to left. Lastly, in Montgomery’s algorithm the value of q depends on the lowest digit of the partial product and this does not require carry propagation except from the digit position deleted in the division by r. On the other hand, for the classical algorithm the value of q depends on the top bits of the partial product, and its accuracy is reduced, thereby causing extra processing, if carries are not propagated. Once again, Montgomery is more efficient for the resources that are typically available. Overall, this section has illustrated the cost of allowing redundant representations internally and for the I/O of the modular multiplier. It is greater for the traditional algorithm but is, nevertheless, still a significant issue if some parallel processing of digits is envisaged for Montgomery’s algorithm. 7

“Larger” is relative to the radix r. One can decrease r if the multiplier size is fixed. For example, in a SIMD software context, Intel [189] uses 29 bits for r rather than the more natural 32.

Downloaded from https://www.cambridge.org/core. UNIVERSITY OF ADELAIDE LIBRARIES, on 31 Mar 2019 at 10:14:57, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.004

3.13 Scaling the Modulus

65

3.13 Scaling the Modulus The computation of q is a stumbling block for speed in the traditional algorithm, as well as perhaps slowing down Montgomery’s. As seen in Section 3.10 on page 57, in both cases operand scaling by shifting reduces the complexity of determining the quotient and so has the potential to speed up the modular multiplication. On the other hand, as observed in Section 2.3 on page 22, special moduli might be chosen to simplify quotient digit selection. For general moduli the same efficiency gains can be achieved through scaling the modulus8 [335]. We have already noted in Section 3.10 that for both the traditional and Montgomery algorithms shifting A and N relative to B makes it possible to determine q before the addition P ← P + ai · B, and hence reduce any delay in broadcasting q to the positions where it is needed first for use by a bank of multipliers. However, this delay can be further reduced if the computation of the quotient is made easier by scaling the modulus. So, specifically, we wish to remove, or at least simplify, the multiplications (or divisions) which occur in Equation (3.5) on page 49 and in Equations (3.10) and (3.11) on page 52. This is done by scaling N so that the top digit or so is all one bits in the case of the classical algorithm, and so that the bottom digit or so is all zero bits except for the lowest bit in the case of Montgomery. To be precise (except for the possible shift in the classical case to place the top bit at the top of a digit and in the Montgomery case to make N odd), replace the modulus N by its multiple N ∗ = ninv N

(3.12)

where ninv is the quantity defined as ninv in the definition provided by Equation (3.2) on page 49 or as ninv in (3.11), as appropriate. As well as this pre-processing, some post-processing is required. The final output needs to be reduced modulo N, but all intervening modular arithmetic will now use N ∗ . For most choices of moduli, N ∗ will now have more digits than N and so qN ∗ , A, B and the partial products will be longer (by the same amount) and more time will have to be spent computing them. The increase is about one digit in length, and so about two more digit × digit multiplications are required on each iteration. However, the special, simple form of an end digit of N ∗ should enable the extra digit × digit multiplication in q · N ∗ to be avoided. So the cost is one digit × digit multiplication, which is exactly what has been saved from computing q. Nevertheless, this simplification of q can speed up the hardware 8

Sparseness in N does not make any difference to the cost of quotient digit determination. So we do not require scaled moduli for which most digits are zero.

Downloaded from https://www.cambridge.org/core. UNIVERSITY OF ADELAIDE LIBRARIES, on 31 Mar 2019 at 10:14:57, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.004

66

Hardware Aspects of Montgomery Modular Multiplication

when more than one digit position needs q immediately or multipliers are lying idle. Let us consider in detail the properties of the new modulus N ∗ and how it will be used. With Montgomery’s technique, ninv is an odd, but standard, digit in the range [1, r − 1] and so scaling by it will provide a modulus N ∗ = ninv N ≡ n∗0 ≡ n0 · ninv ≡ −1 mod r which is at most one digit longer than N. So the lowest digit of N ∗ satisfies n∗inv ≡ 1 mod r and, by Equation (3.10) on page 52, the reduction then uses q∗ = p0 . No processing at all is required to obtain q∗ . On the other hand, in the revised schoolbook method of Algorithm 3.1 on page 45, suppose we choose ninv to have two or three bits more than a digit, as in Section 3.7.1 on page 48. The scaled modulus ninv N then needs shifting (up or down) to ensure its top bit is at the top of a digit boundary. So the length of N is increased by probably one or two digits, but perhaps by none or three. In the exceptional case of N being a 2-power, this shift cancels the multiplication by ninv = 2k , so N = N ∗ = 2k Z has an unchanged number of digits and the quotient formula (3.5) on page 49 yields q∗ = p z/2k .

(3.13)

Again, as in the Montgomery case, almost no processing is required to obtain q∗ . We will now show this formula also holds for the classical algorithm for all other N ∗ , thereby always removing the multiplication from (3.5). Without loss of generality, assume (3.12) holds as written, with any further shift to place the top bit on a word boundary delayed until later. Since the length of N ∗ is normally different from that of N, the application of the bounds (3.1) on page 49 to N ∗ provides a new power Z ∗ of 2 defined by 2k−1 Z ∗ < N ∗ < 2k Z ∗ .

(3.14)

The maximality of ninv in the definition provided by Equation (3.2) on page 49 gives a tighter bound on N ∗ , namely 22k Z − N < ninv N < 22k Z.

(3.15)

So Z ∗ = 2k Z, the number N ∗ is exactly k bits longer than N, and the leading k bits of N ∗ are all equal to 1. Also from Equation (3.2), n∗ inv is the unique integer satisfying ∗ 2k ∗ 22k Z ∗ − N ∗ < n∗ inv N < 2 Z .

(3.16)

Downloaded from https://www.cambridge.org/core. UNIVERSITY OF ADELAIDE LIBRARIES, on 31 Mar 2019 at 10:14:57, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.004

3.14 Systolic Arrays

67

k 2k ∗ Hence, as with the special case of a 2-power for N, n∗ inv = 2 because 2 Z − N ∗ = 23k Z − ninv N ≤ 2k (22k Z − N) < 2k ninv N = 2k N ∗ < 22k Z ∗ where the first inequality holds by (3.3) on page 49, the second by (3.15), and the third by (3.14). Thus (3.13) holds for all cases of scaling in the classical case. For both the classical and the Montgomery algorithms, operand scaling is therefore an easy means of removing the multiplication from the definition of q and replacing it with a simple bit selection. Note, however, that with the traditional algorithm, international standards for RSA cryptography already place the top bit of N on a word boundary. The above scaling adds k bits to N, making N ∗ just more than one digit longer, and causing two more digits of processing at many points. On the other hand, scaling in the Montgomery case only adds one digit to the modulus length. Therefore, at least in the classical case, modulus scaling may make more sense only as the number of digits increases.

3.14 Systolic Arrays A systolic array consists of a number of processing elements (PEs) for repeatedly computing some function which itself consists of the repetition of a small number of different tasks. The PEs are arranged in a regular manner, communicate data only locally in the direction of computation, and each performs just one of the component tasks of the function. The input data for the function is fed in at one end of the array whenever the function has to be evaluated, and the result is eventually output at the opposite end. The array acts like a pipeline in the sense that it contains several instances of the function evaluation at various stages of completion. Such arrays can be particularly useful on servers with high volumes of connections or for other data intensive processing. One of the advantages of using the architecture of a systolic array is the potential to avoid the cost of simultaneous data broadcasting (delays, wiring, and multiplexers) to a set of PEs operating in parallel and thereby to improve Area × Time efficiency. Although overall latency may not be improved by using a systolic array rather than a parallel architecture, throughput can be substantially increased. There are many papers on the use of systolic arrays in cryptography. The paper [336] is the first to enable modular multiplication in a fully systolic way, i.e., using PEs with only local connections so that there is no need for the simultaneous broadcast of data to many PEs. Sava¸s et al. [298] provide a wider view, integrating the integer and characteristic 2 field cases, considering scalability issues and providing simulation timings as well as a good bibliography of prior work. In [50] Bertoni et al. apply the concept to finite field extensions GF (pm )/GF (p).

Downloaded from https://www.cambridge.org/core. UNIVERSITY OF ADELAIDE LIBRARIES, on 31 Mar 2019 at 10:14:57, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.004

68

Hardware Aspects of Montgomery Modular Multiplication

Since hardware always involves a fixed number of PEs and there are rarely sufficient of them to process all digit positions without re-use, it is necessary to store various intermediate data in queues of digits. This aspect is covered briefly after describing the case where there are sufficient PEs. Freking and Parhi [149] discuss this and the data dependencies in a little more detail for the two-dimensional version of the array, as well as providing more possibilities for arranging the cells. For convenience, we will assume the PEs take a single clock cycle to perform their operation, including reading any required data, forwarding results to the next PE and storing data sent on by the previous PEs. We suppose also that a PE can perform its operation on every clock cycle. In reality, there could be a three or more stage fetch-decode-execute pipeline and some code to perform the PE’s operation as a number of simpler instructions over many clock cycles. This might be the case, for example, when using a Field-Programmable Gate Array (FPGA).

3.14.1 A Systolic Array for A×B An easy example of a two-dimensional systolic array is one for multiplying n−1  i j A = n−1 i=0 ai r and B = j=0 b j r using an n×n array of PEs. The example follows multiplication by hand, starting with the least significant digits of both A and B, adding one ai · B at a time, and propagating carries upwards every time a digit product ai · b j is added to the partial product. It outputs the digits of the final product as they become known from least to most significant with each addition of an ai · B and, for convenience, it shifts down the rest of the partial product at the same time. So, the (i, j)th processing element PEi, j (i, j ≥ 0) contributes to the calculation of the product digit pi+ j by computing the carrysave value ci, j+1 r + si+1, j−1 ← ai · b j + si, j + ci, j

(3.17)

and then forwarding ci, j+1 to PEi, j+1 , si+1, j−1 to PEi+1, j−1 , ai to PEi, j+1 and b j to PEi+1, j all at time T = 2i + j + 1 relative to the first inputs to the multiplication at T = 0. It is easy to see that the digit values ai , b j , si, j and ci, j are then all received by PEi, j no later than at time 2i + j, which is in time for their use in the above calculation. No other data need to be stored by the PE except that two digits of B are queued ready for use by the PE rather than one. If one ignores the digits of Q and N, this is just what happens in Figure 3.1 on page 69 ([336], figure 1), where the black dots in the diagram represent latches to delay the forwarding of digits of B (and N) by one clock tick, so they arrive exactly when required. The diagonal dashed lines in the data dependency diagram indicate

Downloaded from https://www.cambridge.org/core. UNIVERSITY OF ADELAIDE LIBRARIES, on 31 Mar 2019 at 10:14:57, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.004

3.14 Systolic Arrays

69

Figure 3.1 Data flow between modular multiplication cells ([336], figure 1).

the times at which each PE executes its operation for this multiplication. It is clear that the data flow is always in the direction of increasing time. Now consider the periphery of the array. The input data from B and initial zeros for the save digits s0, j trickle in through the top row of the array. On the other hand, digits from A and initial zeros for the carry digits ci,0 come through initialising the right-hand column. If there were a left-hand column with index n, PEi,n would use bn = 0 and si,n+1 = 0. Consequently PEi,n would simply find ci,n+1 = 0 and forward si+1,n−1 = ci,n to PEi+1,n−1 . As this column does no computation, it can easily be absorbed into the neighbouring column of index n − 1, to yield an n×n array, thereby saving some hardware. So, if the number of digits in B is always bounded above by the number of PEs in a row, it can be assumed that the actual left-hand column (of index n − 1) has been slightly modified in this way. In the last row with index n − 1, each PEn−1, j performs its task as normal, outputting the save digit, which is part of the final result, and forwarding the carry digit to the next PE in the row. The digits pk of the output P = A · B appear at intervals of one or two clock ticks. First, n save values pi = si+1,−1 exit the right-hand column processors PEi,0 two clock ticks apart at times 2i + 1, i = 0, 1, 2, . . . , n − 1. As is clear from (3.17) above, si+1,−1 does indeed represent a digit coefficient of ri . Then the remaining n product

Downloaded from https://www.cambridge.org/core. UNIVERSITY OF ADELAIDE LIBRARIES, on 31 Mar 2019 at 10:14:57, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.004

70

Hardware Aspects of Montgomery Modular Multiplication

digits pn−1+ j = sn, j−1 exit the last row processors PEn−1, j one clock tick apart at times 2n − 1 + j, j = 1, 2, . . . , n, (except for p2n−1 = cn−1,n at time 3n − 2 if column n is absorbed into column n − 1) and they are the coefficients of rn−1+ j . It should be reasonably clear that if each PE uses its data at the correct time and the output is collected at the appropriate time, then the array does indeed compute A · B. Thus, for a given k, each carry digit ci,k−i , save digit si,k−i and product ai · bk−i is processed by PEi,k−i and contributes to the coefficient pk of rk in the product P. The carry it generates represents a multiple of rk+1 and is sent to a PE that deals with rk+1 . The save it generates represents a multiple of rk and is sent to the next PE that deals with rk . So the diagonal sequence of elements PE0,k , PE1,k−1 , PE2,k−2 , …, PEk,0 computes and outputs the final digit pk of the product, having added in all the necessary digit products from A and B, propagated carries as necessary, and consumed the carries it has been sent. An alternative row-by-row view is given by noting that Equation (3.17) on page 68 for a fixed i is just that for a carry-save computation of ai · B + Pr−1 where P is the output received from the previous row, namely the result of accumulating the inputs to rows 0 to i − 1. The movement of the save digits by one PE diagonally rightwards means P should be weighted by r−1 . Thus row  i n − 1 would compute n−1 i=0 ai r B = A × B if there were further columns to the right. As those extra columns would use zeros for the digits of B with negative indices, they would not do any computing, and instead just forward on the save digits, which we decided to collect earlier when they left the column of index 0. All carries have been propagated by the time the last save digit exits the array. Note that one multiplication is performed like a wave travelling from top right to bottom left in the array, with only a diagonal of PEs (those along a dashed line in the figure) being busy performing the multiplication at any specific clock time. The other PEs are free and so can be used for further multiplications while the first one is still being computed. As another multiplication can start being fed into PE0,0 at every clock tick, and it takes 3n − 2 clock ticks before the last digit is output, the array could be performing 3n − 2 multiplications simultaneously, each starting, progressing and being output one clock tick behind the one in front.

3.14.2 Scalability Still with the multiplication example, it is important to consider its scalability. Typically the hardware resources will be fixed but they must be able to deal with variable sized inputs. So, suppose the input arguments have n digits but the array is of size s × t, perhaps with s = t. If s, t ≥ n then there is no problem:

Downloaded from https://www.cambridge.org/core. UNIVERSITY OF ADELAIDE LIBRARIES, on 31 Mar 2019 at 10:14:57, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.004

3.14 Systolic Arrays

71

A and B are simply extended to s and t digits respectively by adding zero digits at the most significant ends, the product is computed as above, and then the first 2n digits of output taken as the product. As before, PEs which have zero digits from the most significant end of A or B simply find themselves with zero for the carries and forward the incoming save digit, which is eventually output. If appropriate, extra control could be put in to extract the result directly from the nth row rather than wait until it exits row s. However, if s < n but t ≥ n then there are insufficient rows and the data  i output from the last row has only computed Ps = s−1 i=0 ai r B. This is simply fed back into the top row as it appears digit by digit from the bottom row, and the next set of s digits from A is used to add the next set of s products ai · B,  i yielding the output P2s = 2s−1 i=0 ai r B. This is repeated until all the digits of A have been used. Thus the array works like a tube by effectively having the top and bottom edges joined, and the data just cycles round and round the tube until all n digits of A have been processed. The next case is when t < n but s ≥ n. This works in a similar way by effectively connecting the left and right edges of the array so that the save digits which exit the right side are fed back into the left side and the carry and other digits which flow out of the left side are fed back into the right side. This time, each cycle across the array extends by t digits of B the proportion of each ai ri · B that has been calculated. For each iteration round the rows, the next set of t digits of B is fed one digit at a time into the top of the array, ready for use when the PEs need them. In both these cases for each circuit of data round the array only one PE of each row and one PE in each pair of adjacent columns is operating on the multiplication. Consequently, when the data exits from a row or column all the PEs in that row or column are free to be used in the next cycle with the next set of s or t digits. Things become more complicated in the fourth case, when both s < n and t < n. Again, the array needs to be viewed as having its opposite edges joined together, this time joining both pairs of edges. The data cycles round the rows and round the columns as before, but there is the potential for PEs to be busy when needed. In particular, data may travel from PE0,t−1 back to PE0,0 at the same time as data wants to go from PEs−1,0 back to PE0,0 . So PE0,0 is in demand from two parties. The straightforward solution is to queue the data exiting from one edge in a shift register until processors are free to continue the calculation. As memory for holding digits is much cheaper than PEs, and digits in the queue can easily be recovered sufficiently in advance not to hold up the processing, this solution is very cost effective. One just has to be careful that memory access if fast enough to keep up with the demands of the array.

Downloaded from https://www.cambridge.org/core. UNIVERSITY OF ADELAIDE LIBRARIES, on 31 Mar 2019 at 10:14:57, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.004

72

Hardware Aspects of Montgomery Modular Multiplication

Figure 3.2 Part of a linear systolic array for multiplication.

3.14.3 A Linear Systolic Array A special case is when the multiplication array is only one-dimensional, i.e., s = 1 or t = 1. Assume s = 1 and n ≤ t, so that there is a single row of PEs, as in Figure 3.2, and enough PEs to process every digit of B. As described before, the save digits coming out from the last row are fed back into the top row, but at one position to the right. This simply means the save digits go back to the previous PE and it corresponds to a shifting up of the partial product as each new digit ai is processed. PE j processes digit b j as previously for column j, and its computation is c j+1 r + s j−1 ← ai · b j + s j + c j at time 2i + j + 1. This is a contribution to the digit of index i + j. Every two clock ticks it needs the next digit from A; hence the stream of digits a0 , a1 , a2 , . . . which are fed into the right-hand end on alternate cycles. At the next clock tick, PE j−1 receives s j−1 , which is a coefficient of ri+ j , and adds in it and the contribution from ai+1 · b j−1 , also of weight ri+ j . Also at that next clock tick, PE j+1 receives c j+1 , which is a coefficient of ri+ j+1 , and adds in both it the product ai · b j+1 , also of weight ri+ j+1 . Thus the digit × digit products, the carries and the saves are all added into the correct total for each digit position. Eventually the digits of A run out so that the carries all become zero, and the save digits are passed rightwards for output. At time 2i + 1, i = 0, 1, . . . , 2n − 1, PE0 ejects the save digit s−1 of weight ri . As all relevant contributions from A × B have been added to it, this is the value of the digit pi of the product P. Because of pressure on chip resources, a likely scenario is that there are many fewer PEs than digits of B, i.e., t < n, If the array is indeed too small to hold all the digits of B then, as above, it can compute A × B j where B j is a t-digit number. With some simple adjustments, it can be used iteratively to compute  P = A × B = j A × B j rt j where B j is a radix rt digit of B, i.e., t radix-r digits of B. Each iteration uses t more digits from B to generate t more digits of output for the product P, and a carry C of n digits to be used in the next iteration, as

Downloaded from https://www.cambridge.org/core. UNIVERSITY OF ADELAIDE LIBRARIES, on 31 Mar 2019 at 10:14:57, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.004

3.14 Systolic Arrays

73

follows: Crt + Pj ← C + A × B j for j = 0, 1, . . . This requires initialising the array to the currently incomplete top part C of the product A × B. This top part C is just the first n carries which exit the left end of the array on alternate cycles after the first t cycles, and Pj is given by the first t save values exiting the right end on alternate cycles. The n digits of C need to be fed back in in parallel with re-inputting the digits of A, and this replaces the initialisation to zero of the right-hand carry-in which is performed only for the first iteration. This detail was subsumed in the discussion above for the two-dimensional array. With this set-up, the leftmost PEs are inactive at the end of a multiplication while the t save digits of the output are passed rightwards down the array. These digits are unchanged by the PEs because t further zero digits are supplied on the A pipeline. However, the save digits could be extracted from the lower edge once the final term involving an−1 has been added. This would allow the following part of the multiplication (that involving the next B j ) or another multiplication to commence immediately, resulting in each PEs being fully used on every alternate cycle. Of course, with each PE operating only on alternate clock ticks the array can perform a second (independent) multiplication in the other clock cycles, enabling the full use of its computing power.

3.14.4 A Systolic Array for Modular Multiplication Could the array of Section 3.14.1 on page 68 be adapted to perform an interleaved modular multiplication rather than just multiplication? For the classical algorithms the answer is clearly “no” because the first product digits are already exiting the array by the time any top digits are available for the determination of the multiple q of the modulus N which should be subtracted. However, with Montgomery’s algorithm the multiple q is determined initially and then the addition of qN can progress hand in hand with the addition of ai B and the propagation of carries. Figure 3.1 on page 69 illustrates how the typical PEs would work. PEi, j now calculates ci, j+1 r + si+1, j−1 ← ai · b j + qi · n j + si, j + ci, j

(3.18)

at time T = 2i + j + 1. By analogy with the multiplication case, this clearly  i computes P = A × B + Q × N where Q = n−1 i=0 qi r . If Q has been chosen n correctly, P will be a multiple of r and the lowest n output digits can be ignored to leave the normal Montgomery modular product.

Downloaded from https://www.cambridge.org/core. UNIVERSITY OF ADELAIDE LIBRARIES, on 31 Mar 2019 at 10:14:57, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.004

74

Hardware Aspects of Montgomery Modular Multiplication

Figure 3.3 The rightmost cells.

The rightmost column of cells (Figure 3.3) needs to compute the digits qi which cause the partial product to be divisible by r when qi N is added. So PEi,0 must also determine qi such that its save output si+1,−1 is 0: pi,0 ← ai · b0 + si,0 qi ← pi,0 · ninv mod r

(3.19)

ci,1 r ← pi,0 + qi · n0 at time T = 2i + 1, where ninv is the precomputed inverse of n0 as defined in (3.11) on page 52. Note that qi is determined after a delay of two multiplications and so should take no longer to generate than the carry output in a standard cell of the array. Although it looks as if ci,1 takes longer to compute, a number of methods for simplifying this were discussed earlier in the chapter, including shifting B up so that the first multiplication in (3.19) is eliminated (Section 3.10 on page 57) or scaling N so that the second and third multiplications are removed (Section 3.13 on page 65). In an FPGA these would reduce the time taken by PEi,0 to equal that of the other cells. Alternatively, in an ASIC systolic array, there are considerable simplifications arising i) from only having to compute the less significant digit of pi,0 for input to qi , ii) from the cancellations due to n0 · ninv = −1 mod r when calculating n0 · (pi,0 · ninv mod r) for input to ci,1 r, and iii) from only having to compute the more significant digit of pi,0 + qi · n0 . Suppose, as usual, that n is the number of digits in A, B and N, and the array is large enough for our purposes. Then the leftmost column of cells may need to have an index larger than n − 1 because the intermediate and final values of the product P are bounded above by 2N, cf. the bounds (3.21) on page 78. Moreover, as noted earlier, this leftmost column may forward its carry directly

Downloaded from https://www.cambridge.org/core. UNIVERSITY OF ADELAIDE LIBRARIES, on 31 Mar 2019 at 10:14:57, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.004

3.14 Systolic Arrays

75

to the next row to avoid having an extra column of PEs for which the normal processing of a cell is trivial. For the result of the modular multiplication, recall that the first, i.e., lowest, n digits that the multiplication array produced are now ignored. Indeed, they are now zero after the modular reduction, and the output lines for them have been removed. The required digits p j = sn, j ( j = 0, 1, . . . ) of the Montgomery modular product follow at times 2n − 1 + j ( j = 0, 1, . . . ) if the least significant input digits were multiplied at time 1 and the product digits are output, as before, from the nth row of the array. If the array does not have n rows, then adjustments are made in the same way as in Section 3.14.2 on page 70 for the multiplication array. As in the case of the multiplication array, further modular multiplications can be fed serially into the array, each starting one clock cycle after the previous one until every processing element is busy. With the first digit being output at time 1 and the last at time 3n − 2, there is the capacity for 3n − 2 simultaneous modular multiplications taking place in an n×n array at any one time. When, as is usually the case, there are fewer rows or columns than digits in A or B then, as before, any unused diagonals in the array can be allocated to the next part of a modular product computation. More detail is given in the references cited at the start of this section. For a system-on-a-chip , different applications compete for space and layout can be a problem. As die sizes increase, it becomes more and more feasible to use a systolic array with many PEs for modular multiplication, but area is always going to be an issue. A very useful observation to reduce area is simply to implement one rather than two multipliers in each PEi, j and to split its function over two clock cycles ([338, eqn. 3]) using an intermediate double digit variable di, j which is computed at time T = 2i + j: di, j ← ai · b j , ci, j+1 r + si+1, j−1 ← di, j + qi · n j + si, j + ci, j .

(3.20)

This just requires digits ai and b j to be input one clock cycle earlier. The main carry and save digits are processed as before at time 2i + j + 1. The multiplier must now be able to add in an extra variable for which more register space is required, and an extra control bit is needed to distinguish between the two operations of the PE but, apart from that, the area is reduced by a factor of almost 2. This makes it much easier to make full use of all the multipliers on every cycle in the linear version of the array. In particular, when performing an exponentiation in the linear (i.e., one-dimensional) version of the array, the squarings, and any necessary multiplications can be performed sequentially with no PE being

Downloaded from https://www.cambridge.org/core. UNIVERSITY OF ADELAIDE LIBRARIES, on 31 Mar 2019 at 10:14:57, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.004

76

Hardware Aspects of Montgomery Modular Multiplication

idle. Moreover, the layout of a linear array can be much more easily adapted to any odd-shaped area on the chip. The paper [338] provides more detail and more options for such arrays, including a discussion of applications to elliptic curve cryptography. There are alternative formulations of the normal schoolbook method of multiplying two numbers, and some can be adapted to interleaved modular reductions. For example, Kornerup [222] adapts [336] to a multiplier design in which two digit × digit multiplications of A · B are performed by each PE. However, their extra complexity does not seem to provide an improved measure of Area×Time.

3.15 Side-Channel Concerns and Solutions One of the advantages of listening to digital rather than analogue radio stations is that switching on a light or a nearby thunder storm no longer interferes with the quality of reception. However, this electro-magnetic phenomenon, when an electric current is suddenly switched on or off, led to the discovery of radio waves and the invention of radio communication at the end of the nineteenth century. During the Cold War it was well appreciated that current variation in valves and cathode ray tube (CRT) monitors led to radio emanations which leaked information from electrical apparatuses such as computers, teletypes and telephones. This led to Tempest shielding [268, 120, 332]. To a large extent, this requires putting everything in a metal box or at least within a metal lattice − including cables − but it also concerns input current variation, and even sound9 . Considerable understanding and expertise was developed by government bodies for analysing this “side-channel” leakage, undoubtedly making use of advanced statistical methods, very powerful probing tools, and unlimited computer facilities which are still well beyond the funding means of university researchers. Of course, the transistors used in chips are just switching devices and therefore radiate energy during operation just as valves do. This was also wellknown from their invention. The need for counter-measures to this lower level of leakage has been known for many years [329]. However, the earliest unclassified published demonstrations of successful attacks on cryptographic systems 9

As well as the sounds made by different keys on a keyboard or key pad, CPUs running at low clock speeds used to make the case of a PC vibrate sufficiently to be able to hear when it was performing RSA cryptography even without any listening equipment beyond the human ear.

Downloaded from https://www.cambridge.org/core. UNIVERSITY OF ADELAIDE LIBRARIES, on 31 Mar 2019 at 10:14:57, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.004

3.15 Side-Channel Concerns and Solutions

77

using side-channel leakage are due to Paul Kocher [220, 221]. He used timing measurements of operations deduced from power variations, and the profile of power use during clock cycles to determine the secret keys within a device running algorithms for public key and symmetric key cryptography. Without any counter-measures, this just required running the cryptographic operation many times with the same data and averaging the results to reduce the noise sufficiently to distinguish between properties of different secret keys. The main difficulty in performing a similar attack using EMR measurements (electromagnetic radiation) was the manufacture of a sufficiently small antenna − but these are readily available in the heads used to read hard disks, for example. Success in this field of EMR was demonstrated shortly after Kocher’s publications by Quisquater, Gemplus and others [293, 155]. The laboratory equipment required to perform such timing, power or EMR measurements is not expensive, and consists mainly of an oscilloscope and a probe. Moreover, understanding of the cause of the leakage is not always necessary. For such attacks it suffices to find a correlation between the secret key and any parts of the recorded phenomena. Averaging many oscilloscope traces is generally important to improve the signal-to-noise ratio, but it is essential to select the traces to amplify the signal rather than reduce it. Some of the more sophisticated methods for trace selection require some knowledge of how the cryptographic algorithms are implemented in order to target times when leakage may occur, such as when key-dependent material is moved along a bus, and remove parts of the traces with no key-related information. In his first paper [220], Kocher identified conditional subtractions in modular multiplications as a primary source of timing variations which revealed the secret key. Bit-by-bit he reconstructed the secret exponent in RSA by observing the frequency of the subtractions and choosing the next bit to match the observed frequency. Kocher used known data inputs. So the immediate countermeasure was to blind the data fed into RSA [94, 220, 245]. However, this does not solve the problem since the average behaviour of squarings and multiplications enables them to be distinguished without knowledge of any inputs, and the sequence of squarings and multiplications is directly related to the exponent bits when the usual square-and-multiply exponentiation algorithm is used. The theoretical basis for this was first published by Schindler [300] and Walter & Thompson [342] and is easily illustrated by looking at a small modulus such as N = 5, as in Table 3.1 where the reduced products are in bold. If the reduction is done for results equal to, or above, N the ratio of subtrac8 : 25 = 45 in the example when we tions for multiplications to squarings is 25 2 assume the N possible products occur with equal frequency, and the N possible squares also appear with equal frequency. Simply put, the average value for

Downloaded from https://www.cambridge.org/core. UNIVERSITY OF ADELAIDE LIBRARIES, on 31 Mar 2019 at 10:14:57, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.004

78

Hardware Aspects of Montgomery Modular Multiplication

Table 3.1 Products of residues A, B mod 5. A\B

0

1

2

3

4

0 1 2 3 4

0 0 0 0 0

0 1 2 3 4

0 2 4 1 3

0 3 1 4 2

0 4 3 2 1

a square is greater than the average value for a product and so the probability of a subtraction is greater for a square. For the large moduli of cryptographic applications, the exact frequencies of the final conditional subtraction in line 8 of Algorithm 2.2 on page 16 can be determined straightforwardly as in [341]. It requires a precise bounding interval for the output P of the main loop of Montgomery’s algorithm. This is given by ABR−1 ≤ P < N + ABR−1

(3.21)

which is easily verified by showing that line 3 of Algorithm 2.2 contributes ABR−1 to P and line 5 contributes less than N [340]. When A and B are less than N and N < R, this is a sub-interval of [0, 2N] with length N so that at most one subtraction is required. In cryptographic applications, it is reasonable to assume N is large and prime or almost prime (i.e., all or almost all natural numbers less than N are prime to N), that A and B are uniformly distributed modulo N, and hence that P is uniformly distributed modulo N, or almost so, and so is effectively uniformly distributed over the given interval. In that case, the probability of the conditional subtraction is proportional to the size of the interval [N, N + ABR−1 ], i.e., the probability is ABR−1 N −1 . Integrating this with respect to A and B over the range [0, N] then gives a very accurate value for the probability of the conditional subtraction for a multiplication, namely 14 NR−1 . Identifying A and B and integrating over A gives the probability for a squaring as 13 NR−1 . Fixing A and letting B occur with uniform distribution gives the probability of the subtraction for a constant multiplication by A, namely 12 AR−1 . Similar distinguishing probabilities hold for alternative implementations in which the condition for subtracting N is P ≥ rn rather than P ≥ N and inputs and outputs are bounded above by rn rather than N. So the frequency of conditional subtractions is indeed measurably greater for squarings than for multiplications when sufficient observations are made. Constant multiplications can also be identified from each other and from squarings and general multiplications by the frequencies for them. Such frequencies

Downloaded from https://www.cambridge.org/core. UNIVERSITY OF ADELAIDE LIBRARIES, on 31 Mar 2019 at 10:14:57, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.004

3.15 Side-Channel Concerns and Solutions

79

reveal the sequence of multiplications, squarings and constant multiplications in any exponentiation which uses a standard algorithm, such as the binary and m-ary left-to-right and right-to-left algorithms. This leads to the discovery of the secret key [341]. Consequently, where side-channel attacks might be a concern it is necessary to adopt a version of Montgomery’s algorithm which does not involve a conditional subtraction. This was presented in Chapter 2. The solution is to set a bound greater than N, say B, such that all inputs to the algorithm will be less than B and simply perform a large enough fixed number of iterations of the main loop to ensure the output is also less than B. Since R is increased by a factor of r by each extra iteration, it is possible to ensure R ≥

B2 B−N

(3.22)

and then the main loop output P is bounded by B because of (3.21). This can then be used safely in any future modular multiplication. A typical choice is B = 2N. This requires R ≥ 4N and therefore just one more iteration of the main loop in most situations10 . At the end of an exponentiation it also means at most one conditional subtraction of N. However, as is readily verified by setting A = 1 in (3.21), this subtraction can never occur if a modular multiplication by 1 is performed to retrieve the result from the Montgomery domain [337, 177, 340]. Thus, there is no need to implement subtraction in any part of a modular exponentiation using Montgomery’s algorithm, and that means hardware savings on an ASIC. Given that execution time should be independent of secret data in order to decrease side-channel leakage, it should also be noted that compilers may optimise ROM code to eliminate unnecessary multiplications by zero and additions of zero. Especially in the case of small radices r, this might enable sufficient instances of q = 0 to be detected and used to recover a secret key if the input message has not been blinded. For comparison, there are no published claims of comparable leakage from a classical modular multiplication algorithm if the inputs are blinded, but it is necessary, of course, to implement subtraction. If the inputs are not blinded in this traditional case, the exponent can be re-created bit-by-bit by reproducing the exponentiation and choosing bits to match the observed leakage, as Kocher did [220]. 10

So it would be more efficient and secure if standard key and word sizes led to R ≥ 4N > 12 R, giving the smallest number of iterations such that R > N, but the world is not always ideal.

Downloaded from https://www.cambridge.org/core. UNIVERSITY OF ADELAIDE LIBRARIES, on 31 Mar 2019 at 10:14:57, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.004

80

Hardware Aspects of Montgomery Modular Multiplication

Besides the removal of conditional subtractions from a modular multiplication algorithm, it should be clear that there is a need to protect other aspects of cryptographic exponentiation implementations from critical side-channel leakage. In particular, squarings should ideally be made to behave in the same way as multiplications. This may mean fetching an argument from memory for a squaring even if it is unused or already present in a register, in order to match behaviour for a multiplication. It may also mean taking action to ensure the two arguments of the squaring are suitably modified so that they appear to be as different as in a multiplication when passing along a bus or when used in a multiplier. Such counter-measures are advisable since single exponentiations can be attacked [339] − it may be unnecessary to perform many exponentiations in order to have sufficient sections of leakage trace to average and achieve a good signal-to-noise ratio. Often a sufficient solution is to randomise the exponent so that averaging over many uses of the secret key removes rather than reveals data dependency in the observed signal. Side channel leakage, together with low power, is one of the major concerns for implementers of hardware for cryptography. It is a vast subject. We have merely touched on how Montgomery modular multiplication is affected and provided a counter-measure to one point of leakage, namely the conditional subtraction. With a modification to remove that data-dependent subtraction, other attacks on modular arithmetic in cryptographic hardware, such as active ones involving fault injection, seem not to be specific to the choice of modular multiplication algorithm. Counter-measures to them are then typically generic and applicable to any choice of modular multiplication algorithm.

3.16 Logic Gate Technology Finally, remember that using static CMOS gates is not the only choice for building circuits. Apart from other considerations, power and side-channel leakage through power variation and electromagnetic radiation (EMR) may be reduced by using alternative technologies. Research into this has not identified any particular style which solves such problems fully. For example, Pass-Transistor logic might be used to reduce power. However, at a minimum, a complete solution to side-channel leakage requires removing the possibility of hardware glitches (which are data dependent and cause power surges), balancing the amount and delay of charging (loading capacitance) and equalising gate switching (transition counts) to make them independent of input values at every level of the circuit. Even if it were possible, this is clearly wildly inconsistent with any desire for area efficiency or low power. A lower target

Downloaded from https://www.cambridge.org/core. UNIVERSITY OF ADELAIDE LIBRARIES, on 31 Mar 2019 at 10:14:57, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.004

3.17 Conclusion

81

is to balance the total energy used over a clock cycle to make it nominally the same for all inputs of a given program instruction. The most promising attempts at addressing this problem use a Dual-Rail Pre-charge (DRP) logic style, such as Sense Amplifier Based Logic (SABL) [325] or Wave Dynamic Differential Logic (WDDL) [326]. So far, results show these to be expensive and only partially effective for mitigating the level of leakage. Overall, the clearest leakage comes from data sent along the bus, and next is probably any data which is broadcast widely at the same time to different processing elements through multiplexers. Generally, depending on the logic, it is the Hamming weight of the data or the Hamming weight of the difference between successive data values which can be determined most easily. Thus, choice of logic gate technology depends not just on the digit multiplier but on wider considerations.

3.17 Conclusion Peter Montgomery’s “Modular multiplication without trial division” [251] has had a significant effect on the design and efficiency of hardware for arithmeticbased cryptography as well as providing commercial advantages arising from simpler implementation when compared with traditional methods. In particular, quotient digits stay within the normal range for non-redundant representations and carry propagation does not need to occur before quotient digit selection. Furthermore, the direction of carry propagation away from the locus of quotient digit calculation means that Montgomery’s algorithm is a natural choice for systolic arrays which can perform very efficiently the highest volumes of decryption and digital signing needed on SSL servers.

Downloaded from https://www.cambridge.org/core. UNIVERSITY OF ADELAIDE LIBRARIES, on 31 Mar 2019 at 10:14:57, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.004

4 Montgomery Curves and the Montgomery Ladder Daniel J. Bernstein and Tanja Lange

4.1 Introduction The Montgomery ladder is the following remarkably simple method of computing scalar multiples of points on a broad class of elliptic curves. Define sequences (X1 , X2 , . . .) and (Z1 , Z2 , . . .), starting from X1 , Z1 , A, by the equations X2n = (Xn2 − Zn2 )2 ,

X2n+1 = 4(Xn Xn+1 − Zn Zn+1 )2 Z1 ,

Z2n = 4Xn Zn (Xn2 + AXn Zn + Zn2 ), Z2n+1 = 4(Xn Zn+1 − Zn Xn+1 )2 X1 for n ≥ 1. Then the points     Xn 1 Xn3 Xn2 Xn ,± +A 2 + Zn B Zn3 Zn Zn are, under minor hypotheses, the nth multiples of the points     X12 1 X13 X1 X1 ,± +A 2 + Z1 B Z13 Z1 Z1 on the Montgomery curve By2 = x3 + Ax2 + x. The Montgomery ladder is also remarkably fast: the optimized formulas X2n = (Xn − Zn )2 (Xn + Zn )2 , Z2n = ((Xn + Zn )2 − (Xn − Zn )2 )   A−2 2 2 2 ((Xn + Zn ) − (Xn − Zn ) ) , × (Xn + Zn ) + 4 X2n+1 = ((Xn − Zn )(Xn+1 + Zn+1 ) + (Xn + Zn )(Xn+1 − Zn+1 ))2 Z1 , Z2n+1 = ((Xn − Zn )(Xn+1 + Zn+1 ) − (Xn + Zn )(Xn+1 − Zn+1 ))2 X1 compute (Xn , Zn ) using just 11 multiplications per bit of n. 82 Downloaded from https://www.cambridge.org/core. The Librarian-Seeley Historical Library, on 01 Jan 2020 at 05:41:59, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.005

4.2 Fast Scalar Multiplication on the Clock

83

Montgomery introduced these curves and optimized formulas in a classic 1987 paper “Speeding the Pollard and elliptic curve methods of factorization” [252]. See Chapter 8 for more information about ECM, the elliptic-curve method of factorization. The advent of ECM prompted further applications of elliptic-curve computations, notably elliptic-curve primality proving (ECPP) and elliptic-curve cryptography (ECC). It is easy to see that these applications can also use the Montgomery ladder. Extensive research has produced a wide range of more complicated scalar-multiplication methods (for pointers see, e.g., [42], [41], and [48]), outperforming the Montgomery ladder for tasks such as computing n → nP for a fixed point P, or computing nth multiples of points on certain special curves, but the Montgomery ladder seems practically unbeatable for the core task of computing n, P → nP on typical curves. In ECC it is important to avoid failure cases, so the minor hypotheses mentioned above are worrisome. Fortunately, a careful analysis shows that the Montgomery ladder always computes a modified x-coordinate function that identifies ∞ with 0. Working correctly for all inputs is an unusual feature of elliptic-curve formulas: one expects scalar-multiplication methods to have failure cases that require constant attention from implementors. Twenty years later the introduction of “complete Edwards curves” allowed algebraic computations of arbitrary sums n1 P1 + · · · + nk Pk by the “Edwards addition law” without failure cases. It turned out that complete Edwards curves are birationally equivalent to Montgomery curves with points of order 4 and unique points of order 2, and vice versa. More generally, “twisted Edwards curves” are birationally equivalent to Montgomery curves, and vice versa. The Montgomery ladder is closely related to the Edwards addition law, as we show in Sections 4.4 and 4.5. The United States National Institute of Standards and Technology (NIST) issued ECC standards fifteen years ago. These standards recommended various non-Montgomery curves that had been selected by the National Security Agency (NSA). The only justification provided for the curve shape was an incorrect claim that the standards provided “the fastest arithmetic on elliptic curves.” The simplicity, speed, and completeness of the Montgomery ladder have led to widespread deployment of “Curve25519” [35], the Montgomery curve y2 = x3 + 486662x2 + x over the prime field F p where p = 2255 − 19; see Section 4.3.7 on page 94 for details.

4.2 Fast Scalar Multiplication on the Clock We define the clock as the curve u2 + v 2 = 1 with specified point (0, 1). More generally, we define a twisted clock as a curve au2 + v 2 = 1 with specified Downloaded from https://www.cambridge.org/core. The Librarian-Seeley Historical Library, on 01 Jan 2020 at 05:41:59, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.005

84

Montgomery Curves and the Montgomery Ladder

point (0, 1), where a is nonzero. This section introduces the group of points on this curve and relates the computation of point multiples to Lucas sequences. The Lucas ladder can be viewed as a “degeneration” of the Montgomery ladder. Fix a field k and fix a nonzero a ∈ k. Define Clocka (k) as the set of k-points on the twisted clock, i.e., the set of pairs (u, v) ∈ k × k satisfying the curve equation au2 + v 2 = 1. Then Clocka (k) is an abelian group under the following operations. The neutral element is the specified point (0, 1). The negative −(u, v) of a point (u, v) is (−u, v). The sum (u5 , v5 ) of (u2 , v2 ) and (u3 , v3 ) is given by (u5 , v5 ) = (u2 , v2 ) + (u3 , v3 ) = (u2 v3 + u3 v2 , v2 v3 − au2 u3 ). The difference is (u1 , v1 ) = (u3 , v3 ) − (u2 , v2 ) = (−u2 v3 + u3 v2 , v2 v3 + au2 u3 ). For the special case (k, a) = (R, 1) the addition operation can be visualized as adding times on a conventional clock, using 12:00 = (0, 1) as neutral element. For example, 2:00 + 3:00 = 5:00, and 9:00 + 4:00 = 1:00. The addition can be computed with just 4 multiplications using the sequence of intermediate steps A = u2 u3 , B = v2 v3 , C = (u2 + v2 )(u3 + v3 ) to get (u5 , v5 ) = (C − A − B, B − aA). We denote the cost of a general multiplication by M and the cost of multiplication by a curve constant (such as a) by C, so one point addition costs 3M + C. Additions and subtractions are usually not counted: they are significantly cheaper than multiplications for typical fields. Doubling means adding a point to itself, giving (u4 , v4 ) = (u2 , v2 ) + (u2 , v2 ) = (2u2 v2 , v22 − au22 ) = (2u2 v2 , 2v22 − 1), costing M + S, where S denotes the cost of a squaring. Note that, in doubling, v4 is computed purely from v2 and does not involve u2 . Similarly, v5 = v2 v3 − au2 u3 = 2v2 v3 − v1 , showing that the v-coordinate of the sum P + Q can be computed given the v-coordinates of P, Q, and Q − P. For each n ≥ 0, the scalar multiple nP is P + P + · · · + P, adding n copies of P together. For example, (u4 , v4 ) above is 2(u2 , v2 ). Computing nP in a naive way takes n − 1 additions for n ≥ 1, i.e., 3(n − 1)M + (n − 1)C, but using the  binary expansion n = ci=0 ni 2i (with ni ∈ {0, 1} and nc = 1) to compute nP = 2(2(· · · 2(2P + nc−1 P) + nc−2 P · · · ) + n1 P) + n0 P   takes only c = log2 n doublings and at most c additions. This double-and-add method takes on average c(M + S) + 0.5c(3M + C) = 2.5cM + cS + 0.5cC to compute nP.

Downloaded from https://www.cambridge.org/core. The Librarian-Seeley Historical Library, on 01 Jan 2020 at 05:41:59, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.005

4.2 Fast Scalar Multiplication on the Clock

85

4.2.1 The Lucas Ladder Fix (u1 , v1 ) ∈ Clocka (k). Define (un , vn ) = n(u1 , v1 ) for each n ≥ 2. Then v2n = 2vn2 − 1,

v2n+1 = 2vn vn+1 − v1 .

Recursively applying these two formulas computes (vn , vn+1 ) using just cM + cS if n has c bits. Specifically, to compute (vn , vn+1 ), first recursively compute (vm , vm+1 ) where m = n/2 , and then compute vn and vn+1 using two out of the three formulas v2m = 2vm2 − 1,

v2m+1 = 2vm vm+1 − v1 ,

2 v2m+2 = 2vm+1 − 1,

namely the first two if n is even, and the last two if n is odd. Either way costs M + S. The base case is n = 0, where (vn , vn+1 ) = (1, v1 ). As an example, Figure 4.1 on page 86 shows the indices used in computing (v73 , v74 ). A double arrow from m to 2m indicates that v2m is computed from vm . Single arrows from m and m + 1 to 2m + 1 indicate that v2m+1 is computed from vm and vm+1 . One can save time in the first few lines by skipping recomputations of v1 and v2 . The cost cM + cS here is significantly less than the cost of the double-andadd method. This comparison might seem unfair: if the objective is to compute nth multiples then the double-and-add method produces (un , vn ) while this recursion does not seem to produce un . However, vn is sufficient for many applications. Furthermore, the recursion is best understood as producing both vn and vn+1 , and solving for un in the addition formula vn+1 = v1 vn − au1 un produces a “u-recovery” formula un = (v1 vn − vn+1 )/(au1 ), assuming u1 = 0. The sequence (2v1 , 2v2 , 2v3 , . . . ) is an example of a Lucas sequence of the second kind over k, i.e., a sequence of the form (α + β, α 2 + β 2 , α 3 + √ β 3 , · · · ) where α + β, αβ ∈ k; specifically, take α = v1 + u1 −a and β = √ v1 − u1 −a. A Lucas sequence of the first kind has nth entry (α n − β n )/(α − β ). The special case αβ = 1 used here was introduced by Chebyshev before Lucas: vn , viewed as a polynomial in v1 , is the nth Chebyshev polynomial of the first kind.

4.2.2 Differential Addition Chains Montgomery in [253] introduced an even faster method of computing vn . Experiments show this method taking only about 1.55M per bit of n, as already announced in [252]. The idea of this method applies to computing the nth term xn of any sequence (x0 , x1 , . . . ) that satisfies a recurrence of the form xm+n = f (xm , xn , xn−m ),

Downloaded from https://www.cambridge.org/core. The Librarian-Seeley Historical Library, on 01 Jan 2020 at 05:41:59, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.005

86

Montgomery Curves and the Montgomery Ladder

Figure 4.1 A uniform double-add ladder.

for some function f , starting from some initial values x0 and x1 . The clock example has f (v2 , v3 , v1 ) = 2v2 v3 − v1 , starting with x0 = 1 and with x1 as the v-coordinate of an input point. To compute x8 , starting from x0 and x1 , we compute x2 = f (x1 , x1 , x0 ), x4 = f (x2 , x2 , x0 ), and x8 = f (x4 , x4 , x0 ). To compute x9 we cannot simply extend this chain using x8 and x1 because we do not have x8−1 = x7 . Instead we can compute it via x2 = f (x1 , x1 , x0 ), x3 = f (x1 , x2 , x1 ), x4 = f (x2 , x2 , x0 ), x5 = f (x2 , x3 , x1 ), x9 = f (x4 , x5 , x1 ). The indices 0, 1, 2 = 1 + 1, 3 = 2 + 1, 4 = 2 + 2, 5 = 3 + 2, 9 = 5 + 4 form a differential addition chain. This means a sequence that starts 0, 1 and that continues with sums n + m where n, m, n − m all appear earlier in the sequence. Montgomery calls these chains “Lucas chains”; other names in the literature include “strong addition chain” and “Chebyshev chain.” The simplest way to build a differential addition chain is to allow only 0 and 1 as differences n − m: i.e., to compute x2m as f (xm , xm , x0 ) for m ≥ 1 and to compute x2m+1 as f (xm , xm+1 , x1 ) for m ≥ 1. Montgomery calls this the “binary

Downloaded from https://www.cambridge.org/core. The Librarian-Seeley Historical Library, on 01 Jan 2020 at 05:41:59, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.005

4.3 Montgomery Curves

87

method”; we follow common naming and call it a “ladder.” This method takes two evaluations of f per bit of n to compute xn and xn+1 from x0 and x1 . In the clock example, f costs S for n = m and M for n = m + 1, for a cost of M + S per bit to compute the v-coordinate of nP from the v-coordinate of P, as mentioned above. Shorter chains exist for many numbers: e.g., 9 can be reached via the chain 0, 1, 2, 3, 6, 9, taking 4 steps instead of 5. It might also be helpful to allow a differential addition-subtraction chain: this means that one allows not only sums n + m after n, m, n − m, but also differences n − m after n, m, n + m, using some f  with xn−m = f  (xm , xn , xm+n ). For the clock one can take f  = f . In [253], Montgomery studied lower bounds for the lengths of these chains, and systematic methods to find short chains. See Section 4.8 on page 111 for several such methods. Montgomery’s PRAC method (short for “Practical Algorithm”) achieves the 1.55M per bit mentioned above.

4.3 Montgomery Curves Fix a field k not of characteristic 2, and fix A, B ∈ k with B(A2 − 4) = 0. The curve By2 = x3 + Ax2 + x is a Montgomery curve. This section introduces the group of points on this curve, both from the historical perspective of Weierstrass curves and from the modern perspective of Edwards curves.

4.3.1 Montgomery Curves as Weierstrass Curves A short Weierstrass curve is a curve of the form y2 = x3 + ax + b where 4a3 + 27b2 = 0. A small calculation (relying on the hypothesis that 2 = 0 in k) shows that this curve is geometrically nonsingular: this means that the equation y2 = x3 + ax + b, its x-derivative 0 = 3x2 + a, and its y-derivative 2y = 0 have no common solutions (x, y) over any extension of k. Indeed, any common solution has y = 0 so x3 + ax + b = 0 so b = −x3 − ax = 2x3 so 4a3 + 27b2 = −108x6 + 108x6 = 0, contradiction. More generally (even in characteristic 2), a Weierstrass curve is a geometrically nonsingular curve of the form a0 y2 + a1 xy + a3 y = x3 + a2 x2 + a4 x + a6 with a0 = 0. The Montgomery curve By2 = x3 + Ax2 + x has this form with (a0 , a1 , a3 , a2 , a4 , a6 ) = (B, 0, 0, A, 1, 0), and is geometrically nonsingular, so it is a Weierstrass curve. The nonsingularity calculation boils down to the calculation that the cubic polynomial x3 + Ax2 + x has discriminant A2 − 4, which was hypothesized to be nonzero. Concretely, if By2 = x3 + Ax2 + x and

Downloaded from https://www.cambridge.org/core. The Librarian-Seeley Historical Library, on 01 Jan 2020 at 05:41:59, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.005

88

Montgomery Curves and the Montgomery Ladder

2By = 0 and 3x2 + 2Ax + 1 = 0 then y = 0 so x3 + Ax2 + x = 0, so the discriminant identity   A2 − 4 = (−6A2 + 18)x + (−4A3 + 15A) (x3 + Ax2 + x)   + (2A2 − 6)x2 + (2A3 − 7A)x + (A2 − 4) (3x2 + 2Ax + 1) implies that A2 − 4 = 0, contradiction. The name “Weierstrass curve” arises, historically, from an identity of the form a0 (℘ )2 = ℘3 + a4℘+ a6 satisfied by the Weierstrass ℘ function and its derivative ℘ , specifically with a0 = 1/4. In other words, (℘,℘ ) are points (x, y) on the Weierstrass curve a0 y2 = x3 + a4 x + a6 . We are deviating slightly from standard terminology here. The standard definition of “Weierstrass curve” in the literature assumes a0 = 1, so it allows the Montgomery curve By2 = x3 + Ax2 + x only in the case B = 1. Dropping the restriction a0 = 1 allows (℘,℘ ) to be points (x, y) on a “Weierstrass curve” without any rescaling, allows Montgomery curves without any rescaling, and makes the theory of Weierstrass curves only negligibly more complicated.

4.3.2 The Group Law for Weierstrass Curves The set of k-points on a Weierstrass curve W , written W (k), is the set of pairs (x, y) ∈ k × k satisfying the curve equation, together with an extra point ∞. The points (x, y) are called affine points. Define a unary operation − on W (k) as follows: r −∞ = ∞. r −(x, y) = (x, −(y + (a1 /a0 )x + (a3 /a0 ))). Define a binary operation + on W (k) as follows: r r r r r

∞ + ∞ = ∞. ∞ + (x, y) = (x, y). (x, y) + ∞ = (x, y). (x, y) + (−(x, y)) = ∞. If 2a0 y + a1 x + a3 = 0 then (x, y) + (x, y) = −(x , y + λ(x − x)), where λ = (3x2 + 2a2 x + a4 )/(2a0 y + a1 x + a3 ) and x = a0 λ2 + a1 λ − a2 − 2x. r If x = x then (x, y) + (x , y ) = −(x , y + λ(x − x)) where λ = (y − y)/ (x − x) and x = a0 λ2 + a1 λ − a2 − x − x .

One can prove that these definitions cover all cases; that the outputs are in W (k); and that W (k) is a commutative group with ∞ as neutral element, − as negation, and + as addition.

Downloaded from https://www.cambridge.org/core. The Librarian-Seeley Historical Library, on 01 Jan 2020 at 05:41:59, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.005

4.3 Montgomery Curves

89

For ease of reference we repeat the rules in the special case of Montgomery curves, using the simplifications a1 = 0, a3 = 0, a0 = B, a2 = A, a4 = 1, a6 = 0, and 2B = 0: r r r r r r r

−∞ = ∞. −(x, y) = (x, −y). ∞ + ∞ = ∞. ∞ + (x, y) = (x, y). (x, y) + ∞ = (x, y). (x, y) + (x, −y) = ∞. If y = 0 then (x, y) + (x, y) = −(x , y + λ(x − x)), where λ = (3x2 + 2Ax + 1)/(2By) and x = Bλ2 − A − 2x. r If x = x then (x, y) + (x , y ) = −(x , y + λ(x − x)) where λ = (y − y)/(x − x) and x = Bλ2 − A − x − x .

4.3.3 Other Views of the Group Law The projective k-points on W are all points (X : Y : Z) ∈ P2 (k) satisfying the homogeneous equation a0 ZY 2 + a1 ZXY + a3 Z 2Y = X 3 + a2 ZX 2 + a4 Z 2 X + a6 Z 3 . Here P2 (k) = {(X : Y : Z) : (X, Y, Z) ∈ k3 − {(0, 0, 0)}}. Sometimes (X : Y : Z) is defined as the subspace {(λX, λY, λZ) : λ ∈ k} of the k-vector space k3 ; sometimes it is defined as the set {(λX, λY, λZ) : λ ∈ k∗ }. With either definition, (X  : Y  : Z  ) = (X : Y : Z) if and only if (X  , Y  , Z  ) = (λX, λY, λZ) for some λ ∈ k∗ . For each affine point (x, y) ∈ W (k) there is a corresponding projective kpoint (x : y : 1) on W . The point ∞ ∈ W (k) corresponds to the projective kpoint (0 : 1 : 0) on W . These cover all projective k-points on W : if Z = 0 then (X : Y : Z) = (x : y : 1) where x = X/Z and y = Y/Z, and then the homogeneous equation implies (x, y) ∈ W (k); if Z = 0 then the homogeneous equation forces (X : Y : Z) = (0 : 1 : 0). Taking projective coordinates thus unifies the two cases in the definition of W (k). As for the addition law, the whole group definition can be understood as just two rules: r ∞ is the neutral element in the group. r There is a standard definition of the multiset of intersection points of a line with the curve; if this multiset consists of three points P, Q, R then P + Q + R = 0 in the group.

Downloaded from https://www.cambridge.org/core. The Librarian-Seeley Historical Library, on 01 Jan 2020 at 05:41:59, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.005

90

Montgomery Curves and the Montgomery Ladder

There are several reasons that the second rule splits into cases. First, the multiset is not always a set; for example, if a line is tangent to the curve at P then P appears at least twice in the multiset. Second, the multiset is defined in terms of projective points, so it does not always consist of affine points; for example, a vertical line intersects the curve at ∞. Third, if k is algebraically closed then the multiset always has size exactly 3 by Bézout’s theorem, but for more general fields k a line can intersect the curve in fewer points. Sometimes “elliptic curve” is defined more generally as r a nonsingular cubic curve C in two-dimensional projective space with a specified inflection point I (such as ∞ for Weierstrass curves); or r a nonsingular cubic curve in two-dimensional projective space with a specified point (not necessarily an inflection point); or r a nonsingular genus-1 curve in n-dimensional projective space with a specified point. With the first definition, one can use the same P + Q + R = 0 rule to define a group law on C(k) with I as neutral element. With the second definition, slightly more work is required; see, e.g., [188, Chapter 3, Theorem 1.2]. With the third definition, the standard approach is to abandon the chord-and-tangent approach and instead declare that the zeros and poles of any algebraic function on C, not just a linear function, have sum 0. The main work is then to show that points P of C(k) map bijectively to elements P − I of this “divisor class group,” see [16, chapter 4] for details.

4.3.4 Edwards Curves and Their Group Law An Edwards curve is a curve of the form u2 + v 2 = 1 + du2 v 2 with d ∈ {0, 1} over a field k not of characteristic 2. Edwards curves were introduced in a slightly less general form by Edwards in [136], who also defined a group operation on them. Bernstein and Lange in [44] introduced this form and defined efficient formulas for the group operation. A twisted Edwards curve [37] is a curve of the form au2 + v 2 = 1 + du2 v 2 with a = d and a, d = 0. A twisted Edwards curve E where a is square in k and d is non-square in k is called k-complete, or simply complete if k is clear from the context. In this case the group of k-points of E, written E(k), is defined as the set of (u, v) ∈ k × k satisfying the curve equation, with the following operations. The neutral element is (0, 1). The negative of (u, v) is (−u, v). The sum of two points (u2 , v2 )

Downloaded from https://www.cambridge.org/core. The Librarian-Seeley Historical Library, on 01 Jan 2020 at 05:41:59, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.005

4.3 Montgomery Curves and (u3 , v3 ) is defined as (u2 , v2 ) + (u3 , v3 ) =



u2 v3 + u3 v2 v2 v3 − au2 u3 , 1 + du2 u3 v2 v3 1 − du2 u3 v2 v3

91

 .

(4.1)

The denominators are never 0; see [44]. The Edwards addition law (4.1) is a complete addition law, i.e., an addition law that holds for all inputs. To define E(k) for general a and d, without the requirements of a being square and d being non-square, one needs more work: the addition law (4.1) is defined almost everywhere but can produce divisions by 0. The general definition is as follows. Define E(k) to have√the following elements: all (u, v) ∈ k × k satisfying the √ curve equation; (±1/ d, ∞) if d is a square; and (∞, ± a/d) if a/d is a square. In other words, E(k) is the set of (u, v) ∈ (k ∪ {∞}) × (k ∪ {∞}) satisfying the curve equation, with a careful definition of arithmetic on ∞. Formally, consider the projective embedding of E into P1 × P1 = {((U : Z), (V : T ))}, namely aU 2 T 2 + V 2 Z 2 = Z 2 T 2 + dU 2V 2 . Each affine point (u, v) corresponds to the : 1)). √ projective point ((u : 1), (v √ Additional projective points are ((1 : ± d), (1 : 0)) and ((1 : 0), (± a/d : 1)) if those are defined over k. We identify (1 : 0) with ∞ and identify the rest of P1 (k) with k, so each point is a pair of coordinates in k ∪ {∞}. As before, the neutral element of E(k) is (0, 1), and the negative of (u, v) is (−u, v), where −∞ means ∞. The sum of two points (u2 , v2 ) and (u3 , v3 ) is defined by the Edwards addition law (4.1) together with the dual addition law   u2 v2 + u3 v3 u2 v2 − u3 v3 . (4.2) , (u2 , v2 ) + (u3 , v3 ) = au2 u3 + v2 v3 u2 v3 − u3 v2 For each pair of points ((u2 , v2 ), (u3 , v3 )) ∈ E(k) × E(k), at least one of these laws is defined. Here “defined” allows divisions by 0, producing ∞ as output, but does not allow 0/0. If both laws are defined then the results are identical. This is true for each coordinate separately: if both laws have a defined ucoordinate then those u-coordinates are identical; if both laws have a defined vcoordinate then those v-coordinates are identical. E(k) forms an abelian group under these operations. The dual addition law was introduced by Hisil, Wong, Carter, and Dawson in [186]. The completeness of the set of two addition laws was shown by Bernstein and Lange in [46].

Downloaded from https://www.cambridge.org/core. The Librarian-Seeley Historical Library, on 01 Jan 2020 at 05:41:59, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.005

92

Montgomery Curves and the Montgomery Ladder

The v-coordinate in the dual addition law (4.2) is undefined if and only if (u3 , v3 ) = (u2 , v2 ) or (u3 , v3 ) = (−u2 , −v2 ). By completeness, the Edwards addition law (4.1) is defined in these cases. In particular, the Edwards addition law is a valid formula for doubling any point. Write (u4 , v4 ) = 2(u2 , v2 ); then u22 = (1 − v22 )/(a − dv22 ), so v4 =

v22 − au22 v 2 (a − dv22 ) − a(1 − v22 ) 2av22 − a − dv24 = 2 = . 2 2 2 2 2 1 − du2 v2 a − dv2 − d(1 − v2 )v2 a − 2dv22 + dv24

Note for future reference that this formula expresses v4 purely in terms of v2 . This formula has no exceptional cases: in particular, it works for u2 = ∞ and for v2 = ∞.

4.3.5 Montgomery Curves as Edwards Curves Edwards curves and Montgomery curves are examples of elliptic curves with some special properties. In particular, Edwards curves have a point of order 4 at (1, 0) and a point of order 2 at (0, −1). Montgomery curves have a point of order 2 at (0, 0) and, over finite fields, at least one of the following: a point of order 4 doubling to (0, 0) or two more points of order 2. The same conditions hold for twisted Edwards curves. In fact, Montgomery curves and twisted Edwards curves cover the same set of elliptic curves. More precisely, for each Montgomery curve there is a birationally equivalent twisted Edwards curve, and vice versa. Here a birational equivalence between two elliptic curves M, E is a pair of rational maps M → E and E → M that are defined almost everywhere, that are each other’s inverses when both are defined, and that map specified neutral element to specified neutral element. One can show that a birational equivalence preserves addition wherever it is defined, and can be extended to a group isomorphism. Specifically, the transformation formulas from the twisted Edwards curve au2 + v 2 = 1 + du2 v 2 to the Montgomery curve By2 = x3 + Ax2 + x are x=

1+v 1−v

and

y=

1+v x = , u(1 − v) u

where the curve parameters have the relationship A=2

a+d a−d

and

B=

4 . a−d

Downloaded from https://www.cambridge.org/core. The Librarian-Seeley Historical Library, on 01 Jan 2020 at 05:41:59, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.005

4.3 Montgomery Curves

93

Likewise, the formulas from the Montgomery curve to the twisted Edwards curve are u=

x y

and

v=

x−1 x+1

and the curve parameters satisfy a=

A+2 B

and

d=

A−2 . B

The map from v to x = (1 + v)/(1 − v) is defined for all v ∈ k ∪ {∞} if ∞ is handled carefully as input and output. If v ∈ k − {1} then x ∈ k − {−1}. If v = 1 then x = ∞; here the input point is (0, 1), the neutral element, and the output point is ∞√as required. If v = ∞ then x = −1; here the input is is the order-2 point (0, −1), and an order-4 point (±1/ d, ∞) whose double √ the output is an order-4 point (−1, ∓ d) whose double is the order-2 point (0, 0). The inverse map from x to v = (x − 1)/(x + 1) is similarly defined for all x ∈ k ∪ {∞}. If x ∈ k − {−1} then v ∈ k − {1}. If x = −1 then v = ∞. If x = ∞ then v = 1. These are inverse maps since v = ((1 + v)/(1 − v) − 1)/((1 + v)/ (1 − v) + 1) and x = (1 + (x − 1)/(x + 1))/(1 − (x − 1)/(x + 1)).

4.3.6 Elliptic-Curve Cryptography (ECC) Miller in [247], and independently Koblitz in [217], proposed an elliptic-curve variant of the Diffie–Hellman key exchange method [126]. Miller in [247, page 425] suggested exchanging just x-coordinates instead of (x, y)-coordinates: i.e., sending x(P) rather than an entire point P, where x(x, y) = x. The Diffie–Hellman key exchange with x-coordinates works as follows. One user, say Alice, has a secret key s and a public key x(sP). Here s is an integer, and P is a standard point on a standard Weierstrass curve. Another user, say Bob, has a secret key t and a public key x(tP). Alice and Bob then both know a shared secret x(stP) = x(s(tP)) = x(t(sP)), which seems quite difficult for an attacker to predict. Note that x(stP) is entirely determined by s and x(tP). Indeed, the only possible ambiguity in recovering tP from x(tP) is the possible distinction between tP and −tP, and this distinction has no effect on x(stP): the x-coordinate is invariant under point negation, so x(s(−tP)) = x(−stP) = x(stP). The same argument applies if x-coordinates on Weierstrass curves are replaced by vcoordinates on twisted Edwards curves.

Downloaded from https://www.cambridge.org/core. The Librarian-Seeley Historical Library, on 01 Jan 2020 at 05:41:59, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.005

94

Montgomery Curves and the Montgomery Ladder

The bottleneck here is elliptic-curve scalar multiplication: Alice first has to compute her public key x(sP) given her secret key s, and then has to compute the shared secret x(stP) given her secret key s and Bob’s public key x(tP). For any Weierstrass curve, Alice can compute a square root to obtain ±tP from x(tP), can use the double-and-add method with the respective doubling and addition formulas to obtain ±stP, and can then discard the y-coordinate to obtain x(stP). For Montgomery curves, Alice can use the more efficient Montgomery ladder to compute x(stP) from x(tP), using the doubling and differential-addition formulas developed in this chapter. Electronic signatures use secret key s and public key sP, i.e., they require both coordinates of sP. This makes short Weierstrass curves or (twisted) Edwards curves the more common choice. Verification of a signature typically involves a double-scalar multiplication mP + nQ, of which only x(mP + nQ) is used. This can be computed using a two-dimensional addition chain (Section 4.7 on page 107).

4.3.7 Examples of Noteworthy Montgomery Curves Montgomery introduced Montgomery curves for more efficient factorization of integers in ECM, the elliptic-curve method of factorization; see Chapter 8. Montgomery’s thesis [254] includes several Montgomery curves that are particularly suitable for ECM because the curves are guaranteed to have large Qtorsion. These curves form parameterized families; the curve with “Suyama parameter σ = 11” was further analyzed in [26]. In cryptography, Bernstein’s Curve25519 [35] has found widespread use. The curve is the Montgomery curve with A = 486662 and B = 1 defined over F p with p = 2255 − 19. This prime satisfies p ≡ 1 mod 4 and so for a Montgomery curve over F p either the curve or its quadratic twist has order divisible by 8. The curve is chosen to have (A − 2)/4 minimal among the curves satisfying that the group order is 8 and that the order of the twist is 4  , where

and  are prime numbers, and that the curve satisfies all standard security criteria. See [47] for more security details, and [97] for recent Curve25519 performance results. The WhatsApp messaging system [346] now uses Curve25519 to encrypt all messages from end to end, and the TLS protocol for secure web access [125] has recently added Curve25519 as an option. The EdDSA signature scheme [43] uses twisted Edwards curves, and in particular the Ed25519 signature scheme [42] uses a twisted Edwards curve birationally equivalent to Curve25519. Curve41417 [40] and Curve448 [179] are two newer examples of Montgomery curves (equivalently, twisted Edwards curves) designed for efficient

Downloaded from https://www.cambridge.org/core. The Librarian-Seeley Historical Library, on 01 Jan 2020 at 05:41:59, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.005

4.4 Doubling Formulas without y

95

cryptography over larger prime fields. Curve41417 has been deployed in the BlackPhone [100], and Curve448 is being considered as an alternative for TLS.

4.4 Doubling Formulas without y Let P be a point on a Weierstrass curve. Then x(nP), the x-coordinate of nP, is entirely determined by n and x(P), as mentioned above. Even better, similar to the clock example in Section 4.2 on page 83, x(nP) is a rational function of x(P) for each n. One can compute the numerator and denominator with ring operations, and then divide; there is no need for an initial square-root computation to recover y. In the case of short Weierstrass curves it is easy to find literature stating explicit “division polynomial” recurrences for nP. Miller’s original ECC paper [247] repeated these recurrences, and also mentioned the possibility of avoiding y-coordinates in ECC. However, Miller reported “26 log2 n multiplications” to compute nP using these recurrences. The Montgomery ladder is much simpler and almost three times faster. The structure of Montgomery curves is important for this simplicity and speed: from the modern Edwards perspective, Montgomery takes advantage of having a point of order 4 on the curve or its twist. In the above description we have ignored exceptional cases: e.g., dividing a 0 numerator by a 0 denominator. We start by handling the generic case. We return to exceptional cases in Sections 4.4.4, 4.5.3, and 4.6.3 on pages 97, 101, and 106, respectively. This section begins with the simplest case n = 2: computing x(2P) from x(P). Sections 4.5 and 4.6 handle larger values of n.

4.4.1 Doubling: The Weierstrass View Theorem 4.1 Fix a field k not of characteristic 2. Fix A, B ∈ k with B(A2 − 4) = 0. Define M as the Montgomery curve By2 = x3 + Ax2 + x. Define x : M(k) → k ∪ {∞} as follows: x(x, y) = x; x(∞) = ∞. Let P be an element of M(k). If x(P) = ∞ then x(2P) = ∞. If x(P) = ∞ and x(P)3 + Ax(P)2 + x(P) = 0 then x(2P) = ∞. If x(P) = ∞ and x(P)3 + Ax(P)2 + x(P) = 0 then x(2P) =

(x(P)2 − 1)2 . + Ax(P)2 + x(P))

4(x(P)3

Downloaded from https://www.cambridge.org/core. The Librarian-Seeley Historical Library, on 01 Jan 2020 at 05:41:59, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.005

96

Montgomery Curves and the Montgomery Ladder

Proof If x(P) = ∞ then P = ∞ so 2P = ∞ so x(2P) = ∞ as claimed. Assume from now on that x(P) = ∞. Then P = (x, y) for some x, y ∈ k satisfying By2 = x3 + Ax2 + x. By definition x(P) = x. If x3 + Ax2 + x = 0 then y = 0 so 2P = (x, 0) + (x, 0) = (x, 0) − (x, 0) = ∞ so x(2P) = ∞ as claimed. Assume from now on that x3 + Ax2 + x = 0. Then y = 0. By definition (see Section 4.3.2 on page 88) 2P = (Bλ2 − A − 2x, . . .) where λ = (3x2 + 2Ax + 1)/(2By). Consequently x(2P) = Bλ2 − A − 2x = B

(3x2 + 2Ax + 1)2 − A − 2x 4B2 y2

=

(3x2 + 2Ax + 1)2 (3x2 + 2Ax + 1)2 − A − 2x − A − 2x = 4By2 4(x3 + Ax2 + x)

=

(3x2 + 2Ax + 1)2 − 4(x3 + Ax2 + x)(2x + A) 4(x3 + Ax2 + x)

=

9x4 +12Ax3 +(4A2 +6)x2 +4Ax+1−4(2x4 +3Ax3 +(A2 +2)x2 +Ax) 4(x3 + Ax2 + x)

=

(x2 − 1)2 x4 − 2x2 + 1 = 4(x3 + Ax2 + x) 4(x3 + Ax2 + x) 

as claimed.

4.4.2 Optimized Doublings Divisions are slow. To avoid divisions, the Montgomery ladder represents xcoordinates as fractions. This also means that doublings inside the Montgomery ladder take their inputs x(P) as fractions. (Small exception: the first doubling in the ladder can be sped up in the normal case that its input is provided with denominator 1.) This requires extra multiplications in Theorem 4.1: for example, computing x(P)2 requires squaring both the numerator and the denominator. These extra operations appear inside the simple formulas for X2n and Z2n shown in Section 4.1 on page 82. A straightforward operation count would suggest that there are six multiplications here (not counting the final multiplication by 4, which can be done with two additions): Xn2 , Zn2 , Xn Zn , AXn Zn , X2n , Z2n . Montgomery’s optimized formulas, also shown in Section 4.1, save a multiplication as follows. Start with (Xn + Zn )2 and (Xn − Zn )2 . Compute X2n as the product of these squares, and compute 4Xn Zn as the difference of these squares. Multiply by (A − 2)/4 to obtain (A − 2)Xn Zn , add (Xn + Zn )2 to obtain

Downloaded from https://www.cambridge.org/core. The Librarian-Seeley Historical Library, on 01 Jan 2020 at 05:41:59, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.005

4.4 Doubling Formulas without y

97

Xn2 + AXn Zn + Zn2 , and multiply by 4Xn Zn to obtain Z2n . In total there are two squarings, one multiplication by (A − 2)/4, two more multiplications, two additions, and two subtractions. Montgomery’s formulas can be viewed as expressing doubling as the composition of two 2-isogenies. Montgomery reportedly found these formulas via experiments with 2-isogeny formulas. The same idea has been productively reused to build other curve shapes with efficient formulas for doubling and tripling; see, e.g., [130] and [39]. If d = (A − 2)/(A + 2) is a square, say r2 , then one can replace 2M + 2S + 1C with 4S + 3C as follows (and with 4S + 2C if one changes coordinates from Xn and Zn to r(Xn − Zn ) and Xn + Zn ). Precompute s = (1 + r)/(1 − r). Then compute Y = r(Xn − Zn )2 , Z = (Xn + Zn )2 , V = s(Z − Y )2 , W = (Z + Y )2 , Y  = W − V , and Z  = r(W + V ). Now (Z  + Y  , Z  − Y  ) is (X2n , Z2n ) times an irrelevant factor 4(d + r). This is a speedup if r has small enough numerator and denominator. This speedup is due in essence to Gaudry; see [158], [159], and [45].

4.4.3 A Word of Warning: Projective Coordinates Here is a different way to view representing x(P) as a fraction. Use a tuple (X, Y, Z) to represent a point P = (X : Y : Z) on M in projective coordinates. Discard the Y -coordinate, leaving only the pair (X, Z) to represent X/Z = x(P). One might think that this view smoothly generalizes from affine points to all points on M, and that the case distinctions in Theorem 4.1 on page 95 are merely artifacts of working in affine coordinates. However, discarding the Y -coordinate from the point (0 : 1 : 0) produces (0 : 0). The definition of P1 excludes (0 : 0); the standard notion of fractions excludes 0/0. More importantly, converting the generic case of Theorem 4.1 into projective formulas for x(2P), and then applying those formulas to the case P = (0, 0) with x(P) = 0/1, does not produce 0/0 as output; it produces 1/0.

4.4.4 Completeness of Generic Doubling Formulas The Montgomery ladder defines X2n and Z2n from Xn and Zn using formulas that match the generic case of Theorem 4.1: it defines, e.g., X4 = (X22 − Z22 )2 and Z4 = 4X2 Z2 (X22 + AX2 Z2 + Z22 ). This raises the question of what exactly these formulas do in the other cases. As noted in Section 4.1 on page 82, the exceptional cases did not matter for Montgomery’s application to ECM, but they do matter for cryptography.

Downloaded from https://www.cambridge.org/core. The Librarian-Seeley Historical Library, on 01 Jan 2020 at 05:41:59, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.005

98

Montgomery Curves and the Montgomery Ladder

The following theorem says that x(2P) = X4 /Z4 if x(P) = X2 /Z2 . This was proven in [35] under the assumption that A2 − 4 is non-square. Theorem 4.2 Fix a field k not of characteristic 2. Fix A, B ∈ k with B(A2 − 4) = 0. Define M as the Montgomery curve By2 = x3 + Ax2 + x. Define x : M(k) → k ∪ {∞} as follows: x(x, y) = x; x(∞) = ∞. Let X2 , Z2 be elements of k. Define X4 = (X22 − Z22 )2 , Z4 = 4X2 Z2 (X22 + AX2 Z2 + Z22 ). Let P be an element of M(k). If (X2 , Z2 ) = (0, 0) and x(P) = X2 /Z2 then (X4 , Z4 ) = (0, 0) and x(2P) = X4 /Z4 . Here X/Z means the quotient of X and Z in k if Z = 0; it means ∞ if X = 0 and Z = 0; it is undefined if X = Z = 0. Proof If Z2 = 0 then X4 = X24 = 0 and Z4 = 0 so (X4 , Z4 ) = (0, 0) as claimed. Also x(P) = X2 /0 = ∞ so, by Theorem 4.1 on page 95, x(2P) = ∞ = X4 /Z4 as claimed. Assume from now on that Z2 = 0; then x(P) = ∞. If x(P)3 + Ax(P)2 + x(P) = 0 then X23 + AX22 Z2 + X2 Z22 = 0 so Z4 = 4Z2 (X23 + AX22 Z2 + X2 Z22 ) = 0. Suppose that X4 = 0; then X22 = Z22 so x(P)2 = 1 so x(P) = ±1 so 0 = x(P)3 + Ax(P)2 + x(P) = A ± 2, contradicting the hypothesis that A2 − 4 = 0. Hence X4 = 0 so (X4 , Z4 ) = (0, 0) as claimed. Also, by Theorem 4.1, x(2P) = ∞ = X4 /Z4 as claimed. Assume from now on that x(P)3 + Ax(P)2 + x(P) = 0. Multiply by Z23 = 0 to obtain X23 + AX22 Z2 + X2 Z22 = 0. In particular X2 = 0 so Z4 = 0 so (X4 , Z4 ) = (0, 0) as claimed. Furthermore 4(x(P)3 + Ax(P)2 + x(P)) = Z4 /Z24 and (x(P)2 − 1)2 = X4 /Z24 . By Theorem 4.1, x(2P) = (x(P)2 −  1)2 /(4(x(P)3 + Ax(P)2 + x(P))) = X4 /Z4 as claimed.

4.4.5 Doubling: The Edwards View We now use the Edwards addition law to give a direct proof of Theorem 4.2, without the calculations from Theorem 4.1. Alternate proof of Theorem 4.2 M is birationally equivalent to the twisted Edwards curve au2 + v 2 = 1 + du2 v 2 with a = (A + 2)/B and d = (A − 2)/B. Define (u2 , v2 ) and (u4 , v4 ) as the points corresponding to P and 2P respectively. Recall that v4 = (2av22 − a − dv24 )/(a − 2dv22 + dv24 ). We now develop matching formulas for the Montgomery x-coordinates x2 = x(P) and

Downloaded from https://www.cambridge.org/core. The Librarian-Seeley Historical Library, on 01 Jan 2020 at 05:41:59, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.005

4.5 Differential-Addition Formulas

99

x4 = x(2P). First x4 = =

a − 2dv22 + dv24 + 2av22 − a − dv24 1 + v4 = 1 − v4 a − 2dv22 + dv24 − (2av22 − a − dv24 ) (a − d)v22 (a − d)v22 = . a − av22 − dv22 + dv24 (1 − v22 )(a − dv22 )

Use v2 = (x2 − 1)/(x2 + 1) and the relation of the curve coefficients: x4 = =

(a − d)v22 (a − d)(x2 − 1)2 (x2 + 1)2 = 2 2 2 ((x2 + 1) − (x2 − 1)2 )(a(x2 + 1)2 − d(x2 − 1)2 ) (1 − v2 )(a − dv2 ) (x22 − 1)2 (a − d)(x2 − 1)2 (x2 + 1)2 = . 4x2 ((a − d)(x22 + 1) + 2(a + d)x2 ) 4x2 (x22 + Ax2 + 1)

Replace x2 by X2 /Z2 and clear denominators.



4.5 Differential-Addition Formulas One cannot expect to be able to compute x(P3 + P2 ) given only x(P3 ) and x(P2 ). Usually there are four possibilities for (P3 , P2 ), four possibilities for P3 + P2 , and two possibilities for x(P3 + P2 ). Montgomery’s differential-addition formulas instead compute x(P3 + P2 ) given x(P3 ), x(P2 ), and x(P3 − P2 ), as explained in this section. Similar to the clock addition formulas from Section 4.2 on page 83, these formulas produce x(3P) given x(2P), x(P), x(P); they produce x(7P) given x(4P), x(3P), x(P); they produce x(13P) given x(7P), x(6P), x(P).

4.5.1 Differential Addition: The Weierstrass View We begin by deriving Montgomery’s differential-addition formulas in essentially the same way that Montgomery did, starting from the definition of addition for Weierstrass curves. Theorem 4.3 Fix a field k not of characteristic 2. Fix A, B ∈ k with B(A2 − 4) = 0. Define M as the Montgomery curve By2 = x3 + Ax2 + x. Define x : M(k) → k ∪ {∞} as follows: x(x, y) = x; x(∞) = ∞. Let P2 , P3 be elements of M(k) with P3 = ∞, P2 = ∞, P3 = P2 , and P3 = −P2 . Then x(P3 ) = x(P2 ) and x(P3 + P2 )x(P3 − P2 ) =

(x(P3 )x(P2 ) − 1)2 . (x(P3 ) − x(P2 ))2

Downloaded from https://www.cambridge.org/core. The Librarian-Seeley Historical Library, on 01 Jan 2020 at 05:41:59, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.005

100

Montgomery Curves and the Montgomery Ladder

Proof P3 = ∞ so P3 = (x, y) for some x, y ∈ k satisfying By2 = x3 + Ax2 + x; and P2 = ∞ so P2 = (x , y ) for some x , y ∈ k satisfying B(y )2 = (x )3 + A(x )2 + x . Suppose that x = x . Then By2 = B(y )2 so y = ±y . If y = y then P3 = P2 , contradiction. If y = −y then P3 = −P2 , contradiction. Thus x = x , and P3 + P2 = (Bλ2 − A − x − x , . . .) where λ = (y − y)/  (x − x). Consequently x(P3 + P2 ) = Bλ2 − A − x − x = B

(y − y)2 − A − x − x (x − x)2

=

B(y )2 + By2 − 2Byy − A − x − x (x − x)2

=

(x )3 +A(x )2 +x + x3 +Ax2 +x − 2Byy − (A + x + x)(x −x)2 (x − x)2

=

(x )3 + x + x3 + x + 2Axx − 2Byy − (x + x)(x − x)2 (x − x)2

=

(x + x)(1 + xx ) + 2Axx − 2Byy . (x − x)2

Similarly x(P3 − P2 ) = ((x + x)(1 + xx ) + 2Axx + 2Byy )/(x − x)2 . Thus x(P3 + P2 )x(P3 − P2 )(x − x)4 = ((x + x)(1 + xx ) + 2Axx )2 − (2Byy )2 = ((x + x)(1 + xx ) + 2Axx )2 − 4By2 B(y )2 = ((x + x)(1 + xx ) + 2Axx )2 − 4(x3 + Ax2 + x)((x )3 + A(x )2 + x ) = (x + x)2 (1 + xx )2 + 4Axx (x + x)(1 + xx ) + 4A2 x2 (x )2 − 4(x3 +x)((x )3 +x ) − 4A((x3 +x)(x )2 +((x )3 +x )x2 ) − 4A2 x2 (x )2 = (x + x)2 (1 + xx )2 + 4Axx (x + x + (x )2 x + x2 x ) − 4(x3 + x)((x )3 + x ) − 4Axx (x2 x + x + (x )2 x + x) = ((x )2 + 2xx + x2 )(1 + 2xx + x2 (x )2 ) − 4(x3 (x )3 + x3 x + (x )3 x + xx ) = (x )2 + 2xx + x2 + 2x(x )3 + 4x2 (x )2 + 2x3 x + x2 (x )4 + 2x3 (x )3 + x4 (x )2 − 4(x3 (x )3 + x3 x + (x )3 x + xx )

Downloaded from https://www.cambridge.org/core. The Librarian-Seeley Historical Library, on 01 Jan 2020 at 05:41:59, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.005

4.5 Differential-Addition Formulas

101

= (x )2 − 2xx + x2 − 2x(x )3 + 4x2 (x )2 − 2x3 x + x2 (x )4 − 2x3 (x )3 + x4 (x )2 = ((x )2 − 2xx + x2 )(1 − 2xx + x2 (x )2 ) = (x − x)2 (xx − 1)2 so x(P3 + P2 )x(P3 − P2 ) = (xx − 1)2 /(x − x)2 .



4.5.2 Optimized Differential Addition As discussed in Section 4.4.2 on page 96, the Montgomery ladder represents xcoordinates as fractions. This eliminates the division in Theorem 4.3 on page 99 but uses extra multiplications. The simple formulas for X2n+1 and Z2n+1 in Section 4.1 on page 82 use 4M for Xn Xn+1 − Zn Zn+1 and Xn Zn+1 − Zn Xn+1 , 2S, and 2M by the numerator X1 and denominator Z1 of x(P3 − P2 ). Montgomery’s optimized formulas, also shown in Section 4.1, replace the first four multiplications with just two: they rewrite 2(Xn Xn+1 − Zn Zn+1 ) and 2(Xn Zn+1 − Zn Xn+1 ) as the sum and difference of (Xn − Zn )(Xn+1 + Zn+1 ) and (Xn + Zn )(Xn+1 − Zn+1 ). In total there are 2S, 1M by X1 , 1M by Z1 , 2M more, three additions, and three subtractions. The same trick works for any expressions of the form αβ − γ δ and αδ − βγ : except for a rescaling by 2, these are the sum and difference of (α − γ )(β + δ) and (α + γ )(β − δ). In other words, to multiply the polynomials α + γ t and β − δt modulo t 2 − 1, first multiply modulo t − 1 and t + 1, and then interpolate.

4.5.3 Quasi-Completeness The Montgomery ladder defines X2n+1 and Z2n+1 from Xn+1 , Zn+1 , Xn , Zn , X1 , Z1 using formulas that match Theorem 4.3: it defines, e.g., X5 = 4(X2 X3 − Z2 Z3 )2 Z1 and Z5 = 4(X2 Z3 − Z2 X3 )2 X1 . But there are various hypotheses in Theorem 4.3, raising the question of what happens when these hypotheses are violated, as in Section 4.4.4 on page 97. Theorem 4.4 below almost says that x(P3 + P2 ) = X5 /Z5 in all cases. However, it excludes two values of x(P3 − P2 ), namely 0 and ∞. In other words, it excludes two values of P3 − P2 , namely (0, 0) and ∞. Later we will analyze what the Montgomery ladder does for these inputs. Theorem 4.4 Fix a field k not of characteristic 2. Fix A, B ∈ k with B(A2 − 4) = 0. Define M as the Montgomery curve By2 = x3 + Ax2 + x. Define x : M(k) → k ∪ {∞} as follows: x(x, y) = x; x(∞) = ∞.

Downloaded from https://www.cambridge.org/core. The Librarian-Seeley Historical Library, on 01 Jan 2020 at 05:41:59, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.005

102

Montgomery Curves and the Montgomery Ladder

Let X1 , Z1 , X2 , Z2 , X3 , Z3 be elements of k. Define X5 = 4(X2 X3 − Z2 Z3 )2 Z1 , Z5 = 4(X2 Z3 − Z2 X3 )2 X1 . Let P2 , P3 be elements of M(k). Assume that X1 = 0; Z1 = 0; x(P3 − P2 ) = X1 /Z1 ; (X2 , Z2 ) = (0, 0); x(P2 ) = X2 /Z2 ; (X3 , Z3 ) = (0, 0); and x(P3 ) = X3 /Z3 . Then (X5 , Z5 ) = (0, 0) and x(P3 + P2 ) = X5 /Z5 . We emphasize that both X1 and Z1 are required to be nonzero individually. Proof If P3 = P2 then X1 /Z1 = x(P3 − P2 ) = x(∞) = ∞ so Z1 = 0, contradiction. Hence P3 = P2 . If P2 = ∞ then X2 /Z2 = x(P2 ) = ∞ so Z2 = 0. Furthermore X1 /Z1 = x(P3 − P2 ) = x(P3 ) = X3 /Z3 so X3 Z1 = X1 Z3 . Hence X5 = 4(X2 X3 )2 Z1 = 4X22 X1 X3 Z3 and Z5 = 4(X2 Z3 )2 X1 = 4X22 X1 Z32 . By hypothesis X1 = 0; also X2 = 0 since Z2 = 0; and Z3 = 0 since X3 /Z3 = x(P3 ) = x(P2 ) = ∞. Hence Z5 = 0 and X5 /Z5 = X3 /Z3 = x(P3 ) = x(P3 + P2 ) as claimed. Similarly, if P3 = ∞ then X3 /Z3 = x(P3 ) = ∞ so Z3 = 0. Furthermore X1 /Z1 = x(P3 − P2 ) = x(−P2 ) = x(P2 ) = X2 /Z2 so X2 Z1 = X1 Z2 . Hence X5 = 4(X2 X3 )2 Z1 = 4X32 X1 Z2 X2 and Z5 = 4(Z2 X3 )2 X1 = 4X32 X1 Z22 . Again X1 = 0; X3 = 0 since Z3 = 0; and Z2 = 0 since X2 /Z2 = x(P2 ) = ∞. Hence Z5 = 0 and X5 /Z5 = X2 /Z2 = x(P2 ) = x(P3 + P2 ) as claimed. Assume from now on that P2 = ∞ and P3 = ∞. Note that Z2 = 0 and Z3 = 0. If P3 = −P2 then X2 /Z2 = x(P2 ) = x(−P3 ) = x(P3 ) = X3 /Z3 so X2 Z3 = Z2 X3 so Z5 = 0. We will show in a moment that X5 = 0, so X5 /Z5 = ∞ = x(∞) = x(P3 + P2 ) as claimed. Note that X2 = 0: if X2 = 0 then Z2 = 0 so x(P2 ) = X2 /Z2 = 0 so P2 = (0, 0) so P3 = −P2 = −(0, 0) = (0, 0) = P2 , contradiction. Similarly X3 = 0. Now suppose that X5 = 0. Then 4(X2 X3 − Z2 Z3 )2 Z1 = 0, but Z1 = 0, so X2 X3 = Z2 Z3 . Consequently (X2 + Z2 )(X3 − Z3 ) = (X2 X3 − Z2 Z3 ) − (X2 Z3 − Z2 X3 ) = 0. If X2 + Z2 = 0 then X3 − Z3 = 0 so x(P2 ) = x(−P3 ) = x(P3 ) = X3 /X3 = 1. Otherwise X2 = −Z2 so x(P2 ) = −1. Either way x(P2 )2 = 1 so x(2P2 ) = 0 by Theorem 4.2 on page 98. Hence X1 /Z1 = x(P3 − P2 ) = x(−2P2 ) = x(2P2 ) = 0 so X1 = 0, contradiction. Assume from now on that P3 + P2 = ∞. All hypotheses of Theorem 4.3 on page 99 are now satisfied, so x(P3 ) = x(P2 ) and x(P3 + P2 )x(P3 − P2 )(x(P3 ) − x(P2 ))2 = (x(P3 )x(P2 ) − 1)2 . Multiply through by appropriate powers

Downloaded from https://www.cambridge.org/core. The Librarian-Seeley Historical Library, on 01 Jan 2020 at 05:41:59, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.005

4.5 Differential-Addition Formulas

103

of Z1 , Z2 , Z3 to see that X3 Z2 = X2 Z3 and x(P3 + P2 )X1 (X3 Z2 − X2 Z3 )2 =  Z1 (X2 X3 − Z2 Z3 )2 ; i.e., Z5 = 0 and x(P3 + P2 ) = X5 /Z5 as claimed.

4.5.4 Differential Addition: The Edwards View Alternate proof of Theorem 4.4 Write x1 , x2 , x3 , x5 for, respectively, x(P3 − P2 ), x(P2 ), x(P3 ), x(P3 + P2 ). As before, M is birationally equivalent to the twisted Edwards curve au2 + 2 v = 1 + du2 v 2 with a = (A + 2)/B and d = (A − 2)/B. Let (u2 , v2 ) and (u3 , v3 ) be the points on the twisted Edwards curve equivalent to P2 and P3 respectively. Write (u5 , v5 ) = (u3 , v3 ) + (u2 , v2 ) and (u1 , v1 ) = (u3 , v3 ) − (u2 , v2 ). The dual addition law (4.2) on page 91 says v5 =

u3 v3 − u2 v2 u3 v2 − u2 v3

and

v1 =

u3 v3 + u2 v2 u3 v2 + u2 v3

except when (u3 , v3 ) ∈ {(u2 , v2 ), (−u2 , −v2 ), (−u2 , v2 ), (u2 , −v2 )}, i.e., except when u23 = u22 . Assume for the moment that u23 = u22 ; the exceptional cases will be treated later. Recall that the maps for x and v between M and E are defined on all of k ∪ {∞}. Now x1 x5 =

1 + v1 1 + v5 · 1 − v1 1 − v5

=

(u3 v2 + u2 v3 ) + (u3 v3 + u2 v2 ) (u3 v2 − u2 v3 ) + (u3 v3 − u2 v2 ) · (u3 v2 + u2 v3 ) − (u3 v3 + u2 v2 ) (u3 v2 − u2 v3 ) − (u3 v3 − u2 v2 )

=

(u3 v2 + u3 v3 )2 − (u2 v3 + u2 v2 )2 (u23 − u22 )(v3 + v2 )2 (v3 + v2 )2 = = (u3 v2 − u3 v3 )2 − (u2 v3 − u2 v2 )2 (u23 − u22 )(v3 − v2 )2 (v3 − v2 )2

=

((x3 − 1)(x2 + 1) + (x2 − 1)(x3 + 1))2 (x2 x3 − 1)2 = . ((x3 − 1)(x2 + 1) − (x2 − 1)(x3 + 1))2 (x2 − x3 )2

Substitute 1/x1 = Z1 /X1 , x2 = X2 /Z2 , and x3 = X3 /Z3 to see that x5 = X5 /Z5 as claimed. If (u3 , v3 ) = (u2 , v2 ) then (u1 , v1 ) = (0, 1), corresponding to ∞ on M. If (u3 , v3 ) = (−u2 , −v2 ) then (u1 , v1 ) = (0, −1), corresponding to (0, 0) on M. / {0, ∞}. Both of these points on M are excluded by the hypothesis that x1 ∈ If (u3 , v3 ) = (−u2 , v2 ) then (u5 , v5 ) = (0, 1) so x5 = ∞. Also (u1 , v1 ) = 2(u3 , v3 ) so x1 = (1 + v1 )/(1 − v1 ) = (a − d)v32 /((1 − v32 )(a − dv32 )) as in the alternate proof of Theorem 4.2 on page 98, and x1 = 0 by hypothesis, so

Downloaded from https://www.cambridge.org/core. The Librarian-Seeley Historical Library, on 01 Jan 2020 at 05:41:59, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.005

104

Montgomery Curves and the Montgomery Ladder

v3 ∈ / {0, ∞}, i.e., x3 ∈ / {−1, 1}. To summarize, x2 = x3 (since v2 = v3 ) while x2 x3 = 1. Multiply by Z2 Z3 to see that X2 Z3 = X3 Z2 while X2 X3 = Z2 Z3 , i.e., Z5 = 0 while X5 = 0, so X5 /Z5 = ∞ = x5 as claimed. If (u3 , v3 ) = (u2 , −v2 ) then (u5 , v5 ) = (0, −1) so x5 = 0. Also (u1 , v1 ) = 2(u3 , v3 ) + (0, −1) so 1/x1 = (1 − v1 )/(1 + v1 ) = (a − d)v32 /((1 − v32 )(a − / {0, ∞}, so x3 ∈ / {−1, 1}. To summarize, dv32 )). Now x1 = ∞, so v3 ∈ x2 x3 = 1 while x2 = x3 . Hence X5 = 0 while Z5 = 0, so X5 /Z5 = 0 = x5 as claimed. 

4.6 The Montgomery Ladder This section combines Montgomery’s doubling formulas with Montgomery’s differential-addition formulas to obtain one step of the Montgomery ladder; and then iterates these steps to obtain the full Montgomery ladder.

4.6.1 The Montgomery Ladder Step Theorems 4.2 on page 98 and 4.4 on page 101 together compute x(2P2 ) = X4 /Z4 and x(P3 + P2 ) = X5 /Z5 , given as input x(P2 ) = X2 /Z2 , x(P3 ) = X3 /Z3 , and x(P3 − P2 ) = X1 /Z1 , assuming X1 = 0 and Z1 = 0. The merged optimized formulas are shown in Figure 4.2, under the simplifying assumption Z1 = 1. In total there are 5M, 4S, 1C by (A − 2)/4, four additions, and four subtractions. This is cheaper than the total costs from Sections 4.4 and 4.5, for two reasons: first, a multiplication by Z1 has been eliminated; second, X2 + Z2 and X2 − Z2 are reused between the differential addition and the doubling. The ladder in Section 4.2.1 on page 85 for computing Xn starting from X0 and X1 used one doubling and one differential addition per bit of n. We now combine the doubling and the differential addition into a single step to emphasize the benefits from combining. For fixed (X1 , Z1 ) define step0 (X2 , Z2 , X3 , Z3 ) = (X4 , Z4 , X5 , Z5 ) where X4 = (X22 − Z22 )2 ,

X5 = 4(X2 X3 − Z2 Z3 )2 Z1 ,

Z4 = 4X2 Z2 (X22 + AX2 Z2 + Z22 ),

Z5 = 4(X2 Z3 − Z2 X3 )2 X1 ,

and also define step1 (X3 , Z3 , X2 , Z2 ) = (X5 , Z5 , X4 , Z4 ). The recursive definition of the Montgomery ladder in Section 4.1 on page 82 can be abbreviated as L2n = step0 Ln where Ln = (Xn , Zn , Xn+1 , Zn+1 ), and implies L2n+1 = step1 Ln , so in general Ln = stepn mod 2 (L n/2 ) for n ≥ 2.

Downloaded from https://www.cambridge.org/core. The Librarian-Seeley Historical Library, on 01 Jan 2020 at 05:41:59, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.005

4.6 The Montgomery Ladder

105

Figure 4.2 Montgomery’s optimized formulas for doubling and differential addition, assuming Z1 = 1.

4.6.2 Constant-Time Ladders A typical problem in ECC is to compute a scalar multiple nP, where n is a secret element of {0, 1, . . . , 2256 − 1}. Montgomery’s original ladder is unsatisfactory in this context: anyone who observes the time taken by the ladder can deduce the position of the top bit set in n, since this position dictates the number of steps of the ladder. One fix is to always arrange for n to have a fixed top bit, for example by adding an appropriate multiple of the order of P. Another fix, which we use in Theorem 4.5 below, is to switch to a more general ladder in which the number of steps can be chosen separately from the position of the top bit set in n. It is important here that one can start the Montgomery ladder from 0P, 1P, rather than from 1P, 2P, and that applying a ladder step to 0P, 1P produces another valid representation of 0P, 1P. In this context it is also important for each ladder step to involve a constant sequence of operations, without splitting into cases that depend on the secret bits inside n. Notice that stepb can be computed as cswapb ◦ step0 ◦ cswapb , where cswap0 (X2 , Z2 , X3 , Z3 ) = (X2 , Z2 , X3 , Z3 ), cswap1 (X2 , Z2 , X3 , Z3 ) = (X3 , Z3 , X2 , Z2 ).

Downloaded from https://www.cambridge.org/core. The Librarian-Seeley Historical Library, on 01 Jan 2020 at 05:41:59, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.005

106

Montgomery Curves and the Montgomery Ladder

One should compute cswapb (X2 , Z2 , X3 , Z3 ) as (b(X3 − X2 ) + X2 , b(Z3 − Z2 ) + Z2 , (1 − b)(X3 − X2 ) + X2 , (1 − b)(Z3 − Z2 ) + Z2 ), or some equivalent constant-time arithmetic expression, rather than computing the two cases separately. A composition of two steps produces a cswap-step-cswap-cswap-step-cswap pattern. One can merge the adjacent swaps, defining b as the xor of the two bits. A many-step ladder then follows the pattern cswap-step-cswap-step-cswapstep-cswap etc.

4.6.3 Completeness of the Ladder The end of the Montgomery ladder divides Xn by Zn to obtain x(nP). Typically one computes Xn /Zn as Xn Zn#k−2 when k is finite; but for Zn = 0 this computation outputs 0 rather than ∞. A further difficulty arises when the Montgomery ladder is allowed to receive 0 or ∞ as its input x(P) = X1 /Z1 ; we excluded this case in Theorem 4.4 on page 101. The ladder then produces Xn Zn = 0 for each n, but it is not always true that x(nP) = Xn /Zn : it is possible to have Xn /Zn = ∞ while x(nP) = 0, and it is even possible to have (Xn , Zn ) = (0, 0). Define x0 : M(k) → k as follows: x0 (x, y) = x; x0 (∞) = 0. Using x0 (nP) in place of x(nP) as a ladder output merges ∞ with 0, eliminating all of the above case distinctions. It is then harmless to also use x0 (P) in place of x(P) as a ladder input, since inputs 0 and ∞ always produce the same outputs. The idea of using x0 in this context was introduced in [35]. Theorem 4.5 Fix a field k not of characteristic 2. Fix A, B ∈ k with B(A2 − 4) = 0. Define M as the Montgomery curve By2 = x3 + Ax2 + x. Define x0 : M(k) → k as follows: x0 (x, y) = x; x0 (∞) = 0. Let P be an element of M(k). Let X1 , Z1 be elements of k such that Z1 = 0 and x0 (P) = X1 /Z1 . Let c be a nonnegative integer. Let n0 , . . . , nc−1 be elements of {0, 1}. Define n = 2c−1 nc−1 + 2c−2 nc−2 + · · · + 20 n0 . Define (X, Z, X  , Z  ) = stepn0 stepn1 · · · stepnc−1 (1, 0, X1 , Z1 ). If Z = 0 then x0 (nP) = 0; otherwise x0 (nP) = X/Z. Proof The main case is that X1 = 0. Then x(P) = x0 (P) = X1 /Z1 . We will prove the following statement by induction on c: (X, Z) = (0, 0); X/Z = x(nP); (X  , Z  ) = (0, 0); and X  /Z  = x((n + 1)P). This implies the claim: if Z = 0 then x(nP) = X/0 = ∞ so x0 (nP) = 0 as claimed; otherwise x(nP) = ∞ so x0 (nP) = x(nP) = X/Z as claimed.

Downloaded from https://www.cambridge.org/core. The Librarian-Seeley Historical Library, on 01 Jan 2020 at 05:41:59, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.005

4.7 A Two-Dimensional Ladder

107

If c = 0 then (X, Z, X  , Z  ) = (1, 0, X1 , Z1 ). Evidently (X, Z) = (1, 0) = (0, 0); X/Z = ∞ = x(∞) = x(nP) since n = 0; (X  , Z  ) = (X1 , Z1 ) = (0, 0); and X  /Z  = X1 /Z1 = x(P) = x((n + 1)P). For c ≥ 1: Write (X2 , Z2 , X3 , Z3 ) = stepn1 · · · stepnc−1 (1, 0, X1 , Z1 ). By the inductive hypothesis, (X2 , Z2 ) = (0, 0); X2 /Z2 = x(mP) where m = 2c−2 nc−1 + 2c−3 nc−2 + · · · + 20 n1 ; (X3 , Z3 ) = (0, 0); and X3 /Z3 = x((m + 1)P). Now (X, Z, X  , Z  ) = stepn0 (X2 , Z2 , X3 , Z3 ). If n0 = 0 then (X, Z) = (0, 0) and x(nP) = x(2mP) = X/Z by Theorem 4.2 on page 98; also (X  , Z  ) = (0, 0) and x((n + 1)P) = x((2m + 1)P) = X  /Z  by Theorem 4.4 on page 101. Similar comments apply if n0 = 1. Either way the claimed statement holds, finishing the main case. The remaining case is that X1 = 0. Then x0 (P) = 0 so P = ∞ or P = (0, 0); in both cases nP ∈ {∞, (0, 0)} so x0 (nP) = 0. It thus suffices to show that Z = 0 or X = 0. The initial step input (1, 0, X1 , Z1 ) has the form (∗, 0, 0, ∗) since X1 = 0. Note that step0 (∗, 0, 0, ∗) = (∗, 0, 0, ∗); step1 (∗, 0, 0, ∗) = (0, ∗, ∗, 0); step0 (0, ∗, ∗, 0) = (∗, 0, 0, ∗); and step1 (0, ∗, ∗, 0) = (0, ∗, ∗, 0). By induction (X, Z, X  , Z  ) has the form (∗, 0, 0, ∗) or (0, ∗, ∗, 0); so Z = 0 or X = 0 as claimed. 

4.7 A Two-Dimensional Ladder The way that the Montgomery ladder computes x0 (nP) for a particular tar  get n ≥ 1 is by computing  fori a sequence of scalars n , namely all  i x0 (n P) integers of the form n/2 and n/2 + 1. Each integer larger than 1 in the sequence is a sum of two smaller integers whose difference is 0 or 1; each use of difference 0 involves Montgomery’s doubling formulas, and each use of difference 1 involves Montgomery’s differential-addition formulas with difference P. This section explains an analogous method to compute x0 (mP + nQ) for nonnegative integers m, n, starting from x0 (P), x0 (Q), x0 (P − Q). The method computes x0 (m P + n Q) for each (m , n ) in a “two-dimensional ladder” defined below. This ladder has the following features: r It starts from (0, 0), (1, 0), (0, 1), and (1, −1). r It has 3c additions if m and n fit into c bits. r For each addition v + w, the difference v − w is either (0, 0) or (1, 0) or (0, 1) or (1, −1) or (1, 1). Consequently the only possible failure cases are P, Q, P − Q, P + Q colliding with (0, 0), ∞.

Downloaded from https://www.cambridge.org/core. The Librarian-Seeley Historical Library, on 01 Jan 2020 at 05:41:59, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.005

108

Montgomery Curves and the Montgomery Ladder

r c of the additions are doublings, i.e., have difference (0, 0). The doublings appear in a uniform pattern: add, double, add; add, double, add; etc. Each doubling costs 4M with Montgomery’s formulas. Here, for simplicity, we are taking C = 0, which is reasonable if (A − 2)/4 is small, and also taking S = M. Each differential addition costs 5M with Montgomery’s formulas if x0 (P), x0 (Q), x0 (P − Q), x0 (P + Q) are each provided with denominator 1. The total cost of the chain here is 14cM. For comparison, the Montgomery ladder costs 9cM for a single scalar. Handling two scalars thus increases costs by a factor significantly below 2. In Section 4.8 on page 111 we will see even faster double-scalar methods, although those methods no longer have a uniform pattern of additions and doublings. It is easy to write down a similar chain using 19cM, handling each bit with a uniform double-add-add-add pattern. A 2000 algorithm by Schoenmakers, published in 2003 [317, section 3.2.3], costs on average 17.25cM, with a variable pattern of additions; an algorithm by Akishita [7] costs on average 14.25cM, again with a variable pattern of additions. The chain described here is slightly faster than Akishita’s chain and has the advantage of a uniform add-double-add structure, analogous to the uniform double-add structure of the Montgomery ladder. This chain was introduced by Bernstein in 2006; see [36].

4.7.1 Introduction to the Two-Dimensional Ladder Figure 4.3 is an example of the differential addition chain used here. Each line after the first has three of the four pairs (a, b), (a + 1, b), (a, b + 1), (a + 1, b + 1) for a unique (a, b). The missing element of (a + {0, 1}, b + {0, 1}) is always chosen as either (even, odd) or (odd, even), where the choice is related to the (A, B) for the next line: r If (a + A, b + B) is (even, odd) then the choice is (odd, even). r If (a + A, b + B) is (odd, even) then the choice is (even, odd). r If (a + A, b + B) is (even, even) then the current and next lines have the same choices. r If (a + A, b + B) is (odd, odd) then the current and next lines have opposite choices. The pair (a, b) is also related to (A, B): it is simply ( A/2 , B/2 ). For comparison: The obvious way to build a two-dimensional ladder uses all four pairs (a, b), (a + 1, b), (a, b + 1), (a + 1, b + 1). The Schoenmakers chain [317, section 3.2.3] omits (a + 1, b + 1). Akishita’s chain [7] omits (a + 1 − (A mod 2), b + 1 − (B mod 2)). The ladder presented here omits (a + (a + d + 1 mod 2), b + (b + d mod 2)) where d has a relatively complicated definition. Downloaded from https://www.cambridge.org/core. The Librarian-Seeley Historical Library, on 01 Jan 2020 at 05:41:59, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.005

4.7 A Two-Dimensional Ladder

109

Figure 4.3 A uniform add-double-add two-dimensional ladder.

4.7.2 Recursive Definition of the Two-Dimensional Ladder Define CD (A, B) recursively, for all nonnegative integers A and B and for all D ∈ {0, 1}, as Cd (a, b) followed by the three pairs (A + (A + 1 mod 2), B + (B + 1 mod 2)), (A + (A mod 2), B + (B mod 2)), (A + (A + D mod 2), B + (B + D + 1 mod 2)), where a = A/2 , b = B/2 , and ⎧ ⎪ 0 ⎪ ⎪ ⎪ ⎨1 d= ⎪ D ⎪ ⎪ ⎪ ⎩ 1−D

if (a + A, b + B) mod 2 = (0, 1) if (a + A, b + B) mod 2 = (1, 0) if (a + A, b + B) mod 2 = (0, 0) if (a + A, b + B) mod 2 = (1, 1).

Exception: CD (0, 0) is defined as (0, 0), (1, 0), (0, 1), (1, −1). Downloaded from https://www.cambridge.org/core. The Librarian-Seeley Historical Library, on 01 Jan 2020 at 05:41:59, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.005

110

Montgomery Curves and the Montgomery Ladder

Here are the first few examples of this chain CD (A, B): C0 (0, 0) is (0, 0), (1, 0), (0, 1), (1, −1). C1 (0, 0) is (0, 0), (1, 0), (0, 1), (1, −1). C0 (1, 0) is (0, 0), (1, 0), (0, 1), (1, −1), (1, 1), (2, 0), (2, 1). C1 (1, 0) is (0, 0), (1, 0), (0, 1), (1, −1), (1, 1), (2, 0), (1, 0). C0 (0, 1) is (0, 0), (1, 0), (0, 1), (1, −1), (1, 1), (0, 2), (0, 1). C1 (0, 1) is (0, 0), (1, 0), (0, 1), (1, −1), (1, 1), (0, 2), (1, 2). C0 (1, 1) is (0, 0), (1, 0), (0, 1), (1, −1), (1, 1), (2, 2), (2, 1). C1 (1, 1) is (0, 0), (1, 0), (0, 1), (1, −1), (1, 1), (2, 2), (1, 2). Note for future reference that CD (A, B) always contains the pair (1, 1) if (A, B) = (0, 0). The rest of this section shows that CD (A, B) is a differential addition chain starting with (0, 0), (1, 0), (0, 1), (1, −1) and following a uniform add-doubleadd pattern with all differences in {(0, 0), (1, 0), (0, 1), (1, 1), (1, −1)}. One can easily force the chain to contain any desired pair (m, n) of nonnegative integers by choosing, e.g., (A, B) = (m, n) and D = m mod 2.

4.7.3 The Odd-Odd Pair in Each Line: First Addition Assume that (A, B) = (0, 0). The pair (A+(A+1 mod 2), B+(B+1 mod 2)) in CD (A, B) is equal to (2a + 1, 2b + 1) where (a, b) = ( A/2 , B/2 ) as above. If (a, b) = (0, 0) then the pair is (1, 1), which can be obtained by adding (1, 0) to (0, 1) with difference (1, −1); so assume that (a, b) = (0, 0). The chain already includes Cd (a, b), which contains three of the four pairs (a, b), (a + 1, b), (a, b + 1), (a + 1, b + 1). Consequently, (2a + 1, 2b + 1) can be obtained by adding (a + 1, b) to (a, b + 1) with difference (1, −1), or by adding (a + 1, b + 1) to (a, b) with difference (1, 1); recall that (1, 1) is also in Cd (a, b).

4.7.4 The Even-Even Pair in Each Line: Doubling The next pair (A + (A mod 2), B + (B mod 2)) in the chain CD (A, B) is equal to (2a + 2(A mod 2), 2b + 2(B mod 2)). If (a, b) = (0, 0) then the pair is (2, 0) or (0, 2) or (2, 2), so it can be obtained by doubling (1, 0) or (0, 1) or (1, 1), all of which appear earlier in the chain; so assume that (a, b) = (0, 0). The chain already contains, via Cd (a, b), all pairs (a + {0, 1}, b + {0, 1}) except (a + (a + d + 1 mod 2), b + (b + d mod 2)). If (a + (A mod 2),

Downloaded from https://www.cambridge.org/core. The Librarian-Seeley Historical Library, on 01 Jan 2020 at 05:41:59, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.005

4.8 Larger Differences

111

b + (B mod 2)) equals the missing pair then (a + A, b + B) mod 2 = (2a + d + 1, 2b + d) mod 2 = (1 − d, d); but if (a + A, b + B) mod 2 = (0, 1) then d is 0 by construction, and if (a + A, b + B) mod 2 = (1, 0) then d is 1 by construction. Thus (a + (A mod 2), b + (B mod 2)) is earlier in the chain, and doubling it produces the desired (A + (A mod 2), B + (B mod 2)).

4.7.5 The Other Pair in Each Line: Second Addition If D = 0 then the pair (A + (A + D mod 2), B + (B + D + 1 mod 2)) is equal to (2a + 2(A mod 2), 2b + 1). We claim that this pair can be obtained by adding (a + (A mod 2), b + 1) and (a + (A mod 2), b), with difference (0, 1). If (a, b) = (0, 0) then (a + (A mod 2), b + 1) is either (0, 1) or (1, 1), both of which are already in the chain; and (a + (A mod 2), b) is either (0, 0) or (1, 0), both of which are already in the chain. So assume that (a, b) = (0, 0). The chain already contains, via Cd (a, b), all pairs (a + {0, 1}, b + {0, 1}) except (a + (a + d + 1 mod 2), b + (b + d mod 2)). Suppose that the missing pair is equal to (a + (A mod 2), b + 1) or (a + (A mod 2), b). Then a + (a + d + 1 mod 2) = a + (A mod 2), so a + A mod 2 = 2a + d + 1 mod 2 = 1 − d. If (a + A, b + B) mod 2 = (0, 1) then d = 0 by construction, contradiction. If (a + A, b + B) mod 2 = (1, 0) then d = 1 by construction, contradiction. If (a + A, b + B) mod 2 = (0, 0) then d = D = 0 by construction, contradiction. If (a + A, b + B) mod 2 = (1, 1) then d = 1 − D = 1 by construction, contradiction. Similarly, if D = 1, then the pair (A + (A + D mod 2), B + (B + D + 1 mod 2)) in CD (A, B) is equal to (2a + 1, 2b + 2(B mod 2)), which can be obtained by adding (a + 1, b + (B mod 2)) and (a, b + (B mod 2)),

4.8 Larger Differences Montgomery in [253] also introduced a more complicated method, called PRAC, to compute differential addition-subtraction chains. Recall that these are addition-subtraction chains where each sum computation n + m has n − m already in the chain and where each difference computation n − m has n + m already in the chain. A simple ladder, as in the Lucas ladder and the Montgomery ladder, uses 2 operations (1 differential addition and 1 doubling) for each bit of n; PRAC uses fewer than 1.6 operations for each bit. This section is an introduction to PRAC. Most of the operations in PRAC are differential additions with large difference, and these are more expensive than doublings with Montgomery’s

Downloaded from https://www.cambridge.org/core. The Librarian-Seeley Historical Library, on 01 Jan 2020 at 05:41:59, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.005

112

Montgomery Curves and the Montgomery Ladder

formulas, but PRAC still does slightly better than 9M per bit. PRAC produces much larger speedups in the two-dimensional case discussed in Section 4.7 on page 107, reducing the cost of computing mP + nQ below 11M per bit. The complicated structure of the resulting chains seems to be incompatible with constant-time ECC computations, but is not a problem for ECM.

4.8.1 Examples of Large-Difference Chains Let d, e be coprime integers with 0 ≤ d ≤ e. This section reviews several ways to construct a one-dimensional differential addition chain that starts from 0, 1 and that contains e − d and d and e. Euclid’s chain is the simplest but generally longest; Tsuruoka’s chain is the most complicated but generally shortest. Euclid’s chain E(d, e) is defined recursively as follows: ⎧ ⎪ if d = 0 ⎪ ⎨0, e E(d, e) = E(e − d, e) if e/2 < d ⎪ ⎪ ⎩E(d, e − d), e otherwise For example, E(11, 97) = 0, 1, 2, 3, 5, 7, 9, 11, 20, 31, 42, 53, 64, 75, 86, 97. One can easily prove by induction on e that E(d, e) is a differential addition chain that starts from 0, 1 and that contains e − d and d and e. The point is that if e > 1 then e can be obtained by adding d and e − d, since the difference of d and e − d is earlier in the chain. A more sophisticated differential addition chain S(d, e) is defined recursively as follows: ⎧ ⎪ 0, e if d = 0 ⎪ ⎪ ⎪ ⎨S(e − d, e) if e/2 < d S(d, e) = ⎪ S(d, e/2), e − d, e if 0 < d < e/4 and e ∈ 2Z ⎪ ⎪ ⎪ ⎩ S(d, e − d), e otherwise For example, S(11, 97) = 0, 1, 2, 3, 4, 5, 9, 10, 11, 21, 32, 43, 75, 86, 97. What’s new in S(d, e), compared to E(d, e), is the S(d, e/2), e − d, e case. In this case, e is obtained by doubling e/2; d appears in S(d, e/2) by induction; and e − d is obtained by adding e/2 − d to e/2. Bleichenbacher’s differential addition chain B(d, e), introduced in [55, section 5.3] and republished without credit as the main result of [96], is defined

Downloaded from https://www.cambridge.org/core. The Librarian-Seeley Historical Library, on 01 Jan 2020 at 05:41:59, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.005

4.8 Larger Differences

113

recursively as follows:

B(d, e) =

⎧ 0, e ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪B(e − d, e) ⎪ ⎪ ⎪ ⎪ B(d, e/2), e − d, e ⎪ ⎪ ⎪ ⎨B(d, (e + d)/2), e − d, e ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ B(d/2, e − d/2), d, e ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ B(d, e − d), e

if d = 0 if e/2 < d if 0 < d < e/5 and e ∈ 2Z if 0 < d < e/5 and e ∈ / 2Z and e + d ∈ 2Z if 0 < d < e/5 and e ∈ / 2Z and d ∈ 2Z otherwise

For example, B(11, 97) = 0, 1, 2, 3, 5, 6, 10, 11, 21, 32, 43, 54, 86, 97. Beware that there are two typographical errors in [55, section 5.3]: “x − y, y/2, z” should be “x − y/2, y/2, z” and “x/2, x − y, z” should be “x/2, y − x/2, z.” Tsuruoka’s differential addition chain T (d, e), introduced in [328], is defined recursively as follows: ⎧ if d = 0 ⎪ ⎪0, e ⎪ ⎪ ⎪ ⎪T (e − d, e) if e < 2d ⎪ ⎪ ⎪ ⎪ ⎪ T (d, e/2), e − d, e if 2d ≤ e ≤ 2.09d and e ∈ 2Z ⎪ ⎪ ⎪ ⎪ ⎪ T (d, e/2), e − d, e if 3.92d ≤ e and e ∈ 2Z ⎪ ⎪ ⎪ ⎪ ⎪T (d, (e + d)/3), ⎪ ⎪ ⎪ if not and 5.7d ≤ e and e+d ∈ 3Z ⎪ ⎪ ⎪ ⎨ (2e − d)/3, e − d, e T (d, e) = T (d, (e − d)/3), (e + 2d)/3, ⎪ ⎪ if not and 4.9d ≤ e and e−d ∈ 3Z ⎪ ⎪ (2e − 2d)/3, e − d, e ⎪ ⎪ ⎪ ⎪ ⎪ T (d, (e + d)/2), e − d, e if not and 4.9d ≤ e and d +e ∈ 2Z ⎪ ⎪ ⎪ ⎪ ⎪ T (d, e/3), ⎪ ⎪ ⎪ if not and 6.8d ≤ e and e ∈ 3Z ⎪ ⎪ d + e/3, 2e/3, e − d, e ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ T (d/2, e − d/2), d, e if not and 9d ≤ e and d ∈ 6Z ⎪ ⎪ ⎩ T (d, e − d), e otherwise For example, T (11, 97) = 0, 1, 2, 3, 4, 7, 11, 14, 25, 36, 61, 86, 97.

Downloaded from https://www.cambridge.org/core. The Librarian-Seeley Historical Library, on 01 Jan 2020 at 05:41:59, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.005

114

Montgomery Curves and the Montgomery Ladder

All of these chains were designed with the goal of minimizing the number of additions. Slightly different constructions should do better in other cost measures, such as the number of field multiplications in the elliptic-curve context.

4.8.2 CFRC, PRAC, etc. There is a well-known “duality” between two-dimensional addition chains that contain the pair (d, e) and one-dimensional addition chains that contain both d and e. See, e.g., [317, section 2.2.2]. For example, one way to compute dP + eQ is to first compute P + Q and then compute d(P + Q) + (e − d)Q. This transformation reduces the problem of constructing an addition chain for (d, e) to the problem of constructing an addition chain for (d, e − d). This is the dual of the following reduction, which was used repeatedly above: to build an addition chain for d and e, first build an addition chain for d and e − d, and then compute e as the sum of d and e − d. Duality does not exactly preserve costs for differential chains; see, e.g., [317, Example 3.28] and [317, section 3.4, final paragraph]. One can nevertheless see a large overlap between ideas for optimizing a two-dimensional chain for the pair (d, e) and ideas for optimizing a one-dimensional chain for d and e. In particular, Montgomery’s “CFRC” chain in [253, section 5] is a simple construction of a two-dimensional chain for (d, e), comparable to Euclid’s chain. Montgomery’s “PRAC” chain in [253, section 7] is a more complicated construction, comparable to (and predating) Tsuruoka’s chain. See [36] for an “extendedgcd” chain that, in experiments, produces slightly better results than PRAC.

4.8.3 Allowing d to Vary The standard way to construct a one-dimensional differential addition chain for e is to choose some d coprime to e and use one of the above algorithms to construct a chain containing e − d, d, e. Here are three refinements: √ r Choose d to be very close to 2e/(1 + 5). This guarantees that the top half of the bits of e will be handled with about 1.44 additions per bit; see, e.g., [317, Proposition 3.34]. For example, with e = 100, choosing d = 61 produces a chain ending 17, 22, 39, 61, 100. r Try many d’s and take the shortest chain for e. One can, for example, take √ a range of d’s around 2e/(1 + 5), or around e/α for various constants α having continued fractions consisting of almost entirely 1’s and a few 2’s. r If e has a known factor g, construct a chain for e by constructing a chain for e/g, multiplying it by g, and merging the result with a chain for g. This

Downloaded from https://www.cambridge.org/core. The Librarian-Seeley Historical Library, on 01 Jan 2020 at 05:41:59, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.005

4.8 Larger Differences

115

generally produces shorter chains (for a given amount of d-searching time) than handling e directly. All of these improvements were suggested by Montgomery in [253, section 7], in the context of Montgomery’s PRAC. Here is a simple experiment to illustrate the importance of trying many d’s. Consider each prime number e below 106 . For√each e, try several successive d’s coprime to e, starting just above 2e/(1 + 5); find the shortest E(d, e), the shortest S(d, e), the shortest B(d, e), and the shortest T (d, e). Average the number of additions in these chains as e varies. The following table shows the resulting averages: Number of d’s E((best d), e) S((best d), e) B((best d), e) T ((best d), e)

1

2

4

8

16

32

64

128

47.550 34.125 30.794 29.159

34.405 29.630 29.606 28.723

31.286 28.758 28.818 28.431

29.912 28.371 28.415 28.220

29.364 28.194 28.241 28.105

28.876 28.048 28.093 27.996

28.579 27.950 27.993 27.919

28.428 27.899 27.936 27.875

Beware that [328, section 4.2] uses only two d’s and reports 1.61 additions per bit, and [317, algorithm 3.33] uses only one d and reports 1.64 additions per bit, while taking more d’s easily reaches 1.56 additions per bit. The benefit of trying many chains, and keeping the shortest, was pointed out by Montgomery but does not seem to have been adequately emphasized in the literature. Note that, as shown by the crossover between the S and B rows in the above table, optimizing chains for many d’s is not the same as optimizing them for a single d.

Downloaded from https://www.cambridge.org/core. The Librarian-Seeley Historical Library, on 01 Jan 2020 at 05:41:59, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.005

5 General Purpose Integer Factoring Arjen K. Lenstra

Abstract This chapter describes the developments since 1970 in general purpose integer factoring and highlights the contributions of Peter L. Montgomery.

5.1 Introduction General purpose integer factoring refers to methods for integer factorization that do not take advantage of size-related properties of the unknown factors of the composite to be factored. Methods that do are special purpose methods, such as trial division, John Pollard’s rho and p − 1 methods [279, 278] and variants [348, 17], and Hendrik Lenstra’s elliptic curve method [239]. Some of these methods are discussed in other chapters. The subject of this chapter is general purpose integer factoring. In 1970 a new general purpose integer factoring record was set with the fac7 torization of the seventh Fermat number F7 = 22 + 1, a number of 39 decimal digits [264]. It required about 90 minutes of computing time, accumulated over a period of seven weeks on an IBM 360/91 computer at the UCLA Campus Computing Network. In the late 1970s the 78-digit eighth Fermat number was almost factored by a general purpose method [303]. But its small, 16-digit factor was discovered in 1980 using Pollard’s rho method [79], with general purpose factoring achieving 71 digits only in 1984. It leapt to 87 digits in 1987, a development that could for a large part be attributed to Peter Montgomery [88], requiring on the order of two months of computing time contributed by a modest local network in about a week of wall-clock time. It further jumped ahead to 100 digits in the fall of 1988, the first computation that harvested the Internet’s computational resources [237]. Around that same time it became necessary to distinguish two different record-categories [234], a distinction that persists until the present day: records 116 Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:52, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.006

5.2 General Purpose Factoring

117

for general composites, such as cryptographic moduli [294], and records for composites with a special form, such as Fermat or Cunningham numbers [116, 82]. The first category record stands at 232 decimal digits, with the 2009 factorization of a 768-bit cryptographic modulus [209]; Montgomery’s work was used in several steps of this calculation. It required about two thousand core years of computation, accumulated over a period of almost three years on a wide variety of computers world-wide. The current record calculation of the second category is the shared factorization of 2m − 1 for m ∈ {1007, 1009, 1081, 1111, 1129, 1151, 1153, 1159, 1177, 1193, 1199} which required about five thousand core years over a period of almost five years [104, 210]. In this chapter the various developments are described that contributed to the progress in general purpose integer factoring since 1970. At first this progress heavily relied on new algorithms, but this came mostly to a halt around 1992. Starting in the late 1980s the proliferation of (easily accessible) hardware, the end of which is still not in sight, played an increasingly important role. For a different perspective refer to [285]. Throughout this chapter n indicates the composite to be factored. It is assumed that n is odd and not a prime power. An integer k divides an integer m if m is an integer multiple of k; this is denoted by k|m.

5.2 General Purpose Factoring At this point in time all general purpose factoring algorithms and their running time analyses work in more or less the same way, from a high-level point of view. The common framework is presented in this section. General Idea Following Maurice Kraitchik’s variation of Pierre de Fermat’s method [224, 225], all general purpose factoring methods are congruence of squares methods. They all construct pairs of integers x, y such that x2 ≡ y2 mod n; from n|x2 − y2 = (x − y)(x + y) it follows that gcd(x − y, n) is a non-trivial factor of n if x = ±y mod n. Although most congruence of squares methods are entirely, or to some extent, deterministic, it is reasonable to assume (for some methods this can be proved) that the resulting pairs are more or less random, under the condition that x2 ≡ y2 mod n. This implies that if n has r distinct prime factors, then x ≡ ±y mod n with probability 22r . Thus, in practice a few pairs x, y will suffice to factor n.

5.2.1 Two-Step Approach In [264] Michael Morrison and John Brillhart proposed a systematic, twostep approach to construct pairs x, y as above (see also [345]). Let ω ∈ Z>0

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:52, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.006

118

General Purpose Integer Factoring

be a small constant indicating the desired oversquareness (cf. below). (Refer to [311, 302] for squfof, an entirely different method to construct pairs x, y as above; it is based on binary quadratic forms and particularly efficient for small values of n.) Step 1: Relation Collection In the first step a finite set P of integers is selected depending on n, and a set V of |P| + ω relations is sought, where a relation is a pair (v, ev ) with v ∈ Z and ev = (ev,p ) p∈P ∈ Z|P| such that  w= pev,p where w ≡ v 2 mod n. (5.1) p∈P

The central distinguishing feature between the general purpose factoring algorithms described in this chapter is the method by which relations are obtained. For instance, and most elementary, it may simply involve selecting integers v and inspecting if w ≡ v 2 mod n allows factorization over P. The set of representatives for the integers w may be explicitly chosen, for instance as {0, 1, . . . , n − 1} or as {− n2 , . . . , −1, 0, 1, . . . , n2 }, or may be implicit in the way w is generated during the search. The set P is commonly referred to as the factor base. Step 2: Linear Algebra Since the oversquareness ω is strictly positive, the vectors ev resulting from the first step are linearly dependent. In the second step linear algebra is used to determine dependencies modulo 2 among the vectors ev , i.e., non-empty subsets S of V such that all entries of the vector  v∈S ev = (s p ) p∈P are even. Each subset S gives rise to a pair   sp x = ( v) mod n, y = ( p 2 ) mod n. v∈S

p∈P

With Condition (5.1) it follows that x2 ≡ y2 mod n. At least ω independent subsets S can be determined, which should lead to at least ω independent chances of at least 50% to factor n. Selection of P Morrison and Brillhart chose P as the set consisting of the element −1 along with the primes up to some bound B. All noteworthy general purpose factoring methods since then follow the same two-step approach and essentially make the same choice for P. The choice of B involves a trade-off: with a small B only a few relations are needed, but they may be hard to find, whereas for larger B more relations are required that should, however, be easier to find. The linear algebra effort does not involve a trade-off: it increases some way or another as B increases. When the overall effort is minimized, the optimal choice for B may depend on either of the two separate efforts or both.

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:52, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.006

5.2 General Purpose Factoring

119

Morrison and Brillhart relied on experience to decide on a value for B that would minimize the relation collection effort for the n-value at hand, knowing from their experiments that the linear algebra step would be relatively easy as long as enough memory is available. Richard Schroeppel was the first, during the mid 1970s, to analyse the trade-off and to derive an expression for the optimal asymptotic effort needed for the relation collection step [294, section IX.A] – disregarding the linear algebra step as he had the same experience that it would involve only a relatively minor effort. Schroeppel’s analysis required insight in the probability to find a relation, for which he relied on the result presented below, even though back then it had not been fully proved yet. The first to fully (though still heuristically, cf. Section 5.2.3) analyse the known general purpose integer factoring algorithms was Carl Pomerance in [282].

5.2.2 Smoothness and L-notation An integer is called B-smooth if all its prime factors are at most B. Smoothness probabilities and the resulting asymptotic estimates are commonly expressed using the generalized L-notation. The L-notation was introduced by Pomerance in [282, section 2] to (citing Pomerance) “streamline many arguments and to have the important magnitudes stand out,” reasons that are still valid today. Following the generalization from [233, (3.16)], denote by Lx [r, ψ] any function of x that is r

1−r

e(ψ+o(1))(log x) (log log x) ,

for x → ∞,

where r, ψ ∈ R and 0 ≤ r ≤ 1 and where logarithms are natural. For fixed r, s, ψ, β ∈ R>0 with s < r ≤ 1, it follows from results by Nicolaas De Bruijn [118] and E. Rodney Canfield, Paul Erdös and Pomerance [86] (and citing [233]) that a random positive integer at most Lx [r, ψ] is Lx [s, β]-smooth ], for x → ∞. with probability Lx [r − s, − ψ (r−s) β With ψ > 0, the expression Lx [0, ψ] is polynomial in log x, whereas Lx [1, ψ] is exponential in log x; for 0 < r < 1, the expression Lx [r, ψ] is called subexponential in log x. To illustrate the quote from Pomerance, if r, s, ψ, β ∈ R are fixed with 0 ≤ s < r ≤ 1 and ψ > 0, then Lx [r, ψ]Lx [s, β] = Lx [r, ψ]: the factor Lx [s, β] disappears in the o(1) in Lx [r, ψ]. This includes the case s = 0 where Lx [0, β] = (log x)β+o(1) : factors that are fixed rational polynomials in log x disappear in Lx [r, ψ] if r, ψ > 0. Thus, if B = Lx [r, ψ] for r, ψ > 0, then the number π (B) of primes at most B equals B. Note also that Lx [r, ψ]Lx [r, β] = Lx [r, ψ + β] and Lx [r, ψ] + Lx [r, β] = Lx [r, max(ψ, β )]. Below all efforts expressed in terms of Ln are expected and asymptotic for n → ∞ and, unless noted otherwise, heuristic. In the remainder of this chapter L is used for Ln .

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:52, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.006

120

General Purpose Integer Factoring

5.2.3 Generic Analysis Let S(x, B) denote the average effort to inspect a random positive integer at most x for B-smoothness, and let M(m) denote the effort to find dependencies modulo 2 among the columns of an integer m × (m + ω)-matrix. Let L[s, β] for s, β ∈ R>0 be the upper bound for the primes in P. For each general purpose factoring method that has been published, there are r, ψ ∈ R>0 such that the absolute value of each number that is inspected for smoothness (and that leads to a relation as in (5.1) on page 118 if it is L[s, β]-smooth) is bounded above by L[r, ψ]: for the number field sieve r = 23 and for all earlier methods r = 1, as set forth in the sections below. Assuming that these absolute values behave as random positive integers at most L[r, ψ], the overall factoring effort of almost all general purpose factoring methods can be expressed as       ] · S L[r, ψ], L[s, β] + M π (L[s, β]) ; π (L[s, β]) + ω · L[r − s, ψ (r−s) β             number of relations to be collected

inverse of smoothness probability

average effort to inspect an integer for smoothness

effort for linear algebra

(5.2) see Section 5.4.1 on page 129 for the exception. For one of the general purpose methods presented below it can be proved that the numbers to be tested for smoothness behave, with respect to smoothness properties, as random positive integers at most L[r, ψ]. For that method, Expression (5.2) is the expected asymptotic factoring effort, for n → ∞. For all other methods the smoothness assumption is a heuristic assumption that has, so far, been supported by empirical evidence. For those methods, Expression (5.2) is the heuristic expected asymptotic factoring effort, for n → ∞. Optimal choices for s, ψ and β depend on the general purpose factoring method used and on the smoothness testing and linear algebra methods used, and are derived in the sections below. But a few general observations can be made that simplify Expression (5.2). It follows from r > 0 and the factor L[r − ] in the first term of Expression (5.2) that the optimal s must be strictly s, ψ (r−s) β positive. With β > 0 this implies that π (L[s, β]) + ω = L[s, β] and, again with Expression (5.2), that s = 2r is optimal. As a result, the overall factoring effort becomes L[ 2r , β +

ψr ] S(L[r, ψ], L[ 2r , β]) 2β

+ M(L[ 2r , β]).

(5.3)

Further optimization depends on the general purpose factoring method under consideration, but also on the rules of the game that one decides to play: for both S(x, B) and M(m) either the historically correct methods (i.e., methods

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:52, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.006

5.2 General Purpose Factoring

121

available at the time the general purpose factoring method was developed) or the current best methods (i.e., as may have been developed later and thus possibly anachronistic) can be used, which may give rise to different outcomes. This is further discussed below.

5.2.4 Smoothness Testing Trial Division consists of testing a non-zero integer w for B-smoothness by testing if p divides w for all primes p ≤ B, while replacing w by wp and repeating the test for p if indeed p|w. If w = 1 at the end of this process, then the original w is B-smooth and may lead to a relation as in (5.1) on page 118. If 1 < w < B2 after trial division, then the original w is B-smooth except for onelargeprime, i.e., a prime factor larger than B but less than B2 ; such w-values may lead to a large prime relation, as further discussed in the sections below. The trial division effort for w is O(B log w) implying that S(L[r, ψ], L[ 2r , β]) would become L[ 2r , β]. Thus, if trial division is used as smoothness test, the ]L[ 2r , β] = L[ 2r , 2β + ψr ]. first term of Expression (5.3) becomes L[ 2r , β + ψr 2β 2β Similarly, using Pollard’s rho method [279] would lead to S(L[r, ψ], L[ 2r , β]) = L[ 2r , β2 ]. The Elliptic Curve Method of Factorization may, heuristically, be expected to find w)2 LB × √ a factor at most B of a positive integer w at effort O((log 1 r [ 2 , 2]) [239] (see also [233, 4.3]). It follows that S(L[r, ψ], L[ 2 , β]) would become L[ 2r , 0]. This implies that, if the elliptic curve method is used as ]. smoothness test, the first term of Expression (5.3) becomes L[ 2r , β + ψr 2β As shown by Pomerance in [283], the heuristic arguments can be removed from the analysis, resulting in a slightly modified elliptic curve-based smoothness test that works with high probability (see also Section 5.6 on page 160). Although with elliptic curve-based smoothness testing the S(L[r, ψ], L[ 2r , β])contribution to Expression (5.3) conveniently vanishes in the o(1), in practice its contribution would be considerable, in particular compared to sieving as discussed in the next paragraphs – if sieving can be applied. Sieving amortizes the smoothness testing effort over all values that have to be tested for smoothness, and achieves the same effect on Expression (5.3) as elliptic curve-based smoothness testing. In practice, sieving is much preferred over using elliptic curves, but compared to the latter it has the disadvantage that it can not be applied in all circumstances: the set of values to be tested must be the set of values of a polynomial evaluated over sufficiently many integers in an arithmetic progression (such as a sufficiently large interval of consecutive integers).

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:52, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.006

122

General Purpose Integer Factoring

Let I be an interval of consecutive integers such that the length |I| of I is at least L[ 2r , β] and let d be a small positive integer (d ≤ 2 until Section 5.5 on page 137). Assume that there exists a degree d polynomial f (X ) ∈ Z[X] such that { f (i) : i ∈ I} is the set of values that has to be tested for L[ 2r , β]-smoothness (more general arithmetic progressions are treated in a similar fashion). Sieving exploits the fact that if p divides f (z) for some z ∈ Z, then p divides f (z + kp) for all k ∈ Z. Thus, once all roots of f modulo p in {0, 1, . . . , p − 1} have been determined, all roots of f modulo p in I are found at additional effort at . The latter roots correspond to the subset of polynomial-values that most d|I| p are divisible by p. Root finding modulo a prime p takes (probabilistic) effort (log p)t for some small constant t. Using |I| ≥ L[ 2r , β], the overall effort of finding all prime divisors at most L[ 2r , β] of the values in { f (i) : i ∈ I} to be tested for L[ 2r , β]-smoothness is  

(log p)t +

d|I| p



= |I|

(5.4)

p≤L[ 2r ,β]

 (where it is used that p≤L[ r ,β] 1p ≈ log log(L[ 2r , β])). In the context of Expres2 sion (5.3) on page 120, the interval length |I| and thus the overall effort over ]. As a result, the all |I| values to be tested for smoothness would be L[ 2r , β + ψr 2β average smoothness testing effort S(L[r, ψ], L[ 2r , β]) becomes L[ 2r , 0], namely the quotient of the number |I| of values to be tested for smoothness and the overall smoothness testing effort |I|, so that the first term of Expression (5.3) ]. simplifies to L[ 2r , β + ψr 2β In practice, sieving typically consists of adding a rough approximation of logb p, for some base b ∈ R>0 , to all sieve locations z + kp ∈ I for k ∈ Z, after a root z of f modulo p has been computed. After doing this for all (prime, root) pairs, the locations ∈ I at which a total value has been accumulated that is close to a rough estimate of logb f ( ), are inspected more closely by attempting to factor f ( ) over P. Large prime relations can also easily be recognized when more locations (with smaller but still sufficiently large values) are inspected. The base b is chosen in such a way that a single byte per sieve location suffices to represent an approximation to logb f ( ) for all ∈ I. Montgomery proposed to choose the initial values of the sieve locations so that a final non-negative value indicates that the location needs to be inspected for actual smoothness, because a four- or eight-byte mask can then be used to check four or eight sieve locations at a time for non-negativity. He also proposed to put a non-negative value right after the last location to be inspected, so that it suffices to check the termination condition for ∈ I only at locations containing non-negative values.

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:52, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.006

5.2 General Purpose Factoring

123

5.2.5 Finding Dependencies For all general purpose factoring methods the matrices are sparse, i.e., the number of non-zero entries per column is at most of order log(L[r, ψ]). Regular Gaussian elimination hardly profits from the sparseness: usually, the sparsityadvantage is no longer noticeable after about a fifth of the pivots have been processed. As a result, with regular Gaussian elimination the term M(L[ 2r , β]) in Expression (5.3) on page 120 becomes L[ 2r , 3β]. If pivots are selected using structured Gaussian elimination [226, 286], the sparse original m × (m + ω)matrix M over Z/2Z can often easily be reduced to a dense m × (m + ω)matrix M over Z/2Z with m ≈ m3 , in such a way that dependencies among the columns of M (for instance determined using regular Gaussian elimination) lead to dependencies among the columns of M. Although this combined approach does not change the term M(L[ 2r , β]) = L[ 2r , 3β] in Expression (5.3), it was of great practical importance until the mid 1990s: compared to regular Gaussian elimination, it not just reduced the effort by a factor of approximately 33 , it also reduced the storage requirement of m2 bits by a factor of about 32 . Volker Strassen’s method [319] (applied to M or to M ) reduces M(L[ 2r , β]) in Expression (5.3) to L[ 2r , (log2 7)β]; with the latest variants of the method by Don Coppersmith and Shmuel Winograd [108, 228, 9] it would become about L[ 2r , 2.373β]. Block versions of methods by Cornelius Lanczos [107, 270, 257] or Douglas Wiedemann [347, 106] (see [233, 2.19] for a high-level description) profit much more effectively from the sparseness of M, because for both methods the effort is dominated by a sequence of O(m) multiplications of the matrix M by a vector. Both methods find dependencies modulo 2 among the columns of M in O(mW (M)) bit operations, where the weight W (M) of M is the number of non-zero entries of M. It follows that the term M(L[ 2r , β]) in Expression (5.3) can be simplified to L[ 2r , 2β]. Storage requirements are limited to storage of the original sparse matrix M and an m-dimensional vector over (Z/2Z)k , where the constant k is the blocking factor used. Refer to Chapter 7 on block Lanczos in this volume for more information on these methods. Montgomery contributed not just to block Lanczos, but also did a lot of work on a preprocessing step that is commonly used and that is generally referred to as filtering. The main ideas of this preprocessing step are described in the next section.

5.2.6 Filtering Let the notation be as in the previous section. Filtering refers to a collection of methods that aim to transform the m × (m + ω)-matrix M into an

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:52, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.006

124

General Purpose Integer Factoring

m × (m + ω )-matrix M for which mW (M ) is smaller than mW (M) and such that dependencies modulo 2 among the columns of M easily lead to dependencies modulo 2 among the columns of M. The methods of the previous section can then profitably be applied to M instead of M: in practice a speedup of one or more orders of magnitude may be expected. Background and more details about the material presented here can be found in [89, 90, 286]. Moduli different from 2, as required for the application in Section 5.5.1 on page 139, are handled in a slightly different but similar manner. Below, M refers to the m × (m + ω)-matrix in transition, with changing values for m and ω until M = M , m = m , and ω = ω , for the final matrix M. Filtering proceeds by first removing duplicates of columns that correspond to identical relations (cf. Section 5.2.1 on page 117), next alternatedly removing singleton columns and sets of columns that are referred to as cliques, and finally combining the remaining columns in a merge step. These four steps are further described below. Note that, to avoid useless dependencies, duplicates must be removed irrespective of attempts to lower mW (M). Removing Duplicates In practice duplicate relations turn out to be unavoidable: lattice sieving with many distinct special q-primes will produce identical relations (cf. Section 5.5.2 on page 145), prematurely stopped relation collection jobs may have been restarted, or different relation collection methods may be used for the same factorization (cf. [92]). Assuming canonical representations of relations, a few piped Unix commands remove duplicates at minimal human effort. The storage resources and time required may, however, become substantial. It is common to apply a hash function to each canonical representation, and to locate and further inspect the collisions. Appropriate hash functions are easily designed depending on the application. Refer to [89, section 2.1] for an example of a hash function proposed by Montgomery that is injective for the relations as generated for the factorization reported in [92] (so that collisions correspond to duplicate relations). Removing Singletons If there is a row in M that contains only a single entry that is non-zero modulo 2 (or another applicable modulus), then the column containing that non-zero entry can not occur in a dependency. Such columns, called singletons, can be removed from M. This is easily done using a frequency table, but because each removal may generate one or more new singletons, several passes are normally required before all singletons have been removed. For large collections of relations each singleton-removal pass is quite time consuming, with a quick drop in the number of removals during the later passes. Continuing until the very end may therefore not be worth the effort.

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:52, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.006

5.2 General Purpose Factoring

125

Removing Cliques To have a better chance to get a low mW (M)-value many more relations are collected than necessary to make the original matrix over-square (even until the original ω is much larger than m). As mentioned in [286] (and as for instance used in the structured Gaussian elimination step of the factorization reported in [236]) the simplest approach is to remove the excess columns in (decreasing) order of their number of odd entries, until the excess is deemed small enough. But [286] also suggests another approach, which is further pursued in [89, 90] following ideas of Montgomery. It has become the most common way to remove the excess columns until ω is reduced to approximately m2 . The method is, in filtering context, referred to as clique removal, notwithstanding the non-standard definition of cliques. Consider the graph with vertex set corresponding to the set of relations with an edge connecting two vertices if there is a row in M such that the two corresponding relations are the only two relations that share a non-zero entry in that row. The components of this graph are called cliques in [89]. It follows that removal from M of a single relation in a clique triggers a chain of singletons in the same clique, so that the clique can be removed in its entirety. Fast recognition – and removal – of cliques is therefore an efficient way to deal with large amounts of interdependent excess columns, while also lowering the number of rows. With (ev,p ) p∈P denoting a column in M (i.e., a relation), and given a frequency table containing, for each row in M, the total number o p of odd entries  in the row for p in M, it is easy to compute the value p∈P:ev,p odd 2−o p for each column. Cliques may now be removed by removing the columns for which the computed value is at least 14 . Because this may remove too many other columns as well, initially a cut-off larger than 14 may be used, gradually lowering it until a targeted excess remains. The value will be at least 12 for newly created singletons, so they have a good chance to be removed during a next round of clique removal with an updated frequency table. Merging If there is a p ∈ P for which o p = 2, it is advantageous to replace the two columns in which p occurs an odd number of times by their sum, because as a result m decreases by at least one and W (M) decreases by at least two. This is called a two-way merge. After the two-way merges have been carried out, the process can be repeated for m-way merges, for increasing values of m, where an m-way merge replaces the m columns sharing a particular p-value with o p = m by m − 1 independent, pair-wise sums among those m columns. The least weight-increasing set of m − 1 pairs is easily determined as a minimal spanning tree in the complete graph induced by the m columns. The overall effect for larger m-values may, however, become counterproductive. Merges

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:52, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.006

126

General Purpose Integer Factoring

for larger m-values are therefore followed by removal of the heaviest columns, as long as the oversquareness ω remains large enough. As a result of the merging step, the final matrix M is written as the product  of a pre-merger matrix M  and a merging-transformation matrix T . M = MT Because in the preferred methods to find dependencies the same final matrix M is repeatedly multiplied by a (changing) vector (cf. Section 5.2.5 on page 123),  of M if W (M)  + W (T ) < it is advantageous to use this representation MT W (M); this approach was first used in [210].

5.2.7 Overall Effort With these insights (which may be anachronistic, depending on the context), Expression (5.3) on page 120 becomes L[ 2r , max(β +

ψr , 2β )]. 2β

(5.5)

Optimization of Expression (5.5) still depends on the applicable r- and ψvalues, as further explained below. In more generality, the overall effort can be expressed as L[ 2r , max((1 + σ )β +

ψr , μβ )] 2β

(5.6)

with σ representing the smoothness testing effort and μ the linear algebra exponent. The value for σ ranges from 0 (as in Expression (5.5)) for elliptic curve-based smoothness testing and for sieving (if applicable), to 1 for trial division (with σ = 12 for Pollard’s rho method). The linear algebra exponent μ ranges from 2 (as in Expression (5.5)) for the methods by Lanczos and Wiedemann, to 3 for Gaussian elimination (with μ = log2 7 for Strassens’s method and μ ≈ 2.373 for the Coppersmith-Winograd method). An overview of the results as of 1983 (some of which may have been improved since then due to more efficient auxiliary steps) is given in [282, table on page 93], the most important ones of which are also presented below.

5.3 Presieving General Purpose Factoring Let the notation be as in Section 5.2 on page 117, with P the set of primes up to L[ 2r , β] for r, β > 0 to be specified below.

5.3.1 Dixon’s Random Squares Method Not the earliest but conceptually the most straightforward general purpose factoring method that requires subexponential effort is John Dixon’s random

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:52, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.006

5.3 Presieving General Purpose Factoring

127

squares method [128]. It has never been proved to be practical, because by the time it was proposed more practical methods already existed. The random squares method selects at random integers v ∈ {1, 2, . . . , n − 1} that have not been selected before, computes v 2 mod n = w ∈ {0, 1, . . . , n − 1}, assumes that w = 0 (because n can directly be factored if w = 0), uses trial division to  write w = w p∈P pev,p with w  ∈ Z free of factors in the set of primes P, and if w  = 1 adjoins (v, (ev,p ) p∈P ) to the set of relations. Once enough relations have been found, it uses Gaussian elimination to find dependencies. Expression (5.6) with σ = 1 and μ = 3 applies to Dixon’s random squares method. The numbers w that are tested for smoothness can be bounded by n = L[1, 1], so that r = ψ = 1 and the overall effort becomes: L[ 12 , max(2β +

1 , 3β )], 2β

which is minimized for β = 12 and becomes L[ 12 , 2]. The effort of the matrix step is L[ 12 , 32 ], which is dominated by the relation collection effort L[ 12 , 2]. √ √ 1 Values v for which v < n are useless, but as v < n with probability n− 2 = L[1, − 12 ] this is unlikely to occur because only L[ 12 , 32 ] values will be selected. With faster smoothness testing and linear algebra methods (both anachronistic), the overall effort becomes (cf. Expression (5.5) with r = ψ = 1) L[ 12 , max(β +

1 , 2β )], 2β

√ √ which is minimized for β = 12 2 and becomes L[ 12 , 2]. In this analysis both steps require the same effort in L-notation, i.e., disregarding everything that disappears in the o(1)-terms. Using least absolute√remainders for w (and adjoining −1 to P) does not change L[ 12 , 2] or L[ 12 , 2], but should make the method a bit faster in practice. Both these efforts are larger than the efforts required by the factoring methods considered in the remainder of this chapter. However, for Dixon’s random squares method the analysis does not involve heuristics. See also Section 5.6 on page 160.

5.3.2 Continued Fraction Method The continued fraction method (often referred to as CFRAC) developed by Morrison and Brillhart [264] represented the state of the art in general pur7 pose integer factoring from 1970, when F7 = 22 + 1 was factored, until the mid 1970s. A special purpose hardware device (the Georgia Cracker) was built implementing it [288]. The continued fraction method was used to factor many

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:52, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.006

128

General Purpose Integer Factoring

Cunningham numbers [82]. It inspired the development of faster general purpose factoring methods, as further described in Section 5.4 below. From the above analysis of Dixon’s random squares method it follows that there are two main issues that would have to be addressed in order to get a more efficient factoring method: the speed of the smoothness test and the size of the integers w to be tested for smoothness. The second issue had already been dealt with in the continued fraction methods, several years before Dixon proposed his random squares method. Although generating the w-values is more cumbersome in the continued fraction method than in Dixon’s random squares method, this disadvantage is far outweighed by their much smaller size and thus substantially larger smoothness probability. As explained in detail in [264], and as follows from for instance [181, Theorem 164], the continued fraction expan√ sion of n leads to a sequence of triples (vi , ti , wi ) ∈ Z3 for i = 0, 1, 2, . . . such that vi2 − nti2 = (−1)i wi

√ where 0 < wi < 2 n.

(5.7)

It follows that for those i for which wi is found to be smooth, the value vi along with the vector of exponents of the factorization of wi (including the √ sign) leads to a relation. With r = 1 and ψ = 12 (as 0 < wi < 2 n = L[1, 12 ]), trial division (σ = 1), and Gaussian elimination (μ = 3), the overall effort from Expression (5.6) on page 126 becomes L[ 12 , max(2β +

1 ), 3β], 4β

√ √ √ which is minimized for β = 14 2 and becomes L[ 12 , 2]. The effort L[ 12 , 34 2] of the matrix step is again dominated by the effort of relation collection, in accordance with Morrison’s and Brillhart’s practical experience. This asymptotic result was first, and informally, derived in the mid 1970s by Schroeppel: informal because the effort of the matrix step was not included in his argument; because the smoothness result used (as stated in Section 5.2.2 on page 119) had by then not been fully proved yet; because the wi -values are chosen deterministically and can hardly be argued to behave as randomly selected positive integers at most L[1, 12 ]; and finally because only primes p with Legendre symbol ( np ) ∈ {0, 1} can occur in the factorizations of the wi -values, thus requiring the later and more refined argument from [309, theorem 74]. With σ = 0 and μ = 2 as in Expression (5.5) on page 126 (anachronistic, because the required methods did not exist yet in 1970), and the customary heuristic handwaving, the effort is reduced to L[ 12 , max(β +

1 ), 2β], 4β

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:52, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.006

5.4 Linear and Quadratic Sieve

129

which is minimized for β = 12 and becomes L[ 12 , 1] (with balanced efforts for the two steps). Morrison and Brillhart describe how, depending on n, it is often advantageous to replace n by kn for a small positive multiplier k ∈ Z, in order to boost the smoothness probabilities by aiming for more small primes with ( knp ) = 1 than with ( np ) = 1, or even to use several k-values (most likely leading to more primes that may occur). They also suggest to allow in the factorizations of the wi -values a large prime less than the square of the smoothness bound. As mentioned in Section 5.2 on page 117 such large prime relations can be recognized at no additional effort during trial division. A pair of large prime relations with the same large prime is easily transformed into a single regular relation (with, however, on average more non-zero entries in the exponent-vector and thus a less sparse matrix).

5.4 Linear and Quadratic Sieve 5.4.1 Linear Sieve Schroeppel found a way to replace trial division by sieving, as introduced in Section 5.2.4 on page 121, while keeping ψ almost as small as in the continued fraction method, namely ψ = 12 + o(1). Despite a promising start the practical potential of his linear sieve was never conclusively shown: according to [303] 8 its first attempted factorization – that of the eighth Fermat number F8 = 22 + 1 in 1980, and a tour de force at that time – was cut short during the first stage of the linear algebra step, because the factorization of F8 was independently announced by others (and later reported in [79]). The linear sieve work on F8 remains unpublished till the present day and was, at the time, only known to those who had been so fortunate to attend the single talk that Schroeppel ever gave about his linear sieve [303]. In the linear sieve the values tested for smoothness are √ √ √ √ (5.8) (i + [ n])( j + [ n]) − n = i j + (i + j)[ n] + [ n]2 − n for i, j ∈ Z of relatively small absolute value and with, say, |i| ≥ | j|. Values as in Expression (5.8) have two advantages, and lead to one complication. The first advantage is that they are easier to generate than the wi -values in Expression (5.7) as used in the continued fraction method, while having a compa√ √ rable smoothness probability: because [ n]2 − n is of order n, each value √ in Expression (5.8) is only of order |i + j| n if |i| and | j| are relatively small. More precisely, if |i| and | j| are bounded by L[ui , γi ] and L[u j , γ j ], respectively,

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:52, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.006

130

General Purpose Integer Factoring

for some ui , u j < 1, then √ √ |i j + (i + j)[ n] + [ n]2 − n| ≤ L[ui , γi ]L[u j , γ j ] + (L[ui , γi ] + L[u j , γ j ])L[1, 12 ] + L[1, 12 ] = L[1, 12 ]. Thus, when expressed in L-notation the values generated by Expression (5.8) are of the same order L[1, 12 ] as the wi -values in the continued fraction method, √ √ and the smoothness disadvantage of |i + j| n compared to 2 n in the continued fraction method disappears in the o(1). It follows that r = 1 and ψ = 12 . The second advantage is that for fixed j Expression (5.8) is a linear polynomial in i. This implies that smoothness testing can be done using sieving. From the sieving analysis in Section 5.2.4 on page 121 it follows that as long as the length L[ui , γi ] of the interval of i-values is at least as large as the smoothness bound L[ 12 , β], the sieving effort for a fixed j equals L[ui , γi ]. An overall sieving effort of L[ui , γi ]L[u j , γ j ] then follows. √ A complication arises from the fact that a smooth w = (i + [ n])( j + √ [ n]) − n generated by Expression (5.8) leads to  √ √ w= pei, j,p where w ≡ (i + [ n])( j + [ n]) mod n (5.9) p∈P

which does not conform to Condition (5.1) on page 118. This is easily fixed by taking i = j, an idea that was discarded by Schroeppel for reasons set forth below [303]. Schroeppel fixed it in another manner, namely by adjoining to P √ √ the (i + [ n])- and ( j + [ n])-values with |i| ≤ L[ui , γi ] and | j| ≤ L[u j , γ j ]. With ei, j,i+[√n] = ei, j, j+[√n] = −1, this turns (5.9) into  √ w √ = pei, j,p where (i+[√n])(w j+[√n]) = 1 mod n, (i+[ n])( j+[ n]) p∈P

which is of the required form. As a consequence, however, the cardinality of P, and thus the number of relations to be found, increases from L[ 12 , β] to L[ 12 , β] + max(L[ui , γi ], L[u j , γ j ]). To find these relations over a search space of L[ui , γi ]L[u j , γ j ] elements, it must be the case that   1 L[ui , γi ]L[u j , γ j ] ≥ L[ 12 , β] + max(L[ui , γi ], L[u j , γ j ]) L[ 12 , 4β ] (5.10) because, as shown above, the values to be tested for smoothness are of order L[1, 12 ] and are thus heuristically assumed to be L[ 12 , β]-smooth with probabil1 ]. It follows that the optimal ui and u j satisfy max(ui , u j ) = 12 . If ity L[ 12 , − 4β 1 1 ui = u j then it must be the case that γi ≥ γi + 4β or that γ j ≥ γ j + 4β (because 1 of Condition (5.10)), which is impossible. Thus ui = u j = 2 , simplifying

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:52, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.006

5.4 Linear and Quadratic Sieve

131

Condition (5.10) to L[ 12 , γi + γ j ] ≥ L[ 12 , max(β, γi , γ j ) +

1 ] 4β

and thus γi + γ j ≥ max(β, γi , γ j ) +

1 . 4β

(5.11)

The relation effort is bounded from below by L[ui , γi ]L[u j , γ j ] = L[ 12 , γi + γ j ] (as argued above), attaining this lower bound if γi ≥ β, and the linear alge1 ≥ 1 (reaching its minbra effort is L[ 12 , μ max(β, γi , γ j )]. Because β + 4β 1 imal value 1 for β = 2 ) it follows from Condition (5.11) that γi + γ j ≥ 1 and thus that max(γi , γ j ) ≥ 12 , so that the efforts are bounded from below by L[ 12 , 1] and L[ 12 , μ2 ], respectively. The minima are achieved for β = γi = γ j = 1 , which is optimal. 2 It is impossible to lower the overall effort (thus, if μ > 2, the effort of the linear algebra step) by balancing the two efforts involved: for β = 12 this is 1 > 1 and thus max(γi , γ j ) > 12 and the obvious, and if β = 12 , then β + 4β linear algebra effort becomes larger than L[ 12 , μ2 ]. With (not anachronistic) linμ = log2 7 due to Strassen’s method, the overall effort of Schroeppel’s √ ear sieve narrowly beats the√continued fraction method’s L[ 12 , 2] because 1 log2 7 ≈ 1.404 < 1.414 ≈ 2 (see also [282, table on page 93]). 2 The resulting optimal (but heuristic, expected and asymptotic) relation collection effort L[ 12 , 1] is the factoring effort that was cited in [294, section IX.A], neglecting the dominating term L[ 12 , μ2 ] for the linear algebra step. At the time this was somewhat optimistic but also understandable because experience with the continued fraction method had shown that the linear algebra effort was consistently negligible compared to the relation collection effort. For the purposes of [294], the optimism was later justified by the development of faster linear algebra methods (with μ = 2), and then turned out to be too pessimistic due to the number field sieve. Variant with i = j As mentioned above, Schroeppel considered taking i = j but rejected this idea, even though (5.9) with i = j would directly conform to √ √ Condition (5.1) on page 118 without adjoining the (i + [ n])- and ( j + [ n])values to P (while also effectively reducing the size of P by a factor of two, as shown below). Schroeppel argued that, with i and j independently bounded by L[ 12 , 12 ], Expression (5.8) on page 129 generates a total of L[ 12 , 1] values that are √ all bounded by L[ 12 , 12 ] n in absolute value [303]. To generate the same number of values with i = j, the bound on i becomes L[ 12 , 1], resulting in a bound of √ L[ 12 , 1] n on the absolute values generated by Expression (5.8) on page 129. In √ √ L-notation, L[ 12 , 12 ] n and L[ 12 , 1] n are both equal to L[1, 12 ], and both lead

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:52, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.006

132

General Purpose Integer Factoring

1 to smoothness probability L[ 12 , − 4β ]. But in practice the choice i = j leads to noticeably lower smoothness probabilities. The latter effect was perceived to be worse than having to generate about twice as many relations, because it would result in an overall slowdown of the relation collection step. The more cumbersome linear algebra step that Schroeppel had to deal with by allowing i = j was considered to be a mere nuisance because, so far, the matrix effort had been futile compared to the relation collection effort. New developments, however, and to some extent Schroeppel’s own analysis and experience, proved him wrong, because it turned out that even with i = j the i-values can be kept small, as further shown in Section 5.4.3 on page 133 and Section 5.4.4 on page 134. Schroeppel also reported [303] that he initially rejected the use of a multiplier as had been used in the continued fraction method, but later reconsidered, and that he allowed two large primes per relation, about a decade before that was independently done in [238].

5.4.2 Quadratic Sieving: Plain Unfazed by the issue pointed out by Schroeppel, Pomerance proposed using Schroeppel’s linear sieve with i = j. He called it quadratic sieve because, similar to Schroeppel’s linear sieve, it uses a sieve to locate smooth values of the quadratic polynomial √ √ √ (i + [ n])2 − n = i2 + 2i[ n] + [ n]2 − n

(5.12)

(cf. Section 5.2.4 on page 121). Pomerance’s description [282] is the first paper containing careful and accessible explanations and thorough analyses of general purpose factoring methods and their variants, setting an example for later publications and turning the subject into a more serious scientific endeavor. Initial results obtained by the quadratic sieve were not stellar, with [160] reporting a 47-digit factorization; this may be compared to Schroeppel’s 78digit linear sieve effort that was aborted during the linear algebra step, and which had, at the time, garnered little or no attention. It took several additional contributions – notably by Jim Davis, Diane Holdridge, and Gus Simmons, by Montgomery, and by Pomerance, Jeffrey Smith, and Randy Tuler – to turn the quadratic sieve into the state of the art in general purpose integer factoring, a position it held until 1994. These developments are described below. An advantage of quadratic sieve over linear sieve is the simplified analysis and, for μ > 2, its better overall heuristic asymptotic effort. Because i = j, the generic analysis from Section 5.2.3 on page 120 applies with r = 1, ψ = 12 , and σ = 0 in Expression (5.6) on page 126. More precisely, redoing the linear sieve

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:52, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.006

5.4 Linear and Quadratic Sieve

133

effort analysis with i = j, the original cardinality L[ 12 , β] of P and an i-interval of length L[u, γ ] for some u and γ with 0 < u < 1 and γ > 0, Condition (5.10) on page 130 simplifies to L[u, γ ] ≥ L[ 12 , β +

1 ] 4β

(5.13)

because the values to be tested for smoothness, in absolute value bounded by L[u, γ ]L[1, 12 ] = L[1, 12 ] (since u < 1), are heuristically assumed to be 1 ]. It follows from Condition (5.13) L[ 12 , β]-smooth with probability L[ 12 , − 4β 1 that L[u, γ ] ≥ L[ 2 , β] so that, with the sieving analysis from Section 5.2.4 on page 121, the relation collection effort becomes L[u, γ ]. Minimizing the sum of the relation collection effort L[u, γ ] and the linear algebra effort L[ 12 , μβ] 1 to overunder Condition (5.13), first leads to u = 12 and then with γ = β + 4β 1 1 all effort L[ 2 , max(β + 4β , μβ )] (as indeed in Expression (5.6) with r = 1, √ ψ = 12√ , and σ = 0), which depends on μ. For μ = 3 it results in β = 14 2 and L[ 12 , 34 2] = L[ 12 , 1.061] and for μ = log2 7 in β = 0.372 and L[ 12 , 1.044]. For μ = 2 it results in β = 12 and reaches its minimal value L[ 12 , 1]. In all cases the efforts of the two steps are balanced. √ √ If p divides (i + [ n])2 − n then n ≡ (i + [ n])2 mod p, so that n is a square modulo p. It follows that only primes p with Legendre symbol ( np ) = 1 can occur in the factorizations of values generated by Expression (5.12), as in the continued fraction method. Following Morrison and Brillhart, the use of a suitable multiplier is therefore recommended. Also, the condition ( np ) = 1 effectively halves the size of P, making the quadratic sieve linear algebra step in practice yet again easier compared to linear sieve. Because Expression (5.12) is a quadratic polynomial in i, finding the roots modulo the primes in P is more cumbersome than for the linear polynomials in the linear sieve; this issue is further discussed on page 136 in Section 5.4.4. The growth of the polynomial values behaves according to Schroeppel’s prediction and has a noticeably counterproductive effect compared to linear sieve. In the remainder of this section it is shown how this problem was overcome. When expressed in the L-notation, all variants presented below require the same effort: the speedups, though practically worthwhile, all disappear in the o(1).

5.4.3 Quadratic Sieving: Fancy Davis, Holdridge and Simmons in [117] were the first who managed to avoid a single large sieving interval and the resulting growth of the values to be tested for smoothness. Their method, referred to by the authors as quadratic sieving: fancy, proved to be more effective than the plain quadratic sieve as

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:52, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.006

134

General Purpose Integer Factoring

used in [160]. In 1984 it was used to set a 71-digit factorization record: on a Cray X-MP mainframe computer the relation collection took 8.75 hours, followed by 45 minutes for the linear algebra. Assume that as a result of regular sieving over i ∈ I with the polynomial in Expression (5.12) on page 132 a number of large prime relations (cf. Section 5.2.4 on page 121) has been found, each involving a single prime larger than the smoothness bound, but smaller than its square. For each such large prime relation, corresponding to an equation of the form   ei ,p  √ (5.14) p q (iq + [ n])2 − n = q p∈P

involving a large prime q and an integer iq with iq ∈ I, Davis, Holdridge, and Simmons use a sieve over i ∈ I  to find smooth values of the quadratic integer polynomial √ √ √ (qi+iq +[ n])2 −n (iq +[ n])2 −n 2 = qi + 2i(i + [ n]) + . (5.15) q q q Each new smooth value thus found corresponds to a new large prime relation involving the large prime q, and can be combined with large prime relation (5.14) to produce a regular relation (which is, however, less sparse). If I  ⊆ I, the values of the polynomial in Expression (5.15) may be assumed to have smoothness probabilities comparable to or better than the values encountered during the sieve using the polynomial in Expression (5.12). The advantage compared to the plain quadratic sieve is that a new sieve can be used for each large prime relation found as a result of the sieve using Expression (5.12). In particular, both I and I  can be chosen considerably smaller than the single large sieving interval used in the plain quadratic sieve. In [117] it is reported that with judicious choices for I and I  composites of approximately 64 digits could be factored at the same effort as approximately 56-digit ones using the original method.

5.4.4 Multiple Polynomial Quadratic Sieve In the mid 1980s, and independent of Davis, Holdridge, and Simmons, Montgomery invented another way to keep the polynomial values in quadratic sieve relatively small. His method, now known as the multiple polynomial quadratic sieve, was published in [315] and quickly became the general purpose factoring method of choice. It allows straightforward embarrassingly parallel implementation, making it perfectly suitable to use the idle time of the networks of desktop computers that were emerging around that time. Indeed, in [315] the multiple polynomial quadratic sieve was used to factor an 81-digit composite

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:52, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.006

5.4 Linear and Quadratic Sieve

135

on a local network, the first general purpose factorization surpassing Schroeppel’s aborted 78-digit F8 -attempt, later reaching 87 digits as further described by Thomas Caron and Robert Silverman in [88]. This was quickly and independently followed by the first scientific distributed Internet computation that the author is aware of, reaching for the first time a 100-digit general purpose factorization, as described by the author and Mark Manasse in [237]. The independent, non-networked but nevertheless widely parallelized implementation described in [8] trailed this development by a few unfortunate months; it was particularly challenging because it required daily, campus-wide floppy-disk collection and distribution [284]. Probably the most prominent result obtained using the multiple polynomial quadratic sieve was the 1994 factorization of the 129-digit challenge published in the August 1977 issue of Scientific American, solved in [15] by Derek Atkins, Michael Graff, the author, and Paul Leyland (cf. [15, 229]). It used the software from [237] and did not take advantage of the somewhat faster self-initializing method described at the end of this section, because improving the software was not found to be worth the effort: around that time the much more promising method from Section 5.5.3 on page 152 was about to become practical and competitive with the quadratic sieve (refer to [161, section 5] for a direct comparison). Indeed, in [15] the 129-digit quadratic sieve factorization was already referred to as “probably the last gasp of an elderly workhorse.” Two other old workhorses that were used for the last time for a record factorization were the combination of structured and regular Gaussian elimination (cf. Section 5.2.5 on page 123) and the 16 384-core massively parallel MasPar supercomputer: half a day on a desktop to reduce the original sparse bit-matrix M with m ≈ 525 000 to a dense matrix M with m ≈ 188 000, followed by two days of regular Gaussian elimination on the MasPar to find dependencies among the columns of M . Montgomery showed how to construct a virtually limitless supply of integer polynomials f as in Expression (5.12) on page 132 and Expression (5.15) by focusing on their two crucial properties. The first of these is that they have a non-zero discriminant that is zero modulo n: this ensures that a smooth value leads to a relation as in Condition (5.1) on page 118. The second is that, for any arbitrarily selected γ ≥ β, the polynomial values must be bounded √ by L[ 12 , γ ] n over the sieving interval of length L[ 12 , γ ]. This guarantees 1 ] but also makes not just the usual L[ 12 , β]-smoothness probability L[ 12 , − 4β it possible to sieve with many polynomials over short sieving intervals: in the1 − γ ] polynomials, each over an interval of length L[ 12 , γ ]. In ory L[ 12 , β + 4β Expressions (5.12) and (5.15) this is achieved for polynomials of a specific

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:52, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.006

136

General Purpose Integer Factoring

form and specific γ -values, but there are many degrees of freedom that can be exploited, as shown below. Following Montgomery’s construction as described in [315] and [233, 4.16], consider values f (i) of the quadratic integer polynomial f (X ) = a2 X 2 + bX + c

(5.16)

for integers i with |i| ≤ L[ 12 , γ ], with integers a, b, c such that the discriminant  = b2 − 4a2 c of f is a small odd multiple of n. It follows that f (i) ≡ (ai +

b 2 ) 2a

mod n,

so that each L[ 12 , β]-smooth f (i) leads to a relation as in Condition (5.1) on √ page 118. To bound f (i) by L[ 12 , γ ] n for |i| ≤ L[ 12 , γ ], the leading coefficient a2 of f must be of order

√ n . L[ 12 ,γ ]

Furthermore, to maximize the probability

that f (i) is divisible by primes at most L[ 12 , γ ], the leading coefficient a2 must be free of prime factors at most L[ 12 , γ ]. These two conditions are satisfied if a is chosen as a prime number such that a2 ≈

√  . L[ 12 ,γ ]

To make sure that a solution

to b2 ≡  mod 4a2 is easy to find as well, a is chosen such that a ≡ 3 mod 4 a+1 and with Legendre symbol ( a ) equal to one: it follows that b˜ =  4 mod a satisfies b˜ 2 ≡  mod a, after which b˜ is lifted to b with b2 ≡  mod 4a2 by first solving (b˜ + ka)2 ≡  mod a2 for k, which leads to k=

−b˜ 2 ˜ −1 ((2b) a

mod a) mod a

(known as Hensel’s lemma), and defining b = b˜ + ka or b = b˜ + ka − a2 depending on which of √the two is the proper, √ odd choice. The resulting b and 2 are of order L[ 1 ,γ ] and L[ 12 , γ ] , respectively, as required. c = b4a− 2 2 The number of suitable a-values, and thus of suitable polynomials f (X ) as 1 in (5.16), is of order n 4 +o(1) . It is therefore a simple matter to parallelize the sieving effort for the multiple polynomial quadratic sieve: for any n-value for which it is worthwhile to parallelize the factoring effort, disjoint intervals containing an adequate supply of a-values can be farmed out to any realistic number of sieving clients, with the resulting relations collected at a central location. This is how [315, 88] and, independently but later and on a larger scale, [237] worked. Further Improvements Selecting the best value for γ involves a trade-off because smaller γ -values result in higher smoothness probabilities (of, on average, smaller f (i)-values), but also in more frequent sieve initialization, i.e., computing the roots of the polynomial f (X ) modulo all primes p ≤ L[ 12 , β] with ( np ) = 1. This requires the relatively costly calculation of a−1 modulo all

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:52, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.006

5.5 Number Field Sieve

137

those primes. As shown in [287], Montgomery’s construction of a and b allows a further generalization where a single a, chosen as the product of distinct primes from a collection of κ (much smaller) primes, gives rise to 2 −1 distinct b-values. In this so-called self-initializing quadratic sieve, due to Pomerance,   Smith, and Tuler, the 2 −1 polynomials resulting from each of the κ choices for a, can be ordered in such a way that the roots modulo p for the current polynomial lead with a few additions modulo p to the roots of the next polyno  mial. In this way the costly inversions can essentially be amortized over κ 2 −1 polynomials, leading to a speedup of about a factor of two over the multiple polynomial quadratic sieve [101]. Refer to [287], [8] and [276] for details and to [208] for a recent improvement. An additional speedup of a similar order of magnitude can be obtained by allowing more than a single large prime per relation, as shown for the multiple polynomial quadratic sieve in [238] (re-inventing what Schroeppel had already used for linear sieve, but had never published) and in [241]. Though relevant and of some interest when they occurred, with hindsight all developments since the continued fraction method sketched above were only rather modest improvements of its basic idea. Probably the biggest single contributions were Schroeppel’s informal first analysis of the smoothness bound trade-off and his introduction of sieving, followed by Pomerance’s influential more formal treatment of the subject in [282]. The constant c in the factoring effort estimate L[ 12 , c] slowly decreased over time, but got stuck at c = 1: as further shown below, (failed) attempts to further reduce c were not sufficiently ambitious by targeting the wrong constant in L[ 12 , 1]. An example is the cubic sieve algorithm, a never realized extension of the approach from [107] (see also [233, section 4.E]). As follows from Expression (5.6) on page 126, no general purpose factoring method that is based on the two-step approach from Section 5.2.1 on page 117 can improve on L[ 12 , c] for positive c as long as the numbers to be tested for smoothness are of order L[1, ψ] = nψ+o(1) for positive ψ. It took a new idea (or, actually, a sequence of new ideas) to replace this constant power nψ+o(1) of n by a vanishingly small power of n: more precisely, by L[ 23 , ψ] = no(1) , which then results in overall effort L[ 13 , c] for some positive c (cf. Expression (5.6)). This is further explained in the next section.

5.5 Number Field Sieve While the polishing efforts described in the previous section were underway, an independent development took place that started as a cottage industry

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:52, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.006

138

General Purpose Integer Factoring

Figure 5.1 Tidmarsh Cottage, the birthplace of the number field sieve.

(cf. Figure 5.1) but that quickly took center stage. Triggered by the factorizations of the seventh Fermat number F7 in 1970 and the eighth Fermat number F8 in 1980 (cf. Sections 5.3.2 on page 127 and 5.4.1 on page 129), and rightly concluding from [237] that the ninth Fermat number F9 would be out of reach of general purpose factoring methods for the foreseeable future unless a breakthrough would occur, Pollard designed, in 1988, a new factorization method specifically targeted at Fermat numbers. After using it to factor F7 on his 8-bit Philips P2012 computer (with 64K random access memory and two 640K disk drives), he sent a description of his method (later published as [280]) to Andrew Odlyzko, accompanied by a letter, dated August 31, 1988, with Richard Brent, John Brillhart, Hendrik Lenstra, Claus Schnorr, and Hiromi Suyama in copy: For a 40-digit longer than QS for those able advantage over

number the time is perhaps a little on my computer. With larger numbers, to attempt them, it may have an QS.

... (Perhaps I am talking nonsense?).

... If F9 is still unfactored, then it might be a candidate for this kind of method eventually? I would be grateful for any comments.

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:52, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.006

5.5 Number Field Sieve

139

Pollard was, of course, known for not talking nonsense, but Odlyzko did not take the bait. Lenstra, however, did. This led not only to the factorization of F9 in 1990 – to Pollard’s great numerological relief – but more importantly to the development of the number field sieve integer factoring method, the current state of the art in general purpose integer factorization. As sketched below, Montgomery later played an active role in the auxiliary steps that turned the number field sieve into a practical factoring method. Pollard’s original method, factoring with cubic integers as described in [280], applied only to integers of a special form. It led to the factorization method in [235] which was called the number field sieve (cf. [4]) and which was more general than Pollard’s method because it could use quartic, quintic, etc. instead of just cubic integers, but which still only applied to composites of a special form. This restriction was removed in [84], at which point the original number field sieve became known as the special number field sieve, and the new method from [84] as the general number field sieve. At this point in time, the “general” is dropped most of the time. In this section the various historical developments before and after Pollard’s method from [280] and as collected in [234] are described.

5.5.1 Earlier Methods to Compute Discrete Logarithms Compared to earlier general purpose integer factorization methods, Pollard’s method in [280] introduced two main new ingredients: factorization into prime ideals of certain elements of an algebraic number field of degree three (or higher), and homomorphically mapping such elements to integers modulo n to get two distinct factorizations that are identical modulo n. Both ingredients had already been used for quadratic fields by Coppersmith, Odlyzko, and Schroeppel in their Gaussian integer method from [107] to compute discrete logarithms over prime fields, a method that is related to Taher ElGamal’s method from [141] to compute discrete logarithms over quadratic extensions of prime fields using prime ideal factorizations. The latter method was a generalization of an earlier method to compute discrete logarithms over prime fields [345, 2], which in turn was based on the same two-step approach to integer factorization described in Section 5.2.1 on page 117. The developments from [345, 2] via [141] to [107] that would ultimately lead to [280] and [234] are described below. Discrete Logarithms over Prime Fields Let q be a prime number and let g be a generator of the multiplicative group F× q of the finite field Fq of q elements. The discrete logarithm of h ∈ g with respect to g, denoted logg h, is

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:52, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.006

140

General Purpose Integer Factoring

the x ∈ Z/(q − 1)Z such that gx = h. As shown in [345], the two-step approach to integer factorization from Section 5.2.1 can also be used to compute discrete logarithms, with Leonard Adleman in [2] being the first to use Schroeppel’s approach to analyse that the required effort is subexponential in log q (cf. Section 5.2.2 on page 119). If in Dixon’s random squares method from Section 5.3.1 on page 126 the values w are selected as w = gχ ∈ Fq for random exponents χ ∈ Z/(q − 1)Z (and identifying Fq in the canonical manner with  the set of integers {0, 1, . . . , q − 1}), a relation w = p∈P peχ,p leads to the identity   eχ,p logg p mod (q − 1). χ= p∈P

With |P| linearly independent relations the values logg p for p ∈ P can be found using linear algebra modulo q − 1, after which, for each h for which logg h must be calculated, values τ ∈ Z/(q − 1)Z are randomly selected until hgτ =  eτ,p so that p∈P p   eτ,p logg p − τ mod (q − 1). logg h = p∈P

Discrete Logarithms over General Finite Fields The above method works because of the canonical identification between the elements of Fq and the elements of the set of integers {0, 1, . . . , q − 1}. This makes it possible to embed Fq into the integers while transferring smoothness-related properties of the set of integers {0, 1, . . . , q − 1} to the corresponding elements of Fq . Given this simple approach, it is a natural question to ask what happens if prime fields are replaced by extension fields. In [270] it is shown that the same approach works again for fixed constant field characteristic q with the extension degree d going to infinity: with f (X ) ∈ Fq [X] irreducible of degree d, the extension field Fqd is isomorphic to (Fq [X])/( f ), which is naturally embedded in Fq [X]. An extension field element can thus be defined to be smooth if the corresponding polynomial of degree at most d − 1 in Fq [X] factors into polynomials in Fq [X] of sufficiently small degrees. The required effort is of the form Lq [ 12 , c] for constant c ∈ R>0 , i.e., subexponential in d (see [270] and also [233, 3.9-3.12]). More refined methods exploit the considerable degree of freedom in the representation of field elements if the characteristic is fixed (but d → ∞). Coppersmith, in his 1984 paper [103], was the first to achieve L[ 13 , c]. After almost three decades this line of research was picked up again, resulting in a sequence of dramatic further improvements [162, 196, 27, 165]. Discrete Logarithms over Quadratic Extension Fields Naively doing the same for d = 2 to compute discrete logarithms in Fq2 , with prime q, fails

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:52, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.006

5.5 Number Field Sieve

141

because the elements of Fq2 would be identified with polynomials of degree at most one in Fq [X], via the isomorphism between Fq2 and (Fq [X])/( f ). With the above definition of smoothness of polynomials, all elements are smooth and the algorithm becomes meaningless. ElGamal in [141] showed how to fix this. First he uses the same isomorphism Fq2  (Fq [X])/( f ) for the calculation of the w-values, which will be degree one polynomials in X over Fq . In these polynomials he replaces X and Fq by α and Z, respectively, where α is a zero of f regarded as an irreducible polynomial in Z[X] (with the usual canonical map between Fq and the set of integers {0, 1, . . . , q − 1}, and where irreducibility over Z follows from irreducibility modulo q). This results in w-values in Z[α], the smoothness of which is then defined in terms of a smooth prime ideal factorization in the algebraic number field Q(α)  Q[X]/( f ). Prime Ideal Factorization As described in [141, Appendix C] for quadratic number fields and for higher degree number fields in [235], [84] and [236], prime ideal factorizations in Q(α) lead to a myriad of issues. The present informal description is loosely based on [141, Appendix C] and [235, sections 2, 3, 5] to cover both the present discrete logarithm application and the number field sieve in Section 5.5 on page 137. For simplicity it is assumed that Z[α] is a unique factorization domain; for the more general case refer to [141, appendix C], [235, section 3] and [84]. Assume that f is monic and of degree d as above. Because the field Fqd is of isomorphic to (Fq [X])/( f ), the generator g of the multiplicative group F× qd Fqd can be represented as a polynomial g(X ) ∈ (Fq [X])/( f ) of degree at most is calculated d − 1. For random χ ∈ Z/(qd − 1)Z, the element w = gχ ∈ F× qd as g(X )χ ∈ (Fq [X])/( f ), which results in a polynomial w(X ) = (Fq [X])/( f ) of degree at most d − 1. For the present purpose d = 2, so that the polynomial w(X ) has degree at most one. Although in Sections 5.5.2, 5.5.3, and 5.5.4 below more general d-values are used, the different construction that is used there also leads to polynomials w(X ) ∈ (Fq [X])/( f ) of degree at most one. The resulting polynomial w(X ) can thus be written as a − bX ∈ (Fq [X])/( f ). This polynomial is interpreted as a − bα ∈ Z[α], tested for smoothness in Z[α], and if smooth written as a product over Z[α] of prime elements in Z[α]. From this product a relation then follows. The test for smoothness is straightforward: a − bα is B-smooth if and only if its norm N(a − bα) = bd f ( ab ) ∈ Z is B-smooth (note that the norm is a degree d integer polynomial that is homogeneous in a and b). The remaining steps are more involved, as briefly described in the next paragraphs. Integer factors that a and b may have in common are easily dealt with in the usual manner. Therefore let gcd(a, b) = 1 from now on, and let N(a − bα)

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:52, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.006

142

General Purpose Integer Factoring

(and thus a − bα) be B-smooth. One could now hope that if N(a − bα) =   ea,b,p for some set of primes P, then a − bα = p∈P pea,b,p with P denotp∈P p ing a set of prime elements in Z[α] that corresponds one way or another to P. This is indeed the case if “prime element” means prime ideal; the equality is interpreted as the factorization of the ideal (a − bα) into prime ideals; and if a final issue is addressed: generally speaking a prime ideal is not uniquely identified by its norm, so ambiguities have to be resolved. The latter is easily done too: because N(a − bα) = bd f ( ab ), a prime p divides N(a − bα) if and only if a mod p is a root of f modulo p. Therefore, it suffices to define b P = {(p, z) : p prime, p ≤ B, z ∈ Z, 0 ≤ z < p, f (z) ≡ 0 mod p} and to rewrite the above factorization of N(a − bα) as  N(a − bα) = pea,b,p,z (p,z)∈P

where ea,b,p,z = 0 if a = bz mod p. Note that for d = 2 at most two pairs in P share the same prime. After identifying each pair (p, z) ∈ P with the prime ideal p generated by p and z − α, the prime factorization of N(a − bα) over P corresponds to the prime ideal factorization  pea,b,p (5.17) (a − bα) = p∈P

of the ideal (a − bα). These prime ideals p are first degree prime ideals and are the only prime ideals that can occur in the prime ideal factorization of ideals of the form (a − bα). Two more issues must be addressed to turn Equation (5.17) into a factorization of the element a − bα ∈ Z[α] that holds over Z[α] = (Z[X])/( f ) and that can thus be turned into a factorization in (Fq [X])/( f )  Fqd . The latter is required for ElGamal’s method to compute discrete logarithms in Fq2 and for the early version of the special number field sieve – later it turned out that the prime ideal factorization in Equation (5.17) suffices (thanks to two other additional ideas, mentioned on pages 155 and 157 in Section 5.5.3). Factoring a − bα over Z[α] As is, in Equation (5.17), the ideal p generated by p and z − α does not contribute in a useful or meaningful fashion to a factorization of a − bα over Z[α], because p can not be interpreted as an element of Z[α]. In the context of [141] and [235] this can be fixed by determining, for each p = (p, z − α) ∈ P, an element gp ∈ Z[α] that generates the same ideal as p: this is the case if the norm N(gp ) of gp equals p and gp regarded as a polynomial of degree at most d − 1 has a root z modulo p. Here N(gp ) is as above if d = 2 (in which case gp is a polynomial in α of degree at most

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:52, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.006

5.5 Number Field Sieve

143

one); in general N(gp ) is a degree d integer polynomial that depends on f and that is homogeneous in the d coefficients of gp (see also [235, 3.6]). In [141, lemma 4] and [235, section 3] a search-process is described that determines gp for all p ∈ P and that is efficient for the polynomials at hand in [141] and [235]. Essentially, degree d − 1 integer polynomials with relatively small coefficients are inspected until all generators gp have been found. Once all found, the prime ideal factorization (5.17) of the ideal (a − bα) can be rewritten as  (gp )ea,b,p . (5.18) (a − bα) = p∈P

Even though gp ∈ Z[α], Equation (5.18) may not yet be the factorization of a − bα over Z[α] that is aimed for, because the prime ideal generators gp are not unique. In principle, any choice for gp is as good as any other one, but different choices would lead to different factorizations of a − bα, which can not be correct. This final issue is resolved by finding the unit contribution: if gp and g¯ p are distinct but generate the same prime ideal, then their quotient is a polynomial u = 1 in Z[α] of norm equal to one. Such a u is a unit. In more generality, given a choice of prime ideal generators gp for all p ∈ P, the quo tient ua,b of a − bα and p∈P (gp )ea,b,p satisfying Equation (5.18) is called the unit contribution. To be able to deal with the unit contributions, each much be written as a product over a fixed set of units. This is done as follows. During the search for the prime ideals generators gp , a minimal finite set U ⊂ Z[α] of units can be determined that multiplicatively generates all units, directly by keeping the polynomials of norm equal to one or by considering quotients of two generators that have, in absolute value, the same norm [235, section 3]. Once U has been determined, integers ea,b,u can be found such that  ua,b = u∈U uea,b,u . This can be done using table look-up or using (much faster) complex embeddings as described in [235, section 5]. As a result it is found that     e (5.19) uea,b,u gpa,b,p a − bα = u∈U

p∈P

holds over Z[α] = (Z[X])/( f ). Wrapping up Discrete Logarithms over Quadratic Extension Fields Returning, for d = 2, to where a − bα came from, namely from gχ = a − bX ∈ and χ is chosen at random from (Fq [X])/( f )  Fq2 where g generates F× q2 2 Z/(q − 1)Z, Equation (5.19) is the relation that follows from the smoothness of a − bα. With X substituted for α (also in all u ∈ U and in gp for all p ∈ P) it holds for integer polynomials modulo the polynomial f (X ) ∈ Z[X]. It thus holds modulo q too and, with all polynomials interpreted as elements of Fq2 ,

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:52, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.006

144

General Purpose Integer Factoring

leads to χ = logg (a − bX ) =



ea,b,u logg u +

u∈U



 ea,b,p logg gp mod (q2 − 1).

p∈P

As usual, with |U| + |P| relations the discrete logarithms of all u ∈ U and all gp for p ∈ P can be found using linear algebra modulo q2 − 1. Individual discrete logarithms can then be calculated as described above. In [141] ElGamal has shown that the required effort is subexponential in log q. Gaussian Integer Method Returning to the computation of discrete logarithms over prime fields Fq , in [107] Coppersmith, Odlyzko, and Schroeppel show how the ideas of Schroeppel’s linear sieve can be used to substantially speed up the basic algorithm described in one of the earliest paragraphs of this section. One of their methods combines Gaussian integers (similar to ElGamal’s method sketched above, with d = 2) with a homomorphism between the set Z[i] of Gaussian integers and the ring Z/qZ of integers modulo q to find two distinct factorizations that are the same modulo q. This combination can be interpreted as the degree two version of what was later used in the number field sieve for general degrees, and is briefly described below. Let f (X ) = X 2 − t ∈ Z[X], where |t| is small, t < 0, and t is a quadratic residue modulo q. With α such that f (α) = 0, the same assumption as above is made that Z[α] is a unique factorization domain (unnecessarily limiting the choice of t to just nine possibilities for the present simplified description). With m ∈ Z such that m2 ≡ t mod q, the mapping ϕ from Z[α] to Z/qZ that maps a − bα to a − bm mod q is a ring homomorphism because f (α) = 0 ≡ f (m) mod q. It follows that if the integer a − bm is B-smooth and the Gaussian integer a − bα is smooth as in Equation (5.19), and where a and b are coprime as usual, then  pea,b,p = a − bm ≡ ϕ(a − bα) mod q p≤B

and ϕ(a − bα) = ϕ

 

uea,b,u

u∈U

=



  p∈P

ϕ(u)ea,b,u

 

u∈U

e

gpa,b,p



 ϕ(gp )ea,b,p .

p∈P

This leads to the relation      pea,b,p ≡ ϕ(u)ea,b,u ϕ(gp )ea,b,p mod q p≤B

u∈U

(5.20)

p∈P

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:52, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.006

5.5 Number Field Sieve

145

and thus to the identity     ea,b,p logg p ≡ ea,b,u logg ϕ(u) + ea,b,p logg ϕ(gp ) mod (q − 1) p≤B

u∈U

p∈P

× between the discrete logarithms of a specific set of elements of F× q  (Z/qZ) . With sufficiently many identities of this sort, all these discrete logarithms can be determined – assuming logg g = 1 is among them – after which individual logarithms can be found in, more or less, the customary fashion. √ In [107] integers y, z of order q with yz ≡ m mod q are determined (by interrupting the iteration of the extended Euclidean calculation of m−1 mod q approximately halfway) to replace a − bm of approximate order q by z(a − √ bm) ≡ az − by mod q of approximate order q. This considerably increases the smoothness probabilities, at the cost of introducing an additional factor z on the right-hand side of Relation (5.20) (and an additional term logg z on the righthand side of the ensuing identity modulo q − 1), but the basic idea remains the same. After this modification, the overall required effort becomes Lq [ 21 , 1], using, for the first time, the Lanczos method and thus μ = 2 (cf. Section 5.2.5 on page 123 and Section 5.2.7 on page 126) for the linear algebra step. The method has for a long time been competitive with later number field sieve based discrete logarithm methods [343]. Interestingly, with |a|, |b| ≤ Lq [ 21 , 12 ], the smoothness probabilities of the integers and of the algebraic integers are unbalanced: for the integers az − by the probability is Lq [ 21 , − 12 ], for the algebraic integers the smoothness is bounded below by a positive constant [107]. If a larger degree polynomial f (X ) is used, then the probabilities can be better balanced. This is precisely what happens in the number field sieve.

5.5.2 Special Number Field Sieve Pollard in [280] showed how the ideas from Section 5.5.1 on page 139 can be used with degree d > 2 to factor numbers of the form x3 − t, for small |t|, while using a unique factorization domain Z[α] for his example factorization of F7 . Elsewhere (cf. his letter quoted above in Section 5.5 on page 137) he mentioned that non-unique factorization and degree higher than three seem possible and not too difficult. This was indeed shown to be the case in the follow-up paper [235]. Using d = 5 and a rough first implementation of a generalized version of Pollard’s method, several previously unfactored composites from the Cunningham tables [116, 82] were factored. Many of these numbers were at the time out of reach of the multiple polynomial quadratic sieve or its faster self-initializing variant. Several cases were encountered where Z[α] was

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:52, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.006

146

General Purpose Integer Factoring

not a unique factorization domain and had to be replaced by the ring of integers of Q(α) to get unique factorization. These early experiments culminated in the summer of 1990 in the factorization of F9 , reported by the author, Lenstra, Manasse, and Pollard in [236]. Since then much more refined implementations have been used to obtain a string of factorization records for Cunningham numbers, most recently the shared factorizations from [210] mentioned in the introduction (which also uses one of the ideas from [104], described in Section 5.5.4 on page 158). Pollard’s method, now known as the special number field sieve, is at this point in time still the state of the art for the factorization of Cunningham and other numbers of a similar special form. Relations in the Special Number Field Sieve A relation in the original variants of the number field sieve [280, 235] is a higher degree variation of the homomorphic equivalence (5.20) on page 144 encountered in the Gaussian integer method in Section 5.5.1 on page 139, with the prime q replaced by the composite n to be factored: when Relation (5.20) is divided by its left-hand side, an equation similar to the one in Condition (5.1) on page 118 is obtained (with w = v = 1):       p−ea,b,p ϕ(u)ea,b,u ϕ(gp )ea,b,p mod n. 1≡ p≤B

u∈U

p∈P

With π (B) + |U| + |P| + ω such equations (with ω the oversquareness as in Section 5.2.1 on page 117) the composite n can most likely be factored. As set forth in Section 5.5.1, Relation (5.20) requires simultaneous smoothness, for coprime integers a, b, of the integers a − bm and N(a − bα) = bd f ( ab ) where f (m) ≡ 0 mod n for an irreducible, degree d polynomial f (X ) ∈ Z[X] with f (α) = 0. Because |a| and |b| will be relatively small, the smoothness probabilities depend on the size of m, the degree d, and the sizes of the coefficients of f (X ). In the Gaussian integer method the size-issue was addressed by replacing m by yz for smaller y and z, and by taking f (X ) = X 2 − t for small |t|. In the special number field sieve it is done by considering a similar polynomial of degree larger than two, and by considering only specific n-values. Special Numbers Pollard in [280] targeted composites of the form x3 − k for small |k|. In [235] this was generalized to xD − k for small positive integers D and |k|. Examples of such composites are the Cunningham numbers, with factorizations tabulated in [116, 82] and to the present day the subject of intense computations. Given a composite n = xD − k and a targeted degree d, a polynomial f (X ) = X d − t ∈ Z[X] and integer m with f (m) ≡ 0 mod n are easily found by taking the smallest integer such that d ≥ D and putting t = kx d−D and m = x .

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:52, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.006

5.5 Number Field Sieve

147

The choice of d leads to a trade-off between the smoothness probabilities of a − bm and bd f ( ab ), with larger d leading to smaller a − bm but faster growth of |bd f ( ab )| for larger a- and b-values. This trade-off is analyzed further below. It leads to a d-value that grows as a function of n and smoothness probabilities of a − bm and bd f ( ab ) that are better balanced than in the Gaussian integers method. With f (X ) = X d − t, deriving Relation (5.20) on page 144 from a pair (a − bm, bd f ( ab )) of smooth integers works as described in Section 5.5.1 on page 139, assuming that Z[α] is a unique factorization domain. If the latter is not the case the general number field sieve approach (cf. below) may be used while still exploiting the favorable smoothness probabilities resulting from the polynomial X d − t (compared to the polynomials that normally occur in the general number field sieve). Alternatively, as suggested in [235, section 3] and as done for several of the factorizations obtained in [235], the searchbased approach from Section 5.5.1 can still be used, but with Z[α] replaced by 1c Z[α] for an appropriately chosen small c ∈ Z>1 (and with ϕ redefined as ) = (a − bm)(c−1 mod n) mod n). ϕ( a−bα c Finding Relations Pairs of coprime integers a, b (with b ≥ 0) such that a − bm and bd f ( ab ) = ad − tbd are both smooth are normally found using a two-stage sieving process. Commonly, everything related to a − bm is referred to as the rational side, and everything related to bd f ( ab ) as the algebraic side. With the notation from Section 5.2.3 on page 120 let L[ 2r , β] be the smoothness bound (without loss of generality shared for the smoothness of a − bm and of bd f ( ab )), where r > 0 and β > 0 are specified in the analysis below. Furthermore, assume that coprime pairs of integers a, b with |a| ≤ L[ 2r , γa ] and 0 ≤ b ≤ L[ 2r , γb ] must be considered, with, without loss of generality, γa ≥ γb and γa + γb = 2β (cf. Section 5.2.3 and the analysis below). Under this standard assumption on the size L[ 2r , γa + γb ] = L[ 2r , 2β] of the search space versus the smoothness bound L[ 2r , β], the sieving effort equals L[ 2r , 2β] for both methods sketched below. Line Sieving The first method to find relations, as used by the earliest implementations, is line sieving. Pollard in [280] used it in a first stage to locate pairs of coprime integers a, b for which a − bm is smooth, and then, in the second stage, used trial division to inspect the corresponding values of bd f ( ab ) for smoothness. In [235] line sieving was used in both stages, i.e., first to find pairs of coprime integers a, b for which a − bm is (likely to be) smooth and next to find the pairs for which bd f ( ab ) is smooth as well. This gave the number field sieve its name. In line sieving, for b = 0, 1, 2, . . . , L[ 2r , γb ] in succession the

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:52, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.006

148

General Purpose Integer Factoring

entire line of a-values with |a| ≤ L[ 2r , γa ] is sieved. This is similar to Schroeppel’s linear sieve where consecutively for each fixed j-value the interval of i-values is processed. The elementary line siever from [235] was used for, among others, the factorization of F9 reported in [236] and, after considerable improvements by Montgomery, for many other factorizations (see for instance [78]). The order of the two sieving stages in the special number field sieve is explained by the fact that on average the absolute values |a − bm| on the rational side are larger than the absolute values |bd f ( ab )| on the algebraic side: compared to the reverse order of sieving, fewer candidate locations resulting from the first sieve over the |a − bm|-values remain to be inspected after the second sieve over the |bd f ( ab )|values. In the general number field sieve |bd f ( ab )| is on average larger than |a − bm| so that it becomes more efficient to reverse the order of the sieving stages: there pairs of integers a, b for which bd f ( ab ) is likely to be smooth are located first, and next the resulting set of pairs is further restricted to pairs for which a − bm is smooth as well. Lattice Sieving Unless γb is small, it is more efficient to use lattice sieving, as suggested by Pollard in [281]. Lattice sieving was used for the first time in [161] (for the general number field sieve), and has from that time on been used for all record factorizations obtained using the special or the general number field sieve (for the record reported in [92] both line sieving and lattice sieving were used). In lattice sieving relation collection is split up, not according to disjoint lines as in line sieving, but into non-disjoint subtasks specified by (prime, root) pairs (q, z) for primes q relatively close to the smoothness bound L[ 2r , β]. The prime q is often referred to as special prime or special q-prime. This has nothing to do with the “special” in special number field sieve; the prime q, however, is reminiscent of how Davis, Holdridge, and Simmons managed to get quadratic sieve to work, cf. Section 5.4.3 on page 133. In subtask (q, z) relations are sought specified by pairs of coprime integers a, b for which a ≡ z mod q. Because a relation given by a pair of integers a, b may be found b ¯ dupliby different subtasks (q, z) and (q, ¯ z¯) if ab ≡ z mod q and ab ≡ z¯ mod q, cates among the relations must be removed. There are on the order of L[ 2r , β] (prime, root) pairs for which the prime is close to L[ 2r , β], each of which is processed (typically in parallel, in ranges of sequential q-values) until enough distinct relations have been found. It follows that per (prime, root) pair effort at most L[ 2r , β] may be spent. How this is achieved is described below. Let (q, z) with q of order L[ 2r , β] be a fixed (prime, root) pair. In the special number field sieve (q, z) is typically chosen such that z ≡ m mod q and subtask (q, z) results in pairs of coprime integers a, b with q dividing a − bm; in

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:52, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.006

5.5 Number Field Sieve

149

the general number field sieve (q, z) is chosen such that f (z) ≡ 0 mod q so that q divides bd f ( ab ) for the pairs a, b resulting from subtask (q, z). Without loss of generality, assume that z ≡ m mod q and that γa = γb = β. The pairs of integers a,   which q divides a − bm form an index-q sublattice  b for of Z2 with basis { q0 , 1z } over Z: in subtask (q, z) only elements of this sublattice are considered. Intersecting the sublattice with the original rectangular search space {(a, b) : |a| ≤ L[ 2r , β], 0 ≤ b ≤ L[ 2r , β]} results in a search space 2L[ r ,2β]

2 for subtask (q, z) that consists of approximately = L[ 2r , β] elements. q This intersection is not calculated precisely, but only approximated in the sense that a subtask search space is defined that should be approximately as effective as the intersection: first a reduced basis {u, v} ⊂ Z2 of the original  actual  q z basis { 0 , 1 } is found (i.e., the vectors u and v should have entries that are √ in absolute value close to q) after which the intersection is approximated a as { b = iu + jv ∈ Z2 : |i| ≤ L[ 2r , β2 ], 0 ≤ j ≤ L[ 2r , β2 ]}. The subtask search space is then defined as the rectangle {(i, j) : |i| ≤ L[ 2r , β2 ], 0 ≤ j≤ L[ 2r , β2 ]} in the (i, j)-plane, with each pair (i, j) identified with the a, b pair ab = iu + jv with ab ≡ z mod q.

Sieving by Vectors The above new rectangle of size 2L[ 2r , β2 ] × L[ 2r , β2 ] = L[ 2r , β] in the (i, j)-plane must be sieved with all L[ 2r , β] distinct (prime, root) pairs (p, z p ) while spending effort L[ 2r , β], i.e., proportional to the size of the subtask search space. This implies that line sieving can not be used because it would consider all L[ 2r , β2 ] consecutive j-values (i.e., all lines in the new rectangle; cf. [83]) and it would do so for for each of the L[ 2r , β] (prime, root) pairs (p, z p ). Thus, line sieving would take effort at least L[ 2r , 32 β], which is more than L[ 2r , β]. Instead, in the (i, j)-plane sieving with a (prime, root) pair L[ r ,β]

(p, z p ) must be done in such a way that it takes effort at most 2p ; summation over all (prime, root) pairs (p, z p ) then results in an upper bound L[ 2r , β] on the total sieving effort (cf. Equation (5.4) on page 122). As above, the points to be visited per pair (p, z p ) belong to an index-p sublattice of the (i, j)-plane. Those among them that belong to the new rectangle in the (i, j)-plane are located in a manner similar to how that new rectangle was defined: first a suitably reduced basis is determined for the index-p sublattice induced by (p, z p ) in the (i, j)-plane, after which the intersection with the rectangle can be determined. Pollard in [281] refers to this approach as sieving by vectors and poses the problem how to quickly generate the points in the intersection. It was done crudely but fairly effectively in [49, 161] by considering small linear combinations of the vectors spanning the reduced bases; refer to [147], however, for the solution to Pollard’s problem.

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:52, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.006

150

General Purpose Integer Factoring

Speedup Obtained by Lattice Sieving When using lattice sieving with special q-primes between B0 and B1 close to the smoothness bound L[ 2r , β], a fracB1 of the original search space is considered. The precise values tion ≈ log log log B0 depend on how much sieving (or over-sieving) one decides to do, but normally speaking the fraction will be considerably less than one and far outweighs the overhead inherent in sieving by vectors (as the latter requires a basis reduction step for each (prime, root) pair (p, z p ) that must be sieved with). Another negative effect is that relations for which a − bm is (B0 − 1)-smooth will be missed. Overall, however, for large composites lattice sieving is to be preferred to line sieving. It should be noted that when sieving the values that have a special prime q as a fixed divisor, sieving is normally restricted to (prime, root) pair (p, z p ) for which the prime p is less than q. Free Relations With P and P as in Section 5.5.1 on page 139, if during the construction of P a prime p ∈ P is encountered such that f (X ) splits into linear  factors modulo p, a free relation is obtained. Let f (X ) = z (X − z)ez mod p for distinct integers z ∈ {0, 1, . . . , p − 1} and strictly positive integers ez . For each of these integers z, define p p,z as the first degree prime ideal of norm p generated by p and z − α. Then the ideal generated by p equals the prodz . With P p the set containing all these ideals p p,z it foluct of the ideals pep,z  z which is, with a = p, b = 0, and e p,0,p p,z = ez , lows that (p) = p p,z ∈P p pep,z an equation of the same form as Equation (5.17) on page 142. This leads to a useful relation because p ∈ P and P p ⊂ P. The fraction of relations that thus comes for free is inversely proportional to the degree of the splitting field of f (X ). Heuristic Asymptotic Analysis of the Special Number Field Sieve In the analysis below the second argument used in the L-notation introduced in Section 5.2.2 on page 119 often involves an o(1)-term, for D → ∞ where n = xD − k; this term is silently ignored. Let r, ψr ∈ R>0 be such that max(|a − bm|, |bd f ( ab )|) ≤ L[r, ψr ], and let s, β ∈ R>0 be such that the largest of the two smoothness bounds is upper bounded by L[s, β] (zero arguments can be seen not to work). Thus, it suffices to find L[s, β] + L[s, β] = L[s, β] coprime (a, b) pairs that satisfy the smoothness requirements. Furthermore, dependencies must be found in a sparse L[s, β] × L[s, β]-matrix, at cost L[s, 2β] (cf. Section 5.2.5 on page 123). With the smoothness probabilities from Section 5.2.2 and heuristically assuming that the values a − bm and bd f ( ab ) behave as independent random integers, it is expected that to find a single satisfactory coprime (a, b) pair, it

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:52, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.006

5.5 Number Field Sieve

151

suffices to consider L[r − s, ψs ] random pairs, for some ψs ∈ R>0 . Because L[s, β] pairs suffice, at most L[s, β]L[r − s, ψs ] pairs have to be inspected, which is minimized for s = 2r (cf. this repeats the argument given just before Expression (5.3) on page 120). The trade-off between the smoothness probabilities of a − bm and bd f ( ab ) now determines the values for r and the degree d. It follows from max(|a − 1 bm|, |bd f ( ab )|) ≤ L[r, ψr ] that m ≤ L[r, ψ] for some ψ ∈ R>0 . With m ≈ n d = 1 e d log n , this bound on m implies that d ≈ δ( logloglogn n )1−r , where δ = ψ1 . With a search space that contains L[ 2r , β + ψs ] pairs (a, b) and given the symmetry of |a| and b in |bd f ( ab )|, both |a| and b may be upper bounded by L[ 2r , γ ] s , so that max(|a|, b)d = L[1 − 2r , γ δ]. Balancing the upper for some γ ≥ β+ψ 2 bounds for |a − bm| and |bd f ( ab )|, leads to the optimal choice r = 1 − 2r , and thus r = 23 . In terms of the L-notation, no savings can be obtained when different smoothness bounds are used for |a − bm| and |bd f ( ab )|, so let L[ 13 , β] be the smoothness bound for both. With |a| and b both bounded by L[ 13 , γ ] and m by L[ 23 , ψ] and heuristically assuming random behavior and independence, the values a − bm and |bd f ( ab )| are both L[ 13 , β]-smooth with probability δ ψ ]L[ 31 , − γ3βδ ], so that a total of L[ 13 , β + ψ+γ ] pairs must be inspected L[ 13 , − 3β 3β δ ). This is minimized when to find L[ 13 , β] satisfactory ones (and ψs = ψ+γ 3β 3β 2 = ψ + γ δ and thus results, with effort L[ 13 , 0] per smoothness test (cf. Section 5.2.4 on page 121), in effort L[ 13 , 2 max(β, γ )] to find the required (a, b) pairs. Taking γ = β and noting that this satisfies all the above boundary conditions, and including the cost L[ 13 , 2β] to find the dependencies, it follows that the overall effort is L[ 13 , 2β], which remains to be minimized under the condi1 tion 3ψβ 2 − β − ψ 2 = 0. The single positive root β = 6ψ (1 + 1 + 12ψ 3 ) 2

1

1

attains its minimal value β = ( 32 ) 3 for ψ = ( 32 ) 3 (and thus δ = ( 23 ) 3 ). The 1 resulting overall effort is L[ 13 , ( 32 ) 3 ]. 9 More General Polynomials with Constant Coefficients The above analysis of the relation collection and linear algebra effort applies for n → ∞ as long as the absolute values of the coefficients of the polynomial f (X ) are bounded by a constant. Even though the algorithm as described in this section may not apply to such more general polynomials (because the search for generators of the first degree prime ideals may fail) one nevertheless says that the special number field sieve applies to n-values that admit polynomials with coefficients bounded by a constant. For these somewhat more general n-values the search for generators may be replaced by the more general approach used for the general number field sieve and described in Section 5.5.3 below.

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:52, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.006

152

General Purpose Integer Factoring

Large Prime Relations Relations involving large primes play a much more prominent role in the number field sieve than in earlier general purpose factoring methods, because large primes can relatively easily be found on the rational side (i.e., large primes dividing a − bm) and on the algebraic side (i.e., large primes dividing bd f ( ab )). Depending on the number of large primes allowed, the number of pairs to be inspected after the sieving may increase considerably, resulting in relatively costly cofactor processing (for which other factoring algorithms, including the elliptic curve method and quadratic sieve, turn out to be useful). The presence of large primes also complicates the linear algebra step (cf. the discussion in Section 5.2.6 on page 123 on filtering and Montgomery’s contributions to it) and even deciding if enough relations have been collected becomes a more cumbersome process. Overall, however, usage of large primes leads to a considerable speedup (which, as usual, disappears in the o(1) in the L-notation). Refer to [131] for the earliest results (which were, back then, found to be rather surprising) and to [246, 210] for the most recent ones.

5.5.3 General Number Field Sieve Though it had not escaped at least one of the authors of [235] that, if a number of obstructions are ignored, an approach and analysis similar to the special number field sieve could apply to arbitrary composites, Joe Buhler and Pomerance were the first who dared to publicly suggest this. Their optimism turned out to be justified: after several obstacles had been resolved, the general number field sieve became a reality in the early 1990s. As a result (as shown below and with the usual vigorous handwaving) the expected general purpose factoring effort was reduced, quite spectacularly, from L[ 12 , 1] to 1 ) 3 ] ≈ L[ 13 , 1.9223], for n → ∞. This is a bit worse than the special L[ 13 , ( 64 9 1 number field sieve’s L[ 13 , ( 32 ) 3 ] ≈ L[ 13 , 1.5263] but not overly so. Given the 9 proven practicality of the special number field sieve, some expected that its generalization would soon turn out to be practical as well – and quite possibly replace quadratic sieve as the best practical general purpose factoring method. Despite the encouraging remarks in [84, section 1], this expectation was not generally shared. Initial experiments were indeed hardly encouraging. In [49] a 66-digit general number was factored (using lattice sieving, cf. Section 5.5.2 on page 145) in a few hours on a MasPar supercomputer (cf. Section 5.4.4 on page 134), where quadratic sieve took only a few minutes. This compares very poorly to the performance of the special number field sieve, which had been used to obtain the record factorization of F9 , an achievement that was far beyond the capacity of quadratic sieve. Neither were the results reported in [83]

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:52, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.006

5.5 Number Field Sieve

153

competitive, but it is not clear if sieving by vectors was used in the lattice sieving from [83] (as required to get the right performance). The first more encouraging estimate appeared in [121, section 1], confirmed by an experiment reported in [161, section 5] suggesting that the 129-digit quadratic sieve factorization reported in [15], at that point in time the state of the art in general purpose factoring, could have been achieved at about a third of the effort using the general number field sieve. In 1996 the general number field sieve finally replaced quadratic sieve as the state-of-the-art general purpose factoring method for non-special numbers as well: the factorization of a 130-digit general composite took an effort that was, according to [114], “a fraction of what was spent on the previous record” (of the 129-digit composite in [15]), and used the advantageous effect, as had already been reported in [131], of the use of multiple large primes on both the rational and algebraic side. Probably the most prominent factorization achieved with the general number field sieve is still the 1999 factorization of a 512-bit cryptographic modulus, in [92]. For 512-bit numbers it thus took almost a decade to close the gap between “special” and “general.” The latter factorization required a 500-fold larger effort than the former, so this gap was not entirely closed by Moore’s law. The current general number field sieve factorization record stands at 768 bits [209]1 . There is a 400-fold effort gap between the current special number field sieve record (which stands at about 1200 bits) and a general 1024-bit composite (the factorization of which could have practical implications). Actually closing this gap using current methods would result in a powerbill that can not – or hardly – be justified by the importance of the resulting factorization: it would be preferable to have a significantly improved method before embarking on a general 1024-bit factorization. Unfortunately, however, factoring developments over the last two decades have been disappointing. Thus, it seems there comes no end to the number field sieve’s “day in the sun” [84, section 1]: true progress in general purpose factoring has come to a standstill since the publication of [84] and [104]. The sole exception is [313], but as it relies on the as yet uncertain realization of quantum computing it has no practical implications, yet. Polynomial Selection Finding a suitable polynomial for arbitrary n is easy; finding a good polynomial is much harder and figuring out how to actually use it to factor n is yet another story (part of which is told below). Indeed, for 1 any composite n and d ∈ Z>0 , any integer m close to but less than n d may be 1

The current general number field sieve record for the computation of discrete logarithms over prime fields also stands at 768 bits [212]

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:52, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.006

154

General Purpose Integer Factoring

  chosen, after which f (X ) may be defined as di=0 fi X i where n = di=0 fi mi is the base m representation of n (i.e., fi ∈ Z and 0 ≤ fi < m, for 0 ≤ i ≤ d). If luck has it that the resulting f (X ) is not irreducible (this has not happened 1 yet in practice), a factorization of n may follow right away. The order m ≈ n d estimate of the coefficients of f (X ) (as used in the analysis below) gives only a rough impression of the relative performance of a particular choice. Initially mostly due to the efforts by Montgomery, selecting and distinguishing more effective parameters for the number field sieve has grown into an active area of research, to which the next chapter in this volume is devoted. in this volume. No matter how carefully a polynomial f (X ) has been selected, however, 1 the rough estimate m ≈ n d for its coefficients is inescapable, generally speaking. This leads, unavoidably, to a rather ill-behaved number field Q(α) where the approach sketched in Section 5.5.1 on page 139 meets with a number of obstructions that looked, in the late 1980s, hard to overcome. For instance, although obtaining the prime ideal factorization of the ideal (a − bα) as in Equation (5.17) on page 142 is still possible, turning it into Equation (5.19) on page 143 (as required to obtain Relation (5.20) on page 144) requires finding generators in Z[α] (or in 1c Z[α], for some integer c) for the units and the prime ideals in P. For general number fields – as may be expected given a defining 1 polynomial with coefficients of order n d – it is not even feasible to write down such generators [235, section 9], let alone find them (and the primitive search described in Section 5.5.1 would most certainly be inadequate). While joint efforts were underway to remove the obstructions, which seemed possible but cumbersome [84], Leonard Adleman proposed an elegant and deceptively simple solution in [3]. This led to the approach sketched below. Relations in the General Number Field Sieve In the general number field sieve relations are given by coprime pairs of integers a, b for which the integers a − bm and bd f ( ab ) are both smooth, just as in the special number field sieve. In the latter, relations are turned into identities modulo n between two products by applying ϕ to the relations themselves. Sufficiently many modular identities can then be combined into a single identity modulo n between two squares: an integer square on the left-hand side with on the right-hand side the square of the product of a (large) number of ϕ-values of elements of Z[α]. This approach requires turning Equation (5.17), for each pair a, b under consideration, into something with a right-hand side to which ϕ can be applied, such as Equation (5.19). As mentioned above and as shown in [235] that works if the polynomial f (X ) defining the number field has a particularly nice form, but as elaborated upon in [84] (and mentioned above) it is problematic for general f (X ). As discussed in [84] there are several ways to overcome

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:52, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.006

5.5 Number Field Sieve

155

this problem, the most convenient one of which is using Adleman’s quadratic characters. Remark More general descriptions of the general number field sieve no longer refer to a rational side (for a − bm) and an algebraic side (for bd f ( ab )) but ¯ replace a − bm by bd g( ab ) for a polynomial g(X ) ∈ Z[X] of degree d¯ ≥ 1 that has modulo n a root m in common with f (X ). The methods described in this chapter apply to this more general situation as well. Refer to the next chapter in this volume for a discussion on more general pairs of polynomials. Quadratic Characters In [3] Adleman proposed to construct the above identity modulo n between two squares in a different manner: an integer square on the left-hand side, as above, but on the right-hand side the square of the ϕvalue of an element of Z[α]. In this way the application of ϕ is postponed as long as possible, and everything “on the right-hand side” stays in Z[α] until the last moment. To get this to work in a naive fashion, sets S of pairs of  coprime integers a, b would have to be found such that (a,b)∈S (a − bm) ∈ Z  is the square of some x ∈ Z, and such that η = (a,b)∈S (a − bα) ∈ Z[α] is √ a square so that η ∈ Z[α] can be computed; the required modular identity √ 2 2 x ≡ y mod n would then follow with y = ϕ( η). Unfortunately, this does √ not work, due to the fourth obstruction listed in [84, section 6], namely that η does not necessarily belong to Z[α]. But, as also shown in [84, section 6], this can easily be fixed: with f  (X ) the derivative of the polynomial f (X ) it √ is the case that f  (α) η belongs to Z[α] and the modular identity becomes √ ( f  (m)x)2 ≡ ϕ( f  (α) η)2 mod n.  The condition on (a,b)∈S (a − bm) is equivalent to a dependency modulo 2 among the exponent vectors of the factorizations of the smooth values a − bm,  as usual. The condition on η = (a,b)∈S (a − bα) is only slightly more involved. In the first place, if f  (α)2 η is a square in Z[α] the sum of exponent vectors (ea,b,p )p∈P as in Equation (5.17) on page 142 is a vector with all even entries. This condition is equivalent to the usual dependency modulo 2 among the vec tors (ea,b,p )p∈P (which is a stronger condition than just (a,b)∈S bd f ( ab ) being  a square). Furthermore, if f  (α)2 η is a square in Z[α] then (a,b)∈S (a − bzq ) is a square modulo q, for any prime, root pair (q, zq ) with q prime and f (zq ) ≡ 0 mod q [84, section 8]. But, these are only necessary conditions for f  (α)2 η to be a square in Z[α]. As shown in [3], an effective version of the converse is true too: f  (α)2 η is most likely a square in Z[α] if the vectors (ea,b,p )p∈P are dependent modulo 2,  and if (a,b)∈S (a − bzq ) is a square modulo q for sufficiently many (q, zq ) pairs as above for which f  (zq ) ≡ 0 mod q and for which the first degree prime ideal generated by q and zq − α does not belong to P. Refer to [84, section 8] for the

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:52, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.006

156

General Purpose Integer Factoring

number of prime, root pairs that suffices in theory; in practice one commonly  uses 64 or 128 pairs. To enforce the condition that (a,b)∈S (a − bzq ) is a square modulo q, for each (q, zq )-pair each vector (ea,b,p )p∈P includes an additional bit with value zero if a − bzq is a square modulo q and with value one otherwise. √ It remains to compute f  (α) η ∈ Z[α].  Computing Square Roots in the Number Field Sieve Let η = (a,b)∈S (a − bα) ∈ Z[α] be an element of known smooth norm for which it is known that f  (α)2 η is the square of an element of Z[α]. Several methods have been pro√ √ posed to compute the latter element f  (α) η ∈ Z[α] or just ϕ( f  (α) η) ∈ Z/nZ (which would suffice for the present application); refer to [324] for a recent discussion. A direct approach would be to calculate the quadratic polynomial X 2 −  f (α)2 η ∈ Z[α][X] and to factor it over Q(α) using a standard (polynomialtime) method to do so; refer to [84, section 9] and [324] for references and an extensive discussion of this method. Back in the early 1990s it was deemed to be infeasible, but at this point in time it enjoys renewed interest, simply because these days symbolic algebra packages seem to be able to handle the resulting problems (involving rather large coefficients) without too much trouble. If it works, it is certainly quite convenient. The first method to be used in practice (in [49]) was due to Jean-Marc Couveignes [113]. It requires d to be odd, is based on the use of Chinese √ remaindering, and produces ϕ( f  (α) η) ∈ Z/nZ. Let Q be a product of disQ tinct primes such that 2 bounds the absolute values of the integer coefficients √ of f  (α) η and such that f (X ) remains irreducible modulo each q dividing Q. Here it is assumed that such a Q can be found, but see [113] and [84, section 9]. For any prime q dividing Q it is the case that (Z/qZ)[X]/( f (X )) is √ isomorphic to the finite field Fqd of qd elements, so that ± f  (α) η mod q can √ easily be computed in Fqd . The root f  (α) η ∈ Z[α] can then be computed by combining (using Chinese remaindering) the roots modulo all primes q dividing Q, where the sign-ambiguity (i.e., which of the two choices modulo q to use) is resolved using norm-calculations and the fact that d is odd. √ As shown in [113] (and used in [49]) the calculation of f  (α) η ∈ Z[α] (with huge coefficients, in absolute value only bounded by Q2 ) can be avoided √ in a neat way and ϕ( f  (α) η) ∈ Z/nZ can be calculated directly without requiring arithmetic with numbers larger than n2 . Although it is conceptually quite simple and allows (to a large extent) parallelization, the disadvantage of Couveignes’ method is that it works only for odd d and, more importantly, that the effort involved grows quadratically with the number of primes dividing Q.

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:52, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.006

5.5 Number Field Sieve

157

Montgomery’s Square Root Method Both disadvantages were addressed by the method proposed in 1994 by Montgomery in [255]. Since its initial development it has not led to any new insights or algorithms, because it is perfectly satisfactory as is: currently it is still the method of choice for practical applications (but see also the combination of the direct approach and Couveignes’ method in [324, section 4]). A proper description of the method can be found in Montgomery’s own paper [255] and in [269]. Here the following rough description suffices.  √ Let η = (a,b)∈S (a − bα) ∈ Z[α] be such that f  (α) η ∈ Z[α], as above,  and thus the ideal generated by η equals a product p∈P p2ep (for integers ep ) of squared first degree prime ideals, with P as in Section 5.5.1 on page 139. Montgomery’s square root is an iterative process that builds the desired square root in Z[α] by patiently – and measurably – chipping away parts of η. Initialize the square-root-to-be ς ∈ Z[α] as one. The basic idea is to remove a product of some of the squared prime ideals from η, find a generator in Z[α] of an ideal contained in the product of the (non-squared) ideals, to multiply the squareroot-to-be ς by that generator, and to iterate until η has become small enough to further compute its square root directly. This works, except that the newly found generator may contain a factor not contained in the product of ideals, so per iteration η may have to be corrected by the square of the inverse of that spurious factor. To make this correction step less cumbersome, η is constructed in a different way (though equivalent from the point of view of the linear algebra), namely as a quotient of two similar products as above, with approximately equal norms in the numerator and the denominator:  (a,b)∈Snum (a − bα) η=  (a,b)∈Sden (a − bα) which leads to

 p∈Pnum

p2ep

p∈Pden

p2ep

η= 

with Pnum ∪ Pden = P. With ςnum , ςden ∈ Z[α] and a spurious factor s ∈ Z, all with initial value equal to one, this leads to the following slightly more precise  description. If the norm of the ideal p∈Pnum p2ep is small enough, then compute the square root ς directly and replace ςnum by ςnum ς . Otherwise, let P be a  subset of Pnum such that the ideal I = p∈P pep has a norm in some targeted interval and such that ps ∈ P if s = 1. Identifying I with a lattice (for which a basis is easily constructed given the (prime, root) generators of the first degree prime ideals in P ), a short vector ς in the lattice is found (using, for instance a basis reduction algorithm [232]). The short vector ς can be interpreted as an

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:52, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.006

158

General Purpose Integer Factoring

element of Z[α] and the ideal (ς ) is contained in the ideal I. To check equality of those two ideals, the spurious factor s is replaced by the quotient of the norms of the ideals (ς ) and I. If s = 1, then the proper ideal ps of norm s is located (i.e., ps I = (ς )) and Pden is replaced by Pden ∪ ps , with eps = 1. Finally, Pnum is replaced by Pnum − P and ςnum is replaced by ςnum ς . Once ςnum has been updated, repeat the process with the roles of (Pnum , ςnum ) and (Pden , ςden ) reversed. The targeted interval for the norm of I (i.e., the choice of P ) is probably best determined empirically. It has been proved that per iteration the loss (i.e., the spurious factor s) is relatively small compared to the gain (i.e., the norm of I), and the method requires effort roughly proportional to the size of P. Heuristic Asymptotic Analysis of the General Number Field Sieve The analysis of the general number field sieve proceeds along the same lines as the analysis of the special number field sieve. The main difference occurs when bounding |bd f ( ab )|, which is here bounded by (d + 1)m max(|a|, b)d = δ ] L[ 23 , ψ]L[ 13 , γ ]d = L[ 23 , ψ + γ δ]. It follows that a total of L[ 13 , β + 2ψ+γ 3β 2 pairs (a, b) must be inspected, which is minimized when 3β = 2ψ + γ δ. With γ = β this leads to the modified quadratic equation 3ψβ 2 − β − 2ψ 2 = 0 with 1 (1 + 1 + 24ψ 3 ) which attains its minimal value a single positive root β = 6ψ 1

1

1

1

) 3 ] with δ = 3 3 . β = ( 98 ) 3 for ψ = ( 13 ) 3 . The overall effort becomes L[ 13 , ( 64 9

5.5.4 Coppersmith’s Modifications Two variants of the general number field sieve were proposed by Coppersmith in [104]. At this point in time (and in the public domain) neither of these methods has proved to be practical yet, though an obvious adaptation of Coppersmith’s second method to special numbers was shown to be practical [210]. Using More Number Fields per Composite The first method low1 ) 3 ] ≈ L[ 13 , 1.9223] to ers the general number field effort from L[ 13 , ( 64 9 √

1 3

L[ 13 , (92+263 13) ] ≈ L[ 13 , 1.9019] using a conceptually straighforward idea. For 1 any m ≈ n d and degree d polynomial f (X ) ∈ Z[X] with f (m) ≡ 0 mod n (with coefficients of order m) many similar degree d polynomials can easily be constructed, for instance by adding multiples in Z[X] of X − m to f (X ). Assume that λ such polynomials have been selected, giving rise to λ distinct algebraic numbers fields, say Q(α1 ), Q(α2 ), . . ., Q(αλ ). For any coprime pair a, b of integers for which a − bm is smooth there are λ (assumed to be) independent chances for one of the ideals (a − bαi ) to be smooth (as in Equation (5.17) on page 142, with P replaced by Pi ). Per smooth a − bm, as many distinct

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:52, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.006

5.5 Number Field Sieve

159

vectors will result as there are distinct smooth prime ideals, where the vectors  are (|P| + λi=1 |Pi |)-dimensional: |P| coordinates for the exponents on the rational side plus |Pi | coordinates for each of the λ number fields, where per relation only one of the λ latter parts contains non-zero entries. In [104], the optimal λ-value is derived (as ≈ L[ 13 , 0.1250]) along with the ] and |Pi | = |P| for 1 ≤ i ≤ λ, which then smoothness bounds |P| ≈ L[ 13 , 1.9019 2 λ leads to the effort cited above. Sieving can be used on the rational side to find the pairs for which a − bm is smooth. With λ > 1 it follows from the relative sizes of the rational and algebraic smoothness bounds that on the algebraic sides sieving has to be replaced by elliptic curve-based smoothness testing (cf. Section 5.2.4 on page 121). Factorization Factory This method exploits the idea that distinct composites may share a database of smooth values on the rational side. Actually creating such a database could have severe implications because it would reduce the √ 1 20+8 6 13 individual factoring effort to L[ 3 , ( 9 ) ] ≈ L[ 13 , 1.6386], which is getting close to the effort required by the special number field sieve. The catch √ is1 that 6 3 ) ]≈ this low effort can only be achieved after a preparatory effort L[ 13 , ( 12+5 3 1 L[ 3 , 2.0069] to build the database, and that it requires an amount of permanent storage that is proportional to the individual factoring effort: as it refers to storage, this can only be interpreted as staggering. As above, pairs a, b for which a − bm is smooth may be used for different polynomials, but in the present case the polynomials are targeted at different N composites to be factored. Let m = 2 d for some targeted bit size N, and suppose that in a preparatory sieving step a sufficiently large set S of coprime pairs a, b has been collected for which a − bm is smooth. Any N-bit composite n (which does not have to be known before the preparatory step is carried out) can then be factored by first finding pairs a, b in S for which bd f ( ab ) is smooth (for a polynomial f (X ) ∈ Z[X] with f (m) ≡ 0 mod n, constructed in the usual manner), after which the linear algebra and square root steps can be carried out in the usual manner. As mentioned in [104] (and analyzed in detail in [119]), optimization leads to matching relatively small rational and algebraic smoothness bounds (both ≈ L[ 13 , 0.8193], proportional to the square root of the individual factoring effort), but a relatively large rational sieving rectangle (of size ≈ L[ 13 , 2.0069]) to allow collection of sufficiently many smooth values on the rational side. As above, sieving can not be used on the algebraic side. Refer to [210] for a limited scale application of the factorization factory idea where the roles of the rational and algebraic sides are reversed: two examples are presented of a single special polynomial f (X ) that is shared by several Mersenne numbers (for a number of different roots per polynomial).

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:52, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.006

160

General Purpose Integer Factoring

5.6 Provable Methods This chapter is concluded with a brief description of the relatively poor state of the art in general purpose factoring algorithms that allow a rigorous analysis. None of the rigorous methods below has ever been proved practical. No general purpose factoring method is known for which the expected asymptotic effort is provably of the form L[r, c] for r < 12 (and constant c ∈ R>0 ). All methods below require effort L[ 12 , c], for various constants c ∈ R>0 , and they all rely on Pomerance’s rigorous version of the elliptic curve-based smoothness test from [283] mentioned in Section 5.2.4 on page 121. So far, the only general purpose factoring method in this chapter for which the analysis does not involve heuristic arguments is Dixon’s random squares method from Section√5.3.1 on page 126 with a provable expected asymptotic factoring effort L[ 12 , 2]. Brigitte Vallée has shown in [331] how to improve Dixon’s random squares method by still choosing the random integers v almost of order uniformly but such that the least absolute remainder v 2 mod n is ! 2

only n 3 . This results in a provable expected factoring effort L[ 12 , 43 ] (cf. Section 5.2.7 on page 126). Further lowering the effort seems to require using the approach initiated by Martin Seysen in [309]. It replaces Dixon’s random integers v by random quadratic forms of negative discriminant  = −n, while still using the familiar two-step approach from Section 5.2.1 on page 117. As informally shown in [233, sections 2.C and 4.10–4.14] smooth forms can be combined (using linear algebra) to produce ambiguous forms, and thereby most likely a factorization of || = n. Because smoothness of the forms used depends on smooth√ ness of integers of order n, this leads to factoring effort L[ 12 , 1] in the usual manner (cf. Section 5.2.7). The generalized Riemann hypothesis can be used to ensure that sufficiently many small primes p exist for which ( p ) = 1 (the only ones that can occur in smooth forms), which then leads to a rigorous but conditional effort L[ 12 , 1] [230] (see also [233, 4.14]). The dependence on the Riemann hypothesis was removed by Lenstra and Pomerance, in [240]. This is, a quarter of a century later, still the state of the art in provable general purpose factoring. Acknowledgements The author thanks Scott Contini, Robert Granger, Thorsten Kleinjung, Richard Schroeppel, and Herman te Riele for their contributions and comments.

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:52, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.006

6 Polynomial Selection for the Number Field Sieve Thorsten Kleinjung

In this chapter the development of the polynomial selection step of the number field sieve is described, emphasising Peter Montgomery’s contributions.

6.1 The Problem Given an integer n to be factored, the very first step of the number field sieve consists in choosing two coprime polynomials f1 , f2 ∈ Z[x], each of whose coefficients are also coprime, such that the polynomials have a common root r modulo n (equivalently, that the resultant of f1 and f2 is a non-zero multiple of n). In the following step, the sieving stage, one searches for sufficiently many relations, i.e., coprime pairs (a, b) ∈ Z2 such that both values fi ( ab )bdeg fi , i = 1, 2, are L-smooth, i.e., split into primes below L, for some parameter L. The running time of the sieving stage (and, in general, of the number field sieve) is determined by the number of pairs (a, b) one needs to inspect which in turn depends on the choice of the polynomials. Therefore it is important to carefully select a polynomial pair so that the running time of the number field sieve computation is minimised, as much as is practically possible. In the following it is assumed that the polynomials f1 and f2 are both irreducible; otherwise one can split n non-trivially or one can replace the polynomials by divisors that have r as a root modulo n since the polynomial values for a pair (a, b) will be replaced by divisors and thus remain smooth.

6.2 Early Methods Since the goal of polynomial selection is to minimise the running time of the number field sieve and since it is easy to produce polynomial pairs satisfying the conditions given in the previous section, for instance, f1 = x + n and f2 = x, 161 Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:45, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.007

162

Polynomial Selection for the Number Field Sieve

it is important to have some means to compare (or to assess) polynomial pairs. Obviously, a more precise assessment is more costly than an imprecise one so that, if the number of produced polynomial pairs is too big, a quick imprecise assessment method is needed (or even a chain of methods with increasing accuracy and decreasing speed). The most accurate method, namely executing the remaining steps of the number field sieve, is impractical but can be altered into executing a small but representative fraction of the sieving stage and counting the number of relations. This is still very expensive and can only be used for comparing a small number of polynomial pairs. Faster assessment methods can be devised by observing that the quality of a polynomial pair seems to be mainly influenced by two properties, namely by the size of the coefficients and by the number of roots of the polynomials modulo small primes. Taking into account both properties gives a method (cf. Section 6.7 on page 171) which is faster than the sieving based assessment from above but which is not fast enough for processing a huge set of candidates. A much faster method is to focus only on the size of the coefficients by computing f1 ∞ f2 ∞ , the product of the maximum norms of the polynomials. This assessment function (and its variant in Section 6.5 on page 168) is considered in the following sections until Section 6.7. Notice that it only gives a meaningful result if the degrees of the polynomials are fixed; polynomial pairs of different degree pairs must be compared by other means, e.g., the sieving based assessment method described above. In the first years after the introduction of the number field sieve the standard procedure was to consider the product of the norms followed by a sieving based assessment for the best polynomial pairs. The first and most natural method of constructing polynomial pairs consists of setting f2 = x − m for some positive integer m < n, writing n in base m,  i.e., n = di=0 ai mi with 0 ≤ ai < m and d minimal, as well as setting f1 = d i i=0 ai x so that the polynomials are coprime and have the common root m modulo n. It can be assumed that the coefficients of f1 are coprime, otherwise their greatest common divisor is a non-trivial divisor of n. Usually one wants to find f1 of a given degree d which can be achieved by choosing m appropriately. In the early days of the number field sieve m was chosen to be slightly smaller 1 than n d which results in a monic degree d polynomial f1 , with the monicity simplifying some parts of the later stages of the number field sieve. This gives 1 coefficients of size about n d for each polynomial, so the product of the maxi2 mum norms is about n d . Heuristically and informally, the size of the polynomial f1 can be reduced by generating many polynomial pairs in this way and hoping to encounter one with small coefficients. More precisely, let C > 1 (C stands for cost) be the number of polynomial pairs one is willing to examine, i.e., 1 for C random choices of m slightly smaller than n d the corresponding monic

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:45, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.007

6.2 Early Methods

163

polynomial f1 is computed and its maximum norm is examined. It is assumed 1 that C is not too big, e.g., C < n 2d would do, which is no restriction in practice for integers for which the number field sieve is used. One expects to find (within cost C) one value of m for which the coefficients a0 , . . . , ad−1 of the 1 1 corresponding f1 are bounded by C− d n d , thus reducing the product of the max1 imum norms by a factor of C d . More important was the size reduction by choosing m slightly bigger than 1 n d+1 , thus giving rise to a non-monic f1 . The sizes of the coefficients of f1 and f2 1 2 are about n d+1 so that the product of the maximum norms is about n d+1 which 1 2 1 is smaller than C− d n d as C < n 2d is assumed. As above this method can be improved by repeatedly choosing random m and selecting the m that gives rise to the polynomial f1 with the smallest maximum norm. In order to increase the chance!of hitting an f1 with small coefficients one observes that by choosing " # m = d and for a given leading coefficient ad the size of the coefficient ad−1 in the base m expansion is about the size of ad (factors like 2d are considered to be ! negligible since d is assumed to be small). Indeed, for m ≥ d one has with d n = m + μ, 0 ≤ μ < 1 ad 0 ≤ ad−1

(m + μ)d − md n − ad md ≤ = a < ad m d md−1 md−1



1 1+ m

d

 −1

d

< ad m(e m − 1) < ad (e − 1)d. For the analysis of the expected gain let, as above, C > 1 be the number of 1 polynomial pairs one is willing to examine and let c = C d2 −1 . Choosing ad near 1 c−d n d+1 and m as above, the values of the d − 1 coefficients a0 , . . . , ad−2 which 1 are a priori of size cn d+1 , will be of size ad with probability c−(d+1)(d−1) = C−1 so that after about C trials one expects to find a polynomial pair with 1 1 f1 ∞ = c−d n d+1 . Since f2 ∞ = cn d+1 , the product of the maximum norms 1 can be decreased for d > 1; with an effort of C a factor of C d+1 can be gained. In all methods described so far a factor of about 2 can be gained by allowing signed coefficients in the base m expansion. There are two further methods from the early days which were not really used at that time since it was not known how to exploit them, but which became important later on (cf. Section 6.8 on page 173). One consists in choosing a nonmonic f2 = lx − m and adapting the base m expansion to a base ml expansion, thus resulting in a larger supply of linear polynomials. Since monic f2 already provided more than enough polynomials, there was no need to consider nonmonic f2 . The other method used lattice reduction to find, given f2 = x − m (or f2 = lx − m), polynomials f1 with small coefficients and the same root modulo

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:45, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.007

164

Polynomial Selection for the Number Field Sieve

1

n as f2 . In this method the size of the coefficients is about n d+1 which is the same as above.

6.3 General Remarks Before proceeding with Montgomery’s first involvement in polynomial selection, a counting argument along the lines of [84, section 12] is described that computes what coefficient sizes one can expect for given degrees. Let di = deg fi , i = 1, 2, be fixed, let M be an integer and let ci ≥ 0, i = 1, 2. The goal is to compute the expected number of triples (n, f1 , f2 ) such that n is in the interval [M, 2M], ( f1 , f2 ) is a valid polynomial pair for n and (the absolute values of) the coefficients of fi are bounded by M ci . The number of polynomial pairs ( f1 , f2 ) satisfying the above restriction on the coefficient sizes is (M (d1 +1)c1 +(d2 +1)c2 ). If one restricts to pairs for which f1 and f2 each have coprime coefficients and for which gcd( f1 , f2 ) = 1, the number of pairs is still (M (d1 +1)c1 +(d2 +1)c2 ). This number must be at least (M) in order to obtain a valid polynomial pair for each n in the interval [M, 2M], which gives the condition (d1 + 1)c1 + (d2 + 1)c2 ≥ 1.

(6.1)

The resultant of f1 and f2 can be bounded by O(M d2 c1 +d1 c2 ) where the Oconstant depends on d1 and d2 . Since the resultant must be divisible by n ≥ M, this gives a second condition d2 c1 + d1 c2 ≥ 1.

(6.2)

If this second condition is satisfied, the number of triples (n, f1 , f2 ) as above is expected to be (M (d1 +1)c1 +(d2 +1)c2 ), i.e., for any n in the interval [M, 2M] one expects to find on average (M (d1 +1)c1 +(d2 +1)c2 −1 ) valid polynomial pairs. Heuristically, for the pairs (c1 , c2 ) ∈ R2≥0 satisfying the two inequalities above it is possible, for an integer n, to find polynomial pairs with coefficient bounds nci . This region is defined by the points in the first quadrant lying above the two lines given by the equality cases of the two conditions (cf. Figure 6.1 on page 165). Since smaller coefficients are assumed to be better, the interesting part consists of the pairs lying on one (or both) lines. Notice the different nature of Conditions (6.1) and (6.2), with the former being based on an elementary counting argument and the latter being imposed by the common root requirement. The line corresponding to (6.1) marks the border at which one can expect to find polynomial pairs for every integer. Below this line only a fraction of these integers admits polynomial pairs, this is the realm of the special number field sieve which is not further considered here. Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:45, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.007

6.3 General Remarks

165

c2 1 2

SNFS 1 5

P = ( 283 ,

5 ) 28

( 16 , 16 ) forbidden 1 6

c1

Figure 6.1 For the case d1 = 5, d2 = 1 the two lines corresponding to the equality cases of Conditions (6.1) and (6.2) on page 164 as well as their intersection P are depicted. The region above the bold part of the lines satisfies both inequalities. Moreover, the point ( 16 , 16 ) corresponding to the method from Section 6.2 on page 161 is featured.

The other line is, however, a hard bound and no polynomial pair can exist below it. In the case d1 = d2 the two lines intersect in the point   (d1 + 1) − d2 d1 − (d2 + 1) , ∈ R2≥0 . P= d1 (d1 + 1) − d2 (d2 + 1) d1 (d1 + 1) − d2 (d2 + 1) For d1 > d2 the interesting part consists of the line segment between the points (0, d21+1 ) and P, and the line segment between the points P and ( d12 , 0); similarly for d1 < d2 it consists of the line segment between the points (0, d11 ) and P, and the line segment between the points P and ( d11+1 , 0) (note that one of the segments has length zero if d1 and d2 differ by 1). From the slopes of the two lines it follows that the minimum of c1 + c2 is d1 +d22 +1 and it is attained at P; thus the region near this point is the most interesting one. The polynomial selection method of the preceding section achieves (at cost C = 1) in the case d2 = 1 (assuming d1 > 1) coefficient sizes corresponding to the point ( d11+1 , d11+1 ) which lies on the second line but is not P. Hence it might be advantageous to try to move the coefficient sizes towards P which is exactly what was done in the preceding section by considering many polynomial pairs. However, attaining the point P results in an exponential cost1 with the current 1

This is no longer true if n is prime, i.e., if the number field sieve is used for computing discrete logarithms in Fn . The method in [197] addresses the case d1 = d2 + 1 and attains the point P = (0, d1 ) by picking a random polynomial f1 with coefficients of size O(1) and computing its 1

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:45, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.007

166

Polynomial Selection for the Number Field Sieve

c2 1 2

1 3 1 4 1 5

1 5

1 4

1 3

1 2

c1

Figure 6.2 For each of the cases d1 = 3, d2 = 3 (normal line), d1 = 4, d2 = 2 (dashed lines) and d1 = 2, d2 = 4 (dotted lines) the two lines corresponding to the equality cases of Conditions (6.1) and (6.2) on page 164 are depicted. The bold parts of the dashed and dotted lines (connecting the highlighted intersection points) are below the line corresponding to d1 = d2 = 3.

state-of-the-art algorithms. For the case d1 = 5, d2 = 1 the situation is depicted in Figure 6.1. In the case d = d1 = d2 Condition (6.2) implies Condition (6.1). Therefore the interesting part consists of the line segment between the points (0, d1 ) and 1 ( d1 , 0) on which c1 + c2 takes the value d1 and one expects to find (n d ) polynomial pairs on average. It can be argued that d = d1 = d2 is a bad choice of degrees since the region satisfying the two inequalities for d = d1 = d2 is a proper subset of the union of the corresponding regions for degrees (d + 1, d − 1) and for degrees (d − 1, d + 1); for the case d = 3 this is illustrated in Figure 6.2 below. However, due to the absence of polynomial selection algorithms attaining the point P in acceptable time, the case d = d1 = d2 is still of interest.

6.4 A Lattice Based Method In 1993 Montgomery [85] addressed the case d = d1 = d2 and described a polynomial selection method using lattices. For d = 2, i.e., two quadratic polynomials, this method produces optimal polynomials in the sense of the preceding section, namely c1 + c2 = 12 . roots modulo n (here the primality of n is used). Then, using lattice reduction, each root allows 1

to construct polynomials f2 with coefficients of size O(n d1 ). Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:45, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.007

6.4 A Lattice Based Method

167

By identifying polynomials in Z[x] of degree at most d with vectors in Zd+1  via f = di=0 ai xi → (a0 , . . . , ad )T , it is easy to see that the set of such polynomials having r modulo n as a root is a lattice Lr ⊂ Zd+1 of covolume n. Moreover, polynomials whose absolute values of the coefficients are bounded by O(nc ) correspond to vectors of length (with respect to the Euclidian norm) O(nc ). Therefore, in the case of equal degrees d = d1 = d2 , the polynomial selection problem can be rephrased in terms of lattices. The task is to find a lattice Lr such that it contains two short independent vectors for which the corresponding polynomials are coprime. In general, the latter condition is satisfied but it can be violated in special situations, e.g., if there exists a polynomial f of degree smaller than d with small coefficients such that f (r) ≡ 0 (mod n) then f and x f correspond to short vectors violating the coprimality condition. For any (d + 1)-dimensional lattice of covolume n a basis v1 , . . . , vd+1 sat isfying d+1 i=1 |vi | = O(n) with the O-constant depending on d can, for instance, be found using the lattice basis reduction method from [232]. Therefore, for any 2 r it is possible to find two vectors in Lr with product of their lengths O(n d+1 ); in 1 general one also expects that the two vectors have length O(n d+1 ) (as well as all other basis vectors). Since this is optimal in the case d = 1 (cf. final paragraph of Section 6.3 on page 166; note that the coprimality condition is satisfied), d ≥ 2 is assumed from now on. If a lattice Lr admits a basis with two short basis vectors, i.e., the product of 2 their lengths being much smaller than n d+1 , then the other d − 1 basis vectors must be longer on average. One way to achieve this is to construct Lr such that at least one of the vectors in a reduced basis is long by stipulating the existence of a non-zero linear form λ : Lr → Z with small coefficients, i.e., a short vector in the dual lattice Lr∗ . Via the standard identification of the dual of Qd+1 with Qd+1 the lattice Lr∗ is generated by Zd+1 and the vector w = 1n (1, r, . . . , rd )T . Then a vector of length O(n−z ) in Lr∗ ensures the existence of a vector of length at least of order nz in a reduced basis which implies the existence of two basis 2(1−z) 1 this is shorter than vectors with product of their lengths O(n d ). For z > d+1 in the simple construction above. In order to find a lattice Lr such that its dual lattice contains a short vector 1 1 one can set r ≡ ut (mod n) with t = O(n d ), u = (n d ) so that the first d coordinates of ud−1 w plus an appropriate vector in Zd+1 are equal to 1n (t i ud−1−i ) 1 for i = 0, . . . , d − 1, thus are O(n− d ). If u divides t d − n (or t d + hn for some d 1 −n = O(n− d ). Thus the product small integer h = 0), the last coordinate is t nu 2(d−1)

of the two shortest basis vectors is O(n d2 ), i.e., if the coprimality condition (and in general c1 = c2 = d−1 ). Finding is satisfied, one gets c1 + c2 ≤ 2(d−1) d2 d2 t and u such that u divides t d − n can be done in many ways; Montgomery suggested to choose u as a prime such that n is a d-th power modulo u.

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:45, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.007

168

Polynomial Selection for the Number Field Sieve

For d = 2 this gives the optimal bound 12 for c1 + c2 (cf. Section 6.3 on page 164); notice that if the coprimality condition is not satisfied, one can split n. Moreover, the number of polynomial pairs of this construction is heuris1 tically (n 2 ) which coincides with the expected value from the preceding section (after a slight modification of the construction, namely replacing the prime u by a product of primes for all of which n is a quadratic residue). In the early 1990s this method with the improvement described in the next section was used for several at that time large factoring projects [256], [142].

6.5 Skewness In the sieving stage of the number field sieve a certain set of pairs (a, b) ∈ [−A, A] × [1, B] ∩ Z2 for suitable A and B is processed; this set of pairs is called the sieving region. Line sieving processes all pairs in [−A, A] × [1, B] ∩ Z2 by first considering the pairs with b = 1, then b = 2 and so on where each change of b involves some overhead. Additionally, if A is small compared to the size of the primes in the factor base, the procedure becomes less efficient. Therefore it is desirable to keep E = AB constant while decreasing B in order to speed up sieving if this can be achieved without increasing the values of the polynomials f1 and f2 over the sieving region. If a pair of quadratic polynomials is selected as described in the preceding 1 section, they will, in √ general, have coefficients of size n 4 , so most polynomial 1 values for A = B = E will be of size En 4 . Thus, by changing A and B as 1 above most polynomial values will be of size AB En 4 which is bigger by a factor of S = AB , called skewness, due to the increased contribution of the leading terms. There is an efficient procedure to fix this problem by demanding that 1 1 the leading coefficients are of size S−1 n 4 , the middle coefficients of size n 4 1 and the constant coefficients of size Sn 4 ; such polynomials are called skewed polynomials. In the following it is supposed that S ≥ 1 holds; the case S ≤ 1 is symmetric. These considerations led Montgomery to adapt his lattice based method, described in the previous section, to skewed polynomials by changing the inner d+1 i−1− d  2 y z where y resp. z product from y, z = d+1 i i i i i=1 yi zi to y, zS = i=1 S is the i-th coordinate of y resp. z and d = d1 = d2 as in the previous section. For the dual lattice the inner product has to be changed to , S−1 and in the 1 construction of r one has to choose u = (S−1 n d ). Care has to be taken to not choose S too large, otherwise the vector corresponding to ux − t will occur in the reduced lattice basis; for more details cf. [291] and also [260], [115]. In the 1 case d = 2 any skewness S = O(n 4 − ) with 0 ≤  ≤ 14 works and produces

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:45, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.007

6.5 Skewness

169

polynomial pairs as asked for above. Asymptotically as well as in practice for 1 integers n of, say, more than 40 digits E is smaller than n 4 so that one can even choose S = E resulting in B = 1, i.e., a sieving region consisting of a single line, which almost completely removes the overhead. Montgomery noticed as well that lattice sieving, a more efficient but more complicated sieving variant, can benefit significantly from squeezing the sieving region to a single line, thus making it extremely efficient in this case. Details and reports of some computations can be found in [142] (beware, in that paper lattice sieving for B = 1 is called line sieving). The reader may have noticed that the number of polynomial pairs obtained 1 by the construction above with d = 2 is heuristically (S−1 n 2 ) which becomes smaller as S gets bigger. However, each polynomial pair can be expanded into (S) polynomial pairs of similar coefficient sizes by translating them by an integer h of size up to O(S), i.e., replacing ( f1 , f2 ) by ( f1 (x + h), f2 (x + h)). These translated polynomial pairs do not provide new information since the translation corresponds to a shear mapping of the sieving region. Thus, for a 1 fixed skewness S, one gets heuristically again (n 2 ) skewed polynomial pairs, 1 although only (S−1 n 2 ) of them are essentially different. This is a general phenomenon as will be explained by revising the discussion from Section 6.3 on page 164 in the light of skewness. For a given skewness S = M s (with the notation from that section, i.e., n being in the inter val [M, 2M]) the S-maximum norm of a polynomial f = di=0 ai xi ∈ Z[x] is d defined as f S,∞ = max(|ai |Si− 2 ). Since the existence of a degree d polynod mial with f S,∞ ≤ M c for c ∈ R implies S 2 ≤ M c , fixing a skewness S = M s entails the bound sd2 ≤ c. As before let di = deg fi , i = 1, 2, be fixed and let fi S,∞ ≤ M ci for i = 1, 2 so that the number of such polynomial pairs ( f1 , f2 ) is again (M (d1 +1)c1 +(d2 +1)c2 ). On average, translation by an integer of size O(S) does not change the S-maximum norms of f1 and f2 by much and it does not change the resultant of f1 and f2 . Therefore the polynomial pairs are clustered into classes having the same resultant so that one needs at least (MS) polynomial pairs in order to obtain a valid polynomial pair for each n in the interval [M, 2M]. Thus Condition (6.1) on page 164 becomes (d1 + 1)c1 + (d2 + 1)c2 ≥ 1 + s.

(6.3)

Condition (6.2) on page 164 is not affected by the skewness and remains the same. In the case d = d1 = d2 Condition (6.2) implies Condition (6.3) since 1 c1 + c2 ≥ 2 sd2 ≥ s. Therefore one can expect to find (M d −s ) essentially different polynomial pairs with c1 + c2 = d1 whenever ci ≥ sd2 , i = 1, 2. For d1 = d2 = 2 this is exactly what was observed above.

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:45, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.007

170

Polynomial Selection for the Number Field Sieve

In the general case d1 > d2 the two lines given by the equality cases of Conditions (6.3) and (6.2) intersect in the point   d1 (1 + s) − (d2 + 1) (d1 + 1) − d2 (1 + s) Ps = , . d1 (d1 + 1) − d2 (d2 + 1) d1 (d1 + 1) − d2 (d2 + 1) For s = 0, i.e., skewness S = 1, this is the point P from Section 6.3 on page 164 and for increasing s it moves on the line associated to Condition (6.2) to the 1 −2d2 −2 the first coordinate of Ps violates c1 ≥ right. For s > s1 = d1 (d1 (d12d +1)−d2 (d2 +1)−2) sd1 and for larger s it also violates c2 ≥ sd22 . Thus for s1 < s ≤ s2 = d11d2 the 2 optimal point (with respect to minimising c1 + c2 ) is ( sd21 , d11 − sd22 ) and for s > s2 it is ( sd21 , sd22 ). When considering these optimal points, the value of c1 + c2 is minimal for s = 0, increases slightly for 0 < s ≤ s1 and more rapidly for s > s1 . Similarly the expected number of polynomial pairs is (1) for s = 0 and grows with s. These considerations suggest that searching for skewed polynomial pairs is a bad idea. However, as remarked above, there are no known methods for attaining the optimal point P (resp. Ps ) so that using skewness might be a good idea, and, as shown in the next section, it turns out that using skewness is indeed very useful. Moreover, assessing the quality of the polynomial pair via the value c1 + c2 is adequate for asymptotic considerations but is a very rough assessment in practice where one wants to distinguish between c1 and c2 as well as include the number of roots modulo small primes. The impact of using skewness on the latter is discussed in Section 6.7.

6.6 Base m Method and Skewness Once the concept of skewness has been introduced it is relatively easy to include it in the base m method. In the following the notation from Section 6.2 on page 161 is used, in particular d = d1 , d2 = 1, and f2 = x − m. For simplifying the presentation it is also assumed that d > 3; the case d = 3 can be handled with some modifications although it is probably irrelevant in practice. As before, many polynomial pairs are generated by choosing ad smaller than ! 1 n d d+1 n , picking m near ad so that the coefficient ad−1 is of size ad , and then checking whether the remaining coefficients are small enough with respect to some previously fixed skewness S. More precisely, denote the cost, i.e., the number of polynomial pairs to be inspected, by C (again assuming that C is not 2 1 too big), set S = C (d−3)(d−2) , and choose ad near (nS(1−d)d ) d+1 . This choice follows from stipulating that ad−i , i = 2, . . . , d − 1 is bounded by ad Si , that the

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:45, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.007

6.7 Root Sieve

bound ad Sd−1 for a1 is of size m ≈

! d

n ad

171

and that these bounds are satisfied with

−1

probability C . Thus, after having checked about C coefficients ad , one can expect to find one polynomial pair satisfying f1

≤ C− (d−2)(d+1) n d+1 d

S,∞

1

and

f2

1

S,∞

1

≤ C (d−2)(d+1) n d+1 ,

i.e., f1 S,∞ f2 S,∞ ≤ C− (d−2)(d+1) n d+1 . In other words, with an effort of C one d−1 1 can gain a factor of C (d−2)(d+1) which is better than the factor of C d+1 in the unskewed base m method. Notice that the basic operations involved in checking a polynomial pair are the same as in the unskewed base m method so this speedup carries over to practice. d−1

2

6.7 Root Sieve Using skewed polynomials provides another advantage, namely that it is possible to increase the number of roots of the polynomials modulo small primes. Montgomery contributed substantially to the development of the algorithm described below (and presented in more detail in [266], cf. also [91]). With the parameter choice as in the previous section the bound for f1 S,∞ implies that the coefficient a1 can be of size m and that the coefficient a0 can be of size Sm. Therefore one can expect to find about (S) integers i d 1 such that f1 + i f2 S,∞ ≤ C− (d−2)(d+1) n d+1 . This provides (S) polynomial pairs ( f1 + i f2 , f2 ) which each have approximately the same S-maximum norm as the original pair ( f1 , f2 ). Notice that this is a different type of amplifying polynomial pairs than the translations x → x + h discussed in Section 6.5 on page 168. The main difference is that translating conserves the number of roots modulo a prime, whereas adding a multiple of f2 often changes this number. Thus, by considering the polynomials f1 + i f2 for integers i = O(S), the quality of a polynomial pair can be improved at cost O(S) which is O(C) for d = 4 and o(C) for d > 4. Before describing a procedure for inspecting this set of polynomials it is necessary to know how the number of roots modulo small primes influences the quality of a polynomial pair. In order to simplify the discussion assume that sieving is done with only one polynomial f of degree d and that the polynomial values f ( ab )bd over the sieving region do not depend much on the polynomials to be assessed, e.g., all polynomials to be assessed have the same degree d and the sizes of their coefficients do not vary by much. Let p be a (small) prime that neither divides the leading coefficient nor the discriminant of f . This condition implies that the number of roots of f modulo pk does not depend on k ≥ 1; denote this number

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:45, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.007

172

Polynomial Selection for the Number Field Sieve

by n p ( f ). Therefore, exactly (p − 1)pk−1 n p ( f ) of the (p2 − 1)p2k−2 polynomial values f ( ab )bd where 0 < a, b < pk , p  gcd(a, b), are divisible by pk , whereas for the same number of random values one expects (p2 − 1)pk−2 of them to be divisible by pk . Summing over all k ≥ 1 one expects that a polynomial value pn p ( f )

in the sieving region is divisible by p p2 −1 on average (geometric mean) and 1 that a random value is divisible by p p−1 on average. This suggests to define pn ( f ) 1 − p2p−1 ) log p. For primes dividing the leading coefficient or α p ( f ) = ( p−1 the discriminant of f a similar (slightly more complicated) definition can be derived, and one sets α( f ) =



αp( f )

(6.4)

p < P, p prime

 where P is a small bound, e.g., P = 1000. Notice that the sum p prime α p ( f ) converges [28] and that the contribution of p ≥ P can be neglected in practice. The interpretation of α( f ) is that on average the P-smooth part of a polynomial value in the sieving region is e−α( f ) times the P-smooth part of a random value of similar size. This suggests to use f S,∞ eα( f ) for measuring the quality of a polynomial f . Notice that α( f ) is constant and positive for linear polynomials f so that the corresponding values f ( ab )bd behave slightly worse than random integers with respect to the number of roots modulo small primes. Indeed, one has n p ( f ) = 1 so that (p − 1)pk−1 of the (p2 − 1)p2k−2 polynomial values considered above are divisible by pk which is less than the (p2 − 1)pk−2 values for random integers. Since f2 is linear in the base m method, the function f1 S,∞ f2 S,∞ eα( f1 ) can be used for assessing polynomial pairs produced by this method. More functions for determining the quality of polynomial pairs can be found in [266]. Returning to the set of polynomial pairs ( f1 + i f2 , f2 ) one notices that f2 S,∞ does not depend on i and that, by construction, f1 + i f2 S,∞ does usually not depend on i so that it is sufficient to compute α( f1 + i f2 ) in order to find the best polynomial pairs in the set. Notice that computing α( f ) for a polynomial f involves finding the roots of f modulo many small primes p so that the computation of α( f ) is much slower than, say, the computation of f S,∞ . Therefore it is desirable to speed up this computation, which can be done as follows by using a sieving procedure, called root sieve, that computes an approximation of α( f1 + i f2 ) for i in an interval. For a prime p < P and an s with 0 ≤ s < p it is easy to determine all i such that f1 + i f2 has s as a root modulo p; prime powers pk can be handled similarly. The cases where a prime divides the discriminant of f1 + i f2 for an i in the interval require a slightly bigger effort but are rare. Suppressing these cases and prime powers for this

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:45, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.007

6.8 Later Developments

173

description, the sieving procedure consists of initialising an array indexed by  p i with p < P, p prime log and subtracting for each pair (p, s) with p a prime, p−1 p from all positions i for p < P, 0 ≤ s < p, (an approximation of) the value log p+1 which s is a root of f1 + i f2 modulo p. This gives approximations of α( f1 + i f2 ) and one can further inspect the best ones, i.e., the smallest values. In general, it is possible to sieve over polynomials f1 + g f2 for arbitrary g ∈ Z[x] which for deg g = 0 is described above. However, for g of bigger degree the S-maximum norm of f1 + g f2 is usually much bigger. This can be countered by changing the skewness or by translations but so far, i.e., for n up to 768 bit, it is sufficient to consider g of degree 1. In this case the sieving is carried out for f1 + (i1 x + i0 ) f2 which can be done as above with the obvious modifications; notice that the bounds on i1 are much smaller than those on i0 so that a simple loop over i1 and proceeding per i1 -value as above is almost sufficient.

6.8 Later Developments In this final section a few later improvements are described. The skewed base m method is a two-stage algorithm: first a brute force search produces a few polynomial pairs with a good product of the S-maximum norms, then for each such pair a root sieve is applied to improve the α-value while almost conserving the norms. Since a better α-value can be expected for a larger search area, it is tempting to increase the search area, thereby increasing the norms, and hoping that this is compensated for by an improved α-value. If this approach is adopted, the time spent in the root sieve will increase and may eventually dominate the polynomial selection running time so that improvements for speeding up the root sieve are needed. One improvement is to inspect only the most promising candidates in the search area, namely those for which the first few terms in the sum (6.4) are already good, i.e., small. This can result in a significant speedup at the cost of missing a few good candidates, and is described in [19]. The paper also explains how to reduce to almost zero the time spent in dealing with prime powers. Returning to the first stage of the polynomial selection, the two methods mentioned in the final paragraph of Section 6.2 (on page 163) eventually proved to be useful. The first method, namely considering non-monic f2 , was used in [207] (and earlier in the less efficient algorithm [206]) to produce polynomial pairs with a small third coefficent ad−2 . Since ad and ad−1 are small by construction, ad−2 has the biggest impact on f1 S,∞ and it is therefore important to reduce its size. With the notation from Section 6.6 on page 170 one has to 2 check about C d−2 values of ad in order to find an expansion for which the bound

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:45, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.007

174

Polynomial Selection for the Number Field Sieve

1

on ad−2 is satisfied. The new method produces with an effort C d−2 (suppressing logarithmic terms) a polynomial pair satisfying the bound on ad−2 . Thus, rebalancing all parameters, one expects, by spending an effort of C, to gain a d−1 factor of C (d−3)(d+1) . After the effort for producing polynomials with small third coefficient ad−2 has been reduced, the main obstacle for d > 5 is the fourth coefficient ad−3 . This is considered in [18] where the second method from the final paragraph of Section 6.2 is combined with the method just described and a careful choice of a translation. These new developments do not introduce fundamentally new concepts but build on the methods described before, many having been invented by Peter Montgomery.

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:45, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.007

7 The Block Lanczos Algorithm Emmanuel Thomé

We present the block Lanczos algorithm proposed by Peter Montgomery, which is an efficient means to tackle the sparse linear algebra problem arising in the context of the number field sieve factoring algorithm and its predecessors. The presentation incorporates some simplifications and improvements.

7.1 Linear Systems for Integer Factoring For factoring a composite integer N, algorithms based on the technique of combination of congruences look for several pairs of integers (x, y) such that x2 ≡ y2 mod N. This equality is hoped to be non-trivial for at least one of the obtained pairs, letting gcd(x − y, N) unveil a factor of the integer N. Several algorithms use this strategy: the CFRAC algorithm, the quadratic sieve and its variants, and the number field sieve. Pairs (x, y) as above are obtained by combining relations which have been collected in these algorithms. Relations are written multiplicatively as a set of valuations. All the algorithms considered seek a multiplicative combination of these relations which can be rewritten as an equality of squares. This is achieved by solving a system of linear equations defined over F2 , where equations are parity constraints on each valuation considered, and unknowns indicate whether or not relations are to be selected as part of the combination. We are therefore facing a linear algebra problem. Writing the relations collected as the rows of a matrix M with coefficients in F2 , we are to find several solutions to the homogeneous linear system xT M = 0. 175 Downloaded from https://www.cambridge.org/core. University of New England, on 25 Jan 2018 at 12:41:06, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.008

176

The Block Lanczos Algorithm

To fix notations, we let the matrix M be square of size N × N. It is noteworthy that the matrix M is extremely sparse, as can be illustrated by data from some factoring experiments: for the factoring of RSA-512 in 1999, the matrix had N ≈ 7 × 106 and 62 non-zero coefficients per row [92], and for the RSA-768 factorization in 2009, it had N ≈ 2 × 108 and 144 non-zero coefficients per row [209]. This sparsity property can be exploited to yield efficient algorithms which solve the linear system in a “black-box” fashion, that is, without ever modifying the matrix M. The only access to the matrix M which is allowed to such algorithms is the operation of multiplying M (or its transpose) by a vector. The interesting black-box algorithms are those which solve the linear system using at most O(N) times this operation. For sparse matrices, this approach is considerably cheaper than “dense” algorithms that do not exploit the sparsity property, with regard to both the time and space complexity (which would, for dense algorithms, be O(N ω ) and O(N 2 ), where ω is the matrix multiplication exponent).

7.2 The Standard Lanczos Algorithm Dealing with sparse linear systems is an important topic which goes beyond computational number theory. Among the sparse algorithms which can be employed (reviewed in early works such as [226]), we find the conjugate gradient and the Lanczos algorithms, which were both originally stated in the context of solving numerical systems associated with, for example, partial differential equations. With some adaptation work, it is possible to use these algorithms over finite fields, with limitations which we mention in Section 7.3 on page 179. The Wiedemann algorithm [347] was proposed as a method particularly well adapted to finite fields. We discuss in Section 7.9 on page 187 how it compares with the Lanczos and block Lanczos algorithms. As a first step towards presenting the block Lanczos algorithm, we give here an overview of how the standard Lanczos algorithm can be used to solve homogeneous or inhomogeneous linear systems over finite fields. Arguments that appear in the justification of the standard Lanczos are also important to the block Lanczos context, which explains this preliminary overview. Within this section, we assume that the base field is F p for some prime p. Briefly put, the Lanczos algorithm is the Gram-Schmidt orthogonalization process applied to a Krylov subspace. We need to work with a symmetric matrix A defined over F p . Different problems can be stated, for example depending on whether we intend to solve a homogeneous or inhomogeneous linear system.

Downloaded from https://www.cambridge.org/core. University of New England, on 25 Jan 2018 at 12:41:06, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.008

7.2 The Standard Lanczos Algorithm

177

Another distinction comes from the linear system which we want to solve in the first place. While in some cases it does indeed define a symmetric matrix A, it may also be that we form A as A = MM T , and solve a linear system involving A as a derived means of solving one involving M. Such a strategy would be natural in the prospect of solving the linear systems as defined in Section 7.1 on page 175. In that case, the matrix A is never actually computed, and the black box “multiplication by A” is instead realized as the composition of the two black boxes multiplying by M T and M. For expository purposes, we assume in this section that we have a right-hand side vector b ∈ FNp , and intend to solve for x ∈ FNp the equation Ax = b. The matrix A being symmetric, we may consider the inner product defined from A as v T Aw for vectors v, w ∈ FNp . We say that v and w are A-orthogonal whenever v T Aw = 0. A vector is A-isotropic if it is A-orthogonal to itself. The Lanczos algorithm focuses on the sequence of Krylov subspaces of FNp defined as Vi = v0 , Av0 , A2 v0 , . . . , Ai v0 , where v0 = b. It is clear that the sequence of subspaces (Vi )i≥0 is strictly increasing up to some index, and then stationary. We define a sequence of vectors (vi )i≥0 , so that the following two conditions are satisfied: vi is A-orthogonal to v j whenever i = j,

(7.1)

Vi = v0 , . . . , vi .

(7.2)

We proceed by induction, and assume that a sequence of vectors v0 to vi has been computed so that the two conditions above hold. We now see how to compute vi+1 . We begin by noting (using Condition (7.2) inductively) that Vi+1 = v0  + AVi = v0  + AVi−1 + Avi  = Vi + Avi  so that setting vi+1 to be any vector within the affine subspace Vi + Avi fulfils Condition (7.2) for index i + 1. In order to satisfy Condition (7.1), we let: vi+1 = Avi −

 v Tj A2 vi j≤i

v Tj Av j

v j.

We leave for further discussion the important question of the non-degeneracy of the denominators in the expression of vi+1 . It turns out that the equation above defining vi+1 can be simplified. Indeed, because Av j ∈ V j+1 , we have that v Tj A2 vi = 0 whenever j < i − 1. This implies that only two terms in the sum above are non-zero, yielding the following

Downloaded from https://www.cambridge.org/core. University of New England, on 25 Jan 2018 at 12:41:06, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.008

178

The Block Lanczos Algorithm

shorter equation for defining vi+1 : vi+1 = Avi − ci+1,i vi − ci+1,i−1 vi−1 , ci+1,i =

viT A2 vi vi , viT Avi

ci+1,i−1 =

T vi−1 A2 vi vi−1 . T Av vi−1 i−1

T A2 vi = viT Avi . We can Note also that we have Avi−1 ∈ vi + Vi−1 , so that vi−1 then simplify the expression of ci+1,i−1 as

ci+1,i−1 =

viT Avi vi−1 . T Av vi−1 i−1

The sequence of vectors (vi )i≥0 can thus be computed with a simple recurrence procedure, requiring only a short amount of history to be updated from each iteration to the next (namely, the vectors vi+1 and vi as well as the scalar viT Avi ). We now discuss the termination of the computation of the sequence of vectors (vi )i≥0 . It is clear that vi+1 can be computed only as long as the following condition holds: ∀ j ≤ i, v Tj Av j = 0.

(7.3)

We assume that Condition (7.3) holds until some index m (not included), and that vm = 0. This implies Vm = Vm−1 . Now define x as x=

 vT b i v. T Av i v i i0

Each term in the product is an RLP, thus F (X ) is an RLP, so only t/2 + 1 of its coefficients need to be stored. For the purpose of this exposition, we will assume the p − 1 method where Q is known, as this simplifies the algorithm for constructing F (X ). It can be found in [261] how to construct F (X ) in a way that works for both the p − 1 and p + 1 methods. We start the recursive construction of F (X ) with one of the summands of cardinality 2 in T , in our example with {−35, 35}, which ensures we obtain an RLP of degree 2 (where the degree of an RLP is the maximal degree difference between its monomials): F1 (X ) = X + X −1 − V35 (R).

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:02:04, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.009

202

FFT Extension for Algebraic-Group Factorization Algorithms

Next, we process all summands of odd cardinality in T , in our example {−30, 0, 30}, i.e., from F1 (X ) we construct the RLP  F2 (X ) = X + X −1 − Vτ (R) τ ∈{−35,35}+{−30,0,30},τ >0

of degree 6 by scaling F1 (X ) and multiplying: F2 (X ) = F1 (Q−30 X )F1 (X )F1 (Q30 X ). The summands of odd cardinality in T are processed early as they are more difficult than the cardinality 2 case because they require products of RLPs of unequal degrees, so we prefer to handle them while the degrees of the RLPs involved are small. Finally, the remaining summands of cardinality 2 in T are processed. Each one doubles the degree of the resulting RLP. If the i-th summand that we process is {−k, k}, we compute Fi (X ) = Fi−1 (Q−k X )Fi−1 (Qk X ); with m summands in T (in our example, m = 5), Fm (X ) = F (X ), see Figure 8.3 on page 203. As P is chosen to be highly composite, φ(P) and thus t contain many factors of 2 so that the cost of building F (X ) is dominated by processing sets of cardinality 2. The total cost is therefore in O(M(d/2) + M(d/4) + M(d/8) + . . . + M(1)) = O(M(d)), smaller by a factor of order log d than the cost of the generic product tree in Algorithm 8.1 on page 193. For the p + 1 method, Q is not explicitly known, but the arithmetic for building F (X ) by scaling and multiplication can be reformulated entirely in terms of Chebyshev polynomials which reference only R = Q + Q−1 . Multiplication of RLPs can be performed efficiently via weighted FFTs, see [261] for details.

8.3.2 Evaluation of a Polynomial Along a Geometric Progression After F (X ) is built from the roots Qτ , τ ∈ T , we evaluate it along the geometric progression Qσ , σ ∈ S, with S = {kP : k ∈ N, B1 ≤ kP ≤ B2 }. Note that this choice of S may not cover quite all the primes in (B1 , B2 ], depending on the set T , see [261, §5]. A polynomial can be evaluated along a geometric progression with Bluestein’s algorithm [56], related to the chirp-z transform which is a generalization of the length- DFT to points of evaluation along a geometric progression other than powers of an -th primitive root of unity ω . Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:02:04, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.009

8.3 FFT Extension for the p − 1 and p + 1 Methods

203

F(X) = F5 (X)

F4 (Q−45 X) F4 (Q45 X)

F4 (X)

F3 (Q−21 X) F3 (Q21 X)

F3 (X)

F2 (Q−42 X) F2 (Q42 X)

F2 (X) m ul tip

F1 (Q−30 X)

F1 (X)

ly

F1 (Q30 X) e al sc

F1 (X) = (X − Q−35 )(X − Q35 )/ X Figure 8.3 Example of computing a reciprocal Laurent polynomial F (X ) of degree 24 by scaling and multiplying. Each scaling/multiplication step corresponds to one summand in the set T = {−35, 35} + {−42, 42} + {−21, 21} + {−45, 45} + {−30, 0, 30}.

For a polynomial F (X ) of degree t with coefficient vector in monomial basis, the length- DFT of the coefficient vector is  fi ω ik for 0 ≤ k <

f¯k = 0≤i≤t

= F (ω k ), i.e., it evaluates the polynomial along the geometric progression ω k for 0 ≤ k
3 be a prime and Fq be a finite field of characteristic p. Throughout the whole chapter, we consider E to be an elliptic curve defined over Fq , given by a short Weierstrass equation E : y2 = x3 + ax + b, where a, b ∈ Fq and 4a3 + 27b2 = 0. The point at infinity on E is denoted by O. The curve E can be viewed as the set of affine solutions (x, y) of the curve equation with coordinates in an algebraic closure of Fq together with the special point O. The set of Fq -rational points on E is given by E(Fq ) = {(x, y) ∈ Fq × Fq | y2 = x3 + ax + b} ∪ {O}, and the number of such points is n = #E(Fq ) = q + 1 − t, where t is bounded √ by |t| ≤ 2 q. 9.1.1.1 The Group Law There exists an abelian group law + on E, which is given as follows. The point O is the neutral element, i.e., P1 + O = P1 for any point P1 ∈ E. If P1 = O, write P1 = (x1 , y1 ). Then, the inverse element of P1 is the point −P1 = (x1 , −y1 ), i.e., (x1 , y1 ) + (x1 , −y1 ) = O. Given another point P2 = (x2 , y2 ) ∈ E \ {O, −P1 }, define * λ=

(y2 − y1 )/(x2 − x1 )

if P1 = P2 ,

+ a)/(2y1 )

if P1 = P2 .

(3x12

(9.1)

Then P3 = (x3 , y3 ) = P1 + P2 is given by x3 = λ2 − x1 − x2 and y3 = λ(x1 − x3 ) − y1 . Computing the sum of two affine Fq -rational points requires a division in the field Fq . This operation is usually significantly more costly than all the other operations, such as addition, subtraction and multiplication. Therefore, elliptic curve group operations are often computed using a projective coordinate system, in which an additional third coordinate keeps track of all denominators such that field inversions are avoided altogether until the very end of the computation. This comes at the cost of additional field multiplications and other lower-cost operations. There are various forms of projective coordinate systems suitable for different scenarios.

Downloaded from https://www.cambridge.org/core. Teachers College Library - Columbia University, on 29 Nov 2017 at 03:16:25, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.010

9.1 Preliminaries

209

9.1.1.2 Embedding Degree and Torsion Points Let r be a prime with r | n, and let k be the embedding degree of E with respect to r, i. e., k is the smallest positive integer with r | qk − 1. This means that the multiplicative group F∗qk contains as a subgroup the group μr of r-th roots of unity. The embedding degree k is an important parameter, since it determines the field extensions over which the groups that are involved in pairing computation are defined. For m ∈ Z, let [m] : E → E be the multiplication-by-m map, which maps a point P ∈ E to the sum of m copies of P. The kernel of [m] is the set of mtorsion points on E; it is denoted by E[m] and we write E(Fq )[m] for the set of Fq -rational m-torsion points ( > 0). From now on it is assumed that k > 1, in which case E[r] ⊆ E(Fqk ), i.e., all r-torsion points are defined over Fqk .

9.1.2 Pairings In order to describe the pairing functions relevant to cryptography, we need to introduce divisors on E. A divisor D on E is a formal sum of points with integer  coefficients, i.e., D = P∈E nP (P) with only finitely many of the nP different from zero; this finite set of non-zero nP -values is the support of the divisor D. A divisor is called a principal divisor, if it is the divisor of a function f on E. In that case the coefficient nP reflects the multiplicity of P as a zero (if nP > 0) or pole (nP < 0) of f . Two divisors are equivalent if they differ by a principal divisor. For P1 ∈ E, let DP1 be a divisor that is equivalent to (P1 ) − (O). Let fr,P1 be a function on E with divisor r(P1 ) − r(O). Evaluating a function f at a   f (P)n p . divisor nP (P) means to evaluate 9.1.2.1 The Weil Pairing The first applications of pairings in cryptography used the Weil pairing er = E[r] × E[r] → μr ⊆ F∗qk , (P, Q) → fr,P (DQ )/ fr,Q (DP ). For the computation of fr,Q (DP ), we can take DQ = (Q) − (O) and need to choose a suitable point R such that DP = (P + R) − (R) has support disjoint from {O, Q} (and similarly for fr,P (DQ )). 9.1.2.2 The Tate Pairing Most pairings that are suitable for use in practical cryptographic applications are derived from the reduced Tate pairing tr = E(Fq )[r] × E(Fqk )[r] → μr ⊆ F∗qk , (P, Q) → fr,P (DQ )(q −1)/r . k

Downloaded from https://www.cambridge.org/core. Teachers College Library - Columbia University, on 29 Nov 2017 at 03:16:25, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.010

210

Cryptographic Pairings

In the case k > 1, to which we restrict in this chapter, one can omit the auxk iliary point in the divisor DQ and compute fr,P (Q)(q −1)/r instead (see [29, 30, Thm. 1]). 9.1.2.3 Miller’s Algorithm To evaluate the functions occurring in the Weil and the Tate pairing, one follows an iterative approach based on Miller’s formulas [248]. For an integer m ∈ Z and a point P ∈ E(Fqk )[r] define the function fm,P to be a function with divisor m(P) − ([m]P) − (m − 1)(O). Note that this notation coincides with the one used above for m = r because then [r]P = O. The value of the function fr,P at the point Q can be computed in a square-and-multiply-like fashion via function values fm,P (Q) for m < r with the help of certain line functions. Let P1 , P2 , Q ∈ / {O, −P1 }, and E = O. Then the line through P1 = E such that P1 = O, P2 ∈ (x1 , y1 ) and P2 = (x2 , y2 ), evaluated at Q = (xQ , yQ ) is given by lP1 ,P2 (Q) = (yQ − y1 ) − λ(xQ − x1 ), where λ is the value of the slope computed during the computation of P1 + P2 as in Equation (9.1) on page 208. In the case that P2 = −P1 , we have the vertical line function value vP1 (Q) = lP1 ,−P1 (Q) = xQ − x1 . Define the quotient gP1 ,P2 (Q) = lP1 ,P2 (Q)/vP1 +P2 (Q). Then, for m1 , m2 ∈ Z, Miller’s formulas hold as follows: fm1 +m2 ,P (Q) = fm1 (Q) fm2 (Q)g[m1 ]P,[m2 ]P (Q), fm1 m2 ,P (Q) = fmm12 (Q) fm2 ,[m1 ]P (Q) = fmm21 (Q) fm1 ,[m2 ]P (Q). Specializations of these formulas yield fm+1,P (Q) = fm,P (Q)g[m]P,P (Q), and 2 (Q)g[m]P,[m]P (Q), which lead to Miller’s algorithm to compute f2m,P (Q) = fm,P fr,P (Q) as shown in Algorithm 9.1 on page 211. Note that as written here, in the last iteration of the algorithm, since r0 = 1, the point addition in line 5 computes the point at infinity O in an exceptional case. The value of the function gR,P in this step is defined as gR,P (Q) = vP (Q). Unlike scalar multiplication algorithms for exponentiation in elliptic curve groups, which multiply curve points by varying scalars that are often secret values in the respective cryptographic protocols, Miller’s algorithm works alongside a scalar multiplication with a fixed scalar, namely r. In other words, the exponentiation carried out in Miller’s algorithm is always to the same public number. Significant savings can be obtained by choosing this system parameter with a low Hamming weight (or low weight in a signed binary representation) in order to have as few as possible of the addition steps in line 5 of Algorithm 9.1.

Downloaded from https://www.cambridge.org/core. Teachers College Library - Columbia University, on 29 Nov 2017 at 03:16:25, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.010

9.1 Preliminaries

211

Algorithm 9.1 Miller’s algorithm Input: P, Q ∈ E[r], r = (r −1 , r −2 , . . . , r0 )2 , r −1 = 1 Output: fr,P (Q) representing a class in F∗qk /(F∗qk )r 1: R ← P, f ← 1 2: for i from − 1 downto 0 do 3: f ← f 2 · gR,R (Q), R ← [2]R 4: if (mi = 1) then 5: f ← f · gR,P (Q), R ← R + P 6: end if 7: end for 8: return f The final exponentiation to the power (qk − 1)/r in the Tate pairing after the Miller loop represents a large part of the computation. It maps classes in F∗qk /(F∗qk )r to unique representatives in μr and is an exponentiation with a known fixed, special exponent. Parts of it are carried out in special subgroups of the multiplicative group F∗qk . This leads to various improvements over a general exponentiation with an exponent of that size, leading to a significant speedup (see, for example [307, 167, 176]). 9.1.2.4 The ate Pairing Often the most efficient choices for implementation are variants of the ate pairing [185], which is a certain power of the reduced Tate pairing and is defined on special subgroups of E[r]. Let φq be the q-power Frobenius endomorphism on E, i.e., φq (x, y) = q q (x , y ). Define two groups of prime order r by G1 = E[r] ∩ ker(φq − [1]) = E(Fq )[r] and G2 = E[r] ∩ ker(φq − [q]) ⊆ E(Fqk )[r]. The group G1 contains only points defined over the base field Fq , while the points in G2 are minimally defined over Fqk . The ate pairing is the map aT : G2 × G1 → μr ,

(Q, P) → fT,Q (P)(q −1)/r , k

(9.2)

where T = t − 1. The group G2 has a nice representation by an isomorphic group of points on a twist E  of E, which is a curve that is isomorphic to E. Here, we are interested in those twists which are defined over a subfield of Fqk such that the twisting isomorphism is defined over Fqk . Such a twist E  of E is given by an equation E  : y2 = x3 + (a/α 4 )x + (b/α 6 ) for some α ∈ Fqk with isomorphism ψ : E  → E, (x, y) → (α 2 x, α 3 y). If ψ is minimally defined over Fqk and E  is minimally defined over Fqk/d for a d | k, then we say that E  is a twist of degree d. If a = 0, let d0 = 4; if b = 0, let d0 = 6, and let d0 = 2

Downloaded from https://www.cambridge.org/core. Teachers College Library - Columbia University, on 29 Nov 2017 at 03:16:25, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.010

212

Cryptographic Pairings

Algorithm 9.2 Miller’s algorithm for even k and ate-like pairings Input: Q ∈ G2 , P ∈ G1 , m = (ml−1 , ml−2 , . . . , m0 )2 , ml−1 = 1 Output: fm,ψ (Q ) (P) representing a class in F∗qk /(F∗qk )r 1: R ← Q , f ← 1 2: for i from − 1 downto 0 do 3: f ← f 2 · lψ (R ),ψ (R ) (P), R ← [2]R 4: if (mi = 1) then 5: f ← f · lψ (R ),ψ (Q ) (P), R ← R + Q 6: end if 7: end for 8: return f otherwise. For d = gcd(k, d0 ) there exists exactly one twist E  of E of degree d for which r | #E  (Fqk/d ) (see [185]). Define G2 = E  (Fqk/d )[r]. Then the map ψ is a group isomorphism G2 → G2 and we can represent all elements in G2 by the corresponding preimages in G2 . Likewise, all arithmetic that needs to be done in G2 can be carried out in G2 . The advantage of this is that points in G2 are defined over a smaller field than those in G2 . Using G2 , we may now view k the ate pairing as a map G2 × G1 → μr , (Q , P) → fT,ψ (Q ) (P)(q −1)/r . Algorithm 9.2 shows Miller’s algorithm in the just described setting for an ate-like pairing to compute fm,ψ (Q ) (P) for some integer m > 0. In this algorithm and throughout the rest of this chapter, we assume that k is even and therefore the denominator elimination technique applies (see [29, 30]), which means that the denominators in the above defined fraction gQ1 ,Q2 (P) of line functions can be omitted without changing the pairing value. This is why Algorithm 9.2 only uses the line functions lQ1 ,Q2 (P). Miller’s algorithm builds up the function value fm,ψ (Q ) (P) along a scalar multiplication computing [m]Q (which is the value of R after the Miller loop). Step 3 is called a doubling step, it consists of squaring the intermediate value f ∈ Fqk , multiplying it with the function value given by the tangent to E at R = ψ (R ), and doubling the point R . Similarly, an addition step is computed in step 5 of Algorithm 9.2. The most efficient variants of the ate pairing are so-called optimal ate pairings [333]. They are optimal in the sense that they minimize the size of m and with that the number of iterations in Miller’s algorithm to log(r)/ϕ(k), where ϕ is the Euler totient function and log is the logarithm to base 2. For such minimal values of m, the function fm,ψ (Q ) alone usually does not give a bilinear map. To get a pairing, these functions need to be adjusted by multiplying with a small number of line function values (see [333] for details).

Downloaded from https://www.cambridge.org/core. Teachers College Library - Columbia University, on 29 Nov 2017 at 03:16:25, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.010

9.2 Finite Field Arithmetic for Pairings

213

9.1.3 Pairing-Friendly Elliptic Curves Secure and efficient implementation of pairings can be done only with a careful choice of the underlying elliptic curve. The curve needs to be pairing-friendly, √ i.e., the embedding degree k needs to be small, while r should be larger than q. A survey of methods to construct such curves can be found in [148]. For security reasons, the parameters need to have certain minimal sizes which imply suitable values for the embedding degree k for specific security levels. The requirement for an elliptic curve to have small embedding degree is a strong condition that restricts the set of possible curves to a small subset of elliptic curves. Because of the possibility of a transfer attack on the discrete logarithm problem like the Menezes-Okamoto-Vanstone [243] or the FreyRück [151, 150] attacks, such curves are excluded from being used in elliptic curve-based protocols that do not require a pairing function. In general, an arbitrary elliptic curve over a finite field with a large prime divisor of the group order will have a large embedding degree. This means that pairing-friendly curves are special and rare and finding a curve that satisfies the required properties can be a challenge. The most popular choices for pairing-friendly curves come from polynomial families of curves. In these families, the base field prime and the prime group order are parameterized by rational polynomials that ensure all conditions are satisfied. Concrete parameters are obtained by searching for an integer parameter such that the above polynomials evaluate to prime integers at the parameter. A corresponding elliptic curve can then be constructed using the complex multiplication method, or a simpler algorithm that tests for the correct group order in certain special cases. As explained in subsection 9.1.2.4 on page 211, it is often advantageous to choose curves with twists of degree 4 or 6, so-called high-degree twists, since this results in higher efficiency due to the more compact representation of the group G2 . To achieve security levels of 128 bits or higher, embedding degrees of 12 and larger are advantageous. Because the degree of the twist E  is at most 6, this means that when computing ate-like pairings at such security levels, all field arithmetic in the doubling and addition steps in Miller’s algorithm takes place over a proper extension field of Fq .

9.2 Finite Field Arithmetic for Pairings This section takes a closer look at the field arithmetic needed for pairing algorithms. As above, let Fq be a finite field of characteristic p > 3, and let Fqm be an extension of degree m of Fq for m | k. Note that mostly, pairing-friendly elliptic

Downloaded from https://www.cambridge.org/core. Teachers College Library - Columbia University, on 29 Nov 2017 at 03:16:25, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.010

214

Cryptographic Pairings

curves are defined over a prime field, i.e., q = p. Algorithm 9.2 on page 212 shows that a pairing algorithm requires arithmetic in the base field and also in some of the field extensions Fqm . A crucial building block with major influence on the overall efficiency of pairing-based protocols is multiplication modulo the prime p. An obvious influence of Peter’s work is the use of Montgomery arithmetic for modular operations. Another aspect that has been inspired by Peter’s way of thinking is the treatment of inversions. As indicated above, usually the cost for a field inversion is larger than that for a field multiplication, which in turn is larger than that for an addition and a subtraction. What the exact ratios between these are, depends on the setting, i.e., the shape of the base field prime, the extension degree, the algorithms used, and the computer architecture they are implemented for. When writing code that has to have a constant execution time, independent of secret input data, inversions are often even computed using a field exponentiation based on Fermat’s little theorem. Thus, in implementations of prime fields, inversions are usually very expensive, in which case it does not make sense to use affine coordinates. Pairing implementations using some form of projective coordinates can get away with a single inversion in a pairing evaluation. But still, there are certain scenarios discussed later, in which it is more efficient to work in affine coordinates when doing the elliptic curve operations during the pairing algorithm. Furthermore, divisions are still needed, for example, to scale curve points to a unique affine representation in the non-pairing operations. After a brief discussion on multiplication, this section describes in some detail the circumstances that allow to achieve relatively low inversion costs.

9.2.1 Montgomery Multiplication The base field primes for pairing-friendly elliptic curves can generally not be chosen to have a specific implementation-friendly form as is done for plain elliptic curve cryptography to obtain faster modular multiplications. For example, elliptic curve implementations for key exchange or digital signatures often use curves defined over fields with special prime shapes such as Solinas primes, Montgomery-friendly primes or pseudo-Mersenne primes. Such choices make modular reduction very efficient. Since the parameters of pairing-friendly curves have to satisfy the additional embedding degree condition, adding a condition on the prime shape would be too restrictive. Therefore, base field primes for those curves do not have a special structure that can be exploited to speed up modular reduction.

Downloaded from https://www.cambridge.org/core. Teachers College Library - Columbia University, on 29 Nov 2017 at 03:16:25, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.010

9.2 Finite Field Arithmetic for Pairings

215

As a consequence, Montgomery multiplication and Montgomery reduction (see Chapter 2) are a good choice for efficient implementation of modular arithmetic, and are the method used in speed-record pairing implementations [51, 13, 11].

9.2.2 Multiplication in Extension Fields The field extension Fqk required in the pairing algorithm is usually constructed as a tower of small-degree field extensions (preferably extensions of degree 2 and 3, see the notion of pairing-friendly fields in [218]), depending on the factorization of the embedding degree k. Benger and Scott [33] discuss how to best choose such towers in the pairing setting. Multiplication in each intermediate extension is implemented using the Karatsuba or the Toom-Cook method [218]. The line function values that occur in Miller’s algorithm when using the group G2 on a high-degree twist are sparse elements of Fqk , i.e., some of their coefficients when written as a polynomial over Fq are always zero. Therefore, multiplications with these values can exploit special, more efficient multiplication routines.

9.2.3 Finite Field Inversions Itoh and Tsujii [191] describe a method for computing the inverse of an element in a binary field using normal bases and Kobayashi et al. [216] generalize the technique to large-characteristic fields in polynomial basis and use it for elliptic-curve arithmetic. It is a standard way to compute inverses in optimal extension fields (see [21, 174] and [129, sections 11.3.4 and 11.3.6]). It can be applied in the following setting. Let Fq = Fq (α) where α has minimal polynomial X − ω for some ω ∈ F∗q and assume gcd( , q) = 1. Then, the inverse of β ∈ F∗q can be computed as β −1 = β v−1 · β −v , where v = (q − 1)/(q − 1) = q −1 + · · · + q + 1. Note that β v is the norm of β and thus lies in the base field Fq . So the cost for computing the inverse of β is the cost for computing β v−1 and β v , one inversion in the base field Fq to obtain β −v , and the multiplication of β v−1 with β −v . The powers of β are obtained by using the q-power Frobenius automorphism on F q . We give a brief estimate of the cost of the above. Let Mqm , Sqm , Iqm , addqm , subqm , and negqm denote the costs for multiplication, squaring, inversion, addition, subtraction, and negation in the field Fqm . The cost for a multiplication by

Downloaded from https://www.cambridge.org/core. Teachers College Library - Columbia University, on 29 Nov 2017 at 03:16:25, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.010

216

Cryptographic Pairings

a constant ω ∈ Fqm is denoted by M(ω) . We assume the same costs for addition of a constant as for a general addition. Denote the inversion to multiplication cost ratio by Rqm = Iqm /Mqm . A Frobenius computation using a look-up table of − 1 precomputed values in Fq consisting of powers of ω costs at most − 1 multiplications in Fq (see [216, section 2.3], note gcd( , q) = 1). According to [180, section 2.4.4] the computation of β v−1 via an addition chain approach, using a look-up table for each required power of the Frobenius, costs at most log( − 1) + h( − 1) Frobenius computations and fewer multiplications in Fq . Here h(m) denotes the Hamming weight of an integer m. Knowing that β v ∈ Fq , its computation from β v−1 and β costs at most base field multiplications, one multiplication with ω, and − 1 base field additions. The final multiplication of β −v with β v−1 can be done in base field multiplications. This leads to the following upper bound for the cost of an inversion in Fq : Iq ≤ Iq + ( log( − 1) + h( − 1))(Mq + ( − 1)Mq ) + 2 Mq + M(ω) + ( − 1)addq .

(9.3)

Let M( ) be the minimal number of multiplications in Fq needed to multiply two different, non-trivial elements in Fq not lying in a proper subfield of Fq . Then the following lemma bounds the ratio of inversion to multiplication costs in Fq from above by 1/M( ) times the ratio in Fq plus an explicit constant. Thus the ratio in the extension improves by roughly a factor of M( ). Lemma 9.1 Let Fq be a finite field, > 1, Fq = Fq (α) with α = ω ∈ F∗q . Then using the above inversion algorithm in Fq leads to Rq ≤ Rq /M( ) + C( ), where C( ) = log( − 1) + h( − 1) + h( − 1))).

1 (3

M( )

+ ( − 1)( log( − 1) +

Proof Since M( ) is the minimal number of multiplications in Fq needed for multiplying two elements in Fq , we can assume that the actual cost for the latter is Mq ≥ M( )Mq . Using Inequality (9.3), we deduce ˜ ˜ Rq = Iq /Mq ≤ Iq /(M( )Mq ) + C( ) = Rq /M( ) + C( ), ˜ where C( ) = log( − 1) + h( − 1) + (2 + ( − 1)( log( − 1) + h( − 1)))/M( ) + (Mω + ( − 1)addq )/(M( )Mq ). Since M(ω) ≤ Mq and addq ≤ ˜ ≤ C( ).  Mq , we get Mω + ( − 1)addq ≤ Mq and thus C( )

Downloaded from https://www.cambridge.org/core. Teachers College Library - Columbia University, on 29 Nov 2017 at 03:16:25, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.010

9.2 Finite Field Arithmetic for Pairings

217

Table 9.1 Constants that determine the improvement of Rq over Rq

2

3

4

5

6

7

1/M( ) C( )

1/3 3.33

1/6 4.17

1/9 5.33

1/13 5.08

1/17 6.24

1/22 6.05

In Table 9.1 we give values for the factor 1/M( ) and the additive constant C( ) that determine the improvements of Rq over Rq for several small extension degrees . We take the numbers for M( ) from the formulas given in [259]. For small-degree extensions, the inversion method can be easily made explicit. We state and analyze it for quadratic and cubic extensions in the following two examples. Example 9.1 (Quadratic extensions) Let Fq2 = Fq (α) with α 2 = ω ∈ Fq . An element β = b0 + b1 α = 0 can be inverted as b0 − b1 α 1 b0 b1 = 2 = 2 − α. b0 + b1 α b0 − b21 ω b0 − b21 ω b20 − b21 ω In this case the norm of β is given explicitly by b20 − b21 ω ∈ Fq . The inverse of β thus can be computed in one inversion, two multiplications, two squarings, one multiplication by ω, one subtraction and one negation, all in Fq , i.e., Iq2 = Iq + 2Mq + 2Sq + M(ω) + subq + negq . We assume that we multiply Fq2 -elements with Karatsuba multiplication, which costs Mq2 = 3Mq + M(ω) + 2addq + 2subq . As in the general case above, we assume that the cost for a full multiplication in the quadratic extension is at least Mq2 ≥ 3Mq , i.e., we restrict to the average case where both elements have both of their coefficients different from 0. Thus we can give an upper bound on the I/M-ratio in Fq2 depending on the ratio in Fq as Rq2 = Iq2 /Mq2 ≤ (Iq /3Mq ) + 2 = Rq /3 + 2, where we roughly assume that Iq2 ≤ Iq + 6Mq . This bound shows that for Rq > 3 the ratio becomes smaller in Fq2 . For large ratios in Fq it becomes roughly Rq /3. Example 9.2 (Cubic extensions) Let Fq3 = Fq (α) with α 3 = ω ∈ Fq . Similar to the quadratic case we can invert β = b0 + b1 α + b2 α 2 ∈ F∗q3 by b20 − ωb1 b2 ωb22 − b0 b1 b21 − b0 b2 2 1 = + α + α b0 + b1 α + b2 α 2 N(β ) N(β ) N(β )

Downloaded from https://www.cambridge.org/core. Teachers College Library - Columbia University, on 29 Nov 2017 at 03:16:25, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.010

218

Cryptographic Pairings

with N(β ) = b30 + b31 ω + b32 ω2 − 3ωb0 b1 b2 . We start by computing ωb1 and ωb2 as well as b20 and b21 . The terms in the numerators are obtained by a twoway Karatsuba multiplication and additions and subtractions via 3Mq computing b0 b2 , ωb1 b2 and (ωb2 + b0 )(b2 + b1 ). The norm can be computed by three more multiplications and two additions. Thus the cost for the inversion is Iq3 = Iq + 9Mq + 2Sq + 2M(ω) + 4addq + 4subq . A Karatsuba multiplication can be done in Mq3 = 6Mq + 2M(ω) + 9addq + 6subq . We use Mq3 ≥ 6Mq , assume Iq3 ≤ Iq + 18Mq and obtain Rq3 = Iq3 /Mq3 ≤ (Iq /Mq )/6 + 3 = Rq /6 + 3. 9.2.3.1 Inversions in Towers of Field Extensions Baktir and Sunar [25] introduce optimal tower fields as an alternative for optimal extension fields, where they build a large field extension as a tower of small extensions instead of one big extension. They describe how to use the above inversion technique recursively by passing down the inversion in the tower, finally arriving at the base field. They show that this method is more efficient than computing the inversion in the corresponding large extension with the ItohTsujii inversion directly. For towers of field extensions of degree two and three as those occurring in pairing-friendly fields, the inversion can be done using the two examples given above.

9.2.4 Simultaneous Inversions The inverses of s field elements a1 , . . . , as can be computed simultaneously with Montgomery’s well-known inversion-sharing trick [252, section 10.3.1.] at the cost of one inversion and 3(s − 1) multiplications. It is based on the following idea: to compute the inverse of two elements a and b, one computes their product ab and its inverse (ab)−1 . The inverses of a and b are then found by a−1 = b · (ab)−1 and b−1 = a · (ab)−1 , with an overall cost of just one inversion and three multiplications. In general, for s elements one first computes the products ci = a1 · · · · · ai for 2 ≤ i ≤ s with s − 1 multiplications and inverts cs . Then we have a−1 s = −1 −1 −1 −1 −1 . We get a by c = c a and a = c c and so forth (see cs−1 c−1 s s−2 s s s−1 s−1 s−1 s−1 [129, algorithm 11.15]), where we need 2(s − 1) more multiplications to get the inverses of all elements. Since this method works for general field elements, i.e., it is not restricted to a specific extension degree, we leave out the indices in the notation of the inversion and multiplication costs and their ratio. The cost for s inversions is replaced by I + 3(s − 1)M. Let Ravg,s denote the ratio of the cost of s inversions

Downloaded from https://www.cambridge.org/core. Teachers College Library - Columbia University, on 29 Nov 2017 at 03:16:25, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.010

9.3 Affine Coordinates for Pairing Computation

219

to the cost of s multiplications. It is bounded above by Ravg,s = I/(sM) + 3(s − 1)/s ≤ R/s + 3, i.e., when the number s of elements to be inverted grows, the ratio Ravg,s gets closer to 3. Note that most of the time, this method improves the efficiency of an implementation whenever applicable. However, in large field extensions, the inversion method described above in Section 9.2.3 on page 215 might lead to an I/M-ratio that already is less than 3. In this case using the sharing trick at the highest level of the field extension would make the average ratio worse. But a combination of both methods, such as reducing the inversions down to the ground field and sharing them there, can further increase efficiency.

9.3 Affine Coordinates for Pairing Computation The previous section elaborated on two techniques that achieve a low inversion to multiplication cost ratio, namely working in extension fields and doing several inversions at once if possible. There exist scenarios in which one of the two or both techniques can be applied and lead to affine coordinates being the most efficient choice. Affine coordinates are useful for curve operations in the group G2 including those needed in the Miller loop, whenever a pairing-friendly curve is chosen such that G2 is defined over an extension field of larger degree. This might occur for high security pairings or whenever high-degree twists cannot be used. The second technique of simultaneous inversion becomes possible whenever several pairings or a product of several pairings is computed such that inversions can be synchronized and be done simultaneously. This section states the costs for the doubling and addition steps in Miller’s algorithm for affine coordinates and explains use cases of the above two techniques for achieving low inversion cost.

9.3.1 Costs for Doubling and Addition Steps We begin by describing the evaluation of line functions in affine coordinates for ate-like pairings as in Section 9.1.2.4 on page 211, i.e., a point P on E, P = O, is given by two affine coordinates as P = (xP , yP ). Let R1 , R2 , S ∈ E with R1 = −R2 and R1 , R2 , S = O. Then the function of the line through R1 and R2 (tangent to E if R1 = R2 ) evaluated at S is given by lR1 ,R2 (S) = yS − yR1 − λ(xS − xR1 ),

Downloaded from https://www.cambridge.org/core. Teachers College Library - Columbia University, on 29 Nov 2017 at 03:16:25, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.010

220

Cryptographic Pairings

where λ = (3xR2 1 + a)/2yR1 if R1 = R2 and λ = (yR2 − yR1 )/(xR2 − xR1 ) otherwise. The value λ is also used to compute R3 = R1 + R2 on E by xR3 = λ2 − xR1 − xR2 and yR3 = λ(xR1 − xR3 ) − yR1 . If R1 = −R2 , then we have xR1 = xR2 and lR1 ,R2 (S) = xS − xR1 . Let the notation be as described in Section 9.1.2.4, in particular we use a twist E  of degree d to represent the group G2 by the group G2 . Let e = k/d, then G2 = E  (Fqe )[r]. Let P ∈ G1 , R , Q ∈ G2 and let R = ψ (R ), Q = ψ (Q ). Furthermore, we assume that the field extension Fqk is given by Fqk = Fqe (α) where α ∈ Fqk is the same element as the one defining the twist E  , and we have α d = ω ∈ Fqe . This means that each element in Fqk is given by a polynomial of degree d − 1 in α with coefficients in Fqe and the twisting isomorphism ψ maps (x , y ) to (α 2 x , α 3 y ). 9.3.1.1 Doubling Steps in Affine Coordinates We need to compute lR,R (P) = yP − α 3 yR − λ(xP − α 2 xR ) = yP − αλ xP + α 3 (λ xR − yR ) and R3 = [2]R , where xR3 = λ2 − 2xR and yR3 = λ (xR − xR3 ) − yR . We have λ = (3xR2  + a/α 4 )/2yR and λ = (3xR2 + a)/2yR = αλ . Note that [2]R = O in the pairing computation. The slope λ can be computed with Iqe + Mqe + Sqe + 4addqe , assuming that we compute 3xR2  and 2yR by additions. To compute the double of R from the slope λ , we need at most Mqe + Sqe + 4subqe . We obtain the line function value with a cost of eMq to compute λ xP and Mqe + subqe + negqe for d ∈ {4, 6}. When d = 2, note that α 2 = ω ∈ Fqe and thus we need (k/2)Mq + Mqk/2 + M(ω) + 2subqk/2 for the line. We summarize the operation counts in Table 9.2 on page 221. We restrict to even embedding degree and 4 | k for b = 0 as well as to 6 | k for a = 0 because these cases allow using the maximal-degree twists and are likely to be used in practice. We compare the affine counts to costs of the fastest formulas using projective coordinates taken from [190] and [110]. For an overview of the most efficient explicit formulas known for elliptic-curve operations see the Explicit-Formulas Database [48]. We transfer the formulas in [190] to the ate pairing using the trick in [110] where the ate pairing is computed entirely on the twist. In this setting we assume field extensions are constructed in a way that favors the representation of line function values. This means that the twist isomorphism can be different from the one described in this chapter. Still, in the case d = 2, evaluation of the line function cannot be done in kMq ; instead two multiplications in Fqk/2 need to be done (see also the discussion in the respective sections of [110]). Furthermore, we assume that all precomputations are

Downloaded from https://www.cambridge.org/core. Teachers College Library - Columbia University, on 29 Nov 2017 at 03:16:25, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.010

9.3 Affine Coordinates for Pairing Computation

221

Table 9.2 Operation counts for the doubling step in the ate-like Miller loop omitting 1Sqk + 1Mqk . The elements a, b ∈ Fq are the curve coefficients, k is the embedding degree with respect to the prime subgroup order r, and d is the degree of the twist such that the group G2 is defined over Fqe where e = k/d. DBL

d coord.

Mq Iqe Mqe Sqe

ab = 0 affine k/2 1 2 2|k Jac. [190] − − b=0 affine k/4 1 4 4|k W(1,2) [110] k/2 − a=0 affine k/6 1 6 6|k proj. [110] k/3 −

3 3 3 2 3 2

M( · )

2 1M(ω) 11 1M(a/ω2 ) 2 − 8 1M(a/ω) 2 − 7 1M(b/ω)

addqe subqe negqe 4 6 4 9 4 11

6 17 5 10 5 10

− − 1 1 1 1

done as described in the above papers and small multiples are computed by additions. 9.3.1.2 Addition Steps in Affine Coordinates The line function value has the same shape as for doubling steps. Note that we can replace R by Q in the line and compute lR,Q (P) = yP − α 3 yQ − λ(xP − α 2 xQ ) = yP − αλ xP + α 3 (λ xQ − yQ ) and R3 = R + Q , where xR3 = λ2 − xR − xQ and yR3 = λ (xR − xR3 ) − yR . The slope λ now is different, namely λ = (yR − yQ )/(xR − xQ ). Note that R = −Q does not occur when computing Miller function values of degree less than r. The cost for doing an addition step is the same as that for a doubling step, except that the cost to compute the slope λ is now Iqe + Mqe + 2subqe . Table 9.3 compares the costs for affine addition steps to those in projective coordinates. Again, we take these operation counts from the literature (see [14, 110, 109] for the explicit formulas and details on the computation). Concerning the field and twist representations and line function evaluation, similar remarks as for doubling steps apply here. The multiplication with ω in the case d = 2 can be done as a precomputation, since Q is fixed throughout the pairing algorithm. Since other formulas do not have multiplications by constants, we omit this column in Table 9.3. 9.3.1.3 Affine versus Projective Doubling and addition steps for computing pairings in affine coordinates include one inversion in Fqe per step. The various projective formulas avoid the inversion, but at the cost of doing more of the other operations. How much

Downloaded from https://www.cambridge.org/core. Teachers College Library - Columbia University, on 29 Nov 2017 at 03:16:25, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.010

222

Cryptographic Pairings

Table 9.3 Operation counts for the addition step in the ate-like Miller loop omitting 1Mqk . The elements a, b ∈ Fq are the curve coefficients, k is the embedding degree with respect to the prime subgroup order r, and d is the degree of the twist such that the group G2 is defined over Fqe where e = k/d. ADD ab = 0 2|k b=0 4|k a=0 6|k

d 2 4 6

coord.

Mq

Iqe

Mqe

Sqe

addqe

subqe

negqe

affine Jacobian [14] affine W(1,2) [110] affine proj. [109, 110]

k/2 − k/4 k/2 k/6 k/3

1 − 1 − 1 −

3 8 3 9 3 11

1 6 1 5 1 2

− 6 − 7 − 1

8 17 7 8 7 7

− − 1 1 1 −

higher these costs exactly are depends on the underlying field implementation and the ratio of the costs for squaring to multiplication. A rough estimate of the counts in Table 9.3 shows that the cost traded for the inversion in the projective addition formulas is equivalent to at least several Mqe . For doubling steps, it is smaller, but still equivalent to a few Mqe in all cases. Since doubling steps are much more frequent in the pairing computation (especially when a low Hamming weight for the degree of the used Miller function is chosen), the traded cost in the doubling case is the most relevant to consider. The subsection concludes with two examples that give specific upper bounds on Iqe such that affine coordinates are more efficient than projective ones. Example 9.3 Let ab = 0, i.e., d = 2. The cost that has to be weighed against the inversion cost for a doubling step is 9Sqk/2 − (k/2)Mq + M(a/ω2 ) − M(ω) + 2addqk/2 + 11subqk/2 . Clearly, (k/2)Mq < Sqk/2 , and we assume M(ω) ≈ M(a/ω2 ) and addqk/2 ≈ subqk/2 . We see that if an inversion costs less than 8Sqk/2 + 13addqk/2 , then affine coordinates are better than Jacobian. Example 9.4 In the case a = 0, d = 6, similar to the previous example, we deduce that if an inversion in Fqk/6 is less than 5Sqk/6 − Mqk/6 + (k/6)Mq + M(b/ω) + 12addqk/6 , then affine coordinates beat the projective ones. In order to fully assess the effect of using affine instead of projective coordinates, one needs to know the exact cost that is traded for the inversion. This strongly depends on the specific algorithms chosen to implement the operations in the extension fields. For example, the relative costs between

Downloaded from https://www.cambridge.org/core. Teachers College Library - Columbia University, on 29 Nov 2017 at 03:16:25, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.010

9.3 Affine Coordinates for Pairing Computation

223

multiplications and squarings can differ significantly. Commonly used values in the literature are Sq = Mq or Sq = 0.8Mq when q is a prime, see [48]. Using Karatsuba multiplication and squaring in a quadratic extension leads to a ratio of Sq2 ≈ (2/3)Mq2 . Therefore, we do not specify these costs any further and leave the evaluation for a point when a specific scenario and an implementation is chosen.

9.3.2 Working over Extension Fields This subsection illustrates the usefulness of affine coordinates when the extension degree e is relatively large. To implement pairings at a given security level, on the one hand one needs to find a pairing-friendly elliptic curve such that r and qk have a certain minimal size (e.g. given by current estimates for the runtimes of algorithms to solve the ECDLP). On the other hand, for efficiency it is desirable that they are not much larger than necessary. For a pairing-friendly elliptic curve E over Fq with embedding degree k with respect to a prime divisor r | #E(Fq ), we define the ρ-value of E as ρ = log(q)/ log(r). This value is a measure of the base field size relative to the size of the prime-order subgroup on the curve. The value ρk is the ratio of the sizes of qk and r. For a given curve the value ρk describes how well security is balanced between the curve groups and the finite field group. An overview of construction methods for pairing-friendly elliptic curves is given in [148]. In Table 9.4, we list suggestions for curve families by their construction in [148] for high-security levels of 128, 192, and 256 bits. The last column in Table 9.4 shows the field extensions in which inversions are done to compute the line function slopes. We not only give families of curves with twists of degree 4 and 6, but also more generic families such that the curves only have a twist of degree 2. Of course, in the latter case the extension field, in which inversions for the affine ate pairing need to be computed, is larger than when dealing with higher-degree twists. Because curves with twists of degree 4 and 6 are special (they have j-invariants 1728 and 0), there might be reasons to choose the more generic curves. Note that curves from the given constructions are all defined over prime fields. Therefore we use the notation F p in Table 9.4. Remark The conclusion to underline from the discussion in this section, is that, using the improved inversions in towers of extension fields described here, there are at least two scenarios where most implementations of the ate pairing would be more efficient using affine coordinates.

Downloaded from https://www.cambridge.org/core. Teachers College Library - Columbia University, on 29 Nov 2017 at 03:16:25, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.010

224

Cryptographic Pairings

Table 9.4 Extension fields for which inversions are needed when computing ate-like pairings for different examples of pairing-friendly curve families suitable for the given security levels. Security

128

192

256

Construction in [148] Ex. 6.8 Ex. 6.10 Section 5.3 Constr. 6.7+ Ex. 6.12 Ex. 6.11 Constr. 6.3+ Constr. 6.6 Constr. 6.4 Constr. 6.24+

Curve

k

ρ

ρk

d

Extension

a=0 b=0 a, b = 0 a, b = 0 a=0 b=0 a, b = 0 a=0 b=0 a, b = 0

12 8 10 12 18 16 14 24 28 26

1.00 1.50 1.00 1.75 1.33 1.25 1.50 1.25 1.33 1.17

12.00 12.00 10.00 21.00 24.00 20.00 21.00 30.00 37.24 30.34

6 4 2 2 6 4 2 6 4 2

F p2 F p2 F p5 F p6 F p3 F p4 F p7 F p4 F p7 F p13

When higher security levels are required, so that k is large. For example 256-bit security with k = 28, so that most of the computations for the ate pairing take place in the field extension of degree 7, even using a degree-4 twist (second-to-last line of Table 9.4). In that case, the I/M ratio in the degree7 extension field would be roughly 22 times less (plus 6) than the ratio in the base field (see the last entry in Table 9.1 on page 217). The costs for doubling and addition steps given in the second lines of Tables 9.2 on page 221 and 9.3 on page 222 for degree-4 twists show that the cost of the inversion avoided in a projective implementation should be compared with roughly 6Sq7 + 5addq7 + 5subq7 extra for a doubling (and an extra 6Mq7 + 4Sq7 + 7addq7 + subq7 for an addition step). In most implementations of the base field arithmetic, the cost of these 16 or 17 operations in the extension field would outweigh the cost of one improved inversion in the extension field. When special high-degree twists are not being used. In this scenario there are two reasons why affine coordinates will be better under most circumstances. First, the costs for doubling and addition steps given in the first lines of Tables 9.2 and 9.3 for degree-2 twists are not nearly as favorable towards projective coordinates as the formulas in the case of higher degree twists. For degree-2 twists, both the doubling and addition steps require roughly at least 9 extra squarings and 13 or 15 extra field extension additions or subtractions for the projective formulas. Second, the degree of the extension field where the operations take place is larger. See the bottom row for each security level in

Downloaded from https://www.cambridge.org/core. Teachers College Library - Columbia University, on 29 Nov 2017 at 03:16:25, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.010

9.3 Affine Coordinates for Pairing Computation

225

Table 9.4 on page 224, so we have extension degree 6 for 128-bit security up to extension degree 13 for 256-bit security.

9.3.3 Simultaneous Inversions in Pairing Computation This subsection discusses different scenarios, in which the simultaneous inversion technique from Section 9.2.4 on page 218 can be applied and might lead to a low-enough I/M ratio to favor affine coordinates. 9.3.3.1 Sharing Inversions in a Single Pairing Computation Schroeppel and Beaver [304] demonstrate the use of the inversion-sharing trick to speed up a single scalar multiplication on an elliptic curve in affine coordinates. They suggest postponing addition steps in the double-and-add algorithm to exploit the inversion sharing. In order to do that, the double-and-add algorithm must be carried out by going through the binary representation of the scalar from right to left. First, all doublings are carried out and the points that will be used to add up to the final result are stored. When all these points have been collected, several additions can be done at once, sharing the computation of inversions among them. Miller’s algorithm can also be done from right to left. The doubling steps are computed without doing the addition steps. The required field elements and points are stored in lists and addition steps are done in the end. The algorithm is summarized in Algorithm 9.3. Unfortunately, addition steps cost much more than in the conventional left-to-right algorithm as it is given in Algorithm 9.2 on page 212. In the right-to-left version, each addition step in line 10 needs a general Fqk -multiplication and a multiplication with a line function value. The conventional algorithm only needs a multiplication with a line. These huge costs cannot be compensated for by using affine coordinates with the inversionsharing trick. 9.3.3.2 Parallelizing a Single Pairing However, the right-to-left algorithm can be parallelized, and this could lead to more efficient implementations by taking advantage of many-core machines. Grabher, Großschädl, and Page [163, Algorithm 2.2] use a version of Algorithm 9.3 to compute a single pairing by doing addition steps in parallel on two different cores. They divide the lists with the saved function values and points into two halves and compute two intermediate values which are in the end combined in a single addition step. For their specific implementation, they conclude that this is not faster than the conventional non-parallel algorithm.

Downloaded from https://www.cambridge.org/core. Teachers College Library - Columbia University, on 29 Nov 2017 at 03:16:25, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.010

226

Cryptographic Pairings

Algorithm 9.3 Right-to-left version of Miller’s algorithm with postponed addition steps for even k and ate-like pairings Input: Q ∈ G2 , P ∈ G1 , m = (ml−1 , ml−2 , . . . , m0 )2 , ml−1 = 1 Output: fm,ψ (Q ) (P) representing a class in F∗qk /(F∗qk )r 1: R ← Q , f ← 1, j←0 2: for i from 0 to − 1 do 3: if (mi = 1) then 4: AR [ j] ← R , A f [ j] ← f , j ← j + 1 5: end if 6: f ← f 2 · lψ (R ),ψ (R ) (P), R ← [2]R 7: end for 8: R ← AR [0], f ← A f [0] 9: for ( j ← 1; j ≤ h(m) − 1; j + +) do 10: f ← f · A f [ j] · lψ (R ),ψ (AR [ j]) (P), R ← R + AR [ j] 11: end for 12: return f Still, this idea might be useful for two or more cores, in case multiple cores can be used with less overhead. It is straightforward to extend this algorithm to more cores. The parallelized algorithm can be combined with the shared inversion trick when doing the addition steps in the end. The improvements achieved by this approach strongly depend on the Hamming weight of the value m in Miller’s algorithm. If it is large, then savings are large, while for very sparse m there is almost no improvement. Therefore, when it is not possible to choose m with low Hamming weight, combining the parallelized right-to-left algorithm for pairings with the shared inversion trick can speed up the computation. The communication overhead in order to gather data from intermediate computations on different cores to compute the inverse at one core imposes a non-trivial performance penalty and it remains to be seen whether this approach is worthwhile. Grabher et al. [163] note that when multiple pairings are computed, it is better to parallelize by performing one pairing on each core. 9.3.3.3 Multiple Pairings and Products of Pairings Many protocols involve the computation of multiple pairings or products of pairings. For example, multiple pairings need to be computed in the searchable encryption scheme of Boneh et al. [57]; and the non-interactive proof systems proposed by Groth and Sahai [172] need to check pairing product equations. In

Downloaded from https://www.cambridge.org/core. Teachers College Library - Columbia University, on 29 Nov 2017 at 03:16:25, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.010

9.4 The Double-Add Operation and Parabolas

227

these scenarios, we propose sharing inversions when computing pairings with affine coordinates. In the case of products of pairings, this has already been proposed and investigated by Scott [305, section 4.3] and Granger and Smart [168]. See also the more recent work [306] by Scott. 9.3.3.4 Multiple Pairings Assume we want to compute s pairings on points Qi and Pi , i.e., a priori we have s Miller loops to compute fm,ψ (Qi ) (Pi ). We carry out these loops simultaneously, doing all steps up to the first inversion computation for a line function slope for all of them. Only after that, all slope denominators are inverted simultaneously, and we continue with the computation for all pairings until the next inversion occurs. The s Miller loops are not computed sequentially, but rather sliced at the slope denominator inversions. The costs stay the same, except that now the average inversion-to-multiplication cost ratio is 3 + Rqe /s, where e = k/d and d is the twist degree. So when computing sufficiently many pairings so that the average cost of an inversion is small enough, using the sliced-Miller approach with inversion sharing in affine coordinates is faster than using the projective coordinates explicit formulas described in Section 9.3.1 on page 219. 9.3.3.5 Products of Pairings For computing a product of pairings, more optimizations can be applied, including the above inversion-sharing. Scott [305, section 4.3] suggests using affine coordinates and sharing the inversions for computing the line function slopes as described above for multiple pairings. Furthermore, since the Miller function of the pairing product is the product of the Miller functions of the single pairings, in each doubling and addition step the line functions can already be multiplied together. In this way, we only need one intermediate variable f and only one squaring per iteration of the product Miller loop. Of course in the end, there is only one final exponentiation on the product of the Miller function values. Granger and Smart [168] show that by using these optimizations the cost for introducing an additional ate pairing to the product can be as low as 13% of the cost of a single ate pairing.

9.4 The Double-Add Operation and Parabolas This section consists largely of lightly edited material from Sections 3, 5, and 6 of [137]. In [137], an improvement to the double-and-add operation on

Downloaded from https://www.cambridge.org/core. Teachers College Library - Columbia University, on 29 Nov 2017 at 03:16:25, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.010

228

Cryptographic Pairings

an elliptic curve was introduced, and a related idea was applied to improve the analogous operation in pairing computations on elliptic curves. In affine coordinates, the technique allows one to perform a doubling and an addition, 2P + Q, on an elliptic curve E using only 1M + 2S + 2I (plus an extra squaring when P = Q). This is achieved as follows: to compute 2P + Q, where P = (x1 , y1 ) and Q = (x2 , y2 ), first find (P + Q), but omit its y-coordinate since it is not needed for the next stage. This saves one field multiplication. Next compute (P + Q) + P. This way, two point additions are performed while saving one multiplication. This trick also works when P = Q, i.e., when tripling a point.

9.4.1 Description of the Algorithm Suppose P = (x1 , y1 ) and Q = (x2 , y2 ) are distinct points on E different from O, and x1 = x2 . The details for other cases are given in figure 1 of [137]. That figure also covers special cases, where an input or an intermediate result is the point O. Recall from Section 9.1.1 and Equation (9.1) on page 208 that the point P + Q is then given by (x3 , y3 ), where x3 = λ21 − x1 − x2 and y3 = (x1 − x3 )λ1 − y1 with λ1 = (y2 − y1 )/(x2 − x1 ). To add (P + Q) to P, add (x1 , y1 ) to (x3 , y3 ) using the above rule (assuming x3 = x1 ). The result has coordinates (x4 , y4 ), where x4 = λ22 − x1 − x3 and y4 = (x1 − x4 )λ2 − y1 with λ2 = (y3 − y1 )/(x3 − x1 ). The computation of y3 can be omitted because it is used only in the computation of λ2 , which can be computed without knowing y3 as λ2 = −λ1 − 2y1 /(x3 − x1 ). Omitting the y3 computation saves one field multiplication. Each λ2 formula requires a field division, so the overall saving is this one field multiplication. This trick can also be applied to save one multiplication when computing 3P, the triple of a point P = O, where the λ2 computation will need the slope of a line through two distinct points 2P and P. It can be used twice to save two multiplications when computing 3P + Q = ((P + Q) + P) + P. Thus 3P + Q can be computed using one multiplication, three squarings, and three divisions. Such a sequence of operations would be performed repeatedly if an exponent were written in ternary form and left-to-right exponentiation were used. Ternary representation performs worse than binary representation for large random exponents k, but the operation of triple and add was explored further in the context of double-base chains in [99, section 5].

Downloaded from https://www.cambridge.org/core. Teachers College Library - Columbia University, on 29 Nov 2017 at 03:16:25, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.010

9.4 The Double-Add Operation and Parabolas

229

9.4.2 Application to Scalar Multiplication The above double-and-add operation can be applied to scalar multiplications on an elliptic curve in affine coordinates. In the naive left-to-right method of binary exponentiation, the double-and-add trick can be applied repeatedly at each stage of the computation. More efficient variants could use a combination of the left-to-right or right-to-left m-ary exponentiation with sliding window methods, addition-subtraction chains, signed representations, etc. When using affine coordinates, the double-and-add trick can be used on top of these methods for m = 2 to obtain savings, depending upon the window size that is used. Note that we discuss the double-and-add technique in a historic manner and neglect the protection of scalar multiplication against side channels such as making sure that the algorithm runs in constant time. Another application, in which the double-and-add operation occurs more frequently than in single scalar multiplication is multi-exponentiation, for example a simultaneous three-fold scalar multiplication k1 P1 + k2 P2 + k3 P3 , where the three exponents k1 , k2 , and k3 have approximately the same length. One algorithm creates an 8-entry table with the precomputed points O, P1 , P2 , P2 + P1 , P3 , P3 + P1 , P3 + P2 , P3 + P2 + P1 . Subsequently it uses one elliptic curve doubling followed by the addition of a table entry, for each bit in the exponents [249]. For random scalars ki and a binary scalar decomposition, about 7/8 of the doublings will be followed by an addition other than O.

9.4.3 Application to Pairings The double-and-add technique can be extended to the function evaluation in Miller’s algorithm and thus can be applied to pairing computation as well. The key observation is that one can replace consecutive evaluations of line functions by the evaluation of a parabola, and thus obtain an analogue to the double-andadd operation in the form of a Miller formula. 9.4.3.1 Using the Double-and-Add Trick with Parabolas We use notation as in the paragraph on Miller’s algorithm in Section 9.1.2 on page 209. For two integers b, c, instead of computing f2b+c,P (Q) and [2b + c]P 2 (Q)g[b]P,[b]P and from ( fb,P (Q), [b]P) and ( fc,P (Q), [c]P) via f2b,P (Q) = fb,P [2]([b]P), and then f2b+c,P (Q) = f2b,P (Q) fc,P (Q)g[2b]P,[c]P and [2b]P + [c]P, we now describe an alternative approach that computes the result directly, producing only the x-coordinate of the intermediate point [b]P + [c]P. To combine the two steps, one constructs a parabola through the points [b]P, [b]P, [c]P, −[2b + c]P.

Downloaded from https://www.cambridge.org/core. Teachers College Library - Columbia University, on 29 Nov 2017 at 03:16:25, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.010

230

Cryptographic Pairings

To form f2b+c,P (Q), combine forming fb+c,P (Q) with fb+c+b,P (Q) as follows. We omit the notation for evaluation at Q and simply write f2b+c etc. and use the notation lP for the vertical line function lP,−P through P and −P. We have f2b+c,P = fb+c,P · fb,P · g[b+c]P,[b]P = ( fb,P · fc,P · g[b]P,[c]P ) · ( fb,P · g[b+c]P,[b]P ) =

2 fb,P · fc,P

l[2b+c]P

·

l[b]P,[c]P · l[b+c]P,[b]P . l[b+c]P

Now, one can replace the second fraction l[b]P,[c]P · l[b+c]P,[b]P /l[b+c]P by the parabola whose formula is given in the next paragraph. 9.4.3.2 Equation for the Parabola If R and S are points on E, then there is a (possibly degenerate) parabolic equation passing through R twice (i.e., tangent at R) and also passing through S and −[2]R − S. Using the notation R = (x1 , y1 ) and S = (x2 , y2 ) with R + S = (x3 , y3 ) and 2R + S = (x4 , y4 ), a formula for this parabola is (y + y3 − λ1 (x − x3 ))(y − y3 − λ2 (x − x3 )) . x − x3

(9.4)

The left half of the numerator of (9.4) is a line passing through R, S, and −R − S whose slope is λ1 . The second half of the numerator is a line passing through R + S, R, and −[2]R − S, whose slope is λ2 . The denominator is a (vertical) line through R + S and −R − S. The quotient has zeros at R, R, S, −[2]R − S and a pole of order four at O. One can simplify Equation (9.4) by expanding it in powers of x − x3 , using the Weierstrass equation for E to eliminate references to y2 and y23 . y2 − y23 − λ1 (y − y3 ) − λ2 (y + y3 ) + λ1 λ2 (x − x3 ) x − x3 = x2 + xx3 + x32 + a + λ1 λ2 (x − x3 ) − λ1 (y − y3 ) − λ2 (y + y3 ) = x + (x3 + λ1 λ2 )x − (λ1 + λ2 )y + 2

(x32

(9.5)

+ a − λ1 λ2 x3 + (λ1 − λ2 )y3 ).

Knowing that (9.5) passes through R = (x1 , y1 ), one succinct formula for the parabola is (x − x1 )(x + x1 + x3 + λ1 λ2 ) − (λ1 + λ2 )(y − y1 ).

(9.6)

In the previous section we can now replace l[b]P,[c]P · l[b+c]P,[b]P /l[b+c]P by the parabola (9.6) with R = [b]P and S = [c]P. Formula (9.6) for the parabola does not reference y3 and is never identically zero since its x2 coefficient is 1.

Downloaded from https://www.cambridge.org/core. Teachers College Library - Columbia University, on 29 Nov 2017 at 03:16:25, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.010

9.5 Squared Pairings

231

The following example shows how in the case of affine coordinates, the double-and-add operation can help to make the pairing algorithm more efficient. Example 9.5 Assume that we are computing an ate-like pairing in the same setting as in Section 9.3.1 on page 219. Let ab = 0, i.e., d = 2, let k be even such that we can use denominator elimination, and note that ω = α 2 . Now, assume that we are at a position in the Miller loop, where we are computing a doubling step followed by an addition step. This means that the two steps compute f[2]R +Q ,ψ (Q ) (P) from fR ,ψ (Q ) (P) and R . Let us first analyse the cost when doing this the usual way. We simply add the costs for doubling and addition from Tables 9.2 on page 221 and 9.3 on page 222, respectively. The resulting cost is kMq + 2Iqk/2 + 6Mqk/w + 3Sqk/2 + 1M(ω) + 4addqk/2 + 14subqk/2 plus the squaring cost Sqk for the standard Miller loop squaring and two multiplications 2Mqk with line functions, which are regular field multiplications in Fqk because line functions are not sparse in this case. Next assume that we are using the parabola technique instead of the usual two steps. The squaring cost Sqk in the large field remains, but we only have to do one multiplication Mqk with the parabola. Carrying out the analysis the same way as in Section 9.3.1, one sees that it is possible to compute the parabola and the point [2]R + Q with cost (k/2)Mq + 2Iqk/2 + 6Mqk/w + 2Sqk/2 + 2M(ω) + M(ω2 ) + 5addqk/2 + 8subqk/2 + addq + subq . Overall, the parabola technique saves Mqk + (k/2)Mqk/2 + Sqk/2 , clearly the add and sub costs are lower, but it uses two more multiplications by ω and ω2 , respectively. In this specific setting, the parabola method is indeed more efficient than the standard Miller loop with only doubling and addition steps. This effect becomes more pronounced for larger embedding degrees.

9.5 Squared Pairings For applications to cryptography, the idea of using the squared Weil or Tate pairing instead of the usual pairings was introduced in [138]. Peter was very fond of the idea of squared pairings. The denominator cancellation, which improves efficiency, was one very nice consequence, but that was also achieved independently in a different way in [29, 30]. However, an often unrecognized part of the history is that the squared Weil pairing may be viewed as a precursor to the definition of the ate pairing and optimal pairings which are in the same vein as the squared pairing. Namely, those more recent definitions consider a higher power of the Tate pairing and achieve greater efficiency by reducing the Miller

Downloaded from https://www.cambridge.org/core. Teachers College Library - Columbia University, on 29 Nov 2017 at 03:16:25, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.010

232

Cryptographic Pairings

loop length. The following exposition of the squared Weil and Tate pairings is a lightly edited version of sections 2 and 3 from [138].

9.5.1 The Squared Weil Pairing The squared pairing idea was introduced in [138] as a construction of new pairings, called the squared Weil pairing and squared Tate pairing. An analogous pairing for hyperelliptic curves was also introduced. In each case, a direct formula for evaluating the pairing was given, along with proofs that the values coincide with the value of the original pairing (Weil or Tate) squared. Here we present only the construction of the new pairing, and omit the proofs of correctness. These alternate pairings had the advantage of being more efficient to compute than Miller’s algorithm for the original Weil and Tate pairings. The squared pairing also has the advantage that it is guaranteed to output the correct answer and does not depend on inputting a randomly chosen point. In contrast, Miller’s algorithm may restart, since the randomly chosen point can cause the algorithm to fail. 9.5.1.1 Algorithm for er (P, Q)2 Given two r-torsion points P and Q on E, we want to compute er (P, Q)2 . Start with an addition-subtraction chain for r. That is, after an initial 1, every element in the chain is a sum or difference of two earlier elements, until an r appears. Well-known techniques give a chain of length O(log(r)). For each j in the addition-subtraction chain, form a tuple t j = [[ j]P, [ j]Q, n j , d j ] such that nj f j,P (Q) f j,Q (−P) . = dj f j,P (−Q) f j,Q (P)

(9.7)

Start with t1 = [P, Q, 1, 1]. Given t j and tk , the following procedure gets t j+k . 1. 2. 3. 4.

Compute the points [ j]P + [k]P = [ j + k]P and [ j]Q + [k]Q = [ j + k]Q. Find coefficients of the line l[ j]P,[k]P (X ) = c0 + c1 x(X ) + c2 y(X ). Find coefficients of the line l[ j]Q,[k]Q (X ) = c0 + c1 x(X ) + c2 y(X ). Set n j+k = n j nk (c0 + c1 x(Q) + c2 y(Q)) (c0 + c1 x(P) − c2 y(P)), d j+k = d j dk (c0 + c1 x(Q) − c2 y(Q)) (c0 + c1 x(P) + c2 y(P)).

A similar construction gives t j−k from t j and tk . The vertical lines through [ j + k]P and [ j + k]Q do not appear in the formulas for n j+k and d j+k , because the contributions from Q and −Q (or from P and −P) are equal. When j + k = r, this simplifies to n j+k = n j nk and d j+k = d j dk , since c2 and c2 will be zero.

Downloaded from https://www.cambridge.org/core. Teachers College Library - Columbia University, on 29 Nov 2017 at 03:16:25, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.010

9.5 Squared Pairings

233

When nr and dr are non-zero, then the computation nr fr,P (Q) fr,Q (−P) = dr fr,P (−Q) fr,Q (P) has been successful, and we have the correct output. If, however, nr or dr is zero, then some factor such as c0 + c1 x(Q) + c2 y(Q) must have vanished. That line was chosen to pass through [ j]P, [k]P, and −[ j + k]P, for some j and k. It does not vanish at any other point on the elliptic curve. Therefore this factor can vanish only if Q = [ j]P or Q = [k]P or Q = −[ j + k]P. In all of these cases Q will be a multiple of P, ensuring er (P, Q) = 1. Overall, the squared Weil pairing advances from ti and t j to ti+ j with 12 field multiplications and 2 field divisions in the generic case, compared to 18 field multiplications and 2 field divisions for Miller’s general method. When i = j, each algorithm needs 2 additional field multiplications due to the elliptic curve doublings.

9.5.2 The Squared Tate Pairing Assume P ∈ E(Fq )[r], and Q ∈ E(Fqk ), with neither being the identity and P not equal to a multiple of Q. Define  k  fr,P (Q) (q −1)/r vr (P, Q) := , fr,P (−Q) where fr,P is as above, and call vr the squared Tate pairing. To justify this terminology, it was shown in [138] that vr (P, Q) = tr (P, Q)2 . 9.5.2.1 Algorithm for vr (P, Q) Given an r-torsion point P on E and a point Q on E, we want to compute vr (P, Q). As before, start with an addition-subtraction chain for r. For each j in the addition-subtraction chain, form a tuple t j = [[ j]P, n j , d j ] such that nj f j,P (Q) . = dj f j,P (−Q)

(9.8)

Start with t1 = [P, 1, 1]. Given t j and tk , the following procedure gets t j+k . 1. Compute the points [ j]P + [k]P = [ j + k]P. 2. Find the line function g[ j]P,[k]P (X ) = c0 + c1 x(X ) + c2 y(X ). 3. Set n j+k = n j · nk · (c0 + c1 x(Q) + c2 y(Q)), d j+k = d j · dk · (c0 + c1 x(Q) − c2 y(Q)).

Downloaded from https://www.cambridge.org/core. Teachers College Library - Columbia University, on 29 Nov 2017 at 03:16:25, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.010

234

Cryptographic Pairings

A similar construction gives t j−k from t j and tk . The vertical lines through [ j + k]P and [ j + k]Q do not appear in the formulas for n j+k and d j+k , because the contributions from Q and −Q are equal. When j + k = r, one can further simplify this to n j+k = n j · nk and d j+k = d j · dk , since c2 will be zero. When nr and dr are non-zero, then the computation of (9.8) with j = r is successful, and after raising to the (qk − 1)/r-th power, we have the correct output. If some nr or dr were zero, then some factor such as c0 + c1 x(Q) + c2 y(Q) must have vanished. That line was chosen to pass through [ j]P, [k]P, and −[ j + k]P, for some j and k. It does not vanish at any other point on the elliptic curve. Therefore this factor can vanish only if Q = [ j]P or Q = [k]P or Q = −[ j + k]P for some j and k. In all of these cases Q would be a multiple of P, contrary to our assumption.

Downloaded from https://www.cambridge.org/core. Teachers College Library - Columbia University, on 29 Nov 2017 at 03:16:25, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.010

Bibliography

[1] T. Acar and D. Shumow. Modular reduction without pre-computation for special moduli. Technical report, Microsoft Research, 2010. (Cited on page 24.) [2] L. M. Adleman. A subexponential algorithm for the discrete logarithm problem with applications to cryptography. In Proceedings of the 20th Annual Symposium on Foundations of Computer Science, SFCS ’79, pages 55–60, Washington, DC, USA, 1979. IEEE Computer Society. (Cited on pages 139 and 140.) [3] L. M. Adleman. Factoring numbers using singular integers. In Proceedings of the 23rd Annual ACM Symposium on Theory of Computing, May 5–8, 1991, New Orleans, Louisiana, USA, pages 64–71, 1991. (Cited on pages 154 and 155.) [4] L. M. Adleman. The story of sneakers, the movie and Len Adleman the mathematician. URL: http://www.usc.edu/dept/molecular-science/fm-sneakers.htm, 1991 (accessed April 20, 2017). (Cited on page 139.) [5] A. V. Aho, J. E. Hopcroft, and J. D. Ullman. The Design and Analysis of Computer Algorithms. Addison-Wesley, Reading, MA, 1974. (Cited on pages 195 and 196.) [6] H. Aigner, H. Bock, M. Hütter, and J. Wolkerstorfer. A low-cost ECC coprocessor for smartcards. In M. Joye and J.-J. Quisquater, editors, Cryptographic Hardware and Embedded Systems – CHES 2004, volume 3156 of Lecture Notes in Computer Science, pages 107–118. Springer, Heidelberg, Aug. 2004. (Cited on page 40.) [7] T. Akishita. Fast simultaneous scalar multiplication on elliptic curve with Montgomery form. In S. Vaudenay and A. M. Youssef, editors, SAC 2001: 8th Annual International Workshop on Selected Areas in Cryptography, volume 2259 of Lecture Notes in Computer Science, pages 255–267. Springer, Heidelberg, Aug. 2001. (Cited on page 108.) [8] W. R. Alford and C. Pomerance. Implementing the self-initializing quadratic sieve on a distributed network. In A. J. van der Poorten, I. Shparlinski, and H. G. Zimmer, editors, Number Theoretic and Algebraic Methods in Computer Science (Moscow 1993), pages 163–174. World Scientific, 1995. (Cited on pages 135 and 137.) [9] A. Ambainis, Y. Filmus, and F. Le Gall. Fast matrix multiplication: Limitations of the Coppersmith-Winograd method. In R. A. Servedio and R. Rubinfeld, editors, Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of

235 Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:30, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.011

236

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17] [18]

[19] [20]

[21]

Bibliography

Computing, STOC 2015, Portland, OR, USA, June 14–17, 2015, pages 585–593. ACM, 2015. (Cited on page 123.) S. Antão, J.-C. Bajard, and L. Sousa. RNS-based elliptic curve point multiplication for massive parallel architectures. The Computer Journal, 55(5):629–647, 2012. (Cited on page 37.) D. F. Aranha, P. S. L. M. Barreto, P. Longa, and J. E. Ricardini. The realm of the pairings. In T. Lange, K. Lauter, and P. Lisonek, editors, SAC 2013: 20th Annual International Workshop on Selected Areas in Cryptography, volume 8282 of Lecture Notes in Computer Science, pages 3–25. Springer, Heidelberg, Aug. 2014. (Cited on pages 207 and 215.) D. F. Aranha, L. Fuentes-Castañeda, E. Knapp, A. Menezes, and F. RodríguezHenríquez. Implementing pairings at the 192-bit security level. In M. Abdalla and T. Lange, editors, Pairing-Based Cryptography – Pairing 2012, volume 7708 of Lecture Notes in Computer Science, pages 177–195. Springer, 2012. (Cited on page 207.) D. F. Aranha, K. Karabina, P. Longa, C. H. Gebotys, and J. López. Faster explicit formulas for computing pairings over ordinary curves. In K. G. Paterson, editor, Advances in Cryptology – EUROCRYPT 2011, volume 6632 of Lecture Notes in Computer Science, pages 48–68. Springer, Heidelberg, May 2011. (Cited on pages 207 and 215.) C. Arène, T. Lange, M. Naehrig, and C. Ritzenthaler. Faster computation of the Tate pairing. Journal of Number Theory, 131(5):842–857, 2011. (Cited on pages 221 and 222.) D. Atkins, M. Graff, A. K. Lenstra, and P. C. Leyland. The magic words are squeamish ossifrage. In J. Pieprzyk and R. Safavi-Naini, editors, Advances in Cryptology – ASIACRYPT’94, volume 917 of Lecture Notes in Computer Science, pages 263–277. Springer, Heidelberg, Nov. / Dec. 1995. (Cited on pages 135 and 153.) R. M. Avanzi, H. Cohen, C. Doche, G. Frey, T. Lange, K. Nguyen, and F. Vercauteren. Handbook of Elliptic and Hyperelliptic Curve Cryptography. Chapman & Hall/CRC Press, 2005. (Cited on pages 90, 244, and 245.) E. Bach and J. Shallit. Factoring with cyclotomic polynomials. Mathematics of Computation, 52:201–219, 1989. (Cited on page 116.) S. Bai, C. Bouvier, A. Kruppa, and P. Zimmermann. Better polynomials for GNFS. Mathematics of Computation, pages 1–12, December 2015. (Cited on page 174.) S. Bai, R. P. Brent, and E. Thomé. Root optimization of polynomials in the number field sieve. Mathematics of Computation, 84(295), 2015. (Cited on page 173.) D. V. Bailey, L. Batina, D. J. Bernstein, P. Birkner, J. W. Bos, H.-C. Chen, C.M. Cheng, G. van Damme, G. de Meulenaer, L. J. D. Perez, J. Fan, T. Güneysu, F. Gurkaynak, T. Kleinjung, T. Lange, N. Mentens, R. Niederhagen, C. Paar, F. Regazzoni, P. Schwabe, L. Uhsadel, A. V. Herrewege, and B.-Y. Yang. Breaking ECC2K-130. Cryptology ePrint Archive, Report 2009/541, 2009. http:// eprint.iacr.org/2009/541 (accessed May 3, 2017). (Cited on page 9.) D. V. Bailey and C. Paar. Efficient arithmetic in finite field extensions with application in elliptic curve cryptography. Journal of Cryptology, 14(3):153–176, 2001. (Cited on page 215.)

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:30, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.011

Bibliography

237

[22] J. Bajard, M. E. Kaihara, and T. Plantard. Selected RNS bases for modular multiplication. In J. D. Bruguera, M. Cornea, D. D. Sarma, and J. Harrison, editors, 19th IEEE Symposium on Computer Arithmetic – ARITH 2009, pages 25–32. IEEE Computer Society, 2009. (Cited on page 37.) [23] J.-C. Bajard, L.-S. Didier, and P. Kornerup. An RNS montgomery modular multiplication algorithm. IEEE Trans. Computers, 47(7):766–776, 1998. (Cited on page 36.) [24] J.-C. Bajard and L. Imbert. A full RNS implementation of RSA. IEEE Transactions on Computers, 53(6):769–774, June 2004. (Cited on page 37.) [25] S. Baktir and B. Sunar. Optimal tower fields. IEEE Transactions on Computers, 53(10):1231–1243, 2004. (Cited on page 218.) [26] R. Barbulescu, J. W. Bos, C. Bouvier, T. Kleinjung, and P. L. Montgomery. Finding ECM-friendly curves through a study of Galois properties. The Open Book Series – Proceedings of the Tenth Algorithmic Number Theory Symposium, pages 63–86, 2013. (Cited on pages 4 and 94.) [27] R. Barbulescu, P. Gaudry, A. Joux, and E. Thomé. A heuristic quasi-polynomial algorithm for discrete logarithm in finite fields of small characteristic. In P. Q. Nguyen and E. Oswald, editors, Advances in Cryptology – EUROCRYPT 2014, volume 8441 of Lecture Notes in Computer Science, pages 1–16. Springer, Heidelberg, May 2014. (Cited on page 140.) [28] R. Barbulescu and A. Lachand. Some mathematical remarks on the polynomial selection in NFS. Mathematics of Computation, 86(303):397–418, 2017. (Cited on page 172.) [29] P. S. L. M. Barreto, H. Y. Kim, B. Lynn, and M. Scott. Efficient algorithms for pairing-based cryptosystems. In M. Yung, editor, Advances in Cryptology – CRYPTO 2002, volume 2442 of Lecture Notes in Computer Science, pages 354– 368. Springer, Heidelberg, Aug. 2002. (Cited on pages 207, 210, 212, and 231.) [30] P. S. L. M. Barreto, B. Lynn, and M. Scott. Efficient implementation of pairingbased cryptosystems. Journal of Cryptology, 17(4):321–334, Sept. 2004. (Cited on pages 210, 212, and 231.) [31] P. Barrett. Implementing the Rivest Shamir and Adleman public key encryption algorithm on a standard digital signal processor. In A. M. Odlyzko, editor, Advances in Cryptology – CRYPTO’86, volume 263 of Lecture Notes in Computer Science, pages 311–323. Springer, Heidelberg, Aug. 1987. (Cited on pages 11 and 48.) [32] C. Baum. The System Builders: The Story of SDC. System Development Corporation, 1981. (Cited on pages 4 and 5.) [33] N. Benger and M. Scott. Constructing tower extensions of finite fields for implementation of pairing-based cryptography. In M. A. Hasan and T. Helleseth, editors, Arithmetic of Finite Fields, Third International Workshop, WAIFI 2010, Istanbul, Turkey, June 27-30, 2010. Proceedings, volume 6087 of Lecture Notes in Computer Science, pages 180–195. Springer, 2010. (Cited on page 215.) [34] E. R. Berlekamp. Algebraic coding theory. McGraw-Hill, 1968. (Cited on page 3.) [35] D. J. Bernstein. Curve25519: New Diffie-Hellman speed records. In M. Yung, Y. Dodis, A. Kiayias, and T. Malkin, editors, PKC 2006: 9th International Conference on Theory and Practice of Public Key Cryptography, volume 3958 of

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:30, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.011

238

[36] [37]

[38]

[39]

[40]

[41]

[42]

[43]

[44]

[45]

[46]

[47]

Bibliography

Lecture Notes in Computer Science, pages 207–228. Springer, Heidelberg, Apr. 2006. (Cited on pages 23, 83, 94, 98, and 106.) D. J. Bernstein. Differential addition chains, 2006. https://cr.yp.to/papers.html# diffchain (accessed May 3, 2017). (Cited on pages 108 and 114.) D. J. Bernstein, P. Birkner, M. Joye, T. Lange, and C. Peters. Twisted Edwards curves. In S. Vaudenay, editor, AFRICACRYPT 08: 1st International Conference on Cryptology in Africa, volume 5023 of Lecture Notes in Computer Science, pages 389–405. Springer, Heidelberg, June 2008. (Cited on page 90.) D. J. Bernstein, P. Birkner, T. Lange, and C. Peters. ECM using Edwards curves. Mathematics of Computation, 82(282):1139–1179, 2013. (Cited on page 190.) D. J. Bernstein, C. Chuengsatiansup, D. Kohel, and T. Lange. Twisted Hessian curves. In K. E. Lauter and F. Rodríguez-Henríquez, editors, Progress in Cryptology - LATINCRYPT 2015: 4th International Conference on Cryptology and Information Security in Latin America, volume 9230 of Lecture Notes in Computer Science, pages 269–294. Springer, Heidelberg, Aug. 2015. (Cited on page 97.) D. J. Bernstein, C. Chuengsatiansup, and T. Lange. Curve41417: Karatsuba revisited. In L. Batina and M. Robshaw, editors, Cryptographic Hardware and Embedded Systems – CHES 2014, volume 8731 of Lecture Notes in Computer Science, pages 316–334. Springer, Heidelberg, Sept. 2014. (Cited on page 94.) D. J. Bernstein, C. Chuengsatiansup, T. Lange, and P. Schwabe. Kummer strikes back: New DH speed records. In P. Sarkar and T. Iwata, editors, Advances in Cryptology – ASIACRYPT 2014, Part I, volume 8873 of Lecture Notes in Computer Science, pages 317–337. Springer, Heidelberg, Dec. 2014. (Cited on page 83.) D. J. Bernstein, N. Duif, T. Lange, P. Schwabe, and B.-Y. Yang. High-speed high-security signatures. In B. Preneel and T. Takagi, editors, Cryptographic Hardware and Embedded Systems – CHES 2011, volume 6917 of Lecture Notes in Computer Science, pages 124–142. Springer, Heidelberg, Sept. / Oct. 2011. (Cited on pages 83 and 94.) D. J. Bernstein, S. Josefsson, T. Lange, P. Schwabe, and B.-Y. Yang. EdDSA for more curves. Cryptology ePrint Archive, Report 2015/677, 2015. http://eprint .iacr.org/2015/677 (accessed May 3, 2017). (Cited on page 94.) D. J. Bernstein and T. Lange. Faster addition and doubling on elliptic curves. In K. Kurosawa, editor, Advances in Cryptology – ASIACRYPT 2007, volume 4833 of Lecture Notes in Computer Science, pages 29–50. Springer, Heidelberg, Dec. 2007. (Cited on pages 90 and 91.) D. J. Bernstein and T. Lange. Y Z coordinates with square d for Edwards curves, 2009. https://hyperelliptic.org/EFD/g1p/auto-edwards-yz.html (accessed May 3, 2017). (Cited on page 97.) D. J. Bernstein and T. Lange. A complete set of addition laws for incomplete Edwards curves. Journal of Number Theory, 131:858–872, 2011. (Cited on page 91.) D. J. Bernstein and T. Lange. SafeCurves: choosing safe curves for elliptic-curve cryptography, 2014. https://safecurves.cr.yp.to (accessed May 3, 2017). (Cited on page 94.)

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:30, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.011

Bibliography

239

[48] D. J. Bernstein and T. Lange. Explicit-Formulas Database, 2016. https:// hyperelliptic.org/EFD (accessed May 3, 2017). (Cited on pages 83, 220, and 223.) [49] D. J. Bernstein and A. K. Lenstra. A general number field sieve implementation. pages 103–126 in [234], 1992. (Cited on pages 149, 152, and 156.) [50] G. Bertoni, J. Guajardo, and G. Orlando. Systolic and scalable architectures for digit-serial multiplication in fields GF(pm ). In T. Johansson and S. Maitra, editors, Progress in Cryptology - INDOCRYPT 2003: 4th International Conference in Cryptology in India, volume 2904 of Lecture Notes in Computer Science, pages 349–362. Springer, Heidelberg, Dec. 2003. (Cited on pages 40 and 67.) [51] J.-L. Beuchat, J. E. González-Díaz, S. Mitsunari, E. Okamoto, F. RodríguezHenríquez, and T. Teruya. High-speed software implementation of the optimal ate pairing over Barreto-Naehrig curves. In M. Joye, A. Miyaji, and A. Otsuka, editors, PAIRING 2010: 4th International Conference on Pairing-based Cryptography, volume 6487 of Lecture Notes in Computer Science, pages 21–39. Springer, Heidelberg, Dec. 2010. (Cited on pages 207 and 215.) [52] K. Bigou and A. Tisserand. Single base modular multiplication for efficient hardware RNS implementations of ECC. In T. Güneysu and H. Handschuh, editors, Cryptographic Hardware and Embedded Systems – CHES 2015, volume 9293 of Lecture Notes in Computer Science, pages 123–140. Springer, Heidelberg, Sept. 2015. (Cited on page 37.) [53] I. F. Blake, G. Seroussi, and N. P. Smart, editors. Elliptic Curves in Cryptography. Cambridge University Press, 1999. (Cited on page 208.) [54] I. F. Blake, G. Seroussi, and N. P. Smart, editors. Advances in Elliptic Curve Cryptography. Cambridge University Press, 2005. (Cited on page 246.) [55] D. Bleichenbacher. Efficiency and security of cryptosystems based on number theory. PhD thesis, ETH Zürich, 1996. https://cr.yp.to/bib/1996/ bleichenbacher-thesis.pdf (accessed May 3, 2017). (Cited on pages 112 and 113.) [56] L. Bluestein. A linear filtering approach to the computation of discrete Fourier transform. IEEE Transactions on Audio and Electroacoustics, 18(4):451–455, 1970. (Cited on page 202.) [57] D. Boneh, G. Di Crescenzo, R. Ostrovsky, and G. Persiano. Public key encryption with keyword search. In C. Cachin and J. Camenisch, editors, Advances in Cryptology – EUROCRYPT 2004, volume 3027 of Lecture Notes in Computer Science, pages 506–522. Springer, Heidelberg, May 2004. (Cited on page 226.) [58] D. Boneh and M. K. Franklin. Identity-based encryption from the Weil pairing. In J. Kilian, editor, Advances in Cryptology – CRYPTO 2001, volume 2139 of Lecture Notes in Computer Science, pages 213–229. Springer, Heidelberg, Aug. 2001. (Cited on page 206.) [59] D. Boneh, E.-J. Goh, and K. Nissim. Evaluating 2-DNF formulas on ciphertexts. In J. Kilian, editor, TCC 2005: 2nd Theory of Cryptography Conference, volume 3378 of Lecture Notes in Computer Science, pages 325–341. Springer, Heidelberg, Feb. 2005. (Cited on page 206.) [60] D. Boneh, B. Lynn, and H. Shacham. Short signatures from the Weil pairing. In C. Boyd, editor, Advances in Cryptology – ASIACRYPT 2001, volume 2248 of Lecture Notes in Computer Science, pages 514–532. Springer, Heidelberg, Dec. 2001. (Cited on page 206.)

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:30, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.011

240

Bibliography

[61] D. Boneh, B. Lynn, and H. Shacham. Short signatures from the Weil pairing. Journal of Cryptology, 17(4):297–319, Sept. 2004. (Cited on page 206.) [62] D. Boneh, A. Sahai, and B. Waters. Functional encryption: Definitions and challenges. In Y. Ishai, editor, TCC 2011: 8th Theory of Cryptography Conference, volume 6597 of Lecture Notes in Computer Science, pages 253–273. Springer, Heidelberg, Mar. 2011. (Cited on page 206.) [63] J. W. Bos. High-performance modular multiplication on the Cell processor. In M. A. Hasan and T. Helleseth, editors, Workshop on the Arithmetic of Finite Fields – WAIFI 2010, volume 6087 of Lecture Notes in Computer Science, pages 7–24. Springer, 2010. (Cited on pages 27 and 32.) [64] J. W. Bos, C. Costello, H. Hisil, and K. Lauter. Fast cryptography in genus 2. In T. Johansson and P. Q. Nguyen, editors, Advances in Cryptology – EUROCRYPT 2013, volume 7881 of Lecture Notes in Computer Science, pages 194– 210. Springer, Heidelberg, May 2013. (Cited on pages 23 and 24.) [65] J. W. Bos, C. Costello, H. Hisil, and K. Lauter. High-performance scalar multiplication using 8-dimensional GLV/GLS decomposition. In G. Bertoni and J.-S. Coron, editors, Cryptographic Hardware and Embedded Systems – CHES 2013, volume 8086 of Lecture Notes in Computer Science, pages 331–348. Springer, Heidelberg, Aug. 2013. (Cited on page 24.) [66] J. W. Bos, C. Costello, P. Longa, and M. Naehrig. Selecting elliptic curves for cryptography: an efficiency and security analysis. J. Cryptographic Engineering, 6(4):259–286, 2016. (Cited on page 25.) [67] J. W. Bos and M. E. Kaihara. Montgomery multiplication on the Cell. In R. Wyrzykowski, J. Dongarra, K. Karczewski, and J. Wasniewski, editors, Parallel Processing and Applied Mathematics – PPAM 2009, volume 6067 of Lecture Notes in Computer Science, pages 477–485. Springer, Heidelberg, 2010. (Cited on pages 31 and 35.) [68] J. W. Bos, M. E. Kaihara, T. Kleinjung, A. K. Lenstra, and P. L. Montgomery. On the security of 1024-bit RSA and 160-bit elliptic curve cryptography. Cryptology ePrint Archive, Report 2009/389, 2009. http://eprint.iacr.org/ (accessed May 3, 2017). (Cited on page 4.) [69] J. W. Bos, M. E. Kaihara, T. Kleinjung, A. K. Lenstra, and P. L. Montgomery. Solving a 112-bit prime elliptic curve discrete logarithm problem on game consoles using sloppy reduction. International Journal of Applied Cryptography, 2(3):212–228, 2012. (Cited on pages 4, 9, and 32.) [70] J. W. Bos, M. E. Kaihara, and P. L. Montgomery. Pollard rho on the PlayStation 3. In Special-purpose Hardware for Attacking Cryptographic Systems – SHARCS 2009, pages 35–50, 2009. http://www.hyperelliptic.org/tanja/SHARCS/record2 .pdf (accessed May 3, 2017). (Cited on page 4.) [71] J. W. Bos, T. Kleinjung, A. K. Lenstra, and P. L. Montgomery. Efficient SIMD arithmetic modulo a Mersenne number. In E. Antelo, D. Hough, and P. Ienne, editors, IEEE Symposium on Computer Arithmetic – ARITH-20, pages 213–221. IEEE Computer Society, 2011. (Cited on pages 4, 23, and 199.) [72] J. W. Bos, T. Kleinjung, R. Niederhagen, and P. Schwabe. ECC2K-130 on cell CPUs. In D. J. Bernstein and T. Lange, editors, AFRICACRYPT 10: 3rd International Conference on Cryptology in Africa, volume 6055 of Lecture Notes in

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:30, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.011

Bibliography

[73]

[74]

[75]

[76] [77]

[78]

[79] [80] [81]

[82]

[83]

[84]

241

Computer Science, pages 225–242. Springer, Heidelberg, May 2010. (Cited on page 32.) J. W. Bos, P. L. Montgomery, D. Shumow, and G. M. Zaverucha. Montgomery multiplication using vector instructions. In T. Lange, K. Lauter, and P. Lisonek, editors, SAC 2013: 20th Annual International Workshop on Selected Areas in Cryptography, volume 8282 of Lecture Notes in Computer Science, pages 471–489. Springer, Heidelberg, Aug. 2014. (Cited on pages 4, 26, 27, 28, and 30.) J. W. Bos and D. Stefan. Performance analysis of the SHA-3 candidates on exotic multi-core architectures. In S. Mangard and F.-X. Standaert, editors, Cryptographic Hardware and Embedded Systems – CHES 2010, volume 6225 of Lecture Notes in Computer Science, pages 279–293. Springer, Heidelberg, Aug. 2010. (Cited on page 32.) A. Bosselaers, R. Govaerts, and J. Vandewalle. Comparison of three modular reduction functions. In D. R. Stinson, editor, Advances in Cryptology – CRYPTO’93, volume 773 of Lecture Notes in Computer Science, pages 175–186. Springer, Heidelberg, Aug. 1994. (Cited on page 12.) R. P. Brent. New factors of Mersenne numbers (preliminary report), II. AMS Abstracts, 3:132, 82T–10–22, 1982. (Cited on page 199.) R. P. Brent. Some integer factorization algorithms using elliptic curves. Australian Computer Science Communications, 8:149–163, 1986. (Cited on page 189.) R. P. Brent, P. L. Montgomery, H. J. J. te Riele, H. Boender, M. ElkenbrachtHuizing, R. Silverman, and T. Sosnowski. Factorizations of an ± 1, 13 ≤ a < 100: Update 2, 1996. (Cited on pages 7 and 148.) R. P. Brent and J. M. Pollard. Factorization of the eighth Fermat number. Mathematics of Computation, 36(154):627–630, 1981. (Cited on pages 116 and 129.) R. P. Brent and P. Zimmermann. Modern Computer Arithmetic. Cambridge University Press, 2010. (Cited on pages 24 and 197.) E. F. Brickell. A fast modular multiplication algorithm with application to two key cryptography. In D. Chaum, R. L. Rivest, and A. T. Sherman, editors, Advances in Cryptology – CRYPTO’82, pages 51–60. Plenum Press, New York, USA, 1982. (Cited on page 27.) J. Brillhart, D. H. Lehmer, J. L. Selfridge, B. Tuckerman, and S. S. Wagstaff Jr. Factorizations of bn ± 1, b = 2, 3, 5, 6, 7, 10, 11, 12 Up to High Powers, volume 22 of Contemporary Mathematics. American Mathematical Society, First edition, 1983, Second edition, 1988, Third edition, 2002. Electronic book available at: http://homes.cerias.purdue.edu/~ssw/cun/index.html (accessed May 3, 2017), 1983. (Cited on pages 117, 128, 145, and 146.) J. Buchmann, J. Loho, and J. Zayer. An implementation of the general number field sieve. In D. R. Stinson, editor, Advances in Cryptology – CRYPTO’93, volume 773 of Lecture Notes in Computer Science, pages 159–165. Springer, Heidelberg, Aug. 1994. (Cited on pages 149, 152, and 153.) J. P. Buhler, H. W. Lenstra Jr., and C. Pomerance. Factoring integers with the number field sieve. pages 50–94 in [234], 1992. (Cited on pages 139, 141, 152, 153, 154, 155, 156, and 164.)

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:30, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.011

242

Bibliography

[85] J. P. Buhler, P. Montgomery, R. Robson, and R. Ruby. Technical report implementing the number field sieve. Oregon State University, Corvallis, OR, 1994. (Cited on page 166.) [86] E. Canfield, P. Erdös, and C. Pomerance. On a problem of Oppenheim concerning “Factorisatio Numerorum.” J. Number Theory, 17:1–28, 1983. (Cited on page 119.) [87] D. G. Cantor and H. Zassenhaus. A new algorithm for factoring polynomials over finite fields. Mathematics of Computation, 36:587–592, 1981. (Cited on page 3.) [88] T. R. Caron and R. D. Silverman. Parallel implementation of the quadratic sieve. J. Supercomput., 1:273–290, 1988. (Cited on pages 116, 135, and 136.) [89] S. Cavallar. Strategies in filtering in the number field sieve. In W. Bosma, editor, ANTS, volume 1838 of Lecture Notes in Computer Science, pages 209–231. Springer, 2000. (Cited on pages 7, 124, and 125.) [90] S. Cavallar. On the number field sieve integer factorisation algorithm. PhD thesis, Leiden University, 2002. (Cited on pages 7, 124, and 125.) [91] S. Cavallar, B. Dodson, A. K. Lenstra, P. C. Leyland, W. M. Lioen, P. L. Montgomery, B. Murphy, H. te Riele, and P. Zimmermann. Factorization of RSA-140 using the number field sieve. In K.-Y. Lam, E. Okamoto, and C. Xing, editors, Advances in Cryptology – ASIACRYPT’99, volume 1716 of Lecture Notes in Computer Science, pages 195–207. Springer, Heidelberg, Nov. 1999. (Cited on pages 4 and 171.) [92] S. Cavallar, B. Dodson, A. K. Lenstra, W. M. Lioen, P. L. Montgomery, B. Murphy, H. te Riele, K. Aardal, J. Gilchrist, G. Guillerm, P. C. Leyland, J. Marchand, F. Morain, A. Muffett, C. Putnam, C. Putnam, and P. Zimmermann. Factorization of a 512-bit RSA modulus. In B. Preneel, editor, Advances in Cryptology – EUROCRYPT 2000, volume 1807 of Lecture Notes in Computer Science, pages 1–18. Springer, Heidelberg, May 2000. (Cited on pages 4, 124, 148, 153, and 176.) [93] Ç. K. Koç, T. Acar, and B. S. Kaliski Jr. Analyzing and comparing Montgomery multiplication algorithms. IEEE Micro, 16(3):26–33, 1996. (Cited on pages 16 and 47.) [94] D. Chaum. Blind signatures for untraceable payments. In D. Chaum, R. L. Rivest, and A. T. Sherman, editors, Advances in Cryptology – CRYPTO’82, pages 199– 203. Plenum Press, New York, USA, 1982. (Cited on page 77.) [95] H.-C. Chen, C.-M. Cheng, S.-H. Hung, and Z.-C. Lin. Integer number crunching on the Cell processor. International Conference on Parallel Processing, pages 508–515, 2010. (Cited on page 32.) [96] S. Y. Chiou and C. S. Laih. An efficient algorithm for computing the Luc chain. IEE Proceedings on Computers and Digital Techniques, 147:263–265, 2000. (Cited on page 112.) [97] T. Chou. Sandy2x: New Curve25519 speed records. In O. Dunkelman and L. Keliher, editors, Selected Areas in Cryptography – SAC 2015, volume 9566 of Lecture Notes in Computer Science, pages 145–160. Springer, 2016. (Cited on page 94.) [98] J. Chung and M. A. Hasan. Montgomery reduction algorithm for modular multiplication using low-weight polynomial form integers. In 18th IEEE Symposium

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:30, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.011

Bibliography

[99]

[100] [101] [102] [103]

[104] [105] [106]

[107]

[108] [109]

[110]

[111]

[112]

[113] [114]

243

on Computer Arithmetic (ARITH-18), pages 230–239. IEEE Computer Society, 2007. (Cited on page 25.) M. Ciet, M. Joye, K. Lauter, and P. L. Montgomery. Trading inversions for multiplications in elliptic curve cryptography. Des. Codes Cryptography, 39(2):189– 206, 2006. (Cited on pages 4 and 228.) S. Circle. Blackphone website, 2017. (Cited on page 95.) S. Contini. Factoring integers with the self-initializing quadratic sieve. Masters Thesis, U. Georgia, 1997. (Cited on page 137.) S. Cook. On the minimum computation time of functions. PhD thesis, Harvard University, 1966. (Cited on page 15.) D. Coppersmith. Fast evaluation of logarithms in fields of characteristic two. IEEE Transactions on Information Theory, 30:587–594, 1984. (Cited on page 140.) D. Coppersmith. Modifications to the number field sieve. Journal of Cryptology, 6(3):169–180, 1993. (Cited on pages 117, 146, 153, 158, and 159.) D. Coppersmith. Solving linear equations over GF(2): Block Lanczos algorithm. Linear Algebra Appl., 192:33–60, Jan. 1993. (Cited on page 179.) D. Coppersmith. Solving homogeneous linear equations over GF(2) via block Wiedemann algorithm. Mathematics of Computation, 62(205):333–350, 1994. (Cited on pages 7, 123, 187, and 188.) D. Coppersmith, A. M. Odlyzko, and R. Schroeppel. Discrete logarithms in GF(p). Algorithmica, 1(1):1–15, 1986. (Cited on pages 123, 137, 139, 144, and 145.) D. Coppersmith and S. Winograd. Matrix multiplication via arithmetic progressions. J. Symbolic Comput., 9:251–280, 1990. (Cited on page 123.) C. Costello, H. Hisil, C. Boyd, J. M. González Nieto, and K. K.-H. Wong. Faster pairings on special Weierstrass curves. In H. Shacham and B. Waters, editors, PAIRING 2009: 3rd International Conference on Pairing-based Cryptography, volume 5671 of Lecture Notes in Computer Science, pages 89–101. Springer, Heidelberg, Aug. 2009. (Cited on pages 221 and 222.) C. Costello, T. Lange, and M. Naehrig. Faster pairing computations on curves with high-degree twists. In P. Q. Nguyen and D. Pointcheval, editors, PKC 2010: 13th International Conference on Theory and Practice of Public Key Cryptography, volume 6056 of Lecture Notes in Computer Science, pages 224–242. Springer, Heidelberg, May 2010. (Cited on pages 220, 221, and 222.) N. Costigan and P. Schwabe. Fast elliptic-curve cryptography on the cell broadband engine. In B. Preneel, editor, AFRICACRYPT 09: 2nd International Conference on Cryptology in Africa, volume 5580 of Lecture Notes in Computer Science, pages 368–385. Springer, Heidelberg, June 2009. (Cited on page 32.) N. Costigan and M. Scott. Accelerating SSL using the vector processors in IBM’s cell broadband engine for sony’s playstation 3. Cryptology ePrint Archive, Report 2007/061, 2007. http://eprint.iacr.org/2007/061 (accessed May 4, 2017). (Cited on page 32.) J.-M. Couveignes. Computing a square root for the number field sieve. pages 95–102 in [234], 1992. (Cited on pages 8 and 156.) J. Cowie, B. Dodson, R. M. Elkenbracht-Huizing, A. K. Lenstra, P. L. Montgomery, and J. Zayer. A world wide number field sieve factoring record: On to

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:30, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.011

244

[115]

[116]

[117]

[118] [119] [120]

[121]

[122] [123]

[124]

[125]

[126] [127]

[128] [129] [130]

Bibliography

512 bits. In K. Kim and T. Matsumoto, editors, Advances in Cryptology – ASIACRYPT’96, volume 1163 of Lecture Notes in Computer Science, pages 382– 394. Springer, Heidelberg, Nov. 1996. (Cited on page 153.) N. Coxon. Montgomery’s method of polynomial selection for the number field sieve. Linear Algebra and its Applications, 485:72–102, 2015. (Cited on page 168.) A. J. C. Cunningham and H. J. Woodall. Factorizations of yn ± 1, y = 2, 3, 5, 6, 7, 10, 11, 12 up to high powers. Frances Hodgson, London, 1925. (Cited on pages 117, 145, and 146.) J. A. Davis, D. B. Holdridge, and G. J. Simmons. Status report on factoring (at the Sandia national laboratories). In T. Beth, N. Cot, and I. Ingemarsson, editors, Advances in Cryptology – EUROCRYPT’84, volume 209 of Lecture Notes in Computer Science, pages 183–215. Springer, Heidelberg, Apr. 1985. (Cited on pages 133 and 134.) N. De Bruijn. On the number of positive integers ≤ x and free of prime factors > y, ii. Indag. Math., 38:239–247, 1966. (Cited on page 119.) M. Delcourt, T. Kleinjung, and A. K. Lenstra. Analyses of number field sieve variants. manuscript in preparation, 2017. (Cited on page 159.) R. L. Dennis. Security in the computing environment. Technical Report SP2440/000/01, System Development Corporation, August 18 1966. (page 16). (Cited on page 76.) T. F. Denny, B. Dodson, A. K. Lenstra, and M. S. Manasse. On the factorization of RSA-120. In D. R. Stinson, editor, Advances in Cryptology – CRYPTO’93, volume 773 of Lecture Notes in Computer Science, pages 166–174. Springer, Heidelberg, Aug. 1994. (Cited on page 153.) J. Dhem. Modified version of the Barrett algorithm. Technical report, DICE, Université Catholique de Louvain, 1994. (Cited on page 48.) J. Dhem. Design of an efficient public-key cryptographic library for RISC-based smart cards. PhD thesis, Université Catholique de Louvain, 1998. (Cited on page 48.) J. Dhem and J. Quisquater. Recent results on modular multiplications for smart cards. In Smart Card Research and Applications, CARDIS ’98, volume 1820 of LNCS, pages 336–352. Springer-Verlag, 1998. (Cited on page 48.) T. Dierks and E. Rescorla. The transport layer security (TLS) protocol version 1.2. RFC 5246 (Proposed Standard), http://www.ietf.org/rfc/rfc5246.txt (accessed May 4, 2017), 2008. (Cited on page 94.) W. Diffie and M. E. Hellman. New directions in cryptography. IEEE Transactions on Information Theory, 22(6):644–654, 1976. (Cited on pages 40 and 93.) B. Dixon and A. K. Lenstra. Massively parallel elliptic curve factoring. In R. A. Rueppel, editor, Advances in Cryptology – EUROCRYPT’92, volume 658 of Lecture Notes in Computer Science, pages 183–193. Springer, Heidelberg, May 1993. (Cited on pages 24 and 27.) J. D. Dixon. Asymptotically fast factorization of integers. Mathematics of Computation, 36(153):255–260, 1981. (Cited on page 127.) C. Doche. Finite Field Arithmetic, chapter 11 in [16], pages 201–237. CRC press, 2005. (Cited on pages 215 and 218.) C. Doche, T. Icart, and D. R. Kohel. Efficient scalar multiplication by isogeny decompositions. In M. Yung, Y. Dodis, A. Kiayias, and T. Malkin, editors,

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:30, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.011

Bibliography

[131]

[132] [133] [134]

[135]

[136] [137]

[138]

[139]

[140]

[141]

[142] [143] [144]

[145]

245

PKC 2006: 9th International Conference on Theory and Practice of Public Key Cryptography, volume 3958 of Lecture Notes in Computer Science, pages 191– 206. Springer, Heidelberg, Apr. 2006. (Cited on page 97.) B. Dodson and A. K. Lenstra. NFS with four large primes: An explosive experiment. In D. Coppersmith, editor, Advances in Cryptology – CRYPTO’95, volume 963 of Lecture Notes in Computer Science, pages 372–385. Springer, Heidelberg, Aug. 1995. (Cited on pages 152 and 153.) S. Duquesne and G. Frey. Background on Pairings, chapter 6 in [16], pages 115– 124. CRC press, 2005. (Cited on page 208.) S. Duquesne and G. Frey. Implementation of Pairings, chapter 16 in [16], pages 389–404. CRC press, 2005. (Cited on page 208.) S. R. Dussé and B. S. Kaliski Jr. A cryptographic library for the Motorola DSP56000. In I. Damgård, editor, Advances in Cryptology – EUROCRYPT’90, volume 473 of Lecture Notes in Computer Science, pages 230–244. Springer, Heidelberg, May 1991. (Cited on pages 16, 18, and 47.) W. Eberly and E. Kaltofen. On randomized Lanczos algorithm. In W. W. Küchlin, editor, ISSAC ’97, page 176–183. ACM Press, 1997. Extended abstract. (Cited on page 178.) H. M. Edwards. A normal form for elliptic curves. Bulletin of the American Mathematical Society, 44:393–422, July 2007. (Cited on pages 90 and 190.) K. Eisenträger, K. Lauter, and P. L. Montgomery. Fast elliptic curve arithmetic and improved Weil pairing evaluation. In M. Joye, editor, Topics in Cryptology – CT-RSA 2003, volume 2612 of Lecture Notes in Computer Science, pages 343–354. Springer, Heidelberg, Apr. 2003. (Cited on pages 4, 8, 206, 207, 227, and 228.) K. Eisenträger, K. E. Lauter, and P. L. Montgomery. Improved Weil and Tate pairings for elliptic and hyperelliptic curves. In D. A. Buell, editor, Algorithmic Number Theory, 6th International Symposium, ANTS-VI, Burlington, VT, USA, June 13-18, 2004, Proceedings, volume 3076 of Lecture Notes in Computer Science, pages 169–183. Springer, 2004. (Cited on pages 4, 8, 206, 231, 232, and 233.) S. E. Eldridge and C. D. Walter. Hardware implementation of montgomery’s modular multiplication algorithm. IEEE Transactions on Computers, 42(6):693– 699, June 1993. (Cited on pages 43 and 61.) T. ElGamal. A public key cryptosystem and a signature scheme based on discrete logarithms. IEEE Transactions on Information Theory, 31:469–472, 1985. (Cited on page 40.) T. ElGamal. A subexponential-time algorithm for computing discrete logarithms over GF(p2 ). IEEE Transactions on Information Theory, 31:473–481, 1985. (Cited on pages 139, 141, 142, 143, and 144.) M. Elkenbracht-Huizing. An implementation of the number field sieve. Experimental Mathematics, 5(3):231–253, 1996. (Cited on pages 168 and 169.) M. Ercegovac, November 2015. Private communication. (Cited on pages 2 and 4.) P. Erdös, R. L. Graham, P. L. Montgomery, B. K. Rothschild, J. Spencer, and E. G. Strauss. Euclidean Ramsey theorems, I. Journal of Combinatorial Theory, Series A, 14(3):341–363, 1973. (Cited on page 1.) P. Erdös, R. L. Graham, P. L. Montgomery, B. K. Rothschild, J. Spencer, and E. G. Strauss. Euclidean Ramsey theorems, II. In A. Hajnal, R. Rado, and V. T. Sós,

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:30, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.011

246

[146]

[147]

[148]

[149]

[150]

[151]

[152]

[153] [154]

[155]

[156]

[157]

[158]

Bibliography

editors, Colloquia Mathematica Societatis János Bolyai, 10, volume I of Infinite and Finite Sets, pages 529–557. North-Holland, Amsterdam-London, 1975. (Cited on page 1.) P. Erdös, R. L. Graham, P. L. Montgomery, B. K. Rothschild, J. Spencer, and E. G. Strauss. Euclidean Ramsey theorems, III. In A. Hajnal, R. Rado, and V. T. Sós, editors, Colloquia Mathematica Societatis János Bolyai, 10, volume I of Infinite and Finite Sets, pages 559–583. North-Holland, Amsterdam-London, 1975. (Cited on page 1.) J. Franke and T. Kleinjung. Continued fractions and lattice sieving. In Specialpurpose Hardware for Attacking Cryptographic Systems – SHARCS 2005, 2005. http://www.hyperelliptic.org/tanja/SHARCS/talks/FrankeKleinjung.pdf (accessed May 4, 2017). (Cited on page 149.) D. Freeman, M. Scott, and E. Teske. A taxonomy of pairing-friendly elliptic curves. Journal of Cryptology, 23(2):224–280, Apr. 2010. (Cited on pages 213, 223, and 224.) W. L. Freking and K. K. Parhi. Performance-scalable array architectures for modular multiplication. Journal of VLSI Signal Processing, 31:101–116, 2002. (Cited on page 68.) G. Frey, M. Müller, and H. Rück. The Tate pairing and the discrete logarithm applied to elliptic curve cryptosystems. IEEE Transactions on Information Theory, 45(5):1717–1719, 1999. (Cited on page 213.) G. Frey and H.-G. Rück. A remark concerning m-divisibility and the discrete logarithm in the divisor class group of curves. Mathematics of Computation, 62(206):pp. 865–874, 1994. (Cited on pages 206 and 213.) M. Fürer. Faster integer multiplication. In D. S. Johnson and U. Feige, editors, 39th Annual ACM Symposium on Theory of Computing, pages 57–66. ACM Press, June 2007. (Cited on page 15.) S. D. Galbraith. Pairings, chapter IX in [54], pages 183–214. Cambridge University Press, 2005. (Cited on page 208.) S. D. Galbraith, K. Harrison, and D. Soldera. Implementing the Tate pairing. In C. Fieker and D. R. Kohel, editors, Algorithmic Number Theory – ANTS, volume 2369 of Lecture Notes in Computer Science, pages 324–337. Springer, 2002. (Cited on page 207.) K. Gandolfi, C. Mourtel, and F. Olivier. Electromagnetic analysis: Concrete results. In Çetin Kaya. Koç, D. Naccache, and C. Paar, editors, Cryptographic Hardware and Embedded Systems – CHES 2001, volume 2162 of Lecture Notes in Computer Science, pages 251–261. Springer, Heidelberg, May 2001. (Cited on page 77.) H. L. Garner. The residue number system. In Papers Presented at the the March 3–5, 1959, Western Joint Computer Conference, IRE-AIEE-ACM ’59 (Western), pages 146–153, New York, NY, USA, 1959. ACM. (Cited on pages 36 and 46.) J. Gathen and J. Gerhard. Modern Computer Algebra. Cambridge University Press, Cambridge, 1999. https://cosec.bit.uni-bonn.de/science/mca (accessed May 5, 2017). (Cited on page 192.) P. Gaudry. Variants of the Montgomery form based on Theta functions, 2006. https://cr.yp.to/bib/2006/gaudry-toronto.pdf (accessed May 4, 2017). (Cited on page 97.)

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:30, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.011

Bibliography

247

[159] P. Gaudry and D. Lubicz. The arithmetic of characteristic 2 Kummer surfaces and of elliptic Kummer lines. Finite Fields and Their Applications, 15:246–260, 2009. https://hal.inria.fr/inria-00266565v2 (accessed May 4, 2017). (Cited on page 97.) [160] J. L. Gerver. Factoring large integers with a quadratic sieve. Mathematics of Computation, 41:287–294, 1983. (Cited on pages 132 and 134.) [161] R. Golliver, A. K. Lenstra, and K. McCurley. Lattice sieving and trial division. In Algorithmic Number Theory Symposium – ANTS’94, volume 877 of LNCS, pages 18–27, 1994. (Cited on pages 135, 148, 149, and 153.) [162] F. Göloglu, R. Granger, G. McGuire, and J. Zumbrägel. On the function field sieve and the impact of higher splitting probabilities — application to discrete logarithms in F21971 and F23164 . In R. Canetti and J. A. Garay, editors, Advances in Cryptology – CRYPTO 2013, Part II, volume 8043 of Lecture Notes in Computer Science, pages 109–128. Springer, Heidelberg, Aug. 2013. (Cited on page 140.) [163] P. Grabher, J. Großschädl, and D. Page. On software parallel implementation of cryptographic pairings. In R. M. Avanzi, L. Keliher, and F. Sica, editors, SAC 2008: 15th Annual International Workshop on Selected Areas in Cryptography, volume 5381 of Lecture Notes in Computer Science, pages 35–50. Springer, Heidelberg, Aug. 2009. (Cited on pages 225 and 226.) [164] R. Graham, November 2015. Private communication. (Cited on pages 1 and 4.) [165] R. Granger, T. Kleinjung, and J. Zumbrägel. On the discrete logarithm problem in finite fields of fixed characteristic. Available from http://arxiv.org/abs/1507 .01495 (accessed May 4, 2017). (Cited on page 140.) [166] R. Granger and A. Moss. Generalised Mersenne numbers revisited. Math. Comput., 82(284):2389–2420, 2013. (Cited on page 25.) [167] R. Granger and M. Scott. Faster squaring in the cyclotomic subgroup of sixth degree extensions. In P. Q. Nguyen and D. Pointcheval, editors, PKC 2010: 13th International Conference on Theory and Practice of Public Key Cryptography, volume 6056 of Lecture Notes in Computer Science, pages 209–223. Springer, Heidelberg, May 2010. (Cited on page 211.) [168] R. Granger and N. Smart. On computing products of pairings. Cryptology ePrint Archive, Report 2006/172, 2006. http://eprint.iacr.org/2006/172 (accessed May 4, 2017). (Cited on page 227.) [169] R. T. Gregory and D. W. Matula. Base conversion in residue number systems. In T. R. N. Rao and D. W. Matula, editors, 3rd IEEE Symposium on Computer Arithmetic – ARITH 1975, pages 117–125. IEEE Computer Society, 1975. (Cited on page 36.) [170] J. Großschädl, R. M. Avanzi, E. Savas, and S. Tillich. Energy-efficient software implementation of long integer modular arithmetic. In J. R. Rao and B. Sunar, editors, Cryptographic Hardware and Embedded Systems – CHES 2005, volume 3659 of Lecture Notes in Computer Science, pages 75–90. Springer, Heidelberg, Aug. / Sept. 2005. (Cited on page 44.) [171] J. Großschädl and G.-A. Kamendje. Architectural enhancements for Montgomery multiplication on embedded RISC processors. In J. Zhou, M. Yung, and Y. Han, editors, ACNS 03: 1st International Conference on Applied Cryptography and Network Security, volume 2846 of Lecture Notes in Computer Science, pages 418–434. Springer, Heidelberg, Oct. 2003. (Cited on page 47.)

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:30, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.011

248

Bibliography

[172] J. Groth and A. Sahai. Efficient non-interactive proof systems for bilinear groups. In N. P. Smart, editor, Advances in Cryptology – EUROCRYPT 2008, volume 4965 of Lecture Notes in Computer Science, pages 415–432. Springer, Heidelberg, Apr. 2008. (Cited on page 226.) [173] M. Gschwind. The Cell broadband engine: Exploiting multiple levels of parallelism in a chip multiprocessor. International Journal of Parallel Programming, 35:233–262, 2007. (Cited on page 32.) [174] J. Guajardo and C. Paar. Itoh-Tsujii inversion in standard basis and its application in cryptography and codes. Designs, Codes and Cryptography, 25:207–216, 2001. (Cited on page 215.) [175] S. Gueron and V. Krasnov. Fast prime field elliptic-curve cryptography with 256-bit primes. J. Cryptographic Engineering, 5(2):141–151, 2015. (Cited on page 26.) [176] J. E. Guzmán-Trampe, N. C. Cortés, L. J. D. Perez, D. O. Arroyo, and F. Rodríguez-Henríquez. Low-cost addition-subtraction sequences for the final exponentiation in pairings. Finite Fields and Their Applications, 29:1–17, 2014. (Cited on page 211.) [177] G. Hachez and J.-J. Quisquater. Montgomery exponentiation with no final subtractions: Improved results. In Ç. K. Koç and C. Paar, editors, Cryptographic Hardware and Embedded Systems – CHES 2000, volume 1965 of Lecture Notes in Computer Science, pages 293–301. Springer, Heidelberg, Aug. 2000. (Cited on pages 20 and 79.) [178] M. Hamburg. Fast and compact elliptic-curve cryptography. Cryptology ePrint Archive, Report 2012/309, 2012. http://eprint.iacr.org/2012/309 (accessed May 4, 2017). (Cited on pages 24 and 25.) [179] M. Hamburg. Ed448-goldilocks, a new elliptic curve. Cryptology ePrint Archive, Report 2015/625, 2015. http://eprint.iacr.org/2015/625 (accessed May 4, 2017). (Cited on page 94.) [180] D. Hankerson, A. J. Menezes, and S. Vanstone. Guide to Elliptic Curve Cryptography. Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2004. (Cited on pages 40 and 216.) [181] G. H. Hardy and E. M. Wright. An introduction to the theory of numbers. Oxford Univ. Press, 4th edition, 1960. (Cited on page 128.) [182] O. Harrison and J. Waldron. Efficient acceleration of asymmetric cryptography on graphics hardware. In B. Preneel, editor, AFRICACRYPT 09: 2nd International Conference on Cryptology in Africa, volume 5580 of Lecture Notes in Computer Science, pages 350–367. Springer, Heidelberg, June 2009. (Cited on page 37.) [183] L. Hars. Long modular multiplication for cryptographic applications. In M. Joye and J.-J. Quisquater, editors, Cryptographic Hardware and Embedded Systems – CHES 2004, volume 3156 of Lecture Notes in Computer Science, pages 45–61. Springer, Heidelberg, Aug. 2004. (Cited on page 24.) [184] K. Hensel. Theorie der algebraischen Zahlen. Tuebner, Leipzig, 1908. (Cited on page 12.) [185] F. Hess, N. P. Smart, and F. Vercauteren. The eta pairing revisited. IEEE Transactions on Information Theory, 52(10):4595–4602, 2006. (Cited on pages 211 and 212.)

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:30, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.011

Bibliography

249

[186] H. Hisil, K. K.-H. Wong, G. Carter, and E. Dawson. Twisted Edwards curves revisited. In J. Pieprzyk, editor, Advances in Cryptology – ASIACRYPT 2008, volume 5350 of Lecture Notes in Computer Science, pages 326–343. Springer, Heidelberg, Dec. 2008. (Cited on page 91.) [187] H. P. Hofstee. Power efficient processor architecture and the Cell processor. In High-Performance Computer Architecture – HPCA 2005, pages 258–262. IEEE, 2005. (Cited on page 32.) [188] D. Husemöller. Elliptic Curves, volume 111 of Graduate Texts in Mathematics. Springer, 2004. (Cited on page 90.) [189] Intel Corporation. Using streaming SIMD extensions (SSE2) to perform big multiplications, version 2.0. Technical Report AP-941, Intel, 2000. http://software.intel.com/sites/default/files/14/4f/24960. (Cited on pages 56 and 64.) [190] S. Ionica and A. Joux. Another approach to pairing computation in Edwards coordinates. In D. R. Chowdhury, V. Rijmen, and A. Das, editors, Progress in Cryptology - INDOCRYPT 2008: 9th International Conference in Cryptology in India, volume 5365 of Lecture Notes in Computer Science, pages 400–413. Springer, Heidelberg, Dec. 2008. (Cited on pages 220 and 221.) [191] T. Itoh and S. Tsujii. A fast algorithm for computing multiplicative inverses in GF(2^m) using normal bases. Inf. Comput., 78(3):171–177, 1988. (Cited on page 215.) [192] K. Iwamura, T. Matsumoto, and H. Imai. Systolic-arrays for modular exponentiation using Montgomery method (extended abstract) (rump session). In R. A. Rueppel, editor, Advances in Cryptology – EUROCRYPT’92, volume 658 of Lecture Notes in Computer Science, pages 477–481. Springer, Heidelberg, May 1993. (Cited on page 27.) [193] D. S. Johnson, T. Nishizeki, A. Nozaki, and H. S. Wilf. Discrete algorithms and complexity. Academic Press, Boston, 1987. (Cited on page 255.) [194] A. Joux. A one round protocol for tripartite diffie-hellman. In W. Bosma, editor, Algorithmic Number Theory, 4th International Symposium, ANTS-IV, Leiden, The Netherlands, July 2-7, 2000, Proceedings, volume 1838 of Lecture Notes in Computer Science, pages 385–394. Springer, 2000. (Cited on page 206.) [195] A. Joux. A one round protocol for tripartite Diffie-Hellman. Journal of Cryptology, 17(4):263–276, Sept. 2004. (Cited on page 206.) [196] A. Joux. A new index calculus algorithm with complexity L(1/4 + o(1)) in small characteristic. In T. Lange, K. Lauter, and P. Lisonek, editors, SAC 2013: 20th Annual International Workshop on Selected Areas in Cryptography, volume 8282 of Lecture Notes in Computer Science, pages 355–379. Springer, Heidelberg, Aug. 2014. (Cited on page 140.) [197] A. Joux and R. Lercier. Improvements to the general number field sieve for discrete logarithms in prime fields. A comparison with the gaussian integer method. Mathematics of Computation, 72(242):953–967, 2003. (Cited on page 165.) [198] M. Joye. On Quisquater’s multiplication algorithm. In D. Naccache, editor, Cryptography and Security: From Theory to Applications, volume 6805 of LNCS, pages 3–7. Springer-Verlag, 2012. (Cited on page 48.) [199] M. E. Kaihara and N. Takagi. Bipartite modular multiplication. In J. R. Rao and B. Sunar, editors, Cryptographic Hardware and Embedded Systems –

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:30, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.011

250

[200]

[201]

[202]

[203]

[204]

[205]

[206] [207]

[208] [209]

[210]

[211]

[212]

Bibliography

CHES 2005, volume 3659 of Lecture Notes in Computer Science, pages 201– 210. Springer, Heidelberg, Aug. / Sept. 2005. (Cited on page 27.) M. E. Kaihara and N. Takagi. A hardware algorithm for modular multiplication/ division. IEEE Transactions on Computers, 54(1):12–21, 2005. (Cited on page 9.) E. Kaltofen. Analysis of Coppersmith’s block Wiedemann algorithm for the parallel solution of sparse linear systems. Mathematics of Computation, 64(210):777–806, 1995. (Cited on page 187.) A. A. Karatsuba and Y. Ofman. Multiplication of many-digital numbers by automatic computers. Doklady Akad. Nauk SSSR, 145(2):293–294, 1962. Translation in Physics-Doklady 7, pp. 595–596, 1963. (Cited on pages 15 and 44.) P. S. Kasat, D. S. Bilaye, H. V. Dixit, R. Balwaik, and A. Jeyakumar. Multiplication algorithms for VLSI – a review. International Journal on Computer Science and Engineering (IJCSE), 4(11):1761–1765, nov 2012. (Cited on page 44.) E. Käsper. Fast elliptic curve cryptography in OpenSSL. In G. Danezis, S. Dietrich, and K. Sako, editors, FC 2011 Workshops, volume 7126 of Lecture Notes in Computer Science, pages 27–39. Springer, Heidelberg, Feb. / Mar. 2012. (Cited on page 23.) S. Kawamura, M. Koike, F. Sano, and A. Shimbo. Cox-Rower architecture for fast parallel Montgomery multiplication. In B. Preneel, editor, Advances in Cryptology – EUROCRYPT 2000, volume 1807 of Lecture Notes in Computer Science, pages 523–538. Springer, Heidelberg, May 2000. (Cited on page 37.) T. Kleinjung. On polynomial selection for the general number field sieve. Mathematics of Computation, 75(256):2037–2047, 2006. (Cited on page 173.) T. Kleinjung. Polynomial selection, presented at the CADO workshop. See http://cado.gforge.inria.fr/workshop/slides/kleinjung.pdf, 2008 (accessed May 4, 2017). (Cited on page 173.) T. Kleinjung. Quadratic sieving. Mathematics of Computation, 85:1861–1873, 2016. (Cited on page 137.) T. Kleinjung, K. Aoki, J. Franke, A. K. Lenstra, E. Thomé, J. W. Bos, P. Gaudry, A. Kruppa, P. L. Montgomery, D. A. Osvik, H. J. J. te Riele, A. Timofeev, and P. Zimmermann. Factorization of a 768-bit RSA modulus. In T. Rabin, editor, Advances in Cryptology – CRYPTO 2010, volume 6223 of Lecture Notes in Computer Science, pages 333–350. Springer, Heidelberg, Aug. 2010. (Cited on pages 4, 7, 117, 153, and 176.) T. Kleinjung, J. W. Bos, and A. K. Lenstra. Mersenne factorization factory. In P. Sarkar and T. Iwata, editors, Advances in Cryptology – ASIACRYPT 2014, Part I, volume 8873 of Lecture Notes in Computer Science, pages 358–377. Springer, Heidelberg, Dec. 2014. (Cited on pages 117, 126, 146, 152, 158, and 159.) T. Kleinjung, J. W. Bos, A. K. Lenstra, D. A. Osvik, K. Aoki, S. Contini, J. Franke, E. Thomé, P. Jermini, M. Thiémard, P. Leyland, P. L. Montgomery, A. Timofeev, and H. Stockinger. A heterogeneous computing environment to solve the 768-bit RSA challenge. Cluster Computing, (15):53–68, 2012. (Cited on pages 4 and 7.) T. Kleinjung, C. Diem, A. K. Lenstra, C. Priplata, and C. Stahlke. Computation of a 768-bit prime field discrete logarithm. In J.-S. Coron and J. Nielsen, editors, Eurocrypt 2017, Part I, volume 10210 of Lecture Notes in Computer Science, pages 178–194. Springer, Heidelberg, 2017. (Cited on page 153.)

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:30, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.011

Bibliography

251

[213] M. Kneževi´c, F. Vercauteren, and I. Verbauwhede. Faster interleaved modular multiplication based on Barrett and Montgomery reduction methods. IEEE Transactions on Computers, 59(12):1715–1721, 2010. (Cited on page 48.) [214] M. Kneževi´c, F. Vercauteren, and I. Verbauwhede. Speeding up bipartite modular multiplication. In M. A. Hasan and T. Helleseth, editors, Arithmetic of Finite Fields – WAIFI, volume 6087 of Lecture Notes in Computer Science, pages 166– 179. Springer, 2010. (Cited on page 24.) [215] D. E. Knuth. Seminumerical Algorithms. The Art of Computer Programming. Addison-Wesley, Reading, Massachusetts, USA, 3rd edition, 1997. (Cited on page 11.) [216] T. Kobayashi, H. Morita, K. Kobayashi, and F. Hoshino. Fast elliptic curve algorithm combining Frobenius map and table reference to adapt to higher characteristic. In J. Stern, editor, Advances in Cryptology – EUROCRYPT’99, volume 1592 of Lecture Notes in Computer Science, pages 176–189. Springer, Heidelberg, May 1999. (Cited on pages 215 and 216.) [217] N. Koblitz. Elliptic curve cryptosystems. Mathematics of Computation, 48(177):203–209, 1987. (Cited on pages 22, 40, and 93.) [218] N. Koblitz and A. Menezes. Pairing-based cryptography at high security levels (invited paper). In N. P. Smart, editor, 10th IMA International Conference on Cryptography and Coding, volume 3796 of Lecture Notes in Computer Science, pages 13–36. Springer, Heidelberg, Dec. 2005. (Cited on page 215.) [219] Çetin Kaya. Koç and T. Acar. Montgomery multiplication in GF(2k ). Designs, Codes and Cryptography, 14(1):57–69, 1998. (Cited on pages 21 and 40.) [220] P. C. Kocher. Timing attacks on implementations of Diffie-Hellman, RSA, DSS, and other systems. In N. Koblitz, editor, Advances in Cryptology – CRYPTO’96, volume 1109 of Lecture Notes in Computer Science, pages 104–113. Springer, Heidelberg, Aug. 1996. (Cited on pages 77 and 79.) [221] P. C. Kocher, J. Jaffe, and B. Jun. Differential power analysis. In M. J. Wiener, editor, Advances in Cryptology – CRYPTO’99, volume 1666 of Lecture Notes in Computer Science, pages 388–397. Springer, Heidelberg, Aug. 1999. (Cited on pages 12, 19, and 77.) [222] P. Kornerup. A systolic, linear-array multiplier for a class of right-shift algorithms. IEEE Transactions on Computers, 43(8):892–898, 1994. (Cited on page 76.) [223] M. Koschuch, J. Lechner, A. Weitzer, J. Großschädl, A. Szekely, S. Tillich, and J. Wolkerstorfer. Hardware/software co-design of elliptic curve cryptography on an 8051 microcontroller. In L. Goubin and M. Matsui, editors, Cryptographic Hardware and Embedded Systems – CHES 2006, volume 4249 of Lecture Notes in Computer Science, pages 430–444. Springer, Heidelberg, Oct. 2006. (Cited on page 40.) [224] M. Kraitchik. Théorie des nombres, Tome II. Gauthiers-Villars, Paris, 1926. (Cited on page 117.) [225] M. Kraitchik. Recherches sur le théorie des nombres, Tome II. Gauthiers-Villars, Paris, 1929. (Cited on page 117.) [226] B. A. LaMacchia and A. M. Odlyzko. Solving large sparse linear systems over finite fields. In A. J. Menezes and S. A. Vanstone, editors, Advances in Cryptology – CRYPTO’90, volume 537 of Lecture Notes in Computer Science, Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:30, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.011

252

[227]

[228]

[229] [230] [231]

[232]

[233]

[234]

[235]

[236]

[237]

[238] [239] [240]

[241]

Bibliography

pages 109–133. Springer, Heidelberg, Aug. 1991. (Cited on pages 7, 123, 176, and 179.) K. Lauter, P. L. Montgomery, and M. Naehrig. An analysis of affine coordinates for pairing computation. In M. Joye, A. Miyaji, and A. Otsuka, editors, PAIRING 2010: 4th International Conference on Pairing-based Cryptography, volume 6487 of Lecture Notes in Computer Science, pages 1–20. Springer, Heidelberg, Dec. 2010. (Cited on pages 4, 8, and 206.) F. Le Gall. Powers of tensors and fast matrix multiplication. In Proceedings of the 39th International Symposium on Symbolic and Algebraic Computation, ISSAC ’14, pages 296–303, New York, NY, USA, 2014. ACM. (Cited on page 123.) A. K. Lenstra. Proof of the factorization of the Scientific American challenge. http://www.joppebos.com/petmon/chap5_fig.pdf. (Cited on page 135.) A. K. Lenstra. Fast and rigorous factorization under the generalized Riemann hypothesis. Indagationes Mathematicae, 50:443–454, 1988. (Cited on page 160.) A. K. Lenstra. Generating RSA moduli with a predetermined portion. In K. Ohta and D. Pei, editors, Advances in Cryptology – ASIACRYPT’98, volume 1514 of Lecture Notes in Computer Science, pages 1–10. Springer, Heidelberg, Oct. 1998. (Cited on page 24.) A. K. Lenstra, H. W. Lenstra, and L. Lovász. Factoring polynomials with rational coefficients. Mathematische Annalen, 261(4):515–534, 1982. (Cited on pages 157 and 167.) A. K. Lenstra and H. W. Lenstra Jr. Algorithms in number theory. In J. van Leeuwen, editor, Handbook of Theoretical Computer Science (Volume A: Algorithms and Complexity), pages 673–715. Elsevier and MIT Press, 1990. (Cited on pages 119, 121, 123, 136, 137, 140, and 160.) A. K. Lenstra and H. W. Lenstra Jr. The Development of the Number Field Sieve, volume 1554 of Lecture Notes in Mathematics. Springer-Verlag, 1993. (Cited on pages 6, 116, 139, 239, 241, 243, 252, and 255.) A. K. Lenstra, H. W. Lenstra Jr., M. S. Manasse, and J. M. Pollard. The number field sieve. pages 11–42 in [234], 1989. (Cited on pages 7, 139, 141, 142, 143, 145, 146, 147, 148, 152, and 154.) A. K. Lenstra, H. W. Lenstra Jr., M. S. Manasse, and J. M. Pollard. The factorization of the ninth Fermat number. Mathematics of Computation, 61(203):319– 349, 1993. (Cited on pages 125, 141, 146, and 148.) A. K. Lenstra and M. S. Manasse. Factoring by electronic mail. In J.-J. Quisquater and J. Vandewalle, editors, Advances in Cryptology – EUROCRYPT’89, volume 434 of Lecture Notes in Computer Science, pages 355–371. Springer, Heidelberg, Apr. 1990. (Cited on pages 116, 135, 136, and 138.) A. K. Lenstra and M. S. Manasse. Factoring with two large primes. Mathematics of Computation, 63:785–798, 1994. (Cited on pages 132 and 137.) H. W. Lenstra Jr. Factoring integers with elliptic curves. Annals of Mathematics, 126(3):649–673, 1987. (Cited on pages 6, 8, 116, 121, and 189.) H. W. Lenstra Jr. and C. Pomerance. A rigorous time bound for factoring integers. Journal of the American Mathematical Society, 5:483–516, 1992. (Cited on page 160.) P. C. Leyland, A. K. Lenstra, B. Dodson, A. Muffett, and S. S. Wagstaff Jr. MPQS with three large primes. In C. Fieker and D. R. Kohel, editors, Algorithmic

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:30, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.011

Bibliography

[242]

[243]

[244]

[245]

[246]

[247]

[248] [249]

[250] [251]

[252]

[253]

[254]

253

Number Theory, 5th International Symposium, ANTS-V, volume 2369 of Lecture Notes in Computer Science, pages 446–460. Springer, 2002. (Cited on page 137.) Z. Liu and J. Großschädl. New speed records for Montgomery modular multiplication on 8-bit AVR microcontrollers. In D. Pointcheval and D. Vergnaud, editors, AFRICACRYPT 14: 7th International Conference on Cryptology in Africa, volume 8469 of Lecture Notes in Computer Science, pages 215–234. Springer, Heidelberg, May 2014. (Cited on pages 44 and 47.) A. Menezes, T. Okamoto, and S. A. Vanstone. Reducing elliptic curve logarithms to logarithms in a finite field. IEEE Transactions on Information Theory, 39(5):1639–1646, 1993. (Cited on pages 206 and 213.) R. D. Merrill. Improving digital computer performance using residue number theory. Electronic Computers, IEEE Transactions on, EC-13(2):93–101, April 1964. (Cited on page 36.) T. S. Messerges, E. A. Dabbish, and R. H. Sloan. Power analysis attacks of modular exponentiation in smartcards. In Çetin Kaya. Koç and C. Paar, editors, Cryptographic Hardware and Embedded Systems – CHES’99, volume 1717 of Lecture Notes in Computer Science, pages 144–157. Springer, Heidelberg, Aug. 1999. (Cited on page 77.) A. Miele, J. W. Bos, T. Kleinjung, and A. K. Lenstra. Cofactorization on graphics processing units. In L. Batina and M. Robshaw, editors, Cryptographic Hardware and Embedded Systems – CHES 2014, volume 8731 of Lecture Notes in Computer Science, pages 335–352. Springer, Heidelberg, Sept. 2014. (Cited on page 152.) V. S. Miller. Use of elliptic curves in cryptography. In H. C. Williams, editor, Advances in Cryptology – CRYPTO’85, volume 218 of Lecture Notes in Computer Science, pages 417–426. Springer, Heidelberg, Aug. 1986. (Cited on pages 22, 40, 93, and 95.) V. S. Miller. The Weil pairing, and its efficient calculation. Journal of Cryptology, 17(4):235–261, Sept. 2004. (Cited on pages 8 and 210.) B. Möller. Algorithms for multi-exponentiation. In S. Vaudenay and A. M. Youssef, editors, SAC 2001: 8th Annual International Workshop on Selected Areas in Cryptography, volume 2259 of Lecture Notes in Computer Science, pages 165–180. Springer, Heidelberg, Aug. 2001. (Cited on page 229.) P. L. Montgomery. Evaluation of boolean expressions on one’s complement machines. SIGPLAN Notices, 13:60–72, 1978. (Cited on page 1.) P. L. Montgomery. Modular multiplication without trial division. Mathematics of Computation, 44(170):519–521, April 1985. (Cited on pages 4, 5, 10, 13, 15, 17, 40, 42, and 81.) P. L. Montgomery. Speeding the Pollard and elliptic curve methods of factorization. Mathematics of Computation, 48(177):243–264, 1987. (Cited on pages 5, 6, 8, 83, 85, 189, 197, 207, and 218.) P. L. Montgomery. Evaluating recurrences of form Xm+n = f (Xm , Xn , Xm−n ) via Lucas chains, 1992. https://cr.yp.to/bib/1992/montgomery-lucas.pdf (accessed May 4, 2017). (Cited on pages 85, 87, 111, 114, and 115.) P. L. Montgomery. An FFT extension of the elliptic curve method of factorization. PhD thesis, University of California, 1992. (Cited on pages 2, 8, 94, 189, 190, 193, 194, 196, 197, and 198.)

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:30, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.011

254

Bibliography

[255] P. L. Montgomery. Square roots of products of algebraic numbers. Mathematics of Computation 1943-1993: A Half-Century of Computational Mathematics, 48:567–571, 1994. (Cited on pages 7 and 157.) [256] P. L. Montgomery. A survey of modern integer factorization algorithms. CWI Quarterly, 7(4):337–366, December 1994. (Cited on page 168.) [257] P. L. Montgomery. A block Lanczos algorithm for finding dependencies over GF(2). In L. C. Guillou and J.-J. Quisquater, editors, Advances in Cryptology – EUROCRYPT’95, volume 921 of Lecture Notes in Computer Science, pages 106– 120. Springer, Heidelberg, May 1995. (Cited on pages 7, 123, 179, 180, 183, 184, and 186.) [258] P. L. Montgomery. Parallel block Lanczos, 2000. Slides of presentation at RSA2000, dated January 17, 2000. (Cited on page 186.) [259] P. L. Montgomery. Five, six, and seven-term Karatsuba-like formulae. IEEE Transactions on Computers, 54(3):362–369, 2005. (Cited on pages 4 and 217.) [260] P. L. Montgomery. Searching for higher-degree polynomials for the general number field sieve. helper.ipam.ucla.edu/publications/scws1/scws1_6223.ppt, October 2006. (Cited on page 168.) [261] P. L. Montgomery and A. Kruppa. Improved stage 2 to P±1 factoring algorithms. In A. J. van der Poorten and A. Stein, editors, Algorithmic Number Theory – ANTS-VIII, volume 5011 of Lecture Notes in Computer Science, pages 180–195. Springer, 2008. (Cited on pages 4, 8, 189, 200, 201, 202, and 204.) [262] P. L. Montgomery, S. Nahm, and S. S. Wagstaff Jr. The period of the Bell numbers modulo a prime. Mathematics of Computation, 79(271):1793–1800, 2010. (Cited on page 4.) [263] P. L. Montgomery and R. D. Silverman. An FFT extension to the p − 1 factoring algorithm. Mathematics of Computation, 54(190):839–854, 1990. (Cited on pages 8, 189, 190, and 200.) [264] M. A. Morrison and J. Brillhart. A method of factoring and the factorization of F7 . Mathematics of Computation, 29(129):183–205, 1975. (Cited on pages 116, 117, 127, and 128.) [265] A. Moss, D. Page, and N. P. Smart. Toward acceleration of RSA using 3D graphics hardware. In S. D. Galbraith, editor, 11th IMA International Conference on Cryptography and Coding, volume 4887 of Lecture Notes in Computer Science, pages 364–383. Springer, Heidelberg, Dec. 2007. (Cited on page 37.) [266] B. A. Murphy. Polynomial selection for the number field sieve integer factorisation algorithm. PhD thesis, Australian National University, 1999. (Cited on pages 6, 171, and 172.) [267] National Institute of Standards and Technology (NIST). Digital signature standard (dss). Technical Report FIPS Publication 186-4, July 2013. (Cited on page 40.) [268] National Security Agency (NSA). Compromising emanations laboratory test requirements, electromagnetics (u). Technical Report National COMSEC Information Memorandum (NACSIM) 5100A, NSA, 1981. (Classified). (Cited on page 76.) [269] P. Q. Nguyen. A Montgomery-like square root for the number field sieve. In J. Buhler, editor, ANTS, volume 1423 of Lecture Notes in Computer Science, pages 151–168. Springer, 1998. (Cited on page 157.)

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:30, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.011

Bibliography

255

[270] A. M. Odlyzko. Discrete logarithms in finite fields and their cryptographic significance. In T. Beth, N. Cot, and I. Ingemarsson, editors, Advances in Cryptology – EUROCRYPT’84, volume 209 of Lecture Notes in Computer Science, pages 224–314. Springer, Heidelberg, Apr. 1985. (Cited on pages 123 and 140.) [271] G. Orlando and C. Paar. A scalable GF(p) elliptic curve processor architecture for programmable hardware. In Çetin Kaya. Koç, D. Naccache, and C. Paar, editors, Cryptographic Hardware and Embedded Systems – CHES 2001, volume 2162 of Lecture Notes in Computer Science, pages 348–363. Springer, Heidelberg, May 2001. (Cited on page 61.) [272] D. A. Osvik, J. W. Bos, D. Stefan, and D. Canright. Fast software AES encryption. In S. Hong and T. Iwata, editors, Fast Software Encryption – FSE 2010, volume 6147 of Lecture Notes in Computer Science, pages 75–93. Springer, Heidelberg, Feb. 2010. (Cited on page 32.) [273] K. Pabbuleti, D. Mane, A. Desai, C. Albert, and P. Schaumont. SIMD acceleration of modular arithmetic on contemporary embedded platforms. In IEEE High Performance Extreme Computing Conference (HPEC), pages 1–6. IEEE, 2013. (Cited on page 56.) [274] D. Page and N. P. Smart. Parallel cryptographic arithmetic using a redundant Montgomery representation. IEEE Trans. Computers, 53(11):1474–1482, 2004. (Cited on page 27.) [275] B. N. Parlett, D. R. Taylor, and Z. A. Liu. A look-ahead Lanczos algorithm for unsymmetric matrices. Mathematics of Computation, 44(169):105–124, Jan. 1985. (Cited on pages 179 and 180.) [276] R. Peralta. A quadratic sieve on the n-dimensional cube. In E. F. Brickell, editor, Advances in Cryptology – CRYPTO’92, volume 740 of Lecture Notes in Computer Science, pages 324–332. Springer, Heidelberg, Aug. 1993. (Cited on page 137.) [277] B. J. Phillips, Y. Kong, and Z. Lim. Highly parallel modular multiplication in the residue number system using sum of residues reduction. Appl. Algebra Eng. Commun. Comput., 21(3):249–255, 2010. (Cited on page 37.) [278] J. M. Pollard. Theorems on factorization and primality testing. Proceedings of the Cambridge Philosophical Society, 76:521–528, 1974. (Cited on pages 8, 116, 189, 190, 199, and 200.) [279] J. M. Pollard. A Monte Carlo method for factorization. BIT Numerical Mathematics, 15(3):331–334, 1975. (Cited on pages 116 and 121.) [280] J. M. Pollard. Factoring with cubic integers. pages 4–10 in [234], 1988. (Cited on pages 138, 139, 145, 146, and 147.) [281] J. M. Pollard. The lattice sieve. pages 43–49 in [234], 1990. (Cited on pages 7, 148, and 149.) [282] C. Pomerance. Analysis and comparison of some integer factoring algorithms. In J. Hendrik W. Lenstra and R. Tijdeman, editors, Computational methods in number theory I, volume 154 of Mathematical Centre Tracts, pages 89–139, Amsterdam, 1982. Mathematisch Centrum. (Cited on pages 6, 119, 126, 131, 132, and 137.) [283] C. Pomerance. Fast, rigorous factorization and discrete logarithm algorithms. pages 119–143 in [193], 1987. (Cited on pages 121 and 160.) [284] C. Pomerance, October 1988. Private communication. (Cited on page 135.)

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:30, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.011

256

Bibliography

[285] C. Pomerance. A tale of two sieves. Notices of the AMS, 43(12):1473–1485, December 1996. (Cited on page 117.) [286] C. Pomerance and J. W. Smith. Reduction of huge, sparse matrices over finite fields via created catastrophes. Experiment. Math., 1:89–94, 1992. (Cited on pages 7, 123, 124, and 125.) [287] C. Pomerance, J. W. Smith, and R. Tuler. A pipeline architecture for factoring large integers with the quadratic sieve algorithm. SIAM j. Comput., 17:387–403, 1988. (Cited on pages 6 and 137.) [288] C. Pomerance, J. W. Smith, and S. S. Wagstaff. New ideas for factoring large integers. In D. Chaum, editor, Advances in Cryptology – CRYPTO’83, pages 81– 85. Plenum Press, New York, USA, 1983. (Cited on page 127.) [289] K. Posch and R. Posch. Base extension using a convolution sum in residue number systems. Computing, 50(2):93–104, 1993. (Cited on page 36.) [290] K. C. Posch and R. Posch. Modulo reduction in residue number systems. IEEE Trans. Parallel Distrib. Syst., 6(5):449–454, 1995. (Cited on page 36.) [291] T. Prest and P. Zimmermann. Non-linear polynomial selection for the number field sieve. J. Symb. Comput., 47(4):401–409, 2012. (Cited on page 168.) [292] Q. Pu and X. Zhao. Montgomery exponentiation with no final comparisons: Improved results. In Pacific-Asia Conference on Circuits, Communications and Systems, pages 614–616. IEEE, 2009. (Cited on page 20.) [293] J. Quisquater and D. Samyde. Electromagnetic analysis (EMA): measures and counter-measures for smart cards. In I. Attali and T. Jensen, editors, Smart Card Programming and Security, E-smart 2001, Cannes, France, September 19-21, 2001, volume 2140 of LNCS, pages 200–210. Springer-Verlag, 2001. (Cited on page 77.) [294] R. L. Rivest, A. Shamir, and L. M. Adleman. A method for obtaining digital signature and public-key cryptosystems. Communications of the Association for Computing Machinery, 21(2):120–126, 1978. (Cited on pages 17, 20, 40, 117, 119, and 131.) [295] B. Rothschild, November 2015. Private communication. (Cited on page 1.) [296] A. Sahai and B. R. Waters. Fuzzy identity-based encryption. In R. Cramer, editor, Advances in Cryptology – EUROCRYPT 2005, volume 3494 of Lecture Notes in Computer Science, pages 457–473. Springer, Heidelberg, May 2005. (Cited on page 206.) [297] R. Sakai, K. Ohgishi, and M. Kasahara. Cryptosystems based on pairing. In 2000 Symposium on Cryptography and Information Security – SCIS 2000, 2000. (Cited on page 206.) [298] E. Savas, A. F. Tenca, and Ç. K. Koç. A scalable and unified multiplier architecture for finite fields GF(p) and GF(2m ). In Ç. K. Koç and C. Paar, editors, Cryptographic Hardware and Embedded Systems – CHES 2000, volume 1965 of Lecture Notes in Computer Science, pages 277–292. Springer, Heidelberg, Aug. 2000. (Cited on pages 40 and 67.) [299] M. Schacher, November 2015. Private communication. (Cited on page 3.) [300] W. Schindler. A timing attack against RSA with the Chinese remainder theorem. In Ç. K. Koç and C. Paar, editors, Cryptographic Hardware and Embedded Systems – CHES 2000, volume 1965 of Lecture Notes in Computer Science, pages 109–124. Springer, Heidelberg, Aug. 2000. (Cited on page 77.)

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:30, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.011

Bibliography

257

[301] A. Schönhage and V. Strassen. Schnelle multiplikation großer zahlen. Computing, 7(3-4):281–292, 1971. (Cited on page 15.) [302] R. J. Schoof. Quadratic fields and factorization. In J. Hendrik W. Lenstra and R. Tijdeman, editors, Computational methods in number theory II, volume 155 of Mathematical Centre Tracts, pages 235–286, Amsterdam, 1982. Mathematisch Centrum. (Cited on page 118.) [303] R. Schroeppel, April 2015. Private communication. (Cited on pages 9, 116, 129, 130, 131, and 132.) [304] R. Schroeppel and C. Beaver. Accelerating elliptic curve calculations with the reciprocal sharing trick. Mathematics of Public-Key Cryptography (MPKC), University of Illinois at Chicago, 2003. (Cited on page 225.) [305] M. Scott. Computing the Tate pairing. In A. Menezes, editor, Topics in Cryptology – CT-RSA 2005, volume 3376 of Lecture Notes in Computer Science, pages 293–304. Springer, Heidelberg, Feb. 2005. (Cited on page 227.) [306] M. Scott. On the efficient implementation of pairing-based protocols. In L. Chen, editor, Cryptography and Coding – IMACC, volume 7089 of Lecture Notes in Computer Science, pages 296–308. Springer, 2011. (Cited on page 227.) [307] M. Scott, N. Benger, M. Charlemagne, L. J. D. Perez, and E. J. Kachisa. On the final exponentiation for calculating pairings on ordinary elliptic curves. In H. Shacham and B. Waters, editors, Pairing-Based Cryptography - Pairing 2009, Third International Conference, Palo Alto, CA, USA, August 12-14, 2009, Proceedings, volume 5671 of Lecture Notes in Computer Science, pages 78–88. Springer, 2009. (Cited on page 211.) [308] H. Seo, Z. Liu, J. Großschädl, J. Choi, and H. Kim. Montgomery modular multiplication on ARM-NEON revisited. In J. Lee and J. Kim, editors, Information Security and Cryptology – ICISC 2014, volume 8949 of Lecture Notes in Computer Science, pages 328–342. Springer, 2015. (Cited on pages 26 and 47.) [309] M. Seysen. A probabilistic factorization algorithm with quadratic forms of negative discriminant. Mathematics of Computation, 48:757–780, 1987. (Cited on pages 128 and 160.) [310] M. Shand and J. Vuillemin. Fast implementations of RSA cryptography. In E. E. S. Jr., M. J. Irwin, and G. A. Jullien, editors, 11th Symposium on Computer Arithmetic, pages 252–259. IEEE Computer Society, 1993. (Cited on pages 12 and 20.) [311] D. Shanks. Class number, a theory of factorization, and genera. In D. J. Lewis, editor, Symposia in Pure Mathematics, volume 20, pages 415–440. American Mathematical Society, 1971. (Cited on page 118.) [312] A. Shenoy and R. Kumaresan. Fast base extension using a redundant modulus in RNS. Computers, IEEE Transactions on, 38(2):292–297, 1989. (Cited on page 36.) [313] P. W. Shor. Polynomial-time algorithms for prime factorization and discrete logarithms on a quantum computer. SIAM Journal on Computing, 26(5):1484–1509, 1997. (Cited on page 153.) [314] J. H. Silverman. The Arithmetic of Elliptic Curves, volume 106 of Graduate texts in mathematics. Springer-Verlag, 1986. (Cited on page 208.) [315] R. D. Silverman. The multiple polynomial quadratic sieve. Mathematics of Computation, 48:329–339, 1987. (Cited on pages 6, 134, and 136.)

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:30, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.011

258

Bibliography

[316] J. A. Solinas. Generalized Mersenne numbers. Technical Report CORR 99– 39, Centre for Applied Cryptographic Research, University of Waterloo, 1999. (Cited on page 22.) [317] M. Stam. Speeding up subgroup cryptosystems. PhD thesis, Technische Universiteit Eindhoven, 2003. https://dx.doi.org/10.6100/IR564670. (Cited on pages 108, 114, and 115.) [318] M. Stevens, A. Sotirov, J. Appelbaum, A. K. Lenstra, D. Molnar, D. A. Osvik, and B. de Weger. Short chosen-prefix collisions for MD5 and the creation of a rogue CA certificate. In S. Halevi, editor, Advances in Cryptology – CRYPTO 2009, volume 5677 of Lecture Notes in Computer Science, pages 55–69. Springer, Heidelberg, Aug. 2009. (Cited on page 32.) [319] V. Strassen. Gaussian elimination is not optimal. Numer. Math., 13:354–356, 1969. (Cited on page 123.) [320] A. Svoboda. An algorithm for division. Information processing machines, 9(2534):28, 1963. (Cited on page 24.) [321] N. S. Szabo and R. I. Tanaka. Residue arithmetic and its applications to computer technology. McGraw-Hill, 1967. (Cited on pages 36 and 37.) [322] R. Szerwinski and T. Güneysu. Exploiting the power of GPUs for asymmetric cryptography. In E. Oswald and P. Rohatgi, editors, Cryptographic Hardware and Embedded Systems – CHES 2008, volume 5154 of Lecture Notes in Computer Science, pages 79–99. Springer, Heidelberg, Aug. 2008. (Cited on page 37.) [323] O. Takahashi, R. Cook, S. Cottier, S. H. Dhong, B. Flachs, K. Hirairi, A. Kawasumi, H. Murakami, H. Noro, H. Oh, S. Onish, J. Pille, and J. Silberman. The circuit design of the synergistic processor element of a Cell processor. In International conference on Computer-aided design – ICCAD 2005, pages 111–117. IEEE Computer Society, 2005. (Cited on page 32.) [324] E. Thomé. Square root algorithms for the number field sieve. In F. Özbudak and F. Rodríguez-Henríquez, editors, WAIFI, volume 7369 of Lecture Notes in Computer Science, pages 208–224. Springer, 2012. (Cited on pages 156 and 157.) [325] K. Tiri, M. Akmal, and I. Verbauwhede. A dynamic and differential CMOS logic with signal independent power consumption to withstand differential power analysis on smart cards. In European Solid-State Circuits Conference – ESSCIRC 2002, Florence, 24–26 Sept. 2002, pages 403–406. Università di Bologna, 2002. (Cited on page 81.) [326] K. Tiri and I. Verbauwhede. A logic level design methodology for a secure DPA resistant ASIC or FPGA implementation. In Design, Automation and Test in Europe Conference and Exposition – (DATE 2004), Paris, 16–20 February 2004, pages 246–251. IEEE, 2004. (Cited on page 81.) [327] A. Toom. The complexity of a scheme of functional elements realizing the multiplication of integers. Soviet Mathematics Doklady, 3(4):714–716, 1963. (Cited on page 15.) [328] Y. Tsuruoka. Computing short Lucas chains for elliptic curve cryptosystems. IEICE Transactions on Fundamentals, E84-A(5):1227–1233, 2001. (Cited on pages 113 and 115.) [329] M. Ugon. Portable data carrier including a microprocessor. US Patent and Trademark Office, July 8 1980. US Patent No. 4211919 (Abstract). (Cited on page 76.)

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:30, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.011

Bibliography

259

[330] U.S. Department of Commerce/National Institute of Standards and Technology. Digital Signature Standard (DSS). FIPS-186-4, 2013. http://nvlpubs.nist .gov/nistpubs/FIPS/NIST.FIPS.186-4.pdf (accessed May 4, 2017). (Cited on page 22.) [331] B. Vallée. Generation of elements with small modular squares and provably fast integer factoring algorithms. Mathematics of Computation, 56:823–849, 1991. (Cited on page 160.) [332] W. van Eck. Electromagnetic radiation from video display units: An eavesdropping risk? Computers and Security, 4(4):269–286, Dec. 1985. (Cited on page 76.) [333] F. Vercauteren. Optimal pairings. IEEE Transactions on Information Theory, 56(1):455–461, 2010. (Cited on page 212.) [334] C. D. Walter. Fast modular multiplication using 2-power radix. International J. Computer Mathematics, 39(1-2):21–28, 1991. (Cited on page 48.) [335] C. D. Walter. Faster modular multiplication by operand scaling. In J. Feigenbaum, editor, Advances in Cryptology – CRYPTO’91, volume 576 of Lecture Notes in Computer Science, pages 313–323. Springer, Heidelberg, Aug. 1992. (Cited on page 65.) [336] C. D. Walter. Systolic modular multiplication. IEEE Transactions on Computers, 42(3):376–378, Mar. 1993. (Cited on pages 27, 67, 68, 69, and 76.) [337] C. D. Walter. Montgomery exponentiation needs no final subtractions. Electronics Letters, 35(21):1831–1832, Oct. 1999. (Cited on pages 20 and 79.) [338] C. D. Walter. An improved linear systolic array for fast modular exponentiation. IEE Computers and Digital Techniques, 147(5):323–328, Sept. 2000. (Cited on pages 75 and 76.) [339] C. D. Walter. Sliding windows succumbs to big mac attack. In Çetin Kaya. Koç, D. Naccache, and C. Paar, editors, Cryptographic Hardware and Embedded Systems – CHES 2001, volume 2162 of Lecture Notes in Computer Science, pages 286–299. Springer, Heidelberg, May 2001. (Cited on page 80.) [340] C. D. Walter. Precise bounds for Montgomery modular multiplication and some potentially insecure RSA moduli. In B. Preneel, editor, Topics in Cryptology – CT-RSA 2002, volume 2271 of Lecture Notes in Computer Science, pages 30–39. Springer, Heidelberg, Feb. 2002. (Cited on pages 20, 78, and 79.) [341] C. D. Walter. Longer keys may facilitate side channel attacks. In M. Matsui and R. J. Zuccherato, editors, SAC 2003: 10th Annual International Workshop on Selected Areas in Cryptography, volume 3006 of Lecture Notes in Computer Science, pages 42–57. Springer, Heidelberg, Aug. 2004. (Cited on pages 78 and 79.) [342] C. D. Walter and S. Thompson. Distinguishing exponent digits by observing modular subtractions. In D. Naccache, editor, Topics in Cryptology – CTRSA 2001, volume 2020 of Lecture Notes in Computer Science, pages 192–207. Springer, Heidelberg, Apr. 2001. (Cited on pages 20, 21, and 77.) [343] D. Weber. Computing discrete logarithms with quadratic number rings. In EUROCRYPT’98, pages 171–183, 1998. (Cited on page 145.) [344] E. Wenger and P. Wolfger. Solving the discrete logarithm of a 113-bit Koblitz curve with an FPGA cluster. In A. Joux and A. M. Youssef, editors, SAC 2014: 21st Annual International Workshop on Selected Areas in Cryptography, volume 8781 of Lecture Notes in Computer Science, pages 363–379. Springer, Heidelberg, Aug. 2014. (Cited on page 9.)

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:30, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.011

260

Bibliography

[345] A. E. Western and J. C. P. Miller. Tables of indices and primitive roots. Royal Society Mathematical Tables, vol 9, Cambridge University Press, 1968. (Cited on pages 117, 139, and 140.) [346] WhatsApp Inc. WhatsApp website, 2017. (Cited on page 94.) [347] D. H. Wiedemann. Solving sparse linear equations over finite fields. IEEE Trans. Inform. Theory, IT-32(1):54–62, Jan. 1986. (Cited on pages 123 and 176.) [348] H. C. Williams. A p + 1 method of factoring. Mathematics of Computation, 39(159):225–234, 1982. (Cited on pages 116, 191, and 199.) [349] P. Zimmermann and B. Dodson. 20 years of ECM. In F. Hess, S. Pauli, and M. E. Pohst, editors, Algorithmic Number Theory – ANTS-VII, volume 4076 of Lecture Notes in Computer Science, pages 525–542. Springer-Verlag. Erratum: http://www.loria.fr/~zimmerma/papers/, 2006. (Cited on pages 189, 190, 191, and 197.)

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:30, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.011

Subject Index

adder, 54, 56, 62 addition chain, 6 differential, see differential addition, chain strong, see differential addition, chain two-dimensional , 107 advanced RISC machine, 31, 41, 42 affine point, 88, 91 algebraic side, 147 alignment left, 44 right, 45 antenna, 77 application specific integrated circuit, 41, 53, 54, 56, 59, 62, 74, 79 area, 44, 53–62, 67, 75–76, 80 ARM, see advanced RISC machine ASIC, see application specific integrated circuit ate pairing, see cryptographic pairing Atmega128, 47 Barrett multiplication, 11 Barrett reduction, 48 binary field, 21 bipartite modular multiplication, 27 birational equivalence, 92 block Lanczos algorithm, see Lanczos algorithm block Wiedemann algorithm, see Wiedemann algorithm Bluestein’s algorithm, 202, 204 Brent–Suyama extension, 198 bus, 77, 80, 81 width, 43, 52, 57 capacitance, 80 carry propagation, 34, 35, 40, 53–55, 62–64, 73

carry-save representation, 51, 53–55, 59, 62, 63 value, 68 Cell broadband engine, 32 CFRAC, see continued fraction method CFRC, 114 characteristic, 40, 44, 87–106, 140, 178 equal to two, 21, 67, 87, 179, 206 greater than two, 22, 208, 213, 215 Chebyshev chain, see differential addition, chain Chebyshev polynomial, 85, 201, 202 Chinese remainder theorem, 36, 46, 156, 197 Chung–Hasan arithmetic, 25 clock cycle, 52, 56, 61, 68 energy, 81 profile, 77 speed, 56, 76 clock group, 83 addition, 84 scalar multiplication, 84 twisted, 83 CMOS, 80 co-processor, 42, 56 column-wise representation, 33 complete addition law, 91 complexity area, 54 circuit, 61 computational, 58, 61 quotient digit, 65 systolic array, 76 congruence of squares, 117, 175 finding dependencies, 123 generic analysis, 120 linear algebra, 118, 123, 175 relation collection, 118, 121 smoothness testing, see smoothness testing

261 Downloaded from https://www.cambridge.org/core. City, University of London, on 08 Jan 2020 at 20:37:37, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575

262

Subject Index

constant time, 105 continued fraction method, 127–129, 175 multiplier, 129 Coppersmith–Winograd method, 123, 126 counter, 56 critical path, 44, 55–57, 62 cryptographic pairing, 8, 206 affine coordinates, 219 ate pairing, 211 optimal, 212 multiple pairing, 227 pairing-friendly elliptic curve, 213 product of pairings, 227 squared pairing, 231–234 Tate pairing, 209 Weil pairing, 209 cubic sieve, 137 Cunningham number, 117, 128, 146 Curve25519, 83, 94 Curve414117, 95 Curve448, 95 depth hardware, 56, 61 logarithmic, 56, 57 DFT, see discrete Fourier transform Dickson polynomial, 198 differential addition, 99–104 chain, 86, 190 differential addition-subtraction chain, 87 Diffie–Hellman key exchange, 4, 40, 93, 206 digital signature, 4, 40, 94, 206, 207, 214 discrete Fourier transform, 202 discrete logarithm, 139–145, 187 elliptic curve, transfer attack, 213 division polynomial, 95 Dixon’s random squares method, 126–127 double-and-add method, 84 DSA, see digital signature dual addition law, 91 ECC, see elliptic curve, cryptography ECM, see elliptic curve method of factorization Edwards curve, 83, 90, 92 addition, 91 complete, 90 dual addition law, 91 electromagnetic radiation, 77, 80 elliptic curve, 90, 208 addition, 88, 89

affine point, 88 birational equivalence, 92 cryptography, 4, 22, 40, 46, 93, 206, 207, 214 differential addition, 99–104 divisor, 209 Edwards curve, see Edwards curve group law, 208 method of factorization, see elliptic curve method of factorization Montgomery curve, see Montgomery curve pairing-friendly, 213 projective point, 89 short Weierstrass curve, see Weierstrass curve, short signature, 94 twist, 211–227 Weierstrass curve, see Weierstrass curve elliptic curve method of factorization, 6, 116, 121, 189 GMP-ECM, 193, 198, 204 embarrassing parallelism, 116, 134 embedding degree, 209 EMR, see electromagnetic radiation energy, 76, 81 exponent, 79 randomised, 80 exponentiation, 17, 20, 53, 63, 75, 79, 80 factor base, 118 factoring with cubic integers, 139 factorization factory, 117, 159 fast Fourier transform extension, 2, 8, 189, 190 elliptic curve method, 2, 191–199 p − 1 and p + 1 method, 8, 199–205 fast polynomial arithmetic extension, see fast Fourier transform extension Fermat number, 116, 117, 129, 138, 139 FFT, see fast Fourier transform field-programmable gate array, 54, 59, 68, 74 filtering, 7, 123–126 finding dependencies, 123 finite field, 67 characteristic two, 21 field extension, 215 inversion, 215 optimal extension field, 215 optimal tower field, 218 first degree prime ideal, 142 FPGA, see field-programmable gate array

Downloaded from https://www.cambridge.org/core. City, University of London, on 08 Jan 2020 at 20:37:37, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575

Subject Index

free relation, 150 Frobenius endomorphism, 211 Garner’s formula, 46 Gaussian elimination, 7, 123, 125, 135 Gaussian integer method, 139, 144 general number field sieve, see number field sieve Georgia Cracker, 127 GMP-ECM, 193, 198, 204 Gram–Schmidt orthogonalization, 176 graphics processing unit, 37 group law clock, 84 Edwards curve, 91 Montgomery curve, 89 Weierstrass curve, 88 Hamming weight, 81 Hensel division, 12 inhomogeneous linear system, 185 integer factoring, 5, 116 Internet computation, 116, 135 Karatsuba multiplication, 15, 44, 215, 217, 223 Kronecker–Schönhage trick, 197 Krylov subspace, 177 L-notation, 119 Lanczos algorithm, 7, 123, 175 characteristic 2, 179 look-ahead, 179 parallel computing, 186–187 standard, 176–179 termination, 184 large prime relation, 121, 129, 132, 134, 137, 152–153 large-difference chain, 111–115 latency, 8, 10, 26, 28, 37, 55, 57, 67 Laurent series, 194 leakage, see side-channel, leakage Lenstra’s ECM, see elliptic curve method of factorization linear algebra problem sparse, 175 linear algebra step, 118 linear sieve, 129–132 multiplier, 132 logic combinational, 48, 49, 53, 54, 57

263

dual-rail pre-charge, 81 pass-transistor, 80 sense amplifier based logic, 81 wave dynamic differential, 81 look-up table, 62, 143, 216 loop unrolling, 44 Lucas chain, see differential addition, chain ladder, 84, 85, 111 number, 205 sequence, 84, 85 MAC, see multiply-accumulate matrix multiplication exponent, 123, 126, 176 Mersenne number, 23, 117, 159, 199 Miller’s algorithm, 210, 212 modular multiplication Barrett, see Barrett reduction bipartite, see bipartite modular multiplication classical algorithm, 45 Montgomery, see Montgomery multiplication Montgomery addition, 17 constant μ, 18 R2 , 17, 19 R3 , 18 curve, see Montgomery curve domain, 79 friendly modulus, 24 friendly prime, 22, 24 inverse, 17 ladder, see Montgomery ladder multiplication, see Montgomery multiplication radix, 13 representation, 17 subtraction, 17 tail tailoring, 24 Montgomery curve, 6, 82, 83, 87, 92, 189 addition, 87, 89 Curve25519, see Curve25519 Curve414117, see Curve414117 Curve448, see Curve448 differential addition, 99–104 speed, 101 doubling, 95 completeness, 97 speed, 96 quasi-completeness, 101

Downloaded from https://www.cambridge.org/core. City, University of London, on 08 Jan 2020 at 20:37:37, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575

264

Subject Index

Montgomery ladder, 6, 82, 104 constant time, 105 Montgomery multiplication, 4, 10, 12–22, 42 coarsely integrated operand scanning, 16, 47 comparison-less, 20 conditional subtraction, 14, 19, 49, 58, 77, 78, 80 finely integrated operand scanning, 47 in F2k , 21 integrated, 46 interleaved, 15, 46 latency, 26, 28, 37 multi-precision, 15 operand scanning, 47 product scanning, 47 separated, 46 SIMD extension, 27–31 subtraction-less, 20 systolic, 27 using RNS, 36–39 Morrison–Brillhart approach, 117 msbignum, 4 multi-core processor, 54, 59 multi-exponentiation, 229 multiple polynomial quadratic sieve, see quadratic sieve, multiple polynomial multiplexer, 61, 67, 81 multiplier, 129, 132, 133 multiply-accumulate, 47, 53–56, 63 NIST, 83 NSA, 42, 83 number field sieve, 6, 137, 152–158, 159, 161, 175 Coppersmith’s variant, 158 heuristic analysis, 158 polynomial selection, see polynomial selection quadratic character, 155 relation, 154 square root computation, see square root computation for the number field sieve oversquareness, 118 p − 1 method, 8, 116, 189, 199 p + 1 method, 116, 199 pairing, see cryptographic pairing pipeline, 59, 67, 73 fetch-decode-execute, 68 instruction, 26, 32

multiplier, 41 stall, 47 PlayStation 3 game console, 32 Pollard’s p − 1 method, see p − 1 method POLYEVAL algorithm, 192 POLYGCD algorithm, 195 polynomial GCD, 195 polynomial modular reduction, 193 polynomial selection, 6, 153, 161 alpha, see quality of a polynomial pair, alpha base m method, 163 with skewness, 170–171 counting argument, 164 early abort, 173 lattice based method, 166–168 quality of a polynomial pair, see quality of a polynomial pair post-processing, 65 power, 44, 56–62, 73–77, 80 PRAC, 87, 111, 114 pre-processing, 65 precomputation, 61 prime ideal factorization, 141 principal divisor, 209 product tree algorithm, 192 projective point, 89, 91 provable integer factoring, 126–127, 160 quadratic sieve, 6, 132–137, 175 fancy, 133–134 multiple polynomial, 6, 134–136 multiplier, 133 plain, 132–133 self-initializing, 6, 135, 137 quality of a polynomial pair, 162, 170–172 alpha, 172, 173 Quisquater reduction, 48 radio-frequency identification, 42, 44 radix-r representation, 11 Ramsey theory, 1 rational side, 147 reciprocal Laurent polynomial, 201, 202 record calculation discrete logarithm extension field, 140 prime field, 153 integer factoring continued fraction method, 116, 127 number field sieve, 117, 153

Downloaded from https://www.cambridge.org/core. City, University of London, on 08 Jan 2020 at 20:37:37, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575

Subject Index

quadratic sieve, 116, 134, 135 special number field sieve, 117, 139, 146 reduced instruction set computing, 42 redundant representation, 20, 53 relation, 118, 175 relation collection step, 118 residue number system, 36–39 RFID, see radio-frequency identification rho method, 116 RISC processor, see reduced instruction set computing RLP, see reciprocal Laurent polynomial RNS, see residue number system ROM, 43, 79 root sieve, 171–173 RSA, 17, 20, 36, 40, 42, 43, 46, 53, 56, 67, 76, 77 RSA challenge RSA-512, 153, 176 RSA-768, 117, 153, 176 scalability, 67, 70 scalar multiplication, 84, 105 constant time, 105 differential addition chain, 86 differential addition-subtraction chain, 87 double-and-add method, 84, 229 inversion-sharing, 225 Montgomery ladder, 104 PRAC, 111, 114 two-dimensional addition chain, 107 short Weierstrass curve, see Weierstrass curve, short side-channel attack, 19, 207 leakage, 41, 76, 80 sieving, 6, 121–122 by vectors, 149 lattice sieving, 7, 124, 148–150 line sieving, 7, 147 signal-to-noise ratio, 77 signed numbers, 50 SIMD, see single instruction, multiple data simultaneous inversion, 8, 218 single instruction, multiple data, 26, 31, 41, 47, 55, 64 skewness, 168–170 smart card, 41, 42 smoothness integers, 119 polynomials, 140 smoothness testing, 121

265

elliptic curve method of factorization, 121 sieving, see sieving trial division, 121 sparseness, 123 SPE, see synergistic processing element special q-prime, 124, 134, 148 special number, 146 special number field sieve, 139, 145–152 finding relations, 147–150 heuristic analysis, 150 relation, 146 square root computation for the number field sieve, 6, 156–158 Couveignes’ method, 156, 157 direct method, 156, 157 Montgomery’s method, 157–158 squfof, 118 Strassen’s method, 123, 126, 131 support of a divisor, 209 Suyama’s parametrization, 94, 190, 199 switching, 44, 76, 80 symmetric matrix, 177 synergistic processing element, 32–34 system-on-a-chip, 75 systolic array, 41, 67, 81 linear, 72 modular multiplication, 73 multiplication, 68 Tate pairing, see cryptographic pairing Tempest shielding, 76 throughput, 8, 37, 41, 57, 67 Toom–Cook method, 15, 215 torsion point, 209 trace power, 77 trial division, 116, 121 twisted Edwards curve, see Edwards curve two-dimensional addition chain, 107 two-dimensional ladder, 107–111 unit contribution, 143 vector instruction, 26 AVX2, 26 SSE2, 27 Weierstrass curve, 87, 88 addition, 88 differential addition, 99 doubling, 95 short, 87

Downloaded from https://www.cambridge.org/core. City, University of London, on 08 Jan 2020 at 20:37:37, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575

266

Subject Index

Weierstrass equation, 87, 190, 208 Weil pairing, see cryptographic pairing Wiedemann algorithm, 7, 123, 176, 187 Williams’ p + 1 method, see p + 1 method

wiring, 44, 53, 54, 61, 67 word boundary, 66, 67 x86, 26, 31

Downloaded from https://www.cambridge.org/core. City, University of London, on 08 Jan 2020 at 20:37:37, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575