CAD of Circuits and Integrated Systems [1 ed.] 1786305976, 9781786305978

The theme of the book is Computer Aided Design (CAD) of circuits and integrated systems. To this end, it is necessary to

615 137 8MB

English Pages [288] Year 2020

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

CAD of Circuits and Integrated Systems [1 ed.]
 1786305976, 9781786305978

Citation preview

CAD of Circuits and Integrated Systems

In memory of my parents, to my family

Series Editor Robert Baptist

CAD of Circuits and Integrated Systems

Ali Mahdoum

First published 2020 in Great Britain and the United States by ISTE Ltd and John Wiley & Sons, Inc.

Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms and licenses issued by the CLA. Enquiries concerning reproduction outside these terms should be sent to the publishers at the undermentioned address: ISTE Ltd 27-37 St George’s Road London SW19 4EU UK

John Wiley & Sons, Inc. 111 River Street Hoboken, NJ 07030 USA

www.iste.co.uk

www.wiley.com

© ISTE Ltd 2020 The rights of Ali Mahdoum to be identified as the author of this work have been asserted by him in accordance with the Copyright, Designs and Patents Act 1988. Library of Congress Control Number: 2020931766 British Library Cataloguing-in-Publication Data A CIP record for this book is available from the British Library ISBN 978-1-78630-597-8

Contents

Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ix

Chapter 1. Basic Notions on Computational Complexity and Approximate Techniques . . . . . . . . . . . . . . . . . .

1

1.1. Computational complexity . . . . . . . . . . . . . 1.1.1. Introduction . . . . . . . . . . . . . . . . . . . 1.1.2. Big O notation . . . . . . . . . . . . . . . . . . 1.1.3.  Notation . . . . . . . . . . . . . . . . . . . . 1.1.4. Calculation of T(n) . . . . . . . . . . . . . . . 1.2. Language computability . . . . . . . . . . . . . . 1.2.1. Turing machine and class P . . . . . . . . . . 1.2.2. Non-deterministic algorithm and class NP . 1.2.3. NP-complete problems . . . . . . . . . . . . . 1.2.4. NP-hard problems . . . . . . . . . . . . . . . 1.2.5. NP-intermediate problems. . . . . . . . . . . 1.2.6. Co-NP problems . . . . . . . . . . . . . . . . 1.2.7. Class hierarchy . . . . . . . . . . . . . . . . . 1.3. Heuristics and metaheuristics . . . . . . . . . . . 1.3.1. Definitions . . . . . . . . . . . . . . . . . . . . 1.3.2. Graph theory . . . . . . . . . . . . . . . . . . . 1.3.3. Branch and bound technique . . . . . . . . . 1.3.4. Tabu search technique . . . . . . . . . . . . . 1.3.5. Simulated annealing technique . . . . . . . . 1.3.6. Genetic and evolutionary algorithms . . . . 1.4. Conclusion . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

1 1 2 3 4 10 10 12 16 27 31 33 34 35 35 36 37 41 43 45 48

vi

CAD of Circuits and Integrated Systems

Chapter 2. Basic Notions on the Design of Digital Circuits and Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2. History of VLSI circuit design . . . . . . . . . . . . . . . . 2.2.1. Prediffused circuit . . . . . . . . . . . . . . . . . . . . . 2.2.2. Sea of gates . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.3. Field-programmable gate array – FPGA . . . . . . . . 2.2.4. Elementary pre-characterized circuit (standard cells) . 2.2.5. Full-custom circuit . . . . . . . . . . . . . . . . . . . . . 2.2.6. Silicon compilation . . . . . . . . . . . . . . . . . . . . . 2.3. System design level . . . . . . . . . . . . . . . . . . . . . . . 2.3.1. Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2. Floorplanning . . . . . . . . . . . . . . . . . . . . . . . . 2.3.3. Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.4. Verification . . . . . . . . . . . . . . . . . . . . . . . . . 2.4. Register transfer design level . . . . . . . . . . . . . . . . . 2.4.1. Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.2. Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.3. Verification . . . . . . . . . . . . . . . . . . . . . . . . . 2.5. Module design level . . . . . . . . . . . . . . . . . . . . . . . 2.5.1. Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.2. Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.3. Verification . . . . . . . . . . . . . . . . . . . . . . . . . 2.6. Gate design level. . . . . . . . . . . . . . . . . . . . . . . . . 2.6.1. Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.2. Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.3. Verification . . . . . . . . . . . . . . . . . . . . . . . . . 2.7. Transistor level . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7.1. NMOS and CMOS technologies . . . . . . . . . . . . . 2.7.2. Theory of MOS transistor (current IDS) . . . . . . . . . 2.7.3. Transfer characteristics of the inverter . . . . . . . . . 2.7.4. Static analysis of the inverter . . . . . . . . . . . . . . . 2.7.5. Threshold voltage of the inverter . . . . . . . . . . . . . 2.7.6. Estimation of the rise and fall times of a capacitor . . 2.8. Interconnections . . . . . . . . . . . . . . . . . . . . . . . . . 2.8.1. Synthesis of interconnections . . . . . . . . . . . . . . . 2.8.2. Synthesis of networks-on-chip . . . . . . . . . . . . . . 2.9. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

49 49 49 49 49 51 52 53 54 57 57 64 65 66 69 69 90 91 92 92 93 98 99 99 111 112 112 112 114 117 118 119 120 124 126 140 151

Contents

Chapter 3. Case Study: Application of Heuristics and Metaheuristics in the Design of Integrated Circuits and Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2. System level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1. Synthesis of systems-on-chip (SoCs) with low energy consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2. Heuristic application to dynamic voltage and frequency scaling (DVFS) for the design of a real-time system subject to energy constraint . . . . . . . . . . . . . . . . . . . . . 3.3. Register transfer level . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1. Integer linear programming applied to the scheduling of operations of a data flow graph (DFG) . . . . . . . . . . . . . . . . . 3.3.2. The scheduling of operations in a controlled data flow graph (considering the speed–power consumption tradeoff) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.3. Efficient code assignment to the states of a finite state machine (aimed at reaching an effective control part in terms of surface, speed and power consumption) . . . . . . . . . . . . . . . . . . 3.3.4. Synthesis of submicron transistors and interconnections for the design of high-performance (low-power) circuits subject to power (respectively time) and surface constraints . . . . . . . . . . . 3.4. Module level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1. Design of low-power digital circuits. . . . . . . . . . . . . . . . . 3.4.2. Reduction of memory access time for the design of embedded systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5. Gate level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.1. Estimation of the average and maximal power consumption of a digital circuit . . . . . . . . . . . . . . . . . . . . . . . 3.5.2. Automated layout generation of some regular structures (shifters, address decoders, PLAs) . . . . . . . . . . . . . . . . . . . . . . 3.5.3. Automated layout generation of digital circuits according to the River PLA technique . . . . . . . . . . . . . . . . . . . 3.6. Interconnections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.1. Low-power buffer insertion technique for the design of submicron interconnections with delay and surface constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.2. Data encoding and decoding for low-power aided design of submicron interconnections. . . . . . . . . . . . . . . . . . . .

vii

153

. . . .

153 154

. .

154

. . . .

160 174

. .

174

. .

176

. .

176

. . . . . .

196 207 207

. . . .

219 227

. .

227

. .

234

. . . .

238 239

. .

239

. .

250

viii

CAD of Circuits and Integrated Systems

3.6.3. High-level synthesis of networks-on-chip subject to bandwidth, surface and power consumption constraints . . . . . . . . . . . 3.7. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

253 263

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

267

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

273

Preface

This book concerns the computer-aided design (CAD) of integrated circuits and systems. Therefore, it focuses on the presentation and study of techniques for the development of heuristic- or metaheuristic-based algorithms for concrete problems encountered in the CAD of integrated circuits and systems (Chapter 3). Such algorithms are obviously applicable to intractable problems, as the approach to time-polynomial complexity involves the application of exact methods. It is therefore essential to recall (Chapter 1) the notion of computational complexity in order to distinguish between a problem of time-polynomial complexity (whose solution involves the use of an exact method) and an intractable problem (whose solution involves the use of an approximate method). This chapter also deals with language classification and notions related to heuristics and metaheuristics. Chapter 2 of this book is dedicated to recalling several fundamental notions of the design of integrated circuits and systems relying on the standard design flow (Y-design flow or Gajski’s chart), making it possible to represent a given entity from a behavioral, structural and physical perspective. This is applicable at each level of abstraction (system, register transfer, module, cell). One section at the end of this chapter is dedicated to interconnections, which give rise to specific problems in the newly developed systems-on-chip in recent semiconductor technologies. The first two chapters are thus instrumental in preparing the reader for the last chapter (which is directly related to the title of this book). Chapter 3 is dedicated to the study of the double difficulty of developing either heuristic-

x

CAD of Circuits and Integrated Systems

or metaheuristic-based algorithms: the goal is to reach, in a reasonable CPU time, a quality solution (near-optimal or optimal) required by an intractable problem. This issue is analyzed by means of concrete aided design of integrated circuits and systems at various design levels. In this chapter, it is shown how to deal with the compromise between CPU time and solution quality thanks to techniques for reducing the (often very large) space of solutions, a shift from one solution to an argued and non-random solution (to avoid the problem of local optimum, etc.). In particular, problems dealing with time optimization subject to energy consumption constraint (real-time systems) will be addressed, as well as energy consumption optimization problems subject to time constraint (embedded systems). These two parameters (time and energy), which are often at odds with one another, may obviously have the same priority in the optimization of a multi-objective function. This book has been written on the basis of our lecture notes/tutorial classes given at the Computer Science and Electronics Departments at the University Saad DAHLAB of Blida, Algeria. It is also a compilation of our research works conducted at the Centre de Développement des Technologies Avancées in Algiers. May God bless these efforts. Ali MAHDOUM February 2020

1 Basic Notions on Computational Complexity and Approximate Techniques

1.1. Computational complexity 1.1.1. Introduction The execution time of a program depends on several factors: – program data: for example, sorting about 10 numbers requires less execution time than sorting out millions of numbers; – quality of the code generated by the compiler: codes generated by programming languages do not all have the same quality (this is the case of C, C++ and Java: if the application is not object-oriented and will be executed on some platform (operating system, processor frequency, memory size, etc.), then it is better to implement it in C, which generates a “lighter” code); – computational complexity: this metric gives the developer an idea of the execution time of their program, independently of programming language and platform (processor, memory capacity, etc.). Computational complexity is therefore an important concept, all the more so as it offers the developer an indication on whether their program takes a reasonable CPU time to reach results, even when significant resources (processor frequency, memory capacity) are at their disposal. In other terms, it may signal to the designer the need to modify their algorithm, if it takes an “infinite” time to reach results.

CAD of Circuits and Integrated Systems, First Edition. Ali Mahdoum. © ISTE Ltd 2020. Published by ISTE Ltd and John Wiley & Sons, Inc.

2

CAD of Circuits and Integrated Systems

1.1.2. Big O notation DEFINITION.– Computational complexity T(n) is O(f(n)) if there are constants c and n0 such that: T(n) ≤ c * f(n) ∀ n ≥ n0; c > 0; n0 ≥ 0 EXAMPLE 1.1.– T(n) = (n+1)2 T(n) is O(n2). Indeed, T(n) ≤ 4 * n2 ∀ n ≥ 1 (c = 4; n0 = 1) We will subsequently see how the function T(n), associated with a given algorithm, can be determined. EXAMPLE 1.2.– Consider T(n) = n2+n. Is it possible to consider that T(n) is O(n)? If yes, we would have: n2+n ≤ c * n ⇔ n2+n(1-c) ≤ 0 ⇔ n(n+1-c) ≤ 0 Since n ≥ 0, we would have: c ≥ n+1  c is not a constant and therefore T(n) is not O(n). 1.1.2.1. Sum If T1(n) is O(f(n)) and T2(n) is O(g(n)), then T1(n) + T2(n) is O(max(f(n),g(n))) PROOF.– T1(n) is O(f(n)) ⇔ ∃ c1 > 0, n1 ≥ 0: T1(n) ≤ c1 f(n) ∀ n ≥ n1  T1(n) ≤ c1 max(f(n), g(n)) ∀ n ≥ n1 T2(n) is O(g(n)) ⇔ ∃ c2 > 0, n2 ≥ 0: T2(n) ≤ c2 g(n) ∀ n ≥ n2  T2(n) ≤ c2 max(f(n), g(n)) ∀ n ≥ n2  T1(n) + T2(n) ≤ (c1 + c2) max(f(n),g(n)) ∀ n ≥ max(n1, n2) Generally speaking, if Ti(n) ≤ ci fi(n) ∀ n ≥ ni; i=1, 2, ……., p, then: p

p

p





i =1

i =1

i =1





ax n i   Ti ( n ) is O  m ax f i ( n )   Ti ( n ) ≤ 1m≤ iax  f i ( n )   c i ∀ n ≥ 1m  1≤ i ≤ p  ≤p ≤i≤ p

Basic Notions on Computational Complexity and Approximate Techniques

3

NOTE.– Consider T(n)=n2+n It can be said that T(n) is: - O(n2+n) as T(n) ≤ 1 * (n2+n) ∀ n ≥ 0; C=max(c1, c2)=1; f1(n)+f2(n)=n2+n; - O(n2) as T(n) ≤ 2 * n2 ∀ n ≥ 0; C=c1+c2=1+1=2; max(n2, n) = n2 1.1.2.2. Product If T1(n) is O(f(n)) and T2(n) is O(g(n)), then T1(n)* T2(n) is O(f(n)*g(n)) PROOF.– T1(n) is O(f(n)) ⇔ ∃ c1 > 0, n1 ≥ 0: T1(n) ≤ c1 f(n) ∀ n ≥ n1 T2(n) is O(g(n)) ⇔ ∃ c2 > 0, n2 ≥ 0: T2(n) ≤ c2 g(n) ∀ n ≥ n2  T1(n)* T2(n) ≤ (c1 * c2) f(n) * g(n) ∀ n ≥ max (n1, n2)  T1(n) * T2(n) is O(f(n) *g(n)) Generally speaking, if Ti(n) ≤ ci fi(n) ∀ n ≥ ni; i=1, 2, ……., p, then: p p p ni , ∏ Ti ( n ) ≤ ∏ ci ∏ f i ( n ) ∀ n ≥ 1m≤ iax ≤p i =1

i =1

i =1

p

which means

 p  ( ) T n  ∏ f i (n ) is O ∏ i i =1  i =1 

1.1.3. Ω Notation DEFINITION.– Computational complexity T(n) is Ω(g(n)) if there are constants c and n0 such that: T(n) ≥ c * g(n) ∀ n ≥ n0; c > 0; n0 ≥ 0

4

CAD of Circuits and Integrated Systems

EXAMPLE 1.3.– T(n) = n2+n. It can be verified that T(n) is Ω(n) due to the fact that T(n) ≥ n ∀ n ≥ 0 1.1.4. Calculation of T(n) It is obvious that O and Ω notations cannot be used unless T(n) is determined. The following section shows how the function T(n), associated with a given algorithm, is determined. EXAMPLE 1.4.– Algorithm Bubble_Sort {For i=1 up to n-1 Do For j=n down to i+1 Do If A[j-1] > A[j] Then {temp= A[j-1]; A[j-1]= A[j]; A[j]=temp; } End if Done Done } i=1: j varies from n to 2, hence (n-1) processes i=2: j varies from n to 3, hence (n-2) processes ………………………………

Basic Notions on Computational Complexity and Approximate Techniques

5

i = n-1: j varies from n to n, hence only one process Overall, the number of processes is: Let us write S in a different manner:

(n-1)+(n-2)+……..+1 = S S=1

+ 2+………+(n-1)

We then obtain: 2S= n + n + …………… +n = n(n-1)  S = n(n-1) /2 For each of these n(n-1)/2 processes, one condition and three assignments must be executed. This execution is independent of n and therefore constant, requiring a certain CPU time d that depends on the platform on which the algorithm is executed. Therefore, for the algorithm Bubble_Sort, we have: T(n) = n(n −1) d 2

EXAMPLE 1.5.– Factorial of a number Factorial(n) {If n ≤ 1 Then Fact=1; Else Fact =N * Factorial(n-1); End if } If n ≤ 1, the process is constant: d Else, there is a constant process (c) that concerns multiplication, in addition to a process corresponding to the recursive call Factorial(n-1): T(n-1), which is: T ( n) =

{

d if n ≤1 c + T ( n−1) else

6

CAD of Circuits and Integrated Systems

Consider n > 1. We then have: T(n-1)=c+T(n-2)  T(n)=2c+T(n-2) By recurrence, we have: T(n) = i * c + T(n-i) Consider n-i=1, hence i=n-1. We then have: T(n)=(n-1) * c + T(1) = (n-1)c + d It can be easily verified that T(n) is O(n). Indeed, we should then have: (n-1)c + d ≤ k * n ∀ n ≥ n0; k >0; n0 ≥ 0, which means n(k-c) ≥ d-c Given that multiplication requires more time than assignment, we have d < c. With k > c, we get: n ≥ (d-c)/(k-c); k > c; given k=c+1  n ≥ d-c  T(n) ≤ (c+1) * n ∀ n ≥ 0 EXERCISE 1.1.– Let us consider the Tower of Hanoi problem. There are three rods: A, B and C. There are N disks on rod A, arranged in descending order of size from bottom to top. The objective is to move the N disks from A to C using B (the size order must be preserved on each rod). Find the computational complexity of this problem. As a first step, let us find the algorithm. For this purpose, the following reasoning is used: Let us assume we know how to correctly move (N-1) disks from one rod to another, using the third rod. The solution would then be the following: – correctly move (N-1) disks from A to B using C; – move the remaining disk (of the current larger size) from A to C; – move the (N-1) disks from B to C, using A (we assume we know how to correctly move the N-1 disks from one rod to another). We then obtain the following recursive algorithm:

Basic Notions on Computational Complexity and Approximate Techniques

7

Hanoi(A, B, C, N) // The order of arguments is important { If N = 0 Then return; End if Hanoi(A, C, B, N-1); // The order of parameters is important Move the remaining disk from A to C; Hanoi(B, A, C, N-1); // The order of parameters is important } The computational complexity T(N) is then equal to: – d1 if N=0 – T(N-1) + d + T(N-1) else; d is the CPU time required for moving the remaining disk from one rod to another. We then have: T(N) = 2*T(N-1) +d for N ≠ 0 By recurrence, we get: T(N-1) = 2*T(N-2)+d  T(N) = 4*T(N-2)+3d T(N-2) = 2*T(N-3)+d  T(N) = 8*T(N-3)+7d T(N-3) = 2*T(N-4)+d  T(N) = 16*T(N-4)+15d ……………………………….. T(N-i) = 2*T(N-i-1)+d  T(N)=2i+1*T(N-i-1)+(2i+1-1)d

8

CAD of Circuits and Integrated Systems

With i = N-2, we get: T(N)=2N-1*T(1)+(2N-1 -1)d = 2N-1d + 2N-1d – d = 2N – d Since d > 0, we have T(N) ≤ 2N ∀ N ≥ 1  T(N) is O(2N) EXERCISE 1.2.– Find the computational complexity of the following algorithm: Begin S=0; For i=1 up to N by step of 2 Do S = S + i ! // well read Factorial of i Done End SOLUTION.– Let us recall that the execution time of factorial of N is equal to (N-1)d+d1 i=1: d1+d2

(exec time of 1 ! + exec time of the sum)

i=3: 2d+d1+d2

(exec time of 3 ! + exec time of the sum)

i=5: 4d+d1+d2

(exec time of 5 ! + exec time of the sum)

………………………………….. i=N: (N-1)d+d1+d2; N is odd  T(N) = (d1+d2)+( 2d+d1+d2)+(4d+d1+d2)+ …. (N-1)d+d1+d2 = (d1+d2)(1+(N-1)/2)+d(2+4+ …. + (N-1))

Basic Notions on Computational Complexity and Approximate Techniques

S1=

9

2 + 4 + … +(N-1)

S1=(N-1)+(N-3)+ … + 2  2S1=(N+1)(1+(N-3)/2)  S1 = (N+1)(N-1) /4  T(N) = (d1+d2)((N+1)/2) + d(N+1)(N-1) /4 = N2 d/4 + N(d1+d2)/2 + (d1+d2)/2 – d/4 Let us show that T(N) is O(N2), which leads to: T(N) ≤ C * N2 ∀ N ≥ N0; C > 0; N0 ≥ 0  N2(C - d/4) – N(d1+d2)/2 – (d1+d2)/2 + d/4 ≥ 0 Δ = (d1+d2)2/4 -4(C-d/4)(d/4 – (d1+d2)/2) It is worth noting that if C=d/4, it would not be possible to have: d/4 ≥ N(d1+d2)/2 + (d1+d2)/2 (the value of N can be large, while C and d are constant) Let us consider C=5d/4 We should note that constants d, d1 and d2 are used for multiplication, assignment and addition. If d=2(d1+d2), then we get: Δ = (d1+d2)2/4 - 4d((d1+d2)/2 – (d1+d2)/2) = (d1+d2)2/4 It can then be verified that the roots are: N1= 0; N2 = (d1+d2)/(2*2(d1+d2)) Between these two roots, C*N2 – T(N) is negative. Since the objective is to have C*N2 – T(N) ≥ 0, the two parts should be considered outside the [N1, N2] interval. But if we consider N0=N1, we would not have C*N2 – T(N) ≥ 0 ∀ N ≥ N0 since in [N1, N2], we have C*N2 – T(N) ≤ 0. We should therefore have N0=N2=1/4 = 1. Since C=5d/4 and f(N)=N2, we get T(N) ≤ 5d/4 * N2 ∀ N ≥ 1 and therefore T(N) is O(N2).

10

CAD of Circuits and Integrated Systems

1.2. Language computability The previous sections have shown how to determine the computational complexity of a given algorithm. This is important to the extent that, in the case of exponential complexity, the consequent data size of the problem under consideration would not allow the algorithm execution in a reasonable CPU time. Another algorithm should then be developed in order to avoid this difficulty, at the cost of getting a near-optimal solution. It is therefore obvious that certain problems can be optimally dealt within polynomial time, while an approximate solution, obtained in a limited execution time (minutes, hours, days, etc.), is what we have to accept for others. Therefore, the existing classes of problems and the way they are classified should be considered. We begin by examining the simplest class, which includes the polynomial time problems, namely those whose exact solution is obtained within polynomial time. 1.2.1. Turing machine and class P A deterministic Turing machine involves: – a finite set Γ of symbols, including a subset Σ of input symbols (Σ ⊂ Γ) and the blank symbol b ∈ Γ - Σ, which is a delimiter; – a finite set of states Q, with q0 as the initial state; – a transition function δ: (Q - {qY, qN)} X Γ → Q X Γ X {-1, +1} -1 (+1): the read/write tape head of the machine moves from one position to the left (or, respectively, to the right) on the machine tape, which features the Γ symbols. EXAMPLE 1.6.– Let us consider Γ={0, 1, a, b}; Σ={0, 1,a}; Q={ qY, qN } A Turing machine M featuring an infinite tape containing Γ symbols, accepting language LM = {x ∈ Σ*={0, 1, a}*: M accepts x} = {0, 1, a,}*1, meaning that M accepts all the strings formed of symbols of Σ and that end in 1. It can be verified that the following deterministic automaton makes it possible to reach state qY if the string is accepted, and state qN otherwise:

Basic Notions on Computational Complexity and Approximate Techniques

0

1

a

b

q0

(q0, 0, +1)

(q0, 1, +1)

(q0, a, +1)

(q1, b, -1)

q1

(qN, b, -1)

(qY, b, -1)

(qN, b, -1)

(qN, b, -1)

11

– (q0, 0, +1) shows that symbol 0 must be written in the square indicated by the head, which then moves to the right on the tape, and the automaton passes to state q0. This happens when the current state is q0 and the head is over a square containing symbol 0; – (qY, b, -1): string accepted by the language; – (qN, b, -1): string not accepted by the language. The above-described automaton can be assimilated to a deterministic polynomial time program accepting language L={x ∈ Σ*: x= {0, 1, a,}*1}. Class P can be defined as a set of languages, each of which has a corresponding deterministic polynomial time program recognizing the considered language: P = {L: there is a deterministic polynomial time program for which L=LM} EXERCISE 1.3.– Let us consider language L defined by the following automaton: 0

1

b

q0

(qN, 0, +1)

(q1, 1, +1)

(q0, b, +1)

q1

(qN, 0, +1)

(qY, 1, +1)

(qN, b, -1)

where q0 is the initial state, q1 is an intermediate state, qY and qN are the final states “YES” and “NO” respectively. 0, 1 and b are symbols belonging to the set Γ that includes the subset Σ={0, 1}. What are the strings of Σ* accepted by this language? Obvious solution: q0  1 q1  1 qY The strings accepted by the language are 11(01)*, hence all the strings starting with 11, followed by b or a combination of 0 and 1.

12

CAD of Circuits and Integrated Systems

1.2.2. Non-deterministic algorithm and class NP Assume a function choice() is available, and find the answer, within polynomial time, expressed as TRUE or FALSE to a decision problem. EXAMPLE 1.7.– Given N numbers, find the largest among them. Algorithm MAX_NON_DETERMINISTIC (N) {max=choice({a1, a2, …., aN}); For i=1 up to N Do

If ai > max Then {Answer = ″FALSE″; exit; } End if

Done Answer = ″TRUE″; } Note that this algorithm is: – non-deterministic, because we do not know how the function choice() has defined max; – in a polynomial time (its computational complexity is O(N)). The deterministic algorithm is as follows: Algorithm MAX_DETERMINISTIC (N) {max=a1; For i=2 up to N

Basic Notions on Computational Complexity and Approximate Techniques

Do

13

If ai > max Then max=ai; End if

Done } This algorithm is also in polynomial time, its computational complexity being O(N). Note that, in general, deterministic and non-deterministic algorithms associated with the same problem do not necessarily have the same computational complexity. This is, the case for the following problem, for example: EXAMPLE 1.8.– (Satisfiability problem) Consider X={x1, x2, ….., xn} a set of Boolean variables E= C1 . C2 . ………… . Cm a logical expression; The dot represents the logical operator AND; Ci is a clause where Ci = u1 + u2 + ……… + uk; k=1, 2, …., n; ui is one of the variables either complemented or not The + represents the logical operator OR EXAMPLE 1.9.– E=(x1 + x2 + x3’ ) . (x1’ + x2) . x3 Let us consider the decision problem: Is there an assignment of variables xk; k=1, 2, 3 to 0 or 1 such that E=1? Algorithm_SAT_NON_DETERMINISTIC(n, E) {For i=1 up to n Do xi=choice({0, 1});

14

CAD of Circuits and Integrated Systems

Done If E(x1, x2, ….., xn)=1 Then answer=TRUE; Else answer=FALSE; End if } This (non-deterministic) algorithm is in polynomial time (of complexity O(n)). On the other hand, the deterministic algorithm is not (complexity equal to O(2n)). Hence, for a given problem, it is possible that deterministic and nondeterministic algorithms have the same computational complexity (this is the case of the determination of the maximal number) or not (this is the case of the satisfiability problem). This leads us to a first classification of problems. DEFINITION.– A problem Π belongs to the class NP if it can be executed within polynomial time by a non-deterministic algorithm. From this definition, the 2 problems described above (maximum of N numbers and satisfiability) both belong to the class NP. The problem of the maximum of N numbers can be solved in polynomial time by a deterministic algorithm. It therefore belongs to the class P (see the deterministic Turing machine). As this problem belongs to class P, unlike the satisfiability problem, class P is included in class NP (note that NP is an acronym of non-deterministic polynomial, and not of non-polynomial. This would be a contradiction, since the problem of the maximum of N numbers is polynomial and belongs to class NP). Figure 1.1 illustrates the relationship between classes P and NP.

Basic Notions on Computational Complexity and Approximate Techniques

15

Figure 1.1. First view of the class of non-deterministic polynomial time (NP) problems

EXERCISE 1.4.– Write a non-deterministic algorithm for the problem of colored graphs, defined as follows: G=(V, E);

K ∈ N*;

k ≤ |V|

Is there a function f: V → {1, 2, …, K} such that f(u) ≠ f(v) ∀ (u, v) ∈ E? SOLUTION.– It is sufficient to note that the answer takes the “FALSE” value in the following two cases: – number of colors exceeds K; – two nodes connected by an edge have the same color. Algorithm_G-COL_NON_DETERMINISTIC(V, E, K) { S=∅;

// S is the set of colors

W = ∅ ; // W is the set of already colored nodes G = (V, E) is the graph to be colored For each u ∈ V – W Do {c=choice(colors); // assign a color to a node that is not yet colored If c ∈ S Then if ∃ v ∈ W of color c and (u, v) ∈ E

16

CAD of Circuits and Integrated Systems

Then {Answer = ″FALSE″; Exit(); // Adjacent nodes having the same color } End if Else {S = S ∪ {c}; If |S| > K Then {Answer = ″FALSE″; Exit(); // Number of colors > K } End if } End if W =W ∪ {u}; } Done Answer = ″TRUE″; } 1.2.3. NP-complete problems The class NP, such as that indicated in Figure 1.1, is incomplete. In fact, as well as class P, other classes are included. Before indicating these other classes, let us define some other concepts concerning language computability.

Basic Notions on Computational Complexity and Approximate Techniques

17

Polynomial transformation ∗ DEFINITION.– A polynomial transformation (denoted α) of a language L1 ⊆ 1

into a language L2 ⊆ *2

is a function f: 1* →  *2

satisfying the

following two conditions: – there is a polynomial time program executing f on a deterministic Turing machine; – x ∈ L1 if and only if f(x) ∈ L2, ∀ x ∈ 1* . EXAMPLE 1.10.– Given Σ1={0, 1}; L1= (0,1)*00, which means that L1 is the set of strings constituted of symbols 0 and 1, and ending in 00 Given f: x → x+1; L2= (0,1)*01 – 1 is added to a number within a polynomial time (f is therefore polynomial) – x ∈ L1  x ends in 00  x+1 ends in 01  (x+1) ∈ L2 (x+1) ∈ L2  x +1 ends in 01  x ends in 00  x ∈ L1 f is therefore a polynomial transformation of L1 into L2 (L1 α L2) For decision problems, the following conditions should be met: – f must be determined within a polynomial time; – ∀ I ∈ DΠ1, I ∈ YΠ1 if and only if f(I) ∈ YΠ2 DΠ1 is the set of instances of problem Π1 YΠi is the set of instances accepted by language Li; i=1, 2. EXAMPLE 1.11.– Problem Π1: Given a graph G1=(V1,E1); m=|V1| Does G contain a Hamiltonian circuit?

18

CAD of Circuits and Integrated Systems

Problem Π2: G2=(V2,E2) ∀ (vi, vj) ∈ E2

Each arc of E2 is associated with a cost: d(vi, vj) ∈ Z+

Given a cost B > 0 Is there a cycle of length ≤ B ? Question: Is there a polynomial transformation of Π1 into Π2? i) Determination of f: If (vi, vj) ∈ E1, then (vi, vj) ∈ E2 and d(vi, vj)=1 Else d(vi, vj)=2 Consider B=m An instance of Π2 is built from an instance of Π1 within a polynomial time. ii) Let us prove that there is a Hamiltonian circuit in G1 if and only if there is a cycle in G2 whose length is shorter than or equal to B: Let CH be a Hamiltonian circuit. Then CH contains all the nodes of V1 and its length is therefore m (note that the two ends of the circuit are connected). As the successive nodes vi and vj of CH are connected, the same is true in G2, and they verify the relationship d(vi, vj)=1. Consequently, CH is a cycle of length B=m in G2. Let us prove that if TVC is a cycle of length B=m, then there must be a Hamiltonian circuit in G1: Because the length of the cycle is equal to m=|V1|, this means that d(vi, vj)=1 for any pair of successive nodes in TVC (otherwise, the length of the cycle would exceed m). Consequently, these successive nodes are connected, and therefore this succession of nodes in TVC is a Hamiltonian circuit in G1.

Basic Notions on Computational Complexity and Approximate Techniques

19

Therefore, there is a polynomial transformation of the Hamiltonian circuit problem into that of the traveling salesman problem, and this is in terms of decision problems (and not optimization). The polynomial transformation α makes it possible to classify a language. Indeed: LEMMA 1.1– If L1 α L2, then L2 ∈ P  L1 ∈ P PROOF.– L1 α L2  i) There is a polynomial time program that executes f (from L1 to L2) on a deterministic Turing machine. ii) x ∈ L1 if and only if f(x) ∈ L2 It is known if f(x) belongs to L2 or not in polynomial time (as L2 ∈ P). Consequently, it is known if, in polynomial time, x belongs to L1 or not. LEMMA 1.2.– If L1 α L2 and if L2 α L3, then: L1 α L3 PROOF.– L1 α L2  There is a polynomial function f from L1 to L2 L2 α L3  There is a polynomial function g from L2 to L3 There is a polynomial function f o g from L1 to L3 L1 α L2  x ∈ L1 if and only if f(x) ∈ L2 L2 α L3  y ∈ L2 if and only if g(y) ∈ L3  x ∈ L1 if and only if g(f(x)) ∈ L3 Therefore, α is transitive.

20

CAD of Circuits and Integrated Systems

DEFINITION.– A language L1 is NP-complete if: – L1 ∈ NP – L2 α L1 ∀ L2 ∈ NP THEOREM.– If: – P1 is NP-complete – P1 α P2 – P2 ∈ NP Then: P2 is NP-complete PROOF.– P1 is NP-complete  P1 ∈ NP ∧ P3 α P1 ∀ P3 ∈ NP (by definition) According to the hypothesis, P1 α P2  P3 α P2 (α is transitive) and P2 ∈ NP  P2 is NP-complete Conjecture For NP-complete problems, there is no algorithm giving the exact solution within polynomial time. This is just a conjecture. We should either: – find a polynomial time algorithm that yields an exact solution to an NP-complete problem (according to the previous theorem, this algorithm would solve ALL the NP-complete problems and then the class of NPcomplete problems would be none other than class P); – formally prove there can be no exact and polynomial time algorithm for such problems.

Basic Notions on Computational Complexity and Approximate Techniques

21

The author of an answer to one of these two problems would get a reward of 1 million dollars. Further details can be found at http://www.claymath.org, which lists 6 still unsolved mathematical problems. To this day, we conjecture that there is no exact and polynomial time algorithm for NPcomplete problems, which leads us to the second view of class NP, as shown in Figure 1.2.

Figure 1.2. Second view of the class of non-deterministic polynomial time (NP) problems

Examples of NP-complete problems: – The stable problem Let us recall: given a graph G=(V, E) and an integer k ≤ |V|, the objective is to find out whether there is a subset V’ ⊂ V such that: – V’ is a stable of G, meaning that the vertices of V’ are 2 by 2 nonadjacent in V – |V’|= k Let us prove that this problem is NP-complete. First of all, we should prove that it belongs to the class NP. Then, we prove that it can be solved by a non-deterministic polynomial time algorithm:

22

CAD of Circuits and Integrated Systems

ALGO_NON_DETERM_POLYNOM_STABLE (k, V, E) {V’ = φ ; For i = 1 up to k Do {u = choice(V – V’); V’= V’ ∪ {u}; } Done If V’ is a stable of G Then Answer = ″TRUE″; Else Answer = ″FALSE″; End if } This algorithm is indeed non-deterministic, as it is the function choice() that builds the stable in a manner that is beyond our grasp. Its complexity is O(k), therefore it is in polynomial time. The stable problem thus belongs to class NP. Let us now prove that there is a polynomial transformation of the satisfiability problem (proven to be NP-complete by Cook) into the stable problem: Every variable uj of every clause Ci (see the previous section for a definition of the satisfiability problem) is associated with a node. To make it easier to build the graph G, the layout of nodes G is a matrix, whose lines are represented by variables and columns by clauses: – all the nodes corresponding to variables of the same clause are 2 by 2 connected (to make sure that the logical expression of a given clause takes the value 1, it is sufficient that one variable of this clause is true);

Basic Notions on Computational Complexity and Approximate Techniques

23

– two nodes representing the same variable are also connected if this variable is complemented in one clause, while it is not in the other clause. EXAMPLE 1.12.– Given E=(x1+x3) . ( x1’ + x2’) . (x2+ x3’) . (x1+x2+x3) Let us build G (E has three variables and four clauses) Clause C1: x1 and x3 are respectively associated with V1 and V2 Clause C2: x1’ and x2’ are respectively associated with V3 and V4 Clause C3: x2 and x3’ are respectively associated with V5 and V6 Clause C4: x1, x2 and x3 are respectively associated with V7, V8 and V9 – The nodes associated with variables of the same clause are 2 by 2 connected. – The nodes associated with the same variable figuring in several clauses are connected depending on whether this variable is complemented or not (e.g. V4 is connected with V5 and V8 because the variable x2 is complemented in clause C2, while it is not complemented in clauses C3 and C4. On the other hand, V5 and V8 are not connected because they are associated with variable x2 both in clause C3 and in clause C4). We then obtain the following graph (Figure 1.3) deduced by a polynomial transformation from the logical equation E. This construction clearly shows that there is an assignment of 0 and 1 to variables of E(x1, x2, x3) giving the latter the value 1 if and only if the graph obtained contains a stable whose cardinal is equal to the number of clauses. For this example, the stables of maximal cardinal 4 are: – {V1, V4, V6, V7} corresponding respectively to x1, x2’, x3’ and x1: E(1, 0, 0) = 1; – {V2, V3, V5, V8} corresponding respectively to x3, x1’, x2 and x2: E(0, 1, 1) = 1;

24

CAD of Circuits and Integrated Systems

Figure 1.3. Graph obtained by polynomial transformation from a logical equation

– {V2, V3, V5, V9} corresponding to x3, x1’, x2 and x3: E(0, 1, 1) = 1. Let us note that the following graph (Figure 1.4) does not accept a stable of cardinal 4. As the logical expression F=(x1 + x3) . (x1’ + x2) . (x3’) . (x2’ + x3) (from which the graph is deduced) contains four clauses, it can be stated that there can be no assignment of 0 and 1 to variables of F, giving the latter the value 1.

Figure 1.4. Graph obtained by polynomial transformation from the logical equation F

It is worth noting that the matrix layout of nodes makes it easier to build the graph from the logical equation considered. It can be verified that the graph in Figure 1.5 is in fact the same as the one in Figure 1.3.

Basic Notions on Computational Complexity and Approximate Techniques

V8

V4

V1 V3 V2

25

V7

V9

V5 V6

Figure 1.5. Graph obtained by polynomial transformation from the logical equation E

EXERCISE 1.5.– Given the following graph G=(V, E): V1

V4

V5

V2

V3

Find an instance F of PSAT associated with this graph. What is the relationship between the number of clauses of F and the maximal cardinal of the stable(s) of G? What is this cardinal? SOLUTION.– A possible instance associated with this graph is: F = (x1 + x2 + x3 + x4) . x4’

26

CAD of Circuits and Integrated Systems

V1

V2

V3

V4

V5

The number of clauses of F is the maximal cardinal of the stable(s) of G if F can take the value 1. In this example, this number is equal to the maximal cardinal of stable {V1, V5}, for example, which is equal to 2. NOTE.– For example, F(1, 1, 1, 0) is 1 though {V1, V2, V3, V5} is not a stable: the nodes associated with variables of the same clause are 2 by 2 connected for the unique purpose of establishing a correspondence between the maximal cardinal of a stable and the number of clauses of a given logical function. Consequently, it is possible not to connect 2 by 2 the nodes associated with the variables of the same clause, but the stables should be built from nodes associated with variables of different clauses. Thus, the function F=(x1+x3) . ( x1’ + x2) . (x3’) . (x2’ + x3) – associated with the graph in Figure 1.4 – cannot take the value 1 for any combination of values of the logical variables. However, if the edges between the nodes associated with the variables of the same clause are eliminated, it is possible to find a stable of cardinal equal to 4 (number of clauses): {V1, V2, V4, V8}, for example. Obviously, when building stables from nodes associated with variables of different clauses, it is not possible to find among them one whose cardinal is 4, which is in agreement with the fact that F cannot be 1.

Basic Notions on Computational Complexity and Approximate Techniques

27

EXERCISE 1.6.– Let E1 and E2 be two instances of the satisfiability problem, having m1 and m2 clauses respectively. Our objective is to build a graph associated with F1 = E1 AND E2. – Is F1 an instance of the satisfiability problem? – If yes, let G be the graph associated with F1. What is the relationship between the number of clauses of F1 and the maximal cardinal of stables of G? Answer the same questions for F2 = E1 OR E2 – Let us assume there are combinations of logical values that give E1 and E2 the value 1. Is it possible to make sure that F = E1 AND E2 is 1? SOLUTION.– – As E1 and E2 are instances of PSAT and they are connected by operator AND, F1 is therefore also an instance of PSAT. – Depending on the previously described construction of G, F1 takes the value 1 if and only if G accepts a stable of cardinal equal to (m1+m2) – F2 is not an instance of PSAT because E1 and E2 are connected by the operator OR. – F2 takes the value 1 if and only if G1 (or G2) accepts a stable of cardinal m1 (respectively, m2); Gi is the graph associated with Ei. – If there are combinations of logical values giving to E1 and E2 the value 1, F will not necessarily be 1. For example, E1 = x1 is 1 for x1=1. Similarly, E2 = x1’ is 1 for x1=0. Nevertheless, F1 = x1 . x1’ is 0. 1.2.4. NP-hard problems Let us recall that polynomial time problems can be executed by a deterministic Turing machine. Those of class NP (and of class P) are executed by a non-deterministic Turing machine. Before defining the NP-hard problems, let us now consider the Oracle Turing machine involving a module (oracle) that executes a program P1.

28

CAD of Circuits and Integrated Systems

DEFINITION.– A polynomial Turing reduction of problem P2 to problem P1 (denoted P2 αT P1) is a transformation that enables the polynomial-time resolution of P2 through the algorithm of P1 if the resolution of P1 is assumed of computational complexity O(1). In other terms, this involves: – Stage 1: transforming the data of P2 into data of P1 – Stage 2: solving P1 by means of a module (oracle) – Stage 3: transforming the results of P1 into results of P2 Assuming the complexity of stage 2 is O(1), if stages 1 and 3 have polynomial time complexity, then it can be said that P2 αT P1. DEFINITION.– A problem P2 is NP-hard (NP-difficult) if there is an NP-complete problem P1 such that P1 αT P2. Based on these definitions, the following can be noted: – an NP-hard problem (D) is at least as difficult as an NP-complete problem (C), as the resolution of C can be done through that of D; – according to the literature, an NP-hard problem can belong or not to the class NP (see Figure 1.6; it is however ambiguous for a problem that does not belong to NP to be called NP-hard); – since there is a polynomial transformation for any pair of NP-complete problems, all the NP-complete problems can be polynomially reduced to an NP-hard problem, through which the polynomial Turing reduction of one of these NP-complete problems was done; – if the decision problem associated with a P problem is NP-complete, then P is NP-hard; – according to the conjecture P ≠ NP, an NP-hard problem cannot reach the exact solution within polynomial time (NP-hard problems are the most difficult in the class NP).

Basic Notions on Computational Complexity and Approximate Techniques

29

NP-hard problems

Class of NP-complete problems

Class P

Class NP

Figure 1.6. Classification (according to the literature) of NP-hard problems

Examples of NP-hard problems: – The search problem P2 of assigning 0 and 1 to the variables of a logical function giving it possibly the value 1 is NP-hard because the decision problem P1 associated with it is NP-complete. Indeed, if an assignment rendering the logical function true can be found (or not), we could answer YES (respectively NO) to the question: Is there an assignment of 0 and 1 to variables of E giving it the value 1? The answer to this question would be in polynomial time assuming that the search for the assignment has the complexity O(1). Therefore, P1 αT P2. It is worth noting that the computational complexity of P2 is not polynomial; therefore, it is not O(1). It is simply a manner of finding out whether the resolution of P1 (via P2) would take a polynomial time assuming the complexity of P2 is constant. – The traveling salesman problem P2 is NP-hard. Indeed, if we could find an optimal Hamiltonian circuit of a certain cost C, the associated decision problem P1 could be solved: Is there a cycle of length ≤ B in a graph G=(V,E)? If the Hamiltonian circuit does not exist, the answer to the decision problem is NO. Otherwise, the comparison between B and C leads

30

CAD of Circuits and Integrated Systems

to answering YES or NO to problem P1. Obtaining YES or NO therefore takes a polynomial time assuming that the resolution of P2 is done in O(1). Consequently, P1 αT P2. – The search problem P2 of a stable of a certain cardinal in a graph is also NP-hard. The decision problem P1 associated with it verifies P1 αT P2. EXERCISE 1.7.– The objective is to build a certain number N of medical centers. A center is shared by cities separated by at most M km. Our purpose is to optimize the considered problem PN_CM (to find the exact or near-optimal value of N), and we assume the availability of an algorithm giving the optimal or near-optimal solution to a problem PX to be defined. – What is your proposal of problem PX in order to have a Turing transformation from PN_CM to PX? – Find all the details of this possible transformation taking into consideration the instances and the results of the two problems as well as the computational complexity – How would you classify (polynomial, NP-complete, NP-hard, etc.) the decision and optimization problems associated with PN_CM and PX? Justify your answer. Note that two cities separated by a distance of MORE than M km cannot share the same center (the assumption is that if this condition is not met, the workload of these centers would not be the same). SOLUTION.– – Associate one node with each city. Two nodes are connected by an edge if the distance between them exceeds M km. We could solve PN_CM by an algorithm solving the colored graphs problem (PX): minimizing the number of colors so that they color all the nodes, a given node is assigned one and only one color, and two arbitrary nodes connected by an edge do not have the same color. – The cities whose corresponding nodes have received the same color share the same center (each color is associated with a center).

Basic Notions on Computational Complexity and Approximate Techniques

31

– The transformation of data of PN_CM into data of PX (PGcol) is polynomial time; the heuristic or metaheuristic solving PGcol is polynomial time. The transformation of the results of PGcol into data of PN_CM is polynomial time  polynomial Turing transformation of PN_CM into PGcol. – Classification: PGcol-Decision is NP-complete (see previous section). Since PGcol-Decision α PGcol-Optimization  PGcol-Optimization is NP-hard (previously mentioned theorem). Let us consider PGcol-Decision α PN_CM – f: associate a city with each node; the distance between li and lj exceeds M Kms iff (vi, vj) ∈ E; f is polynomial; – I ∈ Y PGcol-Decision iff f(I) ∈ YPN_CM_Decision: I ∈ YPGcol-Decision  Answer = ″TRUE″ (Nb_colors ≤ val)  NbCM ≤ val  Answer = ″TRUE″ for PN_CM_Decision f(I) ∈ YPN_CM_Decision  NbCM ≤ val (Answer = ″TRUE″) By construction of f, Nb_colors ≤ val  Answer = ″TRUE″ for PGcol_Decision  PGcol-Decision α PN_CM-Decision Furthermore, as PN_CM-Decision ∈ NP (it is easy to find a non-deterministic polynomial algorithm – similar to that of colored graphs, previously examined), PN_CM-Decision is then NP-complete (previously mentioned theorem). Let us consider PN_CM-Optimization It is easy to prove that PN_CM-Decision αT PN_CM-Optimization (the proof is similar to the previous one). Since PN_CM-Decision is NP-complete, PN_CM-Optimization is then NP-hard. NOTE.– α and αT should not be confused. 1.2.5. NP-intermediate problems Certain problems of class NP are not polynomial, NP-complete or NP-hard (see Figure 1.7). Informally speaking, they are more complicated than the polynomial problems and less complicated than the NP-complete problems.

32

CAD of Circuits and Integrated Systems

In order to prove that a problem P1 is NP-intermediate, we must prove that: – P1 belongs to the class NP (it can be solved by a non-deterministic polynomial algorithm); – P1 cannot be solved in a deterministic manner within a polynomial time; – there is no polynomial transformation of an arbitrary NP-complete problem P2 into problem P1 (meaning that P1 is not NP-complete); – there is no problem P2 belonging to NP such that P2 αT P1 (meaning that P1 is not NP-hard).

Figure 1.7. NP intermediate problems

EXAMPLE 1.13.– Given two graphs G=(V, E) and G’=(V’, E’), are G and G’ isomorphic? Namely, is there a function f: V → V’ such that: (u, v) ∈ E if and only if (f(u), f(v)) ∈ E’?

Basic Notions on Computational Complexity and Approximate Techniques

33

1.2.6. Co-NP problems A class that is complementary to class NP is defined as follows: Co-NP = {Πc: Π ∈ NP}; Πc is the problem complementary to Π; Πc may belong to class NP or not = {* - L: L being the language defined on the alphabet  and L ∈ NP} The relationship between NP and co-NP classes is as shown in Figure 1.8. EXAMPLE 1.14.– Let us consider K ∈ N*. Are there integers m and n > 1 such that K = m * n? It has been shown that this problem belongs to NP, more precisely it is an intermediate problem of class NP. Its complementary problem, which consists of stating whether a number is prime (there are no integers m and n > 1 such that K = m * n), belongs to the co-NP class.

Class of NP-complete problems

Class NP

Class P

Class of co-NP-complete problems

co-NP class

Figure 1.8. Relationship between NP and co-NP classes

34

CAD of Circuits and Integrated Systems

1.2.7. Class hierarchy There are other classes containing more complicated problems. They are designated as follows: – P class by Δ1P ; – NP class by

1

P

– co-NP class by

; P

∏1

.

P Generally speaking, these are known as Δ k ,



P k

k ≥ 1.

and



P

classes

k

Figure 1.9 shows the relationship between classes of k and k+1 level.

Δ kP+1

P

∏k

ΔkP

k

P

P

 k +1

Figure 1.9. Relationship between ΔkP , kP , ∏kP , Δ kP+1 and  kP + 1 classes

NOTE.– This book aims to present, by means of concrete problems of VLSI circuits and systems design, how heuristic- and metaheuristic-based algorithms can be developed. We focus on listing several essential notions that help distinguish between polynomial and non-polynomial time problems, namely knowing if an exact or approximate method should be developed for the problem at hand. For further details on language computability, the interested reader is invited to refer to the excellent book indicated in the list of references (Garey and Johnson, 1979).

Basic Notions on Computational Complexity and Approximate Techniques

35

1.3. Heuristics and metaheuristics 1.3.1. Definitions As already noted, it is important to determine the computational complexity of a problem. If this complexity is polynomial, then it is obvious to develop an algorithm giving an exact solution. Otherwise, a heuristic- or metaheuristic-based method should be developed. The two methods can be distinguished by the fact that a heuristic is a specific method for a given problem; therefore, it cannot be applied, in principle, to another problem. On the other hand, a metaheuristic follows a general diagram (mold) that must be adapted to the problem at hand. In general, this diagram is not applicable to various problems. An example is a genetic algorithm that remains simply an enumeration of stages to be followed, nothing more. The content of these stages varies from one problem to another. The choice of developing a heuristic- or metaheuristic-based algorithm will be addressed further below. They nevertheless both represent a challenge and must address the following two issues: – CPU time must be reasonable (this time can be fixed in seconds, hours, days, etc., depending on the considered problem – we note in passing that a CPU time of one day is reasonable if an interesting solution is reached, implementing a component whose service life is about 10 years or more); – though approximate, the solution must nevertheless be interesting (its quality can be known by comparing it to other solutions considered interesting, using common test cases known as benchmarks, and even to the exact solution on test cases that enable the exact method to yield the result within a reasonable CPU time (see Figure 1.10)). Indeed, the latter type of comparison makes it possible to verify if the approximate method can be improved or not; in other terms, obtaining a sound approximate solution means nothing. The question is: what happens for other test cases? On the other hand, obtaining a very poor approximate solution is conducive to the improvement of the developed method.

36

CAD of Circuits and Integrated Systems CPU time Computational complexity of the exact method

Computational complexity of the approximate method

Test cases enabling the execution of the exact method within a reasonable CPU time

Figure 1.10. Example of computational complexities of exact and approximate methods

1.3.2. Graph theory Many optimization problems are related to the graph theory. Though not exhaustive, the following list provides such examples: – the optimal path between a node and all the other nodes in a graph can be found in a polynomial time (Dijkstra algorithm that can be used, for example, for information routing from an emitter to a receiver); minimal coverage algorithm (Prim’s polynomial algorithm, which can be used, for example, for the optimal design of a communication network); – colored graph problem: one of the main characteristics of NP-complete problems is the existence of a polynomial transformation for any pair of these problems. In other terms, the algorithm for the resolution of the (NP-hard) problem associated with a given (NP-complete) decision problem is valid for the resolution of ALL the NP-hard problems associated with other NP-complete decision problems. In other words, it is important to develop an efficient method in order to obtain a near-optimal or even optimal solution in a reasonable CPU time, and this for ALL THESE NP-hard problems. Note that the (decision) problem of colored graphs, defined below, is proven NP-complete (Garey and Johnson, 1979): G=(V,E); K ∈ N*; k ≤ |V|

Basic Notions on Computational Complexity and Approximate Techniques

37

Is there a function f: V → {1, 2, …, K} such that f(u) ≠ f(v) ∀ (u, v) ∈ E? The associated (NP-hard) optimization problem is to find the smallest value of K such that f(u) ≠ f(v) ∀ (u, v) ∈ E. Therefore developing a proper method for this problem would enable its use for solving ALL the NP-hard problems associated with other NP-complete (decision) problems, even when they do not relate to the graph theory! – the effective resolution of the (optimization) problem of the traveling salesman is therefore important for the resolution of many other problems. The modeling of the (non-polynomial time) problem to be solved makes it possible to choose a heuristic or metaheuristic, whichever is better adapted to the problem at hand. Anyway, there is no justification for saying that one or the other yields a better solution. This is still valid for metaheuristics, which will be considered further on. Reaching a proper solution depends on other criteria (choice of initial solution, progression from one solution to another, as well as other aspects that are addressed in Chapter 3 of this book) than the use of such or such metaheuristic. Only the application principle of these metaheuristics will be given here, as their actual uses depend on the considered problems. 1.3.3. Branch and bound technique This technique involves going through a tree graph in order to reach an optimal or near-optimal solution. This obviously assumes the avoidance of an exhaustive exploration and making a proper choice at each stage. A formal definition of this technique is the following: Given a combinatorial optimization problem, find s ∈ S (the set of solutions) such that: f ( s ) = min f ( s ) if it is a minimization problem; s∈S

f ( s ) = max f ( s ) if it is a maximization problem; s∈S

38

CAD of Circuits and Integrated Systems

It can be stated that: – a subset S’ of S is known as branched into subsets S’1, S’2, …, S’k if ∀ i = 1, 2, …, k and S’i ⊂ S’ ∪ S 'i = S ' i ∈ {1, 2 ,..., k }

– a subset S’ of S can be bound if a real g(s) can be determined such that: – g(s) ≤ f(s) ∀ s ∈ S’ for the minimization problem; – g(s) ≥ f(s) ∀ s ∈ S’ for the maximization problem. EXAMPLE 1.15.– Let us consider a problem whose optimal (but not feasible) solution is 20. Assume that after preliminary explorations, we have found the following (minorant) solutions: – 23 (feasible solution); – 24 (feasible solution); – 25 (not feasible solution); – 21 (not feasible solution). It is clear that solutions 24 and 25 will be rejected because solution 23 is better (minimization problem). It is therefore useless to explore solutions issued from 24 and 25, as the best among them will be at best equal to 24 or 25 respectively (due to the exploration method used, 23, 24, 25 and 21 are minorants). Hence, we avoid exploring a large number of solutions, without affecting the quality of the solution found at the current stage. This is said to be a sterile subset of solutions that cannot yield an achievable solution that is better than the one already found (23). 21 is not a feasible solution, but the exploration of the tree graph whose root is 21 may lead to finding a feasible solution equal to 22, therefore better than 23. In a deterministic approach, we know which parts are worth exploring. In this example, function f corresponds to the assignment of tasks to workers. Assigning a given task to a given worker has a cost. It is then a matter of finding the assignment that minimizes the overall costs so that all the tasks are assigned and one worker is assigned a single task.

Basic Notions on Computational Complexity and Approximate Techniques

39

The function g is the strategy used for the exploration of solutions, and it involves the determination of minorants at each stage. Though the “branch and bound” technique does not always yield the exact solution within polynomial time, this example is interesting to the extent that it shows how an exploration, based on logical and deterministic reasoning, can yield a feasible optimal solution in polynomial time, for an intractable problem! Sakarovitch (1983) presents the general algorithm of this technique, but as already mentioned, this algorithm is only a general framework and under no circumstances could it be applied as such to any optimization problem. The entire strategy resides in developing the function g (in this example, effectively determining the minorants) that cannot be the same for all the problems. Moreover, the definition of these metaheuristics is beyond the scope of this book (many other books, as well as online resources do a great job in this sense). On the other hand, reviewing them will help clarify the context when applying them to problems of integrated circuits and systems design (see Chapter 3). As for the above-mentioned example, Chapter 3 will contribute to better approaching complex problems, avoiding certain useless explorations and finding inspiration for dealing with other problems. The branch and bound algorithm given in Sakarovitch (1983) is as follows: Algorithm {If s* is undefined // s* is the best known solution Then v= + ∞; (respectively -∞ for a maximization problem) End if Declare S non-sterile; // S is an element of F, a family of subsets of S covering it Decision(F, s*); While End is not reached

40

CAD of Circuits and Integrated Systems

Do {choose a non-sterile subset S’ of F; If S’ has not been bound Then Bound (S’); Else Branch(S’); End if } Done } Decision(F, s*) {If all the subsets of F are sterile Then {If s* is undefined Then the problem has no solution; Else s* is the optimal solution; End if Exit; // End of algorithm } End if } Bound(S’) {calculate g(S’); If g(S’) ≥ v (respectively ≤ v for a maximization problem) Then Declare S’ sterile;

Basic Notions on Computational Complexity and Approximate Techniques

41

Else If the bound is exact Then {s* = s*(S’); v = f(s*(S’)) = g(S’); Declare S’ sterile; } End if End if Decision(F, s*); } Branch(S’) {Create the subsets S’1, S’2, …., S’k and declare them non-sterile; F = F ∪ { S’1, S’2, …., S’k } – S’; Decision(F, s*); } 1.3.4. Tabu search technique For a minimization problem, the current solution s should be moved to a solution s’. The choice of s’ is such that: f ( s ' ) = min f ( s" ) ; N(s) is the s" ∈ N ( s )

space of solutions. A method based on this single principle would have the drawback that if a local minimum s is at the bottom of a deep valley, it would be impossible to get out of it through a single iteration, and a displacement of solution s to another solution s’ ∈ N(s) such that f(s’) > f(s) may lead to the reverse displacement at the next iteration since s ∈ N(s’) and f(s) < f(s’). The latest visited solutions (tabu solutions) should be kept in mind and returning to

42

CAD of Circuits and Integrated Systems

them should be forbidden for a fixed number of iterations. The purpose is to allow the algorithm enough time to enable it to get out of any valley containing a local minimum. In case of memory space difficulty, a set of movements leading back to already visited solutions should be forbidden. It is sometimes interesting to explore a new region neighboring s. An aspiration function is then used: when s’ (neighboring s) is part of the tabu solutions and moreover it satisfies the aspiration (meaning that f(s’) < A(f(s))), the tabu status of this solution s’ is up and it becomes a candidate in the selection of the best neighbor of s. In general, A(f(s)) takes the value of the best solution s* encountered (the objective is to determine a better solution than s*). For certain problems, the size of the neighborhood N(s) of the current solution s is large, and the only way to determine the solution s’ minimizing f over N(s) is to review the entire set N(s). The preferred approach is to generate a subset N’ ⊆ N(s) that only contains a sample of solutions neighboring s and to choose the solution s’ ∈ N’ whose value f(s’) is optimal. In general, a maximum number of iterations between two upgrades of the best solution s* encountered is fixed. In certain cases, it is possible to find a lower bound f* of the objective function and the search can then be stopped when a solution s of the value f(s) close to f* has been reached. The algorithm, such as that presented in Hertz et al. (1995), is as follows: Algorithm {Choose an initial feasible solution s*= s ∈ Xa; Nb_iter=0; /* current iteration */ L = ∅; /* Initialize the aspiration function A */ f* = - ∞; m_iter=0; Iterative process /* a minimization problem is assumed */ While (f(s) > f* and Nb_iter - m_iter < Nb_max)

Basic Notions on Computational Complexity and Approximate Techniques

43

Do { Nb_iter++; Generate a set N’ ⊆ N(s) of solutions neighboring s; Choose the best solution s’ ∈ N’ such that f(s’) ≤ A(f(s)); s’ ∉ L Update the aspiration function A and the list L of tabu solutions; s = s’; If f(s) < f(s*) Then { s* = s; m_iter = Nb_iter; } End if } Done } NOTES.– Xa is a set of feasible solutions f is the objective function N(s) is a neighborhood of a solution s ∈ Xa L is the set of tabu solutions Nb_max is the maximal number of iterations between two upgrades of s* N’ is a subset of N(s) f* is a lower bound of the objective function 1.3.5. Simulated annealing technique The simulated annealing method draws its inspiration from the thermodynamic law. It is the result of an analogy with the physical phenomenon undergone by cooling a melted body that passes into solid

44

CAD of Circuits and Integrated Systems

state. The analogy used by the simulated annealing involves considering a function f to be minimized as a function of energy, and a solution S can be seen as a given state of the matter whose f (S) is its energy. The algorithm of this technique is as follows: Algorithm {X0 is an initial solution; T0 is an initial temperature; Tmin is a fixed minimal temperature; Max_iter is the fixed maximal number of iterations; Nb_iter = 0; While (Nb_iter < Max_iter) Do {X1 is a neighbor of X0; If f(X1) < f(X0) // a minimization problem is assumed Then If T0 ≤ Tmin Then {X* = X1; // X* is the best solution found Exit;

// the best solution is X*

} Else T0 = T0 - ΔT; End if Else {Generate a random number N; // 0 < N < 1 If 0 < N < e(f(X0) – f(X1)) / T0 Then X* = X1; End if }

Basic Notions on Computational Complexity and Approximate Techniques

45

End if Nb_iter++; } Done } It is worth recalling once again that this algorithm is just a general framework whose solution quality depends on the choice of the initial solution and temperature, on the determination of the neighbor of a solution and the decrementation of the temperature. In particular, it is worth noting that the neighbor of the current solution is essential in avoiding being trapped by a local minimum. 1.3.6. Genetic and evolutionary algorithms Optimization problems rely on the idea that when a process is applied on an initial population, some are more interesting (they better optimize the objective function) than others, considered weak, and therefore called to die, in the sense of being discarded (Holland, 1975). Starting from the best elements of the current stage, the objective is to generate other individuals susceptible to further enhance the objective function. New individuals are generated as a result of operations known as crossover and mutation. Therefore, at the end of a certain number of stages, the expectation is to find an interesting (near-optimal or even optimal) solution. The difference between the genetic and evolutionary algorithms is essentially given by the encoding of the individuals. A genetic algorithm uses the binary encoding (a chromosome is described by a string x ∈ Σ*={0, 1}*) while for the second type of algorithm, the individual can be described by any other representation (integer, real, etc.). The choice of the first or second type relies on the representation that is best adapted to the problem at hand. Before giving the general diagram of this algorithm, we should recall that the quality of the solution does not depend on the application of such or such type of algorithm. The characteristic of genetic and evolutionary algorithms is that they can be executed as such on a parallel architecture. Slight modifications of the other algorithms are nevertheless allowed for a parallel

46

CAD of Circuits and Integrated Systems

execution (several initial solutions instead of only one; communicating at each stage the best solutions to the processes being executed in parallel, for better decision-making in the following stage, etc.). Algorithm {Choose the size N of the population; Nb_gen is the fixed number of generations; Generate the initial population constituted of N individuals;

For i = 2 up to Nb_gen Do { Evaluate each of the N individuals; // Apply the objective function Eliminate the unfit individuals // Let m be the number of eliminated individuals Update X*, the fittest individual; Generate m individuals by crossover and mutation operations } Done X* is the best solution; } NOTES.– – In this common general diagram, the size of the population is the same for all generations. Nevertheless, its variation is not forbidden if the information making it possible to gain CPU time, better optimize, etc., is available.

Basic Notions on Computational Complexity and Approximate Techniques

47

– The crossover operation involves the generation of a new individual from two others (this new individual is composed of two parts, each of which originates in the other two individuals). – The mutation operation involves changing certain characters 0 (1) by 1 (respectively 0). Descriptions of crossover and mutation operations can be found in many books; this is beyond the scope of the present one. Let us note once again that genetic algorithms are not better than the other techniques unless the initializations, crossovers and mutations are logically and carefully carried out. Below is a list of our comments on the application of genetic and evolutionary algorithms (Mahdoum, 2006): – the size as well as the initial population should not be randomly determined without an in-depth study of the problem at hand; – individuals who, by definition of the problem at hand, cannot be considered an optimal solution should not be generated; – crossover and mutation operations should not be randomly carried out for the following reasons: * certain individuals that have already been generated during previous generations may be regenerated (waste of CPU time: devote it to the study of new elements); * the study of redundant individuals weighs down the processing and delays convergence towards better solutions, or may even prohibit the obtaining of such solutions; * random crossover and mutation operations may generate solutions that may render the process stationary (from a certain stage on, the generations of individuals are exactly the same as the previous ones), which would not enable the enhancement of the best current solution (CPU time is needlessly wasted, and the increase in the number of generations would not enable the enhancement of the solution); * random crossover and mutation operations could generate non-feasible solutions (e.g. TCBDACF… for the traveling salesman problem due to twice passing through the same city C);

48

CAD of Circuits and Integrated Systems

* crossover and mutation operations should be properly carried out if a solution enhancement is expected (note the analogy with moving one solution to another in the other techniques: tabu, simulated annealing, branch and bound, etc. to avoid the local optimum problem). 1.4. Conclusion As a conclusion to this chapter, it is worth recalling that this book focuses on the study techniques for the development of heuristic- or metaheuristicbased algorithms for concrete problems of aided design of integrated circuits and systems. Such algorithms are obviously applicable to intractable problems, as the polynomial time problems can be solved using exact methods. The notion of computational complexity, which differentiates a problem of polynomial time complexity (which can be solved by an exact method) from a problem that is not (which can be solved by an approximate method), has been reviewed. Then, several conventional metaheuristics have been reviewed and we noted that obtaining a proper solution does not depend on the application of such or such method, but rather on other criteria (initializations, moving one solution to another, etc.). As far as genetic and evolutionary algorithms are concerned, we have just noted their advantage in being parallelizable as they are already described, while nothing qualifies them as better than others. We have in particular underlined that random crossover and mutation operations could have very significant (negative) consequences for the quality of the solution, similar to the improper displacement of one solution to another, carried out with other techniques (local optimum problem). Chapter 3 of this book presents techniques applied to concrete applications. It will be noted that avoiding the random aspect both at the initialization level and during the displacement of one solution to another leads to obtaining high-quality solutions. Similarly, for intractable problems, characterized by a very wide space of solutions, it will be seen that a proper analysis of the problem at hand can lead to a reduction of this space, without altering the quality of the solution, to a smaller-size space for better convergence towards high-quality solutions. However, before proceeding to Chapter 3, several fundamental notions on the design of digital circuits and systems are needed. This is the object of Chapter 2 of this book.

2 Basic Notions on the Design of Digital Circuits and Systems

2.1. Introduction While the first chapter covered basic notions on computational complexity, this chapter focuses on several aspects of the design of digital circuits and systems. It is our expectation that the reader will thus be better equipped for approaching the third chapter, which presents several optimization problems using notions presented in the first two chapters. 2.2. History of VLSI circuit design 2.2.1. Prediffused circuit This involves a partial premanufacturing by the foundry, enabling an array of transistors to be embedded without connection (matrix). Circuit design is limited to the design of connection levels, and manufacture concerns only the interconnections (masks of metal levels and contacts). This is followed by finalizing the manufacturing of wafers and encapsulation. 2.2.2. Sea of gates This is also a prediffused circuit, but in comparison to the first type, the fixed sites of gates and routing channels are suppressed. This enables a better integration density.

CAD of Circuits and Integrated Systems, First Edition. Ali Mahdoum. © ISTE Ltd 2020. Published by ISTE Ltd and John Wiley & Sons, Inc.

50

CAD of Circuits and Integrated Systems

A prediffused circuit has the advantage of a short manufacturing time due to several logical functions already being implemented on silicon, while those required by the user application can be conveniently interconnected. This type of circuit unfortunately has several drawbacks: its performances may be limited (already sized transistors, etc.), the complexity of the intended circuit is limited by the network size, flexibility is low and surface loss generates additional costs (functions that are not used by the application are however present on the silicon).

Figure 2.1. Prediffused circuit (gate array)

Figure 2.2. Prediffused circuit, before design. For a color version of this figure, see www.iste.co.uk/mahdoum/CAD.zip

Basic Notions on the Design of Digital Circuits and Systems

51

Figure 2.3. Prediffused circuit, after design. For a color version of this figure, see www.iste.co.uk/mahdoum/CAD.zip

Figure 2.4. Sea of gates. For a color version of this figure, see www.iste.co.uk/mahdoum/CAD.zip

2.2.3. Field-programmable gate array – FPGA This is essentially composed of common logic blocks (CLBs), a network of horizontal and vertical interconnections and input/output ports.

52

CAD of Circuits and Integrated Systems

Configurable Logic Block I/O Block

Horizontal Routing Channel

Vertical Routing Channel Figure 2.5. FPGA circuit

Using an FPGA presents several advantages: it can be reprogrammed (the same FPGA can be programmed for several applications), and it is a component that can be available in the workplace (office, practical training rooms, etc.), thus eliminating the necessity of a foundry. It is in fact adapted for small series and for the development of prototypes. The drawbacks are essentially due to its limited uses (it cannot be used for chip cards, mobile phones, electronic watches, etc.), limited scope for the designer to conduct optimization operations (fixed parameters such as transistor size, etc.) and increasing unit price for larger series. 2.2.4. Elementary pre-characterized circuit (standard cells) The development of this circuit involves a library of pre-characterized cells (according to the specifications of a given technology). It is assembled in bands, rows of the same width, separated by routing channels of variable widths. As a main advantage, it offers better density and more flexibility than the prediffused circuit, but its drawback is the design cost, as well as time required for manufacturing the set of masks and wafers.

Basic Notions on the Design of Digital Circuits and Systems

53

Figure 2.6. Example of pre-characterized circuit. For a color version of this figure, see www.iste.co.uk/mahdoum/CAD.zip

2.2.5. Full-custom circuit This approach is adopted for the design of a full-custom circuit in terms of density and performance, thanks to a specific (logical, electrical and physical) design. Besides this optimization advantage, which lends, this circuit to large series and strategic applications, it does however require solid design experience, quite substantial design time and the manufacture of the set of masks and wafers.

54

CAD of Circuits and Integrated Systems

Figure 2.8 shows various approaches to circuit design and gives an overview on their use in a given context. 2.2.6. Silicon compilation Current technologies enable the integration of all system components on the same chip (system on chip (SoC)), which leads to additional design problems (the design of single-chip systems requires multidisciplinary knowledge, from software to physical phenomena of submicron technologies: API development, real-time operating systems, drivers for varied and numerous peripheral equipment, design of mixed digital/analog circuits, RF circuits, solutions to electrothermal phenomena, etc.). Addressing these problems requires a new design methodology. It is first of all a top-down methodology that avoids problems due to the bottom-up approach, according to which a circuit is developed starting from its basic components. If the circuit thus developed does not fulfill the required functionality, time and money are wasted. The silicon compilation approach starts from the most abstract entity of the system to be designed (often a functional description of the application, in algorithmic form) and ends with the finer details (mask layout), passing through successive design levels. The passage from one level of abstraction to the level immediately below occurs only after validation of the entity obtained at the current design level (functional, electrical, optimal aspects, etc.). This approach aims to increase the probability of the system operating properly at low design levels, as an error occurring at such levels is more costly than at higher design levels. The current systems on chip are implemented in hardware as well as in software. This takes advantage of the two modes of implementation and meets criteria that will be explored further on. A block diagram of hardware–software co-design is shown in Figure 2.9. Such a design flow enables the implementation of systems by architectures such as those represented in Figure 2.10 (mixed architecture, bus-based and crossbars) and in Figure 2.11 (more recent architecture, based on networks-on-chip). Software and hardware integration is essentially conducted by drivers. Hardware design details are shown in Figure 2.12, while the details of software design can be found by the interested reader in Jerraya (2002) and Gauthier (2003). It is worth noting that synthesis, analysis and verification tasks are executed at each design level. Nevertheless, the CAD tools employed for each of these tasks differ from one abstraction level to another (e.g. the synthesis tools used at the system level are entirely different from those used at the cell level).

Basic Notions on the Design of Digital Circuits and Systems

55

VHDL, Scheme, Simulation, Placement, Routing

Library User

Figure 2.7. Example of a designed circuit using a cell library. For a color version of this figure, see www.iste.co.uk/mahdoum/CAD.zip

Figure 2.8. Context in which circuit design techniques are used (source: Laboratoire d’Informatique, de Robotique et de Microélectronique (Computer science, robotics and microelectronics laboratory) in Montpellier – LIRMM, France, via INTERNET). For a color version of this figure, see www.iste.co.uk/mahdoum/CAD.zip

– Synthesis is a CAD tool or set of tools enabling the transformation, at a given design level, of the behavioral description of an entity into a structural

56

CAD of Circuits and Integrated Systems

description, or its structural description into a physical description. Sometimes, the synthesis transforms the entity representation into a more refined, optimized representation of the same type. Generally speaking, the synthesis takes into account optimization parameters, constraints, specifications, etc. – Analysis involves the determination or estimation of the entity characteristics, notably aspects of surface, speed and power consumption. This analysis enables the designer to decide whether to pass to a lower design level or to reconsider the entity design at the current abstraction level. This saves time and money spent with useless design at lower levels in case of non-compliance with specifications at the current level. – Verification relates to the entity functionality at the given design level. It may rely on tests, simulations or it may be formal. Formal verification is obviously more rigorous than tests, which cannot be exhaustive, but it is still undergoing research for complex systems. It is worth noting that analysis may sometimes serve as verification. A temporal analysis, for example, serves to analyze whether the signals are generated in accordance with the fixed temporal specifications and, at the same time, to verify if these signals are in accordance with the behavior of the considered entity.

Figure 2.9. Methodology of a system’s hardware–software co-design

Basic Notions on the Design of Digital Circuits and Systems

57

Figure 2.10. Mixed architecture (bus and crossbars)

Figure 2.11. Network-on-chip-based architecture

2.3. System design level 2.3.1. Synthesis VHDL (then VHDL-AMS) language has been used for many years, similarly to VERILOG, to describe systems of a certain complexity degree for which the register transfer level (RTL) was highly convenient. Nevertheless, due to the increasing complexity of the systems to be designed, RTL was not sufficient and a higher level of abstraction was needed: the system level. This level required more elaborate description languages. One of these languages is SYSTEMC, an extension of C++, an object-oriented language, featuring a communication protocol library that enables the designer to define their own protocol. It also offers the possibility of using other predefined types related to signals, ports, communication protocols, etc. and of integrating a kernel for simulation. For further details on this language, the interested reader is invited to refer, for example, to

58

CAD of Circuits and Integrated Systems

(https://www.accellera.org/downloads/standards/systemc), which provides much information on this language (in particular its description). Nevertheless, several aspects that distinguish this language from other languages are presented here. Due to these aspects, this language is essential for the design at the system abstraction level. The designer can describe their system at this level in two successive stages: – functional modules communication channels;

communicating

through

various

abstract

– durations expressed in absolute time units are assigned to processes, for modeling that enables performance analysis, compromise, etc. SYSTEMC essentially involves the following elements: – Method: a function described in C++ and is a member of a class. – Module: a structural entity that can contain processes, ports, channels and other modules (for a hierarchical structure). – Signal: used for communications between the processes of the same module. – Interface: defines a set of method declarations, but does not define any method implementation or data field. – Channel: implements one or several interfaces and serves as a component for communications. – Port: an object enabling a module to access a communication channel interface. – Primitive channel: it is atomic, in the sense that it contains neither modules nor processes and can directly access other channels. – Hierarchical channel: a module containing other processes or modules and can directly access other channels. – Event: enables process activation or interruption. – Sensitivity: defines the moment when a process is activated. – Static sensitivity: statically declared during the elaboration phase (just before simulation) and cannot be modified once the simulation has started.

Basic Notions on the Design of Digital Circuits and Systems

59

– Dynamic sensitivity: can be modified during simulation. – Thread: non-preemptive sequence of execution (no model for controlling corruption of shared data due to unsecured access to this data. It is the user’s task to take this into account, using mutual exclusion, for example). – Wait(): a method that interrupts the execution of one sequence. The arguments involved in this method determine the moment when the execution of the sequence may resume. – SC_THREAD: a method for one module that has its own execution sequence and can call a code, which, in turn calls wait(). SC_THREAD processes are automatically activated as soon as an event in the sensitivity list makes it possible. – SC_METHOD: this method relates to a module that does not have its own execution sequence (all the operations described by it are executed in one go, from beginning to end) and cannot call wait(). – SC_CTHREAD: this method relates to a module that has its own execution sequence and whose sensitivity list contains only one event with ascending or descending front. It can call wait(), but with a short list of arguments. SC_CTHREAD processes are automatically activated (as soon as the event in the sensitivity list makes it possible). Besides these elements, SYSTEMC is characterized by the transaction level model (TLM) and relies on two levels to help the designer partition their system into hardware and software: – untimed functional level (UTF): system decomposition into functional modules communicating through abstract channels. The various processes are executed in a well-established order, without considering the time aspect; – timed functional level (TF): at this level, a functional process is assigned an execution time for performance modeling, compromise analysis for the hardware (HW)–software (SW) partition and for resource allocation. Two design levels (Figure 2.13) then follow: – bus cycle accurate (BCA) level: HW-SW and HW-HW communications between processes are accurately modeled in the form of cycles using communication protocols; – cycle accurate level (CA): HW processes are modeled at this level (which is in fact different from the system level, as indicated in Figure 2.12) from which various stages are conducted up to logic gate level.

60

CAD of Circuits and Integrated Systems

It is worth noting that companies such as CADENCE, MENTOR GRAPHICS and SYNOPSYS had heavily invested in the development of tools for aided design of VLSI circuits using languages such as VHDL (VHDL-AMS) and VERILOG, as the translation of all these tools into SYSTEMC would involve additional costs and time. Translators of a description in SYSTEMC into another language have been developed for this purpose. Thus, the results of tasks accomplished at system level (in SYSTEMC) are transcribed into other languages in order to continue the execution of the design flow from the register transfer level (RTL) without modifying the already developed tools. Behavioral representation

Structural representation

Processors, Operational parts, Control parts Memories, Bus, etc.

Synthesis, System

Analysis,

Functional units Registers Interconnections, etc.

RTL (Example, Operational Part) Functional unit

Cells Verification Cell

Transistors

Floorplanning, Layout generation, Placement, Routing

Cell

Functional unit

Operational part

System Physical representation

Figure 2.12. Block diagram of a silicon compiler

Basic Notions on the Design of Digital Circuits and Systems

61

Hardware–software partitioning is essentially conducted according to the following criteria: – obtaining a system that achieves the sought compromise, such as performance related to power consumption; – implementing in hardware the parts of the system that are not modified in time (taking advantage of the hardware rapidity) and in software those that will be frequently modified (to improve service quality, for example). Such partitioning also targets the speed of designing one product from another (design reuse) as well as the speed of introducing a product to the market (time to market); – for optimization purposes, functional partitioning can also be operated for the hardware part (Vahid et al., 1992); (Vahid et al., 1995).

Figure 2.13. Design levels treated in SYSTEMC, upstream of the register transfer level (RTL)

62

CAD of Circuits and Integrated Systems

A possible hardware implementation of a VLSI system could be conducted using a single operational part and a single control part. The operational part would execute all the application tasks by means of functional units (arithmetic, logical and relational operations), registers and memories (storage of values of the variables), interconnections (data transfer between the memory elements and the functional units), etc., under the command of the control part that must ensure the execution of tasks according to a command or a precise scheduling. For a system of such substantial complexity, such an implementation would generate a lot of problems. Indeed, the length of interconnections would impact data transfer times. This would also have a negative impact on the dynamic and static power consumption (which is no longer negligible in submicron and nanometer technologies). This high density would lead to an increase in the circuit temperature, generating in turn reliability problems and therefore a circuit malfunction. As a remedy, techniques for the functional decomposition of the system into subsystems are adopted according to certain criteria: – power supply disconnects the parts that do not operate during a given time interval in order to reduce power consumption (dark silicon technique); – assigning different supply voltages and frequencies by arranging the circuit parts into islands (or clusters) – dynamic voltage and frequency scaling (DVFS) (Cortes et al., 2005); (Mahdoum et al., 2006); – system decomposition into subsystems so that it meets the time and power consumption constraints (Mahdoum, 2012b). As these techniques are essentially related to computer science and operational research, they are revisited in Chapter 3 of this book. Let us note that system decomposition into N VLSI subsystems requires the value of N to be justified and involves the assignment of highly intercommunicating tasks to the same subsystem (Figure 2.14). Hence, in each subsystem, interconnections are shorter and processing is often done locally. The global bus (critical resource that has high capacity, which leads to significant power dissipation) will seldom be used for data exchanges between the tasks assigned to various subsystems (Figure 2.15).

Basic Notions on the Design of Digital Circuits and Systems

t1

t4

t10

t3

t12

t5

t6

t2

t7

t9

t11

t8 Strongly communicating tasks Averagely communicating tasks Poorly communicating tasks Non-communicating tasks assigned to the same component due to intense communications between t9 and t7, and between t9 and t8 a) Graph of communicating tasks t1

t3

t12 SS1={t1, t3, t12}

t2

t11

t4

t10

SS2={t2, t4, t10, t11} SS3={t5, t6}

t5 t7

t6 t8

b) Graph to be colored

SS4={t7, t8, t9} t9

c) Partitions of sub-systems

Figure 2.14. Task assignment to subsystems obtained by functional partitioning subject to time and power consumption constraints

63

64

CAD of Circuits and Integrated Systems

2.3.2. Floorplanning The system being decomposed as indicated in Figures 2.14 and 2.15, with the subsystems that intercommunicate the most (though poorly because the process is often local, at subsystem level) must be placed in the same neighborhood in view of rapid data transfers between tasks assigned to different subsystems. Decomposition according to the DVFS technique requires space available for voltage converters. Finally, subsystems can engage in asynchronous intercommunication. To this purpose, SYSTEMC enables the definition of an abstract protocol (using simple links) and then their refinement using the handshaking protocol and adapter modules between emitter and receiver, once the functional verification is validated with the abstract protocol.

Figure 2.15. Example of architecture corresponding to the distribution of tasks to subsystems obtained and indicated in Figure 2.14

Basic Notions on the Design of Digital Circuits and Systems

65

2.3.3. Analysis – Surface: As the synthesis of the VLSI system has been done at the system level, a certain number of processors, ASICs, memories, buses, etc. have been obtained, as shown in Figure 2.12. Each component generated at the system level will itself be synthesized at the level immediately below design (RTL). Nevertheless, this is not initiated unless the obtained structural entity meets the specifications, and the disposition of these components obtained by synthesis verifies the behavior of the system to be designed. At this abstraction level, where implementation details are not yet known (they will be known at lower levels), the surface can be estimated only as a function of the number of processors, ASICs, memories, etc. If it proves, for example, that the number of processors is significant, then it can be reduced (more tasks assigned to the same processor), but this would be done at the expense of performance (reduced parallelism). – Speed: Similar to the surface, since the implementation details (such as transistor sizes, parasitic parameters, etc.) are not yet known, the DVFS (dynamic voltage frequency scaling) technique, which at this level involves assigning voltages and frequencies to tasks, may give an estimation of speed (to be optimized while meeting the imposed energy constraint). One of the models employed is given by the following equations: =

(

)

(

+

)

[2.1]

where τi is the execution time of a task Ti, Vi is the supply voltage assigned to Ti, Vth is the threshold voltage of the transistor in the respective technological process, Mi (Oi) is the number of mandatory (optional) CPU cycles required to execute the mandatory operations (all or certain optional operations) of Ti, k is a constant that depends on the adopted technology and 1.4 ≤ α ≤ 2 is a saturation index of rapidity. Note that in the proposed model, all the operations of Ti may be mandatory (Ni = Mi + Oi). The additional time , = | − | is due to the time required to switch the supply voltage Vi to voltage Vj. The estimated time is then: ∑ _ â + ∆, This problem relates to the optimization of an objective function subject to meeting constraints. This subject is revisited in Chapter 3 of this book. Finally, it is also possible to use a temporal simulator (timed functional level – Figure 2.13).

66

CAD of Circuits and Integrated Systems

– Power consumption: Using the same DVFS technique adopted at the system abstraction level, where the task is the manipulated entity, the following equations make it possible to estimate power consumption (to be optimized while meeting the imposed time constraint): =

(

+

)

[2.2]

where Ci is the effective transition capacitance (rise–fall), Vi is the supply voltage used to execute the task Ti, and Mi and Oi are as defined previously. Since not all the tasks are executed under the same supply voltage, we should also take into account (when optimizing the objective function under constraints) the energy dissipated when voltage switches from Vi to Vj:

, =

− , with Cr being the capacitance of the power rail. The consumed energy E is then estimated by: ∑ _ + ∆, . Let us note that the mean dissipated power is the product obtained from energy * mean frequency. It is also worth noting that the system performance and its energy consumption may have the same priority degree for their optimizations according to the PARETO criterion. 2.3.4. Verification SYSTEMC language integrates a kernel that makes it possible to use a behavioral simulator that obeys the following algorithm: 1) all the clock signals that must change their value at the current time are updated; 2) all SC_METHOD and SC_THREAD processes whose inputs have changed their values are run. Nevertheless, while all the operations of a SC_METHOD process are run, running a SC_THREAD process is interrupted when encountering wait(); 3) update, from a list, the outputs of all the SC_CTHREAD processes (without their execution) that are controlled by clock signals. These new values are arranged in a list in order to be better used later on at 5. There is

Basic Notions on the Design of Digital Circuits and Systems

67

also an update of all the outputs of SC_METHOD and SC_THREAD processes that have been executed at 2; 4) steps 2 and 3 are repeated until no input signal changes its value; 5) execution of all the SC_CTHREAD processes controlled by a clock whose outputs have been listed at the third step of the algorithm. The new output values are then updated at the next ascending (or descending) front of the clock signal that controls them at the third step of the algorithm; 6) simulation time is incremented at the next ascending (descending) front of the clock signal used for the simulation, and the algorithm resumes at the first step. The simulation is controlled by calling the function sc_start(). This is done by the routine sc_main(). Simulation time then depends on the argument of the function sc_start(). Simulation can be stopped by the function sc_stop(). The user knows the current time of the simulation using the function sc_simulation_time(). The simulation can be visualized using any graphic editor supporting the following files that are created by SYSTEMC: – VCD (Value Change Dump) files; – ASCII WIF (Waveform Intermediate Format) files; – ISDB (Integrated Signal Data Base) files. These files contain all the values of all the variables and signals that the designer wants to visualize, as they change during the simulation. For example, in order to generate a VCD file, the function sc_create_vcd_trace_file(name_of_file) must be used from routine sc_main(), once all the modules and signals have been defined. The function sc_close_vcd_trace_file(name_ of_file) makes it possible to close the concerned file. Once the structural entity obtained by the synthesis of the system is validated by the analysis and verification tasks, each constituent of this entity undergoes synthesis, analysis and verification tasks. This takes place at the design level immediately below, namely the register transfer level. Before approaching the register transfer level, it is worth noting that CADENCE, in collaboration with INTEL, has quite recently developed the

68

CAD of Circuits and Integrated Systems

Cynthesizer solution, which helps the designer describe their system by means of very abstract SYSTEMC models, the detailed description in this language being the task of Cynthesizer. This is done as follows (Figure 2.16): 1) edit the abstract diagram of the system to be designed using the Intel Co-Fluent Studio tool; 2) rapid performance exploration → Generation of the description in SYSTEMC; 3) execution of Cynthesizer (Cadence C-to-Silicon compiler) → µarchitecture; 4) execution of other design flow tools → effective RTL code; 5) execution of other Cadence tools → Layout of the system to be designed. Starting with the RTL code, CAD tools of other companies (MENTOR GRAPHICS, SYNOPSYS, etc.) may be used.

Figure 2.16. CADENCE high-level design flow. For a color version of this figure, see www.iste.co.uk/mahdoum/CAD.zip

Basic Notions on the Design of Digital Circuits and Systems

69

2.4. Register transfer design level 2.4.1. Synthesis The main constituents of the structural entity obtained at the system abstraction level are essentially processors, means of communication, operational and control parts and memories. As far as processors are concerned, it is not the designer’s task to develop them (as they are abundantly available on the market). Nevertheless, a synthesis tool is applied to determine the instances of processors (depending on their operating frequencies) to be used. This is also the case for the other constituents that are already in the library and do not need to be redesigned. Synthesis is a mapping tool that makes it possible to determine the combination of instances of all the concerned components, optimizing a given objective function while meeting certain constraints. It is worth noting that several instances of a given component can exist in the library. These instances all have the same functionality (that of the considered component) but differ in terms of their characteristics (surface, speed, power consumption). If a component does not already exist in the library, it should be developed by the execution of synthesis, analysis and verification tasks. Similar to the processors, it is beyond the scope of this book to address memory design (which are in any case available on the market), but it is worth seeing how reorganizing the code of the application to be implemented can improve the memory access time in reading and writing. Since interconnections play an important role in submicron and nanometer technologies, they are dealt with separately. The following sections focus on the other types of components involved in RTL. 2.4.1.1. Operational part Synthesis is essentially related to the scheduling of operations involved in the algorithm of the application to be implemented and to the allocation of physical resources. Scheduling: The main scheduling techniques are as follows: ASAP (as soon as possible): a given operation is scheduled as soon as possible (not prior to or at the same time as an operation that must yield a result);

70

CAD of Circuits and Integrated Systems

ALAP (as late as possible): a given operation is scheduled as late as possible (not after or at the same time as an operation that depends on it). The example indicated in Figure 2.17 shows that the ASAP technique performs better than the ALAP technique (four time units for ASAP and five time units for ALAP – assuming that multiplication takes twice as much time as addition and subtraction) but it requires a larger surface (two simultaneous multiplications with ASAP, therefore the use of two multipliers compared to only one for ALAP). Obviously, for other cases, ALAP may perform better.

Figure 2.17. Example of scheduling of a DFG with ASAP and ALAP techniques

Besides the ASAP and ALAP techniques, there are other scheduling methods subject to constraints: List scheduling: This involves minimizing the number of scheduling steps taking into account resource constraints. For example, if the number of multipliers should be below two, the ASAP scheduling in Figure 2.17 cannot be accepted.

Basic Notions on the Design of Digital Circuits and Systems

71

Force-directed scheduling: In contrast, in this case, the number of resources should be minimized for a scheduling in a fixed number of steps. The advantage of the above-described techniques is that they have a flexible formulation of the problem and apply to applications with dominant Data Flow Graph (DFG) such as signal processing. Nevertheless, they are not well adapted to applications with controlled data flow graph (CDFG) that contain if then else structures. Indeed, control structures are badly handled because the operations are partitioned into blocks connected by forks and joins. This leads to the creation of additional steps for the blocks (operations belonging to different blocks are scheduled in different steps even if they are independent). Hence, the two branches of a control structure are scheduled with the same number of steps that is the maximum required for scheduling one of the two branches (then and else parts). Finally, mutual exclusion is limited to operations scheduled in different states, which involves improper use of physical resources (if there is, for example, one multiplication in the then part and one multiplication in the associated else branch, mutual exclusion would not be detected and two multipliers would be used instead of only one for this control structure). On the other hand, the transformation of a CDFG into DFGs (find all the possible execution traces – for example, an algorithm containing two un-nested if then else structures making it possible to execute four possible traces, according to the data) before scheduling would enable each trace to be independently scheduled (if, for example, in a then part there is only one operation, while in the else part there are 20, the number of steps would not be the same for the two branches, while scheduling conducted without determining the traces of CDFG would lead to 20 steps both for the then part and for the else branch. This would not lead to high-performance scheduling since the next operation following the then operation would needlessly wait 19 steps to be executed). A further advantage of expanding the CDFG into DFGs is better use of physical resources (well-processed mutual exclusion). Nevertheless, Chapter 3 of this book shows that scheduling cannot be approached by a polynomial time algorithm (intractable problem), which induces the development of an algorithm based on a heuristic or metaheuristic for scheduling subject to constraints (surface, speed and/or power consumption). Finally, let us note that scheduling step and execution step have different meanings. Indeed, two operations, one belonging to a then branch and the other to the else of the same if, can be scheduled with the same step but cannot be executed at the same time

72

CAD of Circuits and Integrated Systems

as they are exclusive (it is the case, for example, of operations 2 and 4 in Figure 2.18 where !X signifies not(X)).

Figure 2.18. Example of CDFG scheduling in Figure 2.19

1 x=a-b; if(x) // if x ≠ 0 then {2 y=a+c; 3 z=e+f;} else {4 y=a-c; 5 z=y+b; 6 t=z+d;} endif 7 v=u-w; // note that 7 can be scheduled at the 1st step if(! v) // v ≠ 0 then { 8 s=a+b; 9 z=z-s; 10 y=y+z} else {11 s=y+b; 12 z=z+d; 13 y=y-a;} endif

Figure 2.19. Exemple de CDFG

Basic Notions on the Design of Digital Circuits and Systems

73

Allocation of physical resources: The operational part is essentially composed of functional units (modules executing arithmetic, logical and relational operations), registers (for storing the values of various variables) and interconnections that enable data transfers between functional units and registers. The number of functional units of a given type is determined by the scheduling tool (for each DFG, it is the maximal number of operations of this type that are executed simultaneously; for the CDFG, the total number of units of a given type is then the maximal among those of units of the same type determined for each DFG). The instance to be used for this unit is determined by the mapping tool, which is a combinatorial optimization tool making it possible to find the combination of instances of functional units of the operational part optimizing an objective function, while meeting certain constraints: time, surface and/or power consumption. Note that scheduling yields a certain number of scheduling steps as well as the assignment of each operation to a given step, while mapping yields physical quantities (in s, cm2, mW) as the various instances in the library are characterized by their performances, surfaces and power consumptions, besides the instance to be used for each operation. As far as the register allocation is concerned, the objective is to determine the near-optimal number (as the problem considered is intractable – but this number may be optimal in certain cases) of registers enabling the storage of all the variables without conflict (not overwriting the value of a variable whose lifetime has not yet expired – meaning that it is or will still be used until the execution of a certain operation). Indeed, if the algorithm of the application to be implemented involves 1 million variables, then it will not be possible to use 1 million registers. This problem could also be formulated as follows: Find the minimal number of registers such that two arbitrary variables having overlapping lifetimes are assigned to different registers. This problem is perfectly similar to the following colored graph problem: G = (V, E) F: V → {1, 2, ….., K}; K ∈ N F(u) ≠ F(v) ∀ (u, v) ∈ E

74

CAD of Circuits and Integrated Systems

Find the minimal value of K such that two arbitrary nodes connected by an edge have different colors. Note that 1, 2, …., K are fictitious colors whose number should be minimized. The approach is as follows: – associate each variable to a node of the graph; – connect by an edge two nodes whose corresponding variables have overlapping lifetimes; – color the graph (use an algorithm for graph coloration in order to minimize the number of colors); – the variables whose associated nodes have received the same color will be assigned to the same register (there will obviously be as many registers as colors). Concerning the interconnections, they are dealt with at the end of this book, in the networks-on-chip section. Before approaching the synthesis of the control part, let us note that another synthesis task, contributing to optimization at the design level, involves placing, as much as possible (the problem is still combinatorial), the registers in the immediate neighborhood of the functional units that use them the most. This will consequently decrease the lengths of the interconnection segments used for the data transfers between the registers and the functional units that use them more frequently, thus anticipating the reduction of the values of capacitances, resistances and inductances at the interconnection level. This would positively impact both the performance of the considered operational part and its power consumption (obviously, a further effort for the determination of interconnection widths – on which the values of these electrical parameters depend – is required at another abstraction level, during the physical representation – layout). 2.4.1.2. Control part It can be implemented by a PLA (Programmable Logic Array) that presents the advantage of being a regular structure. The constituents of this PLA are essentially transistors, interconnections and D flip-flops (for memorizing the state of the finite state machine). A possible synthesis of a

Basic Notions on the Design of Digital Circuits and Systems

75

control part would be to minimize the number of product terms, thus minimizing the number of transistors and the height of the PLA, therefore the length of vertical interconnections. A tool of such a synthesis is ESPRESSO (Rudell, 1985), which involves the simplification of logical functions in a grouped and unseparated manner (avoiding the logical simplification of each logical equation, independently of the other equations). This optimization problem is proved NP-hard and its resolution yields a near-optimal solution (optimal in certain cases). A further method, still based on the reduction of product terms and equally interesting, is that of Sasao (1984). Rather than the PLA inputs being logical variables and their complements, decoders are used. As an illustration of this method, let us consider the already simplified truth table shown in Table 2.1. A conventional implementation of this table by a PLA would also yield 11 product terms (Figure 2.20). On the other hand, using 2-bit decoders, the number of product terms is only nine (Figure 2.21). If other decoders are used properly, the number goes from nine to five (Figure 2.22). Finally, selecting the output functions (the output itself or its complement), the number of product terms is only four (Figure 2.23). X1

X2

X3

X4

F0

F1

F2

1



1



1

0

0



1

1

1

1

0

0

1

1



1

1

0

0



0



1

0

0

1



1



0

0

0

1

0

1

0

1

0

1

0

1

1

1

1

1

1

0

0

1

1

0

0

1

0

1

0

0

1

0

1

0

1



0

0

0

1

0

0

0

1



0

1

0

Table 2.1. Example of simplified truth table (11 product terms)

76

CAD of Circuits and Integrated Systems

Figure 2.20. PLA with 11 product terms

Implementation using 2-bit decoders:

x +x 1 2 x +x

x

1

1

9 product-terms x

1 1

1

x

2

x

3

x

4

    

  

2

x +x

2

x

2

x +x

F F0 1 F 2

Figure 2.21. PLA reduced to nine product terms

2

Basic Notions on the Design of Digital Circuits and Systems

77

Figure 2.22. PLA reduced to five product terms

Figure 2.23. PLA reduced to four product terms

In the above section, the PLA obtained enables the effective implementation of logical functions. This technique can also implement a control part. This involves states of a finite state machine that is formally defined as follows:

78

CAD of Circuits and Integrated Systems

X={x1,x2,…, x|X|} is the set of control signals; Y={y1,y2,…, y|Y|} is the set of machine states; Z={z1,z2,…, z|Z|} is the set of command signals; δ: X x Y →Y is the function that determines the next state of the machine; λ:X x Y→ Z (λ:Y→ Z) is the function generating command signals for a Mealy (respectively Moore) machine. Generating the control part directly from the corresponding finite state machine would lead to a control part offering characteristics that are of little interest in terms of surface, speed and power consumption due to an enormous number of transistors, long interconnection lines and many parasitic parameters. Certain methods such as the one known as 1 hotencoding involve the minimization of the control part surface by reducing the number of lines, hence an implicit reduction of the number of transistors and interconnections driving an instant improvement in speed and power consumption. Other more interesting techniques, including ours (Mahdoum, 2000); (Mahdoum, 2002a), involve a much more significant reduction in the number of lines: given N states, the codification is not done on a length of log2 N bits, but on a potentially longer one. This relaxation over the length of state codes often leads to better performance. Nevertheless, in certain cases, the results are poorer compared to those generated by techniques such as 1 hot-encoding. This is due to the fact that the number of columns obtained may be very significant compared to the one induced by a codification on log2 N bits, which would yield a more significant surface than the one produced by 1 hot-encoding. Though our technique operates over a length of codes that may differ from log2 N bits, it solves this problem. Even better, the code length may be below log2 N bits, while at the same time, the number of lines is reduced. This is possible by assigning the same code to different states, without however, questioning the proper functionality of the finite state machine and therefore that of the control part. This characteristic is specific to our method given that all the other methods yield a length of at least log2 N bits. Let us consider the following example:

Basic Notions on the Design of Digital Circuits and Systems

79

00 s7 s1 10 00 s8 s1 10 00 s9 s4 11 00 s10 s4 11 01 s5 s2 11 01 s6 s2 01 01 s7 s8 11 01 s1 s8 00 10 s1 s4 00 10 s2 s4 00 10 s3 s4 00 10 s5 s7 00 10 s6 s7 00 11 s1 s5 01 11 s4 s5 01 11 s2 s5 01 11 s8 s2 10 11 s3 s1 11 The first line means: if the control signals are 00 and the machine is at the state s7, then the new state would be s1 and the command signals would be 10. Let us group in the same (composed) state all the states that supply the control part with the same command signals and generate the same states and command signals. Then, we have: 00 {s7,s8} s1 10 00 {s9,s10} s4 11 01 s5 s2 11 01 s6 s2 01 01 s7 s8 11 01 s1 s8 00 10 {s1,s2,s3} s4 00 10 {s5,s6} s7 00 11 {s1,s2,s4} s5 01 11 s8 s2 10 11 s3 s1 11

80

CAD of Circuits and Integrated Systems

This grouping leads to the most minimal number of lines. Then, the number of columns of the control part should be reduced as much as possible. Before discussing this point, let us give the following definitions using an example. – The simple states are: s1, s2, s3, s4, s5, s6, s7, s8, s9, s10, E1={s7,s8}, E2={s9,s10} and E3 ={s1,s2,s4} – The composite states are: Ec1={s1,s2,s3}, Ec2={s5,s6} – The partitions are P1, P2, P3 and P4, and correspond to signals 00, 01, 10 and 11 Note that E1 = {s7,s8} is in fact a simple state as there is no partition containing both s7 and s8. On the other hand, Ec1 = {s1,s2,s3} is a composite state because the partition P4 (11) contains both s1 and s2. Let c1 = 00, c2 = 01 and c3 = 11 be the respective codes of s1, s2 and s3. The code of E = {s1,s2,s3}, which is the intersection of c1, c2 and c3, is c = ** = {00,01,10,11}. Since the code c covers all the possible codes on 2 bits, it would not be possible to code a state (s4, for example) on 2 bits without triggering a non-determinism if s4 was in the same partition as E (supplying the control part with the same control signals). In this case, the only possibility would be to increase the length of the code by 1 bit in order to eliminate this non-determinism. In a more general and complex case, a poorly determined codification would trigger a very significant length of codes, consequently leading (at lower abstraction levels) to a design of a control part that is ineffective in terms of surface, speed and power consumption. The mathematical formulation of this optimization problem is given in Chapter 3 of this book, dedicated to the presentation of heuristics for the aided design of integrated circuits and systems. We briefly note that our method has been compared to various other methods using 43 MCNC FSM Benchmarks. This comparison has proved that: – our method yields better results in 74% of cases; – only our method can yield a length of codes below log2 N bits, N being the number of states of the finite state machine (using the following lemma) while enabling a proper behavior of the finite state machine.

Basic Notions on the Design of Digital Circuits and Systems

81

Two arbitrary states si and sj receive the same code without any effect on the proper behavior of the control part if: i) there is no partition Pk such that si ∈ Pk and sj ∈ Pk or; ii) if such partitions Pk exist, si and sj must be part of a composite state included in Pk. PROOF.– i) Let x = {xi1,xi2,…, xi|x|} be the set of control signals introduced in the control part concatenated with the code of si, and x’ = {xj1,xj2,…, xj|x|} that of control signals concatenated with the code of sj. As si and sj do not belong to the same partition, we have x ≠ x’. Consequently, the new state of the machine and the command signals will be determined in a deterministic manner. ii) If in an arbitrary partition Pk there is a composite state Ec such that si and sj belong to it, si and sj then induce the same new state and the same command signals. Because there is no other partition where si and sj induce different new states or different command signals, the control part cannot be non-deterministic. 2.4.1.3. Physical synthesis of PLAs Due to the rapid development of semiconductor technology, the library of circuits must be constantly updated. The adoption of full-custom methodology would certainly lead to obtaining high-quality circuits but may not be suited to a short time to market. An alternative approach would involve the rapid (automated) generation of the physical structure (layout) of a circuit while offering an acceptable solution in terms of quality. The technique known as River PLAs (Mo et al., 2002) addresses this issue to the extent that a PLA is a regular structure which consequently generates fewer parasitic parameters than an irregular structure. In what follows, we describe the main stages of our own contribution to the application of this technique. Figure 2.24 represents the general operation of our tool. Starting from the initial description, the tool determines the number of hierarchical levels (logical depth) of the circuit. This number corresponds to the number of PLAs to generate. Then, for each PLA, the tool generates the corresponding truth table (that will be simplified by ESPRESSO) and determines, using the length constraint imposed by the designer and the length of each PLA, the column to which the latter belongs (PLAs are arranged in matrix form). It is

82

CAD of Circuits and Integrated Systems

worth noting that the number of PLAs per column (of the matrix) depends on the constraint imposed by the designer and on their height. Then, for each PLA column, the tool determines the minimal number of vertical interconnections by implementing the PLA input signals. Finally, using the basic cell (CIF – Caltech Intermediate Format – files) and the previously obtained results (truth tables, minimal number of interconnections), the tool outputs the layout of the circuit under CIF format (this format enables the portability of editing a layout on different platforms such as CADENCE, MENTOR GRAPHICS and SYNOPSYS). Figure 2.25 shows an example of CIF description. The basic cells (Figure 2.26) have been designed using the tool IC_STATION of Mentor Graphics that made it possible to realize the layout of the latter. Then, the IC_LINK tool is used for the conversion of the previous layouts into CIF format, which will be handled by our tool. As an illustration, let us consider the logical circuit shown in Figure 2.27. Because this circuit involves three logical levels, it is implemented by three PLAs (Figure 2.28). The improvement that we brought to River PLAs technique involves sharing one column between two or more signals of various PLAs (thanks to partitioning that makes use of a given column of a signal that is no longer used in other PLAs). This aims to reduce the width of the circuit, hence its surface. Figure 2.29 shows the automatically generated layout of the ISCAAS Add4bits circuit. Finally, let us note that this technique can also implement a control part. IC_STATION

IC_LINK

circuit.dir

constraint

Nb_PLAs

Nb_Col.

Cell 1



Cell N

Truth tables

Nb_Interconn.

Layout generation

circuit.cif

Figure 2.24. Block diagram of our use of the River PLAs technique

Basic Notions on the Design of Digital Circuits and Systems

83

Figure 2.25. A basic cell and its corresponding CIF format. For a color version of this figure, see www.iste.co.uk/mahdoum/CAD.zip

Figure 2.26. Layouts of some basic cells. For a color version of this figure, see www.iste.co.uk/mahdoum/CAD.zip

Figure 2.27. ISCAAS C17 circuit (the numbers are those of logical signals). For a color version of this figure, see www.iste.co.uk/mahdoum/CAD.zip

84

CAD of Circuits and Integrated Systems

Figure 2.28. Left: each logical signal occupies a column. Right: certain logical signals share the same column (reduction of the circuit surface)

Considering the ISCAAS Add4bits circuit that contains seven logical levels, Table 2.2 indicates the reduction of the number of columns occupied by the interconnections carrying the logical signals for the three types of implementation: – River PLAs without one column being shared by one or more signals; – River PLAs I involving the sharing of one column by two or more logical signals, but arranging the PLAs one after the other; – River PLAs II involving the sharing of one column by two or more logical signals, but arranging the PLAs in matrix form (layout subject to the constraint on the circuit height). From this table, the last column corresponds to the number of the columns to which the considered PLA belongs (matrix arrangement). For this example, the height constraint imposed by the designer made it possible to place the PLAs in three columns for the River PLA II version. Consequently, this last technique offers the best optimization of the number of columns assigned to logical signals. Table 2.3 presents the same comparative study considering other ISCAAS circuits.

Basic Notions on the Design of Digital Circuits and Systems

85

Figure 2.29. Left: layout of an Add4bits circuit obtained with River PLA II. Right: the one obtained with River PLA I. The numbers refer to seven PLAs (logical depth of this circuit). For a color version of this figure, see www.iste.co.uk/mahdoum/CAD.zip

Num_PLA

No optimization

River PLA I

River PLA II

Num_Column

PLA 1

20

8

8

1

PLA 2

20

8

7

2

PLA 3

20

8

7

2

PLA 4

20

8

7

2

PLA 5

20

8

4

3

PLA 6

20

8

4

3

PLA 7

20

8

4

3

Table 2.2. Number of vertical interconnections for each method

The technique we have just presented enables the automated generation of the layout of a control part. As far as the operational part is concerned, if any of its functional units does not exist in the library, it is still possible to use River PLA II to generate its physical representation.

86

CAD of Circuits and Integrated Systems

Nb_Logic gates

Nb_PLAs

Constraint (λ)

No optim.

River PLA I

River PLA II

CPU Time (s)

C17

6

3

500

9

5

4

2

Equ4bits

5

2

800

12

8

6

2

Mux4bits

8

3

500

14

9

6

2

Add4bits

17

7

1,000

20

8

6

3

Add8bits

37

15

3,000

44

16

10

9

Add16bits

77

31

5,000

92

32

23

22

Comp8bits

40

3

1,500

48

32

21

23

C432

160

17

12,000

190

66

60

75

C499

202

11

13,000

163

80

76

105

ISCAAS Circuit

Table 2.3. Number of vertical interconnections for each method and for each ISCAAS circuit

2.4.1.4. Memory synthesis As semiconductor technology scales, processors with ever higher performance are launched on the market. Similar to an integrated system, processors cannot operate on their own, which is to a certain extent an inconvenience, because memory cannot follow their cadence. Fortunately, certain modern memories feature an access mode (page) that can be used to improve the memory access time – an important parameter in the design of certain systems, such as embedded systems. A methodology developed by Kim et al. (2005) uses this page mode in the manner described below. Let us consider the following example as an illustration: C[i-4] = F + K

Write C[i+1]

X = D[i+5] - F

Read D[i+1]

C[i] = A[i] + A[i+1] + B[i+1]

Read A[i]; Read A[i+1]; Read B[+1]; Write C[i]

Basic Notions on the Design of Digital Circuits and Systems

D[i] = K

Write D[i]

B[i+1] = E – N

Write B[i+1]

G = C[i] + D[i]

Read C[i]; Read D[i]

W = A[i] + F

Read A[i]

M = B[i] – C[i]

Read B[i]; Read C[i]

A[i+1] = A[i] * B[i]

Read A[i]; Read B[i]; Write A[i+1]

87

Figure 2.30 indicates the total number of cycles as well as the surface obtained using the memory module M4 whose characteristics are indicated in Table 2.4. NR, NW, PR and PW respectively denote a normal reading, normal writing, a reading in page mode and a writing in page mode. The reading (writing) in page mode requires fewer cycles than normal reading (respectively writing). The recording in page mode is considered when reading (writing) is operated in the same table previously accessed in the same memory module. Hence, in Figure 2.30, the reading of C[i], at the 6th instruction, is done in page mode and not in normal mode. This is due to the fact that the last table that has been accessed in the memory module memorizing C is precisely this very same table (at the 4th instruction), even though another memory access (that of A at the 5th instruction, but in another memory module than the one containing C) has been done after the reading of C in the 5th instruction. If, in a given instruction, memory accesses are sequentially performed in the same memory module, then they must all be recorded (it is the case in the 3rd instruction). If, on the other hand, memory accesses are operated in different memory modules, then the maximum number of cycles must be considered (parallel access, this is the case of the 8th instruction in which only one normal reading must be recorded instead of two). In Kim et al. (2005), the idea has been to maximize the number of memory accesses in page mode and not in normal mode. This is possible in two different ways: – properly assigning the tables to different memory modules (Figure 2.31 indicates an improvement in access time due to another reassignment of tables); – an effective rescheduling of instructions, without however altering the initial behavior of the algorithm (Figure 2.32).

88

CAD of Circuits and Integrated Systems

Memory module

Size (bits * words)

Surface (mm2)

M1

16 * 1,024

5.4154

M2

32 * 1,024

10.8309

M3

16 * 2,048

7.6586

M4

32 * 2,048

15.3171

Table 2.4. Example of module characteristics

M4

M4

A

C

B

D

C[i-4] = F + K

Write C[i+1]

NW

X = D[i+5] - F

Read D[i+1] NR

C[i] = A[i] + A[i+1] + B[i+1]

Read A[i]; Read A[i+1]; Read B[+1]; Write C[i] NR PR NR NW

C[i+1] = K

Write C[i] PW

A[i+1] = E – N

Write A[i+1] NW

G = C[i] + D[i]

Read C[i]; Read D[i]

W = B[i] + F

Read B[i]

M = A[i] – C[i]

Read A[i]; Read C[i] NR // NR = NR

B[i+1] = A[i] * B[i]

Read A[i]; Read B[i]; Write B[i+1] PR NR PW

PR NR

NR

Total: 7NR + 3NW + 3PR + 2PW = 7 *5 + 3*8 + 3*2 + 2*3= 71 cycles Surface: Surface(M4) + Surface(M4)= 2 * 15.3171 = 30.6342 mm2 Figure 2.30. Assignment of tables to memory modules and its effect on the number of cycles and surface

Basic Notions on the Design of Digital Circuits and Systems

M3

M4

A

B

C

D

89

C[i-4] = F + K

Write C[i+1]

NW

X = D[i+5] - F

Read D[i+1] NR

C[i] = A[i] + A[i+1] + B[i+1]

Read A[i]; Read A[i+1]; Read B[+1] Write C[i] NR PR NW

C[i+1] = K

Write C[i] PW

A[i+1] = E – N

Write A[i+1] NW

G = C[i] + D[i]

Read C[i]; Read D[i]

W = B[i] + F

Read B[i]

M = A[i] – C[i]

Read A[i]; Read C[i] NR NR

B[i+1] = A[i] * B[i]

Read A[i]; Read B[i]; Write B[i+1] NR PW

NR

NR

Total: 7NR + 3NW + 1PR + 2PW = 7 * 5 + 3*8 + 1*2 + 2*3 = 67 cycles Surface = Surface(M3) + Surface(M4) = 7.6586 + 15.3171 = 22.9757 mm2 Figure 2.31. Another assignment of tables to memory modules and its effect on the number of cycles and surface

Figure 2.31 shows a reduction of the number of cycles (67 cycles instead of 71) and of memory (22.9757 mm2 instead of 30.6342 mm2) thanks to another assignment of tables and the use of another memory module. Note that the chosen memory module must have the smallest possible surface, while being able to achieve the surface required by the tables and simple variables assigned to it. Finally, Figure 2.32 shows that it is still possible to optimize the number of cycles (64 < 67 < 71) obtaining the same surface (22.9757 mm2 – use of M3 and M4). This is possible thanks to an appropriate rescheduling (increasing the number of accesses in page mode) while enabling a correct execution of the algorithm (compliance with the initial behavior).

90

CAD of Circuits and Integrated Systems

Op6: G = C[i] + D[i]

Read C[i]; Read D[i]

NR

Op7: W = B[i] + F

Read B[i]

Op9: B[i+1] = A[i] * B[i]

Read A[i]; Read B[i]; Write B[i+1] NR PW

Op8: M = A[i] – C[i]

Read A[i]; Read C[i] PR NR

NR

Figure 2.32. Only one permutation of instructions (the 8th and the 9th) leading to a reduction in the number of cycles (64 instead of 67)

Chapter 3 of this book presents our CAD tool that uses this exact methodology, but employs our own heuristics that enabled a favorable comparison of our results to those obtained by Kim et al. (2005). 2.4.2. Analysis 2.4.2.1. Surface Thanks to the mapping tool that enables the assignment of circuits in the library (characterized by their surfaces, speeds and power consumption) to various components (functional units, etc.), it is possible, from this design level (RTL), to give a quite accurate estimation of the surface of the entity obtained at this level, while also taking into consideration the surface of the control part and that of the operational part – control part interface. The designer could then decide to keep this implementation or choose another synthesis by adding, for example, new constraints. If a given functional unit does not exist in the library, it is always possible to automatically generate it by a method similar to the one previously mentioned (River PLAs). Another means to develop this functional unit would be to use the full-custom methodology at the module design level, but taking into account its characteristics determined by the synthesis tool so that the assembly meets the specifications. 2.4.2.2. Speed The scheduling of operations as well as the mapping tool offer the designer the possibility of having a quite accurate estimation of the performance of the entity obtained at the RTL. Similar to the surface, the designer can still choose another scheduling – subject to other constraints – and obtain another configuration, by association with the mapping tool.

Basic Notions on the Design of Digital Circuits and Systems

91

2.4.2.3. Power consumption Though the circuits in the library can be characterized by their power consumptions, it is nevertheless not possible to directly provide the dissipated power (by simple addition of powers, for example). Since the estimation of this parameter is more complicated than that of the surface (sum of the surfaces) and of the performance (a combination of sums, sequential operations, and the operator max operations executed in parallel), it is revisited in Chapter 3 of this book. For now let us note that the consumption due to short circuits is nearly zero in CMOS technology (which is not the case with NMOS and pseudo-NMOS technologies, which were abandoned due to this effect). We should nevertheless take into account the static consumption (caused by current leakages, due to very low threshold voltages of transistors and by parasitic electrical parameters induced between parallel interconnections) which is no longer negligible (as was the case for micron technologies) in submicron and nanometer technologies. 2.4.3. Verification The tool (or set of tools) used for temporal verification is, in itself, a behavioral verification tool. Indeed, the combination of values of primary input signals makes it possible to obtain information on both the entity performance and on the results, in the form of logical signals. The function connecting the output signals to those of the inputs makes it possible to verify it is behaving as expected. However, since it is not possible to make an exhaustive verification (astronomical number of combinations), this type of test is much more valid as a means to invalidate the design (counterexample). This is quite pessimistic, as the designer’s experience must be reckoned with, but research efforts on formal verification are ongoing in research laboratories around the world (Berkeley University in the United States, TIMA of Grenoble in France, etc.). This type of verification relies essentially on demonstrators of theorems, models and substantial software. Before this type of verification is really possible, at any design level, testing and testability techniques are used (and have been for a long time, in fact), such as the stuck-at 0 or 1 fault model, and the integration of testing circuitry into the circuit to be developed. Nowadays, this type of Design For Test (DFT) has been relayed by another equally important type of design, namely the Design For Reliability (DFR), due to the significant number of transistors present on the same chip and the increasing complexity of integrated systems

92

CAD of Circuits and Integrated Systems

(phenomena such as electro-migration and temperature increase are parameters that may lead to a circuit malfunction). 2.5. Module design level 2.5.1. Synthesis Synthesis is approached at this level of design if the desired module (functional unit) does not exist in the library or its existing instances do not have the expected characteristics (surface, speed, power consumption). They are determined at the next higher level of abstraction, the register transfer level, so that the set (operational and control parts and their interface) verifies the specifications. The concerned circuit at this design level (module) can be the adder, multiplier, etc. The synthesis is conducted depending on these considerations, taking, into account in particular, the antagonism between performance and power consumption. Relatively recent synthesis techniques rely on the multi VDD multi Vth concept to reach the sought-for compromise between these two antagonistic parameters. Indeed, using VDD,L, VthN,H, VthP,L (VDD,H, VthN,L, VthP,H) for the same circuit would favor the power consumption (or, respectively, the performance) at the expense of response time (or, respectively, the power) of the circuit. This problem relates to the optimization of an objective (or multi-objective) function subject to constraints. This aspect is revisited in Chapter 3, which is dedicated to this effect. NOTE.– – VDD,L (VDD,H) is a low (or, respectively, high) supply voltage feeding a given logic gate of the circuit (example: 1 V and 3.3 V); – VthN,L (VthN,H) is a low (or, respectively, high) threshold voltage of a given NMOS transistor; – VthP,L (VthP,H) is a low (or, respectively, high) threshold voltage of a given PMOS transistor.

Basic Notions on the Design of Digital Circuits and Systems

93

2.5.2. Analysis 2.5.2.1. Surface Once the physical representation has been generated (automatically, full custom), the surface of the module is simply that of the smallest rectangle including all the constituents of the module. 2.5.2.2. Speed The use of an electrical simulator makes it possible to both verify the functionality of the module (via the function connecting the output signals obtained to the input signals) and to have, at this design level, accurate information related to the consideration of all the electrical parameters, including the parasitical ones, particularly after the extraction of the electrical schematic (sizes of transistors, resistances, capacitances and inductances, both intrinsic and parasitical). Let us note that if the physical entity is not automatically generated, a first simulation is conducted using a text file – NETLIST – describing all the transistors, their types, their connectivity (thanks to their electrical nodes: drain, grid, source, bulk), and their dimensions. This file also contains essential information, such as the supply value, the node to which it is attached, as well as the values of certain capacitances (notably those at the outputs of certain logic gates). The purpose of this first simulation is to spare the designer the effort required by designing the circuit several times (as electrical simulation did not yield the expected times) by changing the sizes of certain transistors. Since this change of dimensions impacts the design rules of a given technology, it would be too time-consuming for the designer to redesign their circuit several times. Since the parasitic parameters are not yet known during this first simulation, the designer should size their transistors so that they get a response time below the expected one (in anticipation of the parasitic parameters). Let us note that automated layout generation can be accompanied by a transistor autosizing tool (using mainly temporal models of transistors, the delays of interconnections, the logical schematic of the circuit and the expected response time). Once this first simulation validates the temporal aspect of the circuit, the designer can approach layout by sizing their transistors according to the contents of the file (NETLIST – see Figure 2.33). It should also be noted that this first simulation cannot be done unless the electrical simulator has provided the values of certain load capacitances (the rise and fall times of these capacitors depend on these values and on those of the transistor sizes). The designer’s experience may help provide the

94

CAD of Circuits and Integrated Systems

values of these capacitances before even starting the circuit layout, but there is a means to estimate the capacitance at the output of a logic gate using the following formula: =∑

,

+∑

,

+

[2.3]

where: – CG,i is the grid capacitance of any transistor having a common node (source or drain) with the electrical node of the concerned capacitance (connected between this precise node and the ground VSS); – CD,i is the diffusion capacitance of any transistor having a common node (source or drain) with the electrical node of the concerned capacitance (connected between this precise node and the ground VSS); – Cr is the routing capacitance (output of the logic gate that will be extended by metal to form an interconnection). Note that the capacitances, by unit of surface and of length of polysilicon, of the diffusion (N or P) and of the metal of the concerned technology, as well as the widths of transistors, the length of the channel and certain design rules (for calculating surfaces and perimeters), give a quite accurate estimation of the capacitances at the outputs of logic gates.

Figure 2.33. Example of NETLIST file for an electrical simulation. For a color version of this figure, see www.iste.co.uk/mahdoum/CAD.zip

Basic Notions on the Design of Digital Circuits and Systems

95

Figure 2.33 shows the following: – the use of keyword subckt. Indeed, this is used when a sub-circuit is repeated several times in the circuit that contains it (which is not the case here – it only serves as an illustration). It would be sufficient to describe such a part of the circuit once, then make calls in order to place it in convenient places (connectivity with other parts of the circuit, thanks to the number of electrical nodes); – the sizes of transistors (W) are also given just as an illustration. Indeed, as the mobility of electrons (conduction in the NMOS transistor channel) is around three times greater than that of the holes (positive charges enable the conduction in the channel of PMOS transistor), it is the PMOS transistors that must have larger widths (as compensation) than the NMOS transistors. Nevertheless, the two mobilities are known to be almost identical in 7 nm technology and beyond; – the load capacitance CLoad is located between the node Nr. 3 (output of the logic gate) and the ground, and its value in this example is 10-13 F; – the supply is between the electrical node Nr. 5 and the ground (its value is here 3.3 V); – the current is initialized at 0 A at nodes 1 and 2 (input signals of the circuit); – electrical simulation is conducted over a period of 90 ns, with a simulation step of 0.1 ns (the smaller the simulation step, the more accurate the simulation, though slow); – the PULSE and PLOT lines lead respectively to the test cases and to the signals to be visualized. Let us note that for a circuit with a significant number of input signals, the number of exhaustive test cases would be very large. A smaller number should be chosen, one that covers the possible cases (a representative sample). The PULSE lines indicate, over a 90 ns period, the time intervals in which a logical signal is 1 or 0, as well as the rise and fall times of the signals (transitions 0–1). 2.5.2.3. Power consumption Consumption due to short circuits was present in NMOS and pseudoNMOS technologies. Because of this important problem, they were abandoned in favor of CMOS technology. In the latter technology, the short circuit (current flow between the supply line VDD and the ground VSS) has a

96

CAD of Circuits and Integrated Systems

very short duration, when new signals are applied to the grids of NMOS and PMOS transistors. But rapidly, one of the two (NMOS or PMOS) blocks becomes blocked, which prevents any flow of current between VDD and VSS. As bipolar transistors are faster but consume more power than MOS transistors, there was a time when BiCMOS technology was employed. This involved placing bipolar transistors on the critical paths of the circuit (due to their rapidity), while MOS transistors (less surface and less power consumption) were placed everywhere else. With the development of the semiconductor technology, the rapidity aspect is no longer an issue for submicron or nanometer CMOS technology (rapid commutation of transistors thanks to their low threshold voltages). This technology has therefore become a candidate for the implementation of quite a number of integrated systems. Two types of consumption are consequently considered: – static consumption; – dynamic consumption. While it was quite negligible in micron technologies, in submicron or nanometer systems, static consumption (due to current leakages) is a significant part of the total consumption. Figure 2.34 shows the various possibilities for current leakage.

Figure 2.34. Various current leakage possibilities in a transistor. For a color version of this figure, see www.iste.co.uk/mahdoum/CAD.zip

 Reverse polarization between the drain of PMOS transistor and n-well.  Reverse polarization between n-well and the P-type substrate.

Basic Notions on the Design of Digital Circuits and Systems

97

 Reverse polarization between the drain of the NMOS transistor and the P-type substrate.  Source–drain leakage current (Vth, very low transistor threshold voltage). A model used for the power consumption due to current leakages in the transistor channel is indicated by the following equation (BACPAC – Berkeley Advanced Chip Performance Calculator): = 0.28125 ∗

∗ 

∗ 10

∗ ∗

∗(

∗ 10

)



+ [2.4]

where: – VDD is the supply voltage; – K = 10 µA/µm; – L is the channel length in the given technology; – Wavg is the average width of transistors; – NbTransN is the number of NMOS transistors in the circuit; – NbTransP is the number of PMOS transistors in the circuit; – VtN is the initial threshold voltage of a NMOS transistor; – VtP is the initial threshold voltage of a PMOS transistor (note that VtP < 0); – αV = 0.095 V. The dynamic power consumption in a gate (due to a charge or discharge of the capacitance connected to the gate output) is given by the following equation: = 0.5 ∗





where: – VDD is the supply voltage; – f is the frequency; – CG is the capacitance at the node of the gate output.

[2.5]

98

CAD of Circuits and Integrated Systems

For a circuit containing N gates, the dynamic consumption is: = 0.5 ∗



∗∑

(



)

[2.6]

where: – NGi is the number of charges and/or discharges of capacitance CGi. Due to parameter NGi, the estimation of the dynamic consumption of a circuit is far from being simple, as suggested by equation [2.6]. Indeed, this parameter depends essentially on the input signals of the circuit. For a circuit with M input signals, the input signal of each gate takes the state 0 or 1 for each of the 2M input vectors. But, in order to know if the capacitance of a given gate has been charged, discharged (record CGi in case of a charge or discharge) or preserved its previous state, the states of the N capacitances obtained with the current input vector should be observed and compared to those obtained with the previous vector. In order to determine the average or maximal dynamic power consumption, the exhaustive process would then involve the evaluation of equation [2.6] for a number of cases equal to 2M * 2M, hence 22*M possible times. To get an idea on the complexity of this problem, let us consider an adder of just 2 numbers, each of which is coded on 32 bits. The number of calculations of equation [2.6] would then be equal to 264 * 264 = 2128. Knowing that the current systems are much more complicated than an adder (processors, ASICs, memories, etc.), that the parameter NGi depends also on the design style of the circuit (static, dynamic, presence or lack of pass transistors making the passage between various logical levels) and considering possible glitches further complicates the problem at hand. Due to its intractability, it can only be solved using an algorithm based on a heuristic or metaheuristic. This subject is dealt with in Chapter 3 of this book. Since interconnections present a signficiant power consumption problem in submicron or nanometer technologies, they are addressed separately, at the end of this chapter. 2.5.3. Verification As previously mentioned, temporal analysis (the use of an electrical simulator) is, in itself, a temporal and functional verification. Nevertheless, because the entities involved at this design level are less complex than those dealt with at the system and register transfer level, the formal verification at

Basic Notions on the Design of Digital Circuits and Systems

99

the module and cell level is much more feasible. The stuck-at 0 or 1 fault method, as well as the integration of test circuitry into the original circuit are still applicable. In terms of the physical representation (layout) of an entity, there are other verification means: – Layout Versus Schematic (LVS) comparison; – Design Rule Checking (DRC) of a given technology (this tool enables the verification of spaces, crossings, extensions, etc.); – Electrical Rule Checking – ERC (this tool verifies, for example, the existence of a node of transistor that has no connection, etc.). 2.6. Gate design level 2.6.1. Synthesis The optimization of a cell may relate to the logical synthesis, as shown by the example in Table 2.5. a 0 0 0 0 1 1 1 1

000 001

b 0 0 1 1 0 0 1 1

c 0 1 0 1 0 1 0 1

a 0 1

b * 0

c * 0

00* 0**

010 011

S 1 1 1 1 1 0 0 0

Group 2i adjacent cases (2 adjacent cases differ by only one bit) carrying the value 1.

01*

100 Table 2.5. Example of logical simplification

S 1 1

100

CAD of Circuits and Integrated Systems

Table 2.6 shows a further possibility for the simplification of a logical function

Table 2.6. Another example of logical simplification

Referring to the logical simplification of several functions, it can be done considering each function separately or globally, as shown by the example below:

Table 2.7. Separate and global simplifications of logical functions

Basic Notions on the Design of Digital Circuits and Systems

101

For other cases, global simplification can generate a clearly less significant number of product terms. Indeed, simplification such as that conducted by conventional methods (Karnaugh, etc.) is not effective unless a reduced number of logical variables is used. Indeed, this is a combinatorial problem, and not a tractable one. The ESPRESSO (Rudell, 1985) tool, which relies on a heuristic and opts for a global optimization of functions, is very widely used for the interesting results that it yields. There are other methods for the logical synthesis of functions such as: – one that involves achieving a logical function with (n + 3)/2 gates, where n is the number of variables; – one that involves achieving a logical function with the smallest possible number of logic gates; – one that involves achieving a logical function by minimizing the cost of the function h(G,I), where G is the number of gates and I is the number of interconnections; – one that involves achieving a function with logic gates of similar complexity. The design style has also an effect on the characteristics of the cell. Figure 2.35 (the static style) shows the use of six transistors to represent the logical function S = NOT (a .b + NOT(c )), while the dynamic representation only uses five. In general, for a function with N logical variables, the number of transistors is respectively equal to 2*N and (N + 2) for both the static and dynamic styles. Therefore, from the surface point of view, the dynamic style is quite appropriate, but it requires the transmission of a clock signal for each logic gate. A set of logic gates would also require proper synchronization of the clock signals, while solving the problem of capacitor accidental discharge due to the cascading of these gates. In Figure 2.36, the pre-charge to “1” is done when ϕ = 0 (charging the capacitor at the gate output. Depending on the values of the input signals, this capacitor will be maintained, either charged or discharged). At ϕ = 1, the PMOS pre-charge transistor is blocked when the NMOS evaluation transistor leads. Then the output signal is evaluated with the signals applied to the NMOS transistor gates (during the pre-charge phase, the logical signals supplying the grids of these transistors are set to “0”).

102

CAD of Circuits and Integrated Systems

Figure 2.35. Static design style

NMOS Block

Figure 2.36. Dynamic design style

Referring to the cascading of dynamic gates (DOMINO logic), let us consider Figure 2.37. Assume that the evaluation yields Z = 0. For example, with A = 1, B = 0 (in the 2nd gate), the output signal at the 2nd gate should be equal to 1. Nevertheless, this output signal takes the value 0, and not 1. This is due to the fact that during the pre-charge to 1, this pre-charge has been falsified for the 2nd gate (discharge of the capacitance of this gate through the

Basic Notions on the Design of Digital Circuits and Systems

103

NMOS transistor, whose grid is supplied by Z). During the evaluation, even though Z and B are blocked, the logical signal “1” has been lost during the pre-charge phase (significant discharge via the transistor Z). We shall briefly present techniques for addressing the accidental capacitor discharge – for further details, the interested reader is invited to refer to Weste et al. (1993).

Figure 2.37. Problem of cascading dynamic logic gates

In Figure 2.38a, the PMOS transistor, whose grid is connected to the ground (it is still conductive), makes it possible to compensate for the loss of charge. Nevertheless, it should be weakly sized, with respect to NMOS transistors, to avoid the falsification of logic (obtaining “1” instead of “0” when needed). The PMOS transistor in Figure 2.38b plays the same role (compensation for a loss of charge), but it raises fewer problems in its sizing as is not always conducting (it blocks when the inverter that supplies it has logical “0” as its input signal).

Figure 2.38. Techniques for compensating for the loss of charge. For a color version of this figure, see www.iste.co.uk/mahdoum/CAD.zip

104

CAD of Circuits and Integrated Systems

For the cascading gates, it is sufficient to insert an inverter between two successive levels (modified DOMINO logic or NORA – NORAce). The role of each of these inverters is to transform the logical “1” of gate i into “0” to prevent the capacitor of gate (i + 1) from discharging through the transistor whose grid is supplied by the output signal of gate i. However, inserting these inverters involves transforming the logical functions in order to preserve the initial behavior of the circuit. This is illustrated by Figure 2.39.

Vdd

Vdd

Z N

N

φ Figure 2.39. NORA logic

There are other variants of NORA logic where the logic gates are evaluated using their NMOS blocks (using a PMOS transistor for the precharge to “1”), while other gates of the same circuit are evaluated with their PMOS blocks (using a NMOS transistor for the pre-charge to “0”). This is indicated in Figures 2.40 and 2.41 (note that an inverter is required at the output of a gate that was pre-charged at ‘0’ and supplies the PMOS blocks – avoiding the risk of falsification of the pre-charge at “0” in these cascading gates where the evaluation is done by PMOS blocks). This latter technique can be efficiently used, taking into account that the equivalent resistance of series transistors is higher than that of parallel transistors. Since NMOS and PMOS are complementary in the CMOS technology – transistors in series (in parallel) in the NMOS block are in parallel (in series) in the PMOS block – it is recommended to evaluate with the type of block which involves most of the transistors in parallel, and to do the pre-charge (at “1” or “0”) with the other block. In particular, it is not interesting to choose the PMOS block for implementing many PMOS transistors in series (doubly penalizing due to a significant equivalent resistance and the mobility of

Basic Notions on the Design of Digital Circuits and Systems

105

holes – positive charges enabling conduction in the PMOS transistor channel – lower than that of electrons) for the evaluation and only use one NMOS transistor (already favored by its conduction compared to PMOS) for its pre-charge at “0”.

Figure 2.40. Variant of NORA logic

Figure 2.41. Another variant of NORA logic

106

CAD of Circuits and Integrated Systems

In this specific case, it is better to conduct the evaluation with NMOS transistors in parallel and to proceed with the pre-charge at “1” using a PMOS transistor. The way the pre-charge is conducted may also take into account the power consumption. For example, for a decoder of memory addresses, it features at its output 2N lines for an address coded on N bits. It would then be preferable to pre-charge these 2N lines at “0”, then, when a memory word is selected, to pre-charge the addressed line at “1”, thus causing only one charge of capacitance among the 2N lines. An initial pre-charge at “1” of 2N lines would cause a discharge of (2N – 1) capacitances (transition from 1 to 0 for all the capacitances except that corresponding to the addressed line), which would generate a high dynamic power consumption. As already mentioned, the dynamic power consumption is due to charging and discharging capacitors. Some of these charges and discharges are partial – glitches – (the capacitor does not fully charge or discharge – an intermediate and not final state of the capacitor). This is due to the fact that the input signals of a given logic gate do not arrive at the same time. If there is a significant number of such charges/discharges (current circuits can contain thousands of gates), part of the dynamic power would be unnecessarily consumed. In certain cases (Figure 2.42, where signal D is delayed with respect to C), it is possible to eliminate such charges and discharges (using another circuit achieving the same function F, but ensuring identical signal arrival times). It is also possible to solve this problem by inserting pass transistors between two successive logical levels of the circuit that play the role of switches, controlled by signals such that the input signals of a given logic gate supply it at the same time. Though this design style is interesting for the dynamic power consumption, it requires more surface (use of pass transistors) and also adequate management of the control signals supplying the grids of these additional transistors. This would then require additional time, thus rendering the circuit slower (this design style would therefore be useful when the system to be implemented is not critical in time and surface, but is in power consumption). It is also possible to use gates controlled by switches (Figure 2.43 – clocked CMOS logic, where the gate is active at ϕ = 0 and blocked at ϕ = 1).

Basic Notions on the Design of Digital Circuits and Systems

107

Figure 2.42. Example of elimination of partial charges and discharges (glitches). For a color version of this figure, see www.iste.co.uk/mahdoum/CAD.zip

Figure 2.43. CMOS clocked design style

108

CAD of Circuits and Integrated Systems

If an output signal and its complement are simultaneously necessary, the designer could use the CVSL (Cascade Voltage Switch Logic) style (Figure 2.44). An illustration is given in Figure 2.45 for a NAND gate. With A = B = 0, the two NMOS transistors supplied by A and B will be on and therefore discharge the electric node corresponding to S (therefore, S = 0). The node S renders the PMOS transistor, on the left, open, and therefore charges the node corresponding to S to “1” (therefore S = 1), while blocking the PMOS transistor on the right, and therefore preventing S from being 1. For A = B = 1, the two NMOS transistors supplied by A and B will be on and will therefore discharge the electrical node corresponding to S (hence S = 0). The node S will render the PMOS transistor located on the right open and will therefore charge the node corresponding to S to “1” (hence S =1) while blocking the PMOS transistor on the left, therefore preventing S from being 1.

Figure 2.44. CVSL design style

Figure 2.45. NAND gate implemented in CVSL

Basic Notions on the Design of Digital Circuits and Systems

109

The design style based on transmission gates may be useful for the implementation of certain circuits, such as the multiplexors. The example shown in Figure 2.46 enables the selection of data among the four inputs, depending on the values of A and B – for example, F(1,0) = D3. Note that in this figure, the pass transistors are in CMOS, not only in NMOS (or PMOS). This is due to the fact that the NMOS (PMOS) transistor transmits a proper “0” (or, respectively, “1”), which enables the generation at output of a quality signal. From an electrical point of view, this is reflected by the fact that if the voltage at the drain node of NMOS transistor is VDD, applying a voltage equal to VDD on the grid of this transistor would open it, but only (VDD – Vth) would be obtained at the other node. Then, if this transistor is in series with another, it would have (VDD – Vth – Vth = VDD – 2Vth) at the other node. Hence, for N transistors in series, the initially transmitted data would go from VDD to VDD – N*Vth (Vth being the initial threshold voltage of the NMOS transistor), hence a degradation of the logical “1”. The same reasoning can be applied to the PMOS transistor in order to transmit the logical “0”. In order to avoid fan-in and fan-out problems, we should avoid the implementation of a logic gate with a significant number of transistors, but instead split it into a certain number of gates each containing a reduced number of transistors, while avoiding the increase in logical depth of the circuit. Indeed, the number of charges and discharges of a capacitance at the output of a given gate will be large as this gate is located at a deep logical level, which involves the dissipation of significant dynamic power. From the point of view of physical representation, the metal-oriented layout offers the advantage of facilitating automated layout generation, while obtaining a regular structure, which enables the minimization of the negative impact of parasitic parameters. This technique involves the arrangement of the circuit elements in matrix form – polysilicon and a metal level horizontally (or vertically), diffusion and another metal level vertically (or, respectively, horizontally). It is worth recalling that the intersection of Metal1 and Metal2 does not form the same electrical node (this is an advantage when routing in a simple manner without wasting surface because of circumvention due to the use of only one level of metal, as in previous technologies. Moreover, the intersection of polysilicon with type N (P) diffusion enables the creation of an NMOS (or, respectively, PMOS) transistor channel. Since polysilicon is very resistive and capacitive, it is exclusively used for creating a transistor channel. Interconnections are realized with metal (aluminum, copper), having better electrical characteristics. Nevertheless, the 2nd level of polysilicon (which is

110

CAD of Circuits and Integrated Systems

very capacitive) featured by certain semiconductor technologies, besides its role as an electrode, has the advantage of realizing, on a small surface, a capacitance of quite high value, established by the designer. Figure 2.47 shows an example of metal-oriented layout implementing the function S2 = A and (B or C) in two logic gates (the 2nd gate is an inverter).

Figure 2.46. Transmission gates

Figure 2.47. Example of metal-oriented layout. For a color version of this figure, see www.iste.co.uk/mahdoum/CAD.zip

Basic Notions on the Design of Digital Circuits and Systems

111

2.6.2. Analysis 2.6.2.1. Surface Similar to an entity at the module level, the surface of the cell will be that of the smallest rectangle including all its components. This is from the point of view of the physical representation (layout). Before getting to this representation, it can be estimated in terms of the number of product terms (behavioral representation) or in terms of the number of transistors and of logical depth (structural representation). 2.6.2.2. Speed The means for estimating the response time of a cell are in fact those used at the module abstraction level (functional, electrical simulations). From the results of electrical simulation, yielding the various values of voltage and current at various instants of the simulation (depending on the fixed simulation step), the designer can determine the response time of the circuit and the transition time of the signals as follows: – propagation time: this is defined as the time interval whose limits are the times when the input and output signals each reach 50% of their values (e.g. 2.5 V for the two signals in the case of an inverter for a supply voltage equal to 5 V); – signal transition time: this is defined as the time interval whose limits are the time when the input signal reaches 10% (90%) of its rated value and the time when this signal reaches 90% (10%) of its value. For example, if the supply voltage is 5 V, then the rise time TR (fall time TF) is the interval that separates the moments when the input signal voltage passes from 0.5 V (4.5 V) to 4.5 V (0.5 V). The average transition time of the signal is then determined as (TR + TF)/2. Obviously, since a circuit has several input signals (possibly even several output signals), we should consider the least favorable case for determining the propagation time of the circuit. 2.6.2.3. Power consumption Similar to the module design level, power consumption due to current leakages is estimated by equation [2.4]. Equation [2.6] gives an estimation of the dynamic consumption. Let us recall that this equation is, in fact, implemented by a whole algorithm (based on a heuristic or a metaheuristic),

112

CAD of Circuits and Integrated Systems

because of parameter NGi, that depends on successive vectors of the input signals, as well as on the circuit design style. 2.6.3. Verification Everything that was said for verification at the module abstraction level is applicable to the cell design level. Let us just add that the tool for the design rule checking of a given technology can be used interactively (a possible error is signaled as soon as it occurs). It is worth noting that a design rule concerning, for example, the spacing D between two metal levels, does not forbid the option for a spacing D1 larger than D. Though D1 has an unfavorable influence on the surface, it reduces the parallel parasitic capacitance formed between two metallic lines (the parallel capacitance is inversely proportional to the distance separating the two interconnections). As the two metal rectangles designed by the designer will actually take other forms during the fabrication of the circuit (possibly overflowing the sides), spacing below D may lead to the creation of a similar electrical node, contrary to the expected logic. 2.7. Transistor level 2.7.1. NMOS and CMOS technologies Figure 2.48 shows an example of a circuit according to the NMOS technology. As the threshold voltage of the load transistor is negative and VGS = 0 (therefore >Vth), this transistor is therefore always on, loading the capacitance CL. Maintaining this charge is conditioned by the evaluation signals of the NMOS block.

Figure 2.48. Example of NMOS circuit

Basic Notions on the Design of Digital Circuits and Systems

113

The pseudo-NMOS technology (Figure 2.49) involves the use, as load transistor, of a PMOS transistor, whose grid is connected to VSS, and is therefore always on. These two technologies, having the advantage of using only (N + 1) transistors for a circuit with N input signals, have nevertheless been abandoned, because of the power dissipation due to the short-circuit current (the flow of a supply current to the ground). The CMOS technology solves this problem by using two blocks of NMOS and PMOS transistors that are exclusively conducting. The static design style uses 2 * N transistors (Figure 2.50), but this number can be reduced to (N + 2) transistors if the dynamic design style (already presented in this book) is chosen. The bipolar transistors, being faster but larger power consumers than MOS transistors, there was a time when the BiCMOS technology was employed. This involved placing bipolar transistors on the critical paths of the circuit (due to their rapidity), while MOS transistors (less surface and less power consumption) were placed everywhere else. However, as the semiconductor technology develops CMOS technology became widely used for the implementation of a significant number of integrated circuits and systems (transistor switching became rapid due to their low initial threshold voltages).

Figure 2.49. Example of pseudo-NMOS circuit

Figure 2.50. Static design style in CMOS

114

CAD of Circuits and Integrated Systems

2.7.2. Theory of MOS transistor (current IDS) The current IDS is one of the parameters that influence power consumption and the response time of the circuit. Let us consider Figure 2.51 to show the current IDS (source–drain current) in the channel of the NMOS or PMOS transistor.

Figure 2.51. NMOS and PMOS transistors

The grid voltage involved in the creation of the channel is VGS – Vt. The charge of the channel is then: Q = C’Ox (VGS – Vth), where C’Ox is the capacitance of the oxide, GGS is the grid–source voltage and Vt is the threshold voltage of the transistor. We have: IDS = Q/τ

[2.7]

where τ is the electron transfer time in the channel of the NMOS transistor. The speed of the electrons under the effect of the electrical field is V = µ*E = µ (VDS/L), where µ is the electron mobility. Since: τ = L/V equations [2.7] and [2.8] yield: IDS = Q/τ = Q/(L/V) = Q * V/L = Q * (µ (VDS/L2)) = C’Ox (VGS – Vt) * (µ (VDS/L2))

[2.8]

Basic Notions on the Design of Digital Circuits and Systems

115

As C’Ox = COx *W * L, where COx is the capacitance per surface unit, we have: (V

= μC

− V )V

− V )V with

= β(V

= μC

In this model, the charge is assumed constant along the channel. A more exact expression of IDS is the following: (

=

)



=

(



− )

with

Vt = Vt0 +  It is worth noting that λ is a parameter due to the substrate effect. Hence, the INITIAL threshold voltage of a transistor (Vt0) is constant (it depends on a given technological process), but the threshold voltage of a transistor (Vt) IS NOT: it varies with VDS, as shown in Figure 2.52. Assuming that λ = 1, we get: =

(



)



Figure 2.52. Parameters that influence IDS in the NMOS transistor

[2.9]

116

CAD of Circuits and Integrated Systems

Figure 2.52 shows the following: – the initial threshold voltage of the transistor (Vt0) is constant; – the threshold voltage of the transistor (Vt) is not constant (it varies as a function of VDS, due to the substrate effect); – the pinch point (or channel constriction) is located at the intersection between VGS and Vt curves; – the surface limited by VS, VD, Vt and VGS curves is IDS/β; – for a given value of VDS (VGS), IDS increases with VGS (or, respectively, VDS). This happens up to the pinch point. As IDS depends on β (= μ COxW/L), the designer may vary IDS by varying W, the transistor width (the other parameters of β being fixed) As indicated by Figure 2.52, IDS is not constant. Let us examine the case when it is maximal. Given α = β (VGS - Vt0). We then have: =



=







The variation table in Figure 2.53 shows that IDS is maximal for VDS = α/β; β ≠ 0 This means IDS,max = α * α/β - (β/2) * α2/β2 = α2/(2β) = β2(VGS – Vt0)2/(2β) We then have: ,

=

(



) =

(

)

Figure 2.53. Calculation of IDS, max

[2.10]

Basic Notions on the Design of Digital Circuits and Systems

117

We then have (Figure 2.54): – VGS < Vt0:

blocked transistor;

– VDS < VGS – Vt0: linear region; – VDS = VGS – Vt0: channel constriction; – VDS > VGS - Vt0: saturation.

Figure 2.54. Various regions of transistor operation

2.7.3. Transfer characteristics of the inverter The objective is to characterize the output of the inverter as a function of its input, that is, to determine VDS as a function of VGS.

Figure 2.55. Transfer characteristics of the inverter in NMOS technology. For a color version of this figure, see www.iste.co.uk/mahdoum/CAD.zip

118

CAD of Circuits and Integrated Systems

As VGS = 0 and the load transistor is in depletion mode (Vth < 0), it is always conductive. To characterize the transfer of the inverter, it is sufficient to equalize the currents of the load and signal transistors at various values of VGS for the signal transistor. The intersection of these curves with that of the load transistor yields VDS as a function of VGS. For example, for VGS = 0.3 VDD = 1.5 V (with VDD =5 V), we approximately obtain VDS = 4 V while for VGS = 0.6*VDD = 3 V, we approximately obtain VDS = 1.25 V, and so on. In general, the more VGS increases (decreases), the more VDS decreases (or, respectively, increases). 2.7.4. Static analysis of the inverter Given VGS = Vin = VDD; Vt0 = 0.2VDD; VDS,S = 0.1VDD We then have: VGS - Vt0 = 0.8VDD > VDS The signal transistor operates in the linear region 

(

=

)





[2.9]

For the load transistor, VGS=0; Vdep = -0.8VDD; VDS,C = VDD - VDS,S = 0.9VDD; Vdep is the initial threshold voltage of the depleted load transistor We have: VGS - Vdep= 0.8VDD < VDS,C Therefore, the load transistor operates in saturation mode 

,

=

(replacing VGS by 0 in equation [2.10]).

Given βR= βS/βC, and equalizing the currents in the two transistors, we have: =

. (



.

)

Basic Notions on the Design of Digital Circuits and Systems

119

Figure 2.56. Influence of βR on the logical levels

Figure 2.56 shows that for Vin = VDD, βR = 4 gives a better value of Vout (0.1VDD) for the inverter. For a convenient logical level, we should therefore have: βR ≥ 4 2.7.5. Threshold voltage of the inverter The threshold voltage of the inverter (not to be confused with the threshold voltage of a transistor) is the voltage for which Vin = Vout = 0.5VDD Given VGS = Vin = 0.5VDD; Vt0 = 0.2VDD; VDS,S = Vout= 0.5VDD We then have: VGS - Vt0 = 0.3VDD < VDS The signal transistor is in saturated mode 

=

)

(

=

.

For the load transistor, Vgs=0; Vdep = -0.8Vdd; Vds,C = Vdd - Vout = 0.5Vdd We have: Vgs - Vdep=0.8Vdd > Vds,C The load transistor operates in linear regime 

,

= =

∗ 0.5

− 0.4



− 0.25

.

(from equation [2.9]).

120

CAD of Circuits and Integrated Systems

Considering βR = βS/βC, and equalizing the currents in the two transistors, =

we have:

.

=

. ∗

.

≈ 6

As β depends on W, the transistors should be properly sized in order to have convenient logical levels. Note that β may be equal to 8 if Vin = VDD - Vt0 (this is the case when the transistors of the inverter are supplied by the output signal of a pass transistor). 2.7.6. Estimation of the rise and fall times of a capacitor 2.7.6.1. Estimation of the rise time of a capacitor Let us consider an inverter in NMOS technology (this will be revisited for CMOS technology). A voltage Vout = 0.9, VDD is acceptable for the representation of a logical “1”. For the load transistor, we have VDS,C = VDD – Vout As Vout varies from 0 to 0.9VDD, VDS,C varies from VDD to 0.1 VDD. As Figure 2.57 shows, the inverter operates in linear and saturated modes.

Figure 2.57. Operating modes of the load transistor

– Saturation: the voltage across CL (at the output node of the inverter) is = = . Varying t the solution to the differential equation: , from 0 to t1 and V (Vout) from 0 (VDD - VDD) to 0.2VDD (VDD – 0.8VDD), we have: =

2

+

Basic Notions on the Design of Digital Circuits and Systems

121

– Linear region: varying t from t1 to t2 and V (Vout) from 0.2VDD (VDD - 0.8VDD) to 0.9VDD (VDD - 0.1VDD), the voltage across CL is the solution to the differential equation: =

,

=

(



.

=

This yields: =

+

1 − )− ( 2

; t3 = t2 – t1; Vdep = -0.8VDD

.

=

− )

4

2.7.6.2. Estimation of the fall time of a capacitor CL discharges through the signal transistor. Voltage Vout = 0.1VDD is acceptable for the representation of logical “0”. As Vout = VDS,S, VDS varies from VDD to 0.1 VDD. Based on Figure 2.58, let us consider the two operating modes (linear and saturation) of the inverter. – Saturation: the voltage across CL (at the output node of the inverter) is (

the solution to the differential equation: = = , Varying t from 0 to t1 and V (Vout) from (VDD – Vt0) to VDD, we get: =

)

.

2 (



)

– Linear region: varying t from t1 to t2 and V (Vout) from 0.1VDD to (VDD - Vt0), the voltage across CL is the solution to the differential equation: = This yields: =

,

= +

(

=

(

=



)

4

) −

1 2 ; t3 = t2 – t1; Vt0 = 0.2VDD

122

CAD of Circuits and Integrated Systems

Figure 2.58. Operating regions of the signal transistor

NOTE.– =

=

=

As βR ≠ 1 (βR ≥ 4), the inverter does not operate symmetrically (TRise ≠ TFall). Here, the rise time TRise is at least four times longer than the fall time TFall. CMOS inverter: As the NMOS has already been described, let us now examine the PMOS transistor. Its threshold voltage is equal to Vt0P - λV = Vt0P – V (with λ = 1). Figure 2.59 shows the parameters that influence its drain–source current IDS. The two equations of this current when the inverter operates in linear or saturation mode are the following: – Linear mode: =

,

,



,

,



1 2

,

As VS,P =VDD, we have: ,

=





,

(



)− (



)

[2.11]

– Saturation: ,

=

,



,

=





,

[2.12]

Basic Notions on the Design of Digital Circuits and Systems

123

Figure 2.59. Parameters influencing IDS in the PMOS transistor

We then have: NMOS transistor

PMOS transistor

VGS,N ≤ Vt0,N: transistor OFF

VGS,P ≥ Vt0,P:

VDSs,N < VGS,N - Vt0,N: linear region

VDS,P > VGS,P - Vt0,P: linear region

VDS,N = VGS,N - Vt0,N: channel constriction

VDS,P = VGS,P - Vt0,P: linear region

VDS,N > VGS,N - Vt0,N: saturation

VDS,P < VGS,P - Vt0,P:linear region

transistor OFF

Table 2.8. Operating regions of NMOS and PMOS transistors

Table 2.9 indicates that the CMOS inverter operates mainly in regions A and E and does not operate in regions B, C and D except for a short moment (variation of VGS involving a short circuit in this very short time interval). P

N

Off

Linear

Saturation

Off

X

A

X

Linear

E

X

D

Saturation

X

B

C

Table 2.9. Complementary operation of NMOS and PMOS transistors

124

CAD of Circuits and Integrated Systems

Influence of βN/βP on the transfer characteristics: Figure 2.60 shows that the higher this ratio, the faster the discharge of this capacitor, at the output of the inverter. Similarly, the higher βP/βN, the faster the charge of this capacitor.

Figure 2.60. Transfer characteristics of the inverter

2.8. Interconnections The considerable development of semiconductor technology has opened the way for a high level of integration, better efficiency (for a constant size of the die) and cost reduction (in 2003, 0.01 USD for 100,000 transistors). This development has nevertheless caused problems particularly related to interconnections. Among these, it is worth noting the energy dissipation due to the parasitic effects induced by parallel interconnections (this problem was less significant with micron technologies), the dealy and noise, signal integrity. While the delay of logic gates is continuously improving (rapid commutation of the transistor thanks to very low initial threshold voltage), the situation is different for interconnections. Though IBM has improved the delay related to interconnections (around the year 2000) by replacing aluminum with copper (with lower resistivity), this delay continues to grow due to increasingly strong integration. This is clearly shown in Figure 2.61, due to ITRS (International Technology Roadmap for Semiconductors), which indicates, besides the decreasing delay of gates depending on the technological process, the variations of the delay related to interconnections

Basic Notions on the Design of Digital Circuits and Systems

125

as a function of this same process and other parameters (nature of the metal dielectric constant).

Figure 2.61. Delays of gates and interconnections. For a color version of this figure, see www.iste.co.uk/mahdoum/CAD.zip

Due to the fact that global interconnections have not followed the scale factor, that the delay related to submicron interconnections has become more significant than the one related to logic gates, that the performances of the current complex systems are increasingly dictated by those of interconnections and that the progress – in terms of materials – is insufficient to meet the specifications, the efforts should therefore focus on the design and optimization of interconnections in terms of speed, power and noise. In this context, the following sections focus on the techniques for buffer insertion on an interconnection (signal regeneration and delay improvement), data encoding and decoding (power decrease due to parasitic parameters) and to interconnections carrying the clock signals (delay, power and surface of this type of interconnection must be taken into account in the global evaluation of the system to be designed). Still in relation to the clock signals, particular attention is given to the synthesis of PLLs in a complex system operating at several clock frequencies.

126

CAD of Circuits and Integrated Systems

2.8.1. Synthesis of interconnections 2.8.1.1. Buffer insertion 2.8.1.1.1. Delay The delay on an interconnection is proportional to the product resistance * capacitance. Many research works among which Alpert et al. (1999) show that buffer insertion, besides having a positive impact on the regeneration of a clock signal on a long interconnection, improves the delay on this interconnection. Hence, by decoupling a high resistance and a high capacitance into several low resistances and low capacitances (Figure 2.62), the T2 becomes less than is below or equal to T1. This is clearly shown in Figure 2.631, which indicates that repeaters reduce delay. Nevertheless, this insertion must take the delay, surface and power consumption into account. This is in fact a combinatorial problem whose solution can only be found by a heuristic. The literature offers the reader excellent works on this subject. This book presents our own contribution to this problem. The space of solutions being very wide, our method relies on a heuristic (detailed in Chapter 3 of this book) that reconciles CPU time and representative coverage of this space. Our method involves placing a number of inverters (from 1 to a maximal value Nmax) in predetermined positions in order to minimize the power consumption, while meeting, if possible (the constraints may be poorly estimated by the designer for this combinatorial problem), the time and surface constraints. Let us consider, for this purpose, Figure 2.64 illustrating the insertion of three inverters on an interconnection.

Figure 2.62. Buffer insertion on an interconnection. For a color version of this figure, see www.iste.co.uk/mahdoum/CAD.zip 1 Source: ITRS, 2003.

Basic Notions on the Design of Digital Circuits and Systems

127

Figure 2.63. Delays on interconnections with or without repeaters. For a color version of this figure, see www.iste.co.uk/mahdoum/CAD.zip

Figure 2.64. Example of three inverters insertion on an interconnection. For a color version of this figure, see www.iste.co.uk/mahdoum/CAD.zip

The maximum number of inverters to be inserted on the interconnection is linked to the length of the interconnection between the two nodes (source and destination) and to the distance fixed between two successive inverters. Considering Figure 2.64, our method involves placing 1, 2 or 3 inverters on this interconnection. The possibilities would then be those indicated in Figure 2.65. Let us note that while for the three inverters only one placement is possible, this is not the situation for one or two inverters. Generally ! = possibilities to insert k inverters (1 ≤ k ≤N, speaking, there are )! ! (

with N being the maximum number of inverters to be inserted). Hence, the total number of combinations to be inserted is equal to ∑ . As confirmed by the results obtained, this insertion method makes it possible to reconcile CPU time (obtaining results within a reasonable CPU time) and

128

CAD of Circuits and Integrated Systems

a good coverage of the very wide space of solutions. Hence, for the insertion of at most three inverters, we consider seven possibilities (see Figure 2.65), of which the one minimizing the power consumption the most, under fixed time and surface constraints, is retained.

Figure 2.65. Possibilities for the insertion of one to three inverters, at predetermined positions on an interconnection. For a color version of this figure, see www.iste.co.uk/mahdoum/CAD.zip

Since the calculation of the delay must take into account the time of logic gates and that of data transfer on the segments of interconnections, let us consider Figure 2.66. This shows all the possible delays following a given insertion (number of inverters and their positions).

Figure 2.66. Possible delays for inserting a maximum of four inverters

We then have: D12 = D1 + d12; D23 = D2 + d23; D34 = D3 + d34; …..; D16 = D1 + d16 + D6

Basic Notions on the Design of Digital Circuits and Systems

129

where: – D1 (D6) is delay delaly of the gate at the source (destination) node; – Di (i ≠ 1; i ≠ 6) is the delay of the inverter placed at position i (note that this delay depends on the characteristics – dimensions and threshold voltages of the transistors – of the considered inverter). Note that Di6 = Di + di6 + D6; 1 ≤ i ≤ 5; – dij (i ≠ 6) is the time on the interconnection segment linking nodes i and j. This is illustrated by Figure 2.67.a showing the delays generated by the insertion of only one inverter at position 3 and Figure 2.67b for the insertion of three inverters at three predefined positions. Then, we have the following: – the global delay Dg for the insertion of the inverter at position 3 is: Dg= (D1+d13) + (D3+d35+D5) = D13 + D35; – insertion of three inverters: Dg= (D1+d12) + (D2+d23) + (D3+d34) + (D4+d45+D5) = D12 + D23 + D34 + D45.

Figure 2.67. Delays involved in the insertion of one inverter at position 3 (a) or in the insertion of three inverters (b)

130

CAD of Circuits and Integrated Systems

Let us then consider the two types of delay: – Inverter: A common model of the delay of an inverter (published in the specialist literature) is given by the following equation: =

( )∗ ()

() ()

()

()

()

+

() ()

−1

[2.13]

where: – CLoad(i) is the capacitance at the output node of inverter i; – L is the length of the transistor channel in the adopted technology; – μ is the mobility of electrons or that of holes; – Cox is the capacitance of the oxide; – Wi is the size of the NMOS or PMOS transistor of the inverter i; – VDD(i) is the supply voltage of the inverter i (in a multi VDD multi Vth design, the inverters may have different supply voltages); – Vth(i) is the initial threshold voltage of the NMOS or PMOS transistor. In fact, equation [2.11] concerns the delay of a transistor, but it can be applied for an inverter using the NMOS (PMOS) transistor for the discharge (charge) of CLoad. It is clear that the parameters of this equation have a direct influence on the surface, delay and power consumption. – Interconnection: The delay model on an adopted interconnection is the one published in Rao (2005). It is given by equation [2.12]: =1 2





+





[2.14]

where: – rwij (Cwij) is the resistance (capacitance), by unit length, of the wire connecting the nodes i and j; – Cbj is the input capacitance of the buffer (inverter) j; – lwij is the length of the interconnection segment connecting nodes i and j (Figure 2.68).

Basic Notions on the Design of Digital Circuits and Systems

131

Figure 2.68. Delay on an interconnection segment

2.8.1.1.2 Surface The surface (S) before the insertion of inverters is assumed known. The total surface would then be STot = S + SI, with SI being the additional surface due to the insertion of inverters. SI can then be estimated as follows: =∑

(

+ )∗

=



(

+ )

[2.15]

Thus, SI is estimated as the sum of surfaces of the rectangles including the NI inserted inverters. Let us note that: – Wi is the width of the PMOS (generally larger than that of the associated NMOS transistor in order to compensate for the difference between the mobilities of electrons and holes) of the inverter i; – K is a distance that makes it possible to comply with the design rules of the targeted technology; – H is the height of the rectangle including the inverter i (H depends essentially on the design rules, diffusion-metal contact and the length of the channels of NMOS and PMOS transistors – as a rule identical – of the inverter i). 2.8.1.1.3 Power consumption The estimation of static and dynamic power consumption is done with our tool developed on the basis of a genetic algorithm and using the models indicated in section 2.4.2.3. Nevertheless, within multi VDD multi Vth design (reaching the expected compromise between delay and power consumption), equations [2.4] and [2.6] have been transformed as follows: = 0.28125 ∗





132

CAD of Circuits and Integrated Systems

∗ ∑#

,

∗ 10

,

⁄∝

+

,

∗ 10

,

⁄∝



,

[2.16]

where: – NbN,i (NbP,i) is the number of NMOS (PMOS) transistors in the gate i; – VtN,i (VtP,i) is the initial threshold voltage of NMOS (PMOS) transistors of gate i. VtN,i (VtP,i) may be equal to Vth,Low or Vth,High; – VDD,i (VDD,Low or VDD,High) is the voltage supplying gate i; – the other parameters are as defined in section 2.4.2.3. It is worth noting that: – the use in the same circuit, of several initial threshold voltages of transistors would require several additional masks, which would render circuit manufacture more difficult. Because such an approach would lead to a very expensive circuit, the number of these threshold voltages is limited to 2, for each type of transistor; – equation [2.14] indicates the use of the same initial threshold voltage for all the transistors of the same type and belonging to the same logic gate (facilitate the manufacture of the circuit); – in the case of insertion of inverters on an interconnection, each gate is an inverter (containing only one transistor of each type) = 0.5 ∗



,

∑|

|

,



,

+

,

∑|

|

,



,

[2.17]

where: – EL (EH) is the set of logic gates supplied by VDD,L (VDD,H); – CgL,i (CgH,i) is the capacitance at the output of the ith gate supplied by VDD,L (VDD,H); – NgL,i (NgH,i) is the number of charges and discharges of CgL,i (CgH,i); – the other parameters are as defined in section 2.4.2.3. Our technique for the optimization of the total (static and dynamic) power consumption subject to time and surface constraints will be detailed in Chapter 3 of this book.

Basic Notions on the Design of Digital Circuits and Systems

133

2.8.1.2. Data encoding and decoding As semiconductor technology scales, power consumption due to parasitic parameters is no longer negligible, as in the case of micron technologies. Figure 2.69 shows – in addition to the intrinsic resistors, capacitances and inductances – the resulting parasitic parameters. These latter parameters involve additional power consumption not far from negligible. Indeed, the literature states that the capacitance Cij, formed between two parallel interconnections i and j, is six times higher than the capacitance CLoad of each of the two interconnections in the 0.18 μm CMOS technology. This parallel capacitance becomes eight times higher in the 0.13 μm CMOS technology.

Figure 2.69. Intrinsic and parasitic electrical parameters generated in the implementation of interconnections in submicron technologies

Moreover, it has been shown that this power dissipation strongly depends on the data that flow through these interconnections. It is then a matter of using the data that generate the least power dissipation thanks to proper data encoding/decoding. In Sotriadis et al. (2003), a model for the calculation of power consumption during the data transfer on interconnections in submicron technologies, as well as an encoding/decoding technique are defined. The main points are presented in what follows, and we will then describe our small contribution to encoding/decoding, adopting the same model (in Chapter 3 of this book, the details of the implementation of our technique are shown).

134

CAD of Circuits and Integrated Systems

Equation [2.16] has two parts: part of the energy consumption due to dynamic transitions of the intrinsic capacitance of a given interconnection (E1), and another part of the consumption due to the parasitic capacitance formed between two neighboring and parallel interconnections (E2). = ∑ ( =

+  ∑

− ) −

− +

(

− =



) [2.18]

+

where: – (1 ≤ j ≤ N);

is the current (previous) bit on the jth interconnection

– − capacitance

indicates if there has been a transition of the  ;

– λ is a parameter due to technology (=8 in the 0.13 μm CMOS technology); ) indicates if the current bits on the two neighboring − –( interconnections are identical; − indicates if the previous bits on the two neighboring – interconnections were identical. NOTE.– In my opinion, the parentheses of equation [2.16] should be replaced by absolute values. Indeed, a 1 → 0 change of bit involves, similar to a 0 → 1 change, a transition and therefore a discharge of CL (instead of a charge in the 0 → 1 case) to be added (+ CL) and not deducted (- CL). That being said, let us consider a bus whose initial size is 2 bits. Paradoxically, it will be seen that by increasing the size to 3 bits, and thanks to proper encoding of data, the dissipated energy will be less significant. Let us consider Table 2.10 showing the K values obtained (energy = ∗ ). Hence, if the previous data (the previous state of the 3-bit bus) was, for example, 000 and the current state of the bus is 001, then the value of K is equal to 4. For each current state of the bus encoded on M bits

Basic Notions on the Design of Digital Circuits and Systems

135

(3 in this example), there are 2M new possible states of the bus. Since the original size of the bus was equal to N (2 in this example), such that N < M and therefore 2N < 2M, the idea is to use – for each previous state of the bus – the least energy-intensive 2N possibilities among the 2M ones. Thus, for the previous state 000, for example, the four possibilities of coding the original data (00, 01, 10, 11) at lower energy cost are the current states of the bus 000, 001, 100 and 111, as they generate the lowest values of K and therefore of energy (0, 4, 4 and 3 respectively). 000

001

010

011

100

101

110

111

000

0

4

7

5

4

8

5

3

001

0.5

0

10.5

4

4.5

4

8.5

2

010

0.5

7.5

0

1

7.5

14.5

1

2

011

1

3.5

3.5

0

8

10.5

4.5

1

100

0.5

4.5

10.5

8.5

0

4

4

2

101

1

0.5

14

7.5

0.5

0

7.5

1

110

1

8

3.5

4.5

3.5

10.5

0

1

111

1.5

4

7

3.5

4

6.5

3.5

0

Table 2.10. Value of

=



for each state transition

As shown in Table 2.10, a possible encoding is as follows: 00: (000, 000); 01: (000, 001); 10: (000, 100); 11: (000, 111) 00: (001, 000); 01: (001, 001); 10: (001, 011); 11: (001, 111) 00: (010, 000); 01: (010, 010); 10: (010, 011); 11: (010, 110) 00: (011, 000); 01: (011, 001); 10: (011, 011); 11: (011, 111) 00: (100, 000); 01: (100, 100); 10: (100, 101); 11: (100, 111) 00: (101, 000); 01: (101, 001); 10: (101, 101); 11: (101, 111) 00: (110, 000); 01: (110, 010); 10: (110, 110); 11: (110, 111) 00: (111, 000); 01: (111, 001); 10: (111, 110); 11: (111, 111)

136

CAD of Circuits and Integrated Systems

In other terms, if, for example, the current state is 000 and the actual data to send is 10, then the data to transmit on the 3-bit bus is 100. Note that the decoding will be obvious: if the previous and current states are 000 and 100 respectively, then the actual data to recover would be 10. Before presenting other more interesting encoding techniques, let us consider the energy gain obtained by increasing the size of the original bus. This is shown in Table 2.11, which indicates, for example, that the value of K passes from 9.75 for an initial data size of 6 bits to only 6.57 (4.78) if a 7bit bus (9 bits respectively) is used.

2 3 4 5 6 7 8

2

3

4

5

6

7

8

9

10

11

12

2.25

1.50

1.28

1.13

1.00

0.90

4.13

2.48

2.16

1.92

1.73

0.82

0.75

0.70

0.66

0.62

1.56

1.42

1.30

1.20

1.12

6.00

3.77

3.15

2.80

2.53

2.30

2.11

1.94

1.80

7.88

5.13

4.23

3.75

3.40

3.12

2.87

2.66

9.75

6.57

5.41

4.78

4.34

3.99

3.69

11.63

8.01

6.65

5.87

5.33

4.91

13.50

9.50

7.94

7.01

6.37

Table 2.11. Value of K for various bus sizes

This being the case, it can be seen that the increase in bus size will stop reducing energy from a certain size where dynamic consumption starts to outweigh consumption due to parasitic capacities. This limit for reducing energy consumption when taking into consideration the surface parameter enables the designer to choose the adequate bus size. We have previously presented an encoding technique that can be improved so that the most frequently used data are favored in the encoding in view of further reduction of energy consumption. Hence, the previous encoding generates the following K values: 00: (000, 000) → 0; 01: (000, 001) → 4; 10: (000, 100) → 4; 11: (000, 111) → 3

Basic Notions on the Design of Digital Circuits and Systems

137

Let us assume that the frequently transmitted data are in this order: 01, 10, 11, 00. A proper choice would be the following encoding: 01: (000, 000) → 0; 00: (000, 001) → 4; 11: (000, 100) → 4; 10: (000, 111) → 3 This encoding is operated by periodic determination of the data transmission frequency. Rather than being static, this encoding is dynamic in the sense that, at each period, it can vary. Nevertheless, the probabilitly that data occur for the (n + 1) packet are estimated based on those of packet n, which is not necessarily correct. For this purpose, we have slightly modified this technique by memorizing, first of all, the data packet before transmission. This is followed by the exact (not estimated) calculation of occurrences of each data in order to decide the appropriate encoding before data transmission. Such a technique makes it possible to further reduce the energy consumption, but at the expense of a slight increase in the surface for data memorization (buffering) and some delay, due to this memorization and to the exact calculation of occurrences, before the packet transmission. Table 2.12 shows, for various data cases, the results obtained for various techniques (CWP: no encoding is used; CFP: static encoding; CDP: dynamic encoding with an estimation of occurrences; EC: dynamic encoding with an exact calculation of occurrences). For an initial size of 8 bits, this table shows that: – all the techniques yield better consumption of energy than the technique used without encoding, with the exception of static encoding (for 10- and 16-bit buses). This is reflected by improper encoding that does not reduce parasitic consumption, while increasing dynamic consumption; – the exact calculation of data occurrences yields the best result for each bus size; – the increase in bus size may improve energy consumption thanks to the reduction of parasitic consumption when this increase does not strongly influence the dynamic consumption (EC: 9 bits and 11 bits); – a specific data case may increase dynamic consumption without significantly reducing parasitic consumption (EC: 9 bits and 10 bits).

138

CAD of Circuits and Integrated Systems

Bus size

CWP

CFP

CP

EC

9

1527.5

1494.5

1039.5

494

10

1360.5

1362.5

1003.5

498.5

11

1292.5

1271.0

986

481

12

1,288

1252.0

971.5

471.5

13

1,248

1,201

943.5

472.5

14

1,201

1200.5

937

468

15

1,222

1170.5

933.5

466

16

1,117.5

1135.5

941

452.5

17

1,163

1128.5

913

459.5

Table 2.12. Comparative study of various encoding techniques

2.8.1.3. Distribution of clocks and PLLs Clock distribution poses a difficult problem in the design of synchronous systems. A clock signal reaching a module port earlier than a port of another module may cause a system malfunction. Diverse and varied design techniques are used to solve this skew problem: buffer insertion on a long interconnection, H-distribution of the clock signal. However, such techniques are only effective for less complex systems. For complex systems, a GALS (globally asynchronous, locally synchronous) type of design is the option. Hence, at a local level, clock signal can more easily be synchronized. At the global scale of the system, communication protocols (such as handshaking) are rather used for asynchronous communications. Still concerning the clock signals that are generated by the PLLs (PhaseLocked Loop), there is a compounded problem posed by the use of PLLs in a system-on-chip (SoC). Since such a system may contain a significant number of IPs (Intellectual Property – processors, memories, DSPs, etc.) operating at various frequencies would lead to the design of as many PLLs as individual frequencies. Besides the surface occupied by these many PLLs, there is a problem of power consumption. As an example, a PLL generating a frequency of 6 GHz consumes no less than 11 mW which is enormous for a single PLL. In general, the power consumed by a PLL is equal to (Lee et al., 2012): 2 + 1.5* F (in mWatts)

Basic Notions on the Design of Digital Circuits and Systems

139

where: – 2 is a power value (2 mW); – 1.5 is an energy constant; – F is a frequency expressed in MHz. A relatively recent work (Reuben et al., 2014) deals with the reduction of the number of PLLs using a single PLL (and dividers that are not large power consumers) for several frequencies. This relies on the least common multiple (LCM). Hence, frequencies equal to, for example, 270 MHz, 45 MHz, 540 MHz and 180 MHz can be obtained with a single PLL generating 1,080 MHz (Figure 2.70) and using respectively dividers by 4, by 24 (dividing by 6 the output of the divider by 4 amounts is similar to dividing by 24), by 2 and by 6 (dividing by 3, the output of the divider by 2 amounts to dividing 1,080 by 6 to obtain 180). Note that the power consumed by a divider is clearly below that consumed by a PLL, as indicated by the following equation (Reuben et al., 2014): Pdiv = F * 1.8 * Number of dividers/45,000 where 1.8 is an energy constant. Note that the use of a single PLL in Figure 2.70 would have required the generation of an initial frequency of very high value, that of the LCM of all the expected frequencies (generally, a PLL generating a frequency above 2.5 GHz is not recommended due to its high power consumption). Moreover, noting that each IP operates within an interval of frequencies, it is not possible to exhaustively find the LCM within a reasonable CPU time. Indeed, let us consider a system composed of 50 IPs, each of which operates at a frequency within an interval containing 10. The total number of cases to be considered would then be equal to 1050 among which the LCM would be selected (the lower the frequency generated by a PLL, the lower its power consumption). If the calculation of only one LCM requires 1 ns, the CPU time would then be equal to 1050 ns, which yields 3.96 * 1023 years. It is obvious that such a problem can only be solved using an approximate method. As this relates to computer science and operational research, it will be revisited in Chapter 3 of this book, when our own heuristics (Toubaline et al., 2018)

140

CAD of Circuits and Integrated Systems

whose results have been favorably compared to those obtained by Reuben et al. (2014) will be presented. 225 MHz

/4 PLL1=900 MHz /2

/3

450 MHz

/3

50 MHz

150 MHz

270 MHz

/ 4

45 MHz

/ 6

PLL2=1080 MHz / 2

/3

180 MHz

540 MHz

Figure 2.70. Frequency generation

2.8.2. Synthesis of networks-on-chip With the onset of technological development, it is currently possible to integrate millions of transistors on the same chip. Besides this advantage, the delays of logic gates are increasingly shorter because of the rapid commutation of transistors due to their increasingly reduced threshold voltages. Unfortunately, besides these advantages, the designers have to cope with many problems. Here, we only mention the problem related to the context of this synthesis. The curves of ITRS (The International Technology Roadmap for Semiconductors, Figure 2.61) related to the delays of logic gates and interconnections clearly show that the first type of delay constantly improves as the transistor channel length decreases. This is unfortunately not the case for interconnections: global communication is increasingly slower as the semiconductor technology scales. Though partial solutions such as buffer insertion or the replacement of aluminum with copper (IBM 2000) have been found, this is still a problem. Moreover, as current systems

Basic Notions on the Design of Digital Circuits and Systems

141

are increasingly complex and are characterized by a high degree of communication, it has become impossible to use a conventional bus because it is a critical and shared resource which significantly reduces the bandwidth. An intermediate solution has been found (use of crossbars), but it is nevertheless insufficient. A better solution had to be chosen which involved the design of an application-specific network, meeting the constraints of bandwidth, power consumption and surface. For an architecture hosting dynamic applications, the tasks of a given application on an architecture, based on a network-on-chip are executed in the following stages: – scheduling of tasks; – task assignment to IPs (processors); – IP assignment to the network tiles; – definition of a routing strategy for carrying the data from the IP-emitter to the IP-receiver. For the applications to be implemented on an ASIC (Application-Specific Integrated Circuit), this involves finding the configuration that meets the bandwidth (number of bits transmitted per second), surface and power consumption constraints. Hence, Figure 2.71 (Murali et al., 2006) shows two configurations, among the many possible, for a MOTOROLA application clearly showing the problem. The architectures proposed for this type of application rely on several methodologies, the most prominent of which are those that rely on fat trees, mesh networks and other variants of these methodologies (Figure 2.72). For mesh network architectures and variants, the main problem is related to routing. Indeed, two or more packets of data may compete for one or more segments of interconnections (shared and critical resource). Routing strategies are then adopted in combination with other techniques, such as the decomposition of a data packet into flits and sending the latter along various paths. This often poses the problem of receiving these flits in order, which is not always easy. The other type of architecture involves the use of fats trees that provide a path for each data transfer. This solves the conflict between various transfers but unfortunately, it generates a problem of size, namely that of density. Indeed, Figure 2.72 shows that for only 8 IPs, the density is such that the number of parasitic parameters is very significant, which is not in favor of power consumption, surface or speed. The use of such a topology for an SoC containing a significant number of IPs is certainly not indicated.

142

CAD of Circuits and Integrated Systems

Our contribution to this as it is has been to exploit the fact that the application is known, field implementable on an ASIC. To this effect, let us consider Figure 2.73, which shows a controlled data flow graph (CDFG).

Figure 2.71. Problem of network-on-chip design. For a color version of this figure, see www.iste.co.uk/mahdoum/CAD.zip

Figure 2.72. Architectures implementing networks-on-chip

Basic Notions on the Design of Digital Circuits and Systems

143

Figure 2.73. Possible traces in a CDFG

In this figure, according to logical conditions A and B, only one trace will be execulted among the others. In the case of an important application whose execution requires several IPs, the identification of these traces, as well as the knowledge of data transfer times, would avoid the following two extreme cases: – using only one bus, which is therefore a critical and shared resource, generates a contentious and arbitrage problem that is all the more important as the number of transfers increases. Moreover, as only one transfer can be made at a time, the bandwidth becomes limited; – using an interconnection for each data transfer (point-to-point link) is a big advantage for the speed of data transfers, but it would be at the expense of the surface and power consumption. Our methodology (Mahdoum, 2012a) is an alternative to these two extreme cases. Let us consider Figure 2.74 showing the execution of traces of a given application on several IPs. It also indicates the possible data transfers occurring at specific times, which can be known based on calculation models taking into account the time to execute instructions and data transfers, or determined during simulations involved by the design of

144

CAD of Circuits and Integrated Systems

the integrated system implementing the given application (a more practical and interesting possibility). With this latter possibility, the execution times of the instructions are recovered by the aided-design tool of the network-onchip, in order to determine a given configuration (number of interconnections and their sizes in bits, connectivity between IPs, etc.) making it possible to meet the specifications. An effective exploration in the space of solutions (very wide, as Chapter 3 of this book shows) makes it possible to determine the best feasible configuration. For this purpose, we have developed two heuristics: – time optimization while meeting surface and power consumption constraints: heuristics adapted to real-time system implementation; – power consumption optimization while meeting surface and time constraints: heuristics adapted to the implementation of systems for which power consumption is a critical parameter, such as embedded systems.

Figure 2.74. Traces of an application being executed by several IPs

Basic Notions on the Design of Digital Circuits and Systems

145

In Figure 2.74 (where Tij signifies a data transfer from IPi to IPj), assume that the logical conditions A, B and C are true (“1” is the TRUE logical constant). It is clear that: – transfers T12 and T32 can share the same interconnection, without any conflict, to the extent that their time intervals required for the data transfers are disjointed. Nevertheless, transfers T12 and T32 cannot share the same interconnection if the logical conditions not(A) and C are true, because in this case their time intervals overlap; – T12 and T42 cannot share the same interconnection to the extent that their time intervals overlap and the logical condition A is either true or false. Let us consider a CDFG containing N control structures (if then else). The total number of traces would then be equal to M = ∏ , with Ni being the number of traces induced by the ith control structure. In the specific case when there are no nested control structures, the number of traces would be equal to 2N. In order to develop an efficient communications network for the targeted application, the precise time intervals cannot be known without the scheduling of operations on each trace. Though the scheduling is computationally complex O(M), it can be conducted within a reasonable CPU time, at the expense of compromise on the number of cycles. Indeed, as mentioned in Bergamashi (1997), it is possible to significantly reduce the number of traces within a reasonable CPU time, thus enabling the scheduling of a possible CDFG within a reasonable CPU time (in table II of the abovereferred article, on page 91, the least favorable case, involving the reduction of a significant number of traces into one, clearly shows this). Hence, it is possible to consider all the possible scenarios for data transfer during the execution of the application and determine, in advance, the signals that control the transfers operated during the execution of the application. Note that our methodology is dynamic because it anticipates on the implementation of the application. The question is how can we effectively determine the interconnections to ensure all the data transfers, without any conflict, and how can we control these transfers? For this purpose, let us reconsider Figure 2.74. It is clear that a single interconnection is not sufficient for enabling all the transfers without conflict. It can be easily verified that four sets of interconnections are required to eliminate any conflict, while maximizing the number of parallel transfers: S1 = {T14, T42}, S2 = {T24, T23}, S3 = {T12}, S4 = {T32}. Indeed, the time intervals T12 and T14 do not overlap (T12 and T14 can therefore share the same interconnection without conflict). T24 and T23 can also share the same interconnection

146

CAD of Circuits and Integrated Systems

because they are exclusive (two exclusive transfers can share the same interconnection though their time intervals “overlap”). The time intervals T12 and T14 do not overlap, but those of T12 and T42 do: therefore, T12 cannot belong to S1. It is worth noting that each set of interconnections has a certain size in bits (16 bits, 32 bits, 64 bits, etc.) depending on the data size and on the bandwidth. For example, if the size of a given interconnection cannot be equal to 64 bits (due to surface and/or power consumption constraints) when a data packet of 64 bits must be transferred, this packet can be transmitted in two successive sub-packets of 32 bits each (assuming that the interconnection can take this size), where the 32 most significant bits will be transmitted first, followed by the other bits. For this example, the minimal number of sets enabling the maximization of the number of parallel transfers has been clearly determined. The situation is different in practice because this problem is combinatorial (it is precisely NP-Hard), and can only be solved by an adequate method that is presented in Chapter 3 of this book. This being said, once the interconnections and their size has been determined, it is important to conveniently connect them to the input/output (I/O) ports of the various IPs, depending on the allocation of data transfers to interconnections. In Figure 2.75, transfers T14 and T12 are assumed to relate to the same port of IP1. The sets Si being those previously defined, the transfer T14 is done via the 1st interconnection (that belongs to S1), while T12 is done through the 3rd interconnection (that belongs to S3). Note that each of these interconnections has a certain size (8 bits, …., 64 bits) depending on the bandwidth, surface and power consumption constraints. Once the I/O ports have been correctly connected to interconnections, the various switches (implemented by simple CMOS pass transistors) must be controlled so that they authorize or prohibit transfers. Each source or drain node of the pass transistors is connected to an I/O port and to an interconnection while the NMOS (PMOS) transistor grid is supplied by a signal (complemented signal respectively) emitted by the control part that manages the various data transfers according to the application protocol. Figure 2.76 shows the details of switch Nr. 6: the source node is connected to a port of IP2; its drain is connected to interconnection S2; the grid of the NMOS transistor is controlled by the signal (SOP24 + SOP23) to enable a transfer from IP2 to IP4 (when SOP24 = 1) and from IP2 to IP3 (when SOP23 = 1) on the same interconnection, without possibility of conflict, because these two transfers are exclusive.

Basic Notions on the Design of Digital Circuits and Systems

147

Figure 2.75. Data transfers without conflict

Figure 2.76. Details of a data transfer

Our own methodology has consequently been extended to 3D architectures (without however having manufactured any 3D circuit). Indeed, this type of architecture is a complementary solution to the problem of interconnections, to the extent that delay and energy consumption are widely dominated by interconnections. The first 3D architectures have already involved manufactured device layers, which were then connected by a certain type of contact known as through silicon vias (TSV). This solution has unfortunately induced an important problem related to these contacts. Indeed, the latter involve a non-negligible additional surface, as well as a significant parasitic capacitance (from tens to hundreds fF). A recent and more interesting solution (Figure 2.77) circumvents this problem as follows:

148

CAD of Circuits and Integrated Systems

– device layers are sequentially manufactured; – alignment is accurate (the step is of the order of only 10 nm during the lithography phase, while it is of the order of 0.5 µm for the technology using TSVs); – the highest level layer can be very thin (about 30 nm only); – the characteristics of MIVs (Monolithic Inertier Via), which are the contacts used in this technology, are clearly more interesting than those of TSVs (70 * 140 nm versus 3,000 * 30,000 nm from the point of view of dimensions and induce a very low capacitance, which is below 0.1 fF). Let us recall that, unlike the point-to-point methodology, our architecture is scalable. Moreover, our routing technique is defined in advance, during the design time. During the run time, a simple configuration of bits applied to the grids of the pass transistors makes it possible to prohibit or authorize transfers on predefined interconnections, with no conflict. Hence, our network-on-chip design methodology barely weighs on the time budget dedicated to the user application (for the other methodologies studied, the routing algorithm is run during the run time of the application itself), which qualifies it as a candidate for the real-time system design.

Figure 2.77. Monolithic 3D architecture with three metal levels. For a color version of this figure, see www.iste.co.uk/mahdoum/CAD.zip

Basic Notions on the Design of Digital Circuits and Systems

149

Because the interconnections are materialized as resistance, capacitances and inductances, their influence on the power consumption and data transfer time is obvious. In order to take this aspect into consideration, we have opted for placing IPs in such a way that the interconnection segments are the shortest possible (the IPs that exchange data most frequently are placed, to the extent that it is possible, side by side). This placement problem is also combinatorial and can be effectively solved only by an appropriate heuristic that is presented in Chapter 3 of this book. Since this is a combinatorial problem (computational complexity is O(N !), with N being the number of IPs), it may happen that IPs that frequently exchange data are far from one another in a 2D architecture (as their places are occupied by other IPs). 3D architecture is a solution to this problem to the extent that such IPs can be placed near the concerned IPs, on another plane of 3D architecture. Figure 2.78 shows a placement possibility assuming that IPs Z and X frequently communicate with IP Y (with Z communicating strongly with W). Y

Z

W

…… …

X

Plane2

X

Y

Z

W

Plane1

Figure 2.78. 2D and 3D architectures

Let us note that the use of additional contacts (MIVs) in a 3D architecture is not a significant disadvantage to the extent that additional time and power consumption involved by an MIV are only of the order of 3 ps and 0.2 fW, respectively. On the other hand, there will be a gain on these two parameters thanks to a more interesting placement of IPs with respect to 2D architectures (for example, the placement of IP X – Figure 2.78). Indeed, the problem of IP placement on a 3D architecture is a relaxation of the one concerning 2D architectures. Formally, S2D ⊆ S3D, with S2D and S3D being respectively the sets of solutions for the 2D and 3D placements. It is clear that any optimal solution s* belonging to S2D also belongs to S3D (Figure 2.79a), but the reverse may be false (Figure 2.79b).

150

CAD of Circuits and Integrated Systems

Figure 2.79. Relaxation of the problem of IP placement

Assuming that our algorithm for the placement of IPs on a 2D architecture yields this result: 4 2 5 10 3 9 7 6 8 1, these IPs will be placed in a 3D architecture taking into account this order and dimensions of the architectural plans. Let us finally note that when a plane becomes full, the next IP will be placed on the next plane, at the same column as the previously placed IP (meandered placement on a given plane, as shown in Figure 2.80).

Figure 2.80. Placement of IPs on a 3D architecture (vertical bars are implemented by MIV contacts)

Basic Notions on the Design of Digital Circuits and Systems

151

In Figure 2.80, there are as many MIVs as transfers between two distinct planes of the architecture. We have improved this placement by minimizing the number of MIVs while determining their adequate positions (Toubaline et al., 2017). This work relies essentially on the adequate determination of IP clusters sharing the same MIV that will be placed at the “center of gravity” of each cluster. 2.9. Conclusion In this chapter, we have described the main notions of the design of integrated circuits and systems in direct relation to the content of Chapter 3, which deals with the CAD of integrated circuits and systems. We have relied on the conventional diagram of the silicon compilation, highlighting the synthesis, analysis and verification tasks at each design level. In what follows, several techniques for the aided-design of integrated circuits and systems are presented. These techniques involve the notions presented in the first two chapters.

3 Case Study: Application of Heuristics and Metaheuristics in the Design of Integrated Circuits and Systems

3.1. Introduction Since integrated systems have grown in complexity, the methods used for their design increasingly rely on the use of computer science and operational research. However, even though the current processors are characterized by high performance in terms of executing the instructions of a given program, it is still true that a conventional resolution of certain problems does not yield the expected results within a reasonable CPU time. Determining the computational complexity of a given (tractable or intractable) problem is, in fact, the basis for deciding whether to implement an exact algorithm or opt for an approximate method. For this purpose, the notions of computational complexity and problem classification were examined in Chapter 1. Since the exact or approximate algorithm of a given problem cannot be determined without knowledge of the respective problem, Chapter 2 focused on clarifying this issue and defining its various aspects. We hope that these two chapters will enable the reader to better approach Chapter 3, which presents the exact or approximate methods in the context of a given problem. It will be seen that many of the approached problems that are related to the design of integrated circuits and systems are not tractable, and therefore they need to be solved by approximate methods. Since these methods have the double difficulty of being in polynomial time (to yield the result within a reasonable CPU time) while providing a high-quality solution (near-optimal or even

CAD of Circuits and Integrated Systems, First Edition. Ali Mahdoum. © ISTE Ltd 2020. Published by ISTE Ltd and John Wiley & Sons, Inc.

154

CAD of Circuits and Integrated Systems

optimal in some cases), proposals will be advanced, case by case, in order to address this double difficulty. 3.2. System level 3.2.1. Synthesis of systems-on-chip (SoCs) with low energy consumption Current applications are so complex and have such a degree of communications that it is no longer possible to implement them in a single operational part and a single control part. Indeed, such an implementation would generate a significant density of components involving, in turn, the appearance of many parasitic parameters, which would be at the expense of the performance of the integrated system and its power consumption. A functional partitioning is then required, to cope with the congestion of communications while addressing the aspects of performance and energy or power consumption (P = E * F, with F being the frequency). In contrast to other methods, that fix the number of subsystems implementing the concerned application, our method is characterized by the automatic determination of the number of these communicating subsystems. It is worth noting that we refer here to a hardware functional partitioning that follows the first partitioning that determines the part(s) to be implemented in software or hardware. Our partitioning is characterized by a double assignment: the assignment of a given task to a processing unit (processor or ASIC) and the assignment of each unit to one of the subsystems obtained. As already mentioned in Chapter 2, the tasks that strongly communicate are assigned to the same subsystem so that communications occur locally most often (through a network-on-chip that is specific to each subsystem considered), which would make it possible to reduce the number of electrical parameters (resistances, capacitances and inductances), both intrinsic and parasitic (Figures 2.14 and 2.15). It is worth noting that the rare communications between the tasks belonging to different subsystems are implemented through the global bus and that the type of unit (ASIC or processor) executing a given task will be automatically determined, to optimize the power consumption while meeting the time constraint.

Case Study: Application of Heuristics and Metaheuristics

155

3.2.1.1. Finding out the number of subsystems and task assignment The number of subsystems is determined as follows: – We determine the graph to be colored based on the graph of tasks (Figure 2.14). Note that the nodes associated with two tasks whose communication is poor or absent are connected by an edge. This provides them with two different colors and therefore assigns their corresponding tasks to two different subsystems. – Color the graph. The number of subsystems is the number of colors, and the tasks whose associated nodes have obtained the same color are assigned to the same subsystem. The problem of colored graph optimization can be formulated as follows: Given a graph G = (V, E) F: V  {1, 2, …., K} Minimize K such that: F(u)  F(v) ∀ (u, v)  E We have already seen in Chapter 1 that the decision problem (P1) associated with this optimization problem (P2) is NP-complete. As there is a Turing transformation from (P1) to (P2), (P2) is therefore NP-hard. Since it is still conjectured that P  NP (which is one of the mathematical problems that have not yet been solved, and a reward has been offered for its resolution – see http://www.claymath.org), there is yet no polynomial algorithm for the colored graph problem yielding the exact solution for any instance of the graph. Having noted that the number of colors depends on the order in which nodes are colored, we have developed a heuristic based on this concept. Therefore, we have determined classes where the nodes that belong to the same class have the same degree. This serves the purpose of making MCl ! (read as MCl factorial) colorations (considering all the possible arrangements of MCl classes), and then selecting the coloration which yields the minimal number of colors. Nevertheless, if all the nodes of the graph have different degrees, then there would be N ! colorations (only one node in

156

CAD of Circuits and Integrated Systems

each class; N = |V|). For a substantial value of N, the number of colorations would be very significant, which does not guarantee that a solution will be obtained within a reasonable CPU time. In order to obtain a sufficiently low value of MCl, the same class is assigned nodes with the same degree or slightly different degrees. Since this assignment of nodes to classes is not unique, we choose the one that is the solution to the following optimization problem: min Such that: =

i) M Cl

ii)

 Cli



V

i 1

where dMi and dmi are, respectively, the largest and the smallest degree of the nodes in class Cli. Table 3.1 shows that: – this heuristic offers near-optimal and even optimal solutions for certain DIMACS benchmarks. It is worth noting that for the myciel7 benchmark, for example, the solutions are 9 and 8 colors with 1 and 3 classes respectively (the assignment of the nodes of this graph to 3 classes has made it possible to reach the exact known value, which is equal to 8 colors); – the mean relative error percentage, (Solutionobtained – Solutionexact)/ Solutionexact, ranges from 8.60 (with the use of only 1 class) to only 1.92 (with the use of a certain number of classes for each benchmark); – the CPU time of our heuristic is generally 0 s (several s, without reaching 1 s) except for some rare benchmarks, particularly for queen8_8, at 1,858 s (approximately 31 min).

Case Study: Application of Heuristics and Metaheuristics

DIMACS benchmark anna david myciel3 myciel4 myciel5 myciel6 huck zeroin.i.3 miles750 miles500 miles1500 miles1000 games120 jean myciel7 queen5_5 miles 250 queen6_6 queen8_8 Mean relative error (%)

Exact solution 11 11 4 5 6 7 11 30 31 20 73 42 9 10 8 5 8 7 9

#Colors (#Classes=1) 11 11 4 5 6 7 11 30 31 20 73 42 9 10 9 7 9 10 14

% Err. 0 0 0 0 0 0 0 0 0 0 0 0 0 0 12.5 40 12.5 42.86 55.56

CPU time(s) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

(#Colors, #Classes) (11, 1) (11, 1) (4, 1) (5,1) (6, 1) (7, 1) (11, 1) (30, 1) (31, 1) (20, 1) (73, 1) (42, 1) (9, 1) (10, 1) (8, 3) (5, 7) (8, 3) (8, 8) (11, 8)

08.60

% Err. 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 14.29 22.22

157

CPU time (s) 0 0 0 0 0 0 0 0 0 0 4 1 0 0 0 2 0 49 1,858

01.92

Table 3.1. Results of the graph coloring heuristic

3.2.1.2. Task assignment to physical resources Having determined the number of subsystems and the task assignment to these subsystems, the type of unit (ASIC or processor), executing a given task so that it optimizes the power consumption while meeting the time constraint, should be defined. Therefore, our initial solution is not random, but carefully considered. Since a processor is slower than an ASIC, but consumes less power (due to the sequential execution of operations, contrary to an ASIC, for which parallelism is possible), our initial solution minimizes, as much as possible, the power consumption, which means it assigns each task to a processor. If the time constraint is met, the exploration of the solution space stops immediately (obtaining the best solution in the initialization stage). Knowing the tasks that have been assigned to a given subsystem SSi, there will initially be as many processors in SSi as threads in the graph of tasks assigned to this subsystem. Let us consider Figure 3.1,

158

CAD of Circuits and Integrated Systems

which represents a graph of tasks and the associated possible threads. The threads are selected based on the degree of communication between various tasks. Hence, the strongly intercommunicating tasks are part of the same thread and are executed by the same processor. This has the advantage of using the same local memory and simplifying the operating system: the notion of a global variable is sufficient for passing values from one task to another, unlike the case when two tasks are assigned to different processors, for which the operating system should enable this communication, through a global memory, for example.

Figure 3.1. Possible threads for a task graph

Case Study: Application of Heuristics and Metaheuristics

159

If Ti3 and Ti4 are assumed to be highly communicating, then the only candidates are the pairs of threads (thread31, thread32), (thread41, thread42), (thread51, thread52) and (thread61, thread62). Moreover, assuming that Ti5 is highly communicating with Ti2 and that Ti1 and Ti2 are highly communicating, then the pair (thread31, thread32) is retained. Upon initialization, tasks Ti1, Ti2 and Ti5 are assigned to the same processor, while Ti3 and Ti4 are assigned to a second processor (since there are two threads, SSi initially contains two processors). After this initialization, if the time constraint is not met, other solutions should be explored. For the same purpose of finding a near-optimal and even optimal feasible solution, according to our technique, the passage from one solution to another is not random, trying to remain close to the current solution (which is interesting, but not feasible – violation of the constraint). Hence, the change is not sudden, but rather progressive. In other terms, since the current solution has violated the time (or power) constraint, a processor (or, respectively, an ASIC) is replaced by an ASIC (or, respectively, a processor). If the replacement is not sufficient, then the process is iterated until a feasible solution is obtained. Let us note that this optimization problem has no solution in two cases: – each unit is a processor and the power constraint is not met; – each unit is an ASIC and the time constraint is not met. For these two cases, it can be stated that the designer of the integrated system has poorly estimated one or both constraints. This approach (i.e. the automatic determination of the number of subsystems, assignments of tasks to units – processors or ASICs, assignments of unit to subsystems, initialization of the solution and passage from the current solution to another solution) avoids the random aspect, and is implemented in an argued and deterministic manner in order to dedicate CPU time exclusively to exploring interesting solutions. It is worth noting that the size of the space of solutions associated with the considered problem is very large, which forbids any exhaustive design space exploration. For further details, the reader is invited to refer to Mahdoum (2012b). Let us simply note that, for a given configuration of the integrated system, the power

160

CAD of Circuits and Integrated Systems

consumption analysis (to determine whether the current solution meets the power constraint) relies on models and concepts related to this parameter (as noted in Chapter 2), which are essential for the power consumption estimation tool. As for the time parameter (to know whether the current solution meets the time constraint), it is estimated following the execution of the DVFS (dynamic voltage and frequency scaling) technique, which is presented below. 3.2.2. Heuristic application to dynamic voltage and frequency scaling (DVFS) for the design of a real-time system subject to energy constraint As already noted in Chapter 2, the purpose of silicon compilation, which is a top-down approach, is to develop an integrated system from the most abstract possible level to the fine details, while verifications are operated at each level of abstraction. The objective is to increase the probability of system operation and to reduce the design costs (time and money), because an error is all the more costly, as it occurs at lower design levels. Therefore, in order to conceive a complicated integrated system that meets the user specifications, this issue must be addressed at the highest design level, that is, the system level. At this design level, one of the synthesis tasks is to anticipate obtaining, at low design levels, the integrated system that meets the user specifications. This synthesis task can be operated in various manners, the most important of which are the following: – design an integrated system according to the PARETO criterion, according to which all parameters (notably in terms of performance and power consumption) are given the same priority; – design an integrated system that optimizes power consumption while meeting the time constraint (systems embedded on satellites, portable systems, etc., in short, any system for which power consumption is a critical parameter); – design an integrated system that optimizes performance, while meeting the energy constraint (the case of real-time systems). Section 3.2.1 describes the first stage of the design of a system optimized in terms of power consumption, while meeting the time constraint. The same can be applied for approaching the design of a high-performance system that meets the power consumption constraint. As already seen, this first synthesis

Case Study: Application of Heuristics and Metaheuristics

161

stage uses the DVFS technique to determine whether a given configuration of the system meets the specifications. The following section discusses the basic techniques used for this purpose: – static techniques; – dynamic techniques; – quasi-static techniques. Consider a task graph G=(T, E), where T = {T1, T2, …., Tn} is the set of tasks and E is the set that describes the data dependencies between various tasks. Each task is assumed to have a part whose execution is mandatory and an optional part, whose content execution should be maximized (the largest possible number of optional operations should be executed). These two parts are characterized by the CPU cycles Mi and Oi. Each task Ti is associated with a reward function Ri(Oi). Similar to other works, the dynamic energy ( + ), where Ci is consumed by the task Ti is assumed to be: = the effective switched capacitance, Vi is the supply voltage at which Ti is executed and (Mi+Oi) is the number of cycles executed by Ti. − , The energy overhead for switching from Vi to Vj is , = where Cr is the capacitance of the power rail. Similar to other works, we consider that the runtime of Ti executing (Mi+Oi) cycles with a voltage Vi is ( + ), where k is a technology-dependent constant, = ( ) 1.4    2 is the saturation velocity index and Vth is the threshold voltage of the transistor in the concerned technological process. The time overhead for switching from Vi to Vj is , = | − |. Assuming that the tasks are executed in a fixed order and that the target processor supports voltage scaling from Vmin to Vmax, the three different DVFS techniques are formulated as follows. 3.2.2.1. Static V/O assignment This involves finding, for each task Ti, 1  i  n, the voltage Vi and the number of optional cycles Oi, in order to: Maximize ∑

( )

[3.1]

162

CAD of Circuits and Integrated Systems

Such that: Vmin  Vi  Vmax =

=

+

∗|



(

− ∆

) +

[3.2] |+

− ∆

)∝

(

(

+

)≤

[3.3]

,

(

+

) 

[3.4]

,

Equation [3.3] shows that each task Ti must be completed before its deadline di. The execution of Ti starts at the moment si and its duration is equal to the sum of ∆ , (voltage switching) and (execution time of Ti once the voltage has been defined). The energy consumed by Ti is the sum of the energy due to the potential switching from Vi-1 to Vi and the energy relative to the transition of the capacitance Ci. The total energy consumed by all the tasks should not exceed the allotted energy budget Emax. It is worth noting that the assignment (V, O) is performed during the design (before the execution of the application) by , which does not suitably maximize considering the least favorable case the objective function (equation [3.1]). Indeed, since for each task Ti the unfavorable number of mandatory cycles is considered, the optimization algorithm has a tendency to assign Vmax to various tasks, in order to meet their delay constraints. This involves significant energy consumption for the tasks that could be executed with Vmin while meeting their delay constraints, as their execution does not necessarily operate on unfavorable mandatory cycles. Therefore, this poor allocation of voltage consumes a large part of the energy budget, which prohibits the execution of a considerable number of optional cycles (poorly optimized objective function) or, even worse, consumes the whole energy budget so that no optional parts can be executed, while the energy budget is assumed to be sufficient for executing the mandatory parts of the tasks, which is not necessarily true. 3.2.2.2. Dynamic V/O assignment This optimization problem can be formulated as follows: For c+1  i  n, find Vi and Oi in order to:

Case Study: Application of Heuristics and Metaheuristics

Maximize ∑

( )

[3.5]

Such that: Vmin  Vi  Vmax

[3.6]

=

+

∗|







=



(



163

) +



|+

(

)∝

(

)+

+

[3.7]

(

+

)+





[3.8]

Unlike static techniques, the dynamic techniques use the information provided once c tasks (1  c < n) have been executed in order to decide on the ongoing assignment (V, O), enabling the effective optimization of the objective function. Hence, assuming that tasks T1, T2, …, Tc have been executed, a certain energy budget ECc will have been consumed and tc+1 is the instant at which Tc has been fully executed. Unlike the static techniques, the actual execution times ( ) are considered here, not the worst-case ) cycles. Unfortunately, energy and time overheads ( and ) ( are required for the online computations of the (V, O) assignment associated with the task Ti under execution. Because a great amount of time and energy budget is consumed for online computations, the dynamic (V, O) scheduler will produce a total reward that is inferior to that of the static scheduler, or, even worse, a dynamic (V, O) scheduler might not be able to fulfill the time and energy constraints. 3.2.2.3. Quasi-static V/O assignment and while taking Quasi-static techniques aim to reduce advantage of the dynamic aspect of the assignment (avoiding the static aspect, which only considers unfavorable cases). For this purpose, a set of (V, O) assignments is determined during the design (before the execution of the considered application) and memorized. The selection, for each task, of a pair (V, O) from this set during the execution phase will then require less time and energy than the dynamic technique. This selection relies on the parts of the time and energy budgets consumed before the execution of the task in progress. This problem can then be formulated as follows: Maximize ∑

( )

[3.9]

164

CAD of Circuits and Integrated Systems

Such that: Vmin  Vi  Vmax =

=



+ (

∗|





[3.10] |+

) +

)∝

(

(

+

(

)+

+

)+ 



[3.11] [3.12]

As soon as the task Ti ends, the (V, O) assignment for Ti+1 is made, depending on the set determined during the design and the budgets of time and energy consumed by Ti. Hence, it is sometimes operated as follows: if t1  75 s  EC1  77 J then { V2 = 1.4; O2 = 66924; } else …… In the above-mentioned assignment example, it is worth noting that the  sign is used instead of the = sign. In other terms, the same value is assigned to O2, irrespective of whether EC1 is equal to 1 J or 77 J !! It is clear that the determination of the set of pairs (V, O) in this way cannot effectively optimize the objective function. The following section presents our own method within a quasi-static assignment. This takes advantage of the abovedescribed techniques, while avoiding their drawbacks. More precisely, our technique: – drastically reduces and (time and energy budgets that are very significant when the dynamic method is applied); – interrupts the application only once for the selection of (V, O), while this interruption is generated each time a task is executed with the conrentional quasi-static methods. As 1 < number of tasks, only a very small part of the time and energy budgets is consumed, which leaves the quasitotality of these budgets intact for executing the application itself; – avoids taking into account only unfavorable cases, as with static techniques. The problem associated with our method can be formulated in this way: Maximize ∑

( )

[3.13]

Case Study: Application of Heuristics and Metaheuristics

Such that: Vmin  Vi  Vmax =

=

+

(

∑ =

∗|

[3.14]

|+



) +



165

)∝

(

(

+

(

)+

+

)+







[3.15]



[3.16]

1 = 1 0

[3.17]

and are 0 for any Equations [3.15], [3.16] and [3.17] show that value of i that differs from 1 (there is no time and energy consumption for the selection except during the execution of the 1st task). Hence, these time and energy savings serve to further optimize the objective function. Technique

Number of mandatory operations considered (Mi) =

Static Dynamic

=

_

=



Energy overhead

0 _

≤ =

Quasi-static Our technique

Time overhead



_



≤ _

0

_



1∗

_

∗ 1∗

_

Table 3.2. Summary of the four techniques

Table 3.2 uses new abbreviations such as ex, Q_S, wc and O_T, which represent the terms exact, quasi_static, worst case and our technique respectively. In order to show how our technique detects the trace of a given task being executed (taking therefore into account the exact value of the ), let us consider the CDFG (controlled data flow mandatory cycles graph) of Figure 3.2, from which the various traces contained are deduced (Figure 3.3). Let Mi (Oi) be the number of mandatory (or, respectively, optional) operations executed by the task Ti. Figure 3.2 shows 10 mandatory operations. Since the real trace to be executed is not yet known during the design phase, Mi takes the value 7 (operations 1, 2, 4, 5, 6, 7 and 10) with the static technique. However, if condition A is true, the real number of operations that will be executed would only be 4 (operations 1, 2, 3 and 10). Therefore, the static technique dedicates part of the time and energy budgets to 7 mandatory operations (rather than 4), which does not leave enough

166

CAD of Circuits and Integrated Systems

time and energy for maximizing the optional operations. It is worth noting that in a real application, the gap between these two numbers may be very significant. Let us consider Table 3.3, that meet includes the values of voltages assumed to be obtained during the design, which the time and energy constraints. Nevertheless, allocating a voltage to an operation could increase the additional time and energy costs caused by potential frequent switching between various voltages. It is therefore recommended to allocate the same voltage to all the operations of the same task (Table 3.4). 1

Task 1

2

A 3

B

4

8

5

9

Task 2

6 7

10

Figure 3.2. Example of a CDFG

Case Study: Application of Heuristics and Metaheuristics

Figure 3.3. Traces of the CDFG in Figure 3.2

Voltage (V) Task 1

2

Operation

Trace 1

Trace 2

Trace 3

1

1.0

1.0

3.3

2

1.0

1.0

1.0

3

1.0

4

3.3

5

3.3

6

1.0

7

3.3

8

3.3

9

3.3

10

1.0

3.3

1.0

Table 3.3. Allocation of voltages to operations

Voltage (V) Task

Trace 1

Trace 2

Trace 3

1

1.00

1.00

2.15

2

1.00

2.84

2.53

Table 3.4. Allocation of voltages to tasks

167

168

CAD of Circuits and Integrated Systems

For a trace t of a given task i, the voltage allocated is given by the equation: =

,

_,

_,



[3.18]

, ,

where Vi,t,j is the voltage assigned to operation j of task i when its trace t is executed. As will be seen in what follows, our technique can detect the trace that will be executed for a given task i, which makes it possible to assign the pairs of exact values (V, O) (which is the advantage of the dynamic technique), but without a substantial excess in the time and energy budgets (which is the drawback of the dynamic technique). It is worth recalling that and . There is, _ and _ of our technique are clearly below however, an exception for the 1st task, for which no information, making it possible to detect the trace to be executed, exists, and consequently it is not possible to assign it pairs of (V, O) values soundly determined during the design. Hence, the voltage V1 assigned to the 1st task is determined as follows: =



_

_

[3.19]

,

where V1,j is the voltage determined at design time for the trace j of the 1st task. Therefore, considering Table 3.5 obtained, for example, during the design, the voltage allocated to the 1st task is: V1=(1.13+1.11+1.12)/3 = 1.12 V. Trace 1

Trace 2

Task

Mi

Oi

Vi (V)

Ei (pJ)

Di (ns)

Mi

Oi

Vi (V)

Di (ns)

Mi

Oi

Vi (V)

Ei (pJ)

Di (ns)

1

100

90

1.13 9.73

4.04

120

95

1.11 11.01 4.57

80

41

1.12 6.20

2.57

2

80

20

1.12 5.09

6.18

65

17

1.14 4.17

6.32

70

17

1.13 4.43

4.43

3

60

30

1.13 4.59

8.10

62

30

1.13 4.69

8.29

61

30

1.13 4.64

6.37

4

540 236 1.15 39.18 24.74 516 201 1.13 36.21 23.66 520 210 1.12 36.86 22.03

5

75



… …

20

1.12 4.83 26.77 200 …







80 …

Ei (pJ)

Trace 3

1.12 14.24 29.64 150 …







45 …

1.13 9.92 26.19 …



Table 3.5. Example of (V, O) assignment during the design phase



Case Study: Application of Heuristics and Metaheuristics

169

The values of Table 3.5 will be used by the process below for the assignment of (V, O) pairs during the execution phase of the application: Assignment_V_O( )

// The values below are not related to those // in Table 3.5

{ V1 = 1.13 V; if M1 = 100 then if E1=9.73 && D1=4.04 // Trace 1 currently being executed then {= determine_O1( ); //This process is described further below V2=1.13 V; O2=20; V3=2.12 V; O3=25; V4=1.14 V; O4=97; V5=1.22 V; O5=24; … } else if E1=8.49 && D1=5.06 // Trace 2 is currently being executed then {= determine_O1( ); //This process will be described further // below V2=1.14 V; O2=17; V3=1.13 V; O3=30; V4=1.13 V; O4=201; V5=1.12 V; O5=80;

170

CAD of Circuits and Integrated Systems

… } else ………… …………………………………… end if } determine_O1() { V1 is determined by equation [3.19]; // Vr_ 1 = Vs_ 1 ∀ traces r and s O’1_ t is obtained from equations [3.25], [3.26] and [3.27]; Vi_t, Oi_ t are the values obtained at design time; // i= 2, 3,…Nb_tasks; t=1, 2, …, Nb_traces } It is worth recalling that the process Assignment_V_O( ) consumes the time _ and the energy _ . These quantities are clearly below those of the dynamic technique, and in comparison to other quasi-static techniques, _ and _ are consumed only once, instead of for each task. Therefore, our technique leaves the quasi-totality of the time and energy budgets allocated to the application intact, which enables better optimization of the objective function (the maximization of the number of optional operations to be executed). More precisely, the function determine_O1( ) is based on the following: Let us assume that a certain trace t of task 1 is under execution. Let V1,t be the voltage assigned to the 1st trace for this task (equation [3.18]) during the design. As the (V, O) assignment has been initially obtained with V1,t rather

Case Study: Application of Heuristics and Metaheuristics

171

than with V1 (equation [3.19]), the time and/or energy constraints can be violated using V1 instead of V1,t. Furthermore, in order to maintain the initially assigned (V, O) – which is exact – while meeting the time and energy constraints, the (V, O) assignment, for the 1st task only must be modified. We have: ,

=

,

=

,

, ,

+

,

[3.21]

,

+

,

(

[3.20]

,

,

= =

+

[3.22] +

,

)

[3.23]

In order to continue to meet the time and energy constraints for all the tasks, we have to establish: ≤

,





[3.24]

,

From equations [3.20], [3.21], [3.22], [3.23] and [3.24], we obtain: , ,

≤

,

, ,

≤

,

,

+

,

)∝

( ,



− ,

,

+

 ,

[3.25] −

,



[3.26]

X is the largest integer value below or equal to X (e.g. 3.9 = 3). Since the purpose is to maximize the objective function, the = sign must be used instead of the ≤sign in equations [3.25] and [3.26]. Furthermore, because certain voltages V1,t are higher than the mean voltage V1, we need to have O1,1,t ≤O1,max,t and O1,2,t ≤O1,max,t, where O1,max,t is the maximal value of the optional cycles of the 1st task when its trace t is executed. Hence, in order to continue to meet the time and energy constraints for all the tasks, the (V, O) assignment for the 1st task is given by:

172

CAD of Circuits and Integrated Systems

V1 is as defined in equation [3.19] ,

=

,

, ,

, ,

,

,

,

;1 ≤ ≤

[3.27]

_

 (used in the process determine_O1()) is obtained from equations [3.25], [3.26] and [3.27]. It is worth noting that if the time and energy constraints are met, our technique will yield the same value of Oi (i > 1) as the ideal technique (dynamic technique with δdyn = 0 and dyn = 0). If the voltages allocated to the 1st task are equal, equations [3.25] and [3.26] show that our technique yields the same result as the ideal technique. Otherwise, the value of O1 obtained by our method may differ from the one obtained by the ideal technique for certain traces of the 1st task. More formally: ≥

i) if obtain: ,

=

=



, ,

+

_

,



_

=

,

∀ 1 ≤ , ≤

,

, we



+



_ ,,

_ , ,

+

=0

_ ,,

The error Rel_Err11 = R1,1/RI_C,t is therefore equal to 0%. where I_C, O_T, i and t correspond to the ideal case, our technique, the task i and the trace t respectively. RI_C,t and RO_T,t are the numbers of optional operations that can be executed by the ideal case and our technique; ii) if ∃ ≠



,

=



= max = max = max

≠ +

,



_ ,

, ; 1 _

≤ , ≤ , we obtain:



_ , ,

_ ,

_

+



_ ,,



_ ,,

+

_ ,,

_ , ,



_ , ,

, ,

+

+

,

,

,

1− 1−

,

, )∝

( ,





Case Study: Application of Heuristics and Metaheuristics

173

The error Rel_Err12 = R1,2/RI_C,t is then equal to 100*R12/RI_C %. It is worth noting that if V1,t is close to V1, then R1,2 will be close to 0 and therefore Rel_Err12 will be close to 0%. The least favorable case corresponds to V1,t, which is very close to Vmin, and V1, which is very close to Vmax. However, since V1 is a mean voltage (equation [3.19]), Rel_Err12 would be interesting for a significant number of traces. If the energy budget is equal to the energy consumed by the ideal technique (meaning that Emax=EI_C,t), it is possible that certain optional operations will not be executed by our technique, since = 0 while _ is not zero. Noting that the budget _ is consumed by our technique and used by the ideal technique (which does not really exist), let us determine the number of optional operations that will not be executed by our technique. Because Oi is determined before Oj (for i < j), NL optional operations will not be executed: =

_

max _

_ , , ,

_ , ,

,

_ _ , , ,

_ , , ,



where N is the number of tasks, and EI_C,O,N,t and I_C,O,N,t are the time and energy budgets for executing the optional part of the Nth task. X is the least integer value greater or equal to X (e.g. 3.1 = 4). Let us then consider the following two cases: –

_ _ , , ,

≤ 1 

_ _ , , ,

≤ 1: the number of optional operations whose

execution is enabled by our technique is equal to: _ , ,



=

_ _ , , ,

1−

> 1 

_ _ , , ,

_ _ , , ,

_ , ,

, 1−

_ _ , , ,

> 1: the number of optional operations NL that

will not be executed is distributed to the tasks j, j+1, ….., N (1  j  N) such that: +

_ , ,

=

_ , ,

174

CAD of Circuits and Integrated Systems

For these two cases, we have: iii) if the voltages allocated during the design are equal for all the traces of the 1st task, then O1,t,O_T = O1,t,I_C. This implies that: R21 = RI_C – RO_T = NL The error would then be: Rel_Err21 = 100·R21/RI_C % =

100

max



_

iv) if ∃ ≠ : then we obtain:

,

_ ,,



, ; 1

≤ , ≤

_



=

_ ,

,

R22 = RI_C − RO_T ,

1− =

max _

(

,

1−

− ,

+

, ,



,

+

,

, , _

+

,

)∝ ∝

+

The error Rel_Err22 = 100·R22/RI_C % 3.3. Register transfer level 3.3.1. Integer linear programming applied to the scheduling of operations of a data flow graph (DFG) A very interesting formulation due to Gajski (Gajski, 1992), based on integer linear programming and applicable to dominant data flows, combines ASAP and ALAP scheduling (see Chapter 2). This formulation, involving the scheduling of operations in order to meet the constraints on the number of resources of a given type, is given by: min ∑

,



,

such that:

1) ∀ 1 ≤ ≤ , ∑ 2) ∀ 1 ≤ ≤ , ∀ 1 ≤



=1 ≤

, ∑

,





Case Study: Application of Heuristics and Metaheuristics

3) ∀ , ,







( ∗

) − ∑



175

≤ −1

For a given operation k of the type t having a cost Ct,k (surface, power consumption), the objective is to minimize the global cost of scheduled operations so that the following constraints are met: 1) A given operation Oi can be scheduled only once. Moreover, its scheduling step must be in the interval [Ei, Li], where Ei and Li are, respectively, ASAP and ALAP scheduling. 2) In each scheduling step, the number of operations of a given type (e.g. multiplication) should not exceed the number allocated for this type of operation (e.g. if the circuit must contain at most, three multipliers, it is forbidden to schedule, in the same step, more than three multiplications, even if it is possible). 3) The 3rd constraint concerns the dependence of an operation in relation to another. Noting that xik and xjl can only take the value 0 or 1 (xik=1 means that the operation i is scheduled at the kth step), if the operation j depends on operation i, then the value of k in the scheduling obtained must be strictly below that of l for any possible value of k contained in the interval [Ei, Li] and any possible value of l contained in the interval [Ej, Lj]. EXAMPLE.– Let us assume that Oi and Oj (which depends on Oi) can be respectively scheduled in the intervals [4, 7] and [5, 9]. If Oi is scheduled in the 6th step, for example, Oj must be scheduled at the 7th (6-7 = -1), 8th (6-8 = -2) or 9th step (6-9 = -3), but not at the 5th or 6th step. Let us recall that for the case of a DFG, the scheduling step and the execution step have the same significance, which is not necessarily the case for a CDFG (the details were given in Chapter 2). The result of the optimization of this problem relates to the values of the variables xij. If xij takes the value of 1, this would mean that the operation Oi is scheduled (and executed for a DFG) at the jth step of scheduling. This result meets the three constraints that have been mentioned above while minimizing the global cost.

176

CAD of Circuits and Integrated Systems

3.3.2. The scheduling of operations in a controlled data flow graph (considering the speed–power consumption tradeoff) Performance and power consumption parameters are generally antagonistic. CDFG scheduling can be performed using the method presented above by: – decomposing CDFG into DFGs; – scheduling each DFG using the method described in section 3.3.1, by imposing no constraint on the number of operations of the same type to be scheduled in the same step. At this stage, the scheduling of each DFG enables the highest possible performance execution. If the constraint imposed on the power consumption is met by each of the DFGs, such a method can be applied. Otherwise, additional processing is required. For the DFG(s) in which the power consumption constraint has not been met, a possible approach would be to reduce parallelism by scheduling operations that were initially scheduled in the same step in different steps. Nevertheless, the elimination of all possible parallelism (to obtain as many scheduling steps as there are operations) would lead to performance degradation. Hence, the allocation of parallelizable operations to different steps must be made progressively, until meeting the power consumption constraint for each of the traces of the DFG. For further details, the interested reader is invited to refer to the article “A Low-Power Scheduling Tool for SOC Designs” (Mahdoum et al., 2005a). 3.3.3. Efficient code assignment to the states of a finite state machine (aimed at reaching an effective control part in terms of surface, speed and power consumption) Similar to the operational part, the control part of an integrated circuit must meet requirements in terms of surface, speed and/or power consumption. A decrease in the number of transistors and a reduction in the width and height are parameters that influence the characteristics of the control part in order to the control signals operational part and properly deliver to meeting the specifications of the contribute circuit. In this context, several optimization methods have been developed to reduce the number of product terms, thus leading to a reduction in the number of transistors and

Case Study: Application of Heuristics and Metaheuristics

177

the height of the control part. Such methods involve the assignment of arbitrary codes to the states of the finite state machine from which the control part originates. When the code length is log2 N, with N being the number of states, what follows is a logical optimization of the output variables (the variables of the new state and those related to the control signals) as a function of the input variables (variables related to the control signals and to the current state of the finite state machine). One of these methods is the one known as 1-hot encoding whose state code length is equal to log2 N. With the surface of a control part estimated by (2*i + 3*v + o) * p, there are other techniques that involve the determination of the codes to be assigned to different states (avoiding an arbitrary assignment), often leading to obtaining a smaller surface. It is worth noting that i is the number of primary input variables and their complements (control signals), v is the number of variables associated with the state of the machine (one variable and its input complement for each bit of code of the current state, and one output variable for each bit of the following state of the machine), o is the number of command signals, and p is the number of product terms obtained after the logical synthesis (using, for example, ESPRESSO). Before presenting our technique for optimizing a control part by assigning adequate codes to the states, we review below the formal definition of a finite state machine: X={x1,x2,,x|X|} is the set of control signals; Y={y1,y2,,y|Y|} is the set of states of the machine; Z={z1,z2,,z|Z|} is the set of command signals; δ: X x Y  Y is the function that determines the next state of the machine; λ: X x Y Z (λ:Y Z) is the function generating command signals for a Mealy (or, respectively, Moore) machine. The techniques used for assigning codes to the states of the finite state machine involve a maximum reduction in the number of lines, which leads to a maximal optimization of the height of the control part. As this is the case, this requires an adequate codification of the machine states. As will be seen, the state code length may exceed log2 N to avoid impacting the proper behavior of the control part. Unfortunately, if the codes are too

178

CAD of Circuits and Integrated Systems

long, then the width is quite more significant than that obtained by conventional methods, that operate with arbitrary codes using a length of log2 N. Eventually, the surface would be more significant than that generated by conventional methods, although the height has been reduced to maximum. Our contribution to this issue is to avoid a significant increase in width due to the proper allocation of codes to states. Moreover, the length obtained may even be below log2 N, without affecting the logical behavior of the control part. Before a detailed presentation of our technique, several definitions are provided below: – Simple state: One of the states of the finite state machine. – Composite state: A set of simple states that are used to obtain the same next state and the same command signals. – Partition: A set of simple and composite states supplying the control part with the same control signals. – Count of a composite state: Let Ec be a composite state and Pi an arbitrary partition containing Ec; countj(Ec) (1  j  number of partitions; j  i) is the maximal number of simple states contained in Ec and in the partition Pj (i  j) so that these simple states do not belong to the same composite state in Pj. We then define count(Ec) = max (1, max countj(Ec); 1  j  number of partitions; j  i) – Count of a partition: Let Pi be a partition.

( )=

+∑

where nc is the number of composite states belonging to Pi, Ecj is a composite state belonging to Pi, ns is the number of simple states that do not belong to Ecj but belong to Pi – max_count = max count(Pi); i=1, 2, …, number of partitions

Case Study: Application of Heuristics and Metaheuristics

179

These definitions are explained by the following example, in which each line contains four columns (control signals, current state, next state and command signals): 00 s7 s1 10 00 s8 s1 10 00 s9 s4 11 00 s10 s4 11 01 s5 s2 11 01 s6 s2 01 01 s7 s8 11 01 s1 s8 00 10 s1 s4 00 10 s2 s4 00 10 s3 s4 00 10 s5 s7 00 10 s6 s7 00 11 s1 s5 01 11 s4 s5 01 11 s2 s5 01 11 s8 s2 10 11 s3 s1 11 A maximal reduction of the height of the control part can be realized by grouping the lines where the simple states generate the same next state and the same command signals when the control signals are identical. For the above-mentioned example, this leads to the following reduction: 00 {s7,s8} 00 {s9,s10} 01 s5 01 s6 01 s7 01 s1 10 {s1,s2,s3} 10 {s5,s6} 11 {s1,s2,s4} 11 s8 11

s1 10 s4 11 s2 11 s2 01 s8 11 s8 00 s4 00 s7 00 s5 01 s2 10 s1 11

180

CAD of Circuits and Integrated Systems

Hence, the 1st resulting line (among the 11 lines obtained) is due to the fact that states s7 and s8 generate the same next state (s1) and the same command signals (10) when the control part is supplied by control signals 00. It is worth noting that the best logical synthesis tool cannot produce less than 11 product terms for the above-described example. In other terms, this grouping of lines generates the most optimal control part in terms of height. However, in order to minimize the surface, the width should not be very significant, as the state code length can exceed log2 N, a length used for arbitrarily coding the N states by conventional methods. Before analyzing why the maximal reduction of the height may lead to a code length that may exceed log2 N, we provide the explanations of definitions through the obtained reduction: – the simple states are: s1, s2, s3, s4, s5, s6, s7, s8, s9 and s10; – the composite states are: Ec1={s7,s8}, Ec2={s9,s10}, Ec3={s1,s2,s3}, Ec4={s5,s6} and Ec5={s1,s2,s4}. Later, it will be seen that certain states are in fact simple states; – the partitions are: P1, P2, P3 and P4, which correspond respectively to the control signals 00, 01, 10 and 11; – count(sj) = 1; j=1, 2,,10; – count(Ec1) = 1 as s7 and s8 belong to various partitions (P2 and P4), other than the one containing Ec1 (P1 only, in this example); – count(Ec2): s9 and s10 do not belong to any partition other than P1: count(Ec2)=max(1,0)=1; – count(Ec3): count4(Ec3)=2 as s1 (or s2) belongs to the same partition (P4) as s3. count2(Ec3)=1 as s1 belongs to the partition P2: count(Ec3)=max(1,max(2,1))=2; – count(Ec4): count2(Ec4) = 2 as s5 and s6 belong to the same partition (P2). Therefore, count(Ec4)=max(1,2)=2; – count(Ec5): count2(Ec5) = 1 as s1 belongs to the partition P2; count3(Ec5) = 1 as s1 (or s2) belongs to the partition P3. Therefore, count (Ec5) = max ( 1 , 1 , 1 ) = 1; – count(P1)=0+count(Ec1)+count(Ec2)=2; – count(P2)=4; – count(P3)=0+count(Ec3)+count(Ec4)=4;

Case Study: Application of Heuristics and Metaheuristics

181

– count(P4)=2+count(Ec5)=3; – max_count=max(2,4,4,3)=4. For more clarity, we represent the reduced description of the control part in Figure 3.4. The code length is initialized at 1 with the other tools for code allocation, which is not the case with ours. The definition of max_count is important to the extent that it gives, from the beginning, the minimal length of the code, which is a gain in CPU time. Our initial length is equal to log2 max_count, which means log2 4 = 2 bits for our example. This means that, for this instance, it is not possible to code the states with less than 2 bits without impacting the proper operation of the control part. It is worth noting that the states Ec1, Ec2 and Ec5 (for which count = 1) are in fact simple states. Because the height of the control part has been reduced to maximum, correct coding of the states is required to avoid an unnecessary increase in the code length and the negative impact on the surface. To address this issue, one of our strategies, which is a characteristic of our tool, is to be able to assign the same code to two or several states, without affecting the proper behavior of the control part.

Figure 3.4. Instance of the considered problem, corresponding to the given example

This is confirmed by the following lemma: Two arbitrary simple states si and sj receive the same code, without affecting the logical behavior of the control part, if: i) there is no partition Pk such that si  Pk and sj  Pk; ii) if such partitions Pk exist, then si and sj must belong to the same composite state of Pk.

182

CAD of Circuits and Integrated Systems

PROOF.– i) Let x={xi1,xi2,,xi|x|}be the values of the bits supplying the PLA with the variables of si, and x’ ={xj1,xj2,,xj|x|} be those introduced to the PLA with the variables of sj. As si and sj do not belong to the same partition, we then have x  x’. Consequently, the functions δ and λ determine the new state and the command signals in a deterministic manner. ii) Since in each partition Pk there is a composite state Ec containing si and sj, these states contribute (with the control signals associated with Pk) to producing the same next state and the same command signals. As there are no other partitions containing si and sj, the PLA implementing the control part cannot be non-deterministic. Let c1=00, c2=01 and c3=11 be the respective codes of s1, s2 and s3. Consider E={s1,s2,s3}. The code of E that is the intersection of c1, c2 and c3 is c= ** = {00,01,10,11}. We have c1  c2 =  (c1 and c2 do not share any code) and c1  c = {00}. We define B={0,1} and B’={0,1,*}, and formulate the problem of code assignment to the states subject to constraints as follows: (P): Instance: J={j; j  B’|X| are the bits corresponding to the control signals and correspond to a certain partition P} () for a Mealy (Moore) machine; S={s; s is a state such that count(s)=1} C={E; E is a composite state; count(E)>1} Objective function: f: (J, S  C)  {codes c such that the code length of c is equal to k} Find the minimal value of k such that: f(j, Em)  f(j, En) =  when Em  En and  partition P such that Em  P, En  P and there is no state E  C such that Em  E, En  E in P

Case Study: Application of Heuristics and Metaheuristics

183

Here are some considerations with respect to the above formulated problem: a) As noted above, c1  c   even though c1  c (as c covers c1). Let us assume that C =  (there is no composite state). In order to meet the constraint of the problem, it is sufficient to simply have (c1=f(j, Em))  (c2=f(j,En)) as c1 (c2) cannot cover c2 (c1). b) Let us now assume that C  , but Em  En =   m  n. Let us assume there is a function g such that: f(j, Em) = cm  g(j, Em), where c1  c2 stands for concatenating code c2 to code c1. In order to meet the constraint of the problem, we simply need to make sure that cm  cn (note that each bit of code cm or cn must belong to B). In other terms, cm (cn) cannot cover cn (cm). These two observations enable a significant gain in CPU time. Indeed: a) for two simple states that belong to the same partition, it is sufficient that they have different codes for the constraint to be met; b) even though g(j,Em) covers g(j,En), the codes of Em and En cannot cover each other as cm  cn. We define ci 1  i  C  as follows: ci is a code whose length is log2 |C| and its decimal value ranges between 0 and |C| -1. It is worth noting that |C| is the number of composite states that are different. The function g will subsequently be detailed. Let us note that cm  cn for two arbitrary states Em and En that do not belong to the same composite state, and that the partial code of a simple state included in a composite state Ei is equal to ci. Because of these two considerations, we have decomposed the posed problem (P) into three sub-problems: (SP1): solve (P) if C = ; (SP2): solve (P) if |C| > 0, but with Em  En =   m  n; (SP3): solve (P) if |C| > 0 with the existence of at least two states Ei and Ej such that Ei  Ej  ; Ei  C; Ej  C; i  j. Figures 3.5 and 3.6 (note that E1 is a simple state) are two instances of (SP1).

184

CAD of Circuits and Integrated Systems

s1

s4 s3 s2

s5

Figure 3.5. Instance of (SP1)

Figure 3.6. Another instance of (SP1)

Figures 3.7 and 3.4 are respectively instances of (SP2) and (SP3).

Figure 3.7. Instance of (SP2)

Resolution of (SP1) (SP1) can be solved by a graph coloring heuristic. Indeed, (SP1) can be transformed in the colored graph problem as follows: * associate a node with each state; * color the graph; * define the code length as L= log2 (number of colors); * the “colors” 1, 2, …, K being converted into binary on length L, assign the code to the states by establishing the correspondence between the nodes of the graph and the states (a given state receives the “color” assigned to its corresponding node).

Case Study: Application of Heuristics and Metaheuristics

185

For Figure 3.6, we have what follows: nb_bits= log2 max_count = log2 3 = 2; code(E1) = code(s1) = code(s2) = code(s3) = 01; code(s4) = 10, code(s5) = 00, code(s6) = 01. This coding verifies that: – each state has one and only one code; – the states that belong to the same partition have different codes (meeting the constraint). Note that: – even though states s1, s2 and s3 have the same code (feature of our technique), the PLA operates deterministically (previously mentioned lemma); – code(s6) is the same as code(E1) (but since the bits of the control signals that are concatenated with the codes of s6 and E1 are different, the new state and the command signals are generated deterministically, without ambiguity); – the length of the codes obtained is less than 3 bits (= log2 6). For Figure 3.8, our technique detects that it is useless to code the states (only one simple state per partition). Indeed, the command signals are sufficient to generate the command signals, according to the associated finite state machine.

s1

s2

{s4,s5}

s3

{s6,s7}

Figure 3.8. Instance of (SP1) that does not require coding

186

CAD of Circuits and Integrated Systems

Resolution of (SP2) For (SP2), there is at least one composite state for which the count is greater than 1. Let k be the count value. This means that at least k simple states belonging to this composite state must have different codes while these simple states (belonging to the same composite state) have the same code in (SP1). Hence, the heuristic used to solve (SP1) is not valid for the resolution of (SP2). Before discussing this resolution, we make the following significant remarks using the example of Figure 3.7. Let us assume that code 00 has been assigned to the simple state s1. On the one hand, if code 11 is assigned to s2, the code of the composite state E1 would be ** (this code is the “intersection” of codes 00 and 11), thus covering the four codes 00, 01, 10 and 11). There would then exist no code on 2 bits for state s3, compelling us to increase the length of the codes by 1 bit. On the other hand, if code 01 is assigned to s2, code(E1) would be 0*, thus covering only two codes (00 and 01). Consequently, 10 or 11 could be assigned to s3, thus avoiding the unnecessary increase in the length of the codes. This brings us to the following definition: DEFINITION.– Two codes are adjacent if they differ by 1 and only 1 bit. This definition is important to the extent that assigning adjacent codes to simple states of the same composite state avoids the unnecessary increase in the length of the code (code intersection generates a poor coverage of codes). For the instances of (SP2), two fields of code are used: – the 1st field makes it possible to distinguish between two different composite states (two states are different if E1  E2  E1 and E1  E2  E2, where E1 and E2 can belong to only one or several partitions); – the 2nd field makes it possible to differentiate the simple states belonging to the same composite state while involving next states and different command signals (e.g. s1 and s2 in the partition P2, Figure 3.9).

Figure 3.9. Another instance of (SP2)

Case Study: Application of Heuristics and Metaheuristics

187

In order to code the 2nd field of each composite state E, the function g uses an initial code length equal to nb2 bits: nb2=max (nb_bits-nb1, log2max(count(Ei))); i=1,,nc. As a simple state belonging to a composite state is coded, its code is stored in an initially empty set A. Let us note that bit 0 is added at the end of the codes of A each time it is required (removing any non-determinism of the finite state machine). Once the simple states of the composite state E are coded, the intersection of the codes of A is assigned to E. Then, the simple states that do not belong to any composite state are in turn coded, using an initial code length equal to the sum of the lengths of the two fields. The second heuristic used to solve (SP2) is explained by means of the example in Figure 3.10: nc =2; count(E1)=2; count(E2)=2; count(P1)=2+1+1=4; count(P2)=2+2=4; count(P3)=1+1=2; count(P4)=4; count(P5)=2; count_max = max ( 4 , 4 , 2 , 4, 2 ) = 4; nb_bits=log2 4 =2; nb1=log2nc=1; nb2=max(nb_bits-nb1,log2max(count(E1),count(E2)) = max ( 1 , 1 ) = 1.

Figure 3.10. Example of an instance of SP2

The 1st field of E1, s1, s2 and s3 is initialized at bit 0, and that of E2, s5 and s6 receives bit 1. At this level, all the states belonging to different composite states have different codes. Then, using nb2=1 bit as the initial code length, it comes to code the 2nd field of simple states belonging to E1 and E2: E1: A=; s1 receives code 00; A={00}; s2 receives code 01; A={00,01} (00 cannot be assigned to s2: see partition P4). There is no more 1-bit length

188

CAD of Circuits and Integrated Systems

code (for the 2nd field) for s3: the total length is therefore increased to 3 bits and code 001 is assigned to s3. We then have A={000,010,001}. The codes of E1, s1, s2 and s3 will then be 0**, 000, 010, 001 respectively. E2: A=; s5 receives code 100; s6 receives code 101. The code of E2 will then be 10*. Finally, s4 receives code 100. Note that this coding requires a length equal to log26 and verifies that: a) each state has one and only one code; b) no code covers a forbidden code (no indeterminism for the finite state machine). This heuristic is efficient while being able to code a large number of simple and composite states rapidly. Unfortunately, it cannot solve (SP3), which requires an appropriate heuristic. This is the object of what follows. Resolution of (SP3) In (SP3), two or several composite states can share simple states, which is not the case for (SP2). A code assigned to a simple state can be efficient in relation to a composite state, but not necessarily to other composite states that contain it. Moreover, in order to avoid the unnecessary increase in the length of the codes, this heuristic takes into account two aspects: arranging the states in order before coding them, and identifying the best code for the state under coding. Order of states: the 1st simple states that our heuristic codes are those that belong to at least two composite states. Then, these simple states are arranged in order depending on their occurrences in the composite states. This is a good strategy to the extent that if these states are the last to be coded, it would be difficult to find the codes meeting the proper operation of the finite state machine, without increasing the length of the codes several times. Moreover, the set of priority states for the coding is defined as follows: G={esj  Eci; i=1,2,,max1 such that there is at least 1 state esj  Eck; ki; max1 is the number of composite states}.

Case Study: Application of Heuristics and Metaheuristics

189

We also define three sets as follows: H={Eci; esj  Eci; esj is the state being coded}; I={Eci; i=1,2,,max1 such that there is 1 state esj  Eci and esj  G and at least 1 state esk  Eci; esk  G; k  j }; J={Eci; i=1,2,,max1 such that  esj  Eci, esj  G}. Once all the simple states belonging to G are coded, we start the coding of composite states belonging to I, giving priority to those with the largest “count” value. This is followed by the coding of composite states of J, giving priority to those with the largest cardinal. Finally, the simple states that do not belong to any composite state are coded. To avoid the unnecessary increase in the code length, our tool first finds a way to assign a code that has already been used to the state being coded, without impacting the proper behavior of the finite state machine. If this is not possible, the code to be assigned must meet the following two conditions: – the code must be adjacent to the largest possible number of codes assigned to the states of the set H (“intersection” of codes covering the smallest possible number of codes); – the code must be the least possible adjacent to codes of composite states included in the same partitions as those containing the set H. Let esj be the state being coded. We define: H={Eci; esj  Eci; count(Eci) > 1}; F={Eci; Eci  Ps; Eci  H; count(Eci) > 1}. Since certain codes can also belong to CH and CF, we define D = CH  CF, where CH={codes of simple states included in Eci; Eci  H} and CF = {codes of simple states included in Eci; Eci  F} (see Figure 3.11). We then define C’H = CH - D and C’F = CF – D from which degree1 and degree2 will be determined; degree1 is the number of codes included in C’H, which are adjacent to a certain code c; degree2 is the number of codes included in C’F, which are adjacent to a certain code c. We can therefore consider D as a set of parasitic codes.

190

CAD of Circuits and Integrated Systems

Figure 3.11. Example of an instance of SP3

Let s3 be the simple state being coded: H={E1,E3}; Ps ={P1,P2}; F={E2,E4}. Let us assume that s1 and s2 have already been coded. Then, we have: D={code(s1), code(s2)}. Let us note that code c which will be assigned to esj will be a not forbidden code taken in the interval [0, 2nb_bits -1], with nb_bits being the current length of the codes. This code is determined as follows: a) C’H =  and C’F =  : it is the case when CH=CF=D. In fact, this is the least favorable case, and any non-forbidden code is assigned to esj; b) C’H =  and C’F   : it is the case when CH=D and CH  CF. As C’H is empty, only degree2 can determine the quality of code c that can be assigned to esj: c must be non-adjacent to the largest possible number of codes included in C’F; c) C’H   and C’F =  : it is the case when CF=D and CF  CH. Code c that gives to degree1 the largest value is assigned to esj; d) C’H   and C’F   : it is the case when CH  CF and CF  CH. The approach to determining c is as follows: if |C’H| > |C’F| then c is the code that gives to degree1 the largest possible value degree_a; /* if many codes like c exist, select the one that is the least adjacent to the codes included in C’F and satisfies degree1=degree_a */ else c is the code that gives to degree2 the smallest possible value degree_n; /* if many codes like c exist, select the one that is the most adjacent to the codes included in C’H and satisfies degree2=degree_n */ endif

Case Study: Application of Heuristics and Metaheuristics

191

There may be cases when the current code lengths must be increased by 1 bit because the codes of certain composite states have generated coverage of forbidden codes. In this case, a bit of value 1 is concatenated to the code of each simple state esj verifying esj  Eci; Eci H, where Eci belongs to the partition in which the code of Eci has covered forbidden codes. However, a bit of value 0 is concatenated to all the codes other than esj and the composite states that do not include esj. Then, for each EckH (k  i), we verify whether the code of Eck does not cover an arbitrary forbidden code. If such is the case, the same process is applied. We give further explanations through Figure 3.12.

Figure 3.12. Another example of an instance of SP3

count(E1)=2; count(E2)=1; count(E3)=1; count(P1)=2+1+1=4; count(E4)=1; count(E5)=2; count(P2)=1+2=3; count(E6)=2; count(E7)=1; count(E8)=1; count(P3) = 2+1+1 = 4; count(P4) = 1+1 = 2; max_count = max (4 , 3 , 4 , 2) = 4; nb_bits = log2 max_count  = log2 4 = 2. J = {E1, E2, E3, E4, E5, E6, E7, E8}; I = ;

G={s4,s2,s3,s5,s7,s8}.

The arrangement of the elements of G in the descending order of their occurrences in the composite states yields: G={s3,s7,s4,s2,s5,s8}. s3 receives code 00. s7: H={E2,E5,E6}; Ps ={P1, P2, P3}; s3  E6, but 00 cannot be assigned to s7 (see partition P1).

192

CAD of Circuits and Integrated Systems

= {00}; CF = {code (s3)} ={00}. D= CH  CF ={00}; = CF - D = . This is the least favorable code since the code of s7 must be simultaneously adjacent and non-adjacent to the code of s3! Moreover, any non-forbidden code (e.g. 01) can be assigned to s7. CH = {code (s3)} ’ C H=CH - D =  ; C’F

s4: H={E1,E4}; Ps ={P1, P2}; our heuristic assigns 00 (which is the code of s3) to s4. s2: H={E1,E5}; Ps ={P1,P2}; neither the code of s4, nor that of s3 (see P2) can be assigned. The code of s7 cannot be assigned to the code of s2 (see P1). CH = {code (s4), code (s3), code (s7)}={00, 01}. CF = {code (s7), code (s3), code (s4)} ={01, 00}; D={00,01}. Therefore, C’H=C’F=. It is still a matter of unfavorable case and any non-forbidden code (e.g. 10) can be assigned to s2. We then have code(E5)=**, thus covering code(E4)=00 in the partition P2. Therefore Pi = P2 and the bit of value 1 is concatenated to the codes of states included in E5 (s2 and s7), while bit 0 is concatenated to the codes of states other than s2 and s7 and those of composite states that do not include s2 and s7. We then obtain: code(s3)=000, code(s7)=011, code(s4)=000, code(s2)=101, code(E1)=*0*, code(E2)=011, code(E4)=000, code(E5)=**1, code(E6)=0**. Then, we verify whether there is any code in Pj  (Ps - P2) that covers a forbidden code. Since Ps ={P1,P2}, this verification is uniquely conducted in the partition P1: s2  E1; code(E1) does not cover code(E2). s5: H={E2,E6}; code 011 (of s7) is assigned to s5. s8: H={E3,E7}; Ps ={P1,P3}. CH= ; CF={code(s4), code(s2), code(s3), code(s5), code(s7)}={000,101,011}; D=CH  CF = . Hence, C’H= and C’F=CF. Since C’H is empty, the best code to assign to s8 is the one that is the least possible adjacent to 000, 101 and 011. Code 110 is then assigned to s8. At this level, all the states included in G are coded, and we have: I={E2,E3,E7} and J={E8}.

Case Study: Application of Heuristics and Metaheuristics

193

E2: s6: A={code(s5), code(s7)}={011}; 011 is assigned to s6. Hence, code(E2) = 011. E3: s10: A={code(s8)}={110}; the heuristic assigns 110 to s10. Therefore, code(E3)=110. E7: s9: A={code(s8)}={110}; code 110 is assigned to s9. Hence, code(E7)=110. J={E8}: s11 and s12 receive the same code 100. Hence, code(E8) = 100. Then, s1 and s13 are respectively coded with 000 and 001. Note that the length of the codes obtained (3) is therefore less than log2 13 = 4 bits, and that certain states (e.g. s1 and s4) have the same code (using the previously mentioned lemma), even though they generate different next states and different command signals. In addition to these examples, our three heuristics have been compared to published works, using 43 MCMC FSM benchmarks. The results obtained (Table 3.6) show that: – only our tool (SAFT) can give a length of the codes that is less than log2 Number of states; – our results are the best in 74% of cases. By avoiding the creation of composite states and applying only the 1st heuristic (applicable to simple states), the two versions of our tool give the best result in 100% of cases (choosing the best result obtained by the two versions). H e u r.

CPU time (s)

ex2

h3

2

8

5

6

6

?

6

6

STOIC 120% 160%

ex3

h3

1

5

4

6

6

?

5

7

STOIC 125% 175%

ex5

h3

0

3 #

4

5

6

?

5

5

SAFT 133% 200%

ex6

h3

0

1 #

3

4

5

?

5

4

SAFT 300% 500%

FSM

SAFT STOIC ENCORE NOVA ENC KISS DIET

BEST TOOL

194

CAD of Circuits and Integrated Systems

ex7

h3

1

6

4

6

?

?

?

6

STOIC 150% 150%

keyb

h3

3

2 #

5

8

7

9

8

8

SAFT 250% 450%

lion9

h3

1

2 #

4

4

?

?

4

4

SAFT 200% 200%

shiftreg

h1

0

3

3

3

3

?

?

3

train11

h3

1

5

4

5

5

?

6

5

STOIC 125% 150%

bbsse

h3

1

4

6

6

6

7

6

6

SAFT 150% 175%

dk27

h3

0

4

4

3

3

?

?

?

ENCORE, NOVA 133% 133%

ex1

h3

3

4 #

7

7

7

9

7

7

SAFT 175% 225%

ex4

h1

0

3 #

6

4

?

?

?

4

SAFT 133% 200%

opus

h1

0

2 #

7

4

?

?

?

?

SAFT 200% 350%

pma

h1

1

4 #

8

?

?

?

?

?

SAFT 200% 200%

s1

h1

2

2 #

6

5

5

7

5

5

SAFT 250% 350%

sse

h3

1

4

6

6

?

?

?

6

SAFT 150% 150%

tma

h1

1

4 #

10

?

?

?

?

?

SAFT 250% 250%

bbara

h3

1

7

8

5

5

?

?

?

ENCORE, NOVA 140% 160%

cse

h3

1

4

9

5

5

7

6

5

SAFT 125% 225%

dk14

h3

1

5

6

4

5

?

4

4

ENCORE, KISS, DIET 125% 150%

SAFT, STOIC, ENCORE, NOVA, DIET -% -%

Case Study: Application of Heuristics and Metaheuristics

dk15

195

STOIC, ENCORE, NOVA, KISS, DIET 125% 125%

h3

0

5

4

4

4

?

4

4

h3

2

12

6

8

10

12

10

8

dk17

h3

1

4

6

4

4

?

4

4

SAFT, ENCORE, NOVA, KISS, DIET 150% 150%

dk512

h3

0

4

6

5

6

9

6

5

SAFT 125% 225%

lion

h1

0

1 #

3

2

?

?

?

2

SAFT 200% 300%

mc

h1

0

1 #

4

2

?

?

?

?

SAFT 200% 400%

planet

h3

2

5 #

8

7

6

F

?

F

SAFT 120% 160%

train4

h3

0

2

3

2

?

?

?

2

SAFT, ENCORE, DIET 150% 150%

tav

h1

1

2

?

2

?

?

?

2

SAFT, ENCORE, DIET -% -%

s8

h3

0

3

?

3

?

?

?

3

SAFT, ENCORE, DIET -% -%

bbtas

h2

0

3

?

3

?

?

3

?

SAFT, ENCORE, KISS -% -%

beecount

h3

1

3

?

4

4

?

?

?

SAFT 133% 133%

modulo12

h1

0

4

?

4

?

?

?

?

SAFT, ENCORE -% -%

mark1

h1

0

4

?

4

?

?

5

4

SAFT, ENCORE, DIET 125% 125%

kirkman

h1

13

3 #

?

6

?

11

?

6

SAFT 200% 366.7%

s1a

h1

2

2 #

?

5

?

7

5

5

SAFT 250% 350%

donfile

h3

3

39

?

6

15

12

12

7

ENCORE 116.7% 650%

dk16

STOIC 133% 200%

196

CAD of Circuits and Integrated Systems

styr

h3

4

6

?

6

9

F

6

6

SAFT, ENCORE, KISS, DIET 150% 150%

s298

h3

891

15

?

?

?

?

?

?

SAFT -% -%

tbk

h3

18,325

96

?

23

F

12

?

F

ENC 191.7% 800%

scf

h1

8

7

?

8

?

F

8

F

SAFT 114.3% 114.3%

sand

h2

3

5

?

6

6

11

6

6

SAFT 120% 220%

hi: applied heuristic (h1: SP1; h2: SP2; h3: SP3). F: failed. ?: unavailable result. #: length of the codes < log2 n (n is the number of states of the finite state machine). 1st percentage: 2nd best code length/best code length. 2nd percentage: worst code length/best code length.

Table 3.6. Comparison of SAFT to other tools for assigning codes to the states of a finite state machine

3.3.4. Synthesis of submicron transistors and interconnections for the design of high-performance (low-power) circuits subject to power (respectively time) and surface constraints The characteristics of an integrated circuit do not exclusively depend on the logic gates they are composed of, but also on the interconnections between these gates. We consider here the presentation of our technique for aiding the design of a high-performance integrated circuit (with low power consumption) subject to meeting power (respectively delay) and surface constraints. Because the space of solutions is very wide and the dimensions of transistors and interconnections cannot be arbitrary (which are within intervals of values in the practical case), we made the following choices: – the transistor can have a minimal width Wm or a maximal width WM; – the interconnection also has a minimal width Lm or a maximal width LM.

Case Study: Application of Heuristics and Metaheuristics

197

It is worth noting that the length of the transistor is defined by the target technology, and that of the interconnection will be deduced from the topology of the circuit considered and also the technology used. Based on these indications, let us consider that the circuit consists of NT transistors and NI interconnections. Then, we have the following: Transistor 1:

Wm WM

Transistor 2:

Wm WM

…….. Transistor NT:

Wm W M

Interconnection 1: Lm LM Interconnection 2: Lm LM …….. Interconnection NI: Lm LM The exhaustive method that enables the consideration of all possible cases and then the retention of the most interesting solution involves considering 2 * 2 * 2 …… * 2 = 2NT + NI possible cases, which is not polynomial time. It is then a matter of developing a method based on a heuristic or metaheuristic that makes it possible to find for each transistor and each interconnection the width, enabling the optimization of the objective function and meeting the constraints. In the practical case, there are essentially two types of integrated systems: – real-time systems (where response time is a critical parameter); – portable systems (where power consumption is a critical parameter). For this purpose, we have developed two algorithms based on heuristics: – one giving priority to the time parameter, but making sure that the surface and power consumption constraints are met; – the other giving priority to the power consumption parameter, but ensuring that the surface and time constraints are met.

198

CAD of Circuits and Integrated Systems

Heuristic giving priority to the time parameter Before giving more details on these two algorithms, let us note that for the first algorithm, the best solution obtained Si is replaced by the new solution Sj if the latter improves the time parameter even though the values of the surface and the power consumption increase, while meeting the specifications. The same is true for the second algorithm that involves the replacement of solution Si by the new solution Sj if the latter improves the power consumption even though the values of surface and time increase, but nevertheless meets the constraints fixed by the user. Let us recall that the quality of the solution depends essentially on the initialization, the passage from one solution to another and the criterion for terminating the execution of the algorithm. In the development of these two algorithms, we have particularly avoided the random solution generation aspect. Let us now consider how the initial solution has been defined. Based on the models of delay that have been used, we have perfect knowledge of the parameters that influence it. With certain parameters (e.g. supply and transistor threshold voltages) being fixed by the targeted technological process, we are interested in the influence of dimensions of transistors and interconnections on the delay. It is clear that the response time of a transistor is all the smaller as the width of transistor W becomes large. For a real design of a circuit, it is not possible to consider any width. Moreover, the width of a transistor is chosen from two extreme values Wmin and Wmax. Concerning the definition of the initial solution in terms of transistor dimensions, it is correct to assign to each transistor the width Wmax since the algorithm involves giving priority to the time parameter. The other parameter that influences the delay concerns interconnections. For a given length of this interconnection, a width Lmax ensures a better delay of data transfer than Lmin.

Case Study: Application of Heuristics and Metaheuristics

199

The time required for the transfer of a data being proportional to the R * C (Resistance * Capacitance) product, for an interconnection of length L, we have the following: – width l1: T1 = K [2(L+l1) * CpuL + (L*l1) * CpuS] * 6 R – width l2: T2 = K [2(L+l2) * CpuL + (L*l2) * CpuS] * 3 R where CpuL (CpuS) is the capacitance by unit length (surface) of the considered material (aluminum, copper) and R (or Rc) is the resistance per square. Assuming l2 = 2*l1, we have: T2 = K [2(L+2*l1) * CpuL + (L*2*l1) * CpuS] * 3 R Therefore, we have:

 

 

T2 K *  2 L  4l1  C puL  2 L * l1 * C puS * 3Rc  T1 K *  2 L  2l1  C puL  L * l1 * C puS * 6 Rc 

 L  2l1  C puL  L * l1 * C puS  2 L  2l1  C puL  L * l1 * C puS

Since 2L + 2l1 > L + 2l1, we then have: T2 / T1 < 1, and therefore T1 > T2. Hence, for the same type of material implementing the interconnection and for the same length, the interconnection with the largest width offers a better time for data transfer. Hence, the definition of the initial solution, which is in fact the best solution for the time parameter, involves the assignment of: – the width Wmax to each transistor; – the width Lmax to each interconnection. What is the interest of such an initialization? It is clear that if constraints are met with this initialization, the best solution would be obtained with only one iteration!

200

CAD of Circuits and Integrated Systems

Otherwise, we are also aiming at a quality solution. Moreover, to avoid getting too far away from the best solution (which has unfortunately not met the surface and/or power consumption constraint), the shift from the current solution to another solution occurs in a reasoned (and non-random) manner. This shift is also made by gradually operating on the elements of the solution. Below are the necessary explanations referring to our algorithm: – Part A: the time constraint (very strong, poorly estimated) has not been met with the best solution (Wmax for all the transistors and Lmax for all the interconnections). The algorithm stops and the user receives a notification. Although optimization is related to time, we assume that the generated optimum should not exceed a given value. – Part B: test whether the power consumption constraint is met. If this is the case, consider part C. – Part C: test whether the surface constraint is met. If this is the case, write the solution and stop the execution (it makes no sense to continue the iteration, as the solution cannot be improved: we should recall that the initial solution is the best). – Part D: the power constraint is met, but that of the surface is not. If all the dimensions of the transistors and all the interconnections are minimal, it is not possible to decrease the surface: the algorithm stops and the user receives a notification. – Part E: it is possible to decrease the surface. To avoid moving too far away from the current solution (which is advantageous for the time parameter), the dimensions of the transistors will be changed only by one gate. Because of the change in dimensions, the circuit time is recalculated. If the time constraint is met, no more changes should be made (quality of the solution) and power consumption should be recalculated (the process indicated starting with part B should be resumed). Or, consider the transistors of another logic gate. If after having considered the transistors of all the gates and the fact that it was not possible to meet then the time constraint by trying to meet the surface constraint, the algorithm analyzes the widths of interconnections (part H). If it has been possible to meet the time constraint by decreasing the width(s) of one (certain) interconnection(s), resume the process starting from part B (recalculate the power consumption and the surface, etc.). If, during the process, the change in the width of an interconnection has an impact on the time constraint, the width Lmax is reassigned to the interconnection being considered (so that the time

Case Study: Application of Heuristics and Metaheuristics

201

constraint is still met). If considering all the interconnections has not enabled a change of width (the time constraint would no longer be met), then the algorithm stops by notifying the user that the time and/or surface constraint(s) is (are) very strong. – Part K: the power consumption constraint is not met. Based on the current solution, one can change the dimensions of transistors Wmax into Wmin of only one gate at the same time (preserve the quality of the solution). At each of these changes, recalculate the circuit time. If the time constraint is met, resume the process from the calculation of the power consumption, etc. (part B). Note that each time a change in the transistor dimension no longer meets the time constraint, the previous dimension is reassigned to the gate under consideration (make sure the time constraint is always met) and pass to another gate. If the consideration of all the gates does not solve the problem, an attempt is made at the interconnection. Similarly, only one interconnection is considered at the same time (to avoid altering the interesting ongoing solution too much). Each time a change in the width of the interconnection no longer meets the time constraint, the previous width is reassigned to the interconnection currently being considered, and another interconnection is taken into account. If a change in the width of the interconnection enables the time constraint to be met, then the power consumption is recalculated (part B). If no change in the width of the interconnection has been operated without impacting the time constraint, the algorithm stops by notifying the user that the time and/or power constraint(s) is (are) very strong. Let us recall that the characteristics of such an algorithm are the following: – the initial solution is defined so that we obtain the best response time of the circuit; – the passage from one solution to another is gradual, and not sudden (to avoid impacting the quality of the solution resulting from the best solution too much); – the algorithm stops in certain cases by notifying the user while making sure that the execution of other iterations is useless.

202

CAD of Circuits and Integrated Systems

Finally, let us note that the previously described algorithm takes into account, in order, the dimensions of transistors, then those of interconnections. The obvious question is: what would be the quality of the solution if the interconnections were examined first, followed by the transistors? It is worth noting that if the power consumption constraint: – is met, then part E (transistors are considered) is examined before part H (interconnections are considered); – is not met, then part K (transistors are considered) is examined before part N (interconnections are considered). To make sure that a proper solution is obtained, the algorithm is executed four times, taking into account the following four possibilities: – 1: examine, in order, the parts E, H, K, then N; – 2: examine, in order, the parts E, H, N, then K; – 3: examine, in order, the parts H, E, K, then N; – 4: examine, in order, the parts H, E, N, then K. Hence, the solution generating the best circuit time while meeting all the constraints will be retained. Heuristics giving priority to the power consumption parameter Similar to the first heuristics, the initial solution is not generated randomly, but rather in such a manner that yields the best value of power consumption. This is obtained by assigning: – the width Wmin to each transistor; – the width Lmin to each interconnection. Once again, the interest of such an initialization is clear: if the constraints are met with this initialization, the best solution will be obtained in only one iteration! Otherwise, we also aim at a quality solution. Hence, in order to avoid moving too far away from the best solution (which has unfortunately not met the surface and/or time constraint), the shift from the current solution to

Case Study: Application of Heuristics and Metaheuristics

203

another solution occurs in an argued (and non-random) manner. This shift is also made by gradually operating on the elements of the solution. The necessary explanations concerning our 2nd heuristic are provided below: – Part A: the power constraint (very strong, poorly estimated) has not been met with the best solution (Wmin for all the transistors and Lmin for all the interconnections). The algorithm stops by sending the user a notification. Although the optimization relates to the power consumption, we assume that the optimum generated should not exceed a given value. – Part B: test whether the time constraint is met. If this is the case, then consider part C. – Part C: test whether the surface constraint is met. If this is the case, write the solution and stop the execution (there is no point continuing the iteration since the solution cannot be improved: let us recall that the initial solution is the best). – Part D: the time constraint is met, while the surface constraint is not. If all the dimensions of transistors and all the interconnections are minimal, it is not possible to decrease the surface: the algorithm stops by sending the user a notification. – Part E: it is possible to decrease the surface. To avoid getting too far away from the current solution (which is advantageous to the power consumption parameter), we change the dimensions of the transistors of only one gate. Since the dimensions have changed, the circuit time is recalculated. If the time constraint is met, no more changes should be made (quality of the solution) and power consumption should be recalculated (restart the process from the calculation of the surface indicated in part B). Otherwise, consider the transistors of another logic gate. If after having considered the transistors of all the gates and the fact that it is not possible to meet the time constraint by trying to meet the surface constraint, then the algorithm analyzes the widths of interconnections (part H). If it is possible to meet the time constraint by decreasing the width(s) of a (certain) interconnection(s), then recalculate the power consumption and resume the process from the calculation of the surface indicated in part B. If, during the process, the change in the width of an interconnection has an impact on the time constraint, then the width Lmax is reassigned to the interconnection under consideration (to preserve the time constraint met). If the consideration of all the interconnections has not enabled a change in the width (the time

204

CAD of Circuits and Integrated Systems

constraint will no longer be met), the algorithm stops by notifying the user that the time and/or surface constraint(s) is (are) very strong. – Part K: the time constraint is not met. Starting from the current solution, change the dimensions of transistors Wmin in Wmax of only one gate simultaneously (not moving too far from the quality of the solution). At each of these changes, recalculate the circuit time. If the time constraint is met, recalculate the power consumption. If it exceeds the fixed maximal value, the algorithm stops by sending the user a message. Or resume the process from the surface calculation, etc. (part B). Note that each time a change in the dimension of the transistor no longer meets the time constraint, the previous dimension is reassigned to the gate under consideration (make sure the time constraint is always met) and pass to another gate. If the consideration of all the gates does not solve the problem, an attempt is made at the interconnection. Similarly, only one interconnection is considered simultaneously (to avoid altering the ongoing solution, which is interesting). If a change in the width of the interconnection enables the time constraint to be met, then the power consumption is recalculated. If this exceeds the fixed maximal power, a notification is given to the user and the algorithm stops. Or, the process is resumed from the surface calculation (part B). If, despite considering all the interconnections, the time constraint has not been met, then the algorithm stops by notifying the user that the time and/or power constraint(s) is (are) very strong. Let us also recall that the characteristics of such an algorithm are the following: – the initial solution is defined such that the best power consumption of the circuit is obtained; – the passage from one solution to another is gradual, and not sudden (to avoid impacting the quality of the solution resulting from the best solution too much); – the algorithm stops in certain cases and sends the user notifications while ensuring that the execution of other iterations is useless. Finally, it is worth noting that, similar to the first algorithm, the abovedescribed algorithm takes into account, in order, the dimensions of transistors, followed by those of interconnections. However, the solution

Case Study: Application of Heuristics and Metaheuristics

205

may be different if the interconnections are considered before transistors. Indeed, if the time constraint: – is met, then part E (transistors are considered) is examined before part H (interconnections are considered); – is not met, then part K (transistors are considered) is examined before part O (interconnections are considered). Similar to the first algorithm, and in order to obtain a proper solution, the algorithm is executed four times, taking into account the following four possibilities: – 1: examine, in order, the parts E, H, K, then O; – 2: examine, in order, the parts E, H, O, then K; – 3: examine, in order, the parts H, E, K, then O; – 4: examine, in order, the parts H, E, O, then K. Hence, the solution generating the best power consumption of the circuit while meeting all the constraints will be retained. In the following, we present two tables that summarize the results obtained for addressing this issue. Table 3.7 shows the following: – in the 1st line, all the constraints have been met (it is in fact the best solution that can be obtained for the time); – in the 2nd line, there is no solution because the time constraint is very strong (the possible minimal value is 40.71 ps); – in the 3rd line, there is no solution because the surface constraint is very strong (the possible minimal value is 0.016 mm2); – in the 5th line, there is no solution because the power constraint is very strong (the possible minimal value is 91.71 mW); – the combinations A, B, C and D do not always yield the same results: in the 7th line, the modification of dimensions to meet the power consumption constraint has started with that of interconnections – combinations B and D – which has enabled the decrease of certain capacitances, hence the power consumption with little or no changes in the dimensions of transistors

206

CAD of Circuits and Integrated Systems

(initialized at Wmax). This has made it possible to obtain better values of time and power consumption than with combinations A and C. SF TF PF (ps) (mW) (mm2)

Combination A TF PF SF (ps) (mW) (mm2) 40.71 245.24

950

5

950

5

No solution: very strong time constraint

50

150

0.015

No solution: very strong surface constraint

50

150

5

45

90

5

41

150

5

40.71 138.54

0.020

40.71 148.30

0.020

40.71 138.54

0.020

40.71 148.30

0.020

60

140

10

41.77 134.23

0.017

40.71 131.22

0.020

41.77 134.23

0.017

40.71 131.22

0.020

41.77 91.71

0.016

41.77 91.71

0.016

41.77 91.71

0.016

41.77 91.71

0.016

40.71 148.30

0.027

0.020

40.71 245.24

41.77 149.71

0.027

Combination D TF PF SF (ps) (mW) (mm2)

35

0.017

40.71 245.24

Combination C TF PF SF (ps) (mW) (mm2)

45

41.77 149.71

0.027

Combination B TF PF SF (ps) (mW) (mm2)

0.017

40.71 245.24

0.027

40.71 148.30

0.020

No solution: very strong power constraint

42

92

10

41

92

10

42

150

10

41.77 149.71

0.017

40.71 148.30

0.020

41.77 149.71

0.017

40.71 148.30

0.020

42

150

5

41.77 149.71

0.017

40.71 148.30

0.020

41.77 149.71

0.017

40.71 148.30

0.020

50

250

0.03

40.71 245.24

0.027

40.71 245.24

0.027

40.71 245.24

0.027

40.71 245.24

0.027

No solution: very strong time or power constraint

Table 3.7. Circuit synthesis (priority given to time)

Table 3.8 shows the following: – in the 1st line, all the constraints have been met (in fact, it is the best solution that can be obtained for the power consumption); – in the 2nd line, there is no solution because the power constraint is very strong (the possible minimal value is 91.71 mW); – in the 3rd line, there is no solution because the surface constraint is very strong (the possible minimal value is 0.016 mm2); – in the 5th line, there is no solution because the time constraint is very strong (the possible minimal value is 40.71 ps); – the combinations A, B, C and D do not always yield the same results: in the 7th line, the modification of dimensions to meet the time constraint has started with that of transistors – combinations A and C – which has made it possible to meet the time constraint with little or no changes in the dimensions of interconnections (initialized at Lmin). This has made it possible to obtain better values of power consumption than with combinations B and D;

Case Study: Application of Heuristics and Metaheuristics

207

– in the 11th line, it can be noted that combinations A and C yield a solution, while combinations B and D do not enable this: in order to meet the time constraint (very close to the minimal value), B and D have transformed all the dimensions of interconnections (initialized at Lmin), which has enabled the increase in the values of capacitances Cload and hence the value of the power consumption, exceeding the imposed constraint. However, the combinations A and C have started to change certain dimensions of the transistors, which has led to meeting the time constraint with no (or little) change in the dimensions of interconnections. SF TF PF (ps) (mW) (mm2)

Combination A Combination B Combination C Combination D TF TF TF TF PF SF PF SF PF SF PF SF (ps) (mW) (mm2) (ps) (mW) (mm2) (ps) (mW) (mm2) (ps) (mW) (mm2) 41.77 91.71 0.016 41.77 91.71 0.016 41.77 91.71 0.016 41.711 91.71 0.016

45

350

5

45

90

5

No solution: very strong power constraint

45

900

0.010

No solution: very strong surface constraint

41

990

5

40.6

150

5

41 150 40.8 1,500

40.71 126.84 0.020 40.71 245.24 0.027 40.71 126.84 0.020

40.71

245.24 0.027

No solution: very strong time constraint

5 5

40.71 126.84 0.020 No solution 40.71 126.84 0.020 40.71 126.84 0.020 40.71 245.24 0.020 40.71 126.84 0.020

No solution 40.71 245.24 0.027

41.77

41.77

50

150

5

50

150

0.015

91.71

0.016 41.77

91.71

0.016 41.77

91.71

0.016

91.71

0.016

No solution: very strong surface constraint

40.7

150

5

40.8

150

5

40.71 126.84 0.020

No solution: very strong time constraint

40.8

250

5

40.71 126.84 0.020 40.71 245.24 0.027 40.71 126.84 0.020

No solution

40.71 126.84 0.020

No solution 40.71

245.24 0.027

Table 3.8. Circuit synthesis (priority given to power consumption)

3.4. Module level 3.4.1. Design of low-power digital circuits With the increasing number of portable systems on the market (tablets, cellular phones, laptops, etc.) and the limited lifetime of batteries, the design of low-power circuits has become a necessity. Despite the efforts carried out for increasing the lifetime of batteries (replacement of NiCd batteries with NiMH batteries), significant autonomy has not yet been reached. Moreover, current technologies enable the integration of millions of transistors on a chip (implementation of SoC). Moreover, even though the VLSI system is supplied by the sector, its conventional design would require high power consumption and would consequently increase the system operating temperature, which may lead to a system malfunction. Furthermore, it is worth noting that thermal effects decrease the reliability of components and

208

CAD of Circuits and Integrated Systems

increase their cooling costs. Hence, the power dissipation aspect is currently approached at each level of abstraction, starting from the highest design levels: – application of techniques, at high design levels, which involve the use of several supply voltages in order to minimize the energy dissipation while meeting the temporal constraint; – comparison of architectures achieving the same behavior in order to pick the one with the least consumption under imposed values of supply voltages, threshold voltages and frequency; – generation of voltage islands and assignment of voltages embedded in the floorplanning process for optimization purposes; – electrical disconnection of inactive components, reduction of intrinsic and parasitic capacitances, etc. Since the simultaneous use of several threshold voltages leads to a prohibitive manufacturing cost of the circuit, only two threshold voltages are generally used in order to reconcile the two antagonistic parameters: performance and energy consumption. In our case, we also use two threshold voltages for each type of transistor (N or P) and only two supply voltages (fewer voltage transformers). These assignments of voltages are accompanied by an auto-sizing of transistors in order to minimize the energy consumption while meeting the surface and time constraints. Since the computational complexity of this problem is O(23*N), where N is the number of logic gates, we have used an appropriate technique to generate an interesting solution in a reasonable CPU time for this intractable problem. Because we are equally interested in dynamic and static consumption (due to current leakages in transistors), we adopt the models defined by equations [2.14] and [2.15] that are recalled below: = 0.28125 ∗





#



,

∗ 10

,

|

= 0.5 ∗



,

⁄∝

+

,

∗ 10

| ,



,

+

,

,

⁄∝

|

|



,

,





,

Case Study: Application of Heuristics and Metaheuristics

209

The initial solution is not randomly generated, as it often happens, but is rather defined depending on the issue addressed and the targeted purpose. Since the interest is here to minimize the power consumption while meeting the time constraint, we generate a priori the one that involves the BETTER minimization of power. If the time constraint is met, then the BEST solution would be obtained in a very short CPU time. The element studied at each iteration involves a chain of characters of length 3*Nb_gates (Nb_gates being the number of gates of the circuit). For each gate, there are three successive characters coding the values of Vdd (supply voltage), VthN (threshold voltage of the NMOS transistor) and VthP (threshold voltage of the PMOS transistor) used for the considered gate: – chain[3*(i-1)] = ‘0’ (‘1’) if Vdd,L (Vdd,H) supplies the ith gate; – chain[3*(i-1)+1] = ‘0’ (‘1’) if VthN,L (VthN,H) is assigned to threshold voltages of NMOS transistors of gate i; – chain[3*(i-1)+2] = ‘0’ (‘1’) if VthP,L (VthP,H) is assigned to threshold voltages of PMOS transistors of gate i. It is worth noting that Vdd,L (Vdd,H) represents the highest (lowest) supply voltage used, while VthN,L (VthN,H) corresponds to the lowest (highest) threshold voltage of the NMOS transistor. The same is applicable to VthP,L (VthP,H) for the PMOS transistor. Our algorithm operates as follows: at the first step, we verify whether the optimal solution is feasible, namely if the assignment of Vdd,L, VthN,H and VthP,L to all the gates, all the NMOS transistors and all the PMOS transistors respectively meet the time and surface constraints. If this is the case, the algorithm stops. Otherwise, it involves meeting the time and surface constraints. Let us note that, in order to reach an interesting (power) solution, these constraints are satisfied progressively, from the chain generating the optimal solution. This is achieved by changing a minimal possible number of values: – Vdd,L into Vdd,H values; – VthN,H into VthN,L values; – VthP,L into VthPH values.

210

CAD of Circuits and Integrated Systems

These changes are made progressively by the first three parts (case where reject=1) of the process Generate_combination() until meeting A ≤ Afixed. When a solution meets the time and surface constraints, the algorithm also tries to improve this solution (case where reject=0) by assigning as much as possible values of Vdd,L, VthN,H and Vthp,L to gates, NMOS transistors and PMOS transistors respectively. In case an assignment breaks one of the time or surface constraints, another assignment operates on the previous solution. This takes place by automatic determination of the widths of transistors so that they meet the time constraint. General algorithm Area= +; Power= +; code1=code=(char *) malloc((unsigned int) 3*Nb_gates); // the code chain contains the assignments of Vdd, VthN and VthP For i=1 to Nb_gates Do {*code++=’0’;

// assign VDD,L to the i-th gate

*code++=’1’;

// assign VthN,H to NMOS transistors of i-th gate

*code++=’0;

// assign VthP,L to PMOS transistors of i-th gate

} Done *code=’\0’; code=code1; Determine the sizes of transistors W such that: |D-Dfixed|  ; Determine A=Circuit surface; if A  Afixed

// Afixed is a surface fixed by the designer

then {Ptot=Total_ Power_Dissipation (circuit); exit; // the lowest dissipation is obtained endif reject=1;

}

Case Study: Application of Heuristics and Metaheuristics

new=1; // new=0: the solution cannot be improved iter=1; While(new =1 && iter  Nb_individuals) Do {Generate_combination(); Ptot=Psw+Pleak; if(Power  Ptot) then if(Power > Ptot) then {Power=Ptot; Area=A; strcpy(code_sav,code1); // save the content of code1 in //in the chain code_sav } else if(Area > A) then {Area=A; strcpy(code_sav,code1); } endif endif endif } Done

Process Generate_combination() {if(reject=1) then {reject=0;

// surface constraint not met

211

212

CAD of Circuits and Integrated Systems

While(A > Afixed) // Assign VddH to several gates in order to meet the //time and surface constraints Do {order in a list l the gates in descending order of sizes W of their transistors; Find in l the 1st gate Gi whose Vdd=Vdd,L; if(such gate does not exist) // meaning that Vdd=VddH for each gate then break; endif Assign VddH to Gi; For each gate directly or indirectly feeding Gi Do assign to it VddH; Done Determine Wtransistors such that: |D-Dfixed|  ; Determine A= surface(circuit); } Done While(A > Afixed) // Assign VthN,L to certain NMOS transistors in order to //meet the time and surface constraints Do {order in the list l the gates in descending order of the sizes W of transistors; Find in l the 1st gate Gi whose VthN = VthN,H; if(such a gate does not exist) //meaning that VthN=VthN,L for // each NMOS transistor then break; endif Assign VthN,L to each NMOS transistor of gate Gi; Determine Wtransistors such that: |D-Dfixed|  ;

Case Study: Application of Heuristics and Metaheuristics

213

Determine A= surface(circuit); } Done While(A > Afixed) // Assign VthP,H to certain NMOS transistors in order to // meet the time and surface constraints Do { order in list l the gates in decreasing order of their sizes W of transistors; Find in l the 1st gate Gi whose VthP = VthP,L; if(such a gate does not exist) // meaning that VthP=VthP,H for // each PMOS transistor then {write “the value of Afixed is very constraining, and generates no solution: Increase it and restart ”; exit; } endif Assign VthP,H to each PMOS transistor of gate Gi; Determine Wtransistors such that: |D-Dfixed|  ; Determine A= surface(circuit); } Done } else { // the time and surface constraints are met: We then try in the following to exploit this in order to gradually minimize the total dissipation of energy order in the list l the gates in decreasing order of their sizes W of transistors; Search in l the 1st gate Gi whose Vdd = Vdd,H; end=0;

214

CAD of Circuits and Integrated Systems

new=0;

// new is a global variable

While(end=0) // try to reduce the total dissipation of energy by // gradually assigning Vdd,L to certain gates Do {if(such a gate does not exist) // meaning that Vdd=Vdd,L for //each gate then break; endif (else), assign Vdd,L for each gate directly or indirectly fed by Gi; Determine Wtransistors such that: |D-Dfixed|  ; Determine A= surface(circuit); if(A > Afixed) // failed attempt then {assign VddH to Gi; assign VddH to each gate directly or indirectly fed by Gi; } else {new=1; iter++;} endif Determine from l the next gate Gj following Gi and for which Vdd=VddH; Rename Gj, Gi; } Done Determine A= surface(circuit); if(A  Afixed) then {order in the list l the gates by descending values of their sizes W of NMOS transistors; Find in l the 1st gate Gi for which VthN= VthN,L;

Case Study: Application of Heuristics and Metaheuristics

While (end=0) // try to reduce the total dissipation of energy // by gradually assigning VthN,H to certain // NMOS transistors Do {if(such a gate does not exist) // meaning that VthN=VthN,H // for all NMOS transistors then break; endif (else), assign VthN,H for each NMOS transistor of the gate Gi; Determine Wtransistors so that: |D-Dfixed|  ; Determine A= surface(circuit); if(A > Afixed) // failed improvement attempt then assign VthN,L to each NMOS transistor of Gi; else {new=1; iter++; } endif Determine from l the next gate Gj following Gi and whose VthN=VthN,L; Rename Gj, Gi; } Done } Endif Determine A= surface(circuit); if(A  Afixed) then { order in the list l the gates in descending order of their sizes W of PMOS transistors; Find in l the 1st gate Gi for which VthP= VthP,H;

215

216

CAD of Circuits and Integrated Systems

While(end=0) // try to reduce the total dissipation of energy by // gradually assigning VthP,L to certain // PMOS transistors Do{if(such a gate does not exist) // meaning that VthP=VthP,L // for all PMOS transistors then break; endif (else), assign VthP,L for each PMOS transistor in the gate Gi; Determine Wtransistors such that: |D-Dfixed|  ; Determine A= surface(circuit); if(A > Afixed) // failed improvement attempt then assign VthP,H to each PMOS transistor of Gi; else {new=1; iter++; } endif Determine from l the next gate Gj following Gi and for which VthP=VthP,H; Rename Gj, Gi; } Done } endif } endif } Figure 3.13. Algorithm for the optimization of power consumption, subject to time and surface constraints, of a digital circuit

Case Study: Application of Heuristics and Metaheuristics

217

In the algorithm presented above, let us note the following: – the shift from one solution to another is not made randomly, but rather in a deterministic manner (the bits are adequately transformed into ‘0’ or ‘1’ depending on whether the objective is to improve the current solution or to meet the two constraints); – since the random generation of solutions is avoided, none of the treated solutions is explored once again, which enables a detachment from potential local optimums and an increase in the probability of improving the solution at each iteration. The following results confirm the effectiveness of our algorithm. The results indicated in Tables 3.9 and 3.10 have been obtained with Vdd  {1 V, 3.3 V}, VthN  {0.1 V, 0.3 V} and VthP  {-0.3 V, -0.1 V}. Table 3.9 shows the variations of the dissipated power with the response time of the circuit. For a maximal surface imposed by the designer and of value 3,200 2, or a surface of 128 (µm)2 (=0.20 µm in the 0.35 µm TSMC CMOS technology), the power consumed in the ideal case (Vdd=1 V, VthN=0.3 V, VthP= -0.3 V) varies from 1.55 to 2.63 µW when the response time of the circuit imposed by the designer varies from 150 to 30 ps. It is worth noting that, unfortunately, the ideal case does not enable certain time and/or surface constraints to be met (in Table 3.9, the surface constraint is not met for any delay when dealing with the ideal case of power consumption). Moreover, in order to minimize the total power consumption while trying to meet the time and surface constraints, the delays of all the paths are rendered equal to those of critical paths (value of delay imposed by the designer). This makes it possible to assign lower than initial dimensions to the transistors of noncritical paths, which leads to an increase in the response time of such paths and consequently to the minimization of the total dissipated power. The 5th column in Table 3.9 indicates the values of the dissipated power obtained for various values of delays. Let us note, however, that the problem has no solution when the values of delay and surface imposed by the designer are 30 ps and 128 (µm)2, respectively. In this case, our tool FREEZER2 dimensions the transistors so that the delays of all the paths do not exceed 30 ps. This leads to obtaining a surface (estimated as the sum of surfaces of all the transistors of the circuit) above the imposed value. It is also worth noting that the surface obtained for Tf=50 ps is below that obtained with Tf=80 ps. This is explained by the fact that in order to meet a delay of 50 ps, FREEZER2 had to assign more supply voltages equal to Vdd,H, and more

218

CAD of Circuits and Integrated Systems

threshold voltages equal to VthN,L and VthP,H for NMOS and PMOS transistors, respectively. This assignment of voltages has obviously increased the power dissipation, but is sufficient for meeting the time constraint. Consequently, the dimensions of transistors have not been increased often, and hence a significant surface has been obtained (77.22*10-12 m2). Tf (ps)

Sf ((µm)2)

Pideal case (µW)

Sideal case ((µm)2)

P (µW)

S ((µm)2)

150

128

1.55

974.78 > 128

17.25

50.07

100

128

2.63

982.47 > 128

17.25

68.84

90

128

2.63

985.07 > 128

17.25

79.68

80

128

2.63

988.36 > 128

17.25

100.46

50

128

2.63

1006.12 > 128

23.44

77.22

30

128

2.63

1042.50 > 128

No solution (S > Sf)

Table 3.9. Comparison of minimal power to that obtained under time and surface constraints for the same ISCAAS circuit

Table 3.10 compares, for different ISCAAS circuits, the total power obtained under time and surface constraints to those corresponding to the favorable and unfavorable cases. It is worth noting that the unfavorable case involves assigning 3.3 V to supply voltages of all the gates, 0.1 V to threshold voltages of all NMOS transistors and -0.1 V to those of all PMOS transistors. Let us also note that FREEZER2 obviously produces the minimal power if the surface and time values imposed are above or equal to those enabling this minimal power. Hence, if the value of the power obtained is different from that of minimal power, this means that the time and/or surface constraints have not been met for obtaining the ideal solution. In this case, the voltages and dimensions of transistors have been correctly assigned in order to minimize the power consumption, while meeting the two constraints insofar as the optimization problem has a solution. Table 3.10 shows that the surface constraint is not met for any circuit in the ideal case. This justifies the significant gap between the powers consumed in the ideal case and in that in which the time and surface constraints are met. It also shows a reduction in dissipation from 7.42 to 63.27% with respect to the unfavorable case, which proves that FREEZER2 is a potential tool that enables the designer to obtain a circuit of lower consumption compared to a conventional design, subject to time and surface constraints.

Case Study: Application of Heuristics and Metaheuristics

Ideal case Vdd = 1 V; VthN = 0.3 V; VthP = -0.3 V P1 (µW)

T= Tfixed (ps)

S > Sfixed ((µm)2)

Unfavorable case Vdd = 3.3 V; VthN = 0.1 V; VthP = -0.1 V P2 (µW)

T < Tfixed S < Sfixed (ps) ((µm)2)

Time and surface constraints are met P3 (µW)

T =Tfixed S  Sfixed (ps) ((µm)2)

Reduction of P3 with respect to P2 (%)

219

CPU time* (s)

2.64

70

992.64

65.55

69.67

19.20

29.75

70

21.01

54.61

1

10.53

105

4018.71

145.08

104.51

52.80

66.19

105

58.99

54.38

10

25.89

200

10627.58

437.01

197.40

144.00

219.45

200

157.92

49.78

404

56.68

65

33923.10 2071.57

60.96

931.20

760.83

65

939.01

63.27

4,142

38.00

235

19631.44

722.87

232.23

267.20

359.42

235

278.77

50.28

1,584

10.95

210

7101.06

407.62

209.01

107.20

193.74

210

119.88

52.47

44

15.58

400

14498.17 1234.66

394.80

228.80

876.81

400

238.04

28.98

1,752

44.16

770

10839.45 5423.25

766.37

472.00

4757.36

770

479.90

12.28

18,070

20.72

255

43929.97 1221.79

249.65

515.20

852.74

255

518.62

30.21

357

45.98

585

31943.85 1403.79

583.49

630.40

1299.69

585

639.17

7.42

21,107

*For each ISCAAS circuit, the indicated CPU times concern the obtaining of all the results in the three cases.

Table 3.10. Comparative study of minimal and maximal powers to that obtained under time and surface constraints

3.4.2. Reduction of memory access time for the design of embedded systems In many embedded systems, the memory access delay is a problem for the system performance. This is valid to the extent that memory access is far from being as fast as an actual processor, which has a negative impact on the global operation of the system. Fortunately, modern memories offer what is known as the page mode and the read-modify-write characteristic. In Kim et al. (2005), memory access optimization has been approached through a combination of scheduling of the operations of the concerned application, memory allocation and the assignment of tables of the application to memory modules. The same methodology is approached in order to address this issue, but by adopting our own heuristics. At the end of the description of these heuristics, we present a comparison of our results to those obtained by Kim et al. (2005). As this issue has been described in detail in Chapter 2 of this book (section 2.3.1.4), we describe below the heuristics that we have developed: – Assignment of tables of the concerned application to memory modules:

220

CAD of Circuits and Integrated Systems

This is performed based on the following two observations: – a memory access is performed in PR – Page Read (PW – Page Write) mode if a reading (respectively writing) is done in the same array in two successive accesses in the same memory module; – two arrays required in the same operation involve max(N1, N2) cycles if they are assigned to two different memory modules and N1+N2 cycles, otherwise. Let us recall that the memory access modes Normal Read (NR), Normal Write (NW), Page Read (PR) and Page Write (PW) require 5, 8, 2 and 3 cycles, respectively. Consequently, finding the best possible assignment of arrays to memory modules represents an intractable problem, when subject to surface constraints. This optimization problem is precisely NP-hard, while the associated decision problem is NP-complete. In Chapter 1 of this book, we have presented the Turing transformation. We now use this in this context to solve this problem, by Turing transformation, exploiting our graph coloring technique whose problem is formulated as follows: G = (V, E) F: V  {1, 2, … , K}; K  N F(u)  F(v)  (u, v)  E Find the minimal value of K such that two arbitrary nodes connected by an edge have different colors Note that 1, 2, … , K are fictitious colors whose number must be minimized. Having observed that the order of coloring the nodes of a graph has an impact on the number of colors, and taking into account that proceeding to all the possible arrangements of nodes would require |V| ! colorations (which is not possible, within a reasonable CPU time, for a significant number of

Case Study: Application of Heuristics and Metaheuristics

221

nodes), we have assigned a priori these nodes to M classes (M Area // Area is the allowable area ……... then if E = 

…………..

.then {write “the area constraint is too hard: there is no feasible solution”; exit; } else {Find the 2 arrays Ai and Aj that are assigned to 2 different memory modules and that yield the least increase of the delay when stored in the same memory module; ……………… ……………. build G = (V, E’) such that E = E’  {e}; ……………

…..// e=(vi,vj) is deleted from E; vi and vj are respectively associated to Ai and Aj E=E’;

…………….}

…………..

endif

………

else finished=1; endif

……… ……

} end

Figure 3.15. Algorithm assigning arrays of an application to memory modules

Case Study: Application of Heuristics and Metaheuristics

225

– Rescheduling of operations of the concerned application: In Chapter 2 of this book (section 2.3.1.4), we have seen that the permutation of only two operations had an influence on the number of cycles. Hence, a rescheduling of the operations of the concerned application (obviously, this should not modify the original behavior of the application) can significantly improve the memory access time. To respect the dependence of operations, only the independent operations should be rescheduled. As the number of sets of such operations is not significant, such a task aimed at maximizing the number of accesses in page mode can therefore be executed in a polynomial time. Retaining, in the end, the best rescheduling in each set makes it possible to significantly reduce the memory access time. As shown by the results obtained, our heuristics are efficient because they give near-optimal (even optimal in some cases) results within reasonable CPU times. Concerning the graph coloring, Table 3.13 shows that our colored graph technique has been successfully tested on DIMACS benchmarks. Indeed, the mean relative error is equal to 1.92%, compared to known exact solutions, while a single coloring gives a mean relative error of 8.60%. Let us note that: – for certain benchmarks, the error is zero when more than one class is used (which is the case, for example, for the 15th graph myciel7 for which the error passes from 12.5 to 0%); – the exact solution is obtained for 17 (among 19) benchmarks. Table 3.14 shows a reduction of memory access compared to the case where there is no technique for the assignment of arrays of a given application to memory modules. The average gain ranges from 31.87% (assignment only) to 41.16% when this assignment is accompanied by an appropriate rescheduling of operations of the given application. Compared to the results obtained by Kim et al. (2005), our results show an improvement: an average gain of 41% compared to 38.80% for the same methodology used (assignment of arrays and rescheduling of operations), but with the use of different optimization techniques.

226

CAD of Circuits and Integrated Systems

Instance of a DIMACS benchmark

% Error

CPU time (s)

(#Colors, #classes)

% Error

CPU time (s)

11

0

0

(11 , 1)

0

0

11

11

0

0

(11 ,1 )

0

0

3. myciel3

4

4

0

0

(4 , 1)

0

0

4. myciel4

5

5

0

0

(5 , 1)

0

0

5. myciel5

6

6

0

0

(6 , 1)

0

0

6. myciel6

7

7

0

0

(7 , 1)

0

0

7. huck

11

11

0

0

(11 , 1)

0

0

8. zeroin.i.3

30

30

0

0

(30 , 1)

0

0

9. miles750

31

31

0

0

(31 , 1)

0

0

10. miles500

20

20

0

0

(20 , 1)

0

0

11. miles1500

73

73

0

4

(73 , 1)

0

4

12. miles1000

42

42

0

1

(42 , 1)

0

1

13. games120

9

9

0

0

(9 , 1)

0

0

14. jean

10

10

0

0

(10 , 1)

0

0

15. myciel7

8

9

12.5

0

(8 , 3)

0

0

16. queen5_5

5

7

40

0

(5 , 7)

0

2

17. miles250

8

9

12.5

0

(8 , 3)

0

0

18. queen6_6

7

10

42.86

0

(8 , 8)

14.29

49

19. queen8_8

9

14

55.56

0

(11 ,8 )

22.22

1,858

#Colors

Exact solution

(#classes = 1)

1. anna

11

2. david

Mean error (%)

8.60

Table 3.13. Graph coloring

1.92

Case Study: Application of Heuristics and Metaheuristics

Gain in assignment (%)

Gain in assignment and rescheduling (%)

Amotry.c

23.53

33.34

simpr.c

43.42

45.92

toeplz.c

33.68

45.19

frprmn.c

24.97

36.37

svdvar.c

33.75

44.96

Mean gain (%)

31.87

41.16

Benchmark

227

Table 3.14. Gain in memory access time

Gain (%) (Kim et al., 2005)

Gain (%) Our techniques

fourfs

28

45

spline

43

31

stoerm

65

30

pzextr

34

56

ratint

24

43

38.80

41.00

Benchmark

Mean gain (%)

Table 3.15. Comparison of memory access reduction techniques

3.5. Gate level 3.5.1. Estimation of the average consumption of a digital circuit

and

maximal

power

As mentioned in section 3.4.1, power consumption has become a critical parameter both from the perspective of technological development (more significant current leakages, non-negligible static consumption due to submicron interconnections) and from the perspective of the increasing complexity of integrated systems (systems embedded on satellites or on

228

CAD of Circuits and Integrated Systems

other modules where the supply via electrical sector is not possible). Thus, it has become essential to develop low-power systems (limited autonomy of batteries and reliability – high power consumption would generate an increase in temperature, which would lead to a malfunction of transistors). Hence, the issues related to the reduction of power consumption are approached at each level of design. For a given level of abstraction, this involves not passing to the level immediately following, as long as the constraint on power dissipation is not met. In fact, this is often a matter of energy reduction (Power = Energy * Frequency) as in order to reduce power consumption, it is sufficient to choose a very low frequency if, nevertheless, there is no constraint on the performance aspect. Otherwise, the frequency should maintain a certain value, and in order to reduce power consumption, energy should also be reduced. Therefore, the decision to pass from one design level to another relies on the availability of a tool for estimating power dissipation at the given design level. For this purpose, we present in this section our tool (named SPOT) for the estimation of maximal and average consumption of an integrated circuit. Since there are two types of consumption, let us recall the two corresponding equations (mentioned in section 2.5.2.3): – Dynamic consumption: = 0.5 ∗



(





)

– Static consumption: = 0.28125 ∗ ∗ 10



∗ ∗ 

∗(

∗ 10



+

)

The estimation of dynamic consumption requires, besides the supply of the circuit by a vector of input signals, knowing the capacitances (at the outputs of logic gates) that remain at their states (where the capacitance remains charged or discharged) or change their states (where a given capacitance can pass through several intermediary states – glitches – before stabilizing in a state). The only way to know this would be to feed the circuit with a 1st vector of signals V0 in order to stabilize the capacitances at certain

Case Study: Application of Heuristics and Metaheuristics

229

states, feed the circuit with a second vector Vt to observe the various possible transitions (calculate NGi), and then use the equation to determine the dynamic consumption due to this pair of vectors (V0, Vt). Note that the number of transitions NGi may be above 1 for the gates whole logical depth is above 1 because their input signals may be unsynchronized or change their values in time. Hence, for a circuit with N input signals, the exhaustive method would involve the consideration of 22*N possible cases: – as the dynamic power is previously initialized at 0, we add to it each time the power dissipated for one of the 22*N cases. The average dynamic power is simply the sum obtained, divided by 22*N; – as the calculation for the 22*N cases is executed, the maximal dynamic power, initialized at -, is updated. Considering a circuit with only three input signals, the exhaustive method involves the initialization of V0 at: – 000 and vary Vt from 000 to 111: 23 = 8 cases; – 001 and vary Vt from 000 to 111: 23 = 8 cases; – 111 and vary Vt from 000 to 111: 23 = 8 cases. Hence, for each of 2N values of V0, we must consider 2N values of Vt or a total of 2N * 2N = 22*N power calculations. This calculation approach is not applicable in practice – computational complexity is equal to O(22*N) – for a substantial value of N. Indeed, considering a simple circuit that is an adder of two numbers of 32 bits each (or 64 bits in input), the number of calculations would be equal to 22*64 = 2128 calculations, which is quite significant. Moreover, for each of 2128, the calculation of various values of NGi depends on the design style of the circuit (static with or without pass transistors, dynamic, etc.), which would require a quite prohibitive CPU time. As the calculation of static consumption from the previously indicated equation does not pose any particular problem, we present in this section our heuristic for the estimation of dynamic (average and maximal)

230

CAD of Circuits and Integrated Systems

power, an issue that is characterized by a very wide (22*N) size of the space of solutions for a circuit with a significant number of input signals. In the previous presentation of our heuristics, we have seen how the random initialization of the solution is avoided for the benefit of a well-thought solution that could be the ideal solution. Moreover, for this problem, we can see how this enormous space of solutions can be effectively explored, in order to generate a near-optimal solution (or even an optimal solution in certain cases) in a reasonable CPU time. 1st observation: by feeding the circuit with a vector V0 of input signals, the capacitances at the output nodes of the logic gates of the circuit would each take a certain state (charged or discharged). By re-injecting the same vector V0, each of the capacitances would preserve its previous state. In other terms, the term NGi in the equation of dynamic consumption would be zero for all the gates of the circuit. This means that for this pair of vectors (V0, V0), the dynamic consumption is zero; therefore, it is useless to consider such pairs of vectors. Note that the number of these pairs is equal to 2N, with N being the number of input signals. 2nd observation: by feeding the circuit with a vector V0 of input signals, followed by a different vector of signals Vt, each capacitance at the output node of an arbitrary logic gate would undergo a certain number of transitions NGi (discharge: transition 1  0; charge: transition 0  1), where NGi can be zero. Feeding the circuit with the pair (Vt,V0), the reverse electrical effect would take place: a capacitance that is charged (discharged) with (V0,Vt) is discharged (respectively charged) with (Vt,V0). But both the charge and the discharge are counted as transitions. Hence, the number of transitions of a given gate that occur with (V0,Vt) is exactly THE SAME as with (Vt,V0). Hence, the power consumed with (V0,Vt) is EQUAL to that consumed with (Vt,V0). In this case, it makes no sense to unnecessarily waste CPU time to consider at the same time (V0,Vt) and (Vt,V0). Therefore, to effectively explore the space of solutions, we calculate the dynamic consumption only with one of the two pairs: if the calculation is performed with a given pair (V0,Vt) – respectively (Vt,V0) –, then the pair (Vt,V0) – respectively (V0,Vt) – will not be considered. By defining X as the total number of solutions (22*N) from which we deduct the number of pairs (V0,V0) that are not taken into account for the calculation, we then

Case Study: Application of Heuristics and Metaheuristics

231

have: X= 22*N - 2N. Among the remaining elements, half of the elements is eliminated due to the same result generated by (V0,Vt) and (Vt,V0). Therefore, X/2 elements are eliminated; hence, we eliminate (22N-1 - 2N-1) solutions. Based on these two observations, the space of solutions is effectively explored since the number of eliminated solutions is equal to Y = 2N + 22N-1 2N-1. In other terms, the size of the space of solutions is reduced from 22*N to Z = 22*N – Y. Therefore, Z = 22*N – (2N + 22N-1 - 2N-1) = 22*N – 2N – 22N-1 + 2N-1. It can obviously be verified that Z < Y, which means that the number of solutions that can be explored is below that of eliminated solutions. These two observations show once again how the random generation is abandoned to the benefit of a set of solutions that leads to a near-optimal solution (or even an optimal solution in certain cases) in a reasonable CPU time for a problem of exponential computational complexity. This being said, the size of the set of solutions that can be explored is still quite significant (Z elements). It is worth recalling that the calculation of dynamic power requires a certain CPU time for only one pair of vectors (vector size, design style of the circuit, logical depth of the circuit, etc.). Moreover, it would not be possible to consider the Z elements in a limited CPU time. It is therefore important to still eliminate, but in an effective manner, some of the Z elements. But how should this be done? The observation that “(V0, V0) involves no power consumption” can be interpreted as follows: if no bit of the 1st vector V0 differs from the corresponding bit in the 2nd vector V0, the power consumption is zero. But what happens if the two vectors V0 and Vt differ by 1 bit, 2 bits, …., N bits? As there is nothing to indicate which difference of bits has the strongest impact on power consumption, we have defined N classes (N being the size of the vector of input signals) where an arbitrary pair (V0,Vt) belongs to class Ci if and only if V0 and Vt differ by exactly i bits (1  i  N). Note that class C0 does not exist, as its elements have no influence on power consumption. In order to deal with representative elements, we have decomposed each class Ci into 2i subsets in which the pairs of vectors to be dealt with are th ( , )= 0 ,, , , , , , , where Si,k is the k subset in Ci; 1  k  2i; l = 2i – k +1; 0 , , , ( , , , ) is the jth vector in the subset

232

CAD of Circuits and Integrated Systems

Si,k (respectively Si,l); 1  j  2N-i. Therefore, the classes C1, C2,, CN make it possible to consider several representative elements. Nevertheless, for a better exploration of the solution space, we have added the class CX, including the pairs of vectors (V0j, Vtm) such that 1  j  2N; m= j+1 (j-1) if j is odd (respectively even). v0 C1 vt

v0 C2 vt

000 001 010

v0 C3 vt S21

S11 S22

011 100

S23

101 110 111

S12 S24

v0 CX vt S31

000

S32 S33

001 010

S34

011

S35

100

S36

101

S37

110

S38

111

Figure 3.16. Pairs of vectors explored by SPOT

For a circuit with three input signals, the pairs of vectors to be dealt with are those indicated in Figure 3.16 where the pair 0, , , , , , , , for example, corresponds to (000, 100). Let us recall that the pairs (V0,V0) are not explored, similar to the pairs (Vt,V0) if the pairs (V0,Vt) have been explored. The pairs of vectors are processed alternatively, in the subsets and , , . The results obtained for , = ⋃ , = ⋃ ISCAAS benchmarks are given in Table 3.16. The exact result (exhaustive exploration) has been possible, within a reasonable CPU time, only for circuits whose number of input signals does not exceed 9. Compared to the exact results, the relative error does not exceed 14% for a fixed CPU time. Note also that with respect to the nine cases for which the comparison with the exact solution has been possible, SPOT gives the exact solution for six benchmarks (2/3 of cases). Nevertheless, these results have been obtained for a SUN Sparc LX station running at a frequency of 50 MHz. With the current processors (frequency over 3 GHz), SPOT could explore, with the same CPU time, MORE elements, which would lead to a significant reduction of the relative error.

Case Study: Application of Heuristics and Metaheuristics

Exact method

233

CPU

Heuristics

time

(mW)

CPU time

% Error

0.5625

3”

0.5625

1”

0

4

-

-

1.0000

2h31’20”

-

634

28

-

-

78.2500

3h36’17”

-

5

36

4

0.7500

45”

0.7500

5”

0

count

35

94

34

-

-

7.0000

3h53’07”

-

cordic

23

204

26

-

-

19.8750

2h46’32”

-

alu4

14

224

24

-

-

12.8750

4h24’07”

-

comp

32

110

12

-

-

1.7500

2h52’51”

-

apex6

135

476

16

-

-

17.3750

1h44’47”

-

parity

16

30

8

-

-

0.5000

5’17”

-

majority

5

4

4

0.2500

3”

0.2500

1”

0

b1

3

12

4

0.7500

0”

0.7500

0”

0

cm82a

5

12

4

0.8750

6”

0.7500

1”

14

cm151a

12

18

10

-

-

2.3750

1’45”

-

cm152a

11

2

2

-

-

0.1250

28”

-

cm138a

6

18

4

0.5000

45”

0.5000

5”

0

9syml

9

88

12

6.6250

6h16’32”

5.6975

4’45”

14

x2

10

24

4

-

-

2.0000

1’31”

-

cm42a

4

26

6

1.2500

4”

1.1000

1”

12

z4ml

7

16

4

1.0000

5’54”

1.0000

22”

0

#Primary inputs

#Gates

c17

5

7

3

too large

38

86

c1355

43

decod

Circuit

Logical depth

(mW)

Table 3.16. Results generated by SPOT for ISCAAS benchmarks

234

CAD of Circuits and Integrated Systems

3.5.2. Automated layout generation of some regular structures (shifters, address decoders, PLAs) As the design of a VLSI system is time consuming and expensive, techniques that rapidly generate accurate results are desirable. Silicon compilation (using a library of standard cells) is one of the promising methods for the design of integrated systems. Nevertheless, due to the rapid development of semiconductor technology, it is necessary to update various libraries used by the designer. But the updates of certain libraries may be quite time consuming, which would slow down the design of a system and therefore not be in line with the “Time to market” policy. Among these libraries, the cells in the form of layout are characterized by a tedious design and subject to verifications (design and electrical rules). Moreover, the automated generation of such cells for a given technology is highly recommended. This section presents our tool CMORGEN that enables the generation of regular cells such as the shifters, memory address decoders and PLAs. For each of these three structures, we define, starting from the structural representation, the leaf cells, which are repeated in the circuit. Each of these cells is designed using a graphic editor, verified electrically, rule checked with respect to the targeted technology, and then simulated. This has been performed using MAGIC (developed at UC Berkeley) that integrates an electrical simulator. Our tool CMORGEN places these cells by connecting them according to the topology of the circuit. The mirroring technique has been used to obtain a more compact and optimized circuit. Let us recall that this technique involves sharing the same supply line Vdd or the same line Vss with the logic gates that are in two neighboring rows. This is achieved by, alternatively, vertically revolving a row of gates. The interest of such a methodology is to be able to rapidly update the library of cells by modifying only the basic cells, depending on the new technology targeted. The other interest is that the automated generation of a given structure is done generically (the size of the shifter or decoder or the behavioral description of the PLA may vary).

Case Study: Application of Heuristics and Metaheuristics

235

Figure 3.17 shows the block diagram of CMORGEN, while Figures 3.18–3.20 show the automatically generated layouts that correspond respectively to a shifter array, a memory address decoder and a PLA. Finally, Tables 3.17, 3.18 and 3.19 indicate a rapid automated generation (a few seconds) of this type of structure but their manual design would be tedious and would take several days. MAGIC

SPICE

leaf cells

main.c

SHIFTER.C shifter.c

decoder.c

include.h

shifter.mag or decoder.mag or pla.mag

Figure 3.17. Block diagram of CMORGEN

pla.c

236

CAD of Circuits and Integrated Systems

CPU time (seconds) Lines * columns (# rectangles)

SUN station (50 MHz)

PENTIUM (100 MHz)

3*3 (2,400)

1

0

10 * 4 (11,073)

3

0

10 * 10 (35,193)

8

2

20 * 10 (70,373)

16

4

20 *20 (1,92,372)

39

11

Table 3.17. CPU time for the generation of shifters of various sizes

CPU time (seconds) # Inputs (# rectangles)

SUN station (50 MHz)

PENTIUM (100 MHz)

4 (8,409)

1

0

7 (1,04,857)

13

3

9 (5,23,439)

67

20

10 (1,151,738)

149

45

Table 3.18. CPU time for the generation of memory address decoders of various sizes

PLA

#Primary #State #Primary #Product input terms variables output symbols symbols

#Rect.

CPU time(s) SUN/PENT.

PLA 1

2

3

2

9

65,558

2/0

PLA 2

4

16

8

40

83,572

14/3

PLA 3

4

16

8

90

1,76,682

35/8

Table 3.19. CPU time for the automated layout generation of various PLAs

Case Study: Application of Heuristics and Metaheuristics

Figure 3.18. Automated layout generation for a shifter array 4 * 5. For a color version of this figure, see www.iste.co.uk/mahdoum/CAD.zip

Figure 3.19. Automated layout generation for a memory address decoder with four inputs. For a color version of this figure, see www.iste.co.uk/mahdoum/CAD.zip

237

238

CAD of Circuits and Integrated Systems

Figure 3.20. Automated layout generation of a PLA. For a color version of this figure, see www.iste.co.uk/mahdoum/CAD.zip

3.5.3. Automated layout generation of digital circuits according to the River PLA technique We have presented above the automated layout generation of some regular structures. Nevertheless, certain circuits are irregular a priori. Their automated generations spare the designer the effort of tedious design that may take several days, thus not complying with the “Time to market” commercial policy. Nevertheless, an irregular structure presents several drawbacks, the most important of which is the creation of parasitic electrical parameters, which have a negative influence on performance, surface and power consumption. Moreover, it is highly desirable to transform circuits with irregular structures into circuits with regular structures. Since the already presented tool CMORGEN makes it possible to automatically generate only a priori regular circuits, we have developed another tool named River PLA, a name derived from the methodology that was initially developed by Mo et al. (2002). This methodology involves the transformation of any structure into a set of interconnected PLAs. Since a PLA is a regular structure, the resulting circuit has a regular structure. We have therefore adopted the same methodology, but used our own generation

Case Study: Application of Heuristics and Metaheuristics

239

and optimization techniques. The related details are given in Chapter 2, section 2.4.1.3. Figures 2.24–2.29 and Tables 2.2 and 2.3 provide the details concerning the two versions of our tool: River PLA I (where all the PLAs are arranged in the same “column”) and River PLA II (where the PLAs are arranged in matrix form). River PLA II, which is far better adapted to practical applications (the height and width of the automatically generated circuit are those desired by the designer), enables the layout production within an interesting CPU time (as shown in Table 2.3, it is, for example, 2 s for the ISCAAS C17 circuit – 6 logic gates – and 105 s for the C499 circuit – 202 logic gates). Finally, it is worth noting that the placement of various basic cells and their interconnections is made in a polynomial manner. Concerning the optimization of the size of PLAs, the height is reduced thanks to the reduction in the number of product terms (ESPRESSO, UC Berkeley), while the width is reduced thanks to a Turing transformation of the problem, involving the sharing of the same column by signals supplying the various PLAs to the graph coloring problem (solved by our tool FEWERCOLORS): the graph is constructed by associating a node with each signal, and two nodes are connected by an edge if and only if the associated signals supply the same PLA. After graph coloring, the signals that share the same column are those whose corresponding nodes have received the same color. 3.6. Interconnections 3.6.1. Low-power buffer insertion technique for the design of submicron interconnections with time and surface constraints As already mentioned, the development of semiconductor technology has enabled a significant improvement of the delay of logic gates, but unfortunately this is not the same for interconnections. Hence, several techniques, among which is buffer insertion on an interconnection, have been developed in order to improve the time of data transfer between emitting and receiving nodes. In Chapter 2, section 2.8.1.1, further details are provided, which enables us in this chapter to directly approach the algorithmic aspect of our technique, which involves the introduction of a maximal number N of inverters at regular intervals. In order to reduce the delay between the emitter

240

CAD of Circuits and Integrated Systems

and the receiver while meeting the power consumption constraint, one obvious possibility would be to consider all the possible cases (insert 1, 2, …, or N inverters, then choose the best solution). However, in order to insert only one inverter, there are N possibilities: place it at node 2, 3, …., or (N+1) – there are N+2 nodes, the 1st and the last node being respectively the emitter and the receiver. It is easy to verify that the total number of possibilities for the insertion of 1, 2, …, or N inverters is equal to ! ∑ =∑ , which is a very significant number of possibilities. )! !(

Since this problem is not tractable, this led us to develop a heuristic aimed at obtaining a quality solution within a reasonable CPU time. Similar to the already presented heuristics, we avoid the generation of random solutions, in favor of choosing a carefully targeted solution during initialization, the best in terms of power consumption. If the time and surface constraints are met, such an initialization would give the best solution, avoiding the unnecessary exploration of other solutions (significant gain in CPU time). If one or both constraints are not met, then other solutions are progressively explored, starting from the already considered solutions. This is done in order to avoid altering the quality of the explored solutions too much, which have not met the constraint(s). The obtained realizable solution is near-optimal because it is generated from the best candidates (sharing certain characteristics). Before further describing our algorithm, it is worth noting that, provided some modifications, it can be adapted to the resolution of the problem of optimization of the delay subject to surface and power consumption constraints (in this case, the objective function concerns the power consumption). For each interconnection, and for each equipotential line, the main module of our algorithm determines, if possible, the positions of the inverters, minimizing the power consumption while meeting the time and surface constraints. The Determine_buffer_positions() process is composed of three main parts. For each combination (namely, for a certain number of inverters placed at certain positions of the interconnection) among the M possible ones that concern the insertion of the same number of inverters, this process generates the ideal solution, the one that involves the lowest power consumption. If the time and surface constraints are met, the exploration process continues with another combination. Otherwise, this process

Case Study: Application of Heuristics and Metaheuristics

241

attempts to find a solution that meets the constraints by making slight modifications to the previous solution (not altering the quality of the solution too much, while meeting the constraints). These modifications concern the supply voltage, the threshold voltages of the transistors and their dimensions: it is the part where k takes the value 2 in the Generate_Individual() process. If the previous solution is realizable (meeting the constraints), but is not ideal, then Generate_Individual() uses it to converge to a better realizable solution. It is the part where k > 2. If the problem has a solution (the designer may specify inadequate constraints), the Select_configuration() function returns three possible solutions: – Ecand_1 (set including the positions of the inverters whose electrical parameters are stored in Ebuffer_1) and Ebuffer_1 (set including the solutions meeting the delay constraint while taking the least surface and the lowest power consumption); – Ecand_S (set including the positions of inverters whose electrical parameters are stored in Ebuffer_S) and Ebuffer_S (set including the solutions meeting the delay constraint while taking the least surface, but not the lowest power consumption); – Ecand_P (set including the positions of inverters whose electrical parameters are stored in Ebuffer_P) and Ebuffer_P (set including the solutions meeting the delay constraint while consuming the least power, but not necessarily the least surface). Out of these three types of solutions, obviously, the most interesting would be if Ecand_1 and Ebuffer_1 are not empty. Based on these comments, we provide below the details of the algorithm. BEGIN /* Main module */ For each equipotential line Do {Determine all the interconnections belonging to this equipotential line, then sort them in descending order of their lengths; /* this aims to meet the constraints for long interconnections */ For each interconnection

242

CAD of Circuits and Integrated Systems

Do {Determine G = (V, E); /* V={nodes of the interconnection, source and sink nodes included}, E={(vi,vj); vi  V  i  j} –see “Figure 2.66”- */ Determine_buffer_positions(); /* determines the number of inverters and their positions */ Select_configuration (); /*if several solutions generate the same delay, select the one that is better suited to the application – power and/or surface is the most essential parameter for the concerned application */ } Done } Done

Determine_buffer_positions() Smin=+; Pmin=+; Ecand_1=; Ecand_S=; Ecand_P=; Ebuffer_1= ; Ebuffer_S= ; Ebuffer_P= ; /* Smin (Pmin) is the minimal surface (power) of the inverter configuration (number and their positions) meeting the time and power (surface) constraints Ebuffer_1 (set including the solutions meeting the delay constraint while taking the least surface and the least power consumption) Ecand_1 (set including the positions of inverters whose electrical parameters are stored in Ebuffer_1) Ebuffer_S (Ebuffer_P) is the set of solutions meeting the delay constraint while consuming the least surface (power), but not the least power (surface) Ecand_S (Ecand_P) is the set including the positions of inverters whose electrical parameters are stored in Ebuffer_S (Ebuffer_P) */ For i=1 to M /*M is the number of explored combinations; M  ∑

*/

Case Study: Application of Heuristics and Metaheuristics

243

Do {Generate_Ideal_Individual(); /* Assign WL, VthNH, VthPL, VddL to all the inverters of the current combination */ /* An ideal individual is a set of n inverters (n  N) such that any of them is designed with WL, VthNH, VthPL and VddL, hence the solution that reduces the most the power consumption; indices L and H correspond to Low and High respectively */ k=1; LABEL: D=delay(); // calculate the delay of interconnection S=estimate_area(); // calculate the area of inverters If |D - Tf|   and |S - Sf|   /* Tf and Sf are the time and surface constraints, respectively */ Then {P=Power(); /* calculate the power generated by the current combination */ If S < Smin and P < Pmin Then {Ebuffer_1= {combination i}; /* the combination i corresponds to electrical parameters W, Vdd, Vth of the m inverters (1  m  N) */ Ecand_1={the least expensive path determined until present}; /* this path includes the source and sink nodes, as well as the n inverters */ Smin=S; Pmin=P; } Else {If S < Smin Then If P = Pmin Then {Ebuffer_S = {combination i}; Ecand_S={the least expensive path determined until present };

244

CAD of Circuits and Integrated Systems

} Else {Ebuffer_S=Ebuffer_S  {combination i}; Ecand_S=Ecand_S  {the least expensive path determined until present}; } Endif Endif If P < Pmin Then If S=Smin Then {Ebuffer_P= {combination i}; Ecand_P= {the least expensive path determined until present}; } Else {Ebuffer_P=Ebuffer_P  {combination i}; Ecand_P=Ecand_P  {the least expensive path determined until present}; } Endif Endif } Endif If k=1 // ideal case Then continue; /* interrupt the generation of individuals with the current combination, consider another one */ Endif }

Case Study: Application of Heuristics and Metaheuristics

245

Endif k++; If k  nb_individuals Then {Generate_Individual(D, S, P); goto LABEL; } Endif } End

Generate_Individual(D, S, P) { If k > 2 Then {i=1; While |D - Tf|   and i  nb_buffers Do {If Wi=WH /* minimize the power and the surface while meeting the delay constraint */ Then {Wi=WL;

calculate D; }

Endif i++; } Done If |D - Tf| >  and i > 1 Then {i--; Wi = WH; } Endif // Beginning of the processing with VthN

246

CAD of Circuits and Integrated Systems

i=1; While |D – Tf|   and i  nb_buffers in the current combination Do {If VthN,i = VthNL

/* minimize the current leakage in the channels of NMOS transistors */

Then {VthN,i=VthNH; calculate D; } Endif i++; } Done If |D – Tf| >  and i > 1 Then {i--; VthN,i = VthNL; } Endif // End of the processing with VthN // Beginning of the processing with VthP Same process as with VthN, by replacing: VthN by VthP, VthNL by VthPH, VthNH by VthPL // End of the processing with VthP // Beginning of the processing with Vdd Same process as with VthN, replacing: VthN by Vdd, VthNL by VddH, VthNH by VddL // End of the processing with Vdd } Else { // k=2: Generate a solution from the ideal individual that has not met the delay constraint i=1; While |D-Tf| >  and i  nb_buffers in the current combination

Case Study: Application of Heuristics and Metaheuristics

247

Do {If Wi=WL Then {Wi=WH; /* Attempt to meet the delay constraint by increasing the width of transistors */ S1=S;

calculate S;

If |S – Sf|   Then calculate D; Else { Wi=WL; S=S1; } Endif } Endif i++; } Done // Beginning of the processing with VthN i=1; While |D – Tf| >  and i  nb_buffers in the current combination Do {If VthN,i = VthNH Then {VthN,i=VthNL; /* Attempt to meet the delay constraint by reducing the threshold voltages of NMOS transistors */ calculate D; } Endif i++; } Done // End of the processing with VthN

248

CAD of Circuits and Integrated Systems

// Beginning of the processing with VthP Resume the last processing with VthN, by replacing: VthN by VthP, VthNL by VthPH, VthNH by VthPL // End of the processing with VthP While |D – Tf| >  and i  nb_buffers in the current combination Do {If Vdd,i=VddL Then {Vdd,i=VddH; For j=i+1 up to nb_buffers in the current combination Do {Vdd,j=VddH; /* Attempt to meet the delay constraint by increasing the supply voltages of inverters */ j++; } Done i=j; } Else i++; Endif } Done } Endif

Select_configuration() { If Ebuffer_1 =  and Ebuffer_P =  and Ebuffer_S = 

Case Study: Application of Heuristics and Metaheuristics

249

Then {Write “No solution for this problem: very severe constraints”; exit(); } Endif If Ebuffer_1   /* This set includes the solutions minimizing the power consumption while meeting the delay and surface constraints */ Then Use Ebuffer_1 and Ecand_1 for inverter insertion; Else if Ebuffer_S   and Ebuffer_P   Then {select 1 combination  Ebuffer_S (respectively  Ebuffer_P) /* if the surface constraint (respectively the power constraint) has priority for the application being considered */ Use (Ebuffer_S and Ecand_S) or (Ebuffer_P and Ecand_P) for the insertion of inverters; } Else {Select 1 combination among those in the non-empty set; Use (Ebuffer_S and Ecand_S) (resp. (Ebuffer_P and Ecand_P)) for the insertion of inverters; /* depending on whether Ebuffer_S   (resp. Ebuffer_P  ) */ } Endif Endif }

Table 3.20 summarizes a part of the results obtained (targeting the 0.18 µm CMOS technology) for various lengths of interconnections and the delay and surface constraints.

250

CAD of Circuits and Integrated Systems

Total power (µW)

Delay of interc. (ps)

Surface (µm²)

CPU time (s)

12.05

2.60

-

-

Path: 0247

0.85

2.49

0.237600

Path: 0357

0.59

2.41

0.237600

Path: 0257

0.83

2.52

1.069200

Path: 0357

0.59

2.41

0.237600

Without insertion of inverters

Heuristic

Exact method

4

1,015

Table 3.20. Results obtained for an interconnection length equal to 750 µm, TF=2.60 ps, SF=5.70 µm²

Assuming that VddL = 1.8 V, VddH = 3.3 V, VthNL = 0.45 V, VthNH = 0.55 V, VthPL = -0.55 V, VthPH = -0.45 V, WL = 0.22 µm and WH = 1.76 µm, Table 3.20 shows some paths explored by our heuristic. This could converge to the optimal path (insertion of two inverters at nodes 3 and 5, with nodes 0 and 7 being respectively the source and the sink), inducing, by meeting the delay and surface constraints, a power consumption equal to 0.59 µW, clearly below that consumed in the same interconnection without insertion of inverters (12.05 µW). This is the case for the delay and surface constraints equal to 2.60 ps and 5.70 µm² respectively. Finally, let us note that the CPU time consumed by the heuristic (4 s) is clearly below that required by the exact method (1,015 s for an exhaustive exploration of paths). The exact solution has been obtained by our heuristic in a short CPU time thanks to a proper initialization of the solution, a reasoned shift from one solution to another and the avoidance of the random aspect. 3.6.2. Data encoding and decoding for low-power aided design of submicron interconnections Since Chapter 2, section 2.8.1.2, provides a detailed presentation, in this chapter we directly approach the algorithmic aspect of this issue. It is worth mentioning that this takes into account the non-negligible power dissipated due to parasitic electrical parameters (among which are the capacitances)

Case Study: Application of Heuristics and Metaheuristics

251

generated by the interconnection lines placed side by side. Having observed that this additional consumption strongly depends on data flowing through the interconnections (Sotriadis et al., 2003), it has become obvious that data inducing lower energy consumption should be used by adequate data encoding and decoding (Figure 3.21). Let N be the original size of the bus, and M be the original size of the coded bus (M > N). The basic principle of encoding is to use the 2M * 2N transitions that consume the least energy, therefore eliminating 2M * (2M - 2N) additional transitions. Starting from Table 2.10, Figure 3.22 shows the possible transitions of a coded bus, with N=2 and M=3, after the elimination of additional transitions.

Figure 3.21. Global block diagram of data encoding and decoding

Indeed, Table 2.10 indicates that for each state among the eight (2M = 23) states, there are 2M = 8 possible transitions. But since the original size of the bus is equal to N=2, only 2N= 4 transitions are required for each of the eight states. The total number of transitions is then 8*2N=32, and the transition from one state to another occurs depending on the considered encoding. For example, Figure 3.23 illustrates the four transitions of the state 000, taking into account the four least energy consuming transitions, from Table 2.10. We have nevertheless seen, in section 2.8.1.2, that other encoding/decoding techniques are more interesting in terms of energy consumption. It is worth recalling here our technique, which involves a dynamic encoding/decoding, similar to other techniques, but by exactly calculating (instead of estimating) the data occurrences, in order to reduce the energy as much as possible. This occurs at the expense of a memorization of data packets, consequently involving a certain delay in data transmission:

252

CAD of Circuits and Integrated Systems

1) store the data packet in a buffer; 2) determine the probability of each type of data; 3) use equation [2.16] to determine the energy consumed during a transition from one state to another; 4) encode the data to reduce the energy as much as possible, then transmit them; 5) if there are still data to be transmitted, then go to (1). Else, STOP. Since this encoding/decoding problem poses no particular difficulty, it does not require the presentation of other elements of information besides those presented in section 2.8.1.2 or those presented above.

Figure 3.22. Possible transitions for a size 3 bus. For a color version of this figure, see www.iste.co.uk/mahdoum/CAD.zip

d= 00 01 000

000

111

10 11

001

100

Figure 3.23. Example of bus transitions

Case Study: Application of Heuristics and Metaheuristics

253

3.6.3. High-level synthesis of networks-on-chip subject to bandwidth, surface and power consumption constraints It is worth recalling that the development of the semiconductor technology has significantly reduced the delay at the logic gate level, but unfortunately not at the interconnection level. At the same time, the integrated systems have grown in complexity, being characterized by a strong degree of communication. This has led the designers to do without the bus-based architectures (a critical resource generating a bottleneck during various data transfers) and cross bar-based architectures (an upgrade of busbased architectures, which does not properly meet the requirements of current integrated systems). The architectures based on network-on-chip (NoC) are then adopted to better meet the specifications of the integrated system to be developed. We have seen, in Chapter 2, that an NoC can implement either dynamic applications (the same network is used for various applications) or well-defined applications. This chapter focuses on the latter type of applications and takes into account bandwidth, surface and power consumption constraints. As already seen in Chapter 2, our technique uses the knowledge (by calculations or simulations) of instants at which data transfers take place. This essential information is the main originality of our technique, since it makes it possible to know, in advance, the data transfers that are made in parallel, sequential and exclusive transfers. This is important for the better use and optimization of the hardware resources (sequential or exclusive transfers can share the same interconnections), while anticipating the routing. Indeed, during the execution of the application, the interconnections carrying a data transfer are already known. A simple configuration (at 0 or 1) of the gates of pass transistors (interface between the IPs and the interconnection network) enables the data packets to be directed towards proper interconnections, avoiding the risk of collision with other simultaneously transmitted packets. This characteristic of our methodology leaves a very large part of the time budget allocated to the application itself intact, which qualifies it as a candidate for the design of real-time systems. We have also developed a second version involving the optimization of power consumption while meeting the time and surface constraints. The two versions of our tool were first developed for 2D architectures, and then extended to 3D architectures. The general flowchart of our methodology is shown in Figure 3.24.

254

CAD of Circuits and Integrated Systems

Figure 3.24. Our design flow of a distributed integrated system implemented on a network-on-chip-based architecture

Case Study: Application of Heuristics and Metaheuristics

255

The implementation of a complex integrated system by only one operational part and only one control part is not at all recommended. This is due to the density generated by such an implementation, yielding, in turn, the generation of many parasitic parameters, which are harmful for the characteristics of the system. Moreover, based on a task graph describing the concerned application, we opt for a functional decomposition of the system into a set of subsystems (for further details see section 2.3.1). For each of the subsystems SSi obtained, the following synthesis and analysis tasks are conducted for implementing an NoC-based architecture that is specific to the considered subsystem (then all components will constitute the distributed system). ’’

– An’ assignment of tasks to processors and ASICs of the considered subsystem (section 3.2.1.1) is conducted (taking into account the specifications – performance and power consumption) jointly with the application of the DVS technique (section 3.2.2.3). – Based on the estimation of the time needed for executing each type of operation (or during simulation), the instants of various data transfers are determined. This information enables the efficient use of the hardware resources knowing, before the execution of the application itself, what the exclusive, sequential or simultaneous transfers are. Moreover, this enables the replacement of the routing algorithm (which may consume a large part of the time budget allocated to the application) by a simple configuration (to 1 or 0) of the pass transistors' gates (the interface between the IPs and the network in order to direct a data packet to the proper interconnection, of a certain size in bits, without any conflict with the other simultaneously transmitted packets). – Considering performance as an optimization parameter (some modifications are made to optimize the power consumption) of the NoC, the initial number of interconnections N0 is then determined (N0 enables the maximization of the number of data transfers that are possible simultaneously, with no conflict). Moreover, N0 enables the better identification of the space of solutions whose size is very significant (the number of interconnections, their sizes, the arrangement of IPs). – With N0 interconnections, various instances of configurations of architectures can be analyzed in terms of surface, performance and power consumption (variation of sizes of interconnections, arrangement of IPs previously ordered depending on their degrees of communications on 1 line, 2 lines, etc.).

256

CAD of Circuits and Integrated Systems

– If one of the constraints is not met, the value of N0 is reduced by 1 and the process continues (see Figure 3.24). It is worth noting that, similar to our other heuristics, neither the solution initialization nor the shift from one solution to another is conducted in a fortuitous, random manner, but in a well-argued way. Here, the initialization of N0 focuses on the maximization of the number of data transfers that may occur simultaneously and without conflict, with the lowest possible number of interconnections. When the solution obtained is not feasible (there is a broken constraint), N0 is reduced only by one interconnection, to avoid altering the quality of the solution too much. – When all the constraints are met, a file with a detailed description of the network is generated (various transistors, their dimensions, their nodes – the connectivity of their gates with the signals emitted by the control part managing the NoCi of the subsystem SSi, the connectivity of their source and drain nodes with the ports of the IPs and the interconnections). – Since the application to be implemented is known in advance, we elaborate the equivalent of the routing algorithm during the design. A finite state machine is generated for each NoCi in order to configure the gates of the pass transistors: during the execution of the application, bits at 1 or 0 applied on the gates of the transistors enable the connection (or disconnection) of the ports of the various IPs to (or, respectively, from) the network. Consequently, this enables or forbids, from a given port, sending information on the network (see Figures 2.75 and 2.76). Since this configuration is pre-established (during the design), data packets are directed to proper interconnections (to avoid any conflict) nearly instantaneously, which consumes only a tiny part of the time budget allocated to the application itself. – Since the frequencies of various IPs and that of the NoC are not necessarily the same, a system malfunction due to metastability is unavoidable unless an interface that assures data transfer synchronization is designed (we have partially accomplished this task). – Finally, the integration of all the generated components constitutes the NoCi dedicated to the subsystem SSi. We present hereafter the results obtained by our tool for the NoC-aided design. This tool is executed with two options –v1 and –v2. The option –v1 (–v2) is chosen when the user wants a high-performance network (or, respectively, consuming the least possible power) while meeting the two other constraints related to surface and power consumption (or, respectively,

Case Study: Application of Heuristics and Metaheuristics

257

performance). The first three columns indicate the constraints fixed by the user (bandwidth, surface, power consumption). Then, for each of the two versions V1 and V2, the values obtained for the items are provided as follows: – the number of interconnections and their size in bits; – the surface covered by the network of interconnections; – the maximal and average power consumption; – the frequency of the network. It is worth noting that: – for certain test cases, there is no solution because of too strong constraints, which have been poorly estimated (for the 1st test case, the surface constraint, which was underestimated, is broken); – for each test case, option –v1 yields a higher frequency (better performance) than when option –v2 is chosen. It is worth noting that the bandwidth is obtained by multiplying the network frequency by the size of interconnections; – option –v2 gives lower values of the maximal and average power consumption compared to those obtained with option –v1; – the surface obtained with –v2 is clearly above that obtained with –v1. This is because when priority is given to power consumption, low frequencies of the network are targeted during the exploration of the solution space. But, in order to meet the constraint related to the bandwidth, an appropriate size of interconnections must be determined (Bandwidth = Frequency * Size of interconnections). This size, which is slightly larger than with option –v1, generates a larger surface of the network of interconnections; – due to effective exploration of the solution space, the execution time does not exceed 1 s for reaching these partial results (the execution times of all the tasks indicated in Figure 3.24 have not been counted here).

20.0

20.0

20.0

10.0

5.0

15.0

10.0

10.0

20.0

9.5

9.0

20.

24

24

24

24

24

24

24

24

24

24

24

50

100.0

100.0

50.0

20.

70.0

50.0

50.0

40.0

40.0

40.0

30.0

10.0

(mW)

4 ; 36 7 ; 64

4 ; 16 4 ; 24

4 ; 32 4 ; 36

4 ; 24 4 ; 48

4 ; 32 4 ; 48

4 ; 32 7 ; 64

4 ; 40 4 ; 48

4 ; 40 4 ; 64

Area

9.45

8.63

9.29

8.96

9.29

9.29

9.62

9.62

12.89

8.96

9.45

9.96

Average power (mW) V1

Average power (mW) V2

38.95

38.95

27.47 32.46

30.43

23.80 24.62

24.62

19.73 20.52

22.76

17.10

64.92

48.69

48.69

34.90 32.46

32.46

36.61

27.45

27.45

20.52

20.52

24.61

90.17

89.27

48.69

57.00

54.70

27.45

Table 3.21. Characteristics of the resulting network-on-chip

71.85

64.92

43.28

51.27

36.60

27.36

No solution: some constraint(s) is (are) too hard

12.89 9.96

Maximal power (mW) V2

No solution: some constraint(s) is (are) too hard

10.66

10.48 9.96

Maximal power (mW) V1

No solution: some constraint(s) is (are) too hard

(mm2) (mm2) V1 V2

4 ; 52 4 ; 60 10.13

size (b) V2

size (b) V1

Area

The CPU time does not include the time exhausted by all the tasks of the NoC design

*

(mm2)

(Gb/s)

F_Bandwidth F_Area F_Power

# Int.;

# Int.;

1,000 781.2 5

1388.8 9

666.6 7

500

500

375

500

375

400

(MHz) V2

Freq.

1,500

750

1,000

750

750

600

600

461.54

(MHz) V1

Freq.

1

1

1

1

1

1

1

1

1

CPU time (s)

*

258 CAD of Circuits and Integrated Systems

Case Study: Application of Heuristics and Metaheuristics

259

At the end of section 2.8.2, we provided quite a number of indications on the placement of IPs on a 2D or 3D architecture. It is worth recalling that the computational complexity of this problem is O(N !), with N being the number of IPs. Therefore, this problem is not tractable and requires an adequate heuristic for a substantial value of N. The strongly intercommunicating IPs are placed side by side for the sake of optimization of data transfer time between them. At the 1st step, such a placement is made on one line only, then, during the phase of the exploration of the solution space, this line can be replaced by two or several lines, meandering the IPs (to preserve the neighborhood order) until reaching the configuration that best addresses the fixed specifications. This being the case, as this problem is combinatorial, it happens that IPs that should be immediate neighbors are distant, because the space is filled by other IPs that strongly communicate with those in the same neighborhood. 3D architectures enable the improvement of such placements as the interconnections can be made horizontally and vertically – via TSVs (Through Silicon Vias) or MIVs (Monolithic Inertier Vias), which are contacts between two successive layers (see Figure 2.80). Using a Turing transformation, we have also solved this problem by exploiting our graph coloring technique. The method involves the association of a node of the graph to each IP, and connecting a pair of IPs if they are slightly or not communicating. The coloring algorithm is then executed recursively until each color colors at most two nodes. The IPs whose corresponding nodes have received the same color are then placed side by side. Further details will be given by means of the example shown in Figure 3.25. Let us suppose that coloring a graph with 10 nodes has given the following result: C1: 4 2; C2: 5 3 10 9; C3: 8 7 6; C4: 1.

260

CAD of Circuits and Integrated Systems

C1: 4 2 C2: 5 3 10 9 C3: 8 7 6 C4: 1

IP5

IP9

IP7

IP3

IP6

IP8

IP10 C21: 5 10 C22: 3 9

C31:7 C32: 6 8

Figure 3.25. Example of recursive graph coloring

Each time a color Ci colors more than two nodes, a sub-graph composed of nodes colored by Ci is formed. At the previous stage, these nodes were not connected because their corresponding IPs have similar degrees communication. At the current stage, the average of the degrees communication (calculated on the basis of data packets transmitted between the IPs whose corresponding nodes have received the color Ci) is calculated to connect any pair of nodes whose corresponding IPs have degrees of communication below the average. This coloring process is recursively repeated until each color colors at most two nodes. The example presented in Figure 3.25 gives the following placement: 4 2 5 10 3 9 7 6 8 1. It is worth noting that the placement 4 2 5 10 6 8 7 3 9 1, for example, is not good because at a certain stage, nodes 5, 10, 3 and 9 have received the same color. They should therefore be placed in a neighborhood characterized by a certain degree of granularity obtained during recursive coloring, which means that the pair of nodes 3 and 9 should be in the neighborhood of nodes 5 and 10 (as 3, 9, 5 and 10 were previously colored by C2). On the other hand, the placement 5 10 3 9 4 2 7 6 8 1 can be retained, since a permutation of the colors C1 and C2 has simply taken place. Figure 3.26 presents the placement algorithm, illustrated by an example.

Case Study: Application of Heuristics and Metaheuristics

261

Placement (G_Com) // G_Com is the communications graph { From the communications graph G_Com, determine the graph G_col to color, by connecting 2 nodes if the corresponding IPs are weakly communicating (e.g. the degree of communication is less than the average of the communication degrees); Color the graph G_col; End = 0; While (End = 0) Do {End = 1; For each obtained color C Do If C colors at most 2 nodes Then write this (these) node(s) in a file Else {End = 0; Generate the new communications graph G_Com of the IPs that correspond to the nodes colored by C; Placement (G_Com);

// Recursive function

} Endif Done } Done Return; }

Figure 3.26. Algorithm of IP placement

262

CAD of Circuits and Integrated Systems

When the order of IPs is obtained, the placement on a 3D architecture is done, preserving the same order, by arranging the IPs on successive layers in a meandered manner: the IP placed at the last column of a certain layer has its successor (in the resulting order) placed at the same column and at the following line of the same layer. Otherwise, this successor will be placed at the next layer, but at the same column (Figure 2.80). The results obtained by the placement of IPs on a 3D architecture are indicated in Table 3.22. The first two columns correspond to the numbers of IPs and traces (each trace is a data flow – DFG), followed by the constraints related to bandwidth, surface and power consumption. Then the results obtained in terms of bandwidth, surface and power consumption are indicated. Finally, T_CPUPl is the execution time for the placement of IPs on a 3D architecture, while T_CPUG is that consumed by other tasks of the NoC automated generation. Note that the surface of each IP is estimated at 1 mm2 (1 mm * 1 mm), and power consumption only refers to that due to data transfers. These results indicate that the placement is obtained in less than 1 s (the ps –elf command of the Linux operating system executed as the last instruction in our programs gives only integer values of seconds, minutes and hours. Therefore, 0 s signifies a CPU time below 1 s) for various size instances, while meeting the various constraints. Finally, it is worth noting that MIVs feature better characteristics than TSVs: 70*140 nm versus 3,000*30,000 nm in terms of dimensions, and they induce a very low capacitance, which is below 0.1 fF. Consequently, only MIVs are used as contacts between various layers of 3D architectures. This is because they induce a short delay of information transfer and low power consumption between two successive layers of the architecture (approximately 3 ps and 0.2 fW respectively). The surface required by an MIV is certainly small (9,800 nm2), but a potentially large number of data transfers between two successive layers would lead to the use of a similar number of MIVs (assuming that one MIV is used for each data transfer) and would therefore substantially increase the network surface. This has enabled us to improve our technique for IP placement on a 3D architecture (Toubaline et al., 2017) by sharing the same MIV for a set of data transfers (each MIV is placed at the “gravity center” of each subset).

Case Study: Application of Heuristics and Metaheuristics

User constraints

263

Bandwidth (Gb/s)

Area (cm2)

Pwr dissp.a (µW)

T_CPUPl (s)

T_CPUG (s)

0.40

40

0.20

0.02

0

0

3.5

0.50

40

0.51

0.11

0

0

3.0

0.50

40

1.05

0.17

0

1

40

3.0

0.55

40

1.49

0.10

0

0

800

40

3.0

0.60

40

2.47

0.19

0

1

60

720

40

3.5

0.70

40

2.08

0.32

0

1

70

700

40

3.5

0.70

40

2.57

0.42

0

2

80

800

40

3.5

0.70

40

2.66

0.61

0

3

90

900

40

3.5

0.70

40

1.52

0.68

0

5

100

800

40

3.5

0.70

40

1.73

0.68

0

4

Area (cm2)

Pwr dissip.a (µW)

#IPs

#Traces

Bandwidth (Gb/s)

10

200

40

3.0

20

400

40

30

900

40

40

400

50

a. The switching power dissipation only concerns the one which is due to the data transfers between the IPs.

Table 3.22. Placement of IPs on an architecture

3.7. Conclusion As this book deals with the CAD of integrated circuits and systems using both computer science and design concepts, we thought it is appropriate to dedicate two chapters to these two aspects. Chapter 1 is essentially a review of notions related to language classification, finding the computational complexity of a problem and the definition of intractable problems such as NP-complete, NP-hard and NPintermediate problems. Indications on heuristics and metaheuristics have also been given, to indicate, at the very beginning of the book, that certain problems (the intractable ones) require a specific resolution, in order to address a double challenge: the development of a tractable algorithm while aiming at a certain quality of solution. This is worth recalling, considering that for nearly all the problems approached in this chapter of this book, no solution can be found in the conventional manner, since they are intractable. As the objective is to develop an algorithm to to aid the design of a given integrated circuit or system, it is essential to have good knowledge of the problem at hand for better optimization, analysis, etc. Hence, in Chapter 2, we have presented notions on the design of integrated circuits and systems in direct relation with the applications presented in Chapter 3.

264

CAD of Circuits and Integrated Systems

Finally, problem-solving techniques based on heuristics or metaheuristics (developed in our research work at the Center for Development of Advanced Technologies, Algiers) have been presented and can be used as a guide by students or young researchers to approach the resolution of other problems. With respect to many academic works and known metaheuristics (such as tabu search, simulated annealing, ant colonies, genetic and evolutionary algorithms), our techniques feature the avoidance of the random aspect, both for the initialization of the solution and for passing from one solution to another (in our personal approach, this aspect is only used for the random generation of test cases, for example). Indeed, random navigation in a very large solution space would have negative consequences for the quality of the solution: – generation of non-feasible solutions and therefore useless CPU time consumption (e.g. the solution BEADECB is not feasible for solving the traveling salesman problem, because the path passes twice through town E); – the same, already explored, solutions can be regenerated, which would also involve useless CPU time consumption; – even worse, the process of solution generation can become stationary, meaning that, at each period, the same solutions are generated (in this case, even though the CPU time allocated is significant, the quality of the solution is exactly that obtained at each period). Since such problems can occur even with the use of known metaheuristics (such as tabu search, simulated annealing, ant colonies, genetic and evolutionary algorithms – often used with non-argued crossovers and mutations), the best solution would be to correctly analyze the considered problem for a proper initialization of the solution, to better shift from one solution to another. Among the heuristics we have presented, an example is that involving the aided design of a low-power integrated circuit: – Rather than randomly generating the initial solution, this solution has been defined by taking into account all the parameters that enable power consumption. If the surface and delay constraints are met, the best solution would be obtained during the initialization, and the code execution stops! – When a constraint is not met, the solution under definition is generated from the previous (high-quality but not feasible) solution with a fine tuning of a few parameters only to avoid altering the quality of the solution.

Case Study: Application of Heuristics and Metaheuristics

265

The size of the solution space is often very large. Once again, rather than randomly navigating (with unnecessary consumption of CPU time without improving the quality of the solution), a proper analysis of the problem at hand is also necessary, in order to narrow down, based on observations, the space of solutions to a smaller space, which, additionally, includes the best solutions. As an example, let us mention the problem of the estimation of the maximal and average power consumption, whose computational complexity is equal to O(22N), with N being the number of input signals of the circuit. We have proceeded as follows: – Noting that the pairs of vectors (V0, V0) do not generate any power consumption, such pairs, are discarded in advance, during the exploration of the solution space. But the number of such pairs is equal to X = 2N, which is significant. – Noting that (V0, Vt) involves the same power consumption as (Vt, V0), we discard one of the two pairs if the other has been considered during the processing. Once again, this observation makes it possible to discard Y = (22N - 2N)/2 = (22N-1 - 2N-1) elements, which is also significant. – Therefore, eliminating (X+Y) elements without obviously altering the quality of the solution, the “shift” from one solution to another is made in a deterministic manner in the classes C1, C2, …., CN (where the class Ci is composed of pairs of vectors such that, in each pair, the two vectors differ exactly by i bits). – The CPU time allocated is, therefore, clearly used for exploration in a space of reduced size. It can be easily verified that (X+Y) > (22N – X-Y), which means that the two observations have made it possible to eliminate more elements than remain, which increases the probability of converging to a high-quality solution. Based on these considerations, it is our personal conviction that the quality of the solution does not depend on the use of one or the other metaheuristic, but rather on the manner in which the initialization, the shift from one solution to another (moving away from a local optimum, avoiding the generation of a non-feasible solution) and the exploration (effectiveness, determinism) of the solution space are conducted. Our arguments for this are the following:

266

CAD of Circuits and Integrated Systems

– If one of the metaheuristics is the most effective, why does the scientific community in the field continue to use an abundance of metaheuristics? – Why do works using the same metaheuristic to deal with the same problem produce such different results for the same benchmarks? Finally, this “relentlessness” with which the scientific community in the field is seeking to improve the CPU time for tractable problems – for example, for the problem of sorting N numbers, the computational complexity has been reduced from O(N2) for the bubble-sort to O(N*log2 N) for the quick sort – could also be applied to intractable problems, while improving the quality of the solution. In this sense, we have discussed an intractable problem above – power consumption – where the size of the space of solutions has been reduced from 22N to 22N – (2N + 22N–1 – 2N–1), while being certain that the quality of the solution has not been altered, on the contrary (the argued reduction of the dimension of the solution space increases the probability of converging to a high-quality solution). Is it possible that a thorough consideration of such reflections can one day lead to generating the exact solution for an NP-complete problem (some NP-hard problems, as classified until now – see Figure 1.6 – also belong to NP) with a polynomial deterministic algorithm? If such were the case, we would proceed to another classification of the problems and would certainly claim that P is equal to NP! It is worth recalling that P  NP is just a conjecture, and its confirmation or invalidation is part of one of the six problems still unsolved to this day (see http://www.claymath.org).

References

Alpert, C.J., Devgan, A., Quay, S.T. (1999). Buffer insertion with accurate gate and interconnect delay computation 36th ACM/IEEE (DAC’99), 479–484. Belkouche, H., Louiz, F., and Mahdoum, A. (2000). FEWERCOLORS: A new technique for solving the graph coloring problem 15th IEEE/DCIS (Design of Circuits and Integrated Systems), Montpellier, France, 21–24 November. Bergamashi, R.A., Raje, S., Nair, I., and Trevillyan, L. (1997). Control-flow versus data-flow-based scheduling: Combining both approaches in an adaptive scheduling system IEEE Transactions on VLSI, 5, 82–100. Berrandjia, M.L. and Mahdoum, A. (2008a). Optimisation de circuits par les graphes colorés COSI (Colloque sur l’Optimisation et les Systèmes d’Information), Tizi Ouzou, Algeria, June 8–10. Berrandjia, M.L. and Mahdoum, A. (2008b). Un outil de génération automatique de dessins de masques de circuits digitaux basé sur la technique River PLAs Knowledge Based Industries & Nanotechnolgy Conference, 84–85, Doha, Qatar, February 11–12. Berrandjia, M.L., Mahdoum, A., Benmadache, R., and Chenouf, A. (2008). River_PLAs2 : un outil de génération automatique de dessins de masques basé sur une variante de la technique River PLAs CIEEAM (Conférence Internationale sur l’Electrotechnique, l’Electronique, l’Automatique et la Maintenance), Oran, Algeria, December 16–17. Cortes, L.A., Eles, P., and Peng, Z. (2005). Quasi-static assignment of voltages and optional cycles for maximizing rewards in real-time systems with energy constraints Proceedings of the Design Automation Conference (IEEE/DAC 2005), 889–894.

CAD of Circuits and Integrated Systems, First Edition. Ali Mahdoum. © ISTE Ltd 2020. Published by ISTE Ltd and John Wiley & Sons, Inc.

268

CAD of Circuits and Integrated Systems

Djamah, H., Moussa, H.H., and Mahdoum, A. (1997). CMORGEN: A CMOS regular structure generator. Synthèse, Journal of University Badji Mokhtar of Annaba, 6, 207–218. Gajski, D.D. (1992). High Level Synthesis. Springer, Berlin, Germany. Garey, M.R. and Johnson, D.S. (1979). Computers and Intractability: A Guide to the Theory of NP-completeness Freeman, San Francisco, USA. Gauthier, L. (2003). Conception des logiciels embarqués pour les systèmes monopuces Hermes-Lavoisier, Paris. Hertz, A. and Widmer, M. (1995). La méthode tabou appliquée aux problèmes d’ordonnancement. Revue française d’automatique, d’informatique et de recherche opérationnelle – automatique, productique et informatique industrielle, 29(4–5), 353–378. Jerraya, A.A. (2002). Conception de haut niveau des systèmes monopuces HermesLavoisier, Paris. Kim, J. and Kim, T. (2005). Memory access optimization through combined code scheduling, memory allocation, and array binding in embedded system design. Proceedings of the Design Automation Conference (IEEE/DAC 2005), 13–17 June. Lee, J.Y., Park, M.J., Min, B.H., Kim, S., Park, M.Y., and Yu, H.K. (2012). A 4-GHz all digital PLL with low-power TDC and phase-error compensation IEEE Transactions on Circuits and Systems, 59(N8), 1706–1719. Mahdoum, A. (1997). SPOT: An estimation of the switching power dissipation for CMOS circuits and data paths tool SASIMI’97, Osaka, Japan, December 1–2. Mahdoum, A. (1999). SPOT : un outil à base d’un algorithme génétique pour estimer la consommation maximale de la puissance dynamique des circuits CMOS CSCA’99 (Conference on Soft Computing and their Applications), 94–103, Hôtel Sheraton, Algiers, Algeria, November 8–9. Mahdoum, A. (2000). SAFT: An efficient state assignment for finite state machine tool Information, Journal of the University of Hosei, 3(3), 379–395, Tokyo, Japan, July. Mahdoum, A. (2002a). Démonstrations sur l’outil SAFT Forum de la Conférence Design, Automation and Test in Europe (IEEE/DATE). Palais des Congrès, Paris, France, March 4–8 (Abstract published in Designer’s Forum Proceedings Date’02, Paris, France), 261.

References

269

Mahdoum, A. (2002b). Démonstrations sur l’outil SPOT Forum de la Conférence Design, Automation and Test in Europe (IEEE/DATE). Palais des Congrès, Paris, France, March 4–8 (Abstract published in Designer’s Forum Proceedings Date’02, Paris, France), 260. Mahdoum, A. (2006). Book review: Representations for genetic and evolutionary algorithms written by Franz Rothlauf and edited by Springer Verlag Editions, The Computer Journal, Oxford Journals, 49(5), September. Mahdoum, A. (2007a). Cours algorithmique avancée [Lesson]. University Saad Dahlab of Blida, Algeria. Mahdoum, A. (2007b). Cours CAO de circuits et de systèmes intégrés [Lesson]. University Saad Dahlab of Blida, Algeria. Mahdoum, A. (2009). Synthèse de systèmes monopuce à faible consommation d’énergie FTFC’09 (Faible Tension, Faible Consommation), Centre Suisse d’Electronique et de Microtechnique, Neuchatel, Switzerland, June 3–5. Mahdoum, A. (2012a). A new design methodology of networks on chip IEEE/ASQED (Asian Symposium on Quality Electronic Design), 1–8, Penang, Malaysia, July 10–12. Mahdoum, A. (2012b). Combined heuristics for synthesis of SoCs with time and energy constraints Computers & Electrical Engineering, 38(6), 1687–1702, July. Mahdoum, A. (2013). Architectural synthesis of networks on chip 8th IEEE/ICIEA (International Conference on Industrial Electronics and its Applications), 1889– 1894, Melbourne, Australia, June 19–21. Mahdoum, A. (2014a). A new design methodology of networks-on-chip JSS’14 (Journées sur les Signaux et Systèmes), Guelma, Algeria, 19–20 November. Mahdoum, A. (2014b). Networks on chip design for real-time systems 9th IEEE/ICIEA (International Conference on Industrial Electronics and its Applications), 1–6, Hangzhou, China, June 9–11. Mahdoum, A. (2014c). Networks on chip design for real-time systems 27th IEEE/SOCC (International System On Chip Conference), 165–170, Las Vegas, USA, September 2–5. Mahdoum, A. (2015). An efficient network-on-chip design flow 16th IEEE/ICCT (International Conference on Communication Technology), 1–8, Hangzhou, China, October 18–20.

270

CAD of Circuits and Integrated Systems

Mahdoum, A. and Berrandjia, M.L. (2007). FREEZER2 : un outil à bade d’un algorithme génétique pour une aide à la conception de circuits digitaux à faible consommation de puissance IEEE/FTFC’07 (Faible Tension, Faible Consommation), 143–148, Paris, France, May 21–23. Mahdoum, A. and Souna, B. (2016). An efficient IPs placement on 3D architectures 31st IEEE/DCIS (Design of Circuits and Integrated Systems), 1–6, Granada, Spain, November 23–25. Mahdoum, A., Louiz, F., and Belkouche, H. (2000). FEWERCOLORS: A new technique for solving the graph coloring problem 7th IEEE/ICECS (International Conference on Electronics, Circuits and Systems), Beirut, Lebanon, 17–20 December. Mahdoum, A., Louiz, F., and Belkouche, H. (2002). Démonstrations sur l’outil Fewercolors Forum de la Conférence Design, Automation and Test in Europe (IEEE/DATE). Palais des Congrès, Paris, France, 4–8 March (Abstract published in Designer’s Forum Proceedings Date’02, Paris, France), 262. Mahdoum, A., Bessalah, H., and Badache, N. (2005a). A low-power scheduling tool for SOC designs IEEE/IWSSIP’05, International Workshop on Systems, Signals and Image Processing, 367–372, Chalkida, Greece, Sepetember 22–24. Mahdoum, A., Boutamine, A., Touahri, D., and Toubaline, N. (2005b). FREEZER1 : un outil d’aide à la conception de circuits digitaux à faible consommation de puissance IEEE/FTFC’05 (Faible Tension, Faible Consommation), 65–70, Paris, France, May 18–20. Mahdoum, A., Badache, N., and Bessalah, H. (2006). An efficient assignment of voltages and optional cycles for maximizing rewards in real-time systems with energy constraints. Journal of Low-Power Electronics (JOLPE), American Scientific Publisher, 2(2), 189–200, August. Mahdoum, A., Badache, N., and Bessalah, H. (2007a). A low-power scheduling tool for system on chip designs. WSEAS Transactions on Circuits and Systems, 6(12), 608–624, December. Mahdoum, A., Dahmri, O., and Zair, M. (2007b). A new memory access optimizer using array binding, memory allocation, and combined code in embedded system design 25th IEEE/NORCHIP, Aalborg, Denmark, November 19–20. Mahdoum, A., Benmadache, R., Berrandjia, M.L., and Chenouf, A. (2010a). An efficient low-power buffer insertion with time and area constraints 14th ICC (International Conference on Circuits), Corfu, Greece, July 22–24.

References

271

Mahdoum, A., Benmadache, R., Chenouf, A., and Berrandjia, M.L. (2010b). A lowpower synthesis of submicron interconnects with time and area constraints International Journal of Circuits, Systems and Signal Processing, 4(3), 112–119, July–September. Mahdoum, A., Hamimed, L., Louzri, M., and Messaoudi, M. (2010c). Data-coding methods for low-power aided design of interconnections CECS’10 (International Workshop on Systems Communication and Engineering in Computer Science), Batna, Algeria, October 3–5. Mahdoum, A., Hamimed, L., Louzri, M., and Messaoudi, M. (2011a). Data-coding methods for low-power aided design of interconnections IEEE/WOSSPA’11 (International Workshop on Systems, Signal Processing and their Applications), Tipaza, Algeria, May 9–11. Mahdoum, A., Hamimed, L., Louzri, M., and Messaoudi, M. (2011b). Data-coding methods for low-power aided design of submicron interconnects IEEE/FTFC’11 (Faible Tension, Faible Consommation), Marrakech, Morocco, May 30–June 1. Mo, F. and Brayton, R.K. (2002). River PLAs: A regular circuit structure 39th IEEE/DAC (Design Automation Conference), New Orleans, USA, June 10–14. Murali, S., Meloni, P., Angiolini, F., Atienza, D., Carta, S., Benini, L., De Micheli, G., and Raffo, L. (2006). Designing application-specific networks on chips with floorplan information ICCAD, 355–362, November. Panth, S., Samadi, K., Du, Y., and Lim, S.K. (2013). High-density integration of functional modules using monolithic 3D-IC technology 18th ASP-DAC, 681–686. Rao, R.R., Blaauw, D., Sylvester, D., Alpert, C.J., and Nassif, S. (2005). An efficient surface-based low-power buffer insertion algorithm ISPD’05, 86–93. Reuben, J., Kittur, H.M., and Mohd, S. (2014). A novel clock generation algorithm for system-on-chip based on least common multiple Computers and Electrical Engineering, 40(7). Rudell, R.L. (1985). Espresso-MV: Algorithms for multiple-valued logic minimization. Proceedings of the International Circuit Conference, Portland, USA, May. Sakarovitch, M. (1983). Optimisation combinatoire [Lesson handout]. Institut Polytechnique de Grenoble, France, May. Sasao, T. (1984). Input variable assignment and output phase optimization of PLA’s IEEE Transactions on Computers, C-33(10), 879–894, October.

272

CAD of Circuits and Integrated Systems

Sotiriadis, P.P. and Chandrakasan, A.P. (2003). Bus energy reduction by transition pattern coding using a detailed deep submicrometer bus model IEEE Transactions on Circuits and Systems, 50(10), October. Toubaline, N., Bennouar, D., and Mahdoum, A. (2017a). A classification and evaluation framework for NoC mapping strategies Journal of Circuits, Systems and Computers, 26(N2), 1730001-1–1730001-44, May. Toubaline, N., Bennouar, D., Mahdoum, A. (2017b). Optimal number and placement of vertical links in 3D networks-on-chip International Journal of Computer and Information Engineering, 11(3). Toubaline, N., Mahdoum, A., and Bennouar, D. (2018). An efficient clock generation algorithm for system-on-chip based on least common multiple Journal of Circuits, Systems and Computers, 27. Vahid, F. and Gajski, D.D. (1992). Specification partitioning for system design Proceedings of the Design Automation Conference (IEEE/DAC 1992), 219–224. Vahid, F., and Gagski, D.D. (1995). Clustering for improved system-level functional partitioning Proceedings of the 8th International Symposium on System Synthesis, 13–15 September. Weste, N.H.E. and Eshraghian, E. (1993). Principles of CMOS VLSI Design: A Systems Perspective Addison-Wesley, Boston, USA.

Index

A, C, D analysis, 48, 54, 56, 58, 65, 67, 69, 90, 93, 98, 111, 118, 151, 160, 255, 263, 265 automated layout generation of digital circuits, 81, 82, 84–86, 90, 238 class ∆ , 34 ∑ , 34 , 34 NP, 12, 14, 16, 21, 22, 27, 28, 31–34 -complete, 16, 20–22, 28–32, 36, 37, 155, 220, 263, 266 problem, 16, 20, 21, 28, 29, 31, 32, 36, 266 -hard problem, 27–29, 36, 37, 266 -intermediate, 31, 32, 263 problem, 31, 32, 263 P, 10, 11, 14, 16, 20, 27 co-NP problem, 33 computational complexity, 1 Data Flow Graph (DFG), 70, 71, 73, 174–176, 262

controlled, 71–73, 142, 143, 145, 165–167, 175, 176 Dynamic Voltage and Frequency Scaling (DVFS), 62, 64–66, 160, 161, 255 Dynamic Voltage Scaling (DVS), 255 E, H I embedded systems, 86, 144, 219 execution time, 1, 8, 10, 59, 65, 144, 162, 163, 257 heuristics, 31, 34, 35, 37, 48, 71, 80, 90, 98, 101, 111, 126, 139, 144, 149, 153, 155, 156, 157, 160, 184, 186, 187, 188, 192, 193, 196–198, 202, 203, 219, 225, 229, 230, 233, 240, 250, 256, 259, 263, 264 meta-, 31, 34, 35, 37, 39, 48, 71, 98, 111, 153, 197, 263–266 interconnection submicron buffer insertion, 125, 126, 239 coding and decoding of data, 45, 78, 125, 133–138, 177, 180, 181, 185, 188, 189, 209, 250–252

CAD of Circuits and Integrated Systems, First Edition. Ali Mahdoum. © ISTE Ltd 2020. Published by ISTE Ltd and John Wiley & Sons, Inc.

274

CAD of Circuits and Integrated Systems

L, M, N, O level gate/cell, 54, 59, 227, 253 module, 90, 92, 99, 111, 112, 207 Register Transfer Level (RTL), 57, 60, 61, 65, 67–69, 90, 92, 98, 174 system, 54, 57, 59, 60, 65, 154, 160 micron/submicron technologies, 54, 62, 69, 91, 96, 98, 124, 133, 227, 250 network-on-chip, 57, 74, 140–142, 144, 148, 154, 253, 254 optimization power, 66, 154, 157, 159, 160, 197, 200–207, 209, 210, 218, 223, 249 time, 59, 71, 73, 126, 128, 132, 141, 144, 146, 197, 208, 209, 210, 212, 213, 216–219, 239, 240, 243, 253 P, R, S, V polynomial problem, 31, 32 power and energy consumption, 66, 134, 136, 137, 147, 154, 162, 165, 208, 251 average and maximal estimation, 227

real-time systems, 144, 148, 160, 197, 253 representation behavioral, 55, 111, 234 physical, 56, 74, 85, 93, 99, 109, 111 structural, 56, 111, 234 silicon compilation, 54, 151, 160, 233 space of solutions, 230, 231 synthesis, 54, 55, 57, 65, 67, 69, 74, 81, 90, 92, 99, 101, 125, 126, 140, 151, 154, 160, 177, 180, 196, 206, 207, 255 control part, 62, 69, 74, 75, 77–82, 85, 90, 92, 146, 154, 176–182, 255, 256 memory, 86 submicron transistors and interconnections, 196 system-on-chip, 54, 138, 141, 154, 176, 207 verification, 54, 56, 64, 66, 67, 69, 91, 98, 99, 112, 151, 192

Other titles from

in Electronics Engineering

2020 CAPPY Alain Neuro-inspired Information Processing FOUQUET François Noise in Radio-Frequency Electronics and its Measurement

2018 MURET Pierre Fundamentals of Electronics 2: Continuous-time Signals and Systems Fundamentals of Electronics 3: Discrete-time Signals and Systems, and Quantized Level Systems

2017 MURET Pierre Fundamentals of Electronics 1: Electronic Components and Elementary Functions BUCCI Davide Analog Electronics for Measuring Systems

2016 BAUDRAND Henri, TITAOUINE Mohammed, RAVEU Nathalie The Wave Concept in Electromagnetism and Circuits: Theory and Applications FANET Hervé Ultra Low Power Electronics and Adiabatic Solutions NDJOUNTCHE Tertulien Digital Electronics 1: Combinational Logic Circuits Digital Electronics 2: Sequential and Arithmetic Logic Circuits Digital Electronics 3: Finite-state Machines

2015 DURAFFOURG Laurent, ARCAMONE Julien Nanoelectromechanical Systems

2014 APPRIOU Alain Uncertainty Theories and Multisensor Data Fusion CONSONNI Vincent, FEUILLET Guy Wide Band Gap Semiconductor Nanowires 1: Low-Dimensionality Effects and Growth Wide Band Gap Semiconductor Nanowires 2: Heterostructures and Optoelectronic Devices GAUTIER Jean-Luc Design of Microwave Active Devices LACAZE Pierre Camille, LACROIX Jean-Christophe Non-volatile Memories TEMPLIER François OLED Microdisplays: Technology and Applications THOMAS Jean-Hugh, YAAKOUBI Nourdin New Sensors and Processing Chain

2013 COSTA François, GAUTIER Cyrille, LABOURE Eric, REVOL Bertrand Electromagnetic Compatibility in Power Electronics KORDON Fabrice, HUGUES Jérôme, CANALS Agusti, DOHET Alain Embedded Systems: Analysis and Modeling with SysML, UML and AADL LE TIEC Yannick Chemistry in Microelectronics

2012 BECHERRAWY Tamer Electromagnetism: Maxwell Equations, Wave Propagation and Emission LALAUZE René Chemical Sensors and Biosensors LE MENN Marc Instrumentation and Metrology in Oceanography SAGUET Pierre Numerical Analysis in Electromagnetics: The TLM Method

2011 ALGANI Catherine, RUMELHARD Christian, BILLABERT Anne-Laure Microwaves Photonic Links: Components and Circuits BAUDRANT Annie Silicon Technologies: Ion Implantation and Thermal Treatment DEFAY Emmanuel Integration of Ferroelectric and Piezoelectric Thin Films: Concepts and Applications for Microsystems DEFAY Emmanuel Ferroelectric Dielectrics Integrated on Silicon BESNIER Philippe, DÉMOULIN Bernard Electromagnetic Reverberation Chambers LANDIS Stefan Nano-lithography

2010 LANDIS Stefan Lithography PIETTE Bernard VHF / UHF Filters and Multicouplers

2009 DE SALVO Barbara Silicon Non-volatile Memories / Paths of Innovation

DECOSTER Didier, HARARI Joseph Optoelectronic Sensors FABRY Pierre, FOULETIER Jacques Chemical and Biological Microsensors / Applications in Fluid Media GAUTIER Jacques Physics and Operation of Silicon Devices in Integrated Circuits MOLITON André Solid-State Physics for Electronics PERRET Robert Power Electronics Semiconductor Devices SAGUET Pierre Passive RF Integrated Circuits

2008 CHARRUAU Stéphane Electromagnetism and Interconnections

2007 RIPKA Pavel, TIPEK Alois Modern Sensors Handbook