The third edition of Computer Architecture and Organization features a comprehensive updating of the materialespecially
1,599 205 28MB
English Pages 604 [628]
Computer Architecture and Organization
in
McGRAWHILL INTERNATIONAL Computer Science
Seri
EDIT!
Computer Architecture and Organization
McGrawHill
Series in
Computer Science
SENIOR CONSULTING EDITOR C.L. Liu, University of Illinois at
UrbanaChampaign
CONSULTING EDITOR Allen B. Tucker, Bowdoin College Fundamentals of Computing and Programming
Computer Organization and Architecture Computers
in Society/Ethics
Systems and Languages Theoretical Foundations
Software Engineering and Database Artificial Intelligence
Networks, Parallel and Distributed Computing
Graphics and Visualization
The
MIT Electrical and Computer Science Series
McGrawHill Series Bell and Newell:
Computer
in
Computer Organization and Architecture
Structures: Readings
and Examples
Cavanagh: Digital Computer Arithmetic: Design and Implementation
Feldman and
Retter:
Computer Architecture and Logic Design
Gear: Computer Organization and Programming: With an Emphasis on Personal Computers
Hamacher, Vranesic, and Zaky: Computer Organization Hayes: Computer Architecture and Organization Hayes: Digital System Design and Microprocessors Horvath: Introduction to Microprocessors Using the
MC6809 or the MC68000
Hwang: Scalable Parallel and Cluster Computing: Architecture and Programming
Hwang and
Briggs:
Computer Architecture and Parallel Processing
Lawrence and Mauch: RealTime Microcomputer System Design Siweiorek, Bell and Newell: Computer Structures: Principles Stone: Introduction to
Stone and Siewiorek: Introduction
PDP11
Ward and
&
Examples
Computer Organization and Data Structures to
Computer Organization and Data
Edition Halstead: Computational Structures
Structures:
McGrawHill
Series in
Computer Engineering
SENIOR CONSULTING EDITORS Stephen W. Director, University of Michigan,
UrbanaChampaign
C.L. Liu, University of Illinois,
Bartee:
Ann Arbor
Computer Architecture and Logic Design
Bose, Liang: Neural Network Fundamentals with Graphs, Algorithms, and Applications
Chang and
De
Sze:
ULSI Technology
Micheli: Synthesis
Feldman and
Retter:
and Optimization of Digital
Circuits
Computer Architecture: A Designer's Text Based on a Generic RISC
Hamacher, Vranesic, and Zaky: Computer Organization Hayes: Computer Architecture and Organization Horvath: Introduction
to
Microprocessors Using the
Hwang: Advanced Computer Architecture:
MC6809 or the MC68000
Parallelism, Scalability, Programmability
Hwang: Scalable Parallel and Cluster Computing: Architecture and Programming
Kang and
Leblebici:
CMOS Digital Integrated Circuits: Analysis and Design
Kohavi: Switching and Finite Automata Theory
Krishna and Shin: RealTime Systems
LawrenceMauch: RealTime Microcomputer System Design: An Introduction Levine: Vision in
Navabi:
VHDL:
Man and Machine
Analysis and Modeling of Digital Systems
Peatman: Design with Microcontrollers
Peatman: Digital Hardware Design Rosen: Discrete Mathematics and
Its
Applications
Ross: Fuzzy Logic with Engineering Applications Sandige:
Modern
Sarrafzadeh and
Digital Design
Wong: An
Introduction to VLSI Physical Design
Schalkoff: Artificial Neural Networks Stadler: Analytical Robotics
Sze:
and Mechatronics
VLSI Technology
Taub: Digital Circuits and Microprocessors
Wear, Pinkert, Wear, and Lane: Computers:
An
Introduction to
Hardware and Software Design
Computer Architecture and Organization THIRD EDITION
John
P.
Hayes
University of Michigan
[i3l WCB
MiMcGrawHill
Boston
Burr Ridge, IL
Dubuque, IA Madison, WI Bogota Caracas Lisbon
Bangkok Mexico City
Milan
New
Delhi
Seoul
New York
San Francisco
London Madrid Singapore Sydney Taipei
St.
Toronto
Louis
WCB/McGrawHill A
McGrawHill Companies
Division of The
COMPUTER ARCHITECTURE AND ORGANIZATION International Editions 1998
Exclusive rights by McGrawHill
Book Co 
Singapore, for manufacture and export. This book
cannot be reexported from the country to which
Copyright
©
it is
consigned by McGrawHill.
1998 by The' McGrawHill Companies,
Inc. All rights reserved.
under the United States Copyright Act of 1976, no part of distributed in any
form or by any means, or stored
this publication
in a data
KKP
FC
20
9
Library of Congress CataloginginPublication Data Hayes, John
P.
Computer p.
(John Patrick) (date) architecture and organization
/
John
P.
Hayes.  3rd ed.
cm.  (Electrical and computer engineering)
Includes bibliographical references and index.
ISBN 0070273553 1.
Construction architecture.
and construction. electrical
I.
Title.
2. II.
Electronic digital computersDesign Series:
McGrawHill
1998
9745598
621.39'2dc21
www.mhhe.com
When
series in
and computer engineering.
QA76.9.A73H39
ordering this
Printed in Malaysia
title,
use
ISBN 0071159975
or
base or retrieval system, without the
prior written permission of the publisher.
234567890
Except as permitted
may be reproduced
ABOUT THE AUTHOR JOHN
P.
HAYES
department
at the
is
a professor in the electrical engineering and
computer science
University of Michigan, where he was the founding director of the
Advanced Computer Architecture Laboratory. He teaches and conducts research
in
computer architecture; computeraided design, verification, and testing; VLSI design; and faulttolerant systems. Dr. Hayes is the author of two patents, more than 150 technical papers, and five books, including Layout Minimization for
the areas of
CMOS
Cells (Kluwer, 1992, coauthored with R. L. Maziasz) and Introduction to
Digital Logic Design (AddisonWesley, 1993). journals, including the
IEEE
has served as editor of various
was technical program chairman of Computer Architecture Symposium, Toronto.
the Journal of Electronic Testing, and
International
He
Transactions on Parallel and Distributed Systems and the 1991
Hayes received his undergraduate degree from the National University of Ireand his M.S. and Ph.D. degrees in electrical engineering from the University of Illinois, UrbanaChampaign. Prior to joining the University of Michigan, he was a faculty member at the University of Southern California. Dr. Hayes has also held visiting positions at various academic and industrial organizations, including Stanford University, McGill University, Universite de Montreal, and LogicVision Inc. He is a fellow of the Institute of Electrical and Electronics Engineers and a member of the Association for Computing Machinery and Sigma Xi. Dr.
land, Dublin,
To
My Father
Patrick
J.
Hayes
(19101968) In
Memoriam
CONTENTS
Preface
xiii
Computing and Computers 1.1
1
The Nature of Computing 1.1.1
1
The Elements of Computers /
1 1 .2 Limitations .
of Computers 1.2
1.3
The Evolution Of Computers
12
7.2.7
The Mechanical Era / 1.2.2 Electronic Computers /
1.2.3
The Later Generations
The VLSI Era
35
1.3.1 Integrated Circuits / 1.3.2
Processor Architecture /
1.3.3 System Architecture
1.4
Summary
56
1.5
Problems
57
1.6
References
62
Des ign Methodology
64
2.1
2.2
System Design
64
2.7.7
System Representation / 2.1.2 Design Process /
2.1.3
The Gate Level
The Register Level 2.2.7 RegisterLevel
Components /
83 2.2.2
Programmable
Logic Devices / 2.2.3 RegisterLevel Design 2.3
The Processor Level 2.3.1
ProcessorLevel Components /
114 2.3.2 ProcessorLevel
Design
Summary
126
2.5
Problems
127
2.6
References
136
2.4
Processor Basics 3.1
CPU i.7.7
Organization Fundamentals / 3.1.2 Additional Features
137 137
x
3.2
Data Representation 3.2.1 Basic
Contents
Formats / 3.2.2 FixedPoint Numbers /
3.2.3 FloatingPoint
3.3
Numbers
Instruction Sets 3.3.1 Instruction
3.3.3
4
160
178
t
Formats / 3.3.2 Instruction Types /
Programming Considerations
3.4
Summary
211
3.5
Problems
212
3.6
References
221
Datapath Design 4.1
223
FixedPoint Arithmetic 4.1.1 Addition
223
and Subtraction /
4.1.2 Multiplication /
4.1.3 Division
4.2
ArithmeticLogic Units 4.2.1 Combinational
4.3
252
ALUs
/ 4.2.2 Sequential ALUs
Advanced Topics
266
4.3.1 FloatingPoint Arithmetic / 4.3.2 Pipeline Processing
5
4.4
Summary
292
4.5
Problems
293
4.6
References
301
Control Design
303
5.1
Basic Concepts 5.7.7 Introduction / 5.1.2
5.1.3 Design
5.2
Examples
Microprogrammed Control 5.2.7 Basic
5.2.3
5.3
303
Hardwired Control /
332
Concepts / 5.2.2 Multiplier Control Unit /
CPU Control Unit 364
Pipeline Control 5.3.1 Instruction Pipelines / 5.3.2 Pipeline
Performance /
5.3.3 Superscalar Processing
5.4
Summary
5.5
Problems
392
References
399
5.6
6
Memory 6.1
390
Organization
Memory Technology 6.7.7
Memory
Device Characteristics / 6.1.2 Random
Access Memories / 6.1.3 SerialAccess Memories
400 400
6.2
Memory Systems 6.2.1 Multilevel
6.2.3
6.3
426
Memories /
6.2.2 Address Translation /
6.3.1
Contents
Memory Allocation
Caches
452
Main Features /
6.3.3 Structure versus
6.3.2 Address
Mapping /
Performance
6.4
Summary
471
6.5
Problems
472
6.6
References
478
Sysl em Organization
480
7.1
Communication Methods 7.1.1 Basic
7.2
7.2.7
7.2.3
7.3
480
Concepts / 7.1.2 Bus Control
10 And System Control
504
Programmed 10 / 7.2.2 DMA and Interrupts / 10 Processors / 7.2.4 Operating Systems
Parallel Processing
539
7.3.1 ProcessorLevel Parallelism / 7.3.2 Multiprocessors /
7.3.3 Fault Tolerance
7.4
Summary
7.5
Problems
579
7.6
References
587
Index
xi
578
589
PREFACE
This book
about the design of computers;
is
and
architecture,
it
covers both their overall design, or
their internal details, or organization.
hensive and selfcontained view of computer design
It
at
aims
to
provide a compre
an introductory level,
pri
marily from a hardware viewpoint. The third edition of Computer Architecture and Organization is intended as a text for computer science, computer engineering, and electrical engineering courses at the
undergraduate or beginning graduate levels;
should also be useful for selfstudy. This text assumes
little in
the
way
it
of prerequi
beyond some
familiarity with computer programming, binary numbers, and Like the previous editions, the book focuses on basic principles but has been thoroughly updated and has substantially more coverage of performancesites
digital logic.
related issues.
The book
is
divided into seven chapters. Chapter
1
discusses the nature and lim
itations of computation. This chapter surveys the historical evolution of
design to introduce and motivate the key ideas encountered
later.
computer
Chapter 2 deals
with computer design methodology and examines the two major computer design levels, the register (or register transfer)
and processor
levels, in detail.
reviews gatelevel logic design and discusses computeraided design
It
also
(CAD) and
performance evaluation methods. Chapter 3 describes the central processing unit (CPU), or microprocessor that lies at the heart of every computer, focusing on
and data representation. The next two chapters address
instruction set design
CPU
design issues: Chapter 4 covers the dataprocessing part, or datapath, of a processor,
while Chapter 5 deals with controlunit design. The principles of arithmetic
logic unit
(ALU) design
for both fixedpoint and floatingpoint operations are Both hardwired and microprogrammed control are examined along with the design of pipelined and superscalar processors. Chap
covered in Chapter Chapter
in ter
pal
5,
4.
6 deals with a computer's
memory
memory
subsystem; the chapter discusses the princi
technologies and their characteristics from a hierarchical viewpoint,
with emphasis on cache memories. Finally, Chapter 7 addresses the overall organization of a computer system, including inter and intrasystem communication,
inputoutput (10) systems, and parallel processing to achieve very high perforreliability. Various representative computer systems, such as von Neumann's classic IAS computer, the ARM RISC microprocessor, the Intel Pentium, the Motorola PowerPC, the MIPS RXOOO, and the Tandem NonStop faulttolerant multiprocessor, appear as examples throughout the book. The book has been in use for many years at universities around the world. It contains more than sufficient material for a typical onesemester (15 week) course,
mance and
allowing the instructor some leeway in choosing the topics to emphasize. Much of the background material in Chapter 1 and the first part of Chapter 2 can be left as a reading assignment, or omitted
advanced material loss of continuity.
in
if
the students are suitably prepared.
Chapter 7 can be covered briefly or skipped
The
Instructor's
if
The more
desired without
Manual contains some representative course
outlines.
This edition updates the contents of the previous edition and responds to the its users while retaining the book's timeproven emphasis.on basic
suggestions of
concepts. material
is
The third edition is somewhat more accessible to readers who
shorter than
its
predecessors, and the
are less familiar with computers. Every
Preface
section has been rewritten to reflect the dramatic changes that have occurred in the computer industry over the last decade. The main structural changes are the reorganization of the two old chapters on processor design and control design into
new Chapters
and the consolidation of the two old new Chapter 7. The treatment of performancerelated topics such as pipeline control, cache design, and superscalar architecture has been expanded. Topics that receive less space in this edition include gatelevel design, microprogramming, operating systems, and vecthree chapters: the
and
3, 4,
5;
chapters on system organization and parallel processing in the
The third edition also includes many new examples (case studies) and endofchapter problems. There are now more than 300 problems, about 80 percent of which are new to this edition. Course instructors can obtain an Instructor's Manual, which contains solutions to all the problems, directly from the pubtor processing.
lisher.
The
specific changes
material in Chapter
1
made
in the third edition are as follows:
design has been deemphasized in Chapter
A new
evaluation has been expanded.
(PLDs) has been added, and the stressed.
The old
The
historical
has been streamlined and brought up to date. Gatelevel 2,
while the discussion of performance
section
on programmable
logic devices
(CAD)
has been been split into Chapter 3, "Datapath Design." Chapter 3 contains an
role of computeraided design
third chapter (on processor design) has
"Processor Basics," and Chapter 4, expanded treatment of RISC and CISC CPUs and their instruction sets. It introduces the ARM and MIPS RX000 microprocessor series as major examples; the Motorola 680X0 series continues to be used as an example, however. The material on computer arithmetic and ALU design now appears in Chapter 4. The old chapter on control design, which is now Chapter 5, has been completely revised with a more practical treatment of hardwired control and a briefer treatment of microprogramming. A new section on pipeline control includes some material from the old Chapter 7, as well as new material on superscalar processing. Chapter 6 presents an updated treatment of the old fifth chapter on memory organization. Chapter 6 continues to present a systematic, hierarchical view of computer memories but has a greatly expanded treatment of cache memories. Chapter 7, "System Organization," merges material from the old sixth and seventh chapters. The sections on operating systems and parallel processing have been shortened and modernized. The material for this book has been developed primarily for courses on computer architecture and organization that versity of Southern California to
my
and
colleagues and students
at
I
have taught over the years, initially at the Uniof Michigan. I am grateful
later at the University
these and other schools for their
comments and suggestions. As always, I owe a special thanks as well as her neverfailing support
to
my
many
helpful
wife Terrie for proofreading assistance,
and love. John
P.
Hayes
CHAPTER
1
Computing and Computers
This chapter provides a broad overview of digital computers while introducing many of the concepts that are covered in depth later. It first examines the nature and limitations of the computing process. Then it briefly traces the historical development of computing machines and ends with a discussion of contemporary VLSIbased computer systems.
1.1
THE NATURE OF COMPUTING Throughout history humans have relied mainly on their brains to perform calculations; in other words, they were the computers [Boyer 1989]. As civilization advanced, a variety of computing tools were invented that aided, but did not replace, manual computation. The earliest peoples used their fingers, pebbles, or tally sticks for counting purposes. The Latin words digitus meaning "finger" and calculus meaning "pebble" have given us digital and calculate and indicate the ancient origins of these computing concepts. Two early computational aids that were widely used until quite recently are the abacus and the slide rule, both of which are illustrated in Figure 1.1. The abacus has columns of pebblelike beads mounted on rods. The beads are moved by hand to positions that represent numbers. Manipulating the beads according to certain simple rules enables people to count, add, and perform the other basic operations of
arithmetic. The slide rule, on the other hand, represents numbers by lengths marked on rulerlike scales that can be moved relative to one another. By adding a length a on a fixed scale to a length b on a second, sliding scale, their combined length c = a + b can be read off the fixed scale. The slide rule's main scales are logarithmic, so that the process of adding two lengths on these scales effectively multiplies two
1
SECTION
.
,
1
1.1
The Nature of Computing
B= •
l
t
•
I
•
1
1
1
1
r
1
i

1
1
1
1
:=
:=
sum
C(I).
N
into
Decrement count
then go to M(6, 20:39) Test 0:19)
AC.
N
by one.
N and branch to 6R
if
nonnegative.
Halt.
AC
Update count N. Increment
1
AC + M(2) AC + M(2)
9R
AC
10L
M(4, 8:19):=AC(28:39)
10R
gotoM(3,0:19)
:=
AC.
into
A(I) + B( I).
Load count
5L
AC
•
AC by one.
Modify address
in
3L.
Modify address
in
3R.
Modify address
in
4L.
Branch
to 3L.
Figure 1.15
An IAS program
for vector addition.
4L, respectively. Thus the program continuously modifies ure 1.15
shows
tion, the first
3L
AC:=M(1001)
3R
AC:=AC + M(2001)
4L
M(3001):=AC
Critique. In the years that pleted,
.
during execution. Figthe
computa
have elapsed since the IAS computer was comin computer design have appeared. Hindsight
numerous improvements
enables us to point out 1
itself
program before execution commences. At the end of three instructions will have changed to the following: the
some of the IAS's shortcomings.
The program selfmodification process
illustrated in the
preceding example for
and debugging a prochange themselves is difficult and errorprone. Further, before every execution of the program, the original version must be reloaded into M. Later computers employ special instruction types and registers for index control, which eliminates the need for addressmodify instructions. decrementing the index
gram whose
instructions
I is
inefficient. In general, writing
2.
CPU results in a great deal of unproCPU and main memory M; it also adds to program length. Later computers have more CPU registers and a special memory called a cache that acts as a buffer between the CPU register? and M.
The small amount of storage space
in the
ductive datatransfer traffic between the
3.
No
4.
The
facilities
no procedure
were provided for structuring programs. For example, the IAS has call or return instructions to link different
instruction set
is
programs.
biased toward numerical computation. Programs for non
numerical tasks such as text processing were difficult to write and executed slowly. 5.
Inputoutput (10) instructions were considered of minor importance
they are not mentioned in Burks, Goldstine, and von
Neumann
—
in fact,
[1946] beyond
IAS had two basic and rather inefficient 10 The input instruction INPUT(X, N) transferred N words from an input device to the CPU and then to N consecutive main memory locations, starting at address X. The OUTPUT(X, N) instruction transferred N consecutive words from the memory region with starting address X to an outnoting that they are necessary. instruction types [Estrin 1953].
put device.
1.2.3
The Later Generations
on size and speed imposed by early electronic technology, the IAS and other firstgeneration computers introduced many features that are central to later computers: the use of a CPU with a small set of registers, a separate main memory for instruction and data storage, and an instruction set with a limited range of operations and addressing capabilities. Indeed the term von Neumann computer has become synonymous with a computer In spite of their design deficiencies and the limitations
of conventional design.
The second generation. Computer hardware and software evolved rapidly of the first commercial computers around 1950. The vacuum tube quickly gave way to the transistor, which was invented at Bell Laboratories in 1947, and a second generation of computers based on transistors superseded the after the introduction
first
generation of
vacuum tubebased machines. Like
a
vacuum
tube, a transistor
serves as a highspeed electronic switch for binary signals, but
it
is
smaller,
much less power than a vacuum tube. Similar of memory technology, with ferrite cores becoming
cheaper, sturdier, and requires
progress occurred in the field
dominant technology for main memories until superseded by alltransistor memories in the 1970s. Magnetic disks became the principal technology for sec
the
ondary memories, a position
that they continue to hold.
circuits, the second generation, which spans the decade 195464. introduced some important changes in the design of CPUs and their instruction sets. The IAS computer still served as the basic model, but more registers were added to the CPU to facilitate data and address manipulation. For
Besides better electronic
example, index registers were introduced to store an index variable
I
of the kind
appearing in the statement
C(I):=A(I) + B(I)
(1.9)
27
CHAPTER
1
Computing and Computers
28
SECTION
1.2
The Evolution of Computers
Index registers make it possible to have indexed instructions, which increment or decrement a designated index I before (or after) they execute their main operation. Consequently, repeated execution of an indexed operation like (1.9) allows it to
The index value
step automatically through a large array of data.
CPU
I
is
stored in a
and not in the program, so the program Itself does not change during execution. Another innovation was the introduction of two programcontrol instructions, now referred to as call and return, to facilitate the linking of proregister
grams; see also Example
1.5.
"Scientific" computers of the second generation, such as the
appeared
1962,
in
IBM 7094
which
introduced floatingpoint number formats and supporting
instructions to facilitate numerical processing. Floating point
is
a type of scientific
number such as 0.0000000709 is denoted by 7.09 X 10" 8 floatingpoint number consists of a pair of fixedpoint numbers, a mantissa notation where a
.
M
A
M
M
= X B~E In the preceding example and an exponent E, and has the value 7.09, E = 8, and B = 10. In their computer representation and E are encoded in binary and embedded in a word of suitable size; the base B is implicit. Floating.
M
point
numbers eliminate
the
need for number scaling; floatingpoint numbers are The hardware needed to implement
automatically scaled as they are processed. arithmetic
floatingpoint
quently,
many computers
instructions
(then and
directly
now)
rely
ment floatingpoint operations via fixedpoint
is
relatively
expensive. Conse
on software subroutines
to imple
arithmetic.
Inputoutput operations. Computer designers soon realized that IO operations, that
is,
the transfer of information to
and from peripheral devices
and done
like printers
secondary memory, can severely degrade overall computer performance
if
Most IO transfers have main memory as their final source or destinaand involve the transfer of large blocks of information, for instance, moving a program from secondary to main memory for execution. Such a transfer can take place via the CPU, as in the following fragment of a hypothetical IO program: inefficiently.
tion
Location
LOOP
Comment
Instruction
AC
:= D(I)
M(I)
:=
Input
AC
I:=I+1 if I
(a)
(b)
(c)
Figure 1.18
Some
representative IC packages: (a) 32pin smalloutline Jlead (SOJ); (b) 132pin plastic
quad flatpack (PQFP);
(c)
84pin pingrid array (PGA). [Courtesy of Sharp Electronics
Corp.]
1.3.1
Integrated Circuits
The integrated
circuit
was invented
Corporations [Braun and
McDonald
in
1959
1982].
It
at
Texas Instruments and Fairchild
quickly became the basic building
block for computers of the third and subsequent generations. (The designation of
computers by generation largely fell into disuse after the third generation.) An IC is an electronic circuit composed mainly of transistors that is manufactured in a tinyrectangle or chip of semiconductor material. The IC is mounted into a protective plastic or ceramic package, which provides electrical connection points called pins or leads that allow the IC to be connected to other ICs, to inputoutput devices like a keypad or screen, or to a power supply. Figure 1.18 depicts several representative IC packages. Typical chip dimensions are 10 X 10 mm, while a package like that of
Figure 1.18b
is
approximately 30
ably bigger than the chip
it
X 30 X 4 mm. The
IC package
is
often consider
contains because of the space taken by the pins.
The
PGA
package of Figure 1.18c has an array of pins (as many as 300 or more) projecting from its underside. A multichip module is a package containing several IC chips attached to a substrate that provides mechanical support, as well as electrical connections between the chips. Packaged ICs are often mounted on a printed circuit board that serves to support and interconnect the ICs. A contemporary computer consists of a set of ICs, a set of
IO devices, and
a
power supply. The number
of ICs can range from one IC to several thousand, depending on the computer's size
and the IC types I C density.
An
it
uses.
integrated circuit
is
roughly characterized by
defined as the number of transistors contained in the chip.
its
density,
As manufacturing
tech
niques improved over the years, the size of the transistors in an IC and their inter1 pm. (By comparison, the width of a human hair is about 75 ujn.) Consequently, IC densities have increased steadily, while chip size has varied very little. The earliest ICs the first commercial IC appeared in 1961 contained fewer than 100 transistors and employed smallscale integration or SSI. The terms mediumscale, largescale, and verylargescale integration (MSI, LSI and VLSI.
connecting wires shrank, eventually reaching dimensions below a micron or
—
—
37 •
,•*
lGbit
DRAM a
10 9
CHAPTER
„' . • •
Computers u
c
DRAM
c j
C
106
./*
6_bj t
A*
microprocessor
v^
£ v»
lKbit
 a y
./^ ./• 64bit
lMbit
DRAM 10 3
8bit
microprocessor
32bit
microprocessor
microprocessor
./^
—
•
_,
4bit microprocessor
MSI
•SSI i
1
i
1970
1960
i
1980
i
2000
1990
2010
Year
Figure 1.19 Evolution of the density of commercial ICs.
respectively) are applied to ICs containing hundreds, thousands, and millions of transistors, respectively.
VLSI
The boundaries between
these IC classes are loose, and
often serves as a catchall term for very dense circuits. Because their
facture
is
highly automated
tured in high
volume
at
—
it
resembles a printing process
low cost per
circuit.
manu
—ICs can be manufac
Indeed, except for the latest and
densest circuits, the cost of an IC has stayed fairly constant over the years, implying that newer generations of ICs deliver far greater value (measured by computing performance or storage capacity) per unit cost than their predecessors did. Figure 1.19 shows the evolution of IC density as measured by two of the dens
dynamic randomaccess memory (DRAM), a basic component CPU or microprocessor. Around 1970 it became possible to manufacture all the electronic circuits for a pocket calculator on a single IC chip. This development was quickly followed by singlechip DRAMs est chip types: the
of main memories, and the singlechip
and microprocessors. As Figure 1.19 shows, the capacity of the largest available DRAM chip was IK = 2 10 bits in 1970 and has been growing steadily since then, 20 reaching 1M = 2 bits around 1985. A similar growth has occurred in the complexity of microprocessors. The first microprocessor, Intel's 4004, which was introduced in 1971, was designed to process 4bit words. The Japanese calculator manufacturer Busicom commissioned the 4004 microprocessor, but after Busicom's early demise, Intel successfully marketed the 4004 as a programmable controller to replace standard, nonprogrammable logic circuits. As IC technology
improved and chip density increased, the complexity and performance of onechip microprocessors increased steadily, as reflected in the increase in CPU word size to 8 and then 16 bits by the mid1980s. entire
CPU
By 1990 manufacturers could
of a System/360class computer, along with part of
on a single IC. The combination of a CPU, memory, and IO small
number of ICs)
is
called a microcomputer.
1
Computing and
its
circuits
fabricate the
main memory, in one IC (or a
.
IC families. Within IC technology SECTION
.
several subtechnologies exist that are dis
tinguished by the transistor and circuit types they employ.
Two
of the most impor
1.3
The VLSI Era
tant of these technologies are bipolar
and unipolar; the
latter is
normally referred to
MOS (metaloxidesemiconductor) after its physical structure. Both bipolar and MOS circuits have transistors as their basic elements! They differ, however, in the as
polarities of the electric charges associated with the
primary carriers of electrical
signals within their transistors. Bipolar circuits use both negative carriers (elec
and positive
MOS circuits,
on the other hand, use only one MOS (PMOS) and negative in the case of Ntype MOS (NMOS). Various bipolar and MOS IC circuit types or IC families have been developed that provide tradeoffs among density, operating speed, power consumption, and manufacturing cost. An MOS family that efficiently combines PMOS and NMOS transistors in the same IC is complementary MOS or CMOS. This technology came into widespread use in the 1980s and has been the technology of choice for microprocessors and other VLSI ICs since then because of its combination of high density, high speed, and very low power consumption [Weste and Eshragian 1992]. trons)
carriers (holes).
type of charge carrier: positive in the case of Ptype
EXAMPLE
A
1.6
NOLOGY. To
CMOS The
circuit
ZERODETECTION CIRCUIT EMPLOYING CMOS TECHof transistors in computing,
illustrate the role
whose function
circuit's output z
should be
is 1
when
to detect
when x
x
]
x2x i = 0000;
combinations of input values. Zero detection cessing. For example, if
it is
is
word x
a 4bit
quite a
it
we examine x l
should be
common
a small
x 2x i becomes
zero.
for the other 15
operation in data pro
used to determine when a program loop terminates, as
in the
statement (location 5R) appearing in the IAS program of Figure 1.15.
Figure 1.20 shows a particular implementation sentative
CMOS
symbolic form
subfamily
in
denoted 5,:5 7 and
known
Figure 1.20a.
NMOS
It
as static
ZD of zero detection
CMOS. The
circuit is
using a repre
shown
in
standard
numbers of PMOS transistors Each transistor acts as an onoff
consists of equal
SS :S U
transistors denoted
.
switch with three terminals, where the center terminal c controls the switch's
When turned on,
a signal propagation path
is
state.
created between the transistor's upper and
lower terminals; when turned off, that path is broken. An NMOS transistor is turned on by applying 1 to its control terminal c; it is turned off by applying to c. A PMOS transistor, on the other hand, is turned on by c and turned off by c = 1 Each set of input signals applied to ZD causes some transistors to switch on and others to switch off, which creates various signal paths through the circuit. In Figure 1.20 the constant signals and 1 are applied at various points in ZD. (These signals are derived from ZD's electrical power supply.) The 0/1 signals "flow" through the circuit
along the paths created by the transistors and determine various internal signal values,
main output line z. Figure 1.20b shows the signals and signal transmission paths produced by x x x 2x3  0001. The first input signal x = is applied to PMOS transistor 5, and NMOS transistor 5 g hence S, is turned on and 5 g turns S 2 on and S 9 off. A path is created through S, and is turned off. Similarly, x, = S2 which applies 1 to the internal line y,, as shown by the leftmost heavy arrow in Figand y 3 = 1. ure 1. 20b. In the same way the remaining input combinations make y 2 = as well as the value applied to the
i
;
,
latter signal is applied to the two rightmost transistors turning S 7 off and 5 14 on, which creates a path from the zero source to the primary output line via 5 14 so z = as
The
,
required.
we change
input x 3 from 1 to in Figure 1.20b, the following chain of events 54 turns on and 5,, turns off, changing y2 to 1. Then 5 I3 turns on and S6 turns making y 3 = 0. Finally, the new value of y 3 turns 57 on and S ]4 off, so z becomes If
occurs: off,
1
39
CHAPTER
1
Computing and Computers
Output
PMOS
Inputs
transistor
NMOS transistor
(a)
;
x =
xQ =
x2 
l
*3 =
1
Transistor switched on
=
Transistor switched off
(b)
Figure 1.20 (a)
CMOS circuit ZD for zero detection; (b)
xQx x2x } = 0001 making l
z
=
state
of
ZD
with input combination
0.
the zero input combination x x x 2x 3 = 0000 makes c = readily be verified that no other input combination does this.
Hence
A
transistor circuit like that of Figure 1.20
circuit at a
1
as required.
It
can
l
low
models the behavior of a digital Because many of the
level of abstraction called the switch level.
ICs of interest contain huge numbers of transistors, it is rarely practical to analyze computing functions at the switch level. Instead, we move to higher abstraction levels, two of which are illustrated in Figure 1.21. At the gate or logicAexe\ illustrated by Figure 1.21a. we represent certain common subcircuits by symbolic their
40
SECTION
1.3
Zero
The VLSI Era
detector
NAND gate
NOT gate (inverter)
NOR gates 00
(a)
Figure 1.21
The zerodetection ter level
circuit of Figure
1
.20
modeled
at (a) the
gate level and (b) the regis
of abstraction.
components called A, B, C, and
D
comprises four gates
(logic) gates. This particular logic circuit
of three different types as indicated; note that each gate type has a
distinct graphic
symbol. In moving from the switch
transistor circuit into a single gate
tage of the logic level
is
that
it is
and discard
all its
we
level,
collapse a multi
internal details.
technology independent, so
it
A key
advan
can be used equally
well to describe the behavior of any IC family. In dealing with computer design,
we
also use an even higher level of abstraction
transfer level.
It
known
as the register or register
treats the entire zerodetection circuit as a primitive or indivisible
component, as in Figure 1.21b. The register level is the level at which we describe the internal workings of a CPU or other processor as, for example, in Figures 1.2 and 1.17. Observe that the primitive components (represented by boxes) in these diagrams include
registers,
memory, or computer level of abstraction,
1.3.2
ALUs, and
the like.
When we
treat
an entire
component, we have moved called the processor or system level.
as a primitive
which
is
CPU,
to the highest
Processor Architecture
By 1980 computers were
classified into three
main
types:
mainframe computers,
minicomputers, and microcomputers. The term mainframe was applied to the traditional "large" computer system, often containing thousands of ICs and costing millions
of dollars.
It
typically
served as the central computing facility for an
organization such as a university, a factory, or a bank. Mainframes were then
roomsized machines placed
in special
computer centers and not
directly accessible
average user. The minicomputer was a smaller (desk size) and slower version of the mainframe, but its relatively low cost (hundreds of thousands of dollars)
to the
made
it
suitable as a "departmental"
a small business, for example.
computer
to be shared
by
The microcomputer was even
cheaper (a few thousand dollars), packing
all
a
group of users
smaller, slower,
the electronics of a
computer
—
in
and
into a
handful of ICs, including microprocessor (CPU), memory, and IO chips.
Personal computers. Microcomputer technology gave
rise to a
new
class of
generalpurpose machines called personal computers (PCs), which are intended for a single user. These small, inexpensive computers are designed to sit on an office
desk or fold into a compact form to be carried. The more powerful desktop computers intended for scientific computing are referred to as workstations.
A
typical
PC
has the von Neumann organization, with a microprocessor, a multimegabyte main memory, and an assortment of 10 devices: a keyboard, a video monitor or screen, a magnetic or optical disk drive unit for highcapacity secondary memory, and interface circuits for connecting the PC to printers and to other computers. Personal computers have proliferated to the point that, in the more developed societies, they are present in most offices and many homes. Two of the main applications of PCs are word processing, where personal computers have assumed and greatly expanded all the functions of the typewriter, and dataprocessing tasks like financial record keeping. They are also used for entertainment, education, and increasingly, communication with other computers via the World Wide Web. Personal computers were introduced in the mid1970s by a small electronics kit maker, MITS Inc. [Augarten 1984]. The MITS Altair computer was built around the Intel 8008, an early 8bit microprocessor, and cost only $395 in kit form. The most successful personal computer family was the IBM PC series introduced in 1981. Following the precedent set by earlier IBM computers, it quickly
become
the de facto standard for this class of machine.
standardization process
—namely, IBM's decision
called an open architecture, by
making
its
A new factor also aided the
to give the
PC what came
to be
design specifications available to other
manufacturers of computer hardware and software. As a
became very popular, and many versions of
—
result,
PC
the
IBM PC
—
were produced by others, including startup companies that made the manufacture of lowcost PC clones their main business. The PC's open architecture also provided an incentive for the development of a vast amount of applicationspecific software from many sources. Indeed a new software industry emerged aimed at the massit
the socalled
production of lowcost, selfcontained programs aimed the
at specific
clones
applications of
IBM PC and a few other widely used computer families. The IBM PC series is based on Intel Corp.'s 80X86 family of microprocessors,
which began with the 8086 microprocessor introduced in 1978 and was followed 2 by the 80286 (1983), the 80386 (1986), the 80486 (1989), and the Pentium (1993) [Albert and Avnon 1993]; the Pentium II appeared in 1997. The IBM PC series is also distinguished by its use of the MS/DOS operating system and the Windows graphical user interface, both developed by Microsoft Corp. Another popular personal computer series is Apple Computer's Macintosh, introduced in 1984 and built around the Motorola 680X0 microprocessor family, whose evolution from the 68000 microprocessor (1979) parallels that of the 80X86/Pentium [Farrell 1984. In
1994 the Macintosh
CPU
was changed
to a
new microprocessor known
as the
PowerPC. Figure 1.22 shows the organization of a typical personal computer from the compare Its legacy from earlier von Neumann computers is apparent
—
mid1990s.
Figure 1.22 to Figure 1.17. At the core of this computer
is
a singlechip micropro
As we will see, the microprocessor's internumber of speedup features not found in A system bus connects the microprocessui to a main memor)
cessor such as the Pentium or PowerPC.
nal (micro) architecture usually contains a its
predecessors.
based on semiconductor
DRAM technology and to an IO subsystem. A separate IO
bus, such as the industry standard
2
A
legal ruling that
PCI (peripheral component interconnect)
"'local'"
microprocessor names that are numbers cannot have trademark protection, resulted a microprocessor called the Pentium rather than the 80586.
80486 being followed by
in the
41
CHAPTER
1
Computing and Computers
r 42
Secondary
Microprocessor
SECTION
(hard disk) 1.3
memory
The VLSI Era
Video
Communication
Keyboard
monitor
T
CPU
IO expansion
Main
Hard disk
Video
Keyboard
Network
memory
control
control
control
control
M
Cache
Bus
IO devices
network
—
UTI
u IO
'Li
interface unit
slots
(local)
r
zr
2 bus
Peripheral (IO) interface control unit
ir
T
System bus
Figure 1.22
A typical personal
computer system.
bus, connects directly to the is
IO devices and
their individual controllers.
linked to the system bus, to which the microprocessor and
memory
The IO bus
are attached
via a special bustobus control unit sometimes referred to as a bridge.
The IO
devices of a personal computer include the traditional keyboard, a CRTbased or flatpanel video monitor,
and disk drive units for the hard and flexible (floppy) disk
storage devices that constitute secondary
memory. More recent
additions to the
IO
CDROMs
(compact disc readonly memories), which have extremely high capacity and allow sound and video images to be stored and retrieved efficiently. Other common audiovisual IO devices in personal computers are microphones, loudspeakers, video scanners, and the like, which are referred to as multimedia equipment. device repertoire include drive units for
Performance considerations. As processor hardware became much sive in the 1970s, thanks mainly to advances in
VLSI technology
less
expen
(Figure 1.19),
computer designers increased the use of complex, multistep instructions. This reduces N, the total number of instructions that must be executed for a given task, since a single complex instruction can replace several simpler ones. For example, a multiply instruction can replace a multiinstruction subroutine that implements multiplication by repeated execution of add instructions. Reducing N in this way tends to reduce overall program execution time T, as well as the time that the CPU spends fetching instructions and their operands from memory. The same advances in VLSI made it possible to add new features to old microprocessors, such as new instructions, data types, instruction sets, and addressing modes, while retaining the ability to execute programs written for the older machines. The Intel 80X86/Pentium series illustrates the trend toward more complex instruction sets. The 1978vintage 8086 microprocessor chip, which contained a mere 20,000 transistors, was designed to process 16bit data words and had no instructions for operating on floatingpoint numbers [Morse et al. 1978]. Twentyfive years later,
its
direct descendant, the Pentium, contained over 3 million transis
processed 32bit and 64bit words directly, and executed a comprehensive set of floatingpoint instructions [Albert and Avnon 1993]. The Pentium accumulated
tors,
most of the architectural features of its various predecessors in order to enable it to execute, with little or no modification, programs written for earlier 80X86series machines. Reflecting these characteristics, the 80X86, 680X0, and most older com3 puter series have been called complex instruction set computers (CISCs). By the 1980s it became apparent that complex instructions have certain disadvantages and that execution of even a small percentage of such instructions can sometimes reduce a computer's overall performance. To illustrate this condition, suppose that a particular microprocessor has only
which requires k time
units, to execute.
instructions in 100k time units.
Thus
fast,
simple instructions, each of
the microprocessor can execute 100
Now
suppose that 5 percent of the instructions are To execute an average set of 100 instructions therefore requires (5x21+ 95)k = 200k time units, assuming no other factors are involved. Consequently, the 5 percent of complex instrucslow, complex instructions requiring 2lk time units each.
program execution time. program size, this technology does not necessarily translate into faster program execution. Moreover, complex instructions require relatively complex processing circuits, which tend to put CISCs in the largest and most expensive IC category. These drawbacks were first recognized by John Cocke and his colleagues at IBM in the mid1970s, who developed an experimental computer called 801 that aimed to achieve very fast overall performance via a streamlined instruction set that could be executed extremely fast [Cocke and Markstein 1990]. The 801 and subsequent machines with a similar design philosophy have been called reduced instruction set computers (RISCs). A number of commercially successful RISC microprocessors were introduced in the 1980s, including the IBM RISC System/6000 and SPARC, an "open" microprocessor developed by Sun tions can, as in this particular example, double the overall
Thus while complex
instructions reduce
Microsystems and based on RISC research at the University of California, Berkeley [Patterson 1985]. Many of the speedup features of RISC machines have found their
way
into other
new computers, including such CISC microprocessors as the PenRISC is often used to refer to any computer with an instruc
tium. Indeed, the term tion set
and an associated
CPU
organization designed for very high performance:
the actual size of the instruction set
A its
computer's performance
is
is
relatively unimportant.
also strongly affected by other factors besides
instruction set, especially the time required to
move
instructions and data
M
CPU
and, to a lesser extent, the time required to and main memory and IO devices. It typically takes the CPU about move information between than from one of its internal registers. five times longer to obtain a word from
between the
M
M
This difference in speed has existed since the first electronic computers, despite strenuous efforts by circuit designers to develop memory devices and processor
memory
interface circuits that are fast
enough
to
keep up with the
fastest micro
CPUM
speed disparity has become such a feature of standard (von Neumann) computers that is sometimes referred to as the von Neumann bottleneck. RISC computers usually limit access to main memory to a few load processors. Indeed the
and store instructions; other instructions, including
all
gramcontrol instructions, must have their operands in
3
dataprocessing and pro
CPU
registers. This so
The public became aware of CISC complexity when a design flaw affecting the floatingpoint division Pentium was discovered in 1994. The cost to Intel of this bug. including the replacement cost of Pentium chips already installed in PCs. was about $475 million. instruction of the
43
CHAPTER
I
Computing and Computers
44
called loadstore architecture
section
mann bottleneck by reducing the CPU.
3
The VLSI Era
Performance measures. "basic" operations that
it
is
intended to reduce the impact of the von Neu
number of
the total
A
memory
the
CPU speed A typical
rough indication of
can perform per unit of time.
two
the fixedpoint addition of the contents of
registers
accesses
is
the
made by
number of
basic operation
Rl and R2,
as in the
is
sym
bolic instruction
Rl :=R1 Such operations
+R2
are timed
by a regular stream of signals (ticks or beats) issued by a The speed of the clock is its frequency / millions of ticks per second; the units for this are megahertz (MHz).
central timing signal, the system clock.
measured in Each tick of the clock the operation
period
is 1//
triggers a basic operation;
microseconds
Tdock For example,
((is).
hence the time required to execute
This value
is
called the clock cycle or clock
a computer clocked at 250
MHz
can perform one basic Complicated operations such as division or operations on floatingpoint numbers can require more than one clock cycle to complete their execution. .
operation in the clock period
Tdock =
1/250 = 0.004
(is.
Generally speaking, smaller electronic devices operate faster than larger ones,
accompanied by a from 1981 to 1995 to 100 MHz. Clock
so the increase in IC chip density discussed above has been
steady, but less dramatic, increase in clock speed. For example,
microprocessor clock speeds increased from about 10 speeds of
1
gigahertz
versions of current
GHz
(1
CMOS
or 1000
technology.
MHz
MHz) and beyond It
are feasible using faster
might therefore seem possible to achieve
any desired processor speed simply by increasing the
CPU
clock frequency.
How
which clock frequency is increasing due to IC technology improvements is relatively slow and may be approaching limits determined by the speed of light, power dissipation, and similar physical considerations. Extremely fast cir
ever, the rate at
cuits also tend to be very
expensive to manufacture.
The CPU's processing of an
instruction involves several steps, each of
which
requires at least one clock cycle: 1.
2. 3.
memory M. Decode the instruction's opcode. any operands needed unless they Load (read) from Fetch the instruction from main
M
are already in
CPU
regis
ters.
4.
Execute the instruction via a registertoregister operation using an appropriate functional unit of the CPU, such as a fixedpoint adder.
5.
Store (write) the results in
The
fastest instructions
have
M unless they are to be retained in CPU registers. all their
operands
cuted by the
CPU
The slowest
instructions require multiple
in a single
in
clock cycle, so steps
memory
CPU 1
registers
to 3 all take
and can be exeone clock cycle.
accesses and multiple register
complete their execution. Consequently, measures of instruction execution performance are based on average figures, which are usually determined experimentally by measuring the run times of representative or benchtoregister operations to
mark programs. The more
representative the programs are, that
rately they reflect real applications, the better the
is,
the
more accu
performance figures they provide.
Suppose
benchmark program or set (suite) of on a given CPU takes T seconds and involves the execution of a total of N machine (object) instructions. Here N is the actual number of instructions executed, including repeated executions of the same instruction; it is not the number of instructions appearing in Q. As far as the typical computer user is concerned, the key performance goal is to minimize the total program execution time T. While T can be determined accurately only by measurement of n 2 Assume .
x
that
n„
that subtracts a
unary number n : (rem another unary num
n 2 and the result n,  n : are stored in the formats described ,
57
CHAPTER
1
Computing and Computers
Example 1.1. That is, the tape initially contains only n and n 2 separated by a blank, while the final tape should contain only n,  n 2 Describe your machine by a program
58
in
,
.
SECTION
1.5
listing
with comments, following the style used
in
Figure
1
.4.
Problems 1.4.
Construct a Turing machine program Countjup in the style of Figure 1.4 that incre
ments an arbitrary binary number by one. For example, if the number 1001 1 denoting 19 is initially on an otherwise blank tape, Count _up should replace it with 10100 denoting 20. Assume that the readwrite head starts and ends on the blank square immediately to the left of the listing
structions
1.5.
number on
the tape. Describe your
with comments, following the style used in Figure
employing fewer than 10
at
around 10
120 .
Is
.4.
machine by a program Fewer than 20 in
[Hint:
states suffice for this problem.]
The number of possible sequences of moves ed
1
(distinct
games)
in
chess has been estimat
developing a surefire winning strategy for chess therefore an un
solvable problem?
1.6.
Determine whether each of the following computational tasks is unsolvable. undeor intractable. Explain your reasoning, (a) Determining the minimum amount of wire needed to connect any set of n points (wiring terminals) that are in specified but arbitrary positions on a rectangular circuit board. Assume that at most two wires may be attached to each terminal, (b) Solving the preceding wiring probcidable,
lem when the n points and the wires periphery of the board; that
is,
that
connect them are constrained
to lie
the wire segments connecting the n points
must
on the lie on
a fixed rectangle.
1.7.
Most wordprocessing computer programs contain a spelling checker. An obvious bruteforce method to check the spelling of a word Wis to search the entire online dictionary from beginning to end and compare W to every entry in the dictionary. Outline a faster method to check spelling and compare its time complexity to that of the bruteforce method.
1.8.
Consider the four algorithms
maximum problem times faster than faster than 1.9.
The
listed in Figure 1.7.
With
size that each algorithm can handle
M. Repeat
the calculation for a
the given data, calculate the
on a computer M'
computer
M" that is
that is
10,000
1,000.000 times
M.
bruteforce technique illustrated by the Eulercircuit algorithm in
Example
1.2,
which involves the enumeration and examination of all possible cases, is applicable to many computing problems. To make the method tractable, problemspecific techniques are used to reduce the number of cases that need to be considered. For example, the eightedge graph of Figure 1 .6b can be simplified by replacing the edgepair eg with a single edge because any Euler circuit that contains c must also contain g, and vice versa. Similarly, the pair dh can be replaced by a single edge. The problem then reduces to checking for an Euler circuit in a sixedge graph. For the same problem, suggest another
method
must considered, 1.10.
that
can sometimes substantially reduce the number of cases that
illustrating
it
Consider the heuristic method
with a different graph example.
salesman problem discussed briefproblem involving at most five cities, for which
to solve the traveling
ly in section 1.1.2. Construct a specific
the total distance dhem traveled in the heuristic solution is not the minimum distance dmin Conclude from your example (or from other considerations) that the heuristic solution can be made arbitrarily bad, that is, "worst case" problems can be contrived in dmin can be made arbitrarily large. .
— 1.11.
2
Consider the computation of x by the method of differences covered in Example 1.3. Suppose we want to determine x2 for x = 0.5, 1.0, 1.5, 2.0, 2.5, 3.0, that is, at intervals of 0.5. Explain
how
to
modify the method of Example
accomplish
1.3 to
59
CHAPTER
this task.
1
Computing and 1.12.
Use the method of differences embodied x for integer values of * from 1 to 10.
1.13.
Use is
in
method of differences to compute x5 for integer values of x from to 8. What 5 smallest value of / for which the z'th difference of x is a constant? What is the
the
the
Babbage's Difference Engine to compute
1
,
value of that constant? 1.14.
Consider the problem of computing a table of the natural logarithms of the integers from 1 to 200,000 to 19 decimal places, a task carried out manually in 1795. Select any modern commercially available computer system with which you are familiar and estimate the total time it would require to compute and print this table. Define all the parameters used in your estimation.
1.15.
Discuss the advantages and disadvantages of storing programs and data
memory
program concept). Under what circumstances programs and data in separate memories? 1.16.
(the stored
is it
in the
same
desirable to store
Computers with separate program and data memories implemented in RAMs and respectively, are sometimes called Hanardclass machines after the Harvard Mark 1 computer. Computers with a single (RAM) memory for program and data storage are then called Princetonclass after the IAS computer. Most currently installed computers belong to one of these classes. Which one? Explain why the class you selected is the most widely used.
ROMs,
1.17.
2
Write a program using the IAS computer's instruction set (Figure 1.14) to compute x by means of the method of finite differences described in Example 1.3. For simplicity, assume that the numbers being processed are 40bit integers and that the only dataprocessing instructions you may use are the IAS's add and subtract instructions. The 2 2 2 2 should be stored in k consecutive memory results x (x + l) ...,(x + k l) (x + 2) ,
,
,
,
locations with starting address 3001. 1.18.
A vector of
10 nonnegative numbers
is
stored in consecutive locations beginning in lo
memory of the IAS computer. Using the
cation 100 in the
instruction set of Figure 1.14.
computes the address of the largest number in this locations contain the largest number, specify the smallest address. write a
1.19.
program
that
IAS decided not to implement a square root instruction (ENIAC = x m can be computed iteratively and very efficiently following formula known in ancient Babylon:
The designers of had one), via the
array. If several
the
—
citing the fact that y
>'
J+ \
Here; =
1, 2,
3, ...,
and
>•„ is
an
=(.v,
initial
+
*/y,)/2
approximation to x
]f2 .
Assuming
cesses real (floatingpoint) numbers directly, construct a program ure 1.15 to calculate the square root of a given positive
that
IAS
in the style
number x according
pro
of Figto this
formula.
IAS and other firstgeneration computers as some of their predecessors. In what sense was the IAS a parallel computer? What forms of parallelism do modern computers have that are lacking in the IAS?
1.20. Early
computer
literature describes the
"parallel." unlike
1.21.
The IAS had no
designed for transferring control between and return can be programmed using the IAS's
call or return instructions
programs, (a) Describe
how
call
Computers
60
original instruction set. (b)
What
would you suggest adding
feature
to the
IAS
to
support call and return operations?
SECTION Problen^
1.5
and a stack program of the kind given
122. Construct both a Polish expression
in
Figure
1.16a to evaluate the following expression:
f:=(4x(a 2 + 1.23.
From
b +
c)d)/(e+fxg)
the data presented in Figure 1.19, estimate
how
(1.14)
long
it
takes,
on average, for the
density of leadingedge ICs to double. This doubling rate, which has remained remark
ably constant over the years,
cofounder of 1.24.
Using the
Intel Corp.,
circuit
is
who
referred to as
formulated
of Figure 1.20 as an
CMOS
it
Moore 's
Gordon
law, after
Moore, a
E.
in the 1960s.
illustration, discuss
and
justify the following
Power consumption is very low and most of it occurs when the circuit is changing state (switching), (b) The logic signals and correspond to electrical voltage levels, (c) The subcircuits that constitute logic gates draw their power directly from the global power supply rather than from the external general properties of
circuits: (a)
1
(primary) input signals: hence the gates perform signal amplification. 1.25.
The
CMOS
zerodetection circuit of Figures 1.20 and 1.21 can be implemented as a
single fourinput logic gate. Identify the gate in question and redesign the circuit in the
more compact 1.26.
Design a
singlegate form.
CMOS
onesdetection circuit in the multigate style of Figure 1.20.
It
should
produce the output z = 1 if and only if x x x 2x 3  1111. Give both a transistor (switchlevel) circuit and a gatelevel circuit for your design. ]
1.27.
Discuss the impact of developments
computer hardware technology on the evolution
in
of each of the following: (a) the logical complexity of the smallest replaceable components; (b) the operating speed of the smallest replaceable components;
and
(c) the for
mats used for data and instruction representation. 1.28.
Define the terms software compatibility and hardware compatibility.
What
role
have
they played in the evolution of computers? 1.29. Identify
and briefly describe three distinct ways in which parallelism can be introduced computer in order to increase its overall instruction ex
into the microarchitecture of a
ecution speed. 1.30.
Compare and
contrast the
IAS and PowerPC processors in terms of the complexity of Use the vector addition programs of
writing assemblylanguage programs for them.
Figures 1.15 and 1.27 to illustrate your answer. 1.31.
A
popular microprocessor of the 1970s was the Intel 8085, a direct ancestor of the
80X86/Pentium size in the
CPU
series,
and
M
which has is
IC package has only 40 pins, the the
CPU and M
as well as
the structure
shown
in
lines
AD
is
is
16
1
.32.
The data word
bits.
AD is used to attach IO devices
are shared (multiplexed) as indicated.
M to the 8085; there
Figure
Because the 8085's for transmitting addresses and data between
8 bits, while the address size
also a separate serial (two line)
IO
port.
The 8085 has
most complex arithmetic instructions are addition and subtraction of 8bit fixedpoint (binary and decimal) numbers. There are six 8bit registers designated B, C, D, E, H, and L, which, with the accumulator A, form a about 70 different instruction types.
generalpurpose
CPU
address registers.
register
file.
Its
The
registerpairs
A program counter PC
byte required from
M
in the usual
BC, DE, and
HL
serve as 16bit
maintains the address of the next instruction
manner. The 8085 also has stack pointer SP that
points to the top of a userdefined stack area in
M.
(a)
What
is
the
maximum
capacity
.
Data/ Address Address low high
10 devices
Serial
CHAPTER
Li
System bus (to
Serial
10
61 Control
M and 10)
Computers
port 8bit internal data bus
Accumu
C
B
lator
D
E
H
L
Status
A
register
SR
Program
Instruction
IR
register
control
SP
Stack pointer 8bit
ALU
*— 8— «— 8—
Program counter PC
»•
8/16bit register file
Figure 1.32 Structure of the Intel 8085 microprocessor.
Comment
Location
Instruction
ADDEC:
LXI
D,
NUM1
LXI
H,
NUM2 +
LOOP:
+ 16 16
Initialize address:
DE
:=
NUM1
+
16.
Initialize address:
HL
:=
NUM2 +
16.
MVI
C, 16
Initialize count:
LDAX ADC DAA
D
Load
M
A
MOV
M,A
Store data:
DCX DCX
D
Decrement address:
DE
:=
DE 
H
Decrement address:
HL
:=
HL 
DCR
C
Decrement count:
JNZ
LOOP
Jump
:=
data:
D
:=
C
:= 16.
M(DE).
A + CY + M(HL).
Convert sum
to
in
A
M(HL)
LOOP
if
Update
CY
flag.
to decimal. :=
C
A.
:=
Z*
C

1
1
1
Update
.
Z
flag.
1.
Figure 1.33
An 8085 program
to
add two 32digit decimal
of the 8085 's main (d) Identify three 1.32.
Consider the
memory?
(b)
What
is
integers.
the size of
PC?
(c)
What
is
the purpose of
common features of more recent microprocessors that the 8085
Intel
8085 described
in the preceding
problem.
SP?
lacks.
A taste of its software can
program ADDEC written in 8085 assembly language that performs the addition of two long (n digit) decimal numbers NUM1 and NUM2. The numbers are added two digits (8 bits) at a time using the instructions ADC (add with carry) and DAA (decimal adjust accumulator). ADC takes a byte from be found
in
Figure 1.33, which
lists a
M
and, treating
it
as an 8bit binary
1
Computing and
number, adds
it
and a carry
bit
CY
to the contents of
.
62
A
the
DAA
register.
then changes the binary
sum
in
A
to
binarycoded decimal form.
This calculation uses several flag bits of the status register SR: the carry flag
SECTION
1.6 is
References
set to
whenever
(0)
1
which
decrement the
is set
1.34.
8bit addition is
1
(0);
CY, which
and the zero
as
CISC
to
1
(0)
(non0), (a)
is
NUM2 is stored, 1.33.
from an
when the result of an arithmetic instruction such as add or From the information given here, determine the size n of numbers being added and the (symbolic) location in M where the sum NUM1 +
flag Z,
it
the 9th bit resulting
or
(b) Ignoring the size of the
RISC?
Justify
8085 's instruction
set,
would you
classify
your answers.
The performance of a 100 MHz microprocessor P is measured by executing 10,000,000 instructions of benchmark code, which is found to take 0.25 s. What are the values of CPl and MIPS for this performance experiment? Is P likely to be superscalar? Suppose
that a singlechip microprocessor
P operating at a clock frequency of 50 MHz
new model P which has the same architecture as P but has a clock frequency of 75 MHz. (a) If P has a performance rating of p MIPS for a particular benchmark program Q, what is the corresponding MIPS rating p for P ? (b) P takes is
replaced by a
250
s to
execute
,
Q in a particular personal computer system C. On replacing P by P in Q drops only to 220 s. Suggest a possible reason for this dis
C, the execution time of
appointing performance improvement.
What
CISC and RISC? Identify two key archiRISC and CISC machines, (b) When developing the RISC/6000, the direct predecessor of the PowerPC, IBM viewed the word RISC to mean "reduced instruction set cycles." Explain why this meaning might be more appropriate for the PowerPC than the usual one.
1.35. {a)
are the usual definitions of the terms
tectural features that distinguish recent
1.6
REFERENCES 1.
Albert, D. and D.
2.
Avnon. "Architecture of the Pentium Microprocessor." IEEE Micro,
1993) pp. 1121.
vol. 13 (June
Augarten, S. Bit by Bit:
An
New
Illustrated History of Computers.
York: Ticknor and
Fields, 1984. 3.
Barwise,
J.
and
A
Etchemendy. Turing's World 3.0: An Introduction
J.
Theory. Stanford,
CA: CSLI
to
Computability
Publications, 1993.
History of Mathematics. 2nd ed. New York: Wiley, 1989. MacDonald. Revolution in Miniature. The History and Impact of Semi
4.
Boyer, C. B.
5.
Braun, E. and
6.
conductor Electronics. 2nded. Cambridge, England: Cambridge University Press, 1982. Burks, A. W., H. H. Goldstine, and J. von Neumann. "Preliminary Discussion of the
S.
Logical Design of an Electronic Computing Instrument." Report prepared for U.S.
7.
Army
Ordnance Department, 1946. (Reprinted in Ref. 26, vol. 5, pp. 3479.) Cocke, J. and V. Markstein. "The Evolution of RISC Technology at IBM." IBM Journal of Research and Development, vol. 34 (January 1990) pp. 41 Cormen, T. H., C. E. Leiserson, and R. L. Rivest. Introduction to Algorithms. MIT Press, Cambridge, MA, and McGrawHill, New York, 1990. Diefendorf, K., R. Oehler, and R. Hochsprung. "Evolution of the PowerPC Architec1
8.
9.
ture."
IEEE Micro,
1994) pp. 34^9.
G. "The Electronic Computer
ical Tables
11.
vol. 14 (April
at the Institute for Advanced Studies." Mathematand Other Aids to Computation, vol. 7 (April 1953) pp. 10814. Farrell, J. J. "The Advancing Technology of Motorola's Microprocessors and Microcomputers." IEEE Micro, vol. 4 (October 1984) pp. 5563.
10. Estrin,
12.
Garey,
M.
R. and D. S. Johnson. Computers
and
Intractability.
San Francisco: W. H.
63
Freeman, 1979. H. H. and J. von Neumann. "Planning and Coding Problems for an Electronic Computing Instrument." Part II, vols. 1 to 3. Three reports prepared for U.S. Army Ord
13. Goldstine,
nance Department, 19471948. (Reprinted
in Ref. 26, vol. 5, pp.
14.
Hwang, K. Advanced Computer Architecture.
15.
Morrison, P. and E. Morrison
16.
Morse,
New
(eds.).
New
York: McGrawHill, 1993.
Charles Babbage and His Calculating Engines.
Microprocessors: 8008 to 8086." Santa Clara,
CA:
1978.
Intel,
(Reprinted in Ref. 24, pp. 61546.) 17.
Motorola
Inc.
PowerPC 601 RISC Microprocessor
(Also published by
IBM
User's Manual. Phoenix, AZ, 1993.
VT, 1993). The Java Virtual Machine
Microelectronics, Essex Junction,
M. and M. Tremblay.
18.
O'Connor,
19.
(March/April 1997) pp. 4553. Patterson, D. "Reduced Instruction Set Computers." Communications of the 28, (January 1985) pp. 821.
20.
Poppelbaum, W.
ware."
J.
IEEE Micro,
"picoJavaI:
Hard
J. et al.
"Unary Processing." Advances
in
ACM,
vol.
Computers, vol. 26, ed. M.
New
York: Academic Press, 1985, pp. 4792. 21. Prasad, N. S. IBM Mainframes: Architecture and Design. Yovits.
in
vol. 17
New
York: McGrawHill,
1989. 22. Randell, B. (ed.)
The Origins of Digital Computers: Selected Papers. 3rd
ed. Berlin:
Springer Verlag, 1982. 23. Russell, R.
M. "The CRAY1 Computer System." Communications of the ACM,
vol. 21
(January 1978), pp. 6378. (Reprinted in Ref. 24, pp. 74352.) 24. Siewiorek, D. P., C. G. Bell, and A. Newell. Computer Structures: Readings and Exam
New York: McGrawHill, 1982. Swade, D. D. "Redeeming Charles Babbage' s Mechanical Computer." Scientific American, vol. 268 (February 1993) pp. 8691. 26. von Neumann, J. Collected Works, ed. A. Taub, 6 vols. New York: Pergamon, 1963. 27. Weiss, S. and J. E. Smith. Power and PowerPC San Francisco, CA: Morgan Kaufmann, ples.
25.
.
1994. 28. Weste, N.
and K. Eshragian. Principles of CMOS VLSI Design. 2nd
ed. Reading,
MA:
AddisonWesley, 1992. 29. Wilkes,
M. V. and
J.
"Microprogramming and the Design of Control CirComputer." Proc. Cambridge Phil. Soc, pt. 2, vol. 49
B. Stringer.
cuits in an Electronic Digital
(April 1953) pp. 23038. (Reprinted in Ref. 24, pp. 15863.)
1
Computing and Computers
80235.)
York: Dover, 1961. S. P. et al. "Intel
CHAPTER
CHAPTER
2
Design Methodology
This chapter views the design process for digital systems at three basic levels of abstraction: the gate, the register, and the processor levels. It discusses the nature of the design process, examines design at the register and processor levels in detail, and briefly introduces computeraided design (CAD) and analysis methods.
2.1
SYSTEM DESIGN A
computer
tion
—
an example of a system, which is defined informally as a colleccomplex one of objects called components, that are con
is
—
often a large and
nected to form a coherent entity with a specific function or purpose. The function of the system
is
determined by the functions of its components and
nents are connected.
We
how
the
compowhose
are interested in informationprocessing systems
is to map a set A of input information items (a program and its data, for example) into output information B (the results computed by the program acting on the data). The mapping can be expressed formally by a mathematical function/ from A to B. If/ maps element a of A onto element b of B, we write b = /(a) or b := f(a). We also restrict membership of A and B to digital or discrete quantities, whose
function
values are defined only at discrete points of time.
2.1.1
A
System Representation
useful
set
way
of objects
of modeling a system
V = {v,^^,...^,,}
whose members
are
(ordered)
(V!,v3 ),...,(vn _,,v„)}of all v,
64
to
node
v..
A
graph
is
such
is
a graph.
A
(directed)
graph consists of a
called nodes or vertices and a set of edges
pairs
pairs.
of nodes
The edge
e =
taken from the (v,,yp
set
E
{(v ,v 2 ), l
joins or connects node
often defined by a diagram in which nodes are repre
—
.
sented by circles, dots, or other symbols and edges are represented by lines: this diagram is synonymous with the graph. The ordering implied by the notation (v,,v ) may be indicated in the diagram by an arrowhead pointing from v, to v as, for instance, in Figure 2.1.
classes of objects: a set of informationlines
S
that carry
information signals
between components. In modeling the system by a graph G, we associate C with the nodes of G and S with the edges of G; the resulting graph is often called a block diagram. This name comes from the fact that it is convenient to draw each node (component) as a block or box in which its name and/or its function can be written. Thus the various diagrams of computer structures presented in Chapter 1 Figure 1.29, for instance are block diagrams. Figure 2.2 shows a block diagram representing a small gatelevel logic circuit called an EXCLUSIVEOR or modulo2 adder. This circuit has the same general form as the more abstract graph of Fig
—
2.
1
Two central properties of any system are its strucand behavior; these very general concepts are often confused. We define the structure of a system as the abstract graph consisting of its block diagram with no functional information. Thus Figure 2.1 shows the structure of the small system of Figure 2.2. A structural description merely names components and defines their interconnection. A behavioral description, on the other hand, enables one to deterStructure versus behavior.
ture
mine
for any given input signal a to the system, the corresponding output /(a).
define the function/to be the behavior of the system.
sented in
many
different ways. Figure 2.3
EXCLUSIVEOR
is
called a truth table.
behavior can be written in
follows, noting that /(a) =/(x,^c2 )
:
/(0,0) =
/(0,1)=1 /(1,0)
=
/(U)
=
Figure 2.1
A
graph with eight nodes and nine edges.
The behavior/may be
We
repre
shows one kind of behavioral description all possible combinations of Another description of the same terms of mathematical equations as
for the logic circuit of Figure 2.2. This tabulation of
inputoutput values
CHAPTER
2
Design
Methodology
The systems of interest comprise two processing components C and a set of
ure
65
1
66 *1
SECTION
AND
2.1
System Design
NOT •
OR
»o
x,
©
x2
NOT p x2 o
AND
"
Figure 2.2
A block diagram representing
The
structural
an
EXCLUSIVEOR
logic circuit.
and behavioral descriptions embodied
in
Figures 2.1 and 2.3 are
independent: neither can be derived from the other. The block diagram of Figure 2.2 serves as both a structural and behavioral description for the logic circuit in question, since from
we can
derive Figures 2.1 and 2.3. diagram conveys structure rather than behavior. For example, some of the block diagrams of computers in Chapter 1 identify blocks as being arithmeticlogic units or memory circuits. Such functional descriptions do not completely describe the behavior of the components in question; therefore, we cannot deduce the behavior of the system as a whole from the block diagram. If we it
In general, a block
need a more precise description of system behavior, we generally supply a separate narrative text, or a
more formal description such
as a truth table or a
list
of equa
tions.
Hardware
description languages.
As we have
seen,
we can
system's structure and behavior by means of a block diagram
—
—
fully describe a
the term schematic
diagram is also used in which we identify the functions of the components. We can convey the same detailed information by means of a hardware description language (HDL), a format that resembles (and is usually derived from) a highlevel programming language such as Ada or C. The construction of such description languages can be traced back at least as far as Babbage [Morrison and Morrison 1961]. Babbage's notation, of which he was very proud, centered around the use of special symbols such as —> to represent the movement of mechanical components. In modern times Claude E. Shannon [Shannon 1938] introduced Boolean algebra
Output
Input a x
\
x2
1
1
fia)
1
1
Figure 2.3 1
1
Truth table for the
EXCLUSIVEOR
function.
as a concise
and rigorous descriptive method for logic
1950s, academic and industrial researchers developed
circuits.
eventually evolved into a few widely used languages, notably
which were standardized
in the
Beginning
in the
many ad hoc HDLs. These
VHDL and Verilog,
1
1980s and 90s [Smith 1996; Thomas and Moorby
1996].
Hardware description languages such
as
VHDL have several advantages. They
can provide precise, technologyindependent descriptions of digital circuits
at vari
ous levels of abstraction, primarily the gate and register levels. Consequently, they are widely used for documentation purposes. Like programming languages, HDLs
can be processed by computers and so are suitable for use with computeraided design (CAD) programs which, as discussed later, play an important role in the
HDL
design process. For example, an
employed
to simulate the
On
been specified.
behavior of
the negative side,
P
P
description of a processor
can be
its
design have
HDL descriptions are often long
and verbose;
before
all
the details of
they lack the intuitive appeal and rapid insights that circuit diagrams and less for
mal descriptive methods provide. 2.1 VHDL DESCRIPTION OF A HALF ADDER. To illustrate the use HDLs, we give in Figure 2.4a a VHDL description of a simple logic component known as a half adder. Its purpose is to add two 1bit binary numbers x and y to form a 2bit result consisting of a sum bit sum and a carry bit carry. For example, if x = y= 1. the half adder should produce carry = 1, sum = 0, corresponding to the binary number
EXAMPLE of
10, that
two.
is,
A VHDL The is,
description has two
entity part
is
main
parts:
an entity part and an architecture
part.
a formal statement of the system's structure at the highest level, that
as a single component.
It
describes the system's interface, which
is
the "face" pre
sented to external devices but says nothing about the system's behavior or
its
internal
example the entity statement gives the half adder's formal name half_adder and the names assigned to its inputoutput (IO) signals; 10 signals are referred to in VHDL by their connection terminals or ports. Inputs and outputs are structure. In this
entity half_adder
port
is
(x.y: in bit;
sum. earn, out
Outputs
Inputs bit);
X
end halfjudder;
v
sum carry
sum architecture behavior of half_adder
is 1
1
half_adder
begin
1
1
sum alpha); NAND2: nand_gate port map (d => alpha, e => alpha. /=> end
carry);
structure:
(a)
half_adder
xor_circuit
XOR
c
nand_gate
nand_gate alpha
NAND1 /
c
NAND2 /
Figure 2.5 Half adder: (a) structural
states that
has
its
VHDL description;
(b)
block diagram.
half_adder has a component called
d, e,
NAND1
.
which
and /ports (terminals) mapped (connected)
is
of type nand_gate and
to the signals x, v,
and alpha,
respectively.
2.1.2
Design Process
Given a system's structure, the task of determining its function or behavior is termed analysis. The converse problem of determining a system structure that exhibits a given behavior
Design problem.
is
design or synthesis.
We can now state in broad terms the problem
facing the
com
puter designer or, indeed, any system designer.
Given a desired range of behavior and a set of available components, determine a structure (design) formed from these components that achieves the desired behavior with acceptable cost and performance. the new design's behavior is the overriding goal of the design process, other typical requirements are to minimize cost as measured
While assuring the correctness of
70
SECTION
2.1
System Design
by the cost of manufacture and to maximize performance as measured by the speed of operation. There are some other performance and costrelated constraints to satisfy such as high reliability, low power consumption, and compatibility with existing systems. These multiple objectives interact in poorly understood ways that depend on the complexity and novelty of the design. Despite careful attention to detail and the assistance of
new system
versions of a
often
fail to
CAD
tools, the initial
meet some design objective, sometimes
in
and hardtodetect ways. This failure can be attributed to incomplete specifications for the design (some mode of behavior was overlooked), errors made by human designers or their CAD tools (which are also ultimately due to human error), and unanticipated interactions between structure, performance, and cost. For example, increasing a system's speed to a desired level can make the cost unacsubtle
ceptably high.
The complexity of computer systems is such that the design problem must be broken down into smaller, easier tasks involving various classes of components. These smaller problems can then be solved independently by different designers or design teams. Each major design step is often implemented via the multistep or iterative process depicted by a flowchart in Figure 2.6. An initial design is created, perhaps in ad hoc fashion, by adapting an existing design of a similar system. The result is then evaluated to see if it meets the relevant design objectives. If not, the design is revised and the result reevaluated. Many iterations through the redesign and evaluation steps of Figure 2.6 may be necessary to obtain a satisfactory design. Computeraided design. The emergence of powerful and inexpensive desktop computers with good graphics interfaces provides designers with a range of programs to support their design tasks. CAD tools are used to automate, at least in
(
Begin J
Construct an initial
design
Evaluate
its
cost
and performance
Modify the design to meet the goals
Figure 2.6 Flowchart of an iterative design process.
part, the
ways
more tedious design and evaluation
steps
and contribute
tions or schematic diagrams,
such as
HDL descrip
which humans, computers, or both can
efficiently
process.
new
Simulators create computer models of a
•
design's behavior and help designers determine
design, which can
how
mimic
the
well the design meets vari
ous performance and cost goals. Synthesizers automate the design process itself by deriving structures that imple
•
ment
or part of
all
Editing
Some
is
some design
step.
the easiest of these three tasks, and synthesis the
most
difficult.
synthesis methods incorporate exact or optimal algorithms which, even
easy to program into resources.
Many
CAD
tools, often
if
demand excessive amounts of computing
synthesis approaches are therefore based on trialand error meth
ods and experience with earlier designs. These computationally efficient but inexact
methods are called heuristics and form the basis of most Design
levels.
The design of
a
complex system such
practical
as a
CA© t»«ls.
computer
is
carried
out at several levels of abstraction. Three such levels are generally recognized in
computer design, although they are referred
to
by various different names
in the
lit
erature: • •
•
The processor level, also called the architecture, behavior, or system The register level, also called the registertransfer level (RTL). The gate level, also called the logic level.
As Figure
we
level.
key component treated as The processor level corresponds to a user's or manager's view of a computer. The register level is approximately the level of detail seen by a programmer. The gate level is primarily the concern of 2.7 indicates
are
naming each
level for a
primitive or indivisible at that level of abstraction.
These three design levels also correspond roughly to the major subdivisions of integratedcircuit technology into VLSI, MSI, and SSI components. The boundaries between the levels are far from clearcut, and it is common to encounter descriptions that mix components from more than one level. the hardware designer.
It
Information
Level
Components
density
units
Time
Gate
Logic gates,
SSI
Bits
10' 2 to 10" 9
MSI
Words
lfr'toio^s
VLSI
Blocks of
ur'io
Register
flipflops.
Registers, counters,
combinational
units
circuits,
small sequential circuits.
Processor
CPUs, memories, 10
devices.
words
Figure 2.7
The major computer design
levels.
71
CHAPTER
CAD editors or translators convert design data into forms
•
important
in three
to the overall design process.
io's
s
2
Design
Methodology
72
section 2 System Design
A
few basic component types from each design level are listed in Figure 2.7. rec °g n i ze d as primitive at the gate level include AND, OR, NAND, NOR, and NOT gates. Consequently, the EXCLUSIVEOR circuit of Figure 2.2 is an example of a gatelevel circuit composed of five gates. The component marked XOR in Figure 2.5b performs the EXCLUSIVEOR function and so can be thought of as a more abstract or higherlevel view of the circuit of Figure 2.2, in which all internal structure has been abstracted away. Similarly, the halfadder block of Figure 2Ab represents a higherlevel view of the threecomponent circuit of Figure 2.5b. We consider a half adder to be a registerlevel component. We might regard the circuit of Figure 2.5b as being at the register level also, but because NAND is another gate type and XOR is sometimes treated as a gate, this circuit can also be viewed as gate level. Figure 2.7 indicates some further differences between the design levels. The units of information being processed increase in complexity as one goes from the gate to the processor level. At the gate level individual bits (Os and Is) are processed. At the register level information is organized into multibit words or vectors, usually of a small number of standard types. Such words represent numbers, instructions, and the like. At the processor level the units of information are blocks of words, for example, a program or a data set. Another important difference lies in the time required for an elementary operation; successive levels can differ by several orders of magnitude in this parameter. At the gate level the time required to switch the output of a gate between and 1 (the gate delay) serves as the time unit and typically is a nanosecond (ns) or less. A clock cycle of, say, 10 ns, is a commonly used unit of time at the register level. The time unit at the processor level might be a program's execution time, a quantity that can vary widely.
^
e
^°^ c £ ates
System hierarchy.
more complex
It is
customary
to refer to a
design level as high or low; the
the components, the higher the level. In this
book we
are primarily
concerned with the two highest levels listed in Figure 2.7, the processor and register levels, which embrace what is generally regarded as computer architecture. The ordering of the levels suggested by the terms high and low
A
component in any level L, from the level L, _ beneath ,
mally speaking, there
is
is,
in fact, quite strong.
is
equivalent to a (sub) system of components taken
it.
This relationship
a onetoone
is
illustrated in Figure 2.8. For
mapping h between components
in L,
and
t
system with levels of this type is called a hierarchical system. Thus in Figure 2.8 the subsystem composed of blocks 1, 3, and 4 in the lowlevel description maps onto block A in the highlevel description. Figdisjoint
subsystems
in level L,.,;a
2Ab and 2.5b show two hierarchical descriptions of a halfadder circuit. Complex systems, both natural and artificial, tend to have a welldefined hierarchical organization. A profound explanation of this phenomenon has been given by Herbert A. Simon [Simon 1962]. The components of a hierarchical system at each level are selfcontained and stable entities. The evolution of systems from ures
simple to complex organizations is greatly helped by the existence of stable intermediate structures. Hierarchical organization also has important implications in the design of computer systems. It is perhaps most natural to proceed from higher to
lower design levels because
this
sequence corresponds to a progression of succes
Thus if a complex system is to be designed using IC composed of standard cells, the design process might
sively greater levels of detail.
smallscale ICs or a single
consist of the following three steps.
73 x \
CHAPTER
*
2
1
2
Design 1
Methodology
A
5 1
r
4
3
l
*
5
to
(a)
Figure 2.8
Two
descriptions of a hierarchical system: (a) low level; (b) high level.
1.
Specify the processorlevel structure of the system.
2.
Specify the registerlevel structure of each component type identified in step
3.
Specify the gatelevel structure of each component type identified in step
This design approach
is
termed top down;
it
is
extensively used in both hardware
and software design.
If the
ICs or standard
then the third step, gatelevel design,
cells,
As might be Only
ferent.
foregoing system
is
1.
2.
to
be designed using mediumscale is
no longer needed.
expected, the design problems arising at each level are quite dif
in the
case of gatelevel design
is
there a substantial theoretical basis
(Boolean algebra). The register and processor levels are of most interest puter design, but unfortunately, design at these levels
is
in
com
largely an art that depends
skill and experience. In the following sections we examine design and processor levels in detail, beginning with the betterunderstood register level. We assume that the reader is familiar with binary numbers and with gatelevel design concepts [Armstrong and Gray 1993; Hayes 1993; Hachtel and Somenzi 1996], which we review in the next section.
on the designers' at the register
2.1.3
The Gate Level
Gatelevel (logic) design
is
concerned with processing binary variables whose pos
The design components and which are simple, memoryless processing elements, and flipflops,
sible values are restricted to the bits (binary digits)
are logic gates,
1
.
which are bitstorage devices. Combinational logic. A combinational film rum, also referred to as a logic, or a Boolean function, is a mapping from the set of 2" input combinations of n binary Such a function is denoted by r(.v,. v : and variables onto the output values 1
.
74
SECTION
xn ) or simply by z. The function z can be defined by a truth table, which specifies for every input combination (jc x 2 ,..., x n ) the corresponding value of z{x x ,..., 1 2 xn ). Figure 2.9a shows the truth table for a pair of threevariable functions, s (xq, v'o c_,) and c (xq, Vq, c_,), which are the sum and carry outputs, respectively, of a logic circuit called a full adder. This useful logic circuit computes the numerical ,
,
x
2.1
System Design
sum of its
three input bits using binary (base 2) arithmetic:
c& = xQ phisy plusc_ For example, the the
sum
When
three.
last
of three Is
row of
is
CqS
=
the truth table of Figure 2.9a expresses the fact that 11 2
,
OR
is,
we
the base2 representation of the
number
normally reserve the plus symbol (+) operation, and write out plus for numerical addition. We will
also use a subscript to identify the text; for
that
discussing logic circuits,
for the logical
(2.2)
]
example, twelve
is
will
number base when
denoted by 12 10
in
it is
not clear from the con
decimal and by
1
100 2
in binary.
A
combinational function z can be realized in many different ways by combinational circuits built from the standard gate types, which include AND, OR,
Outputs
Inputs
xo
co
Ci
>'o
EXCLUSIVEOR
Half adder
1
1
gate
5o
Half adder
OR
>o 1
1
1
1
1
1
1
1
1
gate
CO 1
1
1
NAND gate 1
1
1
1
NAND gate
1
used as an inverter 1
(b)
(a)
NOT gate (inverter)
x
'o
c i
+ Vo^'i +
*o>'o c i
c = (* + c_{)(xQ + y
whose
)(\' ()
+
(2.3)
*o?oCi
(24)
+ c_i)
structure also corresponds closely to that of the circuit.
By analogy with
ordinary algebra, (2.3) and (2.4) are referred to as sumofprochuts (SOP) and productofsums (POS) expressions, respectively. The circuit of Figure 2.9c is called a twolevel or depthtwo logic circuit because there are only
AND and one y
,
c_, to its
OR, along each
primary outputs
,v
two
gates,
one
path from this adder's external or primary inputs .
c
,
()
assuming each primary input variable
is
2
Methodology
if
1
CHAPTER Design
x
OR:
75
v,,.
avail
able in both true and inverted (complemented) form. The number of logic levels is defined by the number of gates along the circuit's longest 10 path. Because each
76
through s
some delay
gate imposes
The
D
it,
(typically
1
ns or so) on every signal that propagates
the fewer the logic levels, the faster the circuit.
half adderbased circuit of Figure 2.9b has
gates and so
is
10 paths containing up
considered to have four levels of logic. If
propagation delay, then the twolevel adder (Figure 2.9c)
same
twice as fast as the
is
However, the twolevel adder has more gates and
fourlevel design (Figure 2.9b).
so has a higher hardware cost.
to four
gates have the
all
A
basic task in logic design
to synthesize a gate
is
level circuit realization of a given set of combinational functions that achieves a
satisfactory balance between hardware cost as measured by the number of gates, and operating speed as measured by the number of logic levels used. Often the types of gates that may be used are restricted by IC technology considerations, for example, to NAND gates with five or fewer inputs per gate. The design of Figure 2.9d, which has essentially the same structure as that of Figure 2.9c, uses NAND and NOR gates instead of ANDs and ORs. In this particular case the primary inputs are provided in true (noninverted) form jc y c_, only; hence inverters are introduced to generate the inverted inputs x c_ y Computeraided synthesis tools are available to design circuits like those of Figure 2.9 automatically. The input to such a logic synthesizer is a specification of ,
,
,
.
,
x
the desired function, such as a truth table like Figure 2.9a, or a set of logic equations like (2.3) or (2.4); these are often
embedded
in a behavioral
HDL description.
Also given to the synthesizer are such design constraints as the gate types to use and restrictions on the circuit's interconnection structure. One such restriction is an upper bound on the number of inputs (fanin) of a gate G. Another is an upper bound on the number of inputs of other gates to which G's output line may connect; this
is
called the
(maximum) fanout of G. The output of
structural description of a logic circuit that
the synthesizer
is
a
implements the desired function and
meets the specified constraints as closely as possible. Exact methods for designing twolevel circuits like that of Figure 2.9c (or Figits inverters removed) using the minimum number of gates have long
ure 2.9' 3 ,> 2'>'i'>'o) an d computes their sum S = (s 3 ,s 2 ,S\,s y, it also accepts an input x
,
carry signal c_, and produces an output carry c 3
primitive
component
at the register level, as
internal structure or logic design
Flipflops.
may no
By adding memory
storage elements called flipflops, rely
2
on an external clock signal
This design, which
in
Chapter
4.
is
known
to a
we
CK to
.
A
multibit adder
shown Figure
is
treated as a
2.10b, at which point
its
longer be of interest.
combinational circuit
in the
form of
1bit
obtain a sequential logic circuit. Flipflops
synchronize the times
at
which they respond
as a ripplecarry adder, and other types of binary adders are
examined
in detail
*3
>3
x
y
x2
>'2
x
y
77 '
cu
Full adder Cnnr
x
cu
Full adder
y
c„
Full adder
c«
y
c
Design
Methodology
Full adder
S
c
(b)
Figure 2.10 Fourbit ripplecarry: (a) logic structure; (b) highlevel symbol.
changes on their input data lines. They are also designed to be unaffected by changes (noise) produced by the combinational logic that feeds them. An efficient way to meet these requirements is edge triggering, which conto
transient signal
narrow window of time around one edge (the CK. Figure 2. 1 1 summarizes the behavior of the most common kind of flipflop, an edgetriggered D {delay) flipflop. (Another wellknown flipflop type, the JK flipflop, is discussed in problem 2.1 1.) The output signal y constitutes the stored data or state of the flipflop. The D flipflop reads in the data value on its D line when the 0tol triggering edge of clock signal CK arrives; this D value becomes the new value of y. The triangular symbol on the clock's input port in Figure 2. la specifies edge triggering; its omission indicates level triggering, in which case the flipflop (then usually referred to as a latch) responds to all changes in signal value on D. Since there is just one triggering edge in each clock cycle, there can be just one change in y per clock cycle. Hence we can view the edgetriggered flipflop as trafines the flipflop's state
changes
to a
0tol or lto0 transition point) of
1
versing a sequence of discrete state values
The eral
input data line
D can
v(/),
one for every clock cycle
i.
be varied independently and so can go through sev
changes in any clock cycle
i.
However, only
before the arrival of the triggering edge of
To change the flipflop's state, the D period known as the setup time Tselup
CK
signal
the data value D{i) present just
determines the next state y{i +
must be held steady
for a
1).
minimum
before the flipflop is triggered. For examwhich shows a sample of the D flipflop's behavior, we have D(l) = 1 and v(l) = in clock cycle 1. At the start of the next clock cycle, y changes to 1 in response to D(l) = 1. making v(2) = 1. In clock cycle 3, y changes ple, in
Figure 2.1
lc,
CHAPTER 2
78 PRE
SECTION
—
D
Data
2.1
Input Di >
System Design
State
1
>CK
Clock
i
)
1
1
Next state vO'+l)
CLR 1
(a)
(b)
Triggen ng edge
T 'setup
Glitch /
1
\
Clock
/
CK 1
1
1
1
DataD
II
Time
State y
Cycle
/
1
i
2
1
DataD(/)
4
3
State >(/)
6
5
D
1
1
1
1
1
1
(c)
Figure 2.11
D
flipflop: (a) graphic
back
to 0,
during the
symbol; (b)
making y(3) = critical
0.
state table; (c)
Even though
setup phase of cycle
the spurious pulse or glitch affecting
D=
timing diagram.
1
for
most of clock cycle
3,
D(3) =
thus ensuring that y(4) = 0. Observe that
3,
D in
cycle 5 has no effect on
y.
Hence edge
triggered flipflops have the very useful property of filtering out noise signals
appearing
When itly
at their inputs.
a flipflop
brought to a
(reset) the flipflop at the start
is first
known
asynchronously, that
of operation.
control inputs,
switched on.
initial state. It is
CLR
To
(clear)
its
state
is,
PRE
is
uncertain unless
it
is
explic
independently of the clock signal CK,
this end, a flipflop
and
y
therefore desirable to be able to initialize
can have one or two asynchronous
(preset), as
shown
in Figure 2.11a.
designed to respond to a brief input pulse that forces y to 1 in the case of PRE.
in the case
of
Each
CLR
is
or to
normal synchronous operation with a clock that is matched to the timing its flipflops, we can be sure that one welldefined change of state takes place in a sequential circuit during each clock cycle. We do not have to worry about the exact times at which signals change within the clock cycle. We can therefore consider the actions of a flipflop, and hence of any sequential circuit employIn
characteristics of
ing
it,
to
occur
at
a discrete sequence of points of time
/=
1, 2, 3, ...
In effect, the
clock quantizes time into discrete, technologyindependent time steps, each of
which represents a clock cycle. We can then describe a behavior by the following characteristic equation:
y(/+l) = D(/)
D
flipflop's nextstate
(2.5)
which simply says that y takes the value of D delayed by one clock cycle, hence the D flipflop's name. Figure 2.1 \b shows another convenient way to represent the flipflop's nextstate behavior. This state table tabulates the possible values of the next state y{i + 1) for every possible combination of the present input D(i) and the present state y(i). It is not customary (or necessary) to include clocksignal values explicitly in characteristic
equations or state tables. The clock
of time steps and so
always present
is
is
considered to be the implicit generator
in the
background. Asynchronous inputs are
also omitted as they are associated only with initialization.
A
Sequential circuits.
and a
set
sequential circuit consists of a combinational circuit
of flipflops. The combinational logic forms the computational or data
processing part of the circuit. The flipflops store information on the circuit's past behavior; this stored information defines the circuit's internal state
mary inputs Y,
are
X and the
denoted Z(X,Y).
primary outputs are Z, then
Z is
which the
flipflops
the resulting circuit is said to be clocked or synchronous.
Each
X and
can also trigger changes
in the
primary output
tance of state behavior, the term finitestate machine
Z
change
state;
tick (cycle or
Y
period) of the clock permits a single change in the circuit's state it
the pri
usual to supply a sequential circuit with a precisely con
It is
trolled clock signal that determines the times at
above;
Y. If
a function of both
as discussed
Reflecting the impor
(FSM)
is
often applied to a
sequential circuit.
The behavior of
a sequential circuit can be specified by a state table that
primary outputs and
its
internal states. Figure
2.12a shows the state table of a small but useful sequential
circuit, a serial adder,
includes the possible values of
which
is
intended to add two unsigned binary numbers X, and
sum Z =
length, producing their is,
bit
by
its
bit,
and the
X
result is also
{
plus
X
2
.
produced
The numbers
X2
of arbitrary
are supplied serially, that
serially. In contrast, the
combinational
Input x 1*2
00
Present state
S (y =
01
0)
5
.0
5
S,(y=l)
s
.i
S,,0
.1
Next
Present
state
output
10
5
11
.1
s,.o
5,.0
5,.l
D nipHop
Clock
(a)
Figure 2.12 (a) State table; (b) logic circuit for a serial adder.
79
CHAPTER
2
Design
Methodology
1
1
80
adder of Figure
2.
10
gation delays, adds
^
1
a "parallel" adder, which, ignoring
is
its
1
internalsignal propa
of the input numbers simultaneously. In one clock cycle
all bits
se " a ^ adder receives 2 input bits Xy(i) and x 2 (i) and computes 1 bit z(i) of Z It computes a carry signal c(i) that affects the addition in the next clock cycle. Thus the output computed in clock cycle i is '"'
System Design
1
also
c(i)z(i)
where
c(i

= x^Oplus x2 (i)plus
c(i

must be determined from the adder's present
1)
(2.6)
1)
state S(i).
Observe
that
(2.6) is equivalent to the expression (2.2) for the fulladder function defined earlier.
follows that two possible internal states exist: 5
meaning that the previous carry 1. These considerations lead to the twostate state table of Figure 2.12a. An entry in row 5(0 and column x^x^i) of the state table has the format S(i + 1), z(i), where S(i + 1) is the next internal state that the circuit must have when the present state is 5(0 and the present primary input combination is x (i)x2 (i); z(i) is the corresponding primary output signal that must be generated. Because the serial adder has only two internal states, its memory consists of a single flipflop storing a state variable y. There are only two possible ways to assign 0s and Is to y. We select the "natural" state assignment that has y = for 5 and y = 1 for S since this equates >(/) with the stored carry signal c(i  1). Assume It
signal c(i

=
1)
0,
and S
x
,
meaning
that c(i

1)
,
=
l
,
x
that
we
use an edgetriggered
tional logic
ary output signal D(i) that
behavior
is
D
C then must generate defined by
is
its
The combina
flipflop (Figure 2.11) to store y.
two
signals: the
applied to the
D
primary output
flipflop's data input.
characteristic equation (2.5); that
and a second
z(i)
The +
y(i
is,
flipflop's 1)
=
D(i).
Hence we have D(i) = c(0 It
follows from the above discussion that
C can
be implemented directly by a
full
sum output is z and whose carry output is D; see Figure 2.12b. Before entering two new numbers to be added, it is necessary to reset the serial adder to the 5 state. The easiest way to do so is to adder circuit such as that of Figure 2.9b, whose
apply a reset pulse to the flipflop's asynchronous clear (CLR) input.
Example
2.2 involves a similar, but
onstrates the use of
example
2.2
more complex sequential
circuit
and dem
CAD tools in its design.
design of a 4bitstream serial adder. Consider another
type
of serial adder that adds four number streams instead of the two handled by a conventional serial adder (Figure 2.12). The new adder has four primary input lines jc,, x 2 x 3 x4 and a single primary output z. To determine the circuit's state behavior often the most difficult part of the design process we first identify the information to be stored. As in the standard serial adder case, the circuit must remember carry information computed in earlier clock cycles. The current 2bit sum SUM(i) = c(i)z(i) is given by
—
—
SUM(i) = x {i)plus x 2 (i)plus x3 (/)/?/us x4 (i)plus
c(i
,
,
1)
x
and If c(i  1) is = 4 = 100 2 so c(i) = 10 2 With c(i  1) = 10 2 SUM{i) becomes 6 = 1 10 2 making c{i) = 1 2 Finally, c(i  1) = 1 makes SUM(i) = 1 1 2 and c(i) = 1 2 which is the maximum possible value of c. 2 The carry data to be stored is a binary number ranging from 00 2 to 1 2 which implies
where
c(i

each xfi) =
1) is the 1,
carry
computed
then SUM(i) =
1
plus
in the
1
plus
preceding clock cycle. 1
plus ,
,
1
plus
,
.
,
,
.
1
1
1
1
needs four states and two flipflops. We will denote the four states by S 2 S3 where 5, represents a stored carry of (decimal) value i. Figure 2.13a shows the adder's state table, which has four rows and 16 columns. For present state S(i) and input combination j, the nextstate/output entry Sk ,z is obtained by adding i 2 and the 4 input bits that determine 7 to form SUM(i) = (k 2 k k ) 2 It follows that k = (k 2 k ) 2 and z = k For example, with present state S 2 and present input plus 1 plus 1 plus 1 plus 10 2 = 101 2 so z = 1 and A: =10 2 = 2, making S, 7, SUM(i) = that the adder
5
,
5,,
il
,
,
CHAPTER
Methodology
.
]
.
i
,
the nextstate. Following this pattern,
With
table.
D
it is
straightforward to construct the adder's state
+ l)y2 (i + I) coincide with the flipThe adder thus has the general structure shown in
flipflops, the nextstate values >',(/
flops' data input values
D
(i)D 2 (i).
{
Figure 2.13£>.
A
truth table for the combinational logic
C
appears in Figure 2.13c.
It is
derived
from Figure 2.13a with the states assigned the four bit patterns of >',y 2 as follows: S = 00, 5, = 01, S 2 = 10, and 5 3 = 11. Suppose we want to design Cas a twolevel directly
Present inputs x x 2 x i x4 (decimal) l
2
1
Present state
4
3
6
5
7
9
8
14
15
S,,0 S,,l
S,.l
S 2 ,0
S 2 .0
S 2 .0
S2
,
s2
,
s2
,
s3
,
s3
,
s3
.
s3
.
10
11
S,,0
S„l
S,,l
S 2 .0
S,,
12
So
S ,0
S„.
s,.o
S
S,.0
S,.0
S,.l
S
5
s
s,,o
s,.o
S,.l
S,,0
S,,
S,,
S 2 ,0
S,,0 S„
S>
S,,0 S,.l
S,,l
s 2 ,o
S,.
S 2 .0
S 2 ,0
S2
.
S,.l
S 2 .0
S 2 ,0
S 2 .l
s2
.
S3
S„
S 2 ,0
s2
s2
s2
s2
s3
,
s2
s2
s2
S3.0
s2
.
i
,
1
1
1
5 : .0
S
,
1
.
.
.
1
1
,
1
1
,
1
1
1
S[,0
1
.
.
.
1
1
.
13
1
1
1
(a)
A"
Present
Present
Secondary
Primary
inputs
state
outputs
output
j
X 2 Xy X4
>'l
>'2
£»,
D2
z
Combinational logic
C 1
1
2
3
CK
,
33 34
10101 01101 10011 01011 00111 11111 11111 11111 11111 1111111111 1111 1111 1111
35 36 37 38 39 40 111—1 41 1111— 42 Ill 43 —111 44 1—11 45 1 11 46 —11147 11148 1—1149 11150 111
— —
51
11—1.
001 001 001 001 001 010 010 010 010 010 001 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100
Figure 2.14 Minimal twolevel (SOP) design puted by ESPRESSO.
e
AX
\
y 2 D,
£> 2 z
for
C com
minimum number of gates. Manual minimizamethods [Hayes 1993] are painfully slow in this case without computer aid. We have therefore used a logic synthesis program called Espresso [Brayton et al. 1984; Hachtel and Somenzi 1996] to obtain a twolevel SOP design. To instruct Espresso to compute the minimumcost SOP design on a UNIXbased computer requires issuing a circuit like that of Figure 2.9c, using the
tion
command
like
^espresso seradd4 where
seradd4
is
a
file
containing the truth table of Figure 2.13c or an equivalent
description of C. Espresso responds with the table of Figure 2.14, which specifies an
SOP
design containing the fewest product terms (these are in a minimal form called
prime implicants [Hayes 1993]), format
in this case, 51.
x x2 x 3 x4 y y 2 ]
i
D D z=
states that output z (but not the outputs
product terms. The dash in 1010in the
D, or (1
conclude from Figure 2.14 that an
adder has 5
1
10101
D2
001
has x x 2 x i x 4 y 2 as one of
)
1
  1 
y lt
100) states that x x 2 y
realization of
product terms, none of which happen to be shared
C
is i
a term of
for the fourstream
among
the output func
This conclusion implies a twolevel circuit containing the equivalent of
—
chosen
that is not included i
SOP
its
i
indicates a literal, in this case
term in question. Similarly, row 51
D y We tions.
1
2
i
For example, row 26, which has the
—
at least
54 gates (51 ANDs and three ORs), some especially the OR gates with very high fanin, which makes this type of twolevel design expensive and impractical for many IC technologies. Example 2.6 in section 2.2.3 shows an alternative approach that leads to a lowercost, multilevel design for this adder.
Minimizing the number of gates
in a sequential circuit is difficult
because
affected by the flipflop types, the state assignment, and, of course, the
which the combinational subcircuit
C is
it
way
is
in
designed. Other design techniques exist to
simplify the design process at the expense of using
more
logic elements.
It
is
impractical to deal with complete binary descriptions like state tables tain
more
dozen
than, say, a
states.
if they conConsequently, large, sequential circuits are
designed by heuristic techniques whose implementations use reasonable but nonminimal amounts of hardware [Hayes 1993; Hachtel and Somenzi 1996]. These
designed
circuits are often best
at the
more
abstract register level rather than the
gate level.
2.2
THE REGISTER LEVEL At the
grouped into
register or registertransfer level, related information bits are
ordered sets called words or vectors. The primitive components are small combinational or sequential circuits intended to process or store words.
2.2.1
RegisterLevel
Components
Registerlevel circuits are
composed of wordoriented
devices, the
more important
of which are listed in Figure 2.15. The key sequential component, which gives this level of abstraction
Other
common
its
name,
is
a (parallel) register, a storage device for words.
sequential elements are shift registers and counters.
A
number of
standard combinational components exist, ranging from generalpurpose devices,
such as word gates, to more specialized
Type Combinational
Sequential
circuits,
such as decoders and adders.
Component
Functions
Word
Logical (Boolean) operations.
gates.
Multiplexers.
Data routing: general combinational functions.
Decoders and encoders.
Code checking and conversion.
Adders.
Addition and subtraction.
Arithmeticlogic units.
Numerical and logical operations.
Programmable
General combinational functions.
logic devices.
(Parallel) registers.
Information storage.
Shift registers.
Information storage; serialparallel conver
Control/timing signal generation.
Counters.
Programmable
logic devices.
Figure 2.15
The major component types
at the register level.
General sequential functions.
83
CHAPTER 2 Design
Methodology
84
SECTION
components
Registerlevel
groups of
are linked to
form
by means of wordcarrying
circuits
lines, referred to as buses.
2.2
The Register Level
Types.
The component types of Figure 2.15
level design; they are available as
VLSI
cells in
design
libraries.
MSI
are generally useful in register
parts in various IC series
However, they cannot be
and as standard
identified a priori based
on
some property analogous to the functional completeness of gatelevel operations. For example, we will show that multiplexers can realize any combinational function. This completeness property is incidental to the main application of multiplexwhich is signal selection or path switching. There are no universally accepted graphic symbols for registerlevel components. They are usually represented in circuit diagrams by blocks containing an
ers,
A single signal line in a
abbreviated description of their behavior, as in Figure 2.16.
diagram can represent a bus transmitting m > 1 bits of information in parallel; m is indicated explicitly by placing a slash (/) in the line and writing m next to it (see Figure 2.16). A components's 10 lines are often separated into data and control lines.
An
may be
mbit bus
given a
name
that identifies the bus's role, for
the type of data transmitted over a data bus.
operation determined by the line in
its
A
control line's
1
value.
A
small circle representing inversion
when is
its
placed
lines at
assume
1
state. Alternatively, the
name of
a signal
Unless
the logi
an input or output
port of a block to indicate that the corresponding lines are active in the inactive in the
example,
indicates the
active, enabled, or asserted state.
otherwise indicated, the active state of a bus occurs cal
name
whose
state
active value
and is
includes an overbar. fall into two which specify one of several possible operations that the unit is to perform, and enable lines, which specify the time or condition for a selected operation to be performed. Thus in Figure 2.16, to perform some operation F first set the select line F to a bit pattern denoting F and then activate the edgetriggered enable line £by applying a Oto1 edge signal. Enable lines are often connected to clock sources. The output control signals, if any, indicate when or how the unit completes its processing. Figure 2.16 indicates termination by 5 = 0. The arrowheads are omitted when we can infer signal direction from the circuit structure or signal names.
The
input control lines associated with a multifunction block
broad categories: select
x
lines,
,
{
Data input Ai i
/f
Function select
lines Ai
A?
m /T
mX
k
F
Enable
E Control
Control
output lines
input lines
Z,
Z2
Data output
lines
Figure 2.16 Generic block representation of a registerlevel
component.
2
is concerned with combinational funcfrom the twovalued set B = {0,1} and form a Boolean algebra. We can extend these functions to functions whose values are taken m m from B the set of 2 mbit words, rather than from B. Let z(x ,x2 ,...,xn ) be any twovalued combinational function. Let X ,X2 ,...,Xn denote mbit binary words having the form X, = (x i,xi^,...,xi^) for / = 1,2, ...,«. We define the word opera
Operations. Gatelevel logic design
tions
whose
signal values are
,
x
x
it
tion z as follows:
z(X ,X2 ,...,Xn ) = [z(x l { ,x2 u l
.
.
.,xn
l
)^(x l ,x22 ,.
.
.,xn2),. .,zix ljn ,x2jn ,. .
This definition simply generalizes the usual Boolean operations,
and so have
from
forth,
1bit to mbit
X +X2 + +Xn = (* l
•
which applies
lfl •'
words.
If z is the
+ x2A +
OR
(2.7)
AND, NAND,
function, for instance,
+
X \jn + xljm
.,x njn)]
.
xnA s h2 + x 22 + + '•• + xn,m)
+
x n2
we
,
OR bitwise to the corresponding bits of n mbit words. mn
The
2
combinational functions defined on n mbit words forms a Boolean algebra with respect to the word operations for AND, OR, and NOT. This generalization of Boolean algebra to multibit words is analogous to the extension of the ordinary algebra from single numbers (scalars) to vectors. Pursuing this analogy, we can treat bits as scalars and words as vectors, and obtain more comset
of
all
2
plex logical operations, such as
yX=(yx ,yx2 l
y+
Wordbased level design.
X = (y + x
,
...,yxj
,y + x 2 ...,y + ,
]
xj
(2.8)
some aspects of registerHowever, they do not by themselves provide an adequate design the
logical operations of this type are useful in
ory for several reasons. •
The operations performed by some
basic registerlevel
components are numeriBoolean frame
cal rather than logical; they are not easily incorporated into a
work. •
Many
of the logical operations associated with registerlevel components are
—
complex and do not have the properties of the gates interchangeability of inputs, for example that simplify gatelevel design. Although a system often has a standard word length w based on the width of some important buses or registers, some buses carry signals with a different number of bits. For example, the outcome of a test on a set 5 of vvbit words (does S have property PI) is 1 bit rather than w. The lack of a uniform word size
—
•
for all signals
on these
makes
it
difficult to define a useful algebra to describe operations
signals.
Lacking an adequate general theory, registerlevel design is tackled mainly with heuristic and intuitive methods. We next introduce the major combinational and sequential components used in design
at the register level.
(Refer to Figure 2.15).
85
CHAPTER 2 Design
Methodology
86
SECTION
Word gates. Let X = (x u x 2 ,...,xm ) and Y = (yi,y 2 ,...,y„,) be two mbit binary As noted already, it is useful to perform gate operations bitwise on X and Y obtain another mbit word Z = (zi,z 2 ,...,Z m ). We coin the term wordgate opera
words. 2.2
The Register Level
to
tions for logical functions of this type. In general, if
write bit
Z=f(X,Y)
if z, =/(jc,,y,)
for
i
=
l,2,...,m.
/is any logic operator, we Z= XY denotes the m
For example,
NAND operation defined by Z=(z
NAND
This generalized
l
is
,z 2 ,...,z m )
= (x y ,x 2 y 2 l
,
1
...,x m y m )
realized by the gatelevel circuit in Figure 2.17a.
NAND
represented in registerlevel diagrams by the twoinput 2.17b, which
is
an example of a word gate.
It is
It is
symbol of Figure
also useful to represent scalar
vector operations by a single gate symbol. For example, the operation y +
X
defined by (2.8) and realized by the circuit of Figure 2.18a can be represented by the registerlevel gate
Word
symbol of Figure 2.18b.
gates are universal in that they suffice to implement any logic circuit;
moreover, wordgate circuits can be analyzed using Boolean algebra. In practice,
however, the usefulness of word gates
is
severely limited by the relative simplicity
of the operations they perform and by the variability in word size found ister level.
*i
y\
x2
>'2
m
X
Y
,'
/m
V z (b)
(a)
Figure 2.17 Twoinput, mbit
NAND word gate:
(a) logic
diagram and
symbol.
(b)
X
y
m /
/
1
Z (b)
(a)
Figure 2.18
OR word gate
implementing y + X:
(a) logic diagram; (b)
symbol.
at the reg
A
Multiplexers.
multiplexer
several sources to a
common
specified by applying
is
appropriate control (select) signals to the multiplexer. If the
data sources
is
from one of
a device intended to route data
is
destination; the source
k and each 10 data line carries
as a kinput (or kway), mbit multiplexer.
maximum number
of
m
It is
bits, the multiplexer is referred to convenient to make k = 2 P so that ,
determined by an encoded pattern or address of p bits. the range 00...0, 00...1, ..., 11...1 = 2 P  1. A multiplexer is easily denoted by a suitably labeled version of the generic block symbol of Figure 2.16; the tapered block symbol shown in Figure 2.19, where the narrow data source selection
is
The 2P addresses then cover
end indicates the data output side, is also common. Let a = 1 when we want to select the mbit input data bus X, = (jc,^*, ,..., x i,m\) °f me multiplexer of Figure 2.19. Then a = 1 when we apply the word corresponding to the binary number i to the select bus 5. The binary variable a, denotes the selection of input data bus X, a, is not a physical signal. The data word on X, is then transferred to Z when e = 1. The operation of the 2^input wbit multiplexer is therefore defined by m sumofproduct Boolean equations of the form (
{
t
—
+ x lj a l +
Zj= (x0j a
•••
+x 2 p_i
j
a 2p_ )e
for;'
=
i
0,
1,
...,m
(2.9)
1
or by the single wordbased equation
Z=
(X a +
X
l
a +
{
l
a 2Pi)e
Figure 2.20 shows a typical gatelevel realization of a twoinput, 4bit multiplexer. Several &input multiplexers can be used to route more than k data paths by
connecting them in the treelike fashion shown in Figure 2.21.
A
glevel tree cir
forms a ^input multiplexer. A distinct select line is associated with every level of the tree and is connected to all multiplexers in that level. Thus each level performs a partial selection of the data line X, to be connected to the cuit of this type
output Z. Multiplexers as function generators. Multiplexers have the interesting property that they can
versal
logic
compute any combinational function and so form a type of
uni
MUX
can
generator.
Specifically,
a
2"input,
1bit
generate any ^variable function z(v ,v,,...,v„_,). This
is
multiplexer
accomplished by apply
ing the n input variables v ,v,,...,v n _, to the n select Ymes s ,s ,...,s n _ of MUX, and 2" functionspecific constant values (0 or 1) to MUX's 2" input data lines .v ,.v, ]
Data
in
%2P
X, 1
i
P Select S
\
mr
. .
.
m
)r
1
Multiplexer
(MUX) Enable e
Figure 2.19 Data out
Z
A
2 /'input, mbit multiplexer.
]
87
CHAPTER
2
Design
Methodology
»
Data
SECTION
2.2
x o.o n
in
x *i.o
ft
*o.
x i,
1
*0,
1
x 0.3
^1.2
2
x
l.l
Select s
The Register Level Enable e
Data out
z
Figure 2.20 Realization of a twoinput, 4bit multiplexer.
Data
in
An
—
—AToMux T7
^'
Select
•
>> z
In each case a bit of stored data
new
data bit x
ister consists
is
of
brought in
1
bit at
at the
O x)
is lost
:=
( z m\' Z m2>' Z \' Zo)
from one end of the
other end. In
its
shift register,
while a
simplest form, an mbit shift reg
m flipflops each of which is connected to its left or right neighbor.
Data can be entered (read)
' (^ml' Z m2'' 2 l' Zo)
Z m\' Z m2'' Z
1
bit at a
time
at
one end of the register and can be removed
a time from the other end; this process
Figure 2.30 shows a 4bit shift register built from
accomplished by activating the
SHIFT enable
line
called serial inputoutput.
is
D
flipflops.
connected
of each flipflop. In addition to the serial data lines,
m
to the
A
right shift
clock input
is
CK
input or output lines are
often provided to permit parallel data transfers to or from the shift register. Additional control lines are required to select the serial or parallel input
ther refinement
is
to permit both left
4
and
rightshift operations.
,'
1
2way 5
LOAD
,
multiplexer
X 4 /
LOAD CLOCK
>
Register
CLEAR
Z (*)
Figure 2.29
A 4bit D
register with parallel load: (a) logic diagram; (b) symbol.
Z
CHAPTER
2
Design
Methodology
Registers like those of Figures 2.28 and 2.29 are designed so that external data
can be transferred to or from
95
modes.
A
fur
96
SECTION
2.2
The Register Level
SHIFT
CLEAR (a)
SHIFT Shift register
CLEAR
(b)
Figure 2.30
A 4bit,
rightshift register: (a) logic
diagram; (b) symbol.
components
Shift registers are useful design
in a
number of
applications,
including storage of serial data and serialtoparallel or paralleltoserial data con
They can
version.
numbers, because
The
two.
also be used to perform certain arithmetic operations
most computers include
instruction sets of
A
Counters.
counter
line.
The k
shift operations.
a sequential circuit designed to cycle through a prede
is
termined sequence of k distinct states 5
on an input
on binary
corresponds to multiplication (division) by
left (right) shifting
states represent
,
5,
,
.
.
.
,
Sk _
j
in
response to signals
( 1
pulses)
k consecutive numbers, so the state transitions
can be described by the statement
M :=
S Each
1
5,
plus
(modulo
1
k)
viewed as depending on the
input increments the state by one; the circuit can therefore be
counting the input
number codes
Is.
Counters come in
used, the modulus
many
different varieties
and the timing mode (synchronous or asynchro
k,
nous).
Figure 2.31 shows a counter designed to count
ENABLE input 2",
and
it
The counting states S n S,
line.
has 2"
COUNT =
Sj,
control line
DOWN.
is
,
'21
and the count sequence In the upcounting S, +1 := 5,
1
modulo2"; that
mode
plus
1
is,
The output either
is
pulses applied to the counter's
is
its
COUNT
modulus k =
an nbit binary number
up or down, as determined by the
(DOWN= 0), the counter's behavior is (modulo
2")
97
COUNT ENABLE
Modulo2" updown
CLEAR
CHAPTER 2 Design
counter
DOWN
Methodology
Figure 2.31
A modulo2'
COUNT
whereas
in the
downcounting mode 5, +1 := S
In
some counters modulusselect
(DOWN =
minus
1
1),
1
updown
counter.
the behavior
(modulo
becomes
2")
control lines can alter the modulus; such counters
termed programmable. Counters have several applications in computer design. They can store the state of a control unit, as in a program counter. Incrementing a counter provides an efficient means of generating a sequence of control states. Counters can also generare
ate timing signals
and introduce precise delays into a system.
Buses. A bus is a set of lines (wires) designed to transfer all bits of a word from a specified source to a specified destination on the same or a different IC; the source and destination are typically registers. A bus can be unidirectional, that is, capable of transmitting data in one direction only, or it can be bidirectional. Although buses perform no logical function, a significant cost is associated with them, since they require logic circuits to control access to them and, when used over longer distances, signal amplification circuits (drivers and receivers). The pin
requirements and gate density of an IC increase rapidly with the number of external
buses connected to
must also be taken
To reduce
If these
it.
buses are long, the cost of the wires or cables used
into account.
costs,
buses are often shared, especially
A shared bus
when
they connect
many
can connect one of several sources to one of several destinations. Bus sharing reduces the number of connecting lines but requires more complex buscontrol mechanisms. Although shared buses are relatively devices.
is
one
that
cheap, they do not permit simultaneous transfers between different pairs of devices,
which
is
possible with unshared or dedicated buses.
explored further in Chapter
2.2.2
Bus
structures are
7.
Programmable Logic Devices
Next we examine a class of components called programmable logic devices or PLDs, a term applied to ICs containing many gates or other generalpurpose cells whose interconnections can be configured or "programmed" to implement any desired combinational or sequential function [Alford 1989]. PLDs are relatively easy to design and inexpensive to manufacture. They constitute a key technology for building applicationspecific integrated circuits (ASICs). Two techniques, are
used to program PLDs: mask programming, which requires a few special steps in
98
SECTION
the IC chipmanufacturing process, and field
programming, which
is
done by
designers or end users "in the field" via small, lowcost programming units. 2.2
The Register Level
PLDs
fieldprogrammable
Some
same IC can be reproespecially convenient when developing
are erasable, implying that the
grammed many
times. This technology is and debugging a prototype design for a new product.
Programmable a
PLD
arrays.
The connections leading to and from logic elements in programmed to be permanently
contain transistor switches that can be
switched on or switched
off.
These switches are
laid out in
minimum IC
twodimensional arrays
The programmable x denoting a programmable connection or crosspoint in a gate's input line. The absence of an x means that the corresponding connection has been programmed to the off (disso that large gates can be implemented with
logic gates of a
connected)
PLD
area.
array are represented abstractly in Figure 2.32b, with
state.
The gate structures of Figure 232b can be combined in various ways to implement logic functions. The programmable logic array (PLA) shown in Figure 2.33 is
intended to realize a set of combinational logic functions in minimal
It
consists of an array of
AND
gates (the
AND plane),
uct terms (prime implicants), and a set of
OR
various logical sums of the product terms.
The
which
gates (the
OR
inputs to the
SOP
form.
realize a set of prod
plane),
AND
which form
gates are pro
grammable and include all the input variables and their complements. Hence it is possible to program any desired product term into any row of the PLA. For example, the top row of the PLA in Figure 2.33 is programmed to generate the term x 2 x 3 x 4 y y 2 which is used in computing the output D 2 the last row is programmed to generate x x 2 y for output D,. The inputs to the OR gates are also programmable, so each output column can include any subset of the product terms produced by the rows. The PLA in Figure 2.33 realizes the combinational part C of the 4bitstream adder specified in Figure 2.13. The AND plane generates the 51 sixvari,
\
}
x
x
able product terms according to the
SOP
design given in Figure
:L>(a)
X
X
x
x2
\
>
*\
(b)
Figure 2.32
AND and OR gates:
(a)
normal notation;
(b)
PLD notation.
*2
2. 14.
v V Y 99
AND plane
w
vy /^
w w
•
w
s< ?
J
i_
Register
—
Instruction
1
1
1
i
Pr ogram cc >n tro
Datapath Control signal s
un it
(
Iunit )
(Eunit)
Figure 2.46 Internal organization of a
The
CPU
is
CPU and cache
memory.
a synchronous sequential circuit
puter's basic unit of time. In one clock cycle the
operation, such as fetching an instruction
ing
whose clock period
is
the
com
CPU can perform a registertransfer
word from
M via the system bus and load
into the instruction register IR. This operation can be expressed formally
it
IR
where
PC
is
the
program counter
the
:=
by
M(PC);
CPU
uses to hold the expected address of the
next instruction word.
Once
actions needed for
execution; for example, perform an arithmetic operation on
its
data words stored in
CPU
in the Iunit,
registers.
The
an instruction
is
decoded
Iunit then issues the
signals that enables execution of the instruction in question. fetching, decoding, and executing an instruction constitutes
to
determine the
sequence of control
The entire process of the CPU's instruction
cycle.
Memories.
CPUs and
other instructionset processors operate in conjunction
with external memories that store the programs and data required by the processors.
Numerous memory technologies
of operation.
exist,
and they vary greatly
and perspeed several major
in cost
memory device generally increases rapidly The memory part of a computer can be divided into
formance. The cost of a
with
its
subsystems: 1.
Main memory M,
consisting of relatively fast storage ICs connected directly
and controlled by, the CPU.
to,
2.
3.
Secondary memory, consisting of less expensive devices that have very high storage capacity. These devices often involve mechanical motion and so are much slower than M. They are generally connected indirectly (via M) to the CPU and form part of the computer's 10 system. Many computers have a third type of memory called a cache, which is positioned between the CPU and main memory. The cache is intended to further reduce the average time taken by the or
all
of the cache
may be
CPU
integrated on the
memory
to access the
same IC chip
as the
system.
CPU
Some
itself.
M
Main memory is a wordorganized addressable randomaccess memory (RAM). The term random access stems from the fact that the access time for every location in
memory ries are
use
M
the same.
is
Random
access
is
contrasted with serial access, where
access times vary with the location being accessed. Serial access
slower and less expensive than
some form of
serial access.
RAMs; most secondarymemory
Because of
their
memodevices
lower operating speeds and
access mode, the manner in which the stored information
is
organized
in
serial
secondary
memories is more complex than the simple word organization of main memory. Caches also use random access or an even faster memoryaccessing method called associative or content addressing. Memory technologies and the organization of stored information are covered in Chapter 6.
IO devices.
Inputoutput devices are the means by which a computer
nicates with the outside world.
A
primary function of 10 devices
is to
commu
act as data
from one physical representation to 10 devices do not alter the information content or meaning of the data on which they act. Since data is transferred and processed within a computer system in the form of digital electrical signals, input (output)
transducers, that
is,
to convert information
another. Unlike processors,
devices transform other forms of information to (from) digital electrical signals.
Figure 2.47 involve.
lists
Many
some widely used 10 devices and
the information
media they
of these devices use electromechanical technologies; hence their
is slow compared with processor and mainmemory speeds. Although the CPU can take direct control of an IO device it is often under the immediate control of a specialpurpose processor or control unit that directs the flow of information between the IO device and main memory. The design of 10 systems is considered in Chapter 7.
speed of operation
Interconnection networks.
Processorlevel
wordoriented buses. In systems with
components
communicate
many components, communication may
by be
controlled by a subsystem called an interconnection network; terms such as switching network, communications controller, and bus controller are also used in this context.
The function of
munication paths
among
reasons, these paths are
network is to establish dynamic comcomponents via the buses under its control. For cost usually shared. Only two communicating devices can
the interconnection
the
access and use a shared bus at any time, so contention results when several system components request use of the bus. The interconnection network resolves such contention by selecting one of the requesting devices on some priority basis and connecting it to the bus. The interconnection network may place the other requesting devices in a queue.
117
CHAPTER 2 Design
Methodology
118
Type
SECTION
Medium
2.3
IO
device
Output
Input
The Processor Level
to/from which
Analogdigital converter
X
Analog (continuous)
CDROM
drive
X
Characters
Document scanner/reader
X
Images on paper
Dotmatrix display panel
device
electrical signals
^nd coded
images) on optical disk
Images on screen
X
Keyboard/keypad
IO
transforms digital electrical signals
Characters on keyboard
X
Images on paper
Laser printer
X
Loudspeaker
X
Spoken words and sounds
Magneticdisk drive
X
X
Characters (and coded images) on magnetic disk
X
Characters (and coded images) on magnetic tape
Magnetictape drive
X
Microphone
X
Spoken words and sounds
Mouse/touchpad
X
Spatial position
on pad
Figure 2.47
Some
IO devices.
representative
Simultaneous requests for access to some unit or bus result from the fact that communication between processorlevel components is generally asynchronous in that the components cannot be synchronized directly by a common clock signal. This synchronization problem can be attributed to several causes. •
A
high degree of independence exists
CPUs and IOPs quently and •
•
at
among
the components. For example,
execute different types of programs and interact relatively infre
unpredictable times.
Component operating speeds vary over a wide range. CPUs operate from 1 to 10 times faster than mainmemory devices, while mainmemory speeds can be many orders of magnitude faster than IOdevice speeds. The physical distance separating the components can be too large to permit synchronous transmission of information between them.
one of the functions of a processor such as a CPU or an IOP. An bus to which many IO devices are connected. The IOP is responsible for selecting a device to be connected to the IO bus and from there to main memory. It also acts as a buffer between the relatively slow IO devices and the relatively fast main memory. Larger systems have special processors whose sole function is to supervise data transfers over shared buses.
Bus
IOP
control
controls a
is
common IO
2.3.2 ProcessorLevel
Processorlevel design ister level.
This
is
due
Design is
less
desired system behavior.
programs supplied
to
amenable
to
formal analysis than
is
design
at the reg
in part to the difficulty of giving a precise description of the
it is
To of
say that the computer should execute efficiently little
help to the designer. The
common
approach
all
to
at this level is to take a prototype design of known performance and modify where necessary to accommodate new technologies or meet new performance requirements. The performance specifications usually take the following form:
design it
• • • •
The computer should be capable of executing a instructions of type b per second. The computer should be able to support c memory or 10 devices of type d. The computer should be compatible with computers of type e. The total cost of the system should not exceed/
Even when a new computer
is closely based on a known design, it may not be posperformance accurately. This is due to our lack of understanding of the relation between the structure of a computer and its performance. Performance evaluation must generally be done experimentally during the design pro
sible to predict
its
by computer simulation or by measurement of the performance of a copy of the machine under working conditions. Reflecting its limited theoretical basis, only a small amount of useful performance evaluation can be done via math
cess, either
ematical analysis [Kant 1992].
We
view the design process as involving two major and adapt it to satisfy the given performance constraints. Then determine the performance of the proposed system. If unsatisfactory, modify the design and repeat this step; continue until an acceptable design is obtained. This conservative approach to computer design has been widely followed and accounts in part for the relatively slow evolution of computer architecture. It is rare to find a successful computer structure that deviates substantially from the norm. The need to remain compatible with existing hardware and software standards also influences the adherence to proven designs. Computer owners are understandably reluctant to spend money retraining users and programmers, or Prototype structures.
steps: First select a prototype design
replacing welltested software.
The systems of interest here are generalpurpose computers, which differ from in the number of components used and their autonomy. The
one another primarily
communication structures used is fairly small. We by means of block diagrams that are basically graphs Figure 2.48 shows the structure that applies to firstgeneration com
variety of interconnection or will represent these structures
(section 2.1.1).
puters and
many
small,
modern microprocessorbased systems. The
addition of
specialpurpose 10 processors typical of the second and subsequent generations
Central
M
CPU
ocessing unit
System
Main memory
ICN
bus
IO devices
D,
D2
D*
Figure 2.48 Basic computer structure
is
119
CHAPTER
2
Design
Methodology
120 Central
SECTION
CPU
processing unit 2.3
Main memory
The Processor Level Cache
CM
memory
System
ICN
bus
IO processors
IOP,
IOP,
Figure 2.49 IO
Computer with cache and IO
devices
processors.
Here ICN denotes an interconnection (switching) network memoryprocessor communication. Figure 2.50 shows a prototype structure employing two CPUs; it is therefore a multiprocessor. The uniprocessor systems of Figures 2.48 and 2.49 are special cases of this structure. Even more complex structures such as computer networks can be obtained by linking several
shown
in Figure 2.49.
that controls
copies of the foregoing prototype structures.
Performance measurement.
Many
derived from the characteristics of
Central
processing units
CPU,
—
CPU,
its
performance figures for computers are
CPU. As observed
— Main memory
Cache memories
CM,
CM,
M,
M2
Crossbar switching
ICN
network
IO processors
IOP,
IOP2
IOP„
D,
D2
D3
D*
IO devices
Figure 2.50
Computer with multiple CPUs and main memory banks.
in section 1.3.2,
CPU
speed can be measured easily, but roughly, by
its
clock frequency /in megahertz.
Other, and usually better, performance indicators are
MIPS, which
is
the average
instruction execution speed in millions of instructions per second, and CPI, is
the average
number of
CPU
clock cycles required per instruction.
which
As discussed
performance measures are related to the average time microseconds (us) required to execute N instructions by the formula in section 1.3.2, these
7*
in
NxCPI Hence
the average time tE to execute an instruction t
E
=T/N=
is
CPI/f us
While / depends mainly on the IC technology used to implement the CPU, CPI depends primarily on the system architecture. We can get another perspective on tE by considering the distribution of instructions of different types and speeds in typical program workloads. Let /,, I2 ..., /„ ,
be a
set
of representative instruction types. Let
(us) of an instruction of type /,
/,
and
let
p denote i
is
denote the average execution time the occurrence probability of type
Then
instructions in representative object code.
time t E
f,
the average instruction execution
given by
I PA
us
(2.20)
from the CPU specifications, but accumust usually be obtained by experiment. The set of instruction types selected for (2.20) and their occurrence probabilities define an instruction mix. Numerous instruction mixes have been published that represent various computers and their workloads [Siewiorek, Bell, and Newell 1982]. Figure 2.51 gives some recent data collected for two representative
The
/,
figures can be obtained fairly easily
rate Pj data
Probability
ol
occurrence
Program A
Program B
Instruction type
(commercial)
(scientific)
Memory
0.24
0.29
store
0.12
0.15
Fixedpoint operations
0.27
0.15
Floatingpoint operations
0.00
0.19
Branch
0.17
0.10
Other
0.20
0.12
Memory
load
Figure 2.51 Representative instructionmix data. Source: McGrory, Carlton, and Askins 1992.
121
CHAPTER
2
Design
Methodology
122
SECTION
2.3
The Processor Level
programs running on computers employing the HewlettPackard PARISC architecture under the UNIX operating system [McGrory, Carlton, and Askins 1992]. The execution probabilities are derived from counting the number of times an instruction of each type is executed while running each program; instructions from both the application program and the supporting system code are included in this count. Program A is a program TPCA designed to represent commercial online transaction processing. Program B is a scientific program FEM that performs finiteelement modeling. In each case, memoryaccess instructions (load and store) account for more than a third of all the instructions executed. The computationintensive scientific program makes heavy use of floatingpoint instructions, whereas the commercial program employs fixedpoint instructions only. Conditional and unconditional branch instructions account for 1 in 6 instructions in program A and for 1 in 10 instructions in program B. Other published instruction mixes suggest that as many as 1 in 4 instructions can be of the branch type. A few performance parameters are based on other system components, especially memory. Main memory and cache size in megabytes (MB) can provide a rough indication of system capacity. A memory parameter related to computing
maximum rate in millions of bits per second which information can be transferred to or from a memory unit. Memory bandwidth affects CPU performance because the latter' s processing speed is ultimately limited by the rate at which it can fetch instructions and data from its cache or main memory. Perhaps the most satisfactory measure of computer performance is the cost of executing a set of representative programs on the target system. This cost can be the total execution time T, including contributions from the CPU, caches, main memory, and other system components. A set of actual programs that are representative of a particular computing environment can be used for performance evaluation. Such programs are called benchmarks and are run by the user on a copy (actual or simulated) of the computer being evaluated [Price 1989]. It is also useful to devise artificial or synthetic benchmark programs, whose sole purpose is to obtain data for performance evaluation. The program TPCA providing the data for program A in Figure 2.51 is an example of a synthetic benchmark. speed
is
(Mb/s)
bandwidth, defined as the
at
EXAMPLE 2.8 PERFORMANCE COMPARISON OF SEVERAL COMPUTERS [MCLELLAN 1993]. Figure 2.52 presents some published data on the performance of three machines manufactured by Digital Equipment Corp. in the early 1990s, its 64bit Alpha microprocessor. The SPEC (Standard Performance Evaluation Cooperative) ratings are derived from a set of benchmark programs that computer companies use to compare their products. The SPECint92
based on various versions of
and SPECfp92 parameters indicate instruction execution speed relative to a standardized 1MIPS computer (a 1978vintage Digital VAX 11/780 minicomputer) when executing benchmark programs involving integer (fixed point) and floatingpoint
Hence the SPEC figures approximate MIPS measurements two major classes of application programs like those of Figure 2.51. The remaining data in Figure 2.52 are relative performance figures for executing some other wellknown benchmark programs, most aimed at scientific computing. Data of this sort are better suited to measuring relative rather than absolute performance. For example, suppose we wish to compare the performance of the Digital 3000 and 10000 machines listed in Figure 2.52. The ratio of their SPECint92 MIPS numbers is 104.3/63.8 = 1.65. The corresponding ratios for the other five benchmarks range operations, respectively.
for
123
DEC 3000
DEC 4000
DEC
Performance measure
Model 400
Model 610
Model 610
CPU
133
160
200
clock frequency
(MHz)
10000
CHAPTER
2
Design
Methodology
Cache
size
(MB)
0.5
4
1
SPECint92
63.8
81.2
104.3
SPECfp92
112.2
143.1
200.4
114
155
Linpack 1000 x 1000
90
BM suite
18.1
22.9
Cernlib
16.9
21.0
26.0
Livermore loops
18.7
22.9
28.1
Perfect
28.6
Figure 2.52 Performance comparison of three computers based on the Digital Alpha processor.
Source: McLellan 1993.
from 1.50
to 1.79, suggesting that the Digital
Digital 3000.
Note also
Queueing models. In order
we
10000
is
about twothirds faster than the
that the ratio of their clock frequencies is
outline an approach based
200/133 =
1.50.
to give a flavor of analytic performance modeling, on queueing theory. The origins of this branch of
applied probability theory are usually traced to the analysis of congestion in tele
phone systems made by the Danish engineer A. K. Erlang (18781929)
Our treatment
is
quite informal; the interested reader
is
in 1909.
referred to [Allen 1980;
Robertazzi 1994] for further details.
The queueing model
that
we
case depicted in Figure 2.53; this sons.
It
will consider is the singlequeue, singleserver is
known
represents a "server" such as a
CPU
as the
M/M/l model
for historical rea
or a computer with a set of tasks (pro
be executed. The tasks are activated or arrive at random times and are until they can be processed or "serviced" by the CPU on a firstcome firstserved basis. The key parameters of the model are the rate at which tasks requiring service arrive and the rate at which the tasks are serviced, both meagrams) queued
to in
memory
sured in tasks/s. The mean or average arrival and service rates are conventionally denoted by A (lambda) and p (mu), respectively. The actual arrival and service rates vary randomly around these mean values and are represented by probability
Shared resource "
1r
Items
Queue
Server
Serviced items
Figure 2.53 Simple queueing Quel leing sy stem
model of
;.
computer.
124
_____., SECTION
The latter are chosen to approximate the actual behavior of the system being modeled; how well they J do so must be determined by J observation and distributions.
_ , 1.5
The Processor Level
measurement. The symbol p (rho) denotes A/p and represents the mean utilization of the server, that is, the fraction of time it is busy, on average. For example, if an average of two tasks arrive per second (X = 2) and the server can process them at an average rate of eight tasks per second (p = 8), then p = 2/8 = 0.25. The arrival of tasks at the system is a random process characterized by the interarrival time distribution
p
cess 1
—named — which
after
840)
for
the
defined as the probability that
{t)
x
arrives during a period of length
The M/M/l case assumes
t.
This exponential distribution has
toward
at a rate
1
one task
French mathematician SimeonDenis Poisson (1781
the probability distribution Pl (t)
steadily
at least
a Poisson arrival pro
p
(t)
is
= le~h
when
=
x
determined by
t
=
0.
As
increases,
t
p
x
(t)
increases
Exponential distributions characterize
X.
randomness of many queueing models quite well. They are also mathematically and lead to simple formulas for various performancerelated quantities of interest. It is therefore usual to model the behavior of the server (the service process) by an exponential distribution also. Let p s (t) be the probability that the service required by a task is completed by the CPU in time t or less after its removal from the queue. Then the service process is characterized by the
tractable
p s (t)=\e^' Various performance parameters can characterize the steadystate performance of the singleserver queueing system under the foregoing assumptions. •
The
utilization
p =
A/p of the server, that
is,
the average fraction of time
it is
busy. •
The average number of
queued
tasks
in the system, including tasks waiting for
service and those actually being served.
length and
is
denoted by /Q
.
/
•
The average time
is
is
called the
mean queue
Q = p/(1P)
that arriving tasks
and being served, which
The parameter
can be shown [Robertazzi 1994] that
It
spend
called the
/q are related directly as follows.
An
(2.21) in the system,
both waiting for service
mean waiting time tQ The .
average task
X passing
quantities
r
and
Q through the system
under steadystate conditions should encounter the same number of waiting tasks when it enters the system as it leaves behind when it departs from the system
/q
left behind is Xtq, which is the number of tasks system at rate X during the period tQ when X is present. Hence we conclude that / = Xtq, in other words, Q
after
being serviced. The number
that enter the
Q=
t
Equation (2.22)
is
tems, not just the
(2.22)
Iq/X
called Little's equation.
It is
M/M/l model. Combining t
Q
= l/(p  X)
valid for all types of queueing sys
(2.21) and (2.22),
we
get
(2.23)
The
quantities
/
Q and tQ
refer to tasks that are either waiting for access to the
The mean number of
server or are actually being served.
queue excluding those being served
denoted by
is
/
w
,
while
tasks waiting in the
w
r
denotes the
mean
W
time spent waiting in the queue, excluding service time. (The subscript stands for "waiting.") The mean utilization of the server in an M/M/l system, that is, the
mean number of yields
w
/
tasks being serviced,
is
X/\i;
hence subtracting
this
from
/
Q
:
^2
'w =
'o
P =
(2.24)
\l(\LX)
Similarly
'w = 'o 
where 1/p
we
mean time
the
is
w
see that
f
=
w/X;
l
it
1/M
=
(2.25)
H(Hk)
takes to service a task.
Comparing
(2.24) and (2.25)
therefore, Little's equation holds for both the
Q
and the
W
subscripts.
To is
illustrate the
use of the foregoing formulas, consider a server computer that
way that can be approximated by the M/M/l model. Arriving memory until they are fully executed in one step by the
processing jobs in a
jobs are queued in main
CPU, which
therefore
is
minute, and the computer questions:
What is
New
the server. is,
jobs arrive
on the average,
idle
at
an average rate of 10 per
25 percent of the time.
We ask two
T that each job spends in the computer? What is main memory that are waiting to begin execution?
the average time
the average
number of jobs
To
we assume
N in
from which it follows busy 75 percent of the time, p = X/]i = 0.75. We are given that X = 10 jobs/min; hence the service rate p. is 40/3 jobs/min. Substituting into (2.23) yields T= t = 1/(40/3  10) = 0.3 min. From Little's equaQ tion, N=l = Xt = 3; hence by (2.25), / w = 3  0.75 = 2.25 jobs. Q Q answer,
that
T is
tq,
and
EXAMPLE An
/
.
Since the system
is
ANALYSIS OF SHARED COMPUTER USAGE [ALLEN
2.9
company has staff.
N is w
that steadystate conditions prevail,
a
computer system with a single terminal
that is
1980].
shared by
its
A
small
engineering
average of 10 engineers use the terminal during an eighthour work day, and
each user occupies the terminal for an average of 30 minutes, mostly for simple and routine calculations. since the system that
it
is
is
The company manager
idle
overutilized, since they typically wait an hour or
minal; they want the manager to purchase
We
will
feels that the
now
computer
is
underutilized,
an average of three hours a day. The users, however, complain
new
more
to gain access to the ter
terminals and add them to the system.
attempt to analyze this apparent contradiction using basic queueing the
ory.
Assume that the computer and its users are adequately represented by an M/M/l queueing system. Since there are 10 users per eight hours on average, we set X = 10/8 users/hour = 0.0208 users/min. The system is busy an average of five out of eight hours; hence the utilization p = 5/8, implying that u = 1/30 = 0.0333. Substituting these values for X and u into (2.25) yields fw = 50 mm, which confirms the users' estimate of their average waiting time for terminal access.
The manager
is
now convinced
agrees to buy enough to reduce
many new
r
w
that the
from 50
terminals should he buy?
We
company needs additional terminals and The question then arises: How
to 10 min.
can approach
this
problem by representing
125
CHAPTER 2 Design
Methodology
M/M/l queueing system. Let m be the make t w < 10 or, equivalents, tn < 40. The arriving users are assumed to divide evenly into m queues, one for each terminal. The arrival rate X* per terminal is taken to be X/m = 0.0208/m users/min. If, as indicated
126
SECTION
2 4
each terminal and
its
minimum number
of terminals needed to
users by an independent
CPU
above, the computer's
few additional terminals should we assume that each ter= 0.0333 users/min. To meet the desired perfor
lightly utilized, then a
is
not affect the response time experienced at a terminal*, hence
minal's
mance
mean
goal,
service rate
we
t*
from which
it
is
u* =
p.
require
= l/(u*  X*) = i/(n  X/m) < 40
Q
follows that
m
>
2.5.
Hence
minals should be acquired. This result
is
three terminals are needed, so
two new
ter
pessimistic, since the users are unlikely to
form three separate queues for three terminals or to maintain the independence of the queues by not jumping from one queue to another whose terminal has become available. Nevertheless, this simple analysis gives the useful result that
m
should be 2 or
3.
2.4
SUMMARY The
problem facing the
central
circuit,
digital
system designer
is to
a devise a structure (a
network, or system) from given components that exhibits a specified
at minimum cost. Various and behavior, including block diagrams (for structure), truth and state tables (for behavior), and HDLs (for behavior and structure). Computer systems can be viewed at several levels of abstraction, where each level is determined by its primitive components and information units. Three levels have been presented here: the gate, register, and processor levels, whose components process bits, words, and blocks of words, respectively. Design at all levels is a complex process and depends heavily on CAD tools. The gate level employs logic gates as components and has a welldeveloped theory based on Boolean algebra. A combinational circuit implements logic or Boolean functions of the form z{x x2 ..., xn ), where z and the x,'s assume the values and 1 The circuit can be constructed from any functionally complete set of gate types such as {AND, OR, NOT} or {NAND}. Every logic function can be realized by a twolevel circuit that can be obtained using exact or heuristic minimization techniques. Sequential circuits implement logic functions that depend on time; unlike combinational circuits, sequential circuits have memory. They are
behavior or performs a specified range of operations
methods
exist for describing structure
,
,
x
.
from gates and 1bit storage elements (flipflops) that store the circuit's state and are synchronized by means of clock signals. Registerlevel components include combinational devices such as word gates, built
multiplexers, decoders, and adders, as well as sequential devices such as (parallel)
and counters. Various generalpurpose programmable elePLAs, ROMs, and FPGAs. Little formal theory exists for the design and analysis of registerlevel circuits. They are often described by HDLs whose fundamental construct is the registertransfer statement
registers, shift registers,
ments also
exist, including
cond:
Z
:=
F,(X ,X2 ,...,X 1
it
);
denoting the conditional transfer of data from registers
F
a combinational processing circuit
datapath unit and a control unit. struct a formal
(HDL)
The
.
{
X
l
,X2 ,...,Xk
Z via
step in registerlevel design
is
to con
description of the desired behavior from which the
compo
first
nents and connections for the datapath unit can be determined.
needed
to register
Registerlevel circuits often consist of a
The
logic signals
to control the datapath are then identified. Finally, a control unit is
designed
that generates these control signals.
The components recognized at the processor level are CPUs and other procesmemories, 10 devices, and interconnection networks. The behavior of processorlevel systems is complex and is often specified in approximate terms using sors,
average or worstcase behavior. Processorlevel design
A
of prototype structures.
prototype design
is
is
heavily based on the use
selected and modified to
meet the
given performance specifications. The actual performance of the system evaluated, and the design
is
further modified until a satisfactory result
is
is
then
achieved.
Typical performance measures are millions of instructions executed per second
(MIPS) and clock cycles per
mance evaluation
instruction (CPI).
A few analytical methods for perfor
—notably queueing theory —but
exist
their usefulness is limited.
Instead, experimental approaches using computerbased simulation or performance
measurements on an actual system are used extensively.
2.5
PROBLEMS 2.1.
Explain the difference between structure and behavior in the digital system context.
your answer by giving
lustrate
(a) a purely structural description
and
havioral description of a half subtracter circuit that computes the 1bit difference
2.2.
x
v
(a)
Following the example of Figure
Il
(b) a purely be
d=
and also generates a borrow signal b whenever x < y. 2.4, construct a behavioral
VHDL description of
the fulladder circuit of Figure 2.9b. (b) Following Figure 2.5, construct a structural
VHDL description of the full 2.3.
adder.
Construct both structural and behavioral descriptions
OR circuit appearing in Figure 2.4.
Figure 2.54 describes a half adder bols for the logic operations
in
VHDL of the EXCLUSIVE
2.2.
in the
widely used Verilog
AND, OR, EXCLUSIVEOR,
HDL. The
and
Verilog sym
NOT are &, \ and ~. I.
respectively, (a) Is this description behavioral or structural? (b) Construct a similar description in Verilog for a full adder.
module halfjudder Input x assign
.
s
(xQ v
y y output
s
;
=x
assign c = x
A
,
,
y
&y
,
s
c
,
c o)'
;
;
;
Figure 2.54
endmodule
Verilog description of a half adder.
127
CHAPTER
2
Design
Methodology
128
Outnuts
Inputs
SECTION
2.5
Kl
y{
*i
4
bi
1
1
Problems 1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Figure 2.55 1
1
1
Truth table of a
2.5.
Assign each of the following components to one of the three major design levels cessor, register, or gate
bers
a
jV,
and
N2
number AO
N to N.
(d)
.
(b)
—and
your answers,
justify
An identity circuit that outputs a
are the same;
it
1
if all its
otherwise, (c)
outputs a
—
pro
A multiplier of two nbit num
(a)
n inputs (which represent
A negation circuit that converts
A firstin firstout (FIFO)
the order received;
2.6.
full subtracter.
it
also outputs the
memory, that stores a sequence of numbers numbers in the same order.
in
Certain very smallscale ICs contain a single twoinput gate. The ICs are manufactured
—
—
in three varieties NAND, OR, and EXCLUSIVEOR as indicated by a printed label on the ICs package. By mistake, a batch of all three varieties is manufactured without their labels, (a) Devise an efficient test that a technician can apply to any IC from this batch to determine which gate type it contains, (b) Suppose the batch of unlabeled ICs contains NOR gates, as well as NAND, OR, and EXCLUSIVEOR. Devise an efficient
testing procedure to determine each
2.7.
ICs
gate type.
Construct a logic circuit implementing the 1bit
(full)
subtracter defined in Figure 2.55
using as few gates as you can.
2.8. {a)
Obtain an efficient allNAND realization for the following four variable Boolean
function:
f
x
(a,b,c,d)
= a(b + c)d + a(b + d)(b + c)(c + d)+ b c d
(b) Construct an efficient 2.9.
the 3bit
and 2.10.
allNOR design ioxf
x
Design a twolevel combinational
sum of two
2bit binary
numbers. The
sumofproducts style that computes
circuit
is
to
be implemented using
AND
OR gates.
Consider the
D
flipflop of Figure 2.11. (a)
the flipflop's state y. (b) This flipflop it
{a,b,c,d).
circuit in the
triggers
on the positive
(rising or
is
Explain
why
to 1)
edge of the clock CK.
triggered flipflop triggers on the negative (falling or indicated by placing an inversion bubble at the
Redraw
the y part of Figure 2.
1 1
CK
of one. The J input
is
D
1
to 0)
A
negative edge
edge of CK, which
input like that at the
y
is
output.
for a negative edgetriggered flipflop.
2.11. Figure 2.56 defines a 1bit storage device called a
triggered clocking as the
the glitch does not affect
said to be positive edgetriggered because
flipflop of Figure 2.
activated to store a
1
JK flipflop. 1 1
It
has the same edge
but has two data inputs instead
in the flipflop; that
is,
JK =
10 sets y =
129 Set
00
 yCK
Clock Reset
Inputs
JK
01
10
J
— K
CHAPTER
11
ioio
State >'(')
1
(a)
Next y('
Methodology state
+
D
(b)
Figure 2.56
JK
graphic symbol; (b) state table.
flipflop: (a)
1.
Similarly, the
sets
>•
to 0.
input
activated to store a
is
1 1
always changes, or toggles, the
JK
flipflop,
flop and a 2.12.
K
Derive a
What Show how to
state, (a)
analogous to (2.5)? (b)
few
synchronous sequential
unsigned number
N of arbitrary
length
causing the circuit to output serially the number intuitive 2.13.
An SC
JK =
is,
01 re
the state unchanged, while
is
JK =
the characteristic equation for a
build a
JK
flipflop
from
a
D
flip
NAND gates.
state table for a
An
menter.
in the flipflop; that
The input combination JK = 00 leaves
meaning of each
state
and identify the
circuit that acts as a serial increis
N+
entered serially on input line \
on
its
output line
z.
x,
Give the
reset state.
alternative to a state table for representing the behavior of a sequential circuit
whose nodes denote states is a state diagram or state transition graph, {S^Sj,...^} and whose edges, which are indicated by arrows, denote transitions between states. A transition arrow from 5, to & is labeled XJZV if, when SC is in state 5, and input Xu is applied, the (present) output Z v is produced and SC's next state is Sj. (a) Construct a state table equivalent to the state diagram for SC appearing in Figure 2.57. (b) How many flipflops are needed to implement 5C?
2.14.
Design the sequential flipflops
output
and
line.
SC whose behavior is defined in SC has a single primary input line
circuit
gates.
Your answer should include
few gates and 2.15.
NAND
flipflops as
you can
in
D
complete logic diagram for SC. Use as
Implement the sequential circuit SC specified in JK flipflops (see problem 2.11) and NOR SC and use as few gates and flipflops as you can.
the preceding problem, this time gates. Derive a logic
diagram
for
Design a serial subtracter analogous to the serial adder. The subtracter's inputs are two unsigned binary numbers n and n 2 the output is the difference n,  n 2 Construct ;
x
Reset
Figure 2.57 State
Figure 2.57 using
and a single primary
your design.
using
2.16,
a
diagram for a sequential
circuit
SC.
2
Design
11
.
130
a state table, an excitation table, and a logic circuit that uses
JK
flipflops
and
NOR
gates only.
SECTION problems
2.5
2.YI
.
Design a sequential length by
The
N
3.
is
circuit that multiplies
2.18.
An
emerges
result representing 3/V
your
struct a state table for
flops and
an unsigned binary number
entered serially via input line x with serially
from the
its
circuit's output line
and give a complete logic
circuit
N of arbitrary
least significant bit first.
circuit that uses
z.
D
Conflip
NAND gates only. is
functional completeness, which ensures that a
all
types of digital computation, (a)
important property of gates
complete gate
adequate for
set is
serted that functional completeness
is
It
irrelevant at the register level
has been as
when
dealing
with components such as multiplexers, decoders, and PLDs. Explain concisely
why
Suggest a logical property of sets of such components that might be
this is so. (b)
substituted for completeness as an indication of the components' general usefulness in digital design.
2.19.
Give a brief argument supporting your position.
Redraw the gatelevel multiplexer circuit of Figure 2.20 at the register level using word gates. Use as few such gates as you can and mark all bus sizes. Observe that a signal such as e that fans out to
rying the wbit
word£ =
m
lines
can be considered to create an mbit bus car
(e,e,...,e).
2.20. Figure 2.55 gives the truth table for a full subtracter,

Xj
>',
 b_ i
where b
, i
t
d where b denotes ,
{
{
_
x
denotes the borrowin
the borrowout
bit.
bit.
Show how
which computes the difference
The
subtracter's outputs are b t ,
to use (a) an eightinput multi
plexer and (b) a fourinput multiplexer to realize the full subtracter. 2.21.
Show how
to design a 1/16
decoder using the 1/4 decoder of Figure 2.236 as your
sole building block.
how
2.22. Describe
ANDOR is
much
to
circuit
implement the and
priority
(b) a multiplexer
less costly than the other
encoder of Figure 2.25 by (a) a twolevel
of suitable
size.
Demonstrate
and derive a logic diagram for the
that less
one design expensive
design. 2.23.
Design a 16bit priority encoder using two copies of an 8bit priority encoder. You use a few additional gates of any standard types in your design, if needed.
may 2.24.
A
compares two unsigned numbers X and Y and prowhich indicate X= Y,X>Y, and X < Y, respectively. (a) Show how to implement a magnitude comparator for 2bit numbers using a single 16input, 3bit multiplexer of appropriate size, (b) Show how to implement the same comparator using an eightinput, 2bit multiplexer and a few (not more than five) magnitudecomparator
duces three outputs
twoinput 2.25.
z,,
circuit
z2 , and z3 ,
NOR gates.
Commercial magnitude comparators such as the 74X85 have three control inputs confusingly labeled X = Y, X > Y. and X < Y, like the comparator's output lines. These inputs permit an array of k copies of a 4bit magnitude comparator to be expanded to form a Akhil magnitude comparator as shown in Figure 2.58. Modify the 4bit magnitude comparator of Figure 2.27 to add the three new control inputs and explain briefly how they work. [Hint: The unused carry input lines denoted c in in Figure 2.27 play a central role in the modification.]
2.26.
Show how
connect n half adders (Figure 2.5) to form an «bit combinational inis to add one (modulo 2") to an «bit number X. For ex
to
crementer whose function ample,
2.27.
X
if
1 1 1 1 1 1 1 1
,
it
Show how
LOAD
= 10100111. the incrementer should output Z = 00000000.
Z =
10101000;
if
X
=
should output the
line to
register circuit
of Figure 2.29 can be simplified by using the
enable and disable the register's clock signal
CLOCK.
Explain clear
ly
why
gatedclocking technique
is
often considered a violation of
useful operation related to shifting
is
called rotation. Left rotation of an rabit
this
good de
131
sign practice.
CHAPTER 2 2.28.
A
register is defined
( Z m2>
(a)
•'^0'
Zml)
the 4bit rightshift register
right rotation, (b)
show how 2.29.
Zn
to
Methodology
:_ (Zml»Zin_2v.>Zi.Zo)
Give an assignment statement similar
how
Design
by the registertransfer statement
SR
(2.26)
Show
to (2.26) that defines right rotation.
of Figure 2.30 can easily be
made
implement
to
Using as few additional components and control lines as possible, SR to implement both right shifting and right rotation.
extend
component types: 4bit Dtype reggates. The counter's inputs are a CLEAR signal that resets it to the all0 state and a COUNT signal whose 0tol (positive) edge causes the current count to be incremented by one. Use as few components as you can, assuming for simplicity that each component type has the same Design an
8bit counter using only the following
half adders, full adders, and twoinput
isters,
NAND
cost.
2.30.
Assuming
that input variables are available in true
FPGA cell of Figure 2.35a and EXLCLUSIVEOR functions. the Actel
2.31. (a)
Assuming
of the largest
form only, show how
realize twoinput versions of the
form only, what
that input variables are available in true
NAND
make
to
NAND, NOR, the fanin
is
gate that can be implemented with a single Actel
FPGA
cell
What is the largest NAND if both true and complemented inputs and we allow some or all of the inputs to the NAND to be inverted?
(Figure 2.35a)? (b) are available 2.32.
Show how
to implement the full subtracter defined in Figure 2.55 using as few copyou can of the Actel Cmodule. Again assume that the input variables are supplied in true form only. ies as
shows the Actel FPGA Smodule, which adds a D flipflop Cmodule discussed in the text. Show how to use one copy of
2.33. Figure 2.59
put of the
to the outthis cell to
implement the edgetriggered JK flipflop defined in problem 2.11, assuming only the true output y is needed and that either one of the flipflop's J or K inputs can be complemented.
X \2 *15
X4 *7
*3 Y,
4bit
Y
Y
X>Y
1
o
—
X>Y
>
r,5
\ Y
comparator
4bit
magnitude
magnitude
comparator
X>Y
4bit
Y
comparator
X
X
X
X u
4bit
magnitude
magnitude comparator
4
4
4
4
Yn
Y*
Yi
X>Y
X>Y
X>Y
X>Y
X>Y
X>Y
X=Y
X=Y
X=Y
X= Y
X'/il C /i2
,c n _i) n _ x ,y n _ x
{.x
=
,
+ xnl>nl Cn2
(0,0,1)
and
(1,1,0),
the truth table of Figure 3.22, then z n _ x
is
which make v =
defined correctly for
1,
are
all
the
remaining combinations by the equation z„_i
=
*„_i
© y„_i © c„_ 2
Consequently, during twoscomplement addition the sign
the
of the operands can
in the
Outputs
Inputs
X n\
bits
same way as the remaining (magnitude) bits. A related issue in computer arithmetic is roundoff error, which results from fact that every number must be represented by a limited number of bits. An
be treated
>Vi
C n1
Znl
1
1
1
1
V
1
1
1
1
1
1
Figure 3.22 1
1
1
1
1
1
1
1
Computation of the sign bit ;,,_, and the overflow indicator v in twoscomplement addition.
1
operation involving nbit numbers frequently produces a result of more than n
For example, the product of two Aibit numbers contains up to In bits, which must normally be discarded. Retaining the n most significant result without modification is called truncation. Clearly the resulting
all
bits.
but n of
bits
of the
number
is
in
by the amount of the discarded digits. This error can be reduced by a process called rounding. One way of rounding is to add r;/2 to the number before truncation, where r 7 is the weight of the least significant retained digit. For instance, to round 0.346712 to three decimal places, add 0.0005 to obtain 0.347212 and then take the three most significant digits 0.347. Simple truncation yields the less accurate value 0.346. Successive computations can cause roundoff errors to build up unless countermeasures are taken. The number formats provided in a computer should have sufficient precision that roundoff errors are of no consequence to most users. It is also desirable to provide facilities for performing arithmetic to a higher degree of precision if required. Such high precision is usually achieved by using several words to represent a single number and writing special subroutines to perform multiword, or multipleprecision, arithmetic. error
Decimal numbers. Since humans use decimal arithmetic, numbers being first be converted from decimal to some binary representation. Similarly, binarytodecimal conversion is a normal part of the computer's output processes. In certain applications the number of decimalbinary conversions forms a large fraction of the total number of elementary operations performed by the computer. In such cases, number conversion should be carried out rapidly. The various binary number codes discussed above do not lend themselves to rapid conversion. For example, converting an unsigned binary number xni x n2 xo t0 decimal requires a polynomial of the form entered into a computer must
/ii
Jfc+i
L*,2' to
be evaluated.
by by a sequence of bits. Codes of this kind are called decimal codes. The most widely used decimal code is the BCD {binarycoded decimal) code. In BCD format each digit d of a decimal number is denoted by its 4bit equivalent b i3 b i2 b iA b j0 in standard binary form, as in (3.7). Thus the BCD number representing 971 is 1001011 10001. BCD is a weighted (positional) Several
number codes
encoding each decimal
exist that facilitate rapid binarydecimal conversion
digit separately
i
number code,
BCD numbers employ decimal complement formats. The 8bit ASCII code rep
7 since b Lj has the weight 10'2 Signed .
versions of the signmagnitude or resents the 10 decimal digits
by a
ASCII code word have no numerical
Two
4bit
BCD
field; the
remaining 4
bits
of the
significance.
other decimal codes of moderate importance are
shown
in
Figure 3.23.
The excessthree code can be formed by adding 001 2 to the corresponding BCD number hence its name. The advantage of the excessthree code is that it ma\ be processed using the same logic used for binary codes. If two excessthree num
—
bers are added like binary numbers, the required decimal carry is automatically generated from the highorder bits. The sum must be corrected by adding +3. For
171
CHAPTER
3
Processor Basics
1/2
Decimal code
SECTION
Decimal 3.2
digit
Data Representation
BCD
ASCII
Excessthree
Twooutoffive
0000
0011 0000
0011
11000
1
0001
0011 0001
0100
ooo it
2
0010
0011 0010
0101
00101
3
0011
00110011
0110
00110
4
0100
0011 0100
0111
01001
5
0101
0011 0101
1000
01010
6
0110
00110110
1001
01100
7
0111
0011 0111
1010
10001
8
1000
0011 1000
1011
10010
9
1001
0011 1001
1100
10100
Figure 3.23
Some
important decimal number codes.
example, consider the addition 5 + 9 = 14 using excessthree code.
1000 = 5 1100 = 9
+ Carry
J 0260 
0108
0264
MOVE.L#4001. A2 ABCD(AO), (Al)
010C
0114
BXE
I 
0268
> Program
0276
SF6
03E9
1001
Vector
A 07D1
2001
Vector
> Data
B
0BB9
3001 Vector
Figure 3.38
C OFAO
Memory allocation for the 680X0 vector addition
4000
program.
To determine how
best to implement (3.40) in assembly language, the available and addressing modes must be examined carefully. The ABCD instruction, besides being limited to byte operands, allows only two operand addressing modes: direct register addressing and indirect register addressing with predecrementinstruction types
ing.
As explained
register to
This approach vector,
mode
earlier, the latter
causes the contents of the designated address
be automatically decremented just before the add operation is
and hence
convenient for stepping through it is
selected here.
to address or point to the current
addition step
is
Two
lists, in this
is
carried out.
A0 and Al are chosen and B, respectively. Thus the basic
of the address registers
elements of
A
implemented by the instruction
ABCD (AOMAl) which
is
case the elements of a
(3.41)
equivalent to
A0:=A01, Al :=A11; M(A1) := M(A0) + M(A1) + carry:
A
third address register
(3.41)
is
stored in the
C
A2
is
used to point to vector C, and the result computed by
region by the 1byte data transfer instruction
MOVE.B
(A1),(A2)
(3.42)
1
Because addresses are predecremented, AO, Al, and
The foregoing
be initialized to values
and (3.42) are executed 1000 times,
instructions (3.41)
lowest address (1001 the
A2 must
in the
case of vector A)
that
is,
until the
reached. This point can be detected by
is
which
sets the zerostatus flag is
made back
resulting code,
Z
to
A0
if
1
to (3.41) using the
= 1001 and
START
BNE
otherwise.
to
BNE (branch if not equal to
which also appears with comments
When Z ^
1) instruction.
1,
a
The
in Figure 3.13, is as follows:
MOVE.L MOVE.L MOVE.L
#2001,
ABCD
(A0),(A1)
AO
#300 1, A #4001,A2
MOVE.B
(A1),(A2)
CMPA
#1001,A0
START
Figure 3.39 shows an assembly listing of the foregoing code with various direc
added for both
tives
illustrative
purposes and to complete the program. The assembly
language source code appears on the right side of Figure 3.39, while the assembled object program appears on the left in hexadecimal code. the
and
memory data,
The
leftmost
column contains
addresses assigned by the assembler to the machinelanguage instructions
which are then
listed to the right of these
memory
addresses.
The
first
ORG
program at the hexadecimal assigned by EQU directives to the
directive causes the assembler to fix the start of the
address 0100. The symbolic names A, B, and
addresses of the
MOVE.L
first
(move long)
C
are
elements of the three corresponding vectors. The subsequent instructions contain arithmetic expressions that are evaluated
during assembly and replaced by the corresponding numerical value. For example, the
A + 1000 appearing in the first MOVE.L instruction is replaced by 1001 + 1000 = 2001. In general, assembly languages allow arithmeticlogic expressions to be used as operands, provided the assembler can translate them to the form needed for the object program. The statement MOVE.L #2001, A0 is thus the first executable statement of the program, and its machinelanguage equivalent 2078 07D1 is loaded into memory locations 0100:0103 (hex), as indicated in Figures 3.38 and 3.39. The remainder of the short program is translated to machine code and allocated to memory in simexpression
ilar
fashion.
Many 680X0 branch address
is
branch instructions use relative addressing, which means
computed
relative to the current address stored in the
that the
program counter
PC. Consider, for instance, the conditional branch instruction BNE START, the last in the vectoraddition program. As shown by Figure 3.39. the corresponding machinelanguage instruction is 66F6 in which 66 is the opcode BNE and F6 is an 8bit relative address derived from the operand START. Now F6 )6 = executable instruction
111101 10 2 which when interpreted as a twoscomplement number is 10 l0 or 0A, 6 BNE START has been fetched from memory locations 01 14 16 and 01 15 16 PC is .
,
After
,
automatically incremented to point to the next consecutive
Hence tion
at this point
BNE,
required,
is
it
PC = 000001
16, 6
.
Now when
computes the branch address as
the
CPU
memory
location 0116, 6
.
executes the branch instruc
PC + (0A) = 0000010C !6 which, as (ABCD) with the symbolic address .
the physical address of the instruction
START. The remainder of
the vectoraddition
directives that define data regions.
data region; in this case the
start
program
illustrates the
assemblylanguage
ORG is used again to establish a start
address
is
1001
1(
,
CHAPTER Processor Basics
CMPA (compare address) instruction CMPA #1001, A0
branch
207
one greater than the highest addresses assigned to A, B, and C, respectively.
that are
address for the
= 03E9 I6 The DS.B (define storage .
3
208
Machine language
SECTION
3.3
Assembly language
Location Code/Data
68000/68020 program
for vector addition
Instruction Sets
The vectors
are
composed of a thousand
1byte (two digit) decimal
numbers. The starting (decimal) addresses of A, B, and
C
are
1001, 2001, and 3001, respectively.
;
Define origin of program
at
0100
ORG
03E9
A
EQU
1001
07D1 B
EQU
2001
0BB9 C
EQU
3001
;
;
hex address 100
$100
Define symbolic vector
start
addresses
Begin executable code
0100
2078
07D1
MOVE.L A+1000,AO
;
Set pointer beyond end of
0104
2278
0BB9
MOVE.L B+1000.A1
;
Set pointer
beyond end of B
0108
2478
0FA1
MOVE.L C+
;
Set pointer
beyond end of C
010C
C308
010E
1511
0110
B0F8
0114
66 F6
START
03E9
;
1
000, A2
ABCD
(A0), (Al); Decrement pointers
MOVE.B
(A1),(A2)
;
Store result in
CMPA
A,A0
;
Test for termination
BNE
START
:
Branch
to
A
& add
C
START if Z *
1
End executable code
Begin data definition
03E9
03E9
ORG
A
;
Define
DS.B
1000
:
Reserve 1000 bytes for
;
Initialize
elements 1:3 of
B
:
Initialize
elements 4:6 of
B
;
End program
07D1
010101
DC.B
1.1.1
07D4
16 16 16
DC.B
22,22,22
END
start
of vector
A A
Figure 3.39
Assembly
listing
of the
680X0 program
for vector addition.
in bytes) directive reserves a region
assembler's
memory
location
of 1000 bytes. This directive merely causes the
counter,
which
addresses, to be incremented by the specified
it
uses to keep track of
number of
bytes.
As
memory
indicated by Figure
makes the location counter point to the start of the region storing vecThe two DC. B (define constant in bytes) commands initialize six elements of B
3.38, this action tor B.
to the specified constant values. Finally the
assemblylanguage program.
END
directive indicates the
end of the
Two
Macros and subroutines.
useful tools for simplifying program design by
allowing groups of instructions to be treated as single entities are macros and subroutines.
A
macro
is
defined by placing a portion of assemblylanguage code
between appropriate directives as follows: .
.
. ,
operand
Body of macro
ENDM The macro is subsequently invoked by treating the userdefined macro name, which appears in the label field of the MACRO directive, as the opcode of a new (macro) instruction. Each time the macro opcode appears in a program, the assembler replaces it by a copy of the corresponding macro body. If the macro has operands, then the assembler modifies each copy of the macro body that it generates by inserting the operands included in the current macro instruction. Macros thus allow an assembly language to be augmented by new opcodes for all types of operations; they can also indirectly introduce new data types and addressing modes. A macro is
typically used to replace a short sequence of instructions that occur frequently in
a program.
Note
macros shorten from it.
that although
the object code assembled
the source code, they
do not shorten
Suppose, for example, that the following twoinstruction sequence occurs in a
program for the
Intel
8085
[Intel 1979]:
LDHL ADR
MOV
;Load
A,M
A
This code implements the operation treating
LDAI
ADR
as an indirect
memory
M(HL)
into address register
M(M(ADR)), which
:=
We
address.
HL
into accumulator register
can define
it
A
loads register
as a
A
macro named
(load accumulator indirect) as follows:
LDAI
MACRO ADR LDHL ADR
MOV ENDM With
M(ADR)
;Load
this
macro
A,M
;Load
M(ADR)
;Load
M(HL)
definition present in an
into address register
into register
8085 program,
HL
A
LDAI becomes
a
new
assemblylanguage instruction for the programmer to use. The subsequent occurrence of a statement such as
LDAI in the
same program causes
1000H
the assembler to replace
LDHL
1000H
MOV
A,M
(3.43) it
by the macro body
with the immediate address 1000 16 from (3.43) replacing the macro's dummy input ADR. Note that the macro definition itself is not part of the object pro
parameter gram.
A
subroutine or procedure
invoked by name, much
is
also a sequence of instructions that can be
(macro) instruction. Unlike a macro, howassembled into object code. It is subsequently used.
like a single
ever, a subroutine definition
is
CHAPTER 3 Processor Basics
MACRO operand,
name
209
210
SECTION
not by replicating the body of the subroutine during assembly, but rather during 3.3
Instruction Sets
program execution by establishing dynamic links between the subroutine object code and the points in the program where the subroutine is needed. The necessary links are established by means of two executable instructions named CALL or
JUMP TO SUBROUTINE,
and
RETURN.
Consider, for example, the following
code segment:
CALL
SUB
1
NEXT
Main
(calling)
program
SUB1 Subroutine
SUB
1
RETURN CALL SUB1 has been fetched, the program counter PC contains the address NEXT of the instruction immediately following CALL; this return address must After
be saved to allow control to be returned instruction first saves the contents of
forms the operand of the
the address that
SUB
1
is
save area.
SUB1
It
call
then transfers
PC. and also
The processor then begins execution of
the sub
serves as the subroutine's name. is
call statement,
in this case, into
returned to the original program from the subroutine by execut
RETURN,
restores
it
to
CALL
which simply retrieves the previously saved return address and PC.
RETURN may
and
tions to store return addresses. its
main program. Thus a
later to the
in a designated
the address of the first executable instruction in the subroutine
routine. Control
ing
PC
use specific
The RX000,
register file to save a return address
CPU
registers or
mainmemory
for instance, uses a
on executing any of
its
CPU
loca
register
from
jump/branchand
linkregister instructions, which serve as call instructions; see Figure 3.36. Many computers use a memory stack for this purpose. CALL then pushes the return address into the stack, from which it is subsequently retrieved by RETURN. The stack pointer SP automatically keeps track of the top of the stack, where the last
was pushed by
return address
CALL
and from which
it
will
be popped by
RETURN.
CALL
Figure 3.40 illustrates the actions taken by the realization.
For simplicity,
memory word
long.
The
we assume
instruction
1000 and 1001, and we assume
that
CALL SUB1
that the
program counter
PC
mented
and stores
to 1002.
At
CALL
contains the address 1000, as
and
PC
ing the instruction as a subroutine call, the the instruction
it
in the (buffer)
this point the
system
stored in
is
is
locations
SUB1
with the
instruction cycle begins, the
shown
in Figure 3.40a.
incremented to 1001.
CPU
On
shown
AR; again PC
into the stack.
Then
the contents of
2000 of
AR are
is
in Figure 3.40b,
contains the return address to the main program. Next the contents of
pushed
The
identify
fetches the address part
address register
state is as
one
all
memory
assembler has replaced
physical address 2000. Immediately before the
CALL opcode is fetched and decoded,
instruction in a stack
opcodes and the addresses are
incre
and
PC
PC
are
transferred to PC, and the stack
211
CALL
_CALL_
1000

MOV
Main program

AR
1
IR
AR
7893
CALL
Processor
Main
Basics
program 2000
2000
2000
PC
1000
SP
PC
Subroutine "
CHAPTER
2000 _
2000 IR
1000

1002
•{
SUB1 SP
3500
3500 I
I
RETURN
1034
*
Stack
3500
3500
y» (b)
(a)
_CALL_
1000
2000 IR

CALL

2000
Main

program
AR
2000 2000
Subroutine
SUB1 3499
«f[
RETURN 1002
3500
(c)
Figure 3.40 Processor and
memory
state
during execution of a
CALL
(b) state
immediately after fetching the instruction, and
pointer
SP
is
decremented by one. The resulting
instruction: (a) initial state,
(c) final state.
state
of the system
is
depicted in
Figure 3.40c.
3.4
SUMMARY
M
and task of a CPU is to fetch instructions from an external memory execute them. This task requires a program counter PC to keep track of the active
The main
instruction,
and
registers to store the instructions
and data
as they are processed.
a central data register called an accumulator, along capable of addition, subtraction, and wordoriented logic operations.
The simplest CPUs employ with an In
most
ALU
CPUs
a register
file
containing 32 or more generalpurpose registers
3
SECTION Problems
ARM
RISC processors such as the and the MIPS RXOOO allow only load and store instructions to access M, and use small instruction sets
212
replaces the accumulator. 3.5
and techniques such as pipelining to improve performance. CISC processors such as the Motorola 680X0 have larger instruction sets and some more powerful instructions that improve performance in some applications but reduce it in others.
The
arithmetic capabilities of simpler processors are limited to the fixedpoint
(integer) instructions unless auxiliary coprocessors are used.
have builtin hardware
Computers
store
More powerful CPUs
to execute floatingpoint instructions.
and process information
storage (the smallest addressable unit)
handle data in a few fixedword
is
sizes, 32bit
The
in various formats.
the 8bit byte.
CPU
The
words being
is
basic unit of
designed to
The two major Fixedpoint numbers
typical.
formats for numerical data are fixedpoint and floatingpoint.
decimal, meaning a binary code such as found in ordinary (base 10) decimal numbers. The most common binary number codes are sign magnitude and twos complement. Each code simplifies the implementation of some arithmetic operations; twos complement, for example, simplifies the implementation of addition and sub
can be binary (base 2)
or, less frequently,
BCD that preserves the decimal weights
traction
and so
is
generally preferred.
A
floatingpoint
number comprises
a pair of
fixedpoint numbers, a mantissa M, and an exponent E and represents numbers of X B E where B is an implicit base. Floatingpoint numbers greatly the form
M
word size but require much numbers require. The IEEE 754
increase the numerical range obtainable using a given
more complex
arithmetic circuits than fixedpoint
standard for floatingpoint numbers
The functions performed by instruction consists of an
a
is
widely used.
CPU
opcode and a
techniques called addressing
modes
operands can be in the instruction
are defined
set of
by
its
instruction set.
operand or address
are used to specify operands.
itself
(immediate addressing),
fields.
An
in
An
Various
instruction's
CPU registers,
memory M. Operands in registers can be accessed more rapidly than M. An instruction set should be complete, efficient, and easy to use in
or in external
those in
some broad
grouped into several major types: data transand inputoutput instructions), data processing (arithmetic and logical instructions), and program control (conditional and unconditional branches). All practical computers contain at least a few instructions of each type, although in theory one or two instruction types suffice to perform all fer (load,
sense. Instructions can be
store,
move
register,
computations. RISCs are characterized by streamlined instruction sets that are supported by fast hardware implementations and efficient software compilers. While
CISCs have larger and more complex instruction sets, they simplify the programming of complex functions such as division. The use of subroutines (procedures) and macroinstructions can simplify assemblylanguage programming in all types of processors.
3.5
PROBLEMS 3.1.
Show how
to use the
10member
instruction set of Figure 3.4 to
implement the follow
many computers; use as few inof memory location X to memory location
ing operations that correspond to single instructions in structions as Y. (b)
you can.
(a)
Copy
the contents
Increment the accumulator AC.
(c)
Branch
to a specified address
adr
if
AC ^ 0.
:
3.2.
Use the instruction set of Figure 3.4 to implement the following two operations assuming that signmagnitude code is used, (a) AC := M(X). (b) Test the rightmost bit b of the word stored in a designated memory location X. If b = 1, clear AC; otherwise, leave AC unchanged. [Hint: Use an AND instruction to mask out certain bits of a
Consider the possibility of overlapping instruction fetch and execute operations when executing the multiplication program of Figure 3.5. (a) Assuming only one word can
be transferred over the system bus
at a time,
determine which instructions can be over
lapped with neighboring instructions, (b) Suppose that the
CPUmemory
interface
is
redesigned to allow one instruction fetch and one data load or store to occur during the
same clock
Now determine which instructions, if any, in the multiplication can
cycle.
not be overlapped with neighboring instructions.
3.4.
Write a brief note discussing one advantage and one disadvantage of each of the
lowing two unusual features of the
ARM6:
(a) the inclusion of the
in the general register file; (b) the fact that execution
3.5.
Use
HDL
of the
ADD 3.6.
R6,R6,R6;
ARM6 EOR
(d)
new
Identify the
is
PC
conditional.
instructions:
MOV
MOV
(a)
R6,#0;
MVN
(b)
R6,#0;
(c)
contents
initial register
(all
given in hex code):
R2 = 0000FFFF; R3 = 12345678; NZCV = 0000
contents of every register or flag that
following instructions.
(e)MOV
of every instruction
fol
R6,R6,R6.
ARM6 has the following
Suppose the
state, (a)
program counter
notation and ordinary English to describe the actions performed by each
following
Rl = 11110000;
3.7.
Assume each
R1,R2;
(b)
MOVCS
is
changed by execution of the
executed separately with the foregoing
is
R1.R2;
(c)
MVNCS
R2.R1;
(d)
MOV
initial
R3,#0;
R3,R4, LSL#4.
Suppose the ARM6 has the following hex code):
initial register
and memory contents
(all
given
in
Rl = 00000000; R2 = 87654321 Identify the
new
initial
state,
R3 = A05B77F9;
;
contents of every register or flag that
of the following instructions. ing
(a)
ADD
Assume each
R1,R2,R3;
(b)
is
NZCV = 0000 changed by execution
is
executed separately with the forego
ADDS
R1.R3.R3;
(c)
SUBS
R2,R1,#1;
(7!l
(4.5)
from the sign position,
c„_,, the carry output signal
is
defined by xn _ y n _ + l
l
xn _iC n _2 +
}'„_i c n2' fr°
m which
follows that
it
v
=
c„
(4.6)
Either (4.5) or (4.6) can be used to design overflow detection logic for twos
complement addition or subtraction. Overflow detection in the case of magnitude numbers is similar and is left as an exercise (problem 4.6).
sign
Highspeed adders. The general strategy for designing fast adders is to reduce form carry signals. One approach is to compute the input carry needed by stage i directly from carrylike signals obtained from all the preceding the time required to
stages
i

l,i

stage to stage. /ibit
full
2,...,0, rather
Adders
than waiting for normal carries to ripple slowly from
carrylookahead adder
is
formed from n
adder modified by replacing
called gj and
/?,,
carrylookahead adders.
that use this principle are called
stages,
each of which
carry output line
its
c,
An
basically a
is
by two auxiliary signals
or generate and propagate, respectively, which are defined by the
following logic equations:
&=*# The name generate comes from independent of the value of c propagates c
words,
if Xj
Now be sent
M
+
;
y,
that
=
is, it
i
_
if c,
=
both 1
x,
+
1,
c,
=
jc,v,+ *,',, will be yj = 1 and leading 0s should be shifted into A as before, until the first 1 in X is encountered. Multiplication of Y by this 1 and addition of the result to A causes />, to become negative, from which point on leading Is rather than 0s must be shifted
x7 =
0,
;
zero,
into A.
3.
These rules ensure
complement code. x7 = 1, v 7 = 0; that is,
that a right shift corresponds to division
X is negative and Y is
by 2
positive. This follows case
1
in twos
for the first
seven addandshift steps yielding the partial product
For the
final step, often referred to as a correction step, the subtraction
performed. The result
P
is
6
P=
Y +
P := P7
yis
then given by
/
X2"V = U+
6
\
7
E2'
Jc
Jy
I
XxY
4.
which is by (4.21). x 1 = y1 = 1; that is, both X and Kare negative. The procedure used here follows case 2, with leading 0s (Is) being introduced into the accumulator whenever its contents are zero (negative).
which ensures
The correction
(subtraction) step of case 3
that the final product in
Each addition/subtraction
A.Q
is
also performed,
nonnegative.
is
step can be performed in the usual
twoscomplement is needed in
fashion by treating the sign bits like any other and ignoring overflow. Care the shift step to ensure that the correct
new
position A[7]. This value must be a leading positive or zero, and
assigned to A[7].
F
1
if
it is
negative.
We
to 0,
and
is initially set
F
:=
value if
is
placed in the accumulator's sign
the current partial product in
introduce a flipflop is
F
subsequently defined by
(v 7 anY2
i
k
Y=2
k
Y(2
k
/,
= 2'*
=
1
2
m
1
£2 m r m
k
\)Y
m+
'
k
Y
(4.23)
=
Suppose the index m is replaced by j = m + i  k. Then the upper and lower limits of  k, respectively, to /  1 and summation in (4.23) change from k  and implying that (4.22) and (4.23) are, in fact, the same. It follows that Booth's algorithm correctly computes the contribution of X*, and hence of the entire multiplier
the
/'
1
X, to the product P. Equation (4.20) implies that the contribution of a negative
X*
239
CHAPTER 4 Datapath Design
2W BoothMult
INBUS;
(in:
out:
OUTBUS);
SECTION 4.1
register A[7:0], M[7:0], Q[7:l],
FixedPoint
bus INBUS[7:0], OUTBUS[7:0];
Arithmetic
BEGIN:
A := 0, COUNT := 0,
INPUT:
M := INBUS;
COUNT[2:0],
Q[7:0]:= INBUS, Q[1]:=0;
SCAN:
if
Q[l] Q[0] = 01 then A[7:0]
:=
A[7:0] + M[7:0], go to
TEST; else if Q[l] Q[0]
COUNT =
= 10 then A[7:0]
if
RSHIFT:
A[7]
INCREMENT: OUTPUT:
COUNT := COUNT + 1, go to SCAN; OUTBUS := A, Q[0] := 0; OUTBUS :=Q[7:0];
:= A[7],
7 then go to
A[7:0]  M[7:0];
:=
OUTPUT,
TEST:
A[6:0].Q = A.Q[7:0],
end BoothMult;
Figure 4.15
HDL description of an 8bit multiplier implementing the basic Booth algorithm.
to
P
can also be expressed in the formats of (4.20) and (4.23); a similar argument
demonstrates the correctness of the algorithm for negative multipliers. The argu
ment for fractions is essentially the same as that for integers. The twoscomplement multiplication circuit of Figure 4.12 can easily be modified to implement Booth's algorithm. Figure 4.15 describes a straightforward implementation of the Booth algorithm using the above approach with n = 8 and a circuit based on Figure 4.12. An extra flipflop Q[l] is appended to the right end of the multiplier register Q, and the sign logic for A is reduced to the simple sign extension A[7] := A[7]. In each step the two adjacent bits Q[0]Q[1] of Q are
examined, instead of Q[0] alone as in Robertson's algorithm, to decide the operation (add Y, subtract Y, or no operation) to be performed in that step. For comparison with Robertson's method in Figure 4.13, the operands are assumed to be fractions.
method
in
The
application of this algorithm to the example solved by Robertson's
Figure 4.14 appears in Figure 4.16. where the bits stored in Q[0]Q[1] in
each step are underlined.
Combinational array multipliers. Advances
in
VLSI technology have made
possible to build combinational circuits that perform n
x
it
Hbit multiplication for
An example is the Integrated Device Technology which can multiply two 16bit numbers in 16 ns [Integrated Device Technology 1995]. These multipliers resemble the «step sequential multipliers discussed above but have roughly n times more logic to allow the product to be computed in one step instead of in n steps. They are composed of arrays of simple combinational elements, each of which implements an add/subtractandfairly
shift
values of n.
large
IDT721CL
multiplier chip,
operation for small slices of the multiplication operands.
Suppose
that
two binary numbers X = xn _ xn _ 2 ...x x and Y = y„_iy„_2)'i)'o assume that X and Fare unsigned integers. The ]
are to be multiplied. For simplicity,
product
P = X X Kcan
therefore be expressed as
l
241 Step
Accumulator
Register
Initialize registers
00000000 00000000
101100110
SetQ[l]toO
10110011 = multiplier
Subtract
M from A
00101011
1011001K)
A.Q
00010101
110110011
Rightshift
Skip add/subtract Rightshift
A.Q
00010101
1101100U
00001010
111011001
CHAPTER 4
X
Datapath Design
= mulitplicand
11010101
1
2
Q
Action
Y=
M
11010101
3
Add
M to A
Rightshift
4
A.Q
Skip add/subtract Rightshift
A.Q
11011111
11101100J.
11101111
111101100
11101111
111101100
11110111
111110110
11010101
5
Subtract
M from A
Rightshift
A.Q
Skip add/subtract
6
Rightshift
A.Q
00100010 00010001
1111 10110 011111011
00010001
oi ii
00001000
101111101
non
11010101
7
Add
M to A
Rightshift
A.Q
11011101
loiimpi
11101110
110111110
11010101
8
Subtract
M from A
Set Q[0] to
00011001
110111110
00011001
1101 11 100 = product/3
Figure 4.16
Booth multiplication algorithm.
Illustration of the
X
p=
2
V
(4.24)
= o
corresponding to the bitbybit multiplication style of Figure 4.10.
Now
(4.24) can
be rewritten as
P= 12'
X*^2 ;
Each of
the n
twoinput
2
1bit
AND
in the 1bit case.
Figure 4.17
summed
product terms
gate
—observe
all
(4.25)
=
appearing in (4.25) can be computed by a and logical products coincide
that the arithmetic
Hence an n x n
can compute
xyi
"'
array of twoinput
the x^j
ANDs
of the type
shown
in
terms simultaneously. The terms are
according to (4.25) by an array of n(n  1) 1bit full adders as shown in is a kind of twodimensional ripplecarry adder. The
Figure 4.18; this circuit
and 2j factors
in (4.25) are implemented by the spatial displacement of the adders along the x and y dimensions. Note the similarities between the circuit of Figure 4.17 and the multiplication examples of Figures shifts
implied by the
4.10 and 4.11.
2'
242
SECTION 4.1 FixedPoint
gle
The AND and add functions of the array component (cell) as illustrated in Figure
multiplier can be
combined
into a sin
4.19. This cell realizes the arithmetic
expression
Arithmetic jr
An
n x
rtbit
s
= a plus b plus xy
multiplier can be built using n
some
nent, although, as in Figure 4.18,
inputs set to
or
1,
(4.26)
copies of this cell as the sole compo
cells
on the periphery of the array have from (4.26) a plus bplus xy
effectively reducing their operation
plus b (a half adder). The multiplication time for this multiplier is determined by the worstcase carry propagation and, ignoring the differences between the internal and peripheral cells, is {In \)D, where D is the delay of the basic cell. Multiplication algorithms for twoscomplement numbers, such as Robertson's and Booth's, can also be realized by arrays of combinational cells as the next example shows. to a
EXAMPLE 4.3 ARRAY IMPLEMENTATION OF THE BOOTH MULTIPLICATION algorithm [KOREN 1993]. Implementing the Booth method by a combinational array requires a multifunction cell capable of addition, subtraction, and no
operation (skip). Such a cell
B
shown
is
selected by a pair of control lines
required functions of
B
are defined Z
c 0M =
H
and
in
D
Figure 4.20a. as indicated.
Its It
various functions are is
easily seen that the
by the following logic equations.
=a
© bH ® cH
(a@ D)(b +
c)
+ bc
When HD = 10, these equations reduce to die usual fulladder equations HD =11, they reduce to the corresponding fullsubtracter equations
[email protected]@c c
Figure 4.17
AND array for 4 x 4bit unsigned multiplication.
„,
= ab + ac + be
(4.1);
when
Wo
243
CHAPTER 4 Datapath Design
carry
out
Po
Figure 4.18 Fulladder array for 4 x 4bit unsigned multiplication.
Figure 4.19 Cell
in
which c and cout assume
M for an unsigned array multiplier.
the roles of borrowin
and borrowout, respectively.
When
H = 0, z becomes a, and the carry lines play no role in the final result. An
Hbit multiplier
nected as
shown
in
is
constructed from n
2
Figure 4.20b. The extra cells
+ n(n at the
l)/2 copies
top
left
of the
B
cell
con
change the array's shape
to a trapezium and are employed to signextend and subtraction. Note how the diagonal lines marked b
from the parallelogram of Figure 4.18 the multiplicand
Y
for addition
deliver the signextended
Y
directly to every
signextended by leading 0s; this
is
row of B
cells.
When Y
is
positive,
is
it
implicit in the array of Figure 4.18. In the present
when Kis negative, it must be explicitly signextended by leading Is. The operation to be performed by each row of B cells is decided by bits x x _ of the operand X. To allow each possible ^ rv,_, pair to control row operations, we case,
;
i
j
i
244
SECTION 4.1
H D
Function
FixedPoint
X
Arithmetic 1
bit
adder/
z
= a (no operation)
coul z = apluf bplus c (add)
1
subtracter 1
1

c out z = a
b c
(subtract)
(a)
ZJLZJU'JLVJL^.L/JV
s
*
B
10
Jo
B
*
B Jo
B
V HT Pe
5?
*
Ps
I
^
lo
Pi
Pi
/>4
Po
P\
(b)
Figure 4.20 Combinational array implementing Booth's algorithm: (b) array multiplier for
4x
4bit
introduce a second cell type denoted
main
cell
B
and
C
in Figure
4.20b to generate the control input
D required by the B cells. Cell C compares with x _ values of HD required by Figure 4.20a; these values are as follows:
signal the
H
(a)
numbers.
and
and generates
jc,
j
]
4.1.3 Division
two numbers, a divisor V and a dividend D, are given. The third number Q, the quotient, such that Q X V equals or is D. For example, if unsigned integer formats are being used, Q is com
In fixedpoint division
object
is
to
compute a
very close to
puted so that
D=QX V+R
where R, the remainder,
is
required to be less than V, that
E2 and El  E2 =
bit positions.
The
1
k,
shifted mantissa
and also
M2 is
is
to
1
:
the result
is
used
to
determine the length of
rightshifted by k digit posi
then added to or subtracted from
the other mantissa via adder 2, a 56bit parallel adder with several levels of carry look
ahead. The resulting sum or difference is placed in a temporary register R where examined by a special combinational circuit, the zerodigit checker. The output this circuit indicates the
tive
it
z
is
of
number of leading digits (or leading Fs in the case of negain R. The number z is then used to control the final nor
numbers) of the number
The contents of R are leftshifted z digits by shifter 2. and the result is M3. The corresponding adjustment is made to the exponent by subtracting z using adder 3. In the event that R = 0, adder 3 can be used to set all bits of E3 to 0, which denotes an exponent of 64. malization step.
placed in register
272
Data
SECTION 4.3
M2
Ml
E2
El
Advanced Topics
Exponent comparison and
"3_
mantissa
Adder
Shifter
1
alignment
1
E1E2
Mantissa
Adder
2
addition
subtraction
Zerodigit
.
„
checker "
\\
ll
Result
Adder
Shifter 2
normalization
3
M3
E3 Data
Figure 4.44 Floatingpoint add unit of the
IBM
System/360 Model 91.
Coprocessors. Complicated arithmetic operations like exponentiation and trig
onometric functions are costly
to
implement
implementations of these operations are slow.
in
A
CPU
hardware, while software
design alternative
is
to use auxil
iary processors called arithmetic coprocessors to provide fast, lowcost
implementations of these special functions. In general, a coprocessor
is
hardware a separate
is closely coupled to the CPU and whose instructions and registers are direct extensions of the CPU's. Instructions intended for the coprocessor are fetched by the CPU, jointly decoded by the CPU and the coprocessor, and executed by the coprocessor in a manner that is transparent to the programmer. Specialized coprocessors like this are used for tasks such as managing
instructionset processor that
the
memory system
or controlling graphics devices.
example, was designed
to
allow the
CPU
to operate
The MIPS RX000
series, for
with up to four coprocessors
[Kane and Heinrich 1992]. One of these is a conventional floatingpoint processor, which is implemented on the main CPU chip in later members of the series. Coprocessor instructions can be included in assembly or machine code just like any other CPU instructions. A coprocessor requires specialized control logic to link the CPU with the coprocessor and to handle the instructions that are executed by the coprocessor. A typical CPUcoprocessor interface is depicted in Figure 4.45. The coprocessor is attached to the CPU by several control lines that allow the
273 Coprocessor
Select
CHAPTER 4
address
decoder
Datapath
"
Design
CPU
Busy
Coprocessor
Interrupt request
Synchronization signals k i
1
System bus 1r
.
To main memory
J
and 10 devices
Figure 4.45 Connections between a
CPU
and a coprocessor.
of the two processors to be coordinated.
activities
To
the
CPU,
the coprocessor
whose registers can be read and written external memory. Communication between the
passive or slave device
into in
same manner
CPU
as
much
is
a
the
and copro
cessor to initiate and terminate execution of coprocessor instructions occurs auto
Even
matically as coprocessor instructions are encountered.
if
no coprocessor
actually present, coprocessor instructions can be included in
because
if
the
CPU knows
that
CPU
is
programs,
no coprocessor is present, it can transfer program location where a software routine implement
memory
control to a predetermined
is stored. This type of CPUgenerated interprogram flow is termed a coprocessor trap. Thus the coprocessor approach makes it possible to provide either hardware or software support for certain instructions without altering the source or object code of the program being executed.
ing the desired coprocessor instruction
of normal
ruption
A opcode
coprocessor instruction typically contains the following three
F
from other
that distinguishes coprocessor instructions
the address
F
]
of the particular coprocessor to be used
allowed, and finally the type coprocessor.
The
the coprocessor
F2
field
F2
an
several coprocessors are
if
can include operand addressing information.
same time
fields:
instructions,
of the particular operation to be executed by the
monitor the system bus,
instruction at the
CPU
as the
CPU;
it
By having
can decode and identify a coprocessor
the coprocessor can then proceed to exe
some
early
coprocessors but has the major drawback that the coprocessor, unlike the
CPU,
cute the coprocessor instruction directly. This approach
does not
know
is
found
the contents of the registers defining the current
modes. Consequently,
it is
cessor instruction, fetch
common
all
CPU
in
memory
addressing
decode every coprorequired operands, and transfer the opcode and operto
have the
partially
ands directly to the coprocessor for execution. This
is
the protocol followed in
680X0based systems employing the 68882 floatingpoint coprocessor, which
is
the topic of the next example.
EXAMPLE 4.7 THE MOTOROLA [motorola 1989]. The Motorola
68882 FLOATINGPOINT COPROCESSOR 68882 coprocessor extends 680X0series CPUs
274
Type
Opcode
Operation specified
Data transfer
FMOVE FMOVECR FMOVEM
Move word to/from coprocessor data or control register Move word to/from ROM storing constants (0.0, 7t, e, etc.) Move multiple words to/from coprocessor
Data processing
FADD FCMP
Compare
FDIV
Divide
FMOD FMUL
Modulo remainder
FPvEM FSCAJLE
Remainder (IEEE format)
FSGLMUL
Singleprecision multiply
FSGLDIV FSUB FABS
Singleprecision divide
SECTION 4.3 Advanced Topics
Program control
Add
Multiply
Scale exponent
Subtract
Absolute value
FACOS
Arc cosine
FASIN
Arc sine
FATAN FATANH
Arc tangent
FCOS FCOSH FETOX FETOXMI FGETEXP
Cosine
FGETMAN
Extract mantissa
FINT
Extract integer part
Hyperbolic arc tangent
Hyperbolic cosine
power of x power of x) minus
e to the
(e to the
1
Extract exponent
FINTPvZ
Extract integer part rounded to zero
FLOGN
Logarithm of x
FLOGNP1
Logarithm of x +
FLOG 10 FLOG2 FNEG
Logarithm
to the base 10
Logarithm
to the base 2
to the 1
base e
to the base e
Negate
FSIN
Sine
FSINCOS FSINH
Simultaneous sine and cosine Hyperbolic sine
FSQRT
Square root
FT AN
Tangent
FTANH FTENTOX
Hyperbolic tangent
FTWOTOX
2 to the
FLOGN
Logarithm of x
FBcc
Branch
FDBcc
Test, decrement count, and branch
FNOP FRESTORE FSAVE
No
10 to the power of x
power of x
if
to the base e
condition code (status) cc
is 1
on cc
operation
Restore coprocessor state
FScc
Save coprocessor state Set (cc = 1) or reset (cc = 0) a specified byte
FTST FTRAPcc
Conditional trap
Set coprocessor condition codes to specified values
Figure 4.46 Instruction set of the Motorola
68882 floatingpoint coprocessor.
68020 (section 3.1.2) with a large set of floatingpoint instructions. The 68882 and the 68020 are physically coupled along the lines indicated by Figure 4.45. While decoding the instructions it fetches during program execution, the 68020 identifies coprocessor instructions by their distinctive opcodes. After identifying a coprocessor instruction, the 68020 CPU "wakes up" the 68882 by sending it certain control signals. like the
The 68020 then transmits as an instruction register.
which can proceed
When
opcode to a predefined location in the 68882 that serves The 68882 decodes the instruction and begins its execution, the
in parallel
with other instructions executed within the
the coprocessor needs to load or store operands,
it
asks the
CPU
CPU
proper.
to carry out the
necessary address calculations and data transfers.
The 68882 employs the IEEE 754 floatingpoint number formats described in Example 3.4 with certain multipleprecision extensions; it also supports a decimal floatingpoint format. From the programmer's perspective, the 68882 adds to the CPU a set of eight 80bit floatingpoint data registers FP0:FP7 and several 32bit control (opcode) and status registers. Besides implementing a wide range of arithmetic operations for floatingpoint numbers, the 68882 has instructions for transferring data to and from its registers, and for branching on conditions it registers, including instruction
encounters during instruction execution. Figure 4.46 summarizes the 68882' s instruc
These coprocessor instructions are distinguished by the prefix F (floatingmnemonic opcodes and are used in assemblylanguage programs just
tion set.
point) in their like regular
680X0series instructions; see Fig. 3.12. The status or condition codes cc
generated by the 68882
when executing
floatingpoint instructions include invalid
operation, overflow, underflow, division by zero, and inexact result. Coprocessor status is recorded in a control register, set
of calculations, enabling the
response.
As some coprocessor
which can be read by the host
CPU
CPU
at the
end of a
to initiate the appropriate exceptionprocessing
instructions have fairly long (multicyle) execution
68882 can be interrupted
in the middle of instruction execution. Its state must then be saved and subsequently restored to complete execution of the interrupted
times, the
instruction.
The appearance of coprocessors stems in part from the fact that until the 1980s IC technology could not provide microprocessors of sufficient complexity to
Once such microprocessors became possible, CPU chips, losing some of their separate identity in the process especially in the case of CISC processors. For example, the 1990vintage Motorola 68040 microprocessor integrates a 68882style include onchip floatingpoint units.
arithmetic coprocessors began to migrate onto
—
floatingpoint coprocessor with a 68020style
[Edenfield et
al.
menting the performance of a RISC ciency of the
CPU in
a single microprocessor chip
1990]. Arithmetic coprocessors provide an attractive
CPU
itself.
CPU
way
of aug
without affecting the simplicity and
The multiple function (execution)
effi
units in superscalar
microprocessors like the Pentium resemble coprocessors in that each unit has an instruction set that
it
can execute independently of the program control unit and the
other execution units.
4.3.2 Pipeline Processing
Pipelining
is
a general technique for increasing processor throughput without
requiring large amounts of extra hardware to the
[Kogge 1981; Stone 1993].
It is
applied
design of the complex datapath units such as multipliers and floatingpoint
275
CHAPTER 4 Datapath
Design
276
SECTION 4.3
adders.
It is
improve which we return
also used to
cessor, a topic to
Advanced Topics
Introduction.
the overall throughput of an instruction set proin
Chapter
5.
A pipeline processor consists of a sequence of m dataprocessing cir
cuits, called stages or
segments, which collectively perform a single operation on a
stream of data operands passing through them.
each stage, but a
final result is
through the entire pipeline.
word
As
input register or latch
The
tional.
/?,'s
Some
processing takes place in
obtained only after an operand set has passed
illustrated in Figure 4.47, a stage 5, contains a multi
R
, :
and a datapath
circuit C, that is usually
hold partially processed results as they
move through
combina
the pipeline;
they also serve as buffers that prevent neighboring stages from interfering with one
A common
another.
clock signal causes the
/v,'s
to
change
synchronously.
state
Each Rj receives a new set of input data D,_, from the preceding stage 5,_! except for R\ whose data is supplied from an external source. D,_, represents the results computed by C _ during the preceding clock period. Once D _ has been loaded into R h Cj proceeds to use D,_, to compute a new data set D Thus in each clock period, every stage transfers its previous results to the next stage and computes a i
]
j
l
.
t
new
set
At
of results.
operation.
m
to
a pipeline
first sight Its
advantage
is
seems a costly and slow way
that
independent sets of data operands. These data sets
stage
by stage so
that
when
to
implement the
target
an mstage pipeline can simultaneously process up
the pipeline
is full,
move through
the pipeline
m separate operations are being exe
cuted concurrently, each in a different stage. Furthermore, a new, final result
emerges from the pipeline every clock cycle. Suppose that each stage of the mstage pipeline takes T seconds to perform its local suboperation and store its results.
Then
that
the time to complete a single operation,
is,
7" is
throughput of the per second is
one.
is 1/7/.
When
The delay or latency of the pipeline, is therefore mT. However, the pipeline, that is, the maximum number of operations completed Equivalently, the number of clock cycles per instruction or CPI
the pipeline's clock period.
performing a long sequence of operations
in the pipeline, its perfor
determined by the delay (latency) T of a single stage, rather than by the delay mT of the entire pipeline. Hence an mstage pipeline provides a speedup fac
mance
tor of
is
m compared to a nonpipelined implementation of the same target operation.
Control unit
1
)ata
'
'
r
p
R
C.
i
i
c.
R
in
—v Stage S
Figure 4.47 Structure of a pipeline processor.
i
r
r
"*..."*
i
C
R
Data out
V
V
Stage S 2
Stage S„
Any
operation that can be decomposed into a sequence of suboperations of
about the same complexity can be realized by a pipeline processor. Consider, for
example, the addition of two normalized floatingpoint numbers x and
a topic
y,
discussed in section 4.3.1. This operation can be implemented by the following
compare the exponents, align the mantissas (equalize the expoadd the mantissas, and normalize the result. These operations require the fourstage pipeline processor shown in Figure 4.48. Suppose that x has the normalized floatingpoint representation (x M ,xE ), where xM is the mantissa and xE is the k exponent with respect to some base B = 2 In the first step of adding x = (xM jcE ) to = y (y M ,y E ), which is executed by stage S of the pipeline, x E and y E are compared, fourstep sequence:
nents),
.
{
an operation performed by subtracting the exponents, which requires a fixedpoint
adder (see Example
4.6).
S
{
identifies the smaller of the exponents, say,
xE whose ,
mantissa xM can then be modified by shifting in the second stage S2 of the pipeline
form a new mantissa x' M that makes (x' M ,y E ) = (xM ,xE ). In the third stage the mantissas x' M and y M which are now properly aligned, are added. This fixedpoint addition can produce an unnormalized result; hence a fourth and final step is to
,
needed
to normalize the result.
Normalization
is
done by counting the number k of
leading zero digits of the mantissa (or leading ones in the negative case), shifting the mantissa k digit positions to normalize
ment
it,
and making a corresponding adjust
in the exponent.
Figure 4.49 illustrates the behavior of the adder pipeline
sequence of
N floatingpoint
additions of the form x + (
when performing
y, for the case
N=
6.
a
Add
this type arise when adding two yVcomponent real (floatingpoint) At any time, any of the four stages can contain a pair of partially processed scalar operands denoted Qt,,y,) in the figure. The buffering of the stages ensures that S, receives as inputs the results computed by stage 5,_ during the preceding clock period only. If Tis the pipeline's clock period, then it takes time 4T to compute the single sum x, + y,; in other words, the pipeline's delay is AT. This value is approximately the time required to do one floatingpoint addition using a nonpipelined processor plus the delay due to the buffer registers. Once all four stages of the pipeline have been filled with data, a new sum emerges from the last stage S4 every T seconds. Consequently, N consecutive additions can be done in time (N + 3)T, implying that the fourstage pipeline's speedup is
sequences of vectors.
,
S(4) =
4N
N+3
x  (x M x E ) .
Exponent
Exponent Ri
adder
Mantissa *:
shifter
R3
Mantissa adder
c3
c2 >'
= (>'m.Ve)
Data
R>
adder and mantissa shifter
Data
Q
out
'V Stage 5,
Stage S 2
Stage S 3
Stage 54
(Exponent comparison)
(Mantissa
(Mantissa
(Normalization)
alignment)
addition)
Figure 4.48 Fourstage floatingpoint adder pipeline.
277
CHAPTER 4 Datapath
Design
278
SECTION 4.3
(*6 >6>
(x 5 ,y 5 )
(x 6 ,y 6 )
(x4
(x5
Advanced Topics ,
y4 )
,
y5 )
(x6 ,
y6 )
(xj.yj)
(xA ,y4 )
(x5 ,y 5 )
(x 6 ,y6 )
(x2 ,y2 )
(x 3 ,y 3 )
(x4 ,y4 )
(x s ,y 5 )
(x6 ,y 6 )
*
*
*
(*4 >'4)
(*i. ^5)
(*6 >'6>
(*i.yi)
~1~
(*2 v 2)
(x 3 , y 3 )
nrri X
(^i.yi)
C*3 >'3>
*
(*4 >'*)
5,
• *2
(*5 >5>
*
Ui.yi)
C*2. >2)
(*3 >3)
(*4 >4)
• C*5. >s)
t Current result
(x 2
.
\
;
l
'3>
(x4 ,y4 )
*3
(*e ye)
*
(*5>'5>
6)
54
t •*1+>1
1
+ >'2
x 3 + yi
x\ + y*
*5 + >5
*6 + >'6
*i+yi
«2 + >'2
•»3+> 3
«4
+ >'4
x s+y$
X+y,
X2 + >2
^ + y3
Jt
*,+>',
* 2 + >'2
*3 +
2)
1
+ y,
4
+ >'4
:y3
x2 + >2
t
5
6
7
9
8
10
Figure 4.49 Operation of the fourstage floatingpoint adder pipeline.
For large N, 5(4) = 4 so that results are generated at a rate about four times that of a comparable nonpipelined adder. If it is not possible to supply the pipeline with data at the maximum rate, then the performance can fall considerably, an issue to which
we
return in Chapter 5.
Pipeline design. Designing a pipelined circuit for a function involves
first
finding a suitable multistage sequential algorithm to compute the given function.
This algorithm's steps, which are implemented by the pipeline's stages, should be balanced in the sense that they should all have roughly the same execution time. Fast buffer registers are placed between the stages to allow
all
necessary data items
be transferred from stage to stage without interfering with one another. The buffers are designed to be clocked at the maximum rate (partial or
complete
results) to
be transferred reliably between stages. Figure 4.50 shows the registerlevel design of a floatingpoint adder pipeline
that allows data to
based on the nonpipelined design of Figure 4.44 and employing the fourstage organization of Figure 4.48. The main change from the nonpipelined case is the inclusion of buffer registers to define and isolate the four stages. cation has been
made
to
implement fixedpoint as well
A
further modifi
as floatingpoint addition.
— Data
_
_L
1
'
_
279
i
LHAF1 EK 4
M2
Ml
E2
El
.
.
1
J
i i
I
II
,
Datapath
Exponent comparison
Design
Adder 1
Si 1
1
,
u
1,
E4
f
II
ji
E5
M4
M5 :
1 1
1
Mantissa alignment
T
1
Shifter
1
s2
?" i
"'
"1 i 1
u
v
t
\r
E6
1
*
i
1
M6
M7
i
Mantissa addition/
1
.
subtraction
Adder
i
i
2
i
*3
r
ji
E7
R
1
Zerod
check er
w
'
Adder
i
3
r
Shifter 2 i
Normalization
54 u
I 1
M3
E3
j
__
i
out
Figure 4.50 Pipelined version of the floatingpoint adder of Figure 4.44.
The
circuits that
perform the mantissa addition
shown by broken operands. To perform
in stage
5 3 and the corresponding
accommodate
buffers are enlarged, as
lines in Figure 4.50. to
fullsize fixedpoint
a fixedpoint addition, the input oper
ands are routed through 5 3 only, bypassing the other three stages. Thus the circuit of Figure 4.50 is an example of a multifunction pipeline that can be configured either as a fourstage floatingpoint adder or as a onestage fixedpoint adder.
Of
course, fixedpoint and floatingpoint subtraction can also be performed by this circuit; subtraction
and addition are not usually regarded as
context, however.
distinct functions in this
The same function can sometimes be
280
section 4
partitioned into suboperations in several
different ways, 3
Advanced Topics
depending on such factors as the data representation, the style of tne '°^ c design, an(* the need to share stages with other functions in a multifunction pipeline. A floatingpoint adder can have as few as two stages and as many as six. For example, fivestage adders have been built in which the normalization stage (54 in Figure 4.50)
is split
two
into
one to count the number k of lead
stages:
ing zeros (or ones) in an unnormalized mantissa and a second stage to perform the
k shifts that normalize the mantissa.
Whether or not a particular function or set of functions F should be implemented by a pipelined or nonpipelined processor can be analyzed as follows. Suppose that
F can be broken down into m independent sequential steps F ,F2 ,..,Fm so Pm Let F, be realizable by a logic l
that
it
has an mstage pipelined implementation
.
7*,. Let T R be the delay of each and associated control logic. The longest 7", the pipeline and force the faster stages to wait, doing no the slower stages become available. Hence the delay
circuit C, with propagation delay (execution time)
stage Sj due to
its
buffer register
times create bottlenecks in useful computation, until
between the emergence of two
minimum
7?,
results
Tc = max{r,} + TR The throughput of Pm
Pm
from
clock period (the pipeline period) for
Tc i
\ITC = l/(max{r,} +
is
Z /= T
is
=
the
maximum
value of
7",.
The
defined by the equation
is
1,2,...
7"
R ).
A
,m
(4.44)
nonpipelined implementation
TX
We throughput of 1/(X J=1 i conclude the mstage pipeline Pm has greater throughput than P that is, pipelining P
x
of
F has
a delay of
j
or, equivalently, a
\
x
increases performance
if
Equation (4.44) also implies that it is desirable for all 7", times the same; that is, the pipeline stages should be balanced.
be approximately
to
Feedback. The usefulness of a pipeline processor can sometimes be enhanced by including feedback paths from the stage outputs to the primary inputs of the pipeline. Feedback enables the results computed by certain stages to be used in subsequent calculations by the pipeline. We next illustrate this important concept by adding feedback to a fourstage floatingpoint adder pipeline like that of Figure 4.50.
example
4.8
summation by
lem of computing
the
Consider the proba pipeline processor sum of ./V floatingpoint numbers b ,b 2 ,,bN It can be solved by .

l
adding consecutive pairs of numbers using an adder pipeline and storing the
partial
sums temporarily in external registers. The summation can be done much more efficiently by modifying the adder as shown in Figure 4.5 1 Here a feedback path has been added to the output of the final stage 54 allowing its results to be fed back to the first stage 5. A register/? has also been connected to the output of S4 so that stage's results can be stored indefinitely before being fed back to 5,. The input operands of the modi.
,
,
fied pipeline are derived
obtained from a
such operands as the all0 and result
computed by S4
X
that is typically
location; a constant source
K that can apply
from four separate sources: a variable
CPU register or a memory in the
all
1
words; the output of stage S4 representing the ,
preceding clock period; and,
puted by the pipeline and stored
in the
output register R.
finally,
an earlier result com
Input
Input
tf
X
281
CHAPTER 4 Datapath
V
1
1
r
i
'
r
Design r* 
Multiplexer
Multiplexer
\
I
\
Stage 5,
3
4
i
1
1
/
«
4

4
Control
1
Stage 5 2
'
r
Stage
S3
«
4
«
4
'
i
Stage 5 4
Output 1
r
Register
^
/?
Figure 4.51 Pipelined adder with feedback paths.
The jVnumber summation problem following way. The external operands b
is
solved by the pipeline of Figure 4.5
,b 2 ,...,b N are
x
tinuous stream via input X. This process requires a sequence of register or fetch operations,
which
are easily
implemented
register/memory locations. While the
first
if
1
in the
entered into the pipeline in a con
memory
the operands are stored in contiguous
four numbers b ,b 2 ,b i ,b A are being entered, x
the all0
word denoting
K, as illustrated in Figure 4.52 for times
=
sum
number zero
the floatingpoint
+ b =
t
=
1
:4.
applied to the pipeline input
is
After four clock periods, that
is,
at
time
emerges from 54 and is fed back to the primary inputs of the pipeline. At this point the constant input K = is replaced by the current result 5 4 = b v The pipeline now begins to compute b + b$. At t = 6, it begins to compute b 2 + b6 at / = 7, computation of fc + b n begins, and so on. When b + b 5 emerges from the pipe3 line at t = 8, it is fed back to 5, to be added to the latest incoming number b 9 to initiate computation of b + b 5 + b9 (This case does not apply to Figure 4.52, where b % = b s is t
5, the first
x
6,
;
x
x
.
t
the last item to be
pipeline and pipeline
is
is
summed.)
sum b 2 + bb emerges from
the
number b w Thus at any time, four partial sums of the form
the
In the next time period, the
fed back to be added to the incoming
engaged
in
computing
in its four stages
.
(4.45)
b i + b 1 + b u +b^ + + &,, + fe.fi + fc 4
+&
... ...
282
SECTION 4.3
b.
L J
i
Advanced Topics
(0.M
1
J
1
1
(0, b,)
i3 )
(0,
HZ
L
1
1
(O.fc,)
(CAJ
z
1
bA )
(0,
JL_
b3 )
(0,
(0,6,)
(0. fc,)
1
HZ (0J>.i
»
ii
T
/
=
f
1
r
= 3
f
I
1
I
'
= 2
r
(*i.*s)
(by b
1
T
1
(0, 64)
(fci.65)
*
ZIZ (0.
(fe
2
4)
(*i.
fc
2
(0,
)
fc
3)
(0.
(** *8)
JL fc
,
1
fi
)
1
HZ
I (0.
fc
=4
I
'
ib* b 6 )
(0. fej)
r
(*>
3,
*7>
J_ *5 )
I fc
4)
HZ (fc,.fc 5 )
„ + b 3 + b 7 emerges from 54 at t  16. At this point the outputs of 5 4 and R are fed back to S v The final result is produced four time periods later at t = 20 in the case of N = 8. the desired result b ture
is
shown
in
x
+ b2 +
...
.
Figure 4.52 for the case
x
x
.
x
—
283
CHAPTER 4
I
i (0,0)
(fc,
b% )
,
(b 4
(0,0)
,
Datapath
+ b g b 3 + b7 ) ,
Design
z
4 (b 4
+ 6 5 b 2 + b6 )
4
(0.0)
(bl
+
/?
b>
5.
+
/>
(0.0)
6)
1
ziz
*
4
ZEZ
4
(by
b6 )
(*2,
bj)
1
1
b^ +
t= 10
t=
1
f= 12
11
I
ZJl. + b 5 + b 2 + b(, b 4 + bg + b 3 + b 7 )
(b
+ bs b} + ,
t
bj)
(0,0)
1
*
4
(0,0)
(0,0)
(0,0)
4
HZ
1
4 (0.0)
(£>
4
4 (fc,
*4 + *8
b.
!
(0.0)
(0,0)
(b4
b 2 + b6
5,
'
b2 + b6
1
fc
(0,0)
4
'
= 9
+
(ft , fcg)
f
l
(fc,
1
'
b + b5
/
JL
(0.0)
(&,. b,)
+ b8
fc^
,
+
(0,0)
b,)
+ fc 5 + b 2 +
fc
(b 4 + b% +
(0.0)
6)
(0.0)
~4~
4
4
b
4
pipeline's clock period, that
adder requires time 1
1),
floatingpoint
is.
numbers
as
scheme of Figure 4.52 11)7", where T is the
time (N +
the delay per stage. Since a
4NT to compute SUM. we
which approaches 4
in
comparable nonpipelined 4N/(N +
obtain a speedup here of about
N increases.
The foregoing summation operation can be invoked by
a single vector instruc
tion of a type that characterized the vectorprocessing, pipelinebased
"supercom
puters" of the 1970s and 1980s [Stone 1993]. For instance. Control Data Corp.'s
STAR 100 computer
[Hintz and Tate 1972] has an instruction
SUM
that
computes
284
sum of
B = (b b 2 ,...,bN) of and places the result in a CPU register. The starting (base) address of B, which corresponds to a block of main memory, the name C of the result register, and the vector length N are all specified by operand fields of SUM. We can see from Figure 4.52 that a relatively complex pipeline control sequence is needed to implement a vector instruction of this sort. This complexity contributes the
the elements of a specified floatingpoint vector
,
x
SECTION 4.3 Advanced Topics
N
arbitrary length
and cost of vectororiented computers. Moreover,
significantly to both the size
achieve
maximum
way
speedup, the input data must be stored in a
vector elements to enter the pipeline numberpair per clock cycle.
at the
maximum
possible rate
to
that allows the
—generally one
The more complex arithmetic operations in CPU instruction sets, including most floatingpoint operations, can be implemented efficiently in pipelines. Fixedpoint addition and subtraction are too simple to be partitioned into suboperations suitable for pipelining.
As we
see next, fixedpoint multiplication
is
well suited to
pipelined design.
Pipelined multipliers. Consider the task of multiplying two nbit fixedpoint binary numbers pliers of the
X = xn _
l
xn _ 2
.
.x
.
and Y =
kind described in section
4.
1
)'„_
.2
1
y„_ 2

•
•
v
.
Combinational array multi
are easily converted to pipelines
by the
addition of buffer registers. Figure 4.53 shows a pipelined array multiplier that
employs the
1bit
M computes a
product
xy and
stage and a carry bit generated
S
it
,
bit
,
i
r
X2
A out
Register/?,
M
i
=1
Xt M
M Register
R2
4
M
M
Register
^iL M
^^ M
M /f
3
*3
M
r i Ps
Pa
Pi
Pi
P\
Figure 4.53 Multiplier pipeline using ripplecarry propagation.
each stage
(4.46)
y a x
carry
cells in
Pq
JT[
i
X
L
with the final product
Pn _
x
XY being
=
computed by the
last stage. In addition to
storing the partial products in the buffer registers denoted
and
all
hitherto
An
unused multiplier
must also be stored
bits
/istage multiplier pipeline
in
the multiplicand
/?,,
Y
when
and can generate a new result every clock cycle. slow speed of the carrypropagation logic
in
main disadvantage
the rela
is
each stage. The number of
M
2
needed is n and the capacity of all the buffer registers is approximately 3« 2 (see problem 4.31); hence this type of multiplier is also fairly costly in hardware. For these reasons, it is rarely used. Multipliers often employ a technique called carrysave addition, which is parcells
,
ticularly well suited to pipelining. full adders. Its
sum
the n
bits
input
is
three
/7bit
nbit carrysave adder consists of n disjoint
be added, while the output consists of
to
forming a word 5 and the n carry
adders discussed so
The outputs 5 and
far,
there
C can be
in Figure 4.54, they
bits forming a word C. Unlike the no carry propagation within the individual adders.
is
fed into another «bit carrysave adder where, as
can be added
connections are shifted to the general,
An
numbers
number W. Observe
to a third nbit
left to
shown
that the carry
correspond to normal carry propagation. In
m numbers can be added by a treelike network of carrysave adders to pro
duce a result
form (5,C). To obtain the
in the
final
sum, S and
C must be
added by
a conventional adder with carry propagation.
Multiplication can be performed using a multistage carrysave adder circuit of
shown in Figure 4.55; this circuit is called a Wallace tree after its inven[Wallace 1964]. The inputs to the adder tree are n terms of the form M, =
the type tor
xi Y2k Here M, represents the multiplicand Y multiplied by the /th multiplier bit weighted by the appropriate power of 2. Suppose that is In bits long and that a full doublelength product is required. The desired product P is Z£Tq This sum is computed by the carrysave adder tree, which produces a 2«bit sum and a .
M
x2
*3
z3
Y\ z2
'
'
i
M
Xd y2
y3
>'o
z
zi
'
1
I
1
x
CS / V
T
w2
/
/
"
T
r 1
$3
T
ca irn, Si ive i iddei
i
1
adder
adder
«
^O
S
i
CS
»
c\
z
C
1
T
/ *1
ci
w
"0
r T
Figure 4.54
At WO sta ge
/
/ T
$2 c'i
/
l
r
/ 1
w
y
II
i
w3
Datapath
Design
multiplying fixedpoint vectors,
Its
tively
CHAPTER 4
/?,.
of this type can overlap the computation of n
separate products, as required, for example,
285
C
i
5'
.
;
286
SECTION 4.3
I
1.
Advanced Topics Multiplier decoding and multiplicand gating
M5 M4 M3
C
form the
result
CHAPTER 4
Vr, If LX1 < M, subtract X from y ^3>0 (modulo 2"). If 1X1 > \Y\, then set v„_, to and subtract Y from X (modulo 2"). X negative; Y positive: If in < LX1, subtract Y from X (modulo 2"). If in > 1X1, set.r„_, to and subtract X from Y (modulo 2"). X and Y both negative: Add X and Y (modulo 2") and set z„_ to
Y negative: Let
1X1
= xn _2xn _y..xn and >fl
\Y\
= y„
Datapath
Design
1
,
Figure 4.61 Algorithm for subtracting signmagnitude numbers.
1 nbit adder
subtracter
SUB "}
"}
Figure 4.62
An
lines appearing in the figure
for
4.6.
nbit addersubtracter circuit.
can be used to compute
v.
Construct a suitable logic circuit
v.
Consider again the addersubtracter of Figure 4.62, assuming now that it has been designed for signmagnitude numbers. It computes Z = X + Y when SUB = and Z = X Y when SUB = 1. Assume that the circuit contains an nbit ripplecarry adder and a similar «bit rippleborrow subtracter and that
you have access
Derive a logic equation that defines an overflow flag v for
4.7.
to all internal lines.
this circuit.
Give an informal interpretation and proof of correctness of the two expressions (4.12) and g that define the propagate and generate conditions, respectively, for a 4bit
for/?
carrylookahead generator.
4.8.
Show how
to
extend the 16bit design of Figure 4.8 to a 64bit adder using the same
two component 4.9.
types: a 4bit adder
Stating your assumptions and
module and
a 4bit carrylookahead generator.
showing your calculations, obtain an good estimate
for
each of the following for both an nbit carrylookahead adder and an nbit ripplecarry adder: (a) the total (c) the
4.10.
maximum
number of gates used;
(b) the circuit depth
(number of
levels):
and
gate fanin.
Another useful technique for fast binary addition is the conditionalsum method. It and a closely related method called carryselect addition are based on the idea of simultaneously generating two versions of each sum bit s\: a version s], which assumes that its input carry c,_, = 1. and a second version s?, which assumes that c _ = 0. A multiplexer controlled by