Microprocessors: A Programmer's View
 9780070166394

Citation preview

.

MICROPROCESSORS A PROGRAMMER'S VIEW

ppppppPPPPPP PPPPPPPPPPPP PPPPPPPPPPPP PPPPPPPPPPPP PPPPPPPPPPPPJ PPPPPPPPPPPPJ pgppppppppppf PPPPPPPPPPPPJ PPPPPPPPPPPP PPPPPPPPPPPP PPPPPPPPPPPPf pppprppppppp! ppppppppppppE PPPPPPPPPPPP pppppppppppp| ppppppppppppu ppppppppupppf PPPPPPPP3PPP| PPPPPPRP 3pppa pppppppd DPPPu ppppppppppppy ppppppppppppppi ppppppppc PPPR p,lpppp.pi

|

I

ROBERT

B. K.

COMPUTER PROFESSIONALS

RISCrd.

DEWAR / MATTHEW SMOSNA

:j

003_ 17

Digitized by the Internet Archive in

2014

http://archive.org/details/microprocessorspOOrobe

MICROPROCESSORS A PROGRAMMER'S VIEW

MICROPROCESSORS A PROGRAMMER'S VIEW

Robert B. K. Dewar

Matthew Smosna Computer Science Department Courant

Institute,

New

York University

McGraw-Hill Publishing Company New York

St.

Louis

Lisbon

Oklahoma City

Paris

San Francisco

London San Juan

Madrid Sao Paulo

Auckland

Mexico

Bogota

Milan

Singapore

Caracas

Montreal

Sydney

Tokyo

Hamburg

New

Delhi

Toronto

MICROPROCESSORS: A PROGRAMMER'S VIEW Copyright

©

1990 by McGraw-Hill,

of America. Except

States

as

1976, no part of this publication

by any means, or stored

Inc. All rights reserved. Printed in the

United

permitted under the United States Copyright Act of

may

be reproduced or distributed

in

any form or

without the prior written

in a data base or retrieval system,

permission of the publisher.

1234567890 DOC DOC

10

9 5 4 3 2

ISBN D-D7-DlLb3fl-S -CS0FT*

ISBN D-D7-Dlth3T-D -CHARD} This book was

Adobe Garamond and Helvetica by

set in

Ventura Publisher, Adobe

The

editor

the authors using Xerox

and Corel Draw.

Illustrator,

was David M. Shapiro.

R. R. Donnelley

&

Sons

Company was

printer

and binder.

Library of Congress Cataloging-in-Publication Data Dewar, Robert

B. K.

A

Microprocessors:

Programmer's View

/

Robert

Dewar,

B. K.

Matthew Smosna cm.

p.

Includes bibliographical references.

ISBN 0-07-016638-2.— ISBN 0-07-016639-0 Microprocessors

1. II.

— Programming.

I.

(hard)

Smosna, Matthew.

Title.

QA76.6.D515 005.26— dc20

1990

89-77320

Figures 3.5 (p. 90), 4.1

344),

1

1

346),

.3 (p.

1

1

(p.

105), 4.3 (p. 117), 4.4 (p. 118), 11.1 (p. 343), 11.2

.4 (p.

347),

1

1

.5 (p.

363), and

with the permission of Intel, Inc., Copyright i386

DX,

i376, i860,

Figures 6.1

(p.

and i486

are

©

Intel

1 1

.6 (p.

(p.

364) were reprinted

Corporation.

The terms

i386,

trademarks of the Intel Corporation.

166), 6.5 (p. 198), 7.1 (p. 204), 7.2 (p. 205),

and 7.4

(p.

217) were

reprinted with the permission of Motorola, Inc.

Figures 9.1

289), 9.8

(p.

(p.

266), 9.2

(p.

295), and 9.9

Computer Systems,

268), 9.3 (p.

(p.

269), 9.4

(p.

273), 9.6

(p.

288), 9.7

297) were reprinted with the permission of

(p.

MIPS

Inc.

f igures 10.1 (p. 303), 10.2 (p. 306), 10.3 (p. 308),

with the permission of Sun Microsystems, Inc. 1989. All rights reserved.

and 10.5

© Copyright

(p.

318) were reprinted

Sun Microsystems,

Inc.,

To

my

parents,

Michael and Mary Dewar

To

my father,

Stanislaw

Smosna

ABOUT THE AUTHORS

Robert B. K. Dewar department

at

is

a Professor

of Computer Science and past chair of the

Courant Institute of Mathematical Sciences

at

New York University. He

has been involved with computers for over twenty-five years and has written major software systems including real-time operating systems for Honeywell on early microprocessors and a series of compilers.

The SPITBOL compiler, which he originally wrote now been ported to most major

nearly twenty years ago for mainframe computers, has

microprocessors, including most recently the SPARC.

run-time library for the Realia has been involved with the reviewers.

Ada

He

COBOL

Ada

He

compiler for the

wrote the back end and

IBM

PC, and more recently

language, for which he was one of the language

has also been involved in the design and implementation of the Alsys

compilers for the

Matthew Smosna

IBM PC and

other microprocessors.

Courant Institute of Mathematical Sciences at New York University. He has worked on several implementations of the SETL system (SETL is a set theoretic language developed at NYU), and is currently involved in the implementation of a new Ada compiler for the IBM RP3 (an experimental parallel processor). His main field of research is compiler technology, with an is

a Research Scientist at the

emphasis on code generation techniques. compiler courses

He

has taught graduate and undergraduate

at several universities, including

textbook on compiler design, based on the

NYU, and

class notes, for

is

currently writing a

McGraw-Hill.

5

CONTENTS

Preface

Chapter

1

xvii

Microprocessors What

Is

l

A Microprocessor?

2

The User-Level View of a Microprocessor The System-Level View of a Microprocessor CISC and RISC Microprocessors Registers, Addressing,

3

4 5

and the Instruction Formats

5

Register Sets

6

Addressing Modes

7

Designing Instruction Formats

8

Data Representation

9

Representation of Characters

9

Representation of Integers

10

Packed Decimal

14

Floating-Point Values

Memory

Organization

Big-Endian vs Little-Endian Byte Ordering Big-Endian vs Little-Endian Bit Ordering The Alignment Issue Procedure Calls

The

Call Instruction

Building a Stack Frame

Why a

Frame Pointer Is Needed Hardware Support for Stack Frames Accessing Non-Local Variables Addressing Modes Direct

Memory Addressing

Indexed Addressing

1

15

16 18 19

23 23 24 25 26

27 27 28 29

ix

1

X

CONTENTS Based Addressing

30

Base Plus Index Addressing

31

Indirect Addressing

33

Indirect Addressing with Indexing

Even More Complicated Addressing Modes

33 35

Memory Mapping Memory Memory Caching

37 39 39

Virtual

Tasking

4

Exceptions

Hardware Support

Chapter 2

42 43

for Exceptions

Introduction to the 80386

45

Register Structure

45 48

Special Registers

and Instructions

Maintaining Compatibility with the 8086/88

The User

Instruction Set

Movement

Basic Data

50 Instructions

51

Basic Arithmetic and Logical Operations

51

Multiplication and Division Instructions

53 56

Decimal Arithmetic String Instructions

57

Shift Instructions

59

The

Set

on Condition Instructions

59

Summing Up Registers

60

and the Run-time Stack

Why EBP

Is

Instructions

Instruction

Timing

61

Needed That Make Use of ESP and EBP

Timing the

ENTER Instruction

Addressing and Memory

63 71

Memory on

the

Addressing

Using 16-Bit

61

70

Pipelining and Instruction Timings

Chapter 3

49

Memory on

the

386

Alignment Requirements Byte and Bit Ordering Addressing Modes

72

80386

75

75 76

76 77 80

Direct Addressing

80

Based Addressing

81

Based Addressing with Displacement

82 84 84

Double Indexing Double Indexing with Scaling Segmentation on the 80386

85

Historical Aspects

85

The Global Descriptor Table

87

CONTENTS

Levels of Protection

90 90

Operating System Structure

91

Validating Parameters

93 95

Mechanisms

Protection

Is

All This

The 80386

Chapter 4

Worthwhile?

96

Instruction Formats

Tasking, Virtual Memory, and Exceptions

on the 80386

103

Tasking

103

The

104 104

Local Descriptor Table

Context Switching

Memory

Shared

106

The System Stack

109

Virtual

109

But

Machine Support It All Worth It?

Ill

Memory Management

112

Virtual

Is

Virtual Segmentation

1

Segment-Swapping Algorithms

113

Paging

1 1

1

5

16

Handling Page Faults

120

Virtual Segmentation and Virtual Paging

123

An Anecdote on

124

Paging and Protection

Mode

Exceptions on the 386

1

26

126 128

A Sad Story How an Exception

5

12

The Format of Virtual Addresses

Paging and Virtual 8086

Chapter

xi

Is Handled Asynchronous Interrupts

129

Writing Exception Handlers

131

130

Fault Traps

132

Debugging Support

132

Microprocessors and Floating-Point Arithmetic

135

Floating-Point Implementations Floating-Point Operations:

A

136 Programmer's

Nightmare

The Ada Approach The IEEE Floating-Point Standard Basic Formats

Rounding Modes Extended Precision Formats Overflow and Infinite Values Not a Number (NaNs) Handling of Underflow

137 138 139 139 140 144 147 148 148

xii

CONTENTS Specialized Operations

150

Implementing the IEEE Standard

1

The Intel 387 Chip The 387 and the IEEE Standard The Register Set of the 387 The Instruction Set of the 387

1

1

1

52 53 54

Executing 387 Instructions

157 1

The Weitek

Chipset:

Memory Mapped The Weitek

59 160

An

Alternative

Approach

161

Access

1

Instruction Set

Register Set

Special Purpose Registers

The 68030 Linear Address Space C, Pointers, and the Linear Address Space

1

165 1

66

1

67

1

68

168

Data and the Linear Address Space Byte Ordering Bit Ordering

The 68030 User-Level Instruction Data Movement Instructions

6

161

The 68030 User Programming Model The 68030 User

69 170 1

171

172

Set

172

Integer Arithmetic Instructions

173

Logical and Shift Instructions

174

Bit Field Instructions

174

Program Control Instructions Decimal Instructions

177

The CAS2 (Compare and Double Exchange)

178 In-

180

struction

The 68030 Addressing Modes Addressing Modes and Instruction Sizes Simple Data Movement Postincrement and Predecrement Modes The Register Indirect with Displacement Mode The Register Indirect with Index Modes The Memory Indirect Addressing Modes PC Relative Addressing Modes and Position Independent Code Restrictions on the Use of the Addressing Modes Floating-Point on the 68030 Instruction Formats on the 68030

The 68030 Supervisor The Supervisor

83 184 1

185

186 189 190 192 195

196 1

96

198

201

Conclusion

Chapter 7

1

Coprocessor Emulation

Context Switching

Chapter 6

5

152

State Registers

State

203

204

CONTENTS

The

206

Privileged Instruction Set

Addressing on the 68030

208

Caching on the 68030 Cache Organization Cache Performance Cache Control

209 209 210

The 68030 Memory Management Unit The Address Translation Cache The 68030 Paging Mechanism The Structure of a 68030 Page Table

214 215 216 220 221

211

Transparent Translation

Context Switching

222

Trace Control

222

Exceptions

223 223 224 225

Trap Processing Interrupt Processing

Reserved Exceptions

Chapter 8

An

Introduction to

CISC

RISC

Architectures

One

Series

236 237 238 247 248 249 250

Instruction per Clock Cycle

Pipelining

Simplified

Memory

Addressing

Avoiding Microcoding Register-to-Register Operations

Simple Instruction Formats Register Sets in

229

230 233

Architectures

The IBM 360 What is RISC?

Chapter 9

xiii

RISC Machines

251

CISC, RISC, and Programming Languages

254

The First RISC Processors The CDC 6600 The IBM 801 Project The Berkeley RISC and Stanford MIPS Summary

257 257 258 264

The MIPS Processors

CPU

The Instruction Pipeline The Stall Cycle The Instruction Set The Instruction Formats The Load and Store Instructions The Computational Instructions Immediate Instructions

264

265

The MIPS Chip Register Structure of the

Projects

266 267 268 270 272 272 273 277 281

5 1

xiv

CONTENTS

The Jump and Branch

Instructions

Procedure Call Instructions

The Coprocessor

Instructions

Special Instructions t

Addressing Modes Direct Addressing

Indexed and Base/Index Addressing Base Plus Offset Addressing

Memory Management on

the

MIPS

The Address Space The Instruction and Data Caches The Translation Lookaside Buffer Floating-Point Operations on the

MIPS

Instruction Scheduling

Trap Handling and Overlapped Execution Exception Handling on the

MIPS

Hardware Interrupts

Chapter 10

287 287 289 290 294 296 297 298 300

300

The SPARC Architecture

301

The SPARC Architecture

30

The SPARC Signals The IU Register Set The User Register Set The System Register Set Register

Windows

Managing

the Register File

SPARC Addressing Modes The SPARC Instruction Set The Call Instruction Format The General Instruction Format The SETHI Instruction Format The Conditional Branch Instructions

302 303 304 304 306 307 310 3

1

317 318 318 326 327 331

Exceptions Floating-Point on the

SPARC

Floating-Point Registers

Overlapped Multiplication and Addition

1 1

284 284 285 286

Conclusion

General Organization

Chapter

282 283 283 283

333 334 334

The SPARC Implementations

338

Conclusion

339

The

Intel

A Summary

i860 of the i860

Basic Structure of the i860

Instruction Formats

The Processor Status Registers

341

342 342 344 345

CONTENTS XV Extended Processor Status Register Debugging Support

Memory Management

350

The i860 Cache The Integer Core Instruction

351 Set

Load and Store Integer Addition and Subtraction Multiplication and Division on the i860 Integer

The The

Shift Instructions

Logical Instructions

Control Transfer (Branch and Jump) Instructions

A

Digression on

Ada

-

Access Before Elaboration

Floating-Point Operations on the i860 Floating-Point Load and Store Floating-Point Addition

352 352 353 354 354 355 355 356 357 357 360

Floating-Point Multiplication

361

Adding and Multiplying at the Same Time Using Dual Instruction Mode

362 365 367 368 369

Floating-Point Division Floating-Point Square Root

IEEE 754 Compatibility The i860 Graphics Unit Graphics Pixel Data Type

370 370

Graphics Instructions

371

Exceptions

Context Switching

Chapter 12

347 350

373 374

Programming Model

374

The IBM RISC Chips

377

The IBM RIOS Architecture The Branch Unit The Arithmetic-Logic Unit. The Floating-Point Unit Register Renaming

380 380 383 384 385 385

Data Cache

The RIOS

Instruction Set

Bit Field Instructions

Complex

Instructions

Floating-Point Instructions

Branch Instructions Condition Flag Instructions

Memory

Addressing

Addressing Modes Direct Addressing

Operand Alignment Big-Endian Ordering

Memory Management

386 387 387 388 390 392 393 393 394 395 396 397

4 2 3 5

xvi

CONTENTS Paging Mechanisms

Hardware Locking

An

Example: Matrix Multiplication

Scheduling Comparison Instructions

Hand-Coded Summary

Chapter 13

The

Routines

INMOS

408

Transputer

409

The Transputer and Occam The Structure of the Transputer

410 4

1

Instruction Format

412

Register Structure

4

1

Memory

4

1

4

1

Structure

Loading Values From Memory Extending Operand Values The Remaining Basic Instructions The Extended Instruction Set (Operate) Using the Evaluation Stack

Communication Between Transputers Internal

and External Channels

Process Control

Interrupt Handling

Error Handling. Possible Network Arrangements Other Network Arrangements

Chapter 14

398 400 403 407 407

416 417 420 425 426 427 427 432 433 434 434

Conclusion

435

The Future of Microprocessor Design

437

Developments

437

in Instruction-Set

The CISC Chips CISC vs RISC

Fight Back

Designs

438 441

Glossary

443

Bibliography

455

Index

457

PREFACE 4.

The

introduction of microprocessors

in the use

many

some

ten years ago was an important milestone

of computers. The early microcomputers had limited power, but there are

which

tasks

are satisfied

by

this limited

power such

as control

machines, automobile ignition systems, and computer games. As a

house

is

likely to

have dozens of devices that would be regarded

by the standards of the

More

as

of washing

result, the

average

powerful computers

early developers in the field.

recently, the technology has

advanced

to the point

have achieved very substantial computing power, challenging

where microprocessors larger systems, and

much

book examines and compares these powerful microprocessor architectures. What to do in writing this book is to look at these processors from a software point of view. You will find few schematic diagrams in the book, since we are not interested in the hardware level design. You will, on the other hand, find many assembly language programming examples, showing the significance of the architectural variathis

we attempt

#

between the processors we examine.

tions

The

challenge of describing what a

programmer needs to know about the made even more difficult, but also

architectural features of microprocessors has been

more

entertaining, by a basic split in the architectural philosophies influencing micro-

processor design. Until recently, microprocessor development has ever

more complex hardware, including

use of high-level languages and operating systems. architects to

shown

a trend to

specialized features intended to support the

As VLSI techniques have allowed

pack more transistors on a chip, they have been able to produce micro-

processors with capabilities going well

beyond the mainframes of only

a

decade ago.

number of designers have proposed, known as RISC processors, or Reduced

In a recent sharp reaction to this trend, a

designed and implemented a Instruction Set Computers.

class

of processors

Reduced instruction set computers are streamlined processors

with a simplified instruction

set.

This simplified instruction

set allows a

hardware

designer to use specialized techniques to increase the performance of a machine in a

manner

that

not better

if

is

peculiar to these architectures.

the efficiency cost

Mainstream tion Set Computers,

is

The RISC view

is

basically that

more

is

too high.

dubbed CISC, for Complex Instructhis acronym suggests much too complicated. Some advocates of CISC designs have

architectural designs have been

by RISC proponents. The implicit criticism in

that these processors are

xvi 1

XV111

PREFACE

retorted that the term

Computers.

It all

CISC should be taken

to refer to

Complete Instruction Set

depends on the point of view.

we examine most of the important microprocessors, including both RISC and CISC processors. We certainly don't attempt to describe every representative the book would be too heavy to carry feature of every processor in complete detail around if we did but we do attempt to cover the most interesting points, and the RISC vs CISC debate is a unifying theme that runs through the book. The importance of RISC processors is well established Wall Street is almost as familiar with the term as the computer science establishment. In this book we attempt to provide a perspective on the issues and to give a basis for looking into the future to see where this design controversy might lead. The text is based on a "special topics" graduate course taught at New York University in the spring semester of 1989 by Robert B. K. Dewar. Matthew Smosna began taping the lectures, transcribing and typesetting them, and finally organized the notes in the first version of this book. With the help of our reviewers' comments, we then made several passes through that version, making many technical corrections and the result is the book that you are now reading. additions Selected chapters of the text were read by several of our friends and colleagues at New York University, including Fritz Henglein (now at Rijksuniversiteit Utrecht); Yvon Kermarrek, New York University; Cecilia Panicali, New York University and Jay Vandekopple, Marymount College. Jim Demmel, New York University, reviewed Chapter 5. Stephen R Morse, the principle designer of the 8086, helped by telling th e In this book,









inside story of the design of this processor.

Dan

Prener,

IBM

Research, helped us to

also of IBM Research, Our special thanks go to Richard Kenner both of New York University, who read the whole manuscript, and

better understand the RIOS.

Marc Auslander and

Peter

Oden,

shared their recollections of the 801 Project.

and Ed Schonberg, in

some

cases read

We

would

some chapters

several times.

thank our

John Hennessy, Stanford University; Kevin Kitagawa, Sun Microsystems; Daniel Tabak, George Mason University; and Safwat G. Zaky, University of Toronto. Our proximity to McGraw-Hill in New York City led us to come in unusually close contact with several members of the McGraw-Hill staff. Our sponsoring editor, David Shapiro, provided an enormous amount of daily support. Joe Murphy, senior editing manager, assisted the authors in the art of book design, which was done entirely by the authors using desktop publishing on Compaq PC's we don't just talk about microprocessors! Jo Satloff, our copy editor, did a wonderful job editing what was at times a very rough manuscript. Ingrid Reslmaier, editorial assistant, helped with a multitude of miscellaneous tasks and telephone calls. Finally, in the best tradition of book authors, we wish to thank our wives Karin and Liz for putting up with us and providing invaluable support during the very busy year of 1989, during which we prepared this book. also like to

official reviewers:



Robert B.K. Dewar Matthew Smosna

CHAPTER

1

MICROPROCESSORS

Microprocessors have revolutionized the use of computers are rapidly reaching the point

at all levels

where every kitchen appliance and every

of society.

We

child's toy will

contain a fairly sophisticated processor. In recent years, microprocessor technology has

advanced to the point that performance achieved. This

Two t

book addresses

levels rivalling those

of mainframes can be

the subject of these high-end microprocessors.

important events led to the greatly increased usage of microprocessors.

he introduction of the

IBM PC

led to the widespread use of personal

First,

computers based

on the Intel series of microprocessors. Second, a number of companies, including Sun and Apollo, marketed workstations based on the Motorola microprocessors. This popularized the notion in engineering circles that it was often more effective to have a reasonably powerful workstation on your desk than a small share of a powerful mainframe. It looked for a while as though the microprocessor products of Intel and Motorola would dominate the marketplace in high performance personal computers and work stations. However, requirements for ever increasing performance, particularly for engineering workstations, combined with continuing work in design and implemen-

tation of microprocessor architectures, has recently lead to an explosion of alternative architectures.

1

2

MICROPROCESSORS

These

alternative architectures are based

on new concepts of microprocessor

design, collectively referred to as reduced instruction set computers (RISC). In the past,

were generally considered an advantage

large instruction sets

proudly advertise "over 200 distinct instructions" advocates have turned this idea on

microprocessor instruction design,

The

"less

head by proclaiming that when is more."

architectural philosophy

by eliminating

sets

made

instructions can be

mean

RISC

idea behind the

reducing instruction

its

to run very

— manufacturers would

all

RISC comes to

in their glossy brochures.

is

that

it

by simplifying and

non-essential instructions, the remaining

much

By non-essential instructions we them with sequences of simpler impact on efficiency. The fundamental faster.

those executed so infrequently that replacing

instructions does not have any noticeable

observation that inspired the original RISC research was that only a small part of the instruction set of most other processors were

commonly

executed



a large

number of

instructions were executed rather infrequently.

The tion Set i

ndeed

existing philosophy,

Computers (CISC)

virtually

all

CISC

making

which has recently been described as Complex Instru csomewhat derisive fashion, is by no means dead, and

made by IBM, Apple,

personal computers, including those

Commodore and Amiga, chips are

in a

still

use

CISC

chips

.

significant inroads, although

When

it

comes

Atari

,

RISC powered by

to workstations,

many workstations

are

still

chips.

The continuing recent

New

controversy between the

York Times

article describing a

CISC and RISC camps

is

a fierce one.

A

conference on the West Coast sounded

more like coverage of a boxing match than a scientific meeting. The winner in the judgment of the reporter was RISC, but certainly not by a knockout. In this book we look at a number of representative CISC and RISC microprocessor designs, as well as some which do not clearly fall into either category the line between the two philosophies is not always completely clear from a technical point of view. Our intention is to understand the strengths and weaknesses of the two approaches, and to begin to guess how the argument will eventually be settled. 1



WHAT IS A MICROPROCESSOR? One

of the distinguishing characteristics of the microprocessor

is

that

it

is

usually

T his

means that, unlike minicomputers and mainframes, the complete machinery of the computer is present on a single chip or possibly a very small number of chips Floating-point operations, for example, are often implemented using implemente d

VLSI.

in

,

.

a separate coprocessor chip.

become more sophisticated, With the most advanced microprocessors now being used in workstations, the gap between minicomputer and microcomputer has become somewhat blurred, at least from a programmer's As the

they have

architectural features of microprocessors have

become

less distinctive as a

separate category of machine.

point of view.

1

"Computer Chip

Starts

Angry Debate,"

New

York Times, September 17, 1989.

WHAT IS A MICROPROCESSOR? Another important

characteristic

of microprocessors

is

3

that they are relatively



commodity items. They can be bought off the shelf the price range is and computers are then built around the microprocessors by typically $5 to $800 manufacturers. Unlike the IBM 370, where the processor is simply one second-party inexpensive



inseparable part of a complete t

computer system, the microprocessor

many different hardware environments in IBM PCs, but it also turns up as the

hat appears in

c hip

is

used

One

is

the

Intel

8088

are

manfactured

not really in the business of manufacturing computers

.

no such thing as lintel 8088 Motorola 68030 computer In both cases, and in the case of most of

consequence of this approach

computer or

a separate chip

controller chip for advanced

automobile ignition systems. Very few computers using the byTntel, and indeed Intel

is

For instance, the Intel 8088

.

is

that there

is

.

the other microprocessors

we

look

will

at in this

book, there are a great variety of

may be quite many respects. While they may share the same basic instruction set, such issues as memory access, input/output devices, and even the way floating-point computations are performed may vary from one computer to another. The actual cost of producing a microprocessor is very small, probably just a few dollars. Of course, this figure does not take into account the fact that designing a new microprocessor may cost tens of millions of dollars, which must be recovered in the selling price. However, it does mean that in applications where sufficient numbers of computers using these chips, and two computers using the same chip incompatible in

chips can be sold

it

becomes

feasible to

mass produce what are

in effect

extremely

sophisticated computers at remarkably low prices. These chips appear not only in

automobile ignition systems, but in microwave ovens, washing machines,

televisions,

and many other items usually not thought of as requiring the power of a computer.

An (HDTV).

interesting application for the near future

in high-definition television

is

HDTV requires sophisticated real-time data compression and decompression

algorithms, which need powerful processing capabilities. Within a few years, every

room

living

will

probably have more processing power available than the typical large

computer center of a few years past. The microprocessor makes large-scale computing practical.

a

commitment

to

such

The User-Level View of a Microprocessor

When

experienced programmers open up a manual describing a

for the very first time, there are certain questions that they is

the register structure like?

What

have learned to

ask.

What

data types are supported by the machine? Are there

any interesting or unusual instructions? Does

How

new microprocessor

are interrupts handled? In the

first

it

support tasking or virtual memory?

chapter of this book,

we

will cover

some of

these general issues to set the scene for looking at individual designs.

common

Three r egister set,

set,

architecture are particularly piler writers,

jump in and look at a new microprocessor are the and the addressing modes These aspects of a computer important to assembly language programmers and com-

places to

the instruction

who must

understand

.

this part

of the processor perfectly

in order to take

advantage of the machine. Ideally, high-level language programmers need to

know

nothing about the inner workings of a processor for which their programs are compiled.

4

MICROPROCESSORS



a large extent this is true in practice a C programmer can move C programs from one processor to another without knowing details of the different architectures. However, it is often useful to know what's going on, especially when things go wrong. it possible to switch from driving one It's much the same situation as driving a car

To



is"

car to another without being a mechanic, but if the engine suddenly conks out, useful to be able to look under the hood and

know what

is

that spirit,

we hope

which

allow you to judge the impact of the instruction

will

there

and how

it

it is

works. In

to be able to provide a description of microprocessor architectures

many aspects of the hardware

that influence

set,

addressing modes, and

how software is written

for these machines.

The System-Level View of a Microprocessor In addition to the user or applications view of a microprocessor, an operating system

designer must understand those features of the processor that are intended for imple-

menting system

tasks,

including



Tasking and process management.



Memory management and



Exceptions (traps and interrupts).



Coprocessor and floating-point unit support.

cache control.

These are the basic issues we will look at as we examine several microprocessors from the point of view of someone designing a complete system.

The provided

of tasking and process management have to do with the support

issues

in the

hardware that allows two or more tasks with separate threads of control

on the processor

to execute

single processor

each of them were executing simultaneously. Since a

as if

can execute only one task

at a time, this

is

achieved by allowing each

task to control the processor for a few cycles, with the operating system switching

between the different

tasks so that they

Memory management and system controls a

task's use

memory on

a

machine,

a re small (but fast)

memory

locations

to be executing at the

same time.

how an

operating

of memory. Most microprocessors have hardware support

memory, allowing

for virtual

seem

cache control both have to do with

a task's addressable

as well as

memories

memory

to be larger than the re al

memory. Caches most frequently used

allowing several tasks to share that

that hold copies of the data in the

.

Exceptions are events that cause the normal execution of a program to be interrupted.

They can occur due

to

an internal event such

as

an attempt to divide by

zero (a trap), or due to an external event such as a keyboard stroke (an interrupt). Finally, floating-point

on the chip

itself

support

point computations in hardware cessors are

commonly

v

provided by almost

ariety of

is

The

all

microprocessors, either

ability to

perform floating -

particularly important given the fact that micropr o-

used to build workstations for scientific and engineering use

For each microprocessor, the great

is

or using a separate coprocessor chip.

we

will describe

approaches used

in the

how

these features are supported

design of these processors.

.

and

AND INSTRUCTION FORMATS

REGISTERS, ADDRESSING,

5

CISC and RISC Microprocessors

A question which commonly asked by both applications programmers and operatings is

To what extent does the processor provide specialized instructions problem at hand? RISC designs generally provide only a minimal set of instructions from which more complex instructions can be constructed On CISC processors w e often find elaborate instructions intended to simplify programming of system designers

is:

that aid in solving the

.

frequently occurring specialized operations

The all

sorts

basic

attitude

— "You

is

an extensive

to provide

is

of special-purpose needs In

seems minimal

RISC

CISC philosophy

.

.

don't have to use

them

if you

don't need them."

that these fancy instructions are not really used often

the extra complexity in implementing the hardware

slow

down

more commonly executed

the

In practice, the dividing line is

is

programmed using simpler

and

clear.

that, like

For example, floating-point division

other complicated operations, can be

a floating-point division because in this particular case is

used often enough to justify

its

inclusion.

sophisticated system-level instructions appearing

do not seem

those that handle tasking,

to justify

that this complexity tends to

However, nearly

instructions.

The contrasting

enough

instructions.

not so

an extremely complicated operation

instruction

of instructions covering

set

approach, the cost of these extra instructions

this

it

all

RISC processors include

seems that the complicated

On

more

the other hand, the

on some CISC processors such as enough to be universally ,

to be important

included in RISC designs.

REGISTERS, ADDRESSING,

AND INSTRUCTION

FORMATS Two

of the major issues in the design of an architecture a re the register structure of the machine and the set of addressing modes provided by the hardwar e. Deciding on the structure of the register set generally involves deciding

should have, and the degree to which any or functions.

The

latter issue

is

all

how many

that of register uniformity, that

register similar or identical to

registers a processor

of the registers should have specialized is,

to

what extent

another register In the design of a

modes, the designer must decide which particular addressing modes

how

of addressing

will

be useful, and

they will be specified in the instruction.

Along with the design of the instruction impact on the in

one

is

set

.

which

all

of the

intended use.

Other

final

bits are laid

Some of the

bits are

set,

both of these

issues

have a significant

design of the i nstruction formats of a machine, that

bits in

.

way

the exact set,

and

their

an instruction must be used to define the opcode

used to define the registers or the

participate in the operation

is,

out for each instruction in an instruction

Another

set

memory

.

addresses or both that

of bits must be used to specify the addressing

modes. For example, when a machine allows

direct addressing that

is,

the ability to

memory location as an operand, space must be allocated so that the defines the memory address can be fit into the instruction format We

directly reference a bit pattern that

will

.

begin by looking at some of the issues involved in the design of register

addressing

modes and

their

impact on the

final instruction

sets

format of a machine.

and

,

MICROPROCESSORS

6

Register Sets

The number of

registers

parameter that has trade-off

which

a significant effect

machine

a

is

on the instruction formats of

very simple: the more registers there

is

on

are to be included

the

are,

more

instruction format to reference those registers. For example,

a

fundamental

a processor

.

The

bits are required in the

on

a

machine with 32

general-purpose registers, 5 bits in an instruction will be used up each time a register appears as part of the instruction. If a designer wishes to allow register-to-register operations in which in a third register,

16

bits

The

fit

applied to two registers and the result

up, leaving only a single bit for the

of keeping instructions short

issue

number of bytes

is

In place,

more

the

RISC

it

is

no hope

Many

of the

an d code

compact instructions was the concern with

memory

into the processor to be

,

more time

this takes.

much less concern over instruction density. In the first and we are no longer horrified by programs that occupy

now

is

and caching, have reduced the penalty for loading longer instructions, have larger numbers of registers than were previously practical.

feasible to

Although keeping data

in registers generally speeds

the point of diminishing returns

reached

is

after a

while a compiler cannot

make

up processing consider ably, one plots the speed of a

fairly rapidly. If

program against the number of registers which is,

there

megabytes of code. Second, modern architectural techniques, including instruc-

tion lookahead so

is

now cheaper

is



relatively expensive

bytes that are needed for the instructions, the

designs, there

memory

several

placed

required to program a given function) was an important

execution speed. Instructions must be loaded from



opcode

one that often comes up.

when memory was

consideration. Another factor favoring

executed

is

three of these register operands into a 16-bit instruction format.

designs date from the days

density (the

is

then a 16-bit instruction format cannot be used because 15 of the

would be taken

of being able to

CISC

some operation

are available, the curve flattens out, that

use of more registers

.

seems to be generally

It

more than 32 registers are needed at any one point. Also, having a large number of registers is not without some cost, since at least some of them will have to be stored when the processor switches from one task to the next. The other fundamental issue in the design of a register set is that of register uniformity. The term general register was first used in conjunction with the IBM 360 agreed that no

architecture, referring to a registers that can

Why

don't

all

for the register

XLAT

.

By doing so,



in identical ways.

register sets? The main reason is that it is which certain registers have been designated for becomes unnecessary to allocate space in the instruction

it

in

the use of that register

instruction

be used

machines have uniform

tempting to design instructions special purposes

all

on the 386 (used

is

implied in the instruction. For example, the

for translating character sets),

particular register (EBX) points to a translation table

(AL) contains the character to be translated.

XLAT

and is

only

1

byte long. If

designed so that both registers needed to be specified explicitly,

more the-

bytes. Since

386,

this lack

code density was

a

major design point

of uniformity seems

like a

assumes that one

that another particular register

it

for the

it

had been

would have needed

8086, an ancestor of

reasonable trade-ofF.

Even RISC processors occasionally break the tradition of register uniformity under the same pressures. For example, most RISC processors have a procedure call

AND INSTRUCTION FORMATS 7

REGISTERS, ADDRESSING,

instruction that stores the return point into

one

a 32-bit instruction format

and

Why

specially designated register.

RISC processor with

choose a particular register to store the return point? In a standard a 32-bit address space,

it

is

desirable to have a call

instruction have the largest possible range of addressability Using a dedicated register .

up almost

to hold the return address frees to

hold the address

— allowing

of addressability by 5

Non-uniform

all

of the 32

bits in the instruction

different registers to be specified

bits (for a 32-register

would reduce

machine).

registers are a particular

menace

to compiler writers. In writing the

code generator for a compiler, you want to be able to

of registers

treat the set

of interchangeable resources. Compilers typically are written so that there

whose

responsiblity

compiler

but

if

is

to allocate registers. It

instruction has

its

own

much

typically the result

is

that

it

is

as a

pool

a routine

easier to write this routine in a

— none of

as: "I

need a

register,

the others will do." If every

idiosyncratic set of register requirements, then the

allocating register use in an optimal

and

is

the compiler does not need to deal with requests such

has to be either special register SI or DI

it

format

the range

problem of

manner becomes very much more complicated,

simply

isn't

attempted.

Addressing Modes In choosing a set of addressing as in the case

modes

of register size. As we

the issues of complexity versus utility arise just

shall see in a later section

of this chapter that describes

programming languages and addressing mo des, the include increasingly complex addressing modes that directl y

the relationship between high- level

CISC

tradition has been to

support the use of high-level languages. For example, a compiler writer will recognize

one addressing mode as the one to be used for addressing variables local to a (recursively callable) procedure, and another addressing mode as the one to be used for accessing global variables. Just as increasing the

number of registers on

a

machine may

increases the size of

may have the same That trade-off has been resolved differently on different machines. In particular, we will see that the 68030 has a rich variety of addressing modes, which results in an instruction size that can vary widely, while the RISC processors all carefully restrict an instruction format, increasing the number of addressing modes e ffect.

them

to a small

but important

set so that

Whether an addressing mode

The

first

is

they will

all fit

important or not

into a 32-bit instruction format. is

quite application-dependent.

high-level languages to be used extensively in the United States were

TRAN and COBOL that a smaller

.

Both languages have an

and simpler

set

essentially static

of addressing modes are necessary. For example,

so important to provide double indexing, the ability to

instruction to

form an address, since

FORTRAN

FOR-

view of data, which means

add two

array accesses

it is

not

registers in a single

do not need

this

kind

of addressin g. In Europe,

has a

on the other hand,

much more

ALGOL

60 was

much more

popular.

ALGOL

60

complicated addressing structure, involving the use of a stack to

manage recursion. Some of the early European machines had more complex addressing mechanisms reflecting this emphasis. On the home doorstep, Burroughs was a great fan of ALGOL and built machines that reflected this attraction.

8

MICROPROCESSORS

These days, stack-based languages, including C, Ada, and Pascal, are in common use, and furthermore, they all support dynamic storage allocation Modern CISC designs especially reflect anticipated use of more complicated addressing modes that .

arise

from the use of a stack and dynamically allocated

data.

Designing Instruction Formats In designing instruction formats, there are two extreme positions.

have a very small number of formats and

fit

One approach

the instructions into this small

is

to

The

set.

is to design an optimal format for each instruction. Roughly speaking, RISC designers take the first approach, and CISC designers tend more to the second, although even in CISC processors there will be a degree of uniformity in that very similar instructions might as well have very similar formats. A fundamental decision has to do with the size of the opcode, that is, the number

other approach

of bits reserved for indicating the particular operation to be p e rformed Obviously, if more bits are used, then more distinct instructions can be supported. The cleanest .

approach

is

to use a fixed

number of opcode

bits for all instructions. Interestingly,

although RISC processors do have uniform instruction uniform, whereas there have been some

which always used an

CISC designs in

they are not quite that

sets ,

the past ( notably the

8-bit opcode.

In practice, a designer will recognize that certain operations are

common

IBM 36 0)

than others and react by adjusting the

number of opcode

much more

bits appropriately.

For example, if we have determined that only 4 bits are necessary to represent the most commonly used instructions then 16 possible bit patterns are available. Fifteen of these

most

are used for the

common

1

5 instructions,

of the other instructions. Additional

CISC

designs, this sort of principle

bits

is

is

used to indicate

all

then need to be allocated elsewhere in the

common

instruction format so that these less

and the sixteenth

instructions can be distinguished. In the

carried to extremes. For example, the

number of

opcode bits in 80386 instruction s ranges from 5 to 19. On the other hand, RISC machines tend to have fewer instructions, so fewer opcode bits are needed. Since various operations need different numbers and kinds of operands, space can be saved

if

the layout of instructions

instruction. Furthermore,

once

is

this typical

specialized to the particular needs of the

CISC philosophy

particular requirement that different instructions have similar

example, the CAS2 instruction on the 68030 dissect in detail in

instruction with

On

Chapter

its

(a

6) takes six operands,

operands can be

fit

is

followed, there

no

.

very complicated beast which

but they are

is

operand structures For

all registers,

we

will

so the entire

into a specialized 48-bit format.

RISC designs strongly favor a small number of uniform instruction formats, preferably all of the same size. The regularity of these formats simplifies the instruction-decoding mechanism, and means that a technique known as the other hand,

pipelining can be used.

be

tions

One aspect of pipelining is that several

.

This kind of

instructions will typically

and execution of several instruc overlapped decoding becomes much more difficult for the numerous

the pipeline, allowing the overlapped decoding

in

and complex instruction formats of CISC processors.

DATA REPRESENTATION

9

DATA REPRESENTATION The

of data representation has been complicated by the variety of conventions

issue

used in different manufacturer's hardware. Machines have had different word lengths,

ways of storing integers and floating-point values, etc. other hand, have a similar view of how various data types on the Most microprocessors, should be stored. This is one area where CISC and RISC designers have few disagreements. Since the data representations are so similar, we will treat them here in Chapter 1 with the understanding that they will apply with only minor modifications to all of different character sets, different

the remaining chapters of this book.

Representation of Characters

Through

the years, the

methods used

have varied widely, but

to represent characters

number of bits and then and characters The number of bits

the basic approach has always been the same: c hoose a fixed

designate a correspondence between bit patterns

chosen limits the

.

number of distinct characters that can be represented. For used on a number of earlier machines such as the CDC 6600,

total

example, 6-bit codes,

allow for 64 characters. This

is

enough

uppercase

to include the

One

selection of special characters, but not the lowercase letters.

CDC

heard a

lowercase;

it

salesman proclaim in the mid 1970s that really isn't

an

issue."

Times have

8-bit codes allowing lowercase letters

is

now

letters, digits,

and

a

of the authors once

"None of our customers need

certainly changed,

and the use of 7- or

universal.

Although IBM has persisted in the use of their own EBCDIC code for character computer world has standardized on the use of the ISO

representation, the rest of the

(International Standards Organization) code This exists in several national variants, .

a nd the variant used in the for

United States

is

called ASCII, the

Information Interchange. All the microcomputers that

the code for character representation. This

which

is

a

The

fits

well with the basic

use of ASCII is

a set

is

usually not

assumed

in the design

CISC designs, fancy instructions

and movjng strings of 8-bit characters, However, there of most processors that sents. In ASCII, this

is

One example

is

is

is

of the processor design. for scanning,

nothing

is

.

A single EDIT

the

EDIT

instruction

know about

on the IBM 370, which

instruction, for example, can convert the integer

None

of the microprocessors

we

character

in a single

COBOL 123456

character string $123,456.00, with the resulting output being represented in characters.

,

not the processor's concern.

processors do have instructions that

nstruction implements the kind of picture conversion that appears in

grams

comparing

in the instruction set

concerned with what particular character 01000001 repre-

the code for uppercase A, but that

Some mainframe

,

.

of instructions for manipulating arbitrary 8-bit quantities,

including in the case of some

i

memory organization

sequence of 8-bit bytes, each of which can be separately addressed

Instead, there

codes.

American Standard Code will look at use ASCII as

we

EBCDIC

describe in this text, not even the

processors, have instructions of that level of complexity,

and

completely neutral with respect to the choice of character

pro-

to the

CISC

their instruction sets are

sets

.

It

would thus be

quite

».

10

MICROPROCESSORS

implement an EBCDIC-oriented system on

possible to

even

IBM

a microprocessor, although not

has indulged in such strange behavior.

We should note that 8 bits

not enough for representing characters

is

languages like Japanese and Chinese.

Not only do such languages

larger character sets, but even in English, the increased use

do

full set

of characters, but

using a variety of fonts. Both requirements lead to the need for larger

it

character sets

character

other

much

of desktop publishing and

fancy displays means that one wants not only to represent the also to

sets for

require very



sets.

in the future

Luckily,

we

of 16- or 32-bit quantities, so Japanese

is

the

will

probably see increasing use of 16, or even 32,

most microprocessors

home

are equally at

bit

manipulating strings

in this respect they are built for the future.

main focus of these

efforts since

Japan

is

prominent

so

in the

computer field. The issue of character sets is perceived as an international problem that needs a smooth international solution. Japan itself is most interested in having international standards to solve such problems. At a recent meeting at which the issue of representing Japanese characters in Ada was discussed, the Japanese delegate to the relevant ISO committee explained that Japan is concerned with complaints from other countries over non-tariff barriers to imports.

A Japanese

standard that

tionally accepted can be regarded as being a non-tariff barrier.

It is

is

not interna-

interesting that an

international political conflict can ultimately affect the representation of character

codes on microprocessors!

Representation of Integers These days, everyone agrees that storing integers in binary format is a good idea. But it hasn't always been so! Early on there was quite a constituency of decimal machines, especially in the days when tubes were used to build computers. In those days, it was cheaper to build one 10-state tube than to build four binary-state tubes. In most scientific

allows for

more

written in

COBOL,

efficient

the

programming,

2

integers are stored in binary format,

which

handling of computations. Even in commercial applications

COMPUTATIONAL format allows

programmers

to specify the

use of binary format for quantities that will be used for extensive computations.

For unsigned integers, the binary representation indicate powers of 2,

decimal integer 130

00000000 00000 1

and the most

10000010 when

is

when

1

significant bit

it is

stored as a

it 1

is

is first

2

The

first

author's uncle

worked first

6-bit binary value.

a 12-state device

(it

was

for Plessey's (a large

a

For example, the

From time to

time,

row of decimal

(called the

devices, followed

called a duo-decatron).

Even the

some

we should write integers the other

computer firm

computer. This machine

numeric quantities consisting of

and

left).

Alan Turing, the famous computer

(least significant digit first).

involved with their very

the

stored as an 8-bit binary value and

mathematicians have tried to persuade the world that

way around

obvious. Successive bits

is

(at

in

England)

PEP) had by

British

at

scientist,

one time and was

registers for representing

a binary device, a

decimal device,

might have forgotten that that

is

a

reasonable format for pounds, shillings (which went up to 20) and pence (which went up to 12), because the British long ago

knew what

its

changed

to a decimal

domain was going

to be!

money

system. This

is

a

remarkable case of hardware that

really

DATA REPRESENTATION

TABLE The

1.1

representation of signed and unsigned 4-bit

vali

Unsigned Value Signed Value

Bit Pattern

+0

0000 0001

1

+1

0010

2

+2

0011

3

+3

0100

4

+4

0101

5

+5

0110

6

+6

0111

7

+7

1000

8

-8

1001

9

-7

1010

10

-6

1011

11

-5

1100

12

-4

101

13

-3

1110

14

-2

1111

15

-1

1

always wrote numbers the "wrong" way, but he did not scientists to follow his lead! It

significant bit

have no that

1 1

we

real significance

on

more

as

a matter

being on the

we we need

look

are several

of convention

left,

convince computer to regard the

because, of course,

a silicon chip. Nevertheless, the

ways of representing signed

convention

integers, but

at use the twos complement approach, so this

to look at in detail.

confusing, even for those architectures,

we need

to

The

two's

who know

it

left is

most

and right

so universal

To keep our examples

all

the microprocessors

the only representation that

complement representation can be

when and why we need

simple,

we

will for the bits.

moment assume

that unsigned

1.1. Bit patterns starting

and

In a 4-bit register, unsigned values range

and signed two's complement values range from minus 8

shown in the Table same values in both

quite

quite well. In looking at instruction set

and unsigned numbers.

signed integers are represented using four to 15,

is

have a clear understanding to see

separate instructions for signed

from

to

always think of integers being stored this way.

There will

is

of a binary integer

manage

with a zero

bit

on the

left

to plus 7, as

represent the

and signed case. Bit patterns starting with a one bit numbers in the signed case. The starting, or leftmost bit, is Whether or not a bit pattern whose sign bit is set to 1 is to be regarded the unsigned

are interpreted as negative called the sign

bit.

as negative (i.e.,

whether the number

is

to be regarded as signed or unsigned)



is

programmer you cannot look in a register, see the sign number is present. For example, suppose that a register contains the bit pattern 1101. This may represent either 13 or minus 3, and it is the logic of the program which determines how it is to be interpreted.

something that bit set

is

up

and know that

to the

a negative

MICROPROCESSORS

12

For these potentially negative numbers, the signed and unsigned interpretations always differ by 2

This

is

k

where k

,

the

is

number of bits

important to note, because

subtraction

work

— 16

why

in the case

of 4-bit numbers.

the operations of addition and

both signed and unsigned values. Consider the operation:

for

= 1111

0010 + 1101

numbers

explains

it

is adding 2 to 1 3 to get 1 5. If the numbers same addition is adding +2 to -3 to get -1 What is really k happening is that the normal binary addition is addition mod 2 that is, factors of 16 are simply ignored. Since the signed and unsigned values differ by 16, the resulting bit patterns are the same in the unsigned and signed case. In designing instruction sets, we only need one set of addition and subtraction instructions, which can then be used for signed or unsigned operands at the programmer's choice. The one difference between signed and unsigned addition arises

If the

are regarded as unsigned, this

are regarded as signed, this

.

,

in detecting overflow.

= 1110

+ 0111

0111

Considered

as

Consider the addition:

unsigned, this adds 7 to 7 to give a result of 14. However,

are interpreted as signed, clearly

wrong. The

mathematical

we

are adding

result has

result.

What we

the

+7

wrong

have here

is

to

+7 and

sign

and

if the

getting a result of —2, is

operands

which

is

16 different from the true

an addition that from the signed point ofview

causes arithmetic overflow.

A programmer will

often

want

to

be able to detect arithmetic overflow for signed

The programming language Ada requires that these overflows be detected, since an overflow can raise an exception known as a CONSTRAINT_ERROR, which can be

values.

handled by the program. Processors take one of two possible approaches to satisfying sets of addition and subtraction instrucwhich differ only in the detection of overflow, or they provide one set of instructions which set two separate flags, a carry flag which detects unsigned overflow, and a separate signed overflow flag for the signed case.

this

requirement. Either they do provide two

tions,

What about other operations? For multiplication, is

there are

two

single length, then the resulting bit patterns are, like addition

same

for the

0001

x

unsigned and signed =

1111

1

1

cases.

1111

For unsigned operands,

we have + times —

cases. If the result

and subtraction, the

we have

1

times 15 giving a result of 15. For the signed case,

giving a result of — 1 As with addition and subtraction, the overflow .

conditions are different, but the resulting bit patterns are the same, so only one single-length multiplication operation

Many machines

is

required.

also provide a multiplication instruction

which

length result. In this case, the signed and unsigned cases are different:

0001

x

1111

=

00001111 (unsigned case)

0001

x

1111

=

11111111

(signed case)

gives a double-

3

DATA REPRESENTATION

This means that

double length result multiply instruction

if a

is

provided,

it

1

should be

provided in two forms, signed and unsigned. Given only one of these two possible forms, the result for the other can be obtained with only moderate effort, but

much more convenient

certainly

Division

is

to

it is

have both.

different in the signed

and unsigned

cases even

where

all

operands are

single length:

If a

1110 - 1111

= 0000

(unsigned case)

1110 - 1111

= 0010

(signed case)

machine provides divide

should be provided.

It is

instructions, then separate signed

and unsigned forms

quite difficult to simulate one of these results given only the

other instruction. For example, simulating unsigned division given only a signed divide instruction

is

unpleasant.

The final operation

to be considered

signed and unsigned operands

is

comparison. Here again the situation with

is

obviously different:

1110 > 0001

(as unsigned values)

1110 < 0001

(as signed values)

As with addition and subtraction, there are two approaches that can be taken. Either two sets of comparison instructions must be provided, or a single set of comparison instructions is used which sets two sets of flags, and then there are two sets of conditional branch instructions, one giving the effect of unsigned comparisons, and the other for signed comparisons.

SIGN-EXTENSION. To move an unsigned number to ing a value with zero bits on the left. For instance, contains an unsigned value in the range a 32-bit register

by supplying 24 zero

If a signed value

if

an 8-bit

memory

location

to 255, then this value can be loaded into

bits

must be extended

a larger field involves extend-

on the

left.

in size, then the sign bit

must be copied

into

on the left. This process is called sign extension. For example, if the 4-bit pattern 1100 must be extended to 8-bits, then the result is 11111110. There are various the extra bits

approaches to providing sign extension capabilities. instructions for sign extending values. If there are

extension

is

no

Some

processors have specific

specific instructions, then sign

usually achieved using the arithmetic right shift instruction,

which prop-

agates sign bits, as in the following example:

Byte value

Load

in

memory:

1

01 01 01

into 32-bit

register zero extended: Shift left

24

bits:

Shift right arithmetic

24

bits

ADDRESS ARITHMETIC. One ing addresses.

On

:

00000000 00000000 00000000 10101010 10101010 00000000 00000000 00000000 11111111 11111111 11111111 10101010

important use of unsigned arithmetic

is

in

comput-

the 32-bit microprocessors discussed in this book, address arith-

metic uses 32-bit unsigned addition and subtraction.

MICROPROCESSORS

14

Unsigned arithmetic has "wrap-around" semantics, which means that carries are An important consequence is that the effect of signed offsets can be achieved

ignored.

without signed arithmetic. For instance, of an

offset,

if an

mode provides

addressing

for the addition

then adding an offset of all one-bits has the effect of subtracting one. Even

though the address arithmetic

unsigned, the offsets can be regarded as signed, since

is

signed and unsigned addition gives the same results.

For the same reason, sign extension of offsets also makes sense, even though the address arithmetic

which

is

unsigned.

A common arrangement

8-bit offset field

first

is

address with an unsigned addition.

arithmetic overflow

is

We stress

integers that

cant

fit

is

unsigned since

don't

want any kind

that address arithmetic

as a result

MULTIPLE-PRECISION ARITHMETIC. Software on

to provide short offset fields

not relevant for address computation

of overflow error conditions to be signalled

tic

is

added into the address. For example, an sign extended to 32 bits, and then the result is added to the

are then sign extended before being

—we

of computing addresses.

routines for performing arithme-

into registers can be handled natura ll y by

instructions using algorithms similar to those used by

o n long number s. Most processors

we will look

at

most humans

to

t

he proces sor

do

arith

m etic

have some support for assisting

in

writing such routines. For addition and subtraction, a carry indication and special versions of the add and subtract operations that include the carry

from a previous and we find such instructions even on most RISC processors. For multiplication and division, we need double-length operations, and some RISC machines don't even have single-length multiply and divide, so we don't necessarily get much help when it comes to multiple-precision multiply and divide. stage are needed,

Packed Decimal With current design techniques, of four binary that

is

called

bits

it is

more reasonable to store decimal data as a sequence

than in a single 10-state device. This

packed decimal

If

you have 4 binary

bits

is

a very standard data format

per decimal character with the

obvious binary encoding, then the decimal integer 13 looks

This format

is

like

processing must be able to deal with numbers in decimal format to

0001001

1

in binary.

important, because computer languages intended for commercial

do mostly I/O operations and

relatively little arithmetic.

The

if a

program

is

going

conversion of binary

is a rather expensive operation whether it is done in hardware Adding two packed decimal numbers, on the other hand, is less efficient that adding two's complement integers, but not terribly so. Multiplication and division of packed decimal numbers is not nearly as efficient, but since these operations may not be performed as frequently as addition and subtraction, this may not be an important concern. If all that is done is a little bit of addition and subtraction and a small amount of other arithmetic, it may be attractive to store integers in decimal

to decimal (and vice versa)

or in software.

format, since

When

it

improve the efficiency of input/output operations. done on integers in this packed decimal format it is nice

will greatly

arithmetic

is

the hardware provides instructions that support this format. Full-scale

if

CISC machines

MEMORY ORGANIZATION

l

IBM 370 have

ike the

15

add two packed decimal numbers, each with is done in a single hardware instruction.

instructions that

16 digits, giving a 16-digit result. All of this

Of

course,

microprocessors are capable of operating on packed decimal

all

numbers using s oftware. Even on some of the RISC processors specialized su pport for

the case of the is

no

slower than the hardware instructions on the IBM mainframes. In 80386 and 68030, we do not have full-blown decimal arithmetic, but

much

tions are not

there

that have absolutely

packed decimal, the speeds of these software-supported opera-

a small set

of instructions to

assist in

writing software routines of this type.

Floating-Point Values

The

as numerous as which have supported them. One unpleasant consequence of has created an incompatible mess of hardware where floating-point

formats used used to represent floating-point numbers have been

the variety of machines this variety

that

is

it

some

calculations have yielded slightly, or in

cases completely, different results as they

were moved from one machine to another.

The IEEE P754

standard for floating-point arithmetic, approved and published

attempted to remedy

in 1985,

this situation

and operating on floating-point

data.

specifying a highly desirable approach,

by specifying

Although it

has

it

still

a

uniform method

for storing

has been widely recognized as

not been universally adopted. Too

much hardware has been built using proprietary formats such as those of IBM and DEC. However,

in the microprocessor world, the

point just as Intel was designing the the 8087. This chip

is

not quite

first

100%

IEEE standard appeared

compatible with the standard, because there

were a few last-minute changes in the standard that i/lfiowever,

all

The complex.

details

We

to the 8087, are compatible with the

of

how

8087 was designed, including the 80287 and

just after the

subsequent microprocessor floating-point chips,

80387 follow-ons

at a cr it ical

commercial floating-point coprocessor chip,

IEEE standard and manipulated .

floating-point values are stored

devote the whole of Chapter 5 to

are quite

this subject, reflecting the fact that

floating-point calculations are extremely important in the microprocessor world. In the

and more and top-end video games rely on

case of engineering workstations, floating-point performance

mundane efficient

applications like high-definition television

and accurate floating-point operations

is

critical,

.

MEMORY ORGANIZATION Almost

all

microprocessors organize

memory

into 32-bit words, each of

which

is

divided into four 8-bit bytes These bytes can be individually addressed, so for .

purposes one can equally well regard the

T he

two ways

memory

as

being logically

some composed of a

which the various processors differ are the order in which successive bytes of multiple byte quantities are stored and whether such quantities must be aligned on specific boundaries. sequence of 8-bit bytes.

in

MICROPROCESSORS

16

Big-Endian vs Little-Endian Byte Ordering

The

memory

organization of

to be addressed.

means

into bytes

that the ordering of these bytes needs

As English speakers, we normally think of data

to right, rather than right to

left.

When we

being arranged

left

think of successive bytes in memory,

we

think of lower-numbered bytes as being to the

example,

we think of a

32-bit

number

When the

left

a

number

and the

is



I

I

32

stored in a register,

least significant bit

I

bits

we

bytes

out

to 3 laid

as

D—

2

1 I

of higher-numbered bytes. For

left

memory occupying

in

as

-

think of the most significant bit being on

being on the right, because

this

is

the

way numbers

are represented in English:




bits

natural to assume that

when

a 32-bit

is loaded from memory, the high-order bit of the number is the leftmost and the low-order bit of the number is the rightmost bit of byte 3:

2

1

3

bit

number

of byte

I

'

r

low

high

32

J

bits

This picture corresponds to big-endian byte ordering, where the "big end" or the most significant byte

is

stored in the lowest addressed byte in

indeed store multibyte quantities in

memory

in this

memory. Many

However, the apparent naturalness of this ordering dent on our writing customs. Arabic left to right.

3

is

written right to

left,

is,

do

of course, simply depen-

but numbers are

Arab readers might therefore find it more natural

in the following

processors

manner.

to write the

still

written

above picture

manner 2

3

1

high

o

\

low |

32

'

Train schedules

right

in the

(

bits

lasablanca station, for instance, have familiar times, but the departure

of the board and the destination

is

on the

— most confusing

left

for

Western

readers!

is

on the

MEMORY ORGANIZATION

17

and might therefore naturally expect to find the high-order bit in the leftmost bit of byte 3 and the low-order bit in the rightmost bit of byte 0. This picture corresponds to little-endian byte ordering, where the "little end" of the number is stored in the lowest memory byte. The reason we mention Arabic here is to emphasize that there is nothing inherently natural in choosing one ordering over the other. little-endian ordering "backwards," since they are

being organized

left to right,

and

so they think of the little-endian picture

low

high

32

-

at the

hardware

there

level,

call

as

as:

2

1

However,

You may hear people

determined to think of memory

no

is

|

-

bits

and

left

right,

and even the convention of

thinking of the register as having the most significant byte on the

left is

purely arbitrary.

For various historical reasons, both kinds of byte ordering are found in currently

We will

available microprocessors.

find four different approaches:



Processors like the Intel 80386, which always use little-endian byte addressing.



Processors like the Motorola 68030, which always use big-endian byte addressing



Processors like the

MIPS 2000, where

big- or little-endian addressing

a signal at reset

to be used,

is

.

time determines whether

and the mode then never subse-

quently changed. •

Processors like the Intel i860, where there

is

a software instruction to

backwards and forwards between the two modes while

From

a

programming point of view,

it

a

program

generally does not matter very

is

change

running.

much which

type

of addressing we have, although there are times when we certainly have to be aware of the endian i

nstance

is

if

mode. In

we

particular,

transfer a binary

68030-based

—we

identical except {or the

this picture,

a PC,

is

which

passed between machines is



for

386-based, to a Sun-3, which

the

is,

way integers and characters are represented,

are generally

annoying difference in endianness. Furthermore, there

algorithm for the conversion

From

binary data

from

often have considerable trouble. For example, the data formats of

the two processors, that

containing a 4-byte

when

file

field,



it

is

Fl, followed by

we can

is

no

set

data dependent. Consider the case of a record

two 2-byte

see that the pattern

fields,

F2 and F3

(see Figure 1.1).

of byte swapping required to convert

from one format to the other is dependent on a detailed knowledge of the data layout. There are a few cases where one of the orderings is more convenient than the other. For example, if a

big-endian ordering

is

dump

of

memory

more convenient

the situation exactly reversed). Generally,

unfortunately, there

is

is

displayed byte by byte

left to right,

(but Arabic-speaking programmers might find it

which ordering is used, but camps are well established and

doesn't matter

no hope of agreement,

since both

each regards the other as being hopelessly backwards.

MICROPROCESSORS

18

F2

F1

MSB

LSB

MSB LSB

|

|

v

v

V

MSBJ^J-SBJ

j

V

V

MSB LSB MSB

LSB >

A

r

msb|

LSB A

F2

F1

FIGURE

F3

1.1

Converting

from big-

a record

to little-endian.

Big-Endian vs Little-Endian Bit Ordering

When bit as

a binary value

being on the

Although

this

is

is

we might

we

this

the

left to right

(the

left

most

it is

significant

left

to right.

well established, and the pictures and

book, and indeed throughout the reference manuals for

all

discuss in this book, use this convention.

However, there remains an from

we normally think of the most

think of bytes as being laid out

an arbitrary convention,

diagrams throughout the processors

stored in a register,

left just as

issue

of whether the

or from right to

left.

The

significant bit),

and

bit

bits in the register are

left-to-right ordering

31

is

on the

means

numbered

that bit

is

on

right (the least significant bit): 31

_ —

„J°!?lZI

l£llL~.

-




32

MICROPROCESSORS

procedure THINK X, Y,

A

Z

:

is

INTEGER;

array (1..100) of

:

INTEGER;

begin

X

:=

A(Y);

end

The

addressing of A(Y) involves both using the frame pointer as a base pointer and

using

Y

as

an index

(see Figure 1.6).

Computing

the address of A(Y) involves three

elements: the base address, in this case the frame pointer; the starting offset, which

known at Some

compile time; and the index value, which

may

typically

need

scaling.

processors provide this type of base-index addressing, sometimes called

double indexing,

since, as

registers are similar.

On

we observed

before, the functions of base registers

and index

such processors, the fetching of A(Y) corresponds to a single

load instruction. Other processors not provid in g this double indexing feature

um

may

which the necessary indexing address that is, th e scaled index, m ust be computed and placed in an of the frame pointer and the

require a sequence of instructions in s

is

,

index register so that single indexing can be used. also

It is

important to note that in the case

an element of an array allocated

to access

if a

compiler needs to generate code

in a stack frame, there are really three

Run-time stack

fp



(frame pointer

Offset to first element of A, known at compile time

points to current

frame, THINK)

_Z_ A(100)

A(Y)

A(2)

Y

is

the index value

A(1)

FIGURE

1.6

Use of based plus index addressing

to address arrays allocated

within stack frames.

ADDRESSING MODES

33

components involved: the frame pointer, the starting offset of the array, and the (possibly scaled) index. As we shall see when we compare the 386 and the 68030 to the RISC chips, only the CISC processors provide an addressing mode which allows one to access such an array element in a single instruction. Some RISC chips do have the double ndexing, but none of them allow a programmer to add two registers as well as a constant i

displacement to form an address. This

which

is

a

consequence of the instruction formats,

consquence of the decision to use pipelining, and

are in turn a

is

consistent with

the philosophy of keeping things simple.

Indirect Addressing

When

parameters are passed to procedures, the value passed and stored for use by the

calling procedure

is

often the address of the actual parameter, rather than a copy of the

value of the parameter. In this

method,

=

+

1

programming language may (e.g.,

is

the

require the use of

VAR parameters of Pascal).

optional. Consider the case of the

procedure:

SUBROUTINE QSIMPLE 1

cases, a

of passing parameters

method of passing parameters

In other cases, the

FORTRAN

some

call by reference,

(I)

1

END Within QSIMPLE, the value stored for the parameter is not the value of I, but the address of I. This means that when I is referenced, there is an extra step of fetching the address of I and then dereferencing it (see Figure 1.7(a)). Obviously the reference to I can be achieved by

and However, some addressing which in a single

using an instruction to load the address of

first

I

into a base register

then using based addressing (with an offset of zero) to access processors provide an addressing instruction

mode

fetches the pointer to

first

the actual value of I. extra instruction

is

Of course,

I

called indirect

and then uses

this still takes

an extra

I.

this pointer to fetch (or store)

memory

data reference, but an

not required.

Indirect Addressing with Indexing If the

parameter being passed

addressing must be

FORTRAN

combined

=

an array, then indirect addressing and indexed

an element of the

array.

If in the

above

example, the parameter had been an array

SUBROUTINE QARRAY DIMENSION D(100) D(I)

is

to access

D(I)

+

(D)

1

END then accessing D(I) would involve getting the address of the subscript

I

(see Figure 1.7(b)).

D

and then indexing

it

with

34

MICROPROCESSORS

Memory

Static

address

Static data

of

Memory

Static

Addr

I

address

item,

known

at

link

of

D

Static data item, address

time

known link

value of

at

time

D(100)

I

D(l)

Index value

is

(scaled) value or

D(2)

I

D(1) (a)

FIGURE

(b)

1.7

Indirect addressing of a simple variable,

As

mode look

this gets

and an

array.

more complicated, the issue of whether to provide a single addressing becomes more contentious. Only one of the processors we

that handles this case

Motorola 68030) has this addressing mode built in. On other processors, two or more instructions is needed to access an indirect array element.

at (the

a sequence of

INDIRECT ADDRESSING WITH BASING. so

far,

In our examples o f indirect addressing

the pointer has been allocated statically. However, in a stack-based language,

word

the pointer

required to access

procedure

itself it.

may be allocated on the QSIMPLE in Pascal

Written

QSIMPLE

(var

I

:

stack,

and thus base addressing

instead of

is

FORTRAN,

INTEGER);

begin

I

:=

I

+

1;

end QARRAY; then the parameter passed for in the stack

an

frame for

I

would be

QSIMPLE

a pointer to

(see Figure

1

.8).

offset to a base pointer to get the pointer to

the value of

I.

Again we could do

addressing modes, but

some

this

I

and

this

pointer

Now addressing

I,

and then using

I

would be stored

involves

first

adding

this pointer to access

with a sequence of instructions using simpler

processors have this addressing

mode

built in.

ADDRESSING MODES

35

Run-time stack



>

fp

(frame pointer

Offset to

points to current

is

frame,

I

known

pointer at

compile time

QSIMPLE) Address

-*j

of

value of

I

I

I

-ttJ-§ :

FIGURE

1.8

Using indirect addressing with based indexing.

INDIRECT ADDRESSING WITH BASING AND INDEXING. For consider the case where an array

QARRAY

(D

in

INTARRAY)

:

the grand finale,

passed as a parameter in a stack-based language.

QARRAY

Suppose that we had written procedure

is

Ada

instead of

FORTRAN

is

begin 6(1) := D(l)

...

+

1

...

;

end QARRAY; then the parameter passed for

D would be a pointer to the array, and this pointer would QARRAY (see Figure .9). Now the access to an element

be stored in the stack frame for

of D involves three

1

we use based addressing to get the pointer to D; then we finally we used base plus index addressing, using the pointer

steps: first

dereference this pointer;

and the subscript as the index. This is getting quite complicated, and the Motorola 68030) have a few processors (just one among our examples

as the base

relatively



specialized addressing

mode

allowing a single instruction to be used for this access.

other processors, accessing an element of D

may

take

up

On

to four instructions.

Even More Complicated Addressing Modes It is

and data

possible to write structures

accesses in high-level languages corresponding

to arbitrarily complicated addressing sequences:

type

A

= array

REC2

[1

INTEGER;

..10] of

= record

....

REC1 = record X = array [1 ..10]

AA

var

G I

:

X;

:= X(I) A

:

A;

...

end;

A

Q REC2; A of REC1;

...

.Q A .AA(J);

:

...

end record;

36

MICROPROCESSORS

Run-time stack

FP (frame pointer) points to current

frame.

1

QARRAY)

I

Offset to is

Pointer to

known

D

pointer

compile time

at

D

D(100)

D(l)

Index value

D(2) D(1)

FIGURE 1.9 One use of indirect

We won't

the I

addressing with basing and indexing.

even attempt to draw

expression!

is

(scaled) value of

You can imagine

a picture

of th e

that a processor

memory

might be

access c orresponding; to this

built

with an amazing addressing ;

mode

exactly

corresponding to the required access sequence.

H owever,

not even the

most ardent CISC advocate would expect to see a processor go this far in providing modes! How far is far enough? This is an important point in designing microprocessors Qne of the important factors differentiating; CISC and RISC designs is precisely that o f specialized addressing

.

,

a ddressing

modes. RISC processors tend

uniform, and highly efficient

set

to concentrate

on providing

paths can be constructed as a sequence of instructions designs tend to include a complex set of addressing;

common

a relatively small,

of addressing modes from which complex addressing

when

needed, whereas CISC

modes intended

high-level language situations such as those

we have

to take care

described here

.

of

The

Motorola 68030 goes further than the examples here and includes some even more is difficult to explain in terms of programming langauge Whether this is an appropriate design choice is one of the questions to be answered as the CISC and RISC designers battle things out in the marketplace.

complicated modes whose use features.

MEMORY MANAGEMENT 37

MEMORY MANAGEMENT At the hardware level, the main memory of a microprocessor can be regarded as a vector of 8-bit bytes, where the vector subscript is the memory address. In earlier machines, and in some simple machine designs today, the logical view of memory is identical to t his

t

—when an

hardware view

instruction references a

memory location,

it

o fetching or storing the data from the designated locations in physical

correspond s

memory

.

Although this view of memory results in a very simple organization from both a sofware and hardware point of view, it is quite unsatisfactory for a number of reasons: •

If several programs are running on the same processor in a multi-programmed manner, then we have to make sure that they do not conflict in their use of memory. If programs reference physical memory directly, then this avoidance of conflicts would have to be done at the program level.



Physical

memory

is

limited in

size. If programs

address physical

memory directly,

then they are subject to the same limitations. Furthermore, the amount of physical

memory

varies

from one machine to another, and we would prefer that these way progams are written.

variations not affect the •

Compared

to the

has to access

speed of processors, memories are rather slow.

memory every

time

it

matter on every instruction, since the instruction

memory,

t hen access to the

If a

program

really

executes a load or store instruction, or for that

memory would become

the overall execution speed unacceptably

itself

has to be fetched from

a bottleneck that

would

limit

.

To address these problems, the microprocessors we discuss in this book all provide for memory management. T his phrase refers to a combination of hardware and operating system features which provide for efficient logical

and physical memory

memory

access by separating the notion of

accesses.

Memory Mapping To solve the problem of separate programs intefering with one another, some kind of memory mapping facility is provided by the hardware. This automatically performs a mapping function on all addresses used by a program so that the addresses used within a program do not correspond directly to physical addresses. The simplest approach is to simply relocate all addresses by a constant, as shown in Figure 1.10. By providing a limit register, this

scheme

memory

which

indicates the length of the logical

also allows the

outside

its

own

memory

for a given

program,

hardware to check that a program does not reference

logical region.

This simple base/limit approach has two limitations. First, the memory for a given program must be contiguous. It is always more difficult to allocate large, variable-sized, contiguous chunks of memory, than to allocate in small fixed-sized blocks. Secondly, there is no way that two programs can share memory. Although the general idea is to separate the logical address space of separate programs, there are cases in which we do

38

MICROPROCESSORS

Main memory

Program one addresses

Memory

for I

program one

as though

address

J

for

I

program two

FIGURE 1.10 Memory management through

want

to share

code

itself

memory. In

0.

this section of

as though J

memory

started at

it

Program two addresses

1

Memory

this section of

address

memory

started at

it

0.

the use of relocation.

particular, if two

programs are using the same code, then the

can certainly be shared.

A more flexible scheme divides the logical address space of a program, i

ts

virtual address space, into a sequence

pages are individually

mapped

of fixed-length chunks called

into corresponding physical pages

contiguous in physical memory, allowing a simpler, more

pages.

also called

These virtual

which need not be

efficient allocation

of physical

memory.

The mechanism table

for

mapping the pages is typically quite complex, and involves in memory. Since it would be unacceptably slow to search

lookup structures stored

these structures for every

memory reference,

the processor has a small piece of the table

stored locally in a translation lookaside buffer (TLB). reference

found

is

to first look in the

there. If not, the

The

The approach on

a

memory

TLB, and hope that the necessary translation entry

main memory

is

tables are consulted.

of how these translation tables are stored and accessed, and the how much of this process is in the hardware, and how much is left up to

details

decision of

the operating system, vary considerably from one processor to another. several quite different

schemes

as

we look

at the various processors.

We

will see

MEMORY MANAGEMENT 39

Memory

Virtual

mapping scheme with

Given that we implement

a

implement the concept of

virtual memory. This allows a

memory

not limited by the

space that

is

we have

to

All

do

The

translation tables that says "page not present."

The

traps to the operating system.

on

and when

disk,

it

is

now

to reference a virtual

typically a single bit, to the page

translation process sees this bit

and

it

reads the required page into

memory, swapping

up the page table entries to indicate program can then continue.

that the

new page

present. Execution of the

This approach that

fixes

a small step to

operating system maintains these not-present pages

gets the trap,

out some other page, and

program

it is

of physical memory.

add information,

to

is

size

fixed-size pages,

demand paging,

called

is

swapped

since pages are

in

on demand,

is,

when

they are referenced. Obviously the execution speed becomes painfully

if

every

memory

slow

reference results in a disk read, but

what we hope

is

that in

practice, the great majority of references are to pages which are present, so the overhead

of page swapping

minimal

is

To minimize

this

.

make

overhead, the operating system must

appropriate deci-

swap out when new pages are demanded. There are many algorithms designed to optimize these decisions. Most are based on some variation of the least recently used (LRU) principle, which suggests that the appropriate page to which pages

sions as to

discard

some

is

the one

to

which was

least recently accessed.

Most paging hardware provides

we when a page is accessed, and the other, modified. The latter information is important,

limited support to assist in implementing such algorithms. In particular,

usually find

two

called the dirty

bits in the bit, set

page

when

tables,

page

a

is

one

since pages that have not been modified

old image on disk

set

do not need

to be written

back

to disk (the

valid).

is still

Memory Caching To avoid

the problem of referencing the relatively slow

r eference

instruction,

microprocessors use

much more

to obtain the instructions to

memory caches^ These

microprocessor chip

memories

and

itself,

or

are relatively small,

faster

main memory on every memory be executed J high-performance

are small, very fast

on intimately connected

it is

memories, either on the

separate chips

economically feasible to use

hardware, resulting in the ability to access

memory within

main memory. A memory reference then becomes a two-step operation.

.

Since these

much more

expensive,

the cache

much

rapidly than

to see if the desired

memory

location

is

present. If so, then

First the it

cache

is

checked

can be accessed in the

main memory. If not, then main memory As with the TLB and page table accesses, we hope that most of the time the memory we want ism the cache, so that the overhead of the slow main memory cache, completely avoiding the relatively slow

must be is

accessed.

minimized.

How

often will

the size of the cache favorable



we

find the data in the cache? This obviously depends

and the pattern of references

for example,

when we

in the

program.

Some cases

on both

are clearly

execute a tight loop, the instructions of the loop can

/

40

MICROPROCESSORS

generally be expected to be found in the cache, a situation

the other hand, following a linked

bad case which

A cache

list

256

lines,

is

organized into

is

On

refer to as a cache hit.

is

a

lines

where if a

a line

cache

is

is

a contiguous sequence of bytes

on

4K bytes long, it might be organized The choice of line size is an number of references to main fewer lines, and we are less likely

each containing a 16-byte chunk of memory.

important design parameter.

memory

we

around the memory space

all

will result in cache misses.

an appropriate boundary. For example, as

which roams

increased. If

If

it is

too small, then the

too large, then there are

it is

memory we want in the cache. Typical choices are in the 16-byte range, although we will see caches where this parameter varies considerably. One important design consideration is that it must be possible to search the cache very efficiently, since this search takes place on every memory reference. Going back to to find the

our example of a

4K cache divided into 256 lines of 16-bytes each,

the task

is

to quickly

if any of the 256 lines contains the data we are looking for. At one extreme, a fully associative cache can be constructed, where the data we want can be stored in any of the 256 lines, so that at least from a logical point of view, all 256 lines have to be checked. Obviously we cannot search the lines serially, so we

determine

heed some rather elaborate hardware

to search

all

possibilities in parallel. It

is

possible

to construct such hardware, but quite difficult to keep the performance high enough

down memory

to avoid significantly slowing

At the other extreme,

references.

a directly addressed'cache

is

organized so that a given

location can only be stored in one particular cache line.

One way of doing this

memory is

to use

a portion of the address to specify the cache line. For example, for 32-bit addresses

cached

in

The 4-bit line

is

our

field

4K cache with 256

is

lines, the

address could be divided into three

the byte within the cache line

The

addressed.

issue

is

and the

we want, that is whether the top 20-bits match. the data we want is not in the cache. Referencing a directly addressed cache associative cache. case,

8-bit field indicates

whether that particular cache

However, there

is

If not,

is

much

then

fields:

which cache

line contains the address

we immediately know

easier

than referencing a

that

fully

a significant disadvantage. In the directly addressed

two memory locations that correspond

to the

same cache

line

can never be in the

cache simultaneously. This considerably increases the probability of encountering

4K

from one array would result in none of the data ever being present in the cache, since we would be bounding backwards and forwards between two memory locations which needed the same cache line. unfortunate cases. In the case of our example

to another

where the

A compromise

arrays are separated

is

by

its

our 1

4K

cache

as a

lines.

The

12

bits

A given memory location can be stored

any of the separate caches. For example, we could organize 4-way set associative cache, where each section of the cache had 64

corresponding

6-byte

of 2

to design a set-associative cache. This essentially consists of a

collection of separate directly addressed caches. in

cache, a string copy

a multiple

lines in

address

is

now

interpreted as follows:

41

TASKING

22

The

6-bit field

that a given

is

4

bits

number within any of the four

the line

memory

6

bits

and destination

easier at the

performed

much

is

When

hardware

level since the

no time

set associative

a

is

random

made on which

must be

data to evict from is no There

only one alternative, so there

cache, or a fully associative cache, there

to execute fancy algorithms, so typical

rather simple approaches.

make

cache

set associative

searches that

smaller.

the cache. For a directly addressed cache, there

is

multiway

number of parallel

a cache miss occurs, a decision must be

problem. For a

of the cache. This means

arrays can be cached in separate

sections of the cache, avoiding the conflict. Searching a

much

parts

location can be stored in any of four different cache lines. In the

string copying example, the source

is

bits

hardware makes

One approach which works quite well

is

a choice.

this decision

in practice

is

using

simply

to

choice.

TASKING The fundamental

idea behind tasking

is

that a

machine can have two or more processes

wit h separate threads of control that are both executing in a multiprogramming sense.

Each of those t asks owns the processo r and the a

machine

that has only

registers

when

it is

executin g.

one processor, one program counter, and one

can execute only one program

at a time.

But

it is

set

Of course, of

registers

possible to effectively simulate true

multiprogramming by executing some instructions for one task and then, for whatever reason, switching back to the next task and executing its instructions. In switching from one task to another, an operation known as a context switch, the operating system needs to save the state of the executing task, called the machine state.

The machine

state

is

essentially everything that

include the state of memory, because each task has includes the instruction pointer, the flags that

might have been

Once

set,

and

its

show

is

in the processor.

is

memory

When

(TCB)

used by the task

still

completely removed so that some other task can use all

of the original

task's

information back into the processor and the processor will be ready to

state

execute the instruction that

block

state

the registers.

all

the processor. Later on, the operating system will arrange to put

occurred.

does not

the results of condition tests that

the machine state has been saved (with the

preserved) the processor state

machine

It

own memory. The machine

is

a task

is

it

was

just

about to execute when the context switch

temporarily suspended, a data structure calle d a task control

used to store the machine state information for the inactive task

A context switch saves the machine state

for the current task in

its

.

TCB and

then

TCB for some other task that is ready to execute to restore its machine state. With most processors, the context switch is accomplished by software, using a sequence uses the

of instructions to save the current machine

state and a corresponding sequence of machine state. The TCB is an operating system data determined by the operating systems software.

instructions to restore the current

object

whose

structure

is

42

MICROPROCESSORS

There are situations ation.

This typically

which the context switching time is a critical considerwhich must rapidly switch attention

in

arises in real-time systems,

between input/output devices. In such

situations, context switching represents yet

another possible target for the eager CISC designer, always ready to implement specialized instructions to help with common operations.

Most of the processors covered in to support tasking, but there are

this book do not have any hardware instructions two exceptions, the Intel 80 S86^ nd the INMOS ,

Transputer, where the processor provides hardwar e sup port for

ta sking Particularly in

the case of the Transputer, this hardware support makes very rapid context switching practical.

EXCEPTIONS An

exception

two

is

an interruption of the normal flow of instruction processing. There are

which

situations in

this occurs.

The

first,

which we

when

occurs

call a trap,

the

processor recognizes that the execution of an instruction has caused an error of some kind.

The

second, which

we

call

an interrupt, occurs

when

processor signals that a certain event should be brought to

ogy

its

a device external to the

attention.

The

terminol-

widely between processors, in a rather random manner. This

in this area varies

is

one area where we prefer to adopt a consistent terminology at the expense of not always matching every manufacturer's idiosyncratic usage.

The

instructions

on some

processors signal an error condition (such as overflow)

by setting a status flag which can be tested in a subsequent instruction. Integer overflow is handled this way on most of the processors we will look at. The alternative approach is to generate a trap, which causes a sudden transfer of control, much like a procedure the calling location

call in that

differing

from

a

(i.e.,

normal procedure

the instruction causing the problem) call in

supervisor or protected state. This

means

operating system. In some cases, this

is

memory

environment, a trap

will

in supervisor state to

do

it.

logically be

handle the

if

the required page

is

not present. Clearly the

by the operating system.

done by the application program

situation in

(for instance,

If the

it is

not so clear

handling should

an Ada program needs to

which the flow of instructions

is

program

as

needed.

suddenly modified

is

external interrupt occurs, typically signalling the completion of an input/out-

put operation. Again,

makes

certainlv needs to

CONSTRAINT_ERROR exception that results from a division by zero), then

The second

it

it

In other cases such as divide by zero,

the operating system can transfer control back to the application

when an

by the

obviously appropriate. For example, in a virtual

occur

that the condition should be handled

saved, but

that trap conditions are handled

operating system needs to handle this condition, and furthermore

be

is

that traps typically cause a transition into

this interrupt functions similarly to a

a transition to supervisor state.

procedure

Input/output handling

of the operating system. In the general

case,

it

is

is

call,

except that

clearly the province

quite likely that the task that

is



do with the interrupt it may well belong to some other task or some other currently running program in a multiprogramming system. The distinction between traps and interrupts is not always precise. Traps are interrupted has nothing at

all

to

generally synchronous, since they occur in conjunction with the execution of particular

EXCEPTIONS

instructions. Interrupts,

occur

at

any point

on the other hand, However, there

in time.

43

are generally asynchronous, since they can are intermediate cases. For example,

some

floating-point coprocessors have overlapped execution, so that a floating-point overfl-

ow



which from one point of view is a trap, since it is caused by a specific instruction on the coprocessor behaves like an interrupt, since it occurs asynchronously on the main processor a number of instructions after the one that led to the overflow.



Hardware Support we

All the processors

for Exceptions will

look

at

have hardware support for interrupts and

traps.

This

mechanism (usually available only in supervisor state) to turn the processor interrupts on and off to control whether hardware interrupts are recognized. When an exception occurs, the machine state of the executing task must not be logically altered. includes a

This

is

particularly important in the case of an

registers or flags disappearing

the very least, the hardware registers

may

and

asynchronous interrupt

—we

can't

have

without warning anywhere an interrupt might occur! At

must save the instruction

flags to the interrupt

routine

itself.

pointer, leaving the saving of

Alternatively, the

hardware designer

more of the machine state automatically. We will see variety of possibilities here as we study a range of processors. Another respect in which processors differ is the extent to which they separate

arrange to save considerably

a full

exceptions. At one extreme, every separate condition, including each interrupt separate device,

80386

is

is

automatically handled by a separate exception handl er.

an example of such an organization. At the other extreme, there

from

The

a

Intel

a single

is

exception handler which must handle

all traps and interrupts and has to check various what is going on. The i860 is arranged in this simpler manner it is simpler for the hardware designer, but, of course, the operating systems programmer who has to write the exception routine may not see things quite the same

status flags



and

registers to see

way.

The handling of hardware interrupts poses some special problems, because several same time and some interrupts require very rapid attention. Most processors have some kind of interrupt priority logic that assigns interrupts to interrupts can occur at the

various priority levels

one that

is



made not by

the decision as to

which device

to attach to

which

priority

the designer of the processor but by the system designer

puts together a processor using a specific chip. For example,

is

who

on the IBM PC, the timer

has the highest priority, to ensure that no timer interrupts are lost and that the time of

day

stays accurate.

Other

facilities

often include the ability to temporarily mask an interrupt.

operating system uses this in organizing the handling of interrupts. generally

you don't want the same device

to interrupt again before

you have

handling the previous interrupt. By masking the device until the handling a

second interrupt

is

inhibited until

Another consideration

in

it is

once again convenient

handling device interrupts

is

to

handle

There

are

two approaches

to this

problem.

interruptible, so that they can be interrupted

One

is

is

finished

complete,

it.

that instructions that take

a very long time to execute are problematic if interrupts occur only tions.

The

One example is that

to

between instruc-

make long

instructions

and then resume from where they

left off.

44

MICROPROCESSORS

The

string instructions of the

80386

are designed this way.

divide up a complex instruction into a sequence of

The second approach

component

i

s

to

instructions. For

example, the Transputer floating-point instructions include Begin Square Root, Continue Square Root,

and Finish Square

Root.

To compute

a square root, these three

instructions are issued in sequence (they are never used separately).

approach, a hardware interrupt does not have to wait

computation

is

complete.

By using

this

until the entire square root

CHAPTER

2 INTRODUCTION TO THE

The 386

an example, perhaps the example of a CISC architectu re. Describing a

is

complex instruction this

80386

chapter

we look

set

is

a

complex operation, so we devote three chapters to it. In from an application programmer's point of

at the instruction set

view, and in later chapters describe

its

support for operating systems.

REGISTER STRUCTURE

A starting point for looking at any microprocessor architecture

is its

register structure,

of CISC whose design can be

this affects the instruction set to a large degree, particularly in the case

s ince

processors.

The

traced back to

of the 386

386,

as

we

some of the

register set,

we

shall see, has earliest Intel

an unusual

microprocessors. Before looking at the details

will describe

some of

this heritage.

p redecessor of the 8086, was an 8-bit machine, and 8-bit registers, reflected this organization.

some of

register set

It

its

The

Intel

8080, the

register structure, a set

was possible

of eigh t

for very limited purposes to

form 16-bit registers, but it was nevertheless an most other respects. When the 8086 was designed, the issue of compatibility with the 8080 was an important one. On the one hand, Intel marketing was interested in guaranteeing total

join

8-bit

these 8-bit registers to

machine

in

45

— INTRODUCTION TO THE

46

80386

compatibility with the 8080, and the sales force gave the impression that the 8086

would be upwards compatible. Customers were then writing 8080 programs, and had an interest in protecting their software investments. On the other hand, the redesign was seen by the engineering group as an. opportunity for enhancements. At

clearly

memory from 64K to 128K was important. seem to have been a number of conflicting desires and requirements. Some constituencies had thoughts of major surgery, and there was talk of eliminating the very

Beyond

least,

this,

doubling the addressable

there

the non-symmetrical nature of the 8080. If taken seriously, this

would have meant a was concerned about maintaining its customer base, that time under siege from Zilog (the manufacturer of the Z80, a popular

complete redesign. Since

which was

at

Intel

replacement for the 8080), these requirements were important to management.

i

The principal designer of the 8086, Stephen Morse, steered an interesting course n the middle of these conflicting requirements On the one hand, the 8086 is q uite .

On

compatible with the basic structure of the 8080.

the other hand, the design

was

not constrained by an absolute compatibility requirement, and in particular, the

beyond the 128K that had been originally envisioned, to an address space of one megabyte, which at that time seemed huge for a microprocessor. One important benefit, at least in retrospect, even if it was not a deliberately intended effect, was that the attempt to maintain a reasonable level of compatibility helped to reduce the design work required, and therefore contributed to the important goal of getting something out fast. At the same time the final result was much more than Intel management's original concept of a slightly beefed-up 8080 To resolve the compatibility issue, a translation program was created which c onverted 8080 assembly language to 8086 assembly langua ge. In practice this program generated horrible code, and no one in the engineering department at Intel ever addressing was extended

far

.

expected

it

to be used.

On

now

the other hand, the sales force could

customers

talk to

them "Don't worry, all you have to do is to feed your code through our translator program which fixes up the "minor" discrepancies between the 8080 and

and

tell

8086, and you'll never

know

that the architecture has changed." This kind of discrep-

ancy between what engineering thinks and what the

sometimes less

it

results

is

not

uncommon a

it is

more

or

conscious deception to keep customers locked into a manufacturer's product.

Returning to the affected all

sales force says

from confusion and wishful thinking, sometimes

by the

registers

register set

register structure

of the 8086 are 16

of the 8086, we will see that

its

design

of the 8080. With the exception of the

bits

wide

Some

(see Figure 2.1).

is

strongly

flags register,

of the registers have

curious names, which reflects their special uses. Each of the 16-bit registers AX, BX,

CX, and

DX

is

divided up into two 8-bit components that can be used

individual registers.

being the top 8 in

bits

an attempt to

registers

The AX register, of AX and the

map

for example,

latter

is

divided into

being the lower 8

the register structure of the 8-bit

of t he 16-bit 8086. If you look only at these eight

of the 8086 looks

just like that

was

they were

This was done partly into the

bottom four

registers, the register structure

of the 8080.

In addition to duplicating the register structure

instructions

bits.

8080

as if

AH and AL, the former

also provided. In general there are

two

of the 8080, a sets

full set

of 8-bit

of instructions on the 8086.

47

REGISTER STRUCTURE

16

bits

AH BH CH DH

AX: BX:

CX: DX:

AL BL

CL DL

SI:

Dl:

(Stack Pointer)

SP: BP:

CS DS SS ES BP:

(Instruction Pointer)

FL:

(Flags Register)

FIGURE 2.1 The

register structure

There

one

is

of the Intel 8086.

bit in every

opcode

that determines

whether an instruction

version of the instruction or the 8-bit version. This bit for

word bk, and

The i

ncluded

it is

set to

1

as part

called the W-bit,

is

the 16-bit

which stands

for the 16-bit case.

structure of the instruction formats of the

8086

is

such that the W-bit

is

of the opcode part of an instruction. Following the opcode in various

numbers.

places there are 3-bit fields that are operand/ register register

operands indicated by one of these

while

the

if

is

W-bit

is

fields will

If the

be interpreted

W-bit

is set,

the

as a 16-bit register,

off it will be interpreted as an 8-bit register.

Notice that having the W-bit in the opcode commits one from an architectural point of view to having

all

operands in an instruction be the same length. Looking

the kind of instructions that are available

MOV

AL, BL

MOV

AX, BX

;

on the 8086,

8-bit register

copied

it is

at

possible to write

to 8-bit register

or

but

it is

not possible to write

MOV What would low order 8

AX, BL the last instruction bits

of the

BX

mean?

register

A reasonable

interpretation

should be copied into the

AX

would be

register,

that the

with either

48

INTRODIK

1

:

ION

TO THE

sign or zero extension.

the is

W-bit that

is

80386

But neither of these reasonable interpretations is permitted since MOV opcode specifies that the size of all register operands

part of the

either 8 bits or 16 bits.

A

more

general architectural design

types and operands into the operands themselves, (the gives a quite general

mixing of operands of different

VAX

types.

is

to put designators of

uses this approach). This

But

that,

of course, takes

m ore bits and more logic, b ecause every operand would require such a bit. All of this

is

just a

matter of whether

it is

possible to

fit

to extend an 8-bit value into 16 bits. This can, of course, be It is

very simple to program that operation.

what

It is

into the

opcode the

programmed

written by zero-extending

if

ability

necessary.

BL

(if that's

required), using the sequence

is

MOV MOV

AL, BL

AH,

and Instructions

Special Registers

on the 8086 can be distinguished from every other register in some is something for which the 8086 is well known. This architecture is thus at the opposite design extreme from machines with uniform register sets. Enumerating all the specialized uses of the registers would take too long and be too messy, so we will simply give some examples. Multiplication on most machines involves putting a result into a register pair, since the result of an w-bit by w-bit multiplication will in general require 2n bits. The 8086 has a 16- by 16-bit multiply, which yields a 32-bit result. The solution on the 8086 is to require that operands and the result be placed in specific registers. This Each of the

registers

way. This lack of orthogonality

multiplication specifically requires that the multiplicand be put into AX, with the 32-bit result

put into the

DX:AX

The

CX

register

instruction j

ump

if

CX

(LOOP) is

is

pair.

another register with a special use.

that automatically decrements the

not equal to zero. To execute a loop

MOV

CX, 15

LOOP

LP

1

CX

The 8086

register

5 times, the

has a loop

and then executes

code

a

is

LP:

This

is

very

much

the kind of instruction that

for use in a very specific situation. Since the

not have enough

room

is

"mission-oriented," that

such

intended

normal format of a jump

jump

instructions test

to designate a register (most conditional

special bits within a status register

is,

instruction does

as the carry flag

and the overflow

flag),

the

operands are usually implied rather than being explicitly specified in the instruction. In this sense, the choice to use CX rather than another register as the basis of whether

jump is somewhat arbitrary. The XLAT instruction makes special use of both the BX register and the AL register. The memory location whose address is formed by adding the contents of the BX and AL registers is loaded into the AL register. One obvious use of this instruction

or not to

is

for translating character sets

— hence

the name.

REGISTER STRUCTURE

The i

index registers

SI

and DI have

49

uses in connection with string

special

nstructions that copy a sequence of bytes from one location to another. ("S" stands for

source,

and "D"

for destination.)

We will

call stack.

special uses in conjunction with the

discuss the special uses of each of these registers in detail later on.

we go

Before

BP and SP have

further, let us describe the register structure

about the operand formats.

The 386

has exactly the

same

of the 386 and

talk

structure as the 8086, except

that each of the registers is 32 bits wide and each register is renamed by putting an "E" on the front (you may think of the "E" as meaning "extended"). The bottom 16 bits of each register has a name that corresponds to the old 8086 names, the right-hand half of this picture is identical in all respects to the 8086 register model (Figure 2.2).

Maintaining Compatibility with the 8086/88 register structure of the 386 would seem rather peculiar if we did not understand 8086 origins (see Figure 2.2) At the right, you can see a structure that looks identical to the 8086 and is completely compatible with it, but the registers are extended to 32-bits. The 16-bit CX register on the 8086, for example, becomes a 32-bit extended register called ECX on the 386. The problem is, the instruction formats of the 386 have to be pretty much the same as the 286 because the compatibility requirement is very strong. Recall that on the 8086 and 80286 in the opcode byte (the general form of an instruction is that there is an 8-bit opcode followed by other fields) there is a W-bit that on the 286 says whether to use 8 or 1 6 bits. There isn't room in the 8-bit opcode field to fit an extra bit in saying, "Please use 32 bits." If the 386 were being designed from scratch, it would probably have been preferable to have three possible designators so that 8-bit, 16-bit, and 32-bit references could be freely mixed. But there just is not enough room in the existing

The its

.

instruction formats.

The

mode

trick that

is

used to solve

problem

this

for the processor that can be set to

W W

is

the following.

There

is

an overall

put the machine into either 16-bit

mode

or

is set to 0, the processor always uses 8-bit operations, regardless of mode. If the mode, but if is set to 1 then the processor uses either 32-bit or 1 6-bit operations depending on the mode. In the 32-bit mode there is a choice between 32-bit and 8-bit operands, while in the 1 6-bit mode there is a choice between 1 6-bit and 8-bit operands. I n order to write code that is compatible with the IBM PC or to run PC-compatible code, the processor will operate in 16-bit mode. In this mode, none of the code

32-bit

,

ever uses the upper half of the 32-bit registers t



all

machine,

it

must operate

in 32-bit

work in such a way To operate the 386 as a 32-bit

the instructions

hat they are blind and oblivious to the higher-order

bits.

mode. Eight-bit operations

course, since characters are important whatever the

word

are

still

available,

of

size.

There is one trick that gives a programmer a little more flexibility. There is an operand prefix byte (it has a special coding as 66 Hex, which is different from any opcode value) that directs the processor to change modes for the next instruction. That allows

you

to

mix some

16-bit

mode

instructions into 32-bit code, or vice versa. If you

have code that heavily mixes 16- and 32-bit instructions, then the code

will

be covered

50

INTRODUCTION TO THE

80386

AH BH CH DH

EAX:

EBX:

ECX: EDX:

ESI:

AL BL

CL DL

SI

EDI:

Dl

ESP:

SP BP

EBP:

Extended Stack Pointer

CS DS SS:

ES FS:

GS

EIP:

J

Extended

Instruction

Pointer

EFL:

FIGURE The

2.2

user register set of the 80386.

with these prefixes (wasting time and space). There

is

no

practical

way to flip

the current

processor between 16- and 32-bit operating modes.

This mechanism the design were started

considerations, the

mechanism devised

1

is

rather clumsy, probably not

from

scratch. If the design

what would have been chosen

might have been omitted, or mixing the three operand lengths.

6-bit operations for

if

were not constrained by compatibility at least a

more usable

THE USER INSTRUCTION SET In this section, set.

We

will

we will

give a brief overview of the general design of the

not describe every single instruction in detail

—such

386 instruction

a description can be

80386 Programmer's Reference Manual and in many other books on the 386. What we want to do is to get a general idea of the instructions that are available and concentrate on unusual instructions that exhibit th e CISC philosophy of providing specialized instructions for common high-level programming constructs. found

in the Intel

THE USER INSTRUCTION SET

Basic Data

Movement

regl,

mem

mem,

regl

regl

reg2

,

1

Instructions

The 386 move instructions allow you to move data between registers, between and memory, but not directly between different memory locations:

MOV MOV MOV

5

;

load regl from

memory

;

store the value

in

;

copy a value from reg2

regl into

registers

memory

into regl

The simplicity of this description of the addressing modes of the 386 hides the fact that the memory references implied by the mem operands actually allow a programmer to use a relatively rich set of addressing modes in defining a memory address.

Basic Arithmetic and Logical Operations

The most commonly used the addition instruction

ADD

two operands, one of which is a register; In assembly language, one format for

instructions take

memory location

the other can be a register or a

.

is

EAX, K

This instruction adds the contents of register, leaving the result in

memory

EAX. The addition

as

unsigned or two's complement. Three



CF, the carry



OF, the overflow

flag, is set if



ZF, the zero

is

flag, is set if

flag,

set if

there

is

location is

flags are set

there

the result

is

K,

by the

EAX

can be regarded

result:

a signed overflow.

is all

zero bits.

many of the other processors that we will memory as well as operations from memory:

ADD

to the contents of the

an unsigned overflow.

Unlike to

K

a 32-bit addition that

look

at,

the

386 permits operations

EAX

computes the same sum, but the result is stored back into memory location K. The same instruction format can also be used for operations between the registers:

ADD

EAX, EBX

This instruction computes the

number of two-operand

sum of EAX and EBX,

always replacing the contents of the

ADC SUB

op1

SBB

op1

CMP AND OR XOR

op2 op1 op2

addition including

;

subtraction

;

subtraction including

;

comparison

(like

;

logical

AND

,

;

logical

OR

,

;

logical exclusive

,

op2 op1 op2

in

EAX.

A large

CF

,

,

sum

operand:

;

op2 op1 op2 op1 op2 op1

left

,

,

placing the

instructions share this basic instruction format, with the result

CF

subtraction, but no result stored)

OR

52

INTRODUCTION TO THE

TEST

MOV LEA

The

80386

op1,op2 op1 op2 op1 op2 ,

,

ADC and SBB

AND,

but no result stored)

;

bit test (like

;

copy operand 2

;

place address of operand 2

to

operand

1

in

operand

1

instructions are useful for multiple precision addition

tion, since they include the carry flag

from the previous operation,

and subtrac-

so, for

example, a

typical triple-precision (96-bit) addition can be written as:

ADD ADC ADC

EAX,

EDX

;

EBX, ESI ECX, EDI

add low-order words add next word with carry from previous

ECX:EBX:EAX

The comparison

instruction,

a result. It does,

however,

CMP,

set the

=

ECX:EBX:EAX

EDI:ESI:EDX

behaves exactly like a subtraction but does not store

OF, CF, and ZF

from which a

flags,

signed and unsigned comparison conditions can be deduced. is

+

full set

A complete set

available to test these conditions:

JMP

Ibl

JA

Ibl

JAE

Ibl

JB

Ibl

JBE

Ibl

JE

Ibl

JNE JG JGE

;

;

;

;

;

;

Ibl

;

Ibl

;

Ibl

JL

Ibl

JLE

Ibl

;

;

;

unconditional jump

jump jump jump jump jump

above (greater than, unsigned) above or equal (unsigned) below (less than, unsigned) below or equal (unsigned)

equal (same for signed or unsigned) jump not equal (same for signed or unsigned) jump greater than (signed) jump greater than or equal (signed) jump less than (signed) jump less than or equal (signed)

Operations that take only a single operand can be used with either a

memory

The

INC

op

;

op

;

decrement operand by

op

;

negate operand

op

;

invert

increment operand by

operand

operations described so far can operate

AL, BL,

...),

16-bit operands (using

on

1

1

bits

8-bit operands (using

one of the 16-bit

instructions are

MOVSX MOVZX

one of the few

op1,op2 op1,op2

The motivation behind

cases ;

;

one of the 8-bit

registers,

32-bit operands (using one of the 32-bit registers, EAX, EBX,

...).

AX, BX,

The

...),

or

following

where operands of different lengths can be mixed:

move move

with sign extension with zero extension

the inclusion of these instructions in the instruction set

is

to

and either sign- or zero-extended fill the larger operand. For example, op J can be EAX and op2 can be a byte in memory. this case MOVSX loads a byte from memory, sign-extending it to fill 32 bits.

allow the second operand to be shorter than the

In

register or a

operand:

DEC NEG NOT

registers,

to

of both

of jumps

first

THE USER INSTRUCTION SET

53

Multiplication and Division Instructions

We will complete the picture of integer arithmetic by describing the set of multiply and The

divide instructions.

basic multiply instruction takes only

MUL

op1

;

unsigned multiplication

IMUL

op1

;

signed multiplication

The second operand

always the accumulator (AL, AX, or EAX, depending

is

length of the operand).

one operand:

The result always goes in the extended accumulator

on the

(AX, DX:AX,

or EDX:EAX).

This specialized use of registers keeps the instructions shorter, since the instruc-

one of the operands.

tion need not specify

On

the other hand,

it

complicates

life

for

programmer and particularly for a compiler writer, because it means that must be treated in a special way compared to addition and subtraction EAX must be treated differently from the other registers. Division is similarly

the assembler

multiplication

and

that

specialized:

DIV

op1

;

unsigned division

IDIV

op1

;

signed division

The dividend

is

always in the extended accumulator.

The remainder and

quotient are

stored back in the two halves of the extended accumulator. For example, in the 32-bit

EDX:EAX

form,

O n the 8086,

in

The

first

op1

,

set

of multiply and divide instructions The .

instructions to perform multiplication

op1,op

;

:

single-length multiply

op2, immediate

form performs

result in the left



EAX.

was the complete

this

386 has some additional IMUL IMUL

EDX

divided by the 32-bit operand, with the remainder stored in

is

and the quotient stored

a single-length multiplication (8-, 16-, or 32-bit), putting the

operand

as usual. It

is

interesting to note that there

is

no

MUL in this

none is needed, since, as in the case of addition and subtraction, the signed and unsigned results are the same if only the low-order bits are generated. This multiply instruction corresponds to the normal multiplication required in high-level languages format

like

C

or

FORTRAN,

so

it is

highly convenient for a compiler

.

The second format is highly idiosyncratic. It multiplies op2, which can be a register or memory by the immediate operand and places the resulting single-length ,

product in opl, which must be a

of this type in the instruction

register.

set.

There

are

no other three-operand

Why on earth did

this instruction get

instructions

added? This

good example of another mission-oriented CISC instruction. Consider the case of indexing an array, where the elements of the array bytes long. The following instruction is just what is needed:

is

a

IMUL

are

32

EBX, 1,32

EBX now contains

the byte offset into the array

this special instruction?

That

is

whose subscript

always the $64,000 question!

is I. Is it

On

worth having

the one hand, array

INTRODUCTION TO THE 80386

54

-»>%

common

On

indexing

is

a decent

compiler can eliminate nearly

a

operation.

the other hand,

RISC advocates would argue

that

such multiplication instructions using a

all

standard optimization called strength reduction. Consider the following loop: for

I

in

1

..

1

00 loop

S := S + Q(I).VAL: end loop; Let us assume that

VAL

field

is

Q

use of the special

4 bytes of each record. Naive code for

IMUL

MOV

MOV ADD

S.

INC

ECX

CMP

ECX. 100 LP

JNE

32 bytes long and the

is

this

loop can make nice

instruction:

ECX. 1 EAX, ECX, 32 EBX, Q[EAX]

IMUL

A

an array of records where each record

is

in the first

EBX

:

;

;

;

;

;

:

ECX

use load

add

VAL to

I

EAX

in

field

S

increment

I

test against limit

loop

until

would

clever compiler using strength reduction

hold

to

get offset

I

= 100

replace

I

by 32*1, generating the

following code:

MOV MOV

ECX, 32 EBX, Q[ECX]

ADD ADD

S,

CMP JNE This code

is

clearly

much more is

get 32 *

;

load

;

ECX, 32 ECX. 3200 LP

multiply instruction. If it tions,

EBX

;

:

:

;

efficient since

1

VAL

ECX

in

field

add to S add 32 to 32 * compare against adjusted limit loop until * 32 = 100 - 32 1

I

it

does not need to

make

then the RISC advocates have a point. In practice not

eliminated, so the situation

world, so

it

use of the fancy

true that compilers can always get rid of these multiplica-

is

clouded. There are also

many

can also be argued that relying on clever compilers

DOUBLE-LENGTH MULTIPLY AND DIVIDE. Not ble-length forms of multiply and divide, but as

all

all

of them can be

"stupid" compilers in the is

somewhat

unrealistic.

processors provide the dou-

we have

seen the 386

is

an example

Looking at high-level languages, one might wonder whether these instructions are of any use. Among all the commonly used high-level languages, only COBOL gives access to them. This is done using statements such as: of

a processor that has both.

MULTIPLY SINGLE-ONE BY SINGLE-TWO GIVING DOUBLE-RESULT DIVIDE DOUBLE-DIVIDEND BY SINGLE-DIVISOR GIVING SINGLE-QUOTIENT. There less

are three reasons for providing these instructions. First,

free in the

hardware. To multiply 32

bits

algorithms require 32 steps of shifting and adding.

by 32

bits,

it

tends to be

more

or

the standard hardware

A 64-bit result is naturally developed

without any extra work. Similarly, a 32-bit division involves 32 steps and can naturally deal with a double-length dividend.

THE USER INSTRUCTION SET

55

which these double-length When you learn multiplication in grade school, you are taught that multiplying a single digit by a single digit can give a result of up to two digits in the form of the multiplication tables up to 10 by 10 (the 10 times table is redundant, but it is easy, and we teach it to reinforce the notion of multiplication by 10 being equivalent to moving the decimal point). In addition, there are two

programming

situations in

instructions are useful. First, consider multiple-precision arithmetic.

When

grade-school students are taught the 9 times table, they learn that nine 9s are



you don't learn that nine 9s are 1 and the carry doesn't matter! That's because if you want to do long multiplication by hand you need that carry-digit on multiplication. The same principle applies to programming multiple-precision multiplications. When multiplying 10 words by 10 words, you need the one word by one word giving two words as the component instruction in the algorithm. Similarly, multiple-precision division, which is much more complicated, also requires the double-length divide. A second situation arises in computing expressions of the form B * C / D with 81

integer operands.

With double-length

operations, the result of the multiplication can

temporarily overflow into double length, with the division then bringing the quotient

back into single-length range. At DISC, a typesetting company in Chicago (owned by the brother of the

first

author), the primary application repeatedly evaluates expressions

of this type for scaling graphics and type on the screen and printer. For is

important that double-length

The on

results

original version of the

DISC

this scaling,

it

be permitted. application was written in assembly language

was no problem. The most and runs on the 386. Although the 386

a processor providing double-length results, so there

recent version of the software

is

written in

provides the double-length operations, a choice of catastrophes.

a =

a

int

wrong

results

which case

single precision (in

and hence

The

result) or

(long(b)

*

it is

C

C

does not provide access to them, so there

of all arithmetic operations can either be

possible to get an overflow

everything can be converted to 64

long(c))

/

on the multiplication bits:

long(d)

This implies a multiplication by two 64-bit values, and there certainly instruction to

do

that.

is

left in

Consequently,

a

C

compiler will generate a

call to a

is

not an

time-con-

suming software multiply routine, followed by another call to an even more time-consuming 64-bit division routine, even though all of this could have been done in assembly language in two instructions. At DISC, they finally had to resort to doing these scaling operations with a small assembler routine. Even with the extra overhead of the call, the application was speeded up by nearly 20%. That is still sort of sad, isn't it? The machine they use has the right instructions, but C does not give access to them. There isn't always perfect communication between language designers and hardware designers.

GETTING BOTH THE QUOTIENT AND REMAINDER. Another in the

how

same do

to

on the 386

feature of the di-

that

it

provides both the quotient and the remainder

instruction. Again, this

is

almost free

vide instruction

a long division.

is

At the end of

at

the hardware level

a division, the

— think about

remainder

is

left as a

;

56

INTRODUCTION TO THE 80386

consequence of doing the division. Once again, among high-level languages, only

COBOL

gives direct access to this instruction:

REMAINDER

DIVIDE A BY B GIVING C It

may

be a

C

:=

D

:=

and hope

little

A/B; A rem

we can do

it.

In Ada,

we have

to write

B;

that our compiler

We

division.

verbose, but at least

D.

will

clever

is

enough

probably be disappointed

to notice that

—even

if

it

only needed to do one

the compiler recognizes and

common subexpressions, it may well miss this case, because the common at the source level, only at the level of the generated code.

eliminates is

not

expression

Decimal Arithmetic The decimal ophy

arithmetic operations provide a nice example of the

in action. Let's consider

DAA

detail. if

one of them, Decimal Adjust

after

CISC design

philos-

Addition (DAA),

in

performs the following sequence of computations:

((AL and OFH) > 9) or (AF =

1 )

then

AL^ AL + 6; AF

9FH)or(CF AL

386

B