Operating Systems [3 ed.]
 9780070702035

Table of contents :
Cover
Half Title
About the Authors
Title Page
Copyright
Dedication
Contents
Preface
Visual Tour
Chapter 1: INTRODUCTION TO OPERATING SYSTEMS
1.1 Zeroth Generation: Mechanical Parts
1.2 First Generation (1945––1955): Vacuum Tubes
1.3 Second Generation (1955––1965): Transistors
1.4 Third Generation (1965––1980): Integrated Circuits
1.4.1 (a) Integrated Circuits
1.4.2 (b) Portability
1.4.3 (c) Job Control Language
1.4.4 (d) Multiprogramming
1.4.5 (e) Spooling
1.4.6 (f) Time Sharing
1.5 Fourth Generation (1980––1990): Large Scale Integration
1.5.1 Batch Systems
1.5.2 Real time Systems
Summary
Terms and Concepts Used
Review Questions
Chapter 2: COMPUTER ARCHITECTURE
2.1 Introduction
2.2 A 4GL Program
2.3 A 3GL (HLL) Program
2.4 A 2GL (Assembly) Program
2.5 A 1GL (Machine Language) Program
2.5.1 Assembler
2.5.2 Instruction Format
2.5.3 Loading/Relocation
2.6 0GL (Hardware Level)
2.6.1 Basic Concepts
2.6.2 CPU Registers
2.6.3 The ALU
2.6.4 The Switches
2.6.5 The Decoder Circuit
2.6.6 The Machine Cycle
2.6.7 Some Examples
2.7 The Context of a Program
2.8 Interrupts
2.8.1 The Need for Interrupts
2.8.2 Computer Hardware for Interrupts and Hardware Protection
2.9 Storage Structure
2.9.1 Random Access Memory (RAM)
2.9.2 Secondary Memory
2.10 Storage Hierarchy
Terms and Concepts Used
Summary
Review Questions
Chapter 3: OPERATING SYSTEM FUNCTIONS
3.1 What is an Operating System?
3.2 Different Services of the Operating System
3.2.1 Information Management (IM)
3.2.2 Process Management (PM)
3.2.3 Memory Management
3.3 Uses of System Calls
3.4 The Issue of Portability
3.5 User’s View of the Operating System
3.6 Graphical User Interface (GUI)
3.7 The Kernel
3.8 Booting
3.9 Virtual Machine
3.10 System Calls
3.10.1 Validations
3.10.2 Open Input File
3.10.3 Output File
3.10.4 File Close
Summary
Terms and Concepts Used
Review Questions
Chapter 4: FILE SYSTEMS
4.1 Introduction
4.1.1 Disk Basics
4.1.2 Direct Memory Access
4.2 The File System
4.2.1 Introduction
4.2.2 Block and Block Numbering Scheme
4.2.3 File Support Levels
4.2.4 Writing a Record
4.2.5 Reading a Record
4.2.6. The Relationship Between the Operating System and DMS
4.2.7 File Directory Entry
4.2.8 OPEN/CLOSE Operations
4.2.9 Disk Space Allocation Methods
4.2.10 Directory Structure: User’s View
4.2.11. Implementation of a Directory System
4.2.12 File Organization and Access Management
4.2.13 File Organization and Access Management
4.2.14 File Sharing and Protection
4.2.15 Directory Implementation
4.2.16 Directory Operations
4.2.17 Free Space Management
4.2.18 Bit Vector
4.2.19 Log Structured File System
Terms and Concepts Used
Summary
Review Questions
Chapter 5: I/O MANAGEMENT AND DISK SCHEDULING
5.1 Introduction
5.1.1 The Basics of Device Driver
5.1.2 Path Management
5.1.3 The Submodules of DD
5.1.4 I/O Procedure
5.1.5 I/O Scheduler
5.1.6 Device Handler
5.1.7 The Complete Picture
5.2 Terminal I/O
5.2.1 Introduction
5.2.2 Terminal Hardware
5.2.3 Terminal Software
5.3 CD-ROM
5.3.1 The Technical Details
5.3.2 Organizing Data on the CD-ROM
5.3.3 DVD-ROM
5.4 Terms and Definitions
5.4.1 Disk Scheduling
5.4.2 SCAN
5.4.3 Circular SCAN (C-SCAN)
5.4.4 LOOK
5.4.5 Circular LOOK (C-LOOK)
5.4.6 Swap Space Management
5.4.7 Disk Space Management
5.4.8 Block Size
5.4.9 Keeping Track of Free Blocks
Terms and Concepts Used
Summary
Review Questions
Chapter 6: PROCESS MANAGEMENT
6.1 Introduction
6.2 What is a Process?
6.3 Evolution of Multiprogramming
6.4 Context Switching
6.5 Process States
6.6 Process State Transitions
6.7 Process Control Block (PCB)
6.8 Process Hierarchy
6.9 Operations on a Process
6.10 Create a Process
6.11 Kill a Process
6.12 Dispatch a Process
6.13 Change the Priority of a Process
6.14 Block a Process
6.15 Dispatch a Process
6.16 Time up a Process
6.17 Wake up a Process
6.18 Suspend/resume Operations
6.19 CPU Scheduling
6.19.1 Scheduling Objectives
6.19.2 Concepts of Priority and Time Slice
6.19.3 Scheduling Philosophies
6.19.4 Scheduling Levels
6.19.5 Scheduling Policies (For Short Term Scheduling)
6.20 Multithreading
6.20.1 Multithreading Models
6.20.2 Implementation of Threads
Terms and Concepts Used
Summary
Review Questions
Chapter 7: PROCESS SYNCHRONIZATION
7.1 The Producer — Consumer Problems
7.2 Solutions
7.2.1 Interrupt Disabling/Enabling
7.2.2 Lock-flag
7.2.3 Primitives for Mutual Exclusion
7.2.4 Overview of Attempts
7.2.5 Alternating Policy
7.2.6 Peterson’s Algorithm
7.2.7 Hardware Assistance
7.2.8 Semaphores
7.3 The Classical IPC Problems
7.3.1 Algorithms
7.3.2 Monitors
7.3.3 Message Passing
Terms and Concepts Used
Summary
Review Questions
Chapter 8: DEADLOCKS
8.1 Introduction
8.2 Graphical Representation of a Deadlock
8.3 Deadlock Prerequisites
8.3.1 Mutual Exclusion Condition
8.3.2 Wait for Condition
8.3.3 No Preemption Condition
8.3.4 Circular Wait Condition
8.4 Deadlock Strategies
8.4.1 Ignore a Deadlock
8.4.2 Detect a Deadlock
8.4.3 Recover from a Deadlock
8.4.4 Prevent a Deadlock
8.4.5 Avoid a Deadlock
Summary
Review Questions
Terms and Concepts Used
Chapter 9: MEMORY MANAGEMENT (MM)
9.1 Introduction
9.1.1 Relocation and Address Translation
9.1.2 Protection and Sharing
9.2 Single Contiguous Memory Management
9.2.1 Relocation/Address Translation
9.2.2 Protection and Sharing
9.2.3 Evaluation
9.3 Fixed Partitioned Memory Management
9.3.1 Introduction
9.3.2 Allocation Algorithms
9.3.3 Swapping
9.3.4 Relocation and Address Translation
9.3.5 Protection and Sharing
9.3.6 Evaluation
9.4 Variable Partitions
9.4.1 Introduction
9.4.2 Allocation Algorithms
9.4.3 Swapping
9.4.4 Relocation and Address Translation
9.4.5 Protection and Sharing
9.4.6 Evaluation
9.5 Non-contiguous Allocation - General Concepts
9.6 Paging
9.6.1 Introduction
9.6.2 Allocation Algorithms
9.6.3 Swapping
9.6.4 Relocation and Address Translation
9.7 Segmentation
9.7.1 Introduction
9.7.2 Swapping
9.7.3 Address Translation and Relocation
9.7.4 Sharing and Protection
9.8 Combined Systems
9.9 Virtual Memory Management Systems
9.9.1 Introduction
9.9.2 Relocation and Address Translation
9.9.3 Swapping
9.9.4 Relocation and Address Translation
9.9.5 Protection and Sharing
9.9.6 Evaluation
9.9.7 Design Considerations for Virtual Systems
9.9.8 Virtual Memory
9.9.9 Paging
9.9.10 Demand Paging
9.9.11 Process Creation
Terms and Concepts Used
Summary
Review Questions
Chapter 10: OPERATING SYSTEM: SECURITY AND PROTECTION
10.1 Introduction
10.2 Security Threats
10.3 Attacks on Security
10.3.1 Authentication
10.3.2 Browsing
10.3.3 Trap Doors
10.3.4 Invalid Parameters
10.3.5 Line Tapping
10.3.6 Electronic Data Capture
10.3.7 Lost Line
10.3.8 Improper Access Controls
10.3.9 Waste Recovery
10.3.10 Rogue Software and Program Threats
10.3.11 Covert Channel
10.4 Security Violation through Parameters
10.4.1 Denial of Service and Domain of Protection
10.4.2 A More Serious Violation
10.4.3 The Cause
10.4.4 Solution: Atomic Verification
10.5 Computer Worms
10.5.1 Origins
10.5.2 Mode of Operation
10.5.3 The Internet Worm
10.5.4 Safeguards against Worms
10.6 Computer Virus
10.6.1 Types of Viruses
10.6.2 Infection Methods
10.6.3 Mode of Operation
10.6.4 Virus Detection
10.6.5 Virus Removal
10.6.6 Virus Prevention
10.7 Security Design Principles
10.7.1 Public Design
10.7.2 Least Privilege
10.7.3 Explicit Demand
10.7.4 Continuous Verification
10.7.5 Simple Design
10.7.6 User Acceptance
10.7.7 Multiple Conditions
10.8 Authentication
10.8.1 Authentication in Centralised Environment
10.8.2 Authentication in Distributed Environment
10.9 Protection Mechanisms
10.9.1 Protection Framework
10.9.2. Access Control List (ACL)
10.9.3 Capability List
10.9.4 Combined Schemes
10.10 Data Encryption
10.10.1 Risks Involved
10.11 Basic Concepts
10.11.1 Plain Text and Cipher Text
10.11.2 Substitution Cipher
10.11.3 Transposition Cipher
10.11.4 Types of Cryptography
10.12 Digital Signature
Terms and Concepts Used
Summary
Review Questions
Chapter 11: PARALLEL PROCESSING
11.1 Introduction
11.2 What is Parallel Processing?
11.3 Difference between Distributed and Parallel Processing
11.4 Advantages of Parallel Processing
11.5 Writing Programs for Parallel Processing
11.6 Classification of Computers
11.7 Machine Architectures Supporting Parallel Processing
11.7.1 Bus-based Interconnections
11.7.2 Switched Memory Access
11.7.3 Hypercubes
11.8 Operating Systems for Parallel Processors
11.8.1 Separate Operating Systems
11.8.2 Master/Slave System
11.8.3 Symmetric Operating System
11.9 Issues in Operating System in Parallel Processing
11.9.1 Mutual Exclusion
11.9.2 Deadlocks
11.10 Case Study: Mach
11.10.1 Memory Management in Mach
11.10.2 Communication in Mach
11.10.3 Emulation of an Operating System in Mach
11.11 Case Study: DG/UX
Terms and Concepts Used
Summary
Review Questions
Chapter 12: OPERATING SYSTEMS IN DISTRIBUTED PROCESSING
12.1 Introduction
12.2 Distributed Processing
12.2.1 Centralized vs Distributed Processing
12.2.2 Distributed Applications
12.2.3 Distribution of Data
12.2.4 Distribution of Control
12.2.5 An Example of Distributed Processing
12.2.6 Functions of NOS
12.2.7 Overview of Global Operating System (GOS)
12.3 Process Migration
12.3.1 Need for Process Migration
12.3.2 Process Migration Initiation
12.3.3 Process Migration Contents
12.3.4 Process Migration Example
12.3.5 Eviction
12.3.6 Migration Processes
12.4 Remote Procedure Call
12.4.1 Introduction
12.4.2 A Message Passing Scheme
12.4.3 Categories of Message Passing Scheme
12.4.4 RPC
12.4.5 Calling Procedure
12.4.6 Parameter Representation
12.4.7 Ports
12.4.8 RPC and Threads
12.5 Distributed Processes
12.5.1 Process-based DOS
12.5.2 Object-based DOS
12.5.3 Object Request Brokers (ORB)
12.6 Distributed File Management
12.6.1 Introduction
12.6.2 File Replication
12.6.3 Distributed File System
12.7 NFS—A Case Study
12.7.1 Introduction
12.7.2 NFS Design Objectives
12.7.3 NFS Components
12.7.4 How NFS Works
12.8 Cache Management in Distributed Processing
12.9 Printer Servers
12.10 Client-based (File Server) Computing
12.11 Client–Server (Database Server) Computing
12.12 Issues in distributed database systems
12.13 Distributed Mutual Exclusion
12.14 Deadlocks in Distributed Systems
12.15 LAN Environment and Protocols
12.15.1 Introduction
12.15.2 Data Communication Errors
12.15.3 Messages, Packets, Frames
12.15.4 NIC Functions: An Example
12.15.5 LAN Media Signals and Topologies
12.16 Networking Protocols
12.16.1 Protocols in Computer Communications
12.16.2 The OSI Model
12.16.3 Layered Organization
12.16.4 Physical Layer
12.16.5 Data Link Layer
12.16.6 Network Layer
12.16.7 Transport Layer
12.16.8 Session Layer
12.16.9 Presentation Layer
12.16.10 Application Layer
Terms and Concepts Used
Summary
Review Questions
Chapter 13: WINDOWS NT/2000: A CASE STUDY
13.1 Introduction
13.2 Windows NT
13.2.1 Process Management
13.3 Windows NT
13.3.1 Process Synchronization
13.3.2 Memory Management
13.4 Windows 2000
13.4.1 Win32 Application Programming Interface (Win32 API)
13.4.2 Windows Registry
13.4.3 Operating System Organization
13.4.4 Process Management in Windows 2000
13.4.5 Memory Management in Windows 2000
13.4.6 File Handling in Windows 2000
13.4.7 Important Features of NTFS
13.4.8 File Compression and Encryption
13.4.9 Security in Windows 2000
13.4.10 Windows 2000 and Kerberos
13.4.11 MS-DOS Emulation
Terms and Concepts Used
Summary
Review Questions
Chapter 14: UNIX: A CASE STUDY
14.1 Introduction
14.2 The History of UNIX
14.3 Overview of UNIX
14.4 UNIX File System
14.4.1 User’s View of File System
14.4.2 Different Types of Files
14.4.3 Mounting/Unmounting File Systems
14.4.4 Important UNIX directories/files
14.4.5 The Internals of File Systems
14.4.6 Run-time Data Structures for File Systems
14.4.7 “Open” System Call
14.4.8 “Read” System Call
14.4.9 “Write” System Call
14.4.10 Random Seek — “Lseek” System Call
14.4.11 “Close” System Call
14.4.12 Create a File
14.4.13 Delete a File
14.4.14 Change Directory
14.4.15 Implementation of Pipes
14.4.16 Implementation of Mount/Unmount
14.4.17 Implementation of Link/Unlink
14.4.18 Implementation of Device I/O in UNIX
14.5 Data Structures for Process/memory Management
14.5.1 The Compilation Process
14.5.2 Process Table
14.5.3 u-area
14.5.4 Per Process Region Table (Pregion)
14.5.5 Region Table
14.5.6 Page Map Tables (PMT)
14.5.7 Kernel Stack
14.6 Process States and State Transitions
14.7 Executing and Terminating a Program in UNIX
14.7.1 Introduction
14.7.2 “Fork” System Call
14.7.3 “Exec” System Call
14.7.4 Process Termination — “Exit” System call
14.7.5 “Wait” System Call
14.8 Using the System (Booting and Kogin)
14.8.1 Booting Process: Process 0, Process 1
14.8.2 Login Process
14.9 Process Scheduling
14.10 Memory Management
14.10.1 Introduction
14.10.2 Swapping
14.10.3 Demand Paging
14.10.4 An Example Using Demand Paging
14.11 Solaris Process/thread Management and Synchronization: A Case Study
14.11.1 Solaris Thread and SMP Management
14.11.2 Solaris Process Structure
14.11.3 Solaris Thread Synchronization
Terms and Concepts Used
Summary
Chapter 15: LINUX–A CASE STUDY
15.1 Introduction
15.2 UNIX and Linux: A Comparison
15.3 Process Management
15.4 Process Scheduling
15.5 Memory Management
15.6 File Management
15.7 Device Drivers
15.8 Security
15.8.1 Access Control
15.8.2 User Authentication
Terms and Concepts Used
Summary
Review Questions (Common for Chapters 14 and 15)
Answers to True & False
Answers to Multiple Choice Questions
Index

Citation preview

ABOUT THE AUTHORS Achyut Godbole is currently the Managing Director of Softexcel Consultancy Services, Mumbai. His professional career spans 32 years, and during this time, he has served in world-renowned software companies in India, UK and USA. He has contributed to the multifold growth of companies such as Patni, Syntel, L&T Infotech, Apar, Disha etc. He did his BTech from IIT Bombay in Chemical Engineering, and henceforth, worked for the welfare of Adivasi tribes for one year. Godbole has authored best-selling textbooks from Tata McGraw-Hill such as Operating Systems, Data Communications and Networking, and Web Technologies, including international editions and Chinese translations. In addition, he has authored several very-highly rated books on various subjects (like computers, management, economics, etc.) in Marathi and has written several popular columns in Marathi newspapers/magazines on science, literature, medicine, and technology. He has conducted numerous programmes on television pertaining to technology, science, and economics. He has traveled abroad on more than 150 occasions to several countries to promote software business. Godbole also runs a school for autistic children. He has won several awards, including an award from the Prime Minister of India, ‘Udyog Ratna’, ‘Distinguished Alumnus’ from IIT, ‘Kumar Gandharva’ award at the hands of Pandit Bhimsen Joshi, ‘Navaratna’ from Sahyadri TV channel, the ‘Indradhanu Puraskar’, and ‘Parnerkar Puraskar’ for his contributions to Economics. Besides this, he was ranked 16th in merit in Maharashtra Board Examination. A brilliant student, he was always a topper in class and won many prizes in Mathematics. He has a website (www.achyutgodbole.com) and can be reached at [email protected]. Atul Kahate is working with Oracle Financial Services Software Limited (earlier i-flex solutions limited) as Head—Technology Practice for over eight years. He has 15 years of experience in Information Technology in India and abroad in various capacities. Previously, he has worked with Syntel, L&T Infotech, American Express and Deutsche Bank. He has a Bachelor of Science degree in Statistics and a Master of Business Administration in Computer Systems. He has authored 24 highly acclaimed books on Technology, Cricket, and History published by Tata McGraw-Hill, and other reputed publications. Some of these titles include Web Technologies—TCP/ IP to Internet Application Architectures, Cryptography and Network Security, Fundamentals of Computers, Information Technology and Numerical Methods, Introduction to Database Management Systems, Object Oriented Analysis and Design, and Schaum’s Series Outlines—Programming in C++, XML and Related Technologies as well as international and Chinese translated editions. Several of his books are being used as course textbooks or sources of reference in a number of universities/colleges/IT companies all over the world. He has authored ‘Flu chi kahani— Influenza te Swine flu’ (The story of Flu) and has also co-authored a book in Marathi titled IT t ch jayachay (I want to enter into IT). He has authored two books on cricket, and has written over 3000 articles on IT and cricket in leading Marathi newspapers/ magazines/journals in India and abroad. He has deep interest in history, teaching, science, economics, music, and cricket, besides technology. He has conducted several training programs in a number of educational institutions and IT organizations on a wide range of technologies, including prestigious institutions such as IIT, Symbiosis, I2IT, MET, and Indira Institute of Management. He has done a series of programmes for IBN Lokmat, Star Majha, and SAAM TV channels for explaining complex technology. Kahate has also worked as the official cricket statistician and scorer in a number of tests and limited over international cricket matches. He has contributed to cricket websites, such as CricInfo and Cricket Archive. He is also a member of the Association of Cricket Statisticians, England. He has won several awards, both in India and abroad, including the Computer Society of India (CSI) award for IT education and literacy, the noted ‘Yuvonmesh Puraskar’ from Indradhanu-Maharashtra Times, and the ‘IT Excellence Award’ from Indira Group of Institutes. He has a website (www.atulkahate.com) and can be reached at [email protected].

Managing Director Softexcel Consultancy Services Mumbai

Head–Technology Practice PrimeSourcing DivisionTM Oracle Financial Services Software Limited

Tata McGraw Hill Education Private Limited NEW DELHI McGraw-Hill Offices New Delhi New York St Louis San Francisco Auckland Bogotá Caracas Kuala Lumpur Lisbon London Madrid Mexico City Milan Montreal San Juan Santiago Singapore Sydney Tokyo Toronto

Published by the Tata McGraw Hill Education Private Limited, 7 West Patel Nagar, New Delhi 110 008 Operating Systems, 3/e Copyright © 2011, 2005, 1996, by Tata McGraw Hill Education Private Limited. No part of this publication can be reproduced or distributed in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise or stored in a database or retrieval system without the prior written permission of the publishers. The program listings (if any) may be entered, stored and executed in a computer system, but they may not be reproduced for publication. This edition can be exported from India only by the publishers, Tata McGraw Hill Education Private Limited ISBN-13 : 978-0-07-070203-5 ISBN-10 : 0-07-070203-9 Vice President and Managing Director—MHE: Asia Pacific Region: Ajay Shukla Head—Higher Education Publishing and Marketing: Vibha Mahajan Manager—Sponsoring: Shalini Jha Assistant Sponsoring Editor: Surabhi Shukla Development Editor: Surbhi Suman Executive—Editorial Services: Sohini Mukherjee Senior Production Executive: Suneeta Bohra Deputy Marketing Manager—SEM & Tech. Ed.: Biju Ganesan General Manager—Production: Rajender P Ghansela Assistant General Manager—Production: B L Dogra Information contained in this work has been obtained by Tata McGraw-Hill, from sources believed to be reliable. However, neither Tata McGraw-Hill nor its authors guarantee the accuracy or completeness of any information published herein, and neither Tata McGraw-Hill nor its authors shall be responsible for any errors, omissions, or damages arising out of use of this information. This work is published with the understanding that Tata McGrawHill and its authors are supplying information but are not attempting to render engineering or other professional services. If such services are required, the assistance of an appropriate professional should be sought. Typeset at Print-O-World, 2579, Mandir Lane, Shadipur, New Delhi 110 008, and printed at Lalit Offset Printer, 219, F.I.E., Patpar Ganj, Industrial Area, Delhi 110 092 Cover Printer : Rashtriya Printers

To

Shobha Godbole and

Anita Kahate for their support, understanding, patience, and perseverance

CONTENTS Preface

xxix

1.

INTRODUCTION TO OPERATING SYSTEMS

1.1 1.2 1.3 1.4

Zeroth Generation: Mechanical Parts 1 First Generation (1945––1955): Vacuum Tubes 1 Second Generation (1955––1965): Transistors 2 Third Generation (1965––1980): Integrated Circuits 6 1.4.1 (a) Integrated Circuits 6 1.4.2 (b) Portability 6 1.4.3 (c) Job Control Language 7 1.4.4 (d) Multiprogramming 7 1.4.5 (e) Spooling 7 1.4.6 (f) Time Sharing 8 Fourth Generation (1980––1990): Large Scale Integration 10 1.5.1 Batch Systems 12 1.5.2 Real time Systems 12

1.5

Summary 13 Terms and Concepts Used Review Questions 15

13

2.

COMPUTER ARCHITECTURE

2.1 2.2 2.3 2.4 2.5

Introduction 17 A 4GL Program 18 A 3GL (HLL) Program 18 A 2GL (Assembly) Program 19 A 1GL (Machine Language) Program 21 2.5.1 Assembler 21 2.5.2 Instruction Format 21 2.5.3 Loading/Relocation 22 0GL (Hardware Level) 24 2.6.1 Basic Concepts 24 2.6.2 CPU Registers 26 2.6.3 The ALU 27 2.6.4 The Switches 27 2.6.5 The Decoder Circuit 28 2.6.6 The Machine Cycle 28 2.6.7 Some Examples 29

2.6

1

17

2.7 2.8

2.9

2.10

The Context of a Program 33 Interrupts 33 2.8.1 The Need for Interrupts 33 2.8.2 Computer Hardware for Interrupts and Hardware Protection Storage Structure 38 2.9.1 Random Access Memory (RAM) 38 2.9.2 Secondary Memory 40 Storage Hierarchy 43 Terms and Concepts Used Summary 44 Review Questions 46

44

3.

OPERATING SYSTEM FUNCTIONS

3.1 3.2

What is an Operating System? 48 Different Services of the Operating System 52 3.2.1 Information Management (IM) 53 3.2.2 Process Management (PM) 53 3.2.3 Memory Management 53 Uses of System Calls 54 The Issue of Portability 55 User’s View of the Operating System 56 Graphical User Interface (GUI) 62 The Kernel 63 Booting 64 Virtual Machine 65 System Calls 66 3.10.1 Validations 67 3.10.2 Open Input File 67 3.10.3 Output File 67 3.10.4 File Close 67

3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10

Summary 68 Terms and Concepts Used Review Questions 70

48

68

4.

FILE SYSTEMS

4.1

Introduction 72 4.1.1 Disk Basics 74 4.1.2 Direct Memory Access 85 The File System 87 4.2.1 Introduction 87 4.2.2 Block and Block Numbering Scheme 87 4.2.3 File Support Levels 90 4.2.4 Writing a Record 91

4.2

33

72

4.2.5 Reading a Record 95 4.2.6. The Relationship Between the Operating System and DMS 4.2.7 File Directory Entry 101 4.2.8 OPEN/CLOSE Operations 102 4.2.9 Disk Space Allocation Methods 102 4.2.10 Directory Structure: User’s View 117 4.2.11. Implementation of a Directory System 121 4.2.12 File Organization and Access Management 128 4.2.13 File Organization and Access Management 129 4.2.14 File Sharing and Protection 129 4.2.15 Directory Implementation 130 4.2.16 Directory Operations 130 4.2.17 Free Space Management 131 4.2.18 Bit Vector 131 4.2.19 Log Structured File System 131 Terms and Concepts Used Summary 133 Review Questions 134

132

5.

I/O MANAGEMENT AND DISK SCHEDULING

5.1

Introduction 136 5.1.1 The Basics of Device Driver 136 5.1.2 Path Management 138 5.1.3 The Submodules of DD 140 5.1.4 I/O Procedure 143 5.1.5 I/O Scheduler 144 5.1.6 Device Handler 150 5.1.7 The Complete Picture 150 Terminal I/O 152 5.2.1 Introduction 152 5.2.2 Terminal Hardware 152 5.2.3 Terminal Software 155 CD-ROM 172 5.3.1 The Technical Details 172 5.3.2 Organizing Data on the CD-ROM 173 5.3.3 DVD-ROM 175 Terms and Definitions 175 5.4.1 Disk Scheduling 175 5.4.2 SCAN 176 5.4.3 Circular SCAN (C-SCAN) 176 5.4.4 LOOK 176 5.4.5 Circular LOOK (C-LOOK) 176 5.4.6 Swap Space Management 177 5.4.7 Disk Space Management 178

5.2

5.3

5.4

97

136

5.4.8 Block Size 178 5.4.9 Keeping Track of Free Blocks 178 Terms and Concepts Used Summary 179 Review Questions 180 6. 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 6.10 6.11 6.12 6.13 6.14 6.15 6.16 6.17 6.18 6.19

6.20

179

PROCESS MANAGEMENT Introduction 182 What is a Process? 183 Evolution of Multiprogramming 183 Context Switching 185 Process States 186 Process State Transitions 188 Process Control Block (PCB) 189 Process Hierarchy 194 Operations on a Process 195 Create a Process 196 Kill a Process 199 Dispatch a Process 199 Change the Priority of a Process 201 Block a Process 201 Dispatch a Process 202 Time up a Process 202 Wake up a Process 204 Suspend/resume Operations 205 CPU Scheduling 207 6.19.1 Scheduling Objectives 207 6.19.2 Concepts of Priority and Time Slice 209 6.19.3 Scheduling Philosophies 210 6.19.4 Scheduling Levels 210 6.19.5 Scheduling Policies (For Short Term Scheduling) Multithreading 217 6.20.1 Multithreading Models 219 6.20.2 Implementation of Threads 221 Terms and Concepts Used Summary 223 Review Questions 224

182

212

222

7.

PROCESS SYNCHRONIZATION

7.1 7.2

The Producer — Consumer Problems 227 Solutions 232 7.2.1 Interrupt Disabling/Enabling 232 7.2.2 Lock-flag 232

227

7.3

7.2.3 Primitives for Mutual Exclusion 233 7.2.4 Overview of Attempts 234 7.2.5 Alternating Policy 235 7.2.6 Peterson’s Algorithm 236 7.2.7 Hardware Assistance 237 7.2.8 Semaphores 239 The Classical IPC Problems 244 7.3.1 Algorithms 244 7.3.2 Monitors 249 7.3.3 Message Passing 250 Terms and Concepts Used Summary 252 Review Questions 254

252

8.

DEADLOCKS

8.1 8.2 8.3

Introduction 256 Graphical Representation of a Deadlock 257 Deadlock Prerequisites 258 8.3.1 Mutual Exclusion Condition 258 8.3.2 Wait for Condition 259 8.3.3 No Preemption Condition 259 8.3.4 Circular Wait Condition 259 Deadlock Strategies 259 8.4.1 Ignore a Deadlock 259 8.4.2 Detect a Deadlock 259 8.4.3 Recover from a Deadlock 263 8.4.4 Prevent a Deadlock 264 8.4.5 Avoid a Deadlock 266

8.4

Summary 271 Review Questions 271 Terms and Concepts Used

256

271

9.

MEMORY MANAGEMENT (MM)

9.1

Introduction 274 9.1.1 Relocation and Address Translation 275 9.1.2 Protection and Sharing 275 Single Contiguous Memory Management 276 9.2.1 Relocation/Address Translation 277 9.2.2 Protection and Sharing 277 9.2.3 Evaluation 277 Fixed Partitioned Memory Management 278 9.3.1 Introduction 278 9.3.2 Allocation Algorithms 279

9.2

9.3

274

9.4

9.5 9.6

9.7

9.8 9.9

9.3.3 Swapping 281 9.3.4 Relocation and Address Translation 282 9.3.5 Protection and Sharing 284 9.3.6 Evaluation 286 Variable Partitions 287 9.4.1 Introduction 287 9.4.2 Allocation Algorithms 288 9.4.3 Swapping 291 9.4.4 Relocation and Address Translation 291 9.4.5 Protection and Sharing 292 9.4.6 Evaluation 293 Non-contiguous Allocation - General Concepts 294 Paging 295 9.6.1 Introduction 295 9.6.2 Allocation Algorithms 298 9.6.3 Swapping 301 9.6.4 Relocation and Address Translation 301 Segmentation 315 9.7.1 Introduction 315 9.7.2 Swapping 318 9.7.3 Address Translation and Relocation 319 9.7.4 Sharing and Protection 322 Combined Systems 324 Virtual Memory Management Systems 326 9.9.1 Introduction 326 9.9.2 Relocation and Address Translation 330 9.9.3 Swapping 332 9.9.4 Relocation and Address Translation 345 9.9.5 Protection and Sharing 345 9.9.6 Evaluation 345 9.9.7 Design Considerations for Virtual Systems 345 9.9.8 Virtual Memory 349 9.9.9 Paging 349 9.9.10 Demand Paging 349 9.9.11 Process Creation 350 Terms and Concepts Used Summary 352 Review Questions 354

351

10.

OPERATING SYSTEM: SECURITY AND PROTECTION

10.1 10.2 10.3

Introduction 357 Security Threats 358 Attacks on Security 359 10.3.1 Authentication 359

357

10.3.2 Browsing 359 10.3.3 Trap Doors 359 10.3.4 Invalid Parameters 360 10.3.5 Line Tapping 360 10.3.6 Electronic Data Capture 360 10.3.7 Lost Line 360 10.3.8 Improper Access Controls 360 10.3.9 Waste Recovery 360 10.3.10 Rogue Software and Program Threats 361 10.3.11 Covert Channel 361 10.4 Security Violation through Parameters 362 10.4.1 Denial of Service and Domain of Protection 362 10.4.2 A More Serious Violation 363 10.4.3 The Cause 363 10.4.4 Solution: Atomic Verification 364 10.5 Computer Worms 364 10.5.1 Origins 364 10.5.2 Mode of Operation 364 10.5.3 The Internet Worm 365 10.5.4 Safeguards against Worms 365 10.6 Computer Virus 366 10.6.1 Types of Viruses 366 10.6.2 Infection Methods 366 10.6.3 Mode of Operation 366 10.6.4 Virus Detection 370 10.6.5 Virus Removal 370 10.6.6 Virus Prevention 370 10.7 Security Design Principles 371 10.7.1 Public Design 371 10.7.2 Least Privilege 371 10.7.3 Explicit Demand 371 10.7.4 Continuous Verification 371 10.7.5 Simple Design 371 10.7.6 User Acceptance 371 10.7.7 Multiple Conditions 371 10.8 Authentication 372 10.8.1 Authentication in Centralised Environment 372 10.8.2 Authentication in Distributed Environment 376 10.9 Protection Mechanisms 376 10.9.1 Protection Framework 376 10.9.2. Access Control List (ACL) 381 10.9.3 Capability List 383 10.9.4 Combined Schemes 385 10.10 Data Encryption 386 10.10.1 Risks Involved 386

10.11 Basic Concepts 387 10.11.1. Plain Text and Cipher Text 387 10.11.2 Substitution Cipher 388 10.11.3 Transposition Cipher 388 10.11.4 Types of Cryptography 390 10.12 Digital Signature 394 Terms and Concepts Used Summary 399 Review Questions 401 11.

398

PARALLEL PROCESSING

403

11.1 11.2 11.3 11.4 11.5 11.6 11.7

Introduction 403 What is Parallel Processing? 404 Difference between Distributed and Parallel Processing 405 Advantages of Parallel Processing 405 Writing Programs for Parallel Processing 406 Classification of Computers 407 Machine Architectures Supporting Parallel Processing 407 11.7.1 Bus-based Interconnections 407 11.7.2 Switched Memory Access 408 11.7.3 Hypercubes 409 11.8 Operating Systems for Parallel Processors 410 11.8.1 Separate Operating Systems 410 11.8.2 Master/Slave System 410 11.8.3 Symmetric Operating System 411 11.9 Issues in Operating System in Parallel Processing 411 11.9.1 Mutual Exclusion 411 11.9.2 Deadlocks 412 11.10 Case Study: Mach 414 11.10.1 Memory Management in Mach 416 11.10.2 Communication in Mach 416 11.10.3 Emulation of an Operating System in Mach 417 11.11 Case Study: DG/UX 418 Terms and Concepts Used Summary 420 Review Questions 420 12. 12.1 12.2

419

OPERATING SYSTEMS IN DISTRIBUTED PROCESSING Introduction 423 Distributed Processing 424 12.2.1 Centralized vs Distributed Processing 424 12.2.2 Distributed Applications 425 12.2.3 Distribution of Data 426

423

12.3

12.4

12.5

12.6

12.7

12.8 12.9 12.10 12.11 12.12 12.13 12.14 12.15

12.2.4 Distribution of Control 427 12.2.5 An Example of Distributed Processing 428 12.2.6 Functions of NOS 434 12.2.7 Overview of Global Operating System (GOS) 439 Process Migration 444 12.3.1 Need for Process Migration 444 12.3.2 Process Migration Initiation 445 12.3.3 Process Migration Contents 445 12.3.4 Process Migration Example 446 12.3.5 Eviction 446 12.3.6 Migration Processes 446 Remote Procedure Call 447 12.4.1 Introduction 447 12.4.2 A Message Passing Scheme 447 12.4.3 Categories of Message Passing Scheme 448 12.4.4 RPC 448 12.4.5 Calling Procedure 448 12.4.6 Parameter Representation 449 12.4.7 Ports 450 12.4.8 RPC and Threads 450 Distributed Processes 451 12.5.1 Process-based DOS 452 12.5.2 Object-based DOS 453 12.5.3 Object Request Brokers (ORB) 453 Distributed File Management 454 12.6.1. Introduction 454 12.6.2 File Replication 455 12.6.3 Distributed File System 455 NFS—A Case Study 459 12.7.1 Introduction 459 12.7.2 NFS Design Objectives 459 12.7.3 NFS Components 460 12.7.4 How NFS Works 462 Cache Management in Distributed Processing 463 Printer Servers 465 Client-based (File Server) Computing 466 Client–Server (Database Server) Computing 468 Issues in distributed database systems 473 Distributed Mutual Exclusion 475 Deadlocks in Distributed Systems 479 LAN Environment and Protocols 481 12.15.1 Introduction 481 12.15.2 Data Communication Errors 481 12.15.3 Messages, Packets, Frames 482 12.15.4 NIC Functions: An Example 484

12.15.5 LAN Media Signals and Topologies 485 12.16 Networking Protocols 486 12.16.1 Protocols in Computer Communications 488 12.16.2 The OSI Model 492 12.16.3 Layered Organization 493 12.16.4 Physical Layer 495 12.16.5 Data Link Layer 496 12.16.6 Network Layer 499 12.16.7 Transport Layer 500 12.16.8 Session Layer 502 12.16.9 Presentation Layer 503 12.16.10 Application Layer 504 Terms and Concepts Used Summary 506 Review Questions 506

505

13.

WINDOWS NT 2000: A CASE STUDY

13.1 13.2

Introduction 509 Windows NT 511 13.2.1 Process Management 511 Windows NT 513 13.3.1 Process Synchronization 513 13.3.2 Memory Management 514 Windows 2000 514 13.4.1 Win32 Application Programming Interface (Win32 API) 13.4.2 Windows Registry 518 13.4.3 Operating System Organization 520 13.4.4 Process Management in Windows 2000 527 13.4.5 Memory Management in Windows 2000 531 13.4.6 File Handling in Windows 2000 532 13.4.7 Important Features of NTFS 535 13.4.8 File Compression and Encryption 536 13.4.9 Security in Windows 2000 538 13.4.10 Windows 2000 and Kerberos 541 13.4.11 MS-DOS Emulation 546

13.3

13.4

Terms and Concepts Used Summary 548 Review Questions 548 14.

UNIX: A CASE STUDY

14.1 14.2 14.3

Introduction 550 The History of UNIX 551 Overview of UNIX 555

509

516

547

550

14.4

UNIX File System 558 14.4.1 User’s View of File System 558 14.4.2 Different Types of Files 559 14.4.3 Mounting/Unmounting File Systems 565 14.4.4 Important UNIX directories/files 566 14.4.5 The Internals of File Systems 570 14.4.6 Run-time Data Structures for File Systems 583 14.4.7 “Open” System Call 586 14.4.8 “Read” System Call 587 14.4.9 “Write” System Call 589 14.4.10 Random Seek — “Lseek” System Call 590 14.4.11 “Close” System Call 590 14.4.12 Create a File 591 14.4.13 Delete a File 592 14.4.14 Change Directory 593 14.4.15 Implementation of Pipes 593 14.4.16 Implementation of Mount/Unmount 594 14.4.17 Implementation of Link/Unlink 595 14.4.18 Implementation of Device I/O in UNIX 596 14.5 Data Structures for Process/memory Management 599 14.5.1 The Compilation Process 599 14.5.2 Process Table 602 14.5.3 u-area 602 14.5.4 Per Process Region Table (Pregion) 603 14.5.5 Region Table 605 14.5.6 Page Map Tables (PMT) 606 14.5.7 Kernel Stack 610 14.6 Process States and State Transitions 611 14.7 Executing and Terminating a Program in UNIX 613 14.7.1 Introduction 613 14.7.2 “Fork” System Call 615 14.7.3 “Exec” System Call 617 14.7.4 Process Termination — “Exit” System call 618 14.7.5 “Wait” System Call 618 14.8 Using the System (Booting and Kogin) 619 14.8.1 Booting Process: Process 0, Process 1 619 14.8.2 Login Process 621 14.9 Process Scheduling 623 14.10 Memory Management 627 14.10.1 Introduction 627 14.10.2 Swapping 628 14.10.3 Demand Paging 630 14.10.4 An Example Using Demand Paging 635 14.11 Solaris Process/thread Management and Synchronization: A Case Study 637 14.11.1 Solaris Thread and SMP Management 637

14.11.2 Solaris Process Structure 638 14.11.3 Solaris Thread Synchronization 638 Terms and Concepts Used Summary 640

639

15.

LINUX–A CASE STUDY

15.1 15.2 15.3 15.4 15.5 15.6 15.7 15.8

Introduction 641 UNIX and Linux: A Comparison 643 Process Management 643 Process Scheduling 645 Memory Management 647 File Management 649 Device Drivers 650 Security 650 15.8.1 Access Control 650 15.8.2 User Authentication 651 Terms and Concepts Used 652 Summary 652 Review Questions (Common for Chapters 14 and 15)

641

653

Answers to True & False Answers to Multiple Choice Questions

655 657

Index

659

PREFACE OVERVIEW Almost everybody involved in the development of software comes in contact with different Operating Systems. There are two major groups of people in this context. One group is concerned with knowing how an operating system is designed, what data structures are used by an operating system and how various algorithms within an operating system are organized in various layers to execute different functions. This is a class of system programmers who are required to study the internals of an operating system and later on participate in either designing or installing and managing an operating system (including performance tuning) or enhancing it by writing various device drivers to support various new devices, etc. When Mr Achyut Godbole came up with the first edition of this book in 1995, Operating System was a topic of immense interest to technologists studying system software. UNIX and Microsoft Windows were the leading operating systems of the time. The book was not designed at all considering any syllabi or courses in mind. It was merely an effort to try and explain the way operating systems function, so that someone who has some basic background in computer technology would be able to understand the subject thoroughly. As the 1990’s gave way to the new millennium, computer technologies evolved much more rapidly than ever before. The role of the Internet changed the entire scenario dramatically. Suddenly, desktop computing gave way to distributed computing. Web servers and subsequently, application servers, assumed significant importance. Database servers started hosting billions of bytes of data, which has now easily run into trillions of bytes, and more. At the same time, however, the client (or the Web browser) also became very crucial. All this meant that operating systems catering to the needs of these diverse sets of users had to adapt to these requirements. We came out with a second edition of the book to reflect these changes. The main additions were the case studies on Windows 2000 and Linux. Many other supplementary changes were also made. The third edition of the book is now in your hands. This edition reflects the current technology trends, as well as captures some of the other topics that we felt were necessary to make the book even more exhaustive. WHAT'S NEW IN THE THIRD EDITION As in the previous editions, comprehensive coverage of all important topics using lucid explanations is given utmost emphasis. Furthermore, step-by-step guidance and over 400 diagrams help lend clarity to the subject. The main changes carried out from the previous edition are as follows: The chapter on Information Management is split into two chapters for ease of reference. The File systems chapter contains all the modern techniques pertaining to file management. The chapter on I/O Management and Disk Scheduling contains details of disk driver management, I/O software and hardware and organization of data on disks. The chapter on Interprocess Communication is renamed as Process Synchronization. It has been updated further and contains the concepts of concurrent programming. Coverage of CPU Scheduling is enhanced.

History of Operating Systems now contains batch and real-time concepts as well. System calls and System Protection is now covered more comprehensively. Memory Management contains additional topics on virtual memory. The chapter on Deadlocks now has topics on system model pertaining to deadlocks. Enhanced Pedagogy including detailed question-and-answers at the end of every chapter, totalling to over 200 test questions and 275 objective type questions SCOPE / TARGET AUDIENCE The curriculums for an introductory course on the operating systems for bachelors and masters degrees in Science, Engineering or Management precisely intends to cover all the topics required for a course on operating systems. This book is meant for such students. The book is also meant for a group of professionals consisting of application programmers writing programs in Java, C#, C++, ORACLE or any other third or fourth generation language. This group, comes into contact with only the Command language (JCLs and SHELL) of the operating system. For this group, this JCL is what constitutes the “knowledge of” or “experience on” a particular operating system. But for this group, the operating system remains a mystery, despite working with it for years. How does the operating system translate the commands issued at the terminal into actual actions in terms of machine instructions? for instance. There are excellent books available today, which describe internal workings of an operating system and there are also books which teach an application programmer what commands to issue at a terminal to achieve a specific result. The point is to link these two levels and demystify the subject. This book aims at doing this, in a step-by-step manner. Thus, existing application programmers will stand to gain a lot by this book. We are quite confident that the changes made in this edition would make the book even more useful for all the students of various Operating System courses at the graduate/post-graduate levels. The book would also be useful for all the professors/teachers of these courses. Additionally, computer programmers/managers/ CTOs who want to learn more about Operating Systems would also find the book a great help. ROADMAP FOR VARIOUS TARGET COURSES The book has been carefully designed so that a reader who is not familiar with details of computer architecture can start from the first chapter, which provides a detailed overview of the history of computers. The second chapter provides a very lucid and comprehensive introduction to the functioning of a computer from inside. We believe that this understanding is crucial for a better appreciation of this book. However, those familiar with computer architecture can skip these chapters and move on to the third chapter. For the rest of the book, no specific sequence is needed for reading, since the various topics covered are that independent in nature, and the reader can grasp them depending on how the course is designed or also depending on what he/she exactly wants to know. CHAPTER ORGANISATION Chapter 1 deals with the history of Operating Systems. It covers the various milestones in the history of Operating Systems. The chapter also covers the modern trends in Operating Systems. Chapter 2 begins with an overview of programming language levels, and presents a view at each of

these levels, viz. 4GL, 3GL, 2GL and 1GL (machine language). It shows the relationships amongst these levels, which essentially provide views of the same system at different levels of capabilities and, therefore, abstractions. Chapter 3 introduces the concept of the Operating System functions as provided by the various system calls. It presents the user’s/application programmer’s view of the Operating System and also that of the System Programmer and shows how these are related. It discusses the system calls in three basic categories: Information Management (IM), Process Management (PM), and Memory Management (MM). It also shows the relationship between these three modules. Chapter 4 introduces File Systems. It explains file organization and access management, file sharing and protection. Directory systems are then discussed based on their levels. The chapter also defines directory operations, free-space management, bit vectors and log structured file systems. Chapter 5 is on I/O Management and Disk Scheduling. It defines the concept of a “block” and goes on to explain how the data for a file is organized on a disk. It explains the functioning of hard and floppy disks in detail. It explains how the Operating System does the address translation from logical to physical addresses to actually Read/Write any record. It goes on to show the relationship between the Application Program (AP), Operating System (O/S), and the Data Management Software (DMS). Chapter 6 defines a “process” and explains the concepts of context switching as well as multiprogramming. It defines various process states and discusses different process state transitions. It gives the details of a data structure “Process Control Block (PCB)” and uses it to show how different operations on a process such as “create”, “kill”, or “dispatch” are implemented, each time showing how the PCBs chains would reflect the change. It then discusses the different methods used for scheduling various processes. Chapter 7 describes the problems encountered in Process Synchronisation by taking an example of Producer-Consumer algorithms. It illustrates various solutions that have been proposed so far, for mutual exclusion. The chapter concludes with a detailed discussion on semaphores and classic problems in Inter Process Communication. Chapter 8 describes and defines a deadlock and also shows how the situation can be graphically represented. It states the pre-requisites for the existence of a deadlock. It then discusses various strategies for handling deadlocks, viz. ignore, detect, recover from, prevent, and avoid. It concludes with a detailed discussion of Banker’s algorithm. Chapter 9 elucidates various Contiguous and Non-Contiguous memory allocation schemes. For all these schemes, it states the support that the Operating System expects from the hardware and then goes on to explain in detail the way the scheme is implemented. Security is an important aspect of any Operating System. Chapter 10 discusses the concept of security and various threats to it; and attacks on it. It then goes on to discuss how security violation can take place due to parameter passing mechanisms. It discusses computer worms, and viruses, explaining in detail, how they operate and grow. The chapter discusses various security design principles and also various protection mechanisms to enforce security. Chapter 11 introduces the concept of Parallel Processing and contrasts it with uniprocessing as well as distributed processing, discussing the merits and demerits of all. The chapter demonstrates the programming for parallel processing and also classification of computers.

Chapter 12 defines the term “distributed processing” and contrasts centralized versus distributed processing. It also describes three ways in which the processing can be distributed, viz. distributed application, distributed data and distributed control. It takes an example to clarify these concepts. Chapter 13 offers a detailed case study of Windows NT and Windows 2000. Along with Linux; the Windows family of Operating Systems has become the most important concept that the technologist should know. The chapter provides in-depth discussion Windows, including its architecture, design principles, and various Operating Systems algorithms/data structures. Chapter 14 presents a similar detailed case study of UNIX. The chapter provides a detailed description of UNIX, including its architecture, design principles, and various Operating Systems algorithms/ data structures. Chapter 15 is similar to Chapter 14 except that it describes Linux and not UNIX. The chapter provides notes on the differences between these two Operating Systems at appropriate places. WEB SUPPLEMENTS The Web supplements can be accessed at http://www.mhhe.com/godbole/os3 and contain the following resources: Instructors Chapterwise PowerPoint slides Students Chapterwise solutions for True and False questions and MCQs Frequently Asked Questions from OS Detailed Case Study of UNIX Chapter on Multimedia Operating System for extra reading ACKNOWLEDGEMENTS We are thankful to Shobha Godbole and Anita Kahate for their constant support in numerous ways. Without them, this book would not have been possible. We are also grateful to Sapna and Umesh Aherwadikar for their significant help in many ways. We would also like to acknowledge the following reviewers who took out time to review the book, which helped us in giving a final shape to the revised edition. D S Kushwaha Motilal Nehru National Institute of Technology(MNNIT), Allahabad, Uttar Pradesh Nitin Gupta National Institute of Technology(NIT), Hamirpur, Himachal Pradesh Amritanjali Birla Institute of Technology, Mesra, Jharkhand Madhura V Phatak Institute of Technology (MIT), Pune, Maharashtra K Poulose Jacob Cochin University of Science and Technology, Cochin, Kerala

V Shashikiran Sri Venkateswara College of Engineering, Sriperumbudur, Tamil Nadu Saidalavi Kalady National Institute of Technology(NIT), Calicut, Kerala Finally, we would like to thank the Tata McGraw-Hill Education team specially Vibha Mahajan, Shalini Jha, Surbhi Shukla, Surbhi Suman, Sohini Mukherjee, Suneeta Bohra and Baldev Raj for their enthusiastic support and guidance in bringing out the revised edition of the book. We hope that the reader likes this revised edition and finds it useful in learning the concepts of Operating Systems.

Achyut Godbole Atul KAhAte

Constructive suggestions and criticism always go a long way in enhancing any endeavour. We request all readers to email us their valuable comments / views / feedback for the betterment of the book at [email protected] mentioning the title and author name in the subject line. Please report any piracy spotted by you as well!

VISUAL TOUR

We can imagine that all the issues that are found in a simple, non-distributed environment, such as mutual exclusion, starvation and deadlock are quite possible in the case of a distributed environment. In fact, the possibility of these happening is more here, as a number of entities are involved, which can lead to chaos. To add to the trouble, there is no global state. That is, there is no way for an operating system or a participating process to know about the overall state of all the processes. It can only know about its own state (i.e. local processes). For obtaining information about remote processes, it has to rely on the messages from communication with other processes. Worse yet, these states reflect the position that was in the past. Although this past can be as less as 1 second, it can have very serious consequences in business and other applications. For instance, suppose that Process A gets the information from a remote Process B that the balance in an account is USD 2000, at 11.59 a.m. Therefore, Process A considers that as a sufficient a withdrawal transaction of USD 1500, and goes concurrency. In a local system, we can we take care of it in a remote transaction? To solve this, there are man The global state. Here, the basic assumption is that message was sent, and without any

/ Comprehensive coverage of all topics presented with lucid explanations in simple language. Step-by-step guidance is given wherever necessary for easier understanding of the concepts.

Diagrams form an important part of every textbook on Science and Engineering. This book contains over 400 diagrams which lend clarity to the concepts discussed.

This record logic ally connects all entally. the

used, even accid

This record conta information as to ho

ins security infor mation. Different langu ages have diffe rent rules

This record is information regar actually a direc ding disk quotas, object identifiers tory, , etc. After this, the

entries for the user

files begin.

ioller in the prev of a disk contr ns that this the functions various instructio the conWe have seen have also seen of ous sections. We ds - i.e. the instruction set ucinstr rstan these the DD uses controller unde also seen how troller. We had

Let us now discu ss some important features of NTF S. The requirement from disk crash of most modern es and system Operating Syste failures. NTFS It uses the conc ms is to be able is capable of resto ept of transactio to gracefully recov ring ns, as is done in er the case of datab the file system in the case of such ase systems, to events. perform the recov ering func-

We have encountered a 10#1024 memory address decoder circuit before. We now study the instruction decoder circuit which is basically similar.

The decoder circuit has n control signal lines and 2n output signal lines, and depending upon the value of the control signal lines in binary terms, it chooses or selects one of the output signal lines. 4

correct

sector an d

to activ ate R/W

head of

the appro

priate su

rface to

read the

data. Th

e controll

er norm ally

VISUAL TOUR Terms and Concpets used are highlighted in a separate section at the end of each chapter. These help students have a quick overview of the important terms discussed in the chapter, look up the definitions and meorise them as part of self-study.

There are two types of real-time systems: These types of systems guarantees the execution of critical systems on time. These systems are very restrictive in terms of time constraint. These operations always have high priority over other tasks. These are less restrictive in terms of time constraints. Where a critical realtime tasks gets priority over the other tasks and when high priority tasks are finished, these operations would start executing once again.

A bulleted Summary given at the end of the each chapter help students revise the important concepts learnt in the chapter. n

n

n

n

n

n

n

n

n n n

n

n

n n

n

n

n

n n

n

n

n

n

n n

n

n

n

n n

n

n

n

n n

VISUAL TOUR As part of the self-assessment programme, a detailed set of questions is present in every chapter. The Question Set contains various types of problems including Test Questions, True and False Questions and Multiple-Choice Questions.

The history of Operating Systems is inextricably linked with the history and development of various generations of computer systems. In this chapter, we will trace the history of the Operating Systems by delineating the chronological development of hardware generations.

The first digital computer was designed by Charles Babbage (1732–1871), an English mathematician. This machine had a mechanical design where wheels, gears, cogs and so on were used. As this computer was slow and unreliable, this design could not really become very popular. There was no question of any operating system of any kind for this machine.

Several decades later, a solution evolved which was electronic rather than mechanical. This solution emerged out of the concerted research carried out as part of the war effort during the Second World War. Around 1945, Howard Aiken at Harvard, John Von Neumann at Princeton, J. Eckert and William Mauchely at the University of Pennsylvania and K. Zuse in Germany succeeded in designing calculating machines with vacuum tubes as the central components. These machines were huge and their continued use generated a great deal of heat. The vacuum tubes also used to get burnt very fast (During one computer run, as many as 10,000–20,000 tubes could be wasted!) The

programming was done only in machine language, which could be termed the first generation language. There was no assembly language, nor any higher level language. Again, there was no operating system for these machines too! These were single-user machines, which were extremely unfriendly to users/programmers.

Around 1955, transistors were introduced in the USA at AT&T. The problems associated with vacuum tubes vanished overnight. The size and the cost of the machine dramatically dwindled. The reliability improved. For the first time, new categories of professionals called systems analysts, designers, programmers and operators came into being as distinct entities. Until then, the functions handled by these categories of people had been managed by a single individual. Assembly language, as a second generation language, and FORTRAN, as one High Level Language (third generation language), emerged, and the programmer's job was extremely simplified. However, these were batch systems. The IBM-1401 belonged to that era. There was no question of having multiple terminals attached to the machine, carrying out different inquiries. The operator was continuously busy loading or unloading cards and tapes before and after the jobs. At a time, only one job could run. At the end of one job, the operator had to dismount the tapes, take out the cards ('teardown operation'), load the decks of cards and mount the tapes for the new job ('setup operation'). This entailed the usage of a lot of computer time. Valuable CPU time was therefore, wasted. This was the case when IBM-1401 was in use. An improvement came when IBM-7094 - a faster and larger computer was used in conjunction with IBM-1401, which then was used as a 'satellite computer'. The scheme used to work as follows : (i) There used to be 'control cards' giving information about the job, the user and so on, sequentially stacked, as depicted in Fig. 1.1. For instance, $JOB specified the job to be done, the user who is doing it and may be some other information. $LOAD signified that what would follow were the cards with executable machine instructions punched onto them and that they were to be loaded in the main memory before it could be executed. These cards were therefore, collectively known as an 'object deck' or an 'object program'. When the programmer wrote his program in an assembly language called a 'source program', the assembly process carried out by a special program called 'assembler' would convert it into an object program before it could be executed. The assembler would also punch these machine instructions on the cards in a predefined format. For instance, each card had a sequence number to help it to be rearranged in case it fell out by mistake. The column in which the 'op code' of the machine instruction started was also fixed (e.g. column 16 in the case of Autocoder), so that the loader could do its job easily and quickly.

The $LOAD card would essentially signify that the object cards following it should then be loaded in the memory. Obviously, the object program cards followed the $LOAD card as shown in the figure. The $RUN control card would specify that the program just then loaded should be executed by branching to the first executable instruction specified by the programmer in the "ORG" statement. The program might need some data cards which then followed. $END specified the end of the data cards and $JOB specified the beginning of a new job again! (ii) An advantage of stacking these cards together was to reduce the efforts of the operator in 'set up' and 'teardown' operations, and therefore, to save precious CPU time. Therefore, many such jobs were stacked together one after the other as shown in Fig. 1.1. (iii) All these cards were then read one by one and copied onto a tape using a "card to tape" utility program. This was done on an IBM-1401 which was used as a satellite computer. This arrangement is shown in Fig. 1.2. Controls such as 'total number of cards read' were developed and printed by the utility program at the end of the job to ensure that all cards were read.

(iv) The prepared tape (Tape-J shown in Fig. 1.2) was taken to the main 7094 computer and processed as shown in Fig. 1.3. The figure shows Tape-A as an input tape and Tape-B as an output tape. The printed reports were not actually printed on 7094, but the print image was dumped onto the tape (Tape-P) which was carried to slower 1401 computer again, which did the final printing as shown in Fig. 1.4. Due to this procedure, 7094 computer, which was a faster and more expensive machine was not locked up for a long time unnecessarily. The logic of splitting the operation of printing into two stages here was simple. The CPU of a computer was quite fast as compared to any I/O operation. This was so, because the CPU was a purely electronic device, whereas I/O involved electromechanical operations. Secondly, within two types of I/O operations,

writing on a tape was faster than printing a line on paper. Therefore, the time of the more powerful, more expensive 7094 was saved. This is because, the CPU can execute only one instruction at a time. If 7094 was used to print a report, it would be idle for most of the time. When a line was being printed, the

CPU could not be doing anything else. Of course, some computer had to read the print image tape (Tape-P) and print a report. But then, that could be delegated to a relatively less expensive satellite computer, say the 1401. Actually, writing on tape and then printing on the printer appear to be wasteful and more expensive, but it was not so, due to the differential powers and costs of the 7094 and 1401. This scheme was very efficient and improved the division of labour. The three operations required for the three stages shown in Figs. 1.2 to 1.4 were repetitive. The efficiency increased. The only difference was that the computer 7094 had to have a program which read the card images from Tape-J and interpreted them (e.g. on hitting a $LOAD card image, it actually started loading the program from the following card image records). This was essentially a rudimentary Operating System. The IBM-7094 had two Operating Systems 'IBSYS' and 'Fortran Monitor System (FMS)'. Similarly, the IBM-1401 had to have a program which interpreted the print images from the tape and actually printed the report. This program was a rudimentary ‘spooler’. One scheme was to have the exact print image on the tape. For instance, if there were 15 blank lines between two printed valid report lines, one would actually write 15 blank lines on the print image tape. In this case, the spooler program was very simple. All it had to do was to dump the tape records on the printer. But this scheme was clearly wasteful, because, the IBM-7094 program had to keep writing actual blank lines; additionally, the tape utilization was poor. A better scheme was to use special characters (which are normally not used in common reports, etc.) to denote end-of-line, end-of-page, number of lines to be skipped and so on. In this case, the program on the IBM-7094 which created the print-image tape became a little more complex but far more efficient. The actual tape was used far more efficiently, but then the spooler program also became more complex. It had to actually interpret the special characters on the tape and print the report! This was a single-user system. Only one program belonging to only one user could run at a time. When the program was reading or writing a record, the CPU was idle; and this was very expensive. Due to the electromechanical nature, the I/O operations used to be extremely time-consuming as compared to the CPU operations (This is true even today despite great improvements in the speeds of the I/O devices!). Therefore, during the complete execution of a job, the actual CPU utilization was very poor. Despite these limitations, the rudimentary Operating System did serve the purpose of reducing operator intervention in the execution of computer jobs. Setup and teardown were then applicable only for a set of jobs stacked together instead of for each job. During this period, the mode of file usage was almost always sequential. Database Management Systems (DBMS) and On-line systems were unheard of at that time. One more development of this era was the introduction of a library of standard routines. For example, the 'Input Output Control System (IOCS)' was developed in an assembly language of the IBM-1401, called 'Autocoder', and was supplied along with the hardware. This helped the programmers significantly because they no longer had to code these tedious and error-prone routines every time, in their programs. The concept of a 'system call', where the Operating System carried out a function on behalf of the user, was still not in use. These routines in the source code had therefore, to be included along with the other source program for all programs before the assembly process. Therefore, these routines went through the assembly process every time. An improvement over this was to predetermine the memory locations where the IOCS was expected to be loaded, and to keep the preassembled IOCS routines ready. They were then added to the assembled object program cards to be loaded by the loader along with the other object deck. This process saved the repetitive assembly of IOCS routines every time along with every source program. The source program had simply to "Branch" to the subroutine residing at a predefined memory address to execute a specific I/O instruction.

In the early 60s, many companies such as National Cash Register (NCR), Control Data Corporation (CDC), General Electric (GE), Burroughs, Honeywell, (RCA) and Sperry Univac started providing their computers with Operating Systems. But these were mainly batch systems concerned primarily with throughput. Transaction processing systems started emerging with the users feeling the need for more and more online processing. In fact, Burroughs was one of the few companies which produced an Operating System called as 'Master Control Program (MCP)' which had many features of today's Operating Systems such as multiprogramming (execution of many simultaneous user programs), multiprocessing (many processors controlled by one Operating System) and virtual storage (program size allowed to be more than the available memory).

IBM announced System/360 series of computers in 1964. IBM had designed various computers in this series which were mutually compatible so that the conversion efforts for programs from one machine to the other in the same family were minimal. This is how the concept of ‘family of computers’ came into being. IBM-370, 43xx and 30xx systems belong to the same family of computers. IBM faced the problem of converting the existing 1401 users to System/360, and there were many. IBM provided the customers with utilities such as 'simulators' (totally software driven and therefore, a little slow) and 'emulators' (using hardware modifications to enhance the speed at extra cost) to enable the old 1401 bases software to run on the IBM-360 family of computers. Initially, IBM had plans for delivering only one Operating System for all the computers in the family. However, this approach proved to be practically difficult and cumbersome . The Operating System for the larger computer in the family meant to manage larger resources was found to create far more burden and overheads if used on the smaller computers. Again, the Operating System that could run efficiently on a smaller computer would not manage the resources for a large computer effectively. At least, IBM thought so at that time. Therefore, IBM was forced to deliver four Operating Systems within the same range of computers. These were

The major advantages/features and problems of this computer family and its Operating Systems were as follows :

The System/360 was based on 'Integrated Circuits (ICs)' rather than transistors. With ICs, the cost and the size of the computer shrank substantially, and yet the performance improved.

The Operating Systems for the System/360 were written in assembly language. The routines were therefore, complex and time-consuming to write and maintain. Many bugs persisted for a long time. As these were

written for a specific machine and in the assembly language of that machine, they were tied to the hardware. They were not easily 'portable' to machines with a different architecture not belonging to the same family.

Despite these problems, the user found them acceptable, because, the operator intervention (for setup and teardown) decreased. A 'Job Control Language (JCL)' was developed to allow communication between the user/programmer and the computer and its Operating System. By using the JCL, a user/programmer could instruct the computer and its Operating System to perform certain tasks, in a specific sequence for creating a file, running a job or sorting a file.

The Operating System supported mainly batch programs but it made 'multiprogramming' very popular. This was a major contribution. The physical memory was divided into many partitions, each holding a separate program. One of these partitions was holding the Operating System as shown in Fig. 1.5. However, because, there was only one CPU, at a time only one program could be executed. Therefore, there was a need for a mechanism to switch the CPU from one program to the next. This is exactly what the Operating System provided. One of the major advantages of this scheme was the increase in the 'throughput'. If the same three programs shown in Fig. 1.5 were to run one after the other, the total elapsed time would have been much more than under a scheme which used multiprogramming. The reason was simple. In a uniprogramming environment, the CPU was idle when any I/O for any program was going on (and that was quite a lot!), but in a multiprogramming Operating System, when the I/O for one program was going on, the CPU was 'switched' to another program. This allowed for the overlapped operations of I/O for one program and the other processing for some other program by the CPU, thereby increasing the throughput.

The concept of ‘Simultaneously Peripheral Operations On-Line (spooling)’ was fully developed during this period. This was the outgrowth of the same principle that was used in the scheme discussed earlier and depicted in Figs. 1.2 to 1.4. The only advantage of spooling was that you no longer had to carry tapes to and fro the 1401 and 7049 machines. Under the new Operating System, all jobs in the form of cards could be read into the disk first (shown as 'a' in the figure) and later on, the Operating System would load as many jobs in the memory, one after the other, until the available memory could accommodate them (shown as 'b' in the figure). After many programs were loaded in different partitions of the memory, the CPU was switched from one program to another to achieve multiprogramming. We will later see different policies used to achieve this switching. Similarly, whenever any program printed something, it was not written directly on the printer, but the print image of the report was written onto the disk in the area reserved for spooling (shown as 'c' in the figure). At any convenient time later, the actual printing from this disk file could be undertaken (shown as 'd' in the figure). This is depicted in Fig. 1.6.

Spooling had two distinct advantages. One is that it allowed smooth multiprogramming operations. Imagine if two programs, say, Stores Ledger and Payslips Printing, were allowed to issue simultaneous instructions to write directly on the printer, the kind of hilarious report that would be produced with intermingled lines from both the reports on the same page. Instead, the print images of both the reports were written on to the disk at two different locations of the Spool file first, and the Spooler program subsequently printed them one by one. Therefore, while printing, the printer was allocated only to the Spooler program. In order to guide this subsequent printing process, the print image copy of the report on the disk also contained some preknown special characters such as for skipping a page. These were interpreted by the Spooler program at the time of producing the actual report. Spooling had another advantage too! All the I/O of all the jobs was essentially pooled together in the spooling method and therefore, this could be overlapped with the CPU bound computations of all the jobs at an appropriate time chosen by the Operating System to improve the throughput.

The System/360 with its Operating Systems enhanced multiprogramming, but the Operating Systems were not geared to meet the requirements of interactive users. They were not very suitable for the query systems for example. The reason was simple. In interactive systems, the Operating System needs to recognize a terminal as an input medium. In addition, the Operating System has to give priority to the interactive processes over batch processes. For instance, if you fire a query on the terminal, "what is the flight time of Flight SQ024?," and the passenger has to be serviced within a brief time interval, the Operating System must give higher priority to this process than, say, for a payroll program running in the batch mode. The classical "multiprogramming batch" Operating Systems did not provide for this kind of scheduling of various processes. A change was needed. IBM responded by giving its users a program called "Customer Information Control System (CICS)" which essentially provided ‘Data Communication (DC)’ facility between the terminal and the computer. It also scheduled various interactive users’ jobs on top of the Operating System. Therefore, CICS functioned not only as a Transaction Processing (TP) monitor but also took over some functions of the Operating System such as scheduling. IBM also provided the users with the ‘Time Sharing Option (TSO) Software’ later to deal with the situation.

Many other vendors came up with ‘Time Sharing Operating Systems’ during the same period. For instance, DEC came up with TOPS-10 on the DEC-10 machine, RSTS/E and RSX-11M for the PDP-11 family of computers and VMS for the VAX-11 family of computers. Data General produced AOS for its 16 bit minicomputers and AOS/VS for its 32 bit Super-mini computers. These Operating Systems could learn from the good/bad points of the Operating System running on the System/360. Most of these were far more user/programmer friendly. Terminal handling was inbuilt in the Operating System. These Operating Systems provided for batch as well as on-line jobs by allowing both to coexist and compete for the resources, but giving higher preference to servicing the on-line requests. One of the first time sharing systems was ‘Compatible Time Sharing System (CTSS)’ developed at the Masscheusetts Institute of Technology (M.I.T.). It was used on the IBM-7094 and it supported a large number of interactive users. Time sharing became popular at once. ‘Multiplexed Information and Computing Service (MULTICS)’ was the next one to follow. It was a joint effort of MIT, Bell Labs and General Electric. The aim was to create a ‘computer utility’ which could support hundreds of simultaneous time sharing users. MULTICS was a crucible which generated and tested almost all the important ideas and algorithms which were to be used repeatedly over several years in many Operating Systems. But the development of MULTICS itself was very painful and expensive. Finally, Bell Labs withdrew from the project. In fact, in the process, GE gave up its computer business altogether. Despite its relative failure, MULTICS had a tremendous influence on the design of an Operating System for many years to come. One of the computer scientists, Ken Thompson, working on the MULTICS project through Bell Labs subsequently got hold of a PDP-7 machine which was unused. Bell labs had already withdrawn from MULTICS. Ken Thompson hit upon the novel idea of writing a single user, stripped down version of MULTICS on PDP-7. Another computer scientist—Brian Kernighan—started calling this system ‘UNICS’, out of fun. Later on, the name UNIX was adopted. None of these people were aware of the tremendous impact this event was to have on all the future developments. The UNIX Operating System was later ported to a larger machine, PDP-11/45. There were, however, major problems in this porting. The problems arose because UNIX was written in the assembly language. A more adventurous idea struck another computer scientist—Dennis Ritchie—that of writing UNIX in a higher level language. Ritchie examined all the existing Higher Level Languages (HLLs) and found none suitable for this task. He, in fact, designed and implemented a language called 'C' for this purpose. Finally, UNIX was written in C. Only 10% of the kernel and hardware-dependent routines where the architecture and the speed mattered were written in the assembly language for that machine. All the rest (about 90%) was written in C. This made the job of 'porting' the Operating System far easier. Today, to port UNIX to a new machine, you need to have a C compiler on that machine to compile the 90% of the source code in written in C language into the machine instructions of the target computer. You also need to rewrite, test and integrate only remaining 10% of the assembly language code on that machine. Despite this facility, the job of porting is not a trivial one, though, it is far simpler than the one for porting earlier Operating Systems. This was a great opportunity for the hardware manufacturers. With new hardware and newer architectures, instead of writing a new Operating System each time, porting of UNIX was a far better solution. They could announce their products far faster, because all the other products such as Database Management Systems, Office Automation Systems, language compilers, and so on could also then be easily ported, once the System Calls under UNIX were known and available. After this, porting of Application Programs also became a relatively easier task.

Meanwhile, Bell Labs which later became AT&T, licensed the UNIX source code to many universities almost freely. It became very popular amongst the students who later became designers and managers of software development processes in many organizations. This was one of the main reasons for its popularity (By now it had a multiuser version).

When ‘Large Scale Integration (LSI)’ circuits came into existence, thousands of transistors could be packaged on a very small area of a silicon chip. A computer is made up of many units such as a CPU, memory, I/O interfaces, and so on. Each of these is further made up of different modules such as Registers, Adders, Multiplexers, Decoders and a variety of other digital circuits. Each of these, in turn, is made up of various gates (For example, one memory location storing 1 bit is made up of as many as seven gates!). Those gates are implemented in digital electronics using transistors. As the size of a chip containing thousands of such transistors shrank, obviously the size of the whole computer also shrank. But the process of interconnecting these transistors to form all the logical units became more intricate and complex. It required tremendous accuracy and reliability. Fortunately, with Computer Aided Design (CAD) techniques, one could design these circuits easily and accurately, using other computers themselves! Mass automated production techniques reduced the cost but increased the reliability of the produce computers. The era of microcomputers and Personal Computers (PC) had begun. With the hardware, you obviously need the software to make it work. Fortunately many, Operating System designers on the microcomputers had not worked extensively on the larger systems and therefore, many of them were not biased in any manner. They started with fresh minds and fresh ideas to design the Operating System and other software on them. "Control Program for Microcomputers (CP/M)" was almost the first Operating System on the microcomputer platform. It was developed on Intel 8080 in 1974 as a File System by Gary Kindall. Intel Corporation had decided to use PL/M instead of the assembly language for the development of systems software and needed a compiler for it badly. Obviously, the compiler needed some support from some kind of utility (Operating System) to perform all the file related operations. Therefore, CP/M was born as a very simple, single-user Operating System. It was initially only a File System to support a resident PL/M compiler. This was done at Digital Research Inc. (DRI). After the commercial licensing of CP/M in 1975, other utilities such as editors, debuggers, etc. were developed, and CP/M became very popular. CP/M went through a number of versions. Finally, a 16-bit multiuser, time sharing "MP/M" was designed with real time capabilities, and a genuine competition with the minicomputers started. In 1980, "CP/NET" was released to provide networking capabilities with MP/M as the server to serve the requests received from other CP/M machines. One of the reasons for the popularity of CP/M was its 'userfriendliness'. This had a lot of impact on all the subsequent Operating Systems on microcomputers. After the advent of the IBM-PC based on Intel 8086 and then its subsequent models, the 'Disk Operating System (DOS)' was written. IBM's own PC-DOS and MS-DOS by Microsoft are close cousins with very similar features. The development of PC-DOS again was related to CP/M. A company called "Seattle Computer" developed an Operating System called QDOS for Intel 8086. The main goal was to enable the programs developed under CP/M on Intel 8080 to run on Intel 8086 without any change. Intel 8086 was upward compatible to Intel 8080. QDOS, however, had to be faster than CP/M in disk operations. Microsoft

Corporation was quick to realize the potential of this product, given the projected popularity of Intel 8086. It acquired the rights for QDOS which later became MS-DOS (The IBM version is called PC-DOS). MS-DOS is a single user, user-friendly operating system. In quick succession, a number of other products such as Database Systems (dBASE), Word Processing (WORDSTAR), Spreadsheet (LOTUS 1-2-3) and many others were developed under MS-DOS, and the popularity of MS-DOS increased tremendously. The subsequent development of compilers for various High Level Languages such as BASIC, COBOL and C added to this popularity, and, in fact, opened the gates to a more serious software development process. This was to play an important role after the advent of Local Area Networks (LANs). MS-DOS later was influenced by UNIX and it has been evolving towards UNIX over the years. Many features such as hierarchical file system have been introduced in MS-DOS over a period of time. With the advent of Intel 80286, the IBM PC/AT was announced. The hardware had the power of catering simultaneously to multiple users, despite the name "Personal Computer". Microsoft quickly adapted UNIX on this platform to announce "XENIX". IBM joined hands with Microsoft again to produce a new Operating System called "OS/2". Both of these run on 286 and 386 based machines and are multi-user systems. While XENIX is almost the same as UNIX, OS/2 is fairly different from, though influenced by MS-DOS, which runs on the IBM PC/AT as well as the PS/2. With the advent of 386 and 486 computer bit mapped graphic displays became faster and therefore, more realistic. Therefore, Graphical User Interfaces (GUIs) became possible and infact necessary for every application. With the advent of GUIs, some kind of standardization was necessary to reduce development and training time. Microsoft again reacted by producing MS-WINDOWS. MS-WINDOWS is actually not an Operating System. Internally, it still uses MS-DOS to execute various system calls. On the top of DOS, however, MS-WINDOWS enables a very user friendly Graphical User Interface (as against the earlier text based ones) and also allows windowing capability. MS-WINDOWS did not lend a true multitasking capability to the Operating System. WINDOWS-NT developed a few years later incorporated this capability in addition to being windows based. (OS/2, UNIX provided multitasking, but were not windows based). They had to be used along with Presentation Manager or X-WINDOWS/MOTIF respectively to achieve that capability. With the era of smaller but powerful computers, 'Distributed Processing' started becoming a reality. Instead of a centralized large computer, the trend towards having a number of smaller systems at different work sites but connected through a network became stronger. There were two responses to this development. One was Network Operating System (NOS) and the other, Distributed Operating System (DOS). There is a fundamental difference between the two. In Network Operating System, the users are aware that there are several computers connected to each other via a network. They also know that there are various databases and files on one or more disks and also the addresses where they reside. But they want to share the data on those disks. Similarly, there is one or more printers shared by various users logged on to different computers. NOVELL's NetWare 286 and the subsequent NetWare 386 Operating Systems fall in this category. In this case, if a user wants to access a database on some other computer, he has to explicitly state its address. Distributed Operating System, on the other hand, represents a leap forward. It makes the whole network transparent to the users. The databases, files, printers and other resources are shared amongst a number of users actually working on different machines, but who are not necessarily aware of such sharing. Distributed systems appear to be simple, but they actually are not. Quite often, distributed systems allow parallelisms i.e. they find out whether a program can be segmented into different tasks which can then be run simultaneously

on different machines. On the top of it, the Operating System must hide the hardware differences which exist in different computers connected to each other. Normally, distributed systems have to provide for high level of fault tolerance, so that if one computer is down, the Operating System could schedule the tasks on the other computers. This is an area in which substantial research is still going on. This clearly is the future direction in the Operating System technology. In the last few years, new versions of the existing Operating Systems have emerged, and have actually become quite popular. Microsoft has released Windows 2000, which is technically Windows NT Version 5.0. Microsoft had maintained two streams of its Windows family of Operating Systems – one was targeted at the desktop users, and the other was targeted at the business users and the server market. For the desktop users, Microsoft enhanced its popular Windows 3.11 Operating System to Windows 95, then to Windows 98, Windows ME and Windows XP. For the business users, Windows NT was developed, and its Version 4.0 had become extensively popular. This meant that Microsoft had to support two streams of Windows Operating Systems – one was the stream of Windows 95/98/ME/XP, and the other was the Windows NT stream. To bring the two streams together, Microsoft developed Windows 2000, and it appears that going forward, Windows 2000 would be targeted at both the desktop users, as well as the business users. On the UNIX front, several attempts were made to take its movement forward. Of them all, the Linux Operating System has emerged as the major success story. Linux is perhaps the most popular UNIX variant at the time of going to the press. The free software movement has also helped Linux to become more and more appealing. Consequently, today, there are two major camps in the Operating System world: Microsoft Windows 2000 and Linux. It is difficult to predict which one of these would eventually emerge as the winner. However, a more likely outcome is that both would continue to be popular, and continue to compete with each other.

In a batch system, user enters input on punched cards. The input collected is then read onto a magnetic tape using a computer such as IMB 1401. These computers were good in performing tasks like reading cards, copying tapes and printing outputs. In a batch system, user does not interact with the computer directly. User submits the job to the operator and such an operator collects jobs from various users. Thus, the operator prepares a batch of jobs. Programmer also leaves their programs with operator. The batch of similar jobs or similar programs would be processed when computer time is available. After execution of job or program is complete, output would be sent to the appropriate user/programmer. In batch system execution, CPU utilization is not proper and CPU is often idle, because most of the computing jobs involve I/O operations. In early days, for I/O operations a lot of mechanical action was required. And mechanical movements are much slower than electronic devices.

Real time systems are used when time is a critical factor. There are creation systems in world where there is rigid requirement of time. Execution of a task must be finished in specific time period for the entire system execution. Otherwise, the whole system will fail. Unlike batch systems, input to real time systems comes directly and immediately from the users/systems and real time systems are capable of analyzing and processing data. In real time systems, time is a critical factor and such systems are well defined to execute within certain time period; whereas batch systems are not time dependent.

There are two types of real-time systems: These types of systems guarantees the execution of critical systems on time. These systems are very restrictive in terms of time constraint. These operations always have high priority over other tasks. These are less restrictive in terms of time constraints. Where a critical realtime tasks gets priority over the other tasks and when high priority tasks are finished, these operations would start executing once again.

n

n

n

n

n

n

n

n n

n

n

n

n

n n

n n

n n

n n

n

n

n

n

n n

n n

n

n n

n

n

n

n

n

n

Computer architecture is a very vast subject and cannot be covered in great detail in a small chapter in the book on Operating Systems. However, no book on Operating Systems can be complete unless it touches upon the subject of computer architecture. This is because Operating Systems are intimately connected to the computer architecture. In fact, the Operating System has to be designed, taking into account various architectural issues. For instance, the Operating System is concerned with the way instruction is executed and the concept of instruction indivisibility. The Operating System is also concerned with interrupts: What they are and how they are handled. The Operating System is concerned with the organization of memory into a hierarchy, i.e. disk, main memory, cache memory and CPU registers. Normally at the beginning of any program, the data resides on the disk, because the entire data is too large to be held in the main memory permanently. During the execution of a program, a record of interest is brought from the disk into the main memory. If the data is going to be required quite often, it can be moved further up to the cache memory if available. Cache can be regarded as a faster memory. However, no arithmetic or logical operations such as add or compare or even data movement operations can be carried out unless and until the data is moved from the memory to the CPU registers finally. This is because, the circuits to carry out t hese functions are complex and expensive. They cannot be provided between any two memory locations randomly. They are provided only

for a few locations which we call CPU registers. The circuits are actually housed in a unit called Arithmetic and Logical Unit (ALU) to which the CPU registers are connected as we shall see. The point is: who decides what data resides where? It is the Operating Systems which takes this important decision of which data resides at what level in this hierarchy. It also controls the periodic movements between them. The Operating System takes the help of the concept of Direct Memory Access (DMA) which forms the very foundation of multiprogramming. Finally, the Operating System is also concerned with parallelism. For instance, if the system has multiple CPUs (multiprocessing system), the philosophy that the Operating System employs for scheduling various processes changes. The Operating System, in fact, makes a number of demands on the hardware to function properly. For instance, if the virtual memory management system has to work properly, the hardware must keep a track of which pages in a program are being referenced more often/more recently and which are not, or which pages have been modified. We will present an overview of computer architecture, limited to what a student of any Operating Systems needs to be aware of. As we know, the hardware and software of a computer are organized in a number of layers. At each layer, a programmer forms a certain view of the computer. This, in fact, is what is normally termed as the level of a programming language, which implies the capabilities and limitations of the hardware/software of the system at a given level. This structured view helps us to understand various levels and layers comprehensively in a step-by-step fashion. For instance, a manager who issues a ‘5GL’ instruction to ‘produce the Sales Summary report’ does not specify which files/databases are to be used to produce this report or how it is to be produced. He just mentions his basic requirements. Therefore, it is completely nonprocedural. A non-procedural language allows the user to specify what he wants rather than how it is to be done. A procedural language has to specify both of these aspects.

A 4GL programmer (e.g. a person programming in ORACLE, SYBASE) has to be bothered about which databases are to be used, how the screens should be designed and the logic with which the sales summary is to be produced. Therefore, a 4GL program is not completely non-procedural, though almost all vendors of the so-called 4GLs claim that they are. As of today, the 4GLs are in between completely procedural and completely non-procedural languages. Today’s 4GLs have a lot of non-procedural elements built into them. For instance, they can have an instruction to the effect ‘Print a list of all invoices for all customers belonging to a state XYZ and where the invoice amount is >500 and the list should contain invoice number, invoice amount and the invoice date’.

A 3GL program is completely procedural. COBOL, FORTRAN, C and BASIC are examples of 3GLs. In these languages, you specify in detail not only what you want, but also how it is to be achieved. For instance, the same 4GL instruction described in Sec. 2.2 could give rise to the 3GL program carrying out the following steps: 1. Until it is end of invoice file, do the following: 2. Read an invoice record.

3. 4. 5. 6. 7.

If the invoice amount RUN PAYROLL where “>” is displayed by the CI as a prompt for the user to type in his command, RUN is a command in the CL and its argument PAYROLL is the name of a program already compiled, linked and stored on the disk for future use, and which the user wants to execute now. Figure 3.10 illustrates the picture of the memory, which is occupied by the Operating System where, a part is kept free for any AP to be loaded and executed.

The operating system portion is shown as further divided into a portion for CI. This portion logically consists of compiled versions of a number of routines within the CI. The figure shows one routine for one command. The figure shows a “Scratch Pad for CI” where all commands input by a user are stored temporarily before the input is examined by the watchdog program as shown in Fig. 3.9. The Operating System area also has a portion for other Operating System routines or system calls in the areas of IM, PM and MM. The following steps are now carried out. (i) CI watchdog prompts “>” on the screen and waits for the response. (ii) The user types in “RUN PAYROLL” as shown on the screen in Fig. 3.10. This command is transferred from the terminal keyboard to a memory buffer (scratch pad) of the CI for analysis as shown in Fig. 3.11.

(iii) The CI watchdog program as shown in Fig. 3.9 now examines the command and finds a valid command RUN, whereupon it invokes the routine for RUN. It passes the program name PAYROLL as a parameter to the RUN program. This is as shown in Fig. 3.12. (iv) The RUN routine, with the help of a system call in the IM category, locates a file called PAYROLL and finds out its size. (v) The RUN routine, with the help of a system call in the MM category, checks whether there is free memory available to accommodate this program, and if so, requests the routine in MM to allocate memory to this program. If there is not sufficient memory for this purpose, it displays an error message and waits or terminates, depending upon the policy. (vi) If there is sufficient memory, with the help of a system call in the IM category, it actually transfers the compiled program file for “PAYROLL” from the disk into those available memory locations (loading). This is done after a system call in the IM category verifies that the user wanting to execute the PAYROLL program has an “Execute” Access Right for this file. The picture of the memory now looks as shown in Fig. 3.13. (vii) It now issues a system call in the PM category to schedule and execute this program (whereupon we shall start calling it a “Process”). The picture is a little simplistic, but two points become clear from the discussion above:

l

The system calls in IM, PM and MM categories have to work in close cooperation with one another.

l

Though the user's or Application Programmer's view of the Operating System is restricted to the CL, the commands are executed with the help of a variety of system calls internally (which is essentially the system programmer's view of the Operating System).

The latest trend today is to make the user’s life simpler by providing an attractive and friendly Graphical User Interface (GUI) which provides him with various menus with colours, graphics and windows, as in OS/2 under PS/2. Therefore, the user does not have to remember tedious syntaxes of the Command Language, but can point at a chosen option by means of a mouse. However, this should not confuse us. Internally, the position of the mouse is translated into the coordinates or the position of the cursor on the screen. The CI watchdog program maintains a table of routines such as RUN, DEL, etc. versus the possible screen positions of a cursor manipulated by the mouse. When the user clicks at a certain mouse position, the Operating System invokes the corresponding routine. The Operating System essentially refers to that table to translate this screen position into the correct option, and then calls that specific routine (such as RUN) which, in turn, may use a variety of system calls as discussed earlier. An example of the user-friendly interface is shown in Fig. 3.14. If you move the mouse, the cursor also moves. However, you have to move the mouse along a surface so that the wheels of the mouse also turn. The revolutions of these wheels are translated into the distance which

is mapped into (x, y) coordinates of the cursor by electromechanical means. If the cursor points at a “RUN” option, as shown on the screen, the cursor coordinates (x, y) are known to the watchdog program of CI. This program is coded such that even if the cursor points to coordinates corresponding to any of the characters R, U or N on the screen, it still calls the RUN routine, which as we have seen before, initiates other system calls, in turn. Figure 3.14 shows that if you point a mouse at LIST and then you point it to FILE-B as shown in the figure, the Operating System will convert the mouse positions into screen coordinates first, which it will then convert into an instruction LIST FILE-B and then execute it by calling the appropriate routines.

The Operating System is a complicated piece of software. It consists of a number of routines. Obviously, the size of the Operating System is very large and it is not very wise to keep the full Operating System in the memory all the time because very little space would be left for other Application Programs, due to the limited size of the memory. Therefore, the Operating System is divided into two parts. One consists of the very essential routines which are required very often and almost all the time and the other consists of routines which are required sometimes, but not always. In this sense, they are not vital. The vital portion is called the Kernel of the Operating System. This is the innermost layer of the Operating System close to the hardware, and controlling the actual hardware. It is the heart of the Operating System. If you want to find out the memory overhead the Operating System puts on the system, you should determine the size of the kernel. All the other routines are loaded from the disk to the memory, as and when needed, as shown in Fig. 3.15. This scheme

saves the usage of memory but then you lose the time it takes to load the required routines when necessary. This is the trade-off which decides the size of the kernel.

We need to answer one basic question: If the Operating System is responsible for any I/O operation, who loads the Operating System itself in the memory? You cannot have another program, say P, to load the Operating System; because in order to execute, that program will need to be in the memory first, and then we will need to ask: Who brought that program P in the memory? how was the I/O possible without the Operating System already executing in the memory? It appears to be a chicken and egg problem. In some computers, the part of the memory allocated to the Operating System is in ROM; and therefore, once brought (or etched) there, one need not do anything more. ROM is permanent; it retains its contents even when the power is lost. A ROM-based Operating System is always there. Therefore, in such cases, the problem of loading the Operating System in the memory is resolved. However, the main memory consists of RAM in most of the computers. RAM is volatile. It loses its contents when the power is switched off. Therefore, each time the computer is switched on, the Operating System has to be loaded. Unfortunately, we cannot give a command of the type LOAD Operating System, because such an instruction would be a part of CI which is a part of the Operating System which is still on the disk at that time. Unless it is loaded, it cannot execute. Therefore, it begs the question again! The loading of the Operating System is achieved by a special program called BOOT. Generally this program is stored in one (or two) sectors on the disk with a pre-determined address. This portion is normally called ‘Boot Block’ as shown in Fig. 3.19. The Read Only Memory (ROM) normally contains a minimum program. When you turn the computer on, the control is transferred to this program automatically by the hardware itself. This program in ROM loads the Boot program in pre-determined memory locations. The beauty is to keep the BOOT program as small as possible, so that the hardware can manage to load it easily and in a very few instructions. This BOOT program in turn contains instructions to read the rest of the Operating System into the memory. This is depicted in Figs. 3.16 and 3.17.

The mechanism gives an impression of pulling oneself up. Therefore, the nomenclature bootstrapping or its short form booting. What will happen if we can somehow tamper with the BOOT sector where the Boot program is stored on the disk? Either the Operating System will not be loaded at all or loaded wrongly, producing wrong and unpredictable results, as in the case of Computer Virus.

The concept of virtual machine came about in the 1960s. The background to this is quite interesting. IBM had developed the System/360 operating system, which had become quite popular on the mainframe computer. However, the major concern regarding this operating system was that it was batch-oriented in nature. There was no concept of online computing or timesharing. Users were increasingly feeling a need of timesharing, since only batch processing was not adequate. In order to add the timesharing features to System/360, IBM appointed a dedicated team, which started to work on a solution to this problem. This team came up with a new operating system called as TSS/360, which was based on the System/360, but also had timesharing features. Although technically this solution was acceptable, as it turned out, the development of TSS/360 took a lot of time, and when it finally arrived, people thought that it was too bulky and heavy. Therefore, a better solution was warranted. Soon, IBM came up with another operating system, called as CP/CMS, which was later renamed as VM/370. The VM/370 operating system is quite interesting. It contains a virtual machine monitor. The term virtual machine indicates a machine (i.e. a computer), which does not physically exist, and yet, makes the user believe that there is a machine. This virtual machine monitor runs on the actual hardware, and performs the multiprogramming functions. The idea is shown in Fig. 3.18. Here, we assume that three application programs A, B and C are executing with their own operating systems (again A, B and C, shown as Virtual Machine A, B and C, respectively). The virtual machine, in this case, is an exact copy of the hardware. In other words, it provides support for the kernel/user mode, input/output instructions, interrupts, etc. What significance does this have? It means

that there can actually be more than one operating system running on the computer! The way this works is follows: 1. Each application program is coded for one of the available operating systems. That is, the programmer issues system calls for a particular operating system. 2. The system call reaches its particular operating system, from all those available (depending on which operating system the programmer wants to work with). 3. At this stage, the system call of the program’s operating system is mapped to the system call of VM/370 (i.e. the actual system call to be executed on the real hardware). The virtual machine now makes the actual system call, addressed to the physical hardware. This more or less completes an overview of an Operating System. We will now try to uncover the basic principle of the design of any Operating System.

System calls are an interface provided to communicate with the Operating System. An Operating System manages entire functioning of a computer on its own, but on many occasions explicit direct (initiated by the user through program or using commands) or indirect (not initiated by the user directly) calls are required to perform various operations. Routines or functions or calls that are used to perform Operating System functions are called as system calls. Most system calls of Operating Systems are available in the form of commands.

Systems call instructions are normally available in assembly language. High level languages such C, C++, and Perl also provides facility of system programming. Now C/C++ are widely used for system programming. We can make call UNIX or MS-DOS system routines using C or C++ and those system calls will be executed at run-time. Suppose we have a program in C, which can copy the contents of one file into another file, i.e. it is a backup utility. Then this C program would require two file names and their paths: 1) input (which is existing) 2) output (which will be created). When we execute this program, it will prompt for the names of the two files and while processing it will display error messages if it encounters any problems. Otherwise it will display a ‘success’ message. This is visible to us and we can interact with the program. But in this program, many things are happening in the background. These are:

To ensure that the entered file names are as per the standards or nomenclature. This would be normal processing and would not involving any system calls.

To copy the contents of the input file into the output file, we need to open the input file which is present on the disk. Hence we use function provide by C/C++, which accepts the file name as a parameter. C/C++ function would try to locate the file and then open that file. Now this is a system call. If file exists, it will be opened. If there is any error such as the file is not present or there is not enough memory to load the file or there is no access to open that file, then the program aborts, which will make another system call.

The same is true in the case of the output file.

After copying is done, both the files must be closed so that other processes can use them. File close is also a system call and if there is any problem while closing the file, another system call will be made. Following are the types of system calls: l Process control l File management l Device management l Information maintenance l Communication

n

n

n

n

n

n

n n

n

n

n

n

n

n n

n n

n

n

n

n

n n

n

n

n n

Information Management (IM) consists of two main modules: l

File Systems (FS)

l

Device Driver (DD) or Device Management (DM)

If we want to talk about IM meaningfully, we must understand the concept of a block. A block is a logical unit of data that the Operating System defines for its convenience. Rather than defining what a block is, we will comment upon what it is not. (i) A block is not the same as a sector, though many people use both these terms interchangeably. A sector is a physical unit of data on the disk. A block may be equal to a sector or may be twice or four times as big as a sector, e.g. a sector could be of 512 bytes and a block could be of 1024 bytes. For the sake of convenience, the block size is normally an exact multiple of the sector size. In many cases, the block size is the same as the sector size (multiplication factor = 1). (ii) A block is not a unit of data which is transferred between the disk and the main memory at a time in one shot. Whenever an Application Program (AP) needs to read any data (such as a customer record), the File System translates this request into the one for reading one or more sectors from the disk, and instructs the Device Driver (DD) to read these sectors. It is the DD which is ultimately responsible for reading the data from the disk, and most of the controllers can read multiple sectors at a time. The DDs can use this property of controllers. However, a block is an

entity defined by the Operating System (software) which is quite different than this property of the controller (hardware) to read multiple sectors. (iii) A block is not a unit of data in which the disk space is allocated to files. We will learn more about the disk space allocation in later sections. Some Operating Systems believe in contiguous allocation, where a very large chunk is allocated to a file at a time. Some other Operating System allocate a sector or multiple sector called ‘clusters’ or ‘elements’ at a time to a file. However, this unit of disk space allocation is not the same as a block, though, this allocation / deallocation is normally done in terms of blocks. (iv) A block is not a logical record like a customer record. In some application systems, there was a provision to define a Record Length (RL) and a Block Length (BL). For instance, if the customer record length was 700 characters, the designer/programmer would declare the BL as 2100 characters. In such a case, the Blocking Factor (BF) would be 2100 / 700 = 3. It would mean that the designer would like three logical records to be read at a time. Each programmer would reserve an I/O area of 2100 characters and internally access the records one after the other by accessing the requisite bytes in a block of three records. When all the three records were processed, then a new block of 2100 characters would be read in. This was a useful scheme in sequential processing. The I/O time would reduce, but then the memory requirements for each program would increase. One point needs to be kept in mind. In this scheme, as far as the Operating System was concerned, it had to read 2100 characters at a stretch and therefore, it treated 2100 characters as one record only. As far as the Operating System is concerned, it expects the AP to supply to it the following parameters: (a) The file id (b) The starting position in the file (c) The number of bytes to be read (d) The starting address of memory where the data is to be read. This ‘record’ for the Operating System is 2100 bytes in this case, but it could be a different figure for a different Application System. Therefore, it has nothing to do with the concept of a block which the Operating System defines and users internally for its processing and data manipulations. Given an Operating System, this figure is constant. Therefore, though the term is quite widely used in the literature, it is a pity that it is not defined properly and uniquely. Instead of trying to define the term, we will call a block a unit of data which the Operating System defines for the sake of convenience. Normally, the Operating System also keeps all its data structures in terms of blocks. For instance, it would view the entire disk as comprising a number of blocks. It would maintain a list of free blocks for the sake of allocations. We know that from a hardware print of view, a disk consists of a number of sectors. However, from the point of view of the Operating System, it consists of a number of blocks, where each block is one or more sectors. Therefore, the Operating System needs a mechanism to translate a block number into physical sector numbers. Once this is in place, the Operating System can carry out this “abstraction” and then think of the disk as a series of blocks 0, 1,...to N, and then talk only in terms of blocks thereafter. We will see how all this is done in the sections to follow. It is quite imaginable to have a sector size of 512 bytes, a block of 1024 bytes, disk space allocation to be made in clusters of 2048 bytes and an AP wanting to read a record of 1500 bytes. It is evident that a number of translations are necessary in whole procedure. That is exactly what is one of the important functions of the IM module within the Operating System.

Let us now see how the Operating System would actually perform the function of reading the record on behalf of the AP. This example should also clarify the exact functions and interfaces between the AP, the Operating System (FS and DD) and the hardware (the controller and the device). We will discuss only an overview of this. A detailed discussion of the same follows in subsequent sections. When the AP wants to read a record of 1500 bytes as in the example given, the following happens. (i) The High Level Language (HLL) programmer writes an instruction in the form of “fread (&rec, sizeof (rec), 1, FP)” in C or “READ CUST-REC ... “ in COBOL in his program written in the HLL. (ii) The compiler substitutes the system call for Read in the place of the HLL instruction. The compiler also generates the preparatory instructions of loading various CPU registers or the stack with different parameters such as number of bytes, the starting addresses on the disk and memory, etc. (iii) At the time of execution, the Operating System code for the system call “Read” picks up the parameters from the CPU registers or the stack as per the case. (iv) The File System now translates the AP’s request into a request to read the desired block(s). It uses the file allocation data structure (linked lists, indexes, etc.) to carry out this translation. The File System then requests the Device Driver (DD) to read the desired blocks. (v) The DD issues instructions to the controller for the disk to read the required blocks. The controller is a small computer which understands only specific I/O instructions from the DD. It has a small memory to store this small program loaded by the DD into it. The controller also has some memory to temporarily store the data read from the disk. (vi) The controller reads the data sector by sector and stores it in its own memory until the desired block(s) are read in. The DD is responsible for the translation from block numbers to sector numbers on the disk. (vii) The data from the controller’s buffer memory is transferred to the main memory. It can be transferred into the memory of the AP directly, but it can be, and normally is, read first into the buffer of the Operating System. This transfer between the controller’s buffer and the main memory takes place using Direct Memory Access (DMA). (viii) The File System picks up the required bytes from the blocks read into the memory buffer of the Operating System to formulate the logical customer record demanded by the AP consisting of 1500 bytes and transfers it to the memory of the AP. (ix) The AP which is blocked (i.e., could not proceed) until this happens, then gets ready to execute the next instruction after the Read instruction. When it is ready, it starts executing from that point onwards. This would give an overview of the interactions among the AP, FS, DD and the hardware. We will now consider these modules in more detail. But before doing that, we must know how a disk works. Therefore, let us study that first.

Disk constitutes a very important I/O medium that the Operating System has to deal with very frequently. Therefore, it is necessary to learn how the disk functions. The operating principle of a floppy disk is similar to that of a hard disk. Simplistically, in fact, a hard disk can be considered as being made of multiple floppy disks put one above the other. We will study the floppy disks in the subsequent sections but the discussion is equally applicable to the hard disks as well.

Disks are like long play music records except that the recording is done in concentric circles and not spirally. A floppy disk is made up of a round piece of plastic material, coated with a magnetized recording material. The surface of a floppy disk is made of concentric circles called tracks. Data is recorded on these tracks in a bit serial fashion. A track contains magnetized particles of metal, each having a north and a south pole. The direction of this polarity decides the state of the particle. It can have only two directions. Therefore, each such particle acts as a binary switch taking values of 0 or 1, and 8 such switches can record a character in accordance with the coding methods (ASCII/EBCDIC). In this fashion, a logical record which consists of several fields (data items), each consisting of several characters is stored on the floppy disk. This is depicted in Fig. 4.1.

A disk can be considered to be consisting of several surfaces, each of which consisting of a number of tracks as shown in the Fig. 4.2. The tracks are normally numbered from 0 as the outermost track, with the number increasing inwards. For the sake of convenience, each track is divided into a number of sectors of equal size. Sector capacities vary, but typically, a sector can store up to 512 bytes. Double-sided floppies have two sides or surfaces on which data can be recorded. Therefore, a given sector is specified by three components of an address made up of Surface number, Track number and Sector number.

How is this sector address useful? It is useful to locate a piece of data. The problem is that a logical record (e.g. Customer Record) may be spread over one or more sectors, or many logical records could fit into one sector, depending upon the respective sizes. These situations are depicted in Fig. 4.3. The File System portion of Information Management takes care of the translation from logical to physical address. For instance, in a 3GL or a 4GL program, when an instruction is issued to read a customer record, the File System determines which blocks need to be read in to satisfy this request, and instructs the Device Driver accordingly. The Device Driver then determines the sectors corresponding to those blocks that need to be read.

The disk drive for the hard disk and its Read/Write mechanism is shown in Fig. 4.4. You will notice that there is a Read/Write head for each surface connected to the arm; this arm can move in or out, to position itself on any track while the disk rotates at a constant speed. Let us assume that the File System asks the DD to read a block which when translated into a sector address reads Surface = 1, Track = 10, Sector = 5. Let us call this our target address. Let us assume that the R/W heads are currently at a position which we will call the current address. Let us say it is Surface = 0, Track = 7, Sector = 4. We have to give electrical signals to move the R/W heads from the current address to the target address and then read the data. This operation consists of three stages, as shown in Fig. 4.5. These three stages are shown graphically in Fig. 4.6. The total time for this operation is given by total time = seek time + rotational delay + transmission time. The disk drive is connected to another device called ‘interface’ or ‘controller’. This device is responsible for issuing the signals to the drive to actually position the R/W heads on the correct track, to choose the

correct sector and to activate R/W head of the appropriate surface to read the data. The controller normally has a buffer memory of its own, varying from 1 word to 1 trackful. We will assume that the controller has a buffer to store 1 sector (512 bytes) of data in our example. We will now see the way these three operations, described in Fig. 4.5 can be performed. In essence, we will see how the controller executes some of its instructions such as seek, transfer, etc.

The Seek instruction requires the target track number to which the R/W heads have to be moved. The instruction issued by the DD to the controller contains the target address which has this target track number. The controller stores such instructions in its memory before executing them one by one. The controller also stores the current track number, i.e. the track number at which the R/W arm is positioned at any time. The hardware itself senses the position of the R/W arm and stores it in the controller’s memory. At any time the R/W arm moves in or out, this field in the controller’s memory is changed by the hardware automatically.

Therefore, at any time, the controller can subtract the current track number from the target track number (e.g. 10 – 7 = 3) to arrive at the number of steps that the R/W arm has to move. The sign + or – of the result after this subtraction also tells the controller the direction (in or out) that the R/W arm has to move. Figure 4.7 depicts a floppy disk after it is inserted into a floppy drive through the slot (shown in the figure on the left hand side). When a floppy disk is inserted into the floppy drive, the expandable cone seats the floppy disk on the flywheel. The drive motor rotates at a certain predetermined speed and therefore, the flywheel and the floppy disk mounted on it also start rotating alongwith it.

An electromagnetic R/W head is mounted on the disk arm which is connected to a stepper motor. This stepper motor can rotate in both the directions. If the stepper motor rotates in clockwise direction, the R/W arm moves ‘out’. If the stepper motor rotates anticlockwise, the R/W arm moves ‘in’. If you study the figure carefully, this will become clear. This stepper motor is given a signal by the disk controller. Depending upon the magnitude (or the number of pulses) and the direction of the signal, the direction of rotation and also the number of steps (1 step = distance between two adjacent tracks) are determined, and therefore, the R/W arm can be positioned properly on the desired track (Seek operation). As this is basically an electromechanical operation, it is extremely time–consuming.

For instance, in our last example, the controller calculates the difference between the track numbers of target track and current track. In this case, this is 10 – 7 = 3. It means that the R/W arm must move ‘in’ by 3 steps. Remember the numbering scheme for the tracks? The outermost track is Track 0, and this number increases as one goes inwards. The controller generates appropriate signals to the disk drive to rotate the stepper motor to achieve the Seek operation. This completes the seek operation given in (i) of Fig. 4.5. Figure 4.8 depicts possible connections between a controller and a disk drive. A controller is an expensive equipment and is normally shared amongst multiple drives to reduce the cost, but then at a reduced speed. Even if overlapped Seek operations are possible, we will not consider them. We will assume that a controller controls only one drive at a time. After that drive is serviced, it turns its attention to the next one which is to be serviced. At any moment, the drive being serviced is chosen by the drive select signal shown in the figure. In this case, one could imagine some kind of a decoder circuit in action. If a controller is connected to two disk drives, the connections shown in Fig. 4.8 will exist between the controller and each of the drives. In this case, there will also be a control signal to the controller which can take on two values - low or high (0 or 1). Depending upon this signal, the drive select signal between the controller and the desired drive will be made high while that between the controller and the other drive will be kept low, thereby selecting the former drive in effect. The controller selects a drive and then issues signals to the selected drive to select a surface, to give direction (IN/OUT) to the rotation of the stepper motor, to specify the number of steps and finally whether the data is to be read or written. For all these, the signals (i.e. instructions) given by the controllers are sent to the drive as shown in the figure. After giving these instructions to the drive, either the DATA IN or DATA OUT wire carries the data bit serially, depending on the operation. For instance, in a read operation, data comes out of the DATA OUT wire from the drive to the controller’s buffer. Let us now see how the correct sector number is recognized and accessed on that track. To understand this, we should understand the format in which data is written on the disks. In the earlier days, IBM used to follow Count Data (CD) and Count Key Data (CKD) disk formats. IBM’s ISAM was based on the CKD format. There was no concept of fixed sized sectors in those days. This concept was introduced much later and was called Fixed Block Architecture (FBA). Majority of the computers follow this scheme today and therefore, we will assume fixed size sectors in our discussions to follow. Each sector follows a specific format as shown in Fig. 4.9. According to this format, each sector is divided into address and data portions. Each of these is further divided into four subdivisions as shown in the figure. Let us study them one by one. (i) Address marker (with a specific ASCII/EBCDIC code which does not occur in normal data) denotes the beginning of the address field. As the disk rotates, each sector passes under the R/W head. The disk controller always ‘looks’ for this address marker byte. On encountering it, it knows that what

follows it is the actual address. The address marker is also used for synchronization, so that the address can be read properly. (ii) The actual address, as discussed earlier, consists of three components (surface number, track number, sector number). This address is written on each sector, at the time of formatting. As this address passes under the R/W heads, it is read by them and sent to the controller bit by bit in a bit serial fashion. After receiving it completely, the controller compares it with the sector number in the target address stored in the controller’s buffer loaded by the DD. If it does not match, the controller bypasses that sector and waits for the next sector by looking for the next address marker. If the address now matches, it concludes that the required sector is found, and the operation is complete. Otherwise, the procedure is repeated for all the sectors on the track. If the sector is not found at all (wrong sector number specified), it reports an error. A question that arises is, why maintain the address on each sector? Is it not wasteful in terms of disk space? The controller knows the beginning of a track which is where Sector 0 starts on that track (given by the index hole in the floppy disk). If we do not write the addresses on each sector each time, the controller will have to wait for Sector 0 or the beginning of a track to pass under the R/W heads and then wait for some exact time given by t time = (target sector number X time to traverse a sector), and then start reading mechanically. This is obviously impractical and highly error-prone. Another approach is to maintain only some kind of marker to indicate a new sector. As the disk rotates and the R/W heads pass over a new sector, the controller itself could add 1 to arrive at the next sector number and then compare it with the sector number in the target address. But this is not done for the sake of accuracy. Even a slight error in reading the data from the correct sector can be disastrous. Matching of the full address is always safer.

(iii) The address CRC is kept to detect any errors while reading the address. This is an extension of the concept of a parity bit basically to ensure the accuracy of the address field read in from the disk. As the sector address is written by the formatting program, it also calculates this CRC by using some algorithm based on the bits in the address field. When this address is read in the controller’s memory for comparison, this CRC also is read in. At this time, CRC is again calculated, based on the address read and using the same algorithm as used before. This calculated CRC is now compared with the CRC read from the sector. If they match, the controller assumes that the address is correctly read. If there is an error, the controller can be instructed to retry for a certain number of attempts before reporting the error. We need not discuss at greater length the exact algorithms used for CRC. Suffice it to say that CRC calculation can be done by the hardware itself these days, and therefore, it is fairly fast. (iv) The gap field is used to synchronize the Read operation. For instance, while the address and CRC comparisons are going on, the R/W heads would be passing over the gap. At this time, the controller can issue instructions to start reading the data portion after the R/W heads go over the gap, if the correct sector was indeed found. The gap allows these operations to be synchronized. If there was absolutely no gap at all, the floppy would have already traversed some distance by the time the controller had taken the decision to read the data. (v) The data marker like address marker indicates that what follows is data. (vi) The data is actually the data stored by the user. This is normally 128 or 512 bytes per sector. (vii) The data CRC is used for error detection in data transmission between the disk sector and the controller’s buffer in the same way that the address CRC is used. (viii) The gap field again is for synchronization before the next address marker for the next sector is encountered. Finally, a question arises as to who writes all this information such as the addresses, CRCs, markers, etc. on the sectors of a disk? It is the formatting program. Formatting can be done by the hardware as well as software. Given a raw disk, this program scans all the sectors, writes various markers, addresses, calculates and writes the address CRC, leaves the desired gaps and goes over to the next sector. It follows this procedure until all the sectors on the disk are over. Therefore, even if the actual sector capacity is high, the data portion is only 128 or 512 bytes. This is the reason why people talk about the ‘formatted’ and ‘unformatted’ capacities of a disk. It is only the formatted capacity which is of significance to the user/programmer, as only that can be used to write any data. Without formatting, you cannot do any Read/Write operation. The controller and the Operating System just will not accept it. The formatting program does one more useful thing. It checks whether a sector is good or bad. A bad sector is the one where, due to some hardware fault, reading or writing does not take place properly. To check this, the formatting program writes a sequence of bits in the data area of the sector from the predefined memory locations - say AREA-1. It then reads the same sector in some other memory locations - say AREA2. It then compares these two areas in the memory viz. AREA-1 and AREA-2. If they do not match, either the read or write operation is not performing properly. The formatting process prepares a list of such bad sectors or bad blocks which is passed on to the Operating System, so that the Operating System ignores them while allocating the disk space to different files, i.e. the Operating System does not treat these as free or allocable at all. Using the information already written on the sectors, a correct sector is located on a selected track as discussed above.

With this explanation, the operation (ii) as given in Fig. 4.5 should become very clear. Having found the correct sector, the controller activates the appropriate R/W head to actually start the data transmission as discussed earlier. After this, data traversing in a bit by bit fashion is collected into the bytes of the memory buffer of the controller. This completes the entire picture of how the disk drive works. Even though some examples or figures have been taken from hard disks and some others from the floppy disks, the concepts for both are the same. Again, the exact format s and sizes of sectors and data storage vary from system to system, but the basic pattern is similar. Some drives will not use stepper motors, but may use a more advanced technology. But the basic technique still remains the same. There are some very expensive head per track or fixed head systems wherein there is one R/W head per track, thereby obviating the need for the Seek operation entirely. In such a case, only steps (ii) and (iii) are necessary to read a specific sector, making the operation very fast but expensive. Even if disks are faster than tapes and many other media, their operations involve electromechanical movements of disk arms, actual disk rotation, and so on. Therefore, these operations are far slower than the CPU operations which are purely electronic. This mismatch in speeds is one of the important reasons for the development of the concept of multiprogramming as we shall see. We will now study the controller in a little more detail. Figure 4.10 shows the schematic of a controller. The figure shows the following details: This is the area to store the data read from the disk drive or to be written onto it. It is a temporary buffer area. If any data is to be read from the disk to the main memory, it is first read in this controller’s buffer memory and then it is transferred into the main memory using DMA. Even if the data is read a sector at a time, some controllers can store a full track. Some controllers can read a full track at a time, in which case, this buffer memory in the controller has to be at least a trackful. The data in and data out wires are responsible for the data transfer between the disk drive and the controller as shown in Fig. 4.8 and also in Fig. 4.10. As the data arrives in a bit serial fashion, the electronics in the controller collects it and shifts it to make room for the next bit. After shifting eight bits, it also accepts the parity bit, if maintained, depending on the scheme. The hardware also checks the parity bit for correctness. The received bytes then are stored in the controller’s memory as discussed in (i) above. This stores the track number on which the R/W heads are positioned currently. This is done by the controller’s hardware directly, and it is updated by it if the position of the heads changes. This information is used by the controller in deciding about the number and the direction of the steps in which the R/W heads need to move, given any target address. This gives the information about the status (busy, etc.) or about errors, if any, are encountered while performing read/write operations. This is any other information that a specific controller may want to store. This is also a part of the memory of a controller where the I/O instructions, as the controller would understand, are stored. Any controller is like a small computer which understands only a few I/O instructions like “start motor”, “check status”, “seek” etc. The DD sets up an I/O program consisting

of these instructions for this controller. After receiving from the File System the instructions to read specific blocks, the DD develops this I/O program and loads it in the controller’s memory. In many cases, the basic skeletal program is already hardwired in the controller’s instruction area, and the Device Driver only sets a few parameters or changes a few instructions. After this tiny I/O program is loaded, the controller itself can carry out the I/O operation to read/write any date from/onto the disk into/from its own memory independently. After the specific blocks are read into its memory, the bytes pertaining to the requested logical record (e.g. customer record) that can be picked up and transferred to the main memory by DMA. For all these operations, the main CPU is not required. It is this independence, which is the basis of multiprogramming.

The seek instruction shown in Fig. 4.10 has the target address which contains the target track number. The controller can subtract the current track number as discussed in (iii) earlier from it to decide the number and the direction of the steps that the R/W heads need to move as we have studied earlier. As we know, the program in the controller’s memory set up by the DD needs to be executed one instruction at a time. Therefore, the controller will need some kind of registers like IR, PC, MAR, etc. as in normal computers. It will also require some electronics to execute these instructions. This contains some electronics to control the device. Therefore, all the device control wires shown between the controller and the device in Fig. 4.8 are connected to this portion. The control signals are shown in Fig. 4.10 also. We have already studied them. In order to transfer the data from the controller to the main memory by DMA, the controller has to have a few registers and some electronics or logic to manage this transfer. For instance, it requires the registers to store the amount of data to be transferred (count) or the starting memory address where it is to be transferred. The instructions loaded by the Device Driver in the controller’s memory shown in section (vi) pertaining to Fig. 4.10 contain the instructions to set up these DMA registers too. When the DMA operation starts data is transferred word to word. Each time a word is transferred, the count has to be reduced and the memory address has to be incremented by 1. DMA electronics continues to transfer the data word by word, until the count becomes 0. All this and the actual data transfer are managed by the DMA electronics shown in this section. This completes the picture of how the disk, disk drive and the controller functions, including their interconnections. In order to achieve the functions (i), (ii) and (iii) mentioned in Fig. 4.5, the co ntroller is provided with a number of instructions that it can execute directly. One could imagine that the controller has an instruction register whose wires will carry various signals to the actual disk drive to execute the instruction directly. For instance, when the ‘seek’ instruction is in this register, the hardware itself will find the target track number, current track number and accordingly generate a signal to the stepper motor of the disk drive to move the R/W head the required number of steps in the appropriate direction. We have assumed all along that the controller stores all the instructions and then executes them l one by one. However, some controllers have to be l supplied with one instruction at a time by the DD. l l In such a case, the overall control still remains with l the DD in terms of supplying the next instruction l and checking whether all the instructions are l executed or not. l The controller’s instruction set consists of a limited number of instructions required for the I/O. Some of these are listed in Fig. 4.10(a). Typically for reading the data, the floppy disk device driver would issue the following sequence of instructions to the controller: (i) Check device status. If not working, report. If busy, keep waiting. (ii) If the device is free, but the motor is off, start the motor (Unlike the hard disk drive, floppy disk drive is not always kept rotating; it is started on, only when necessary).

(iii) Achieve the ‘seek’ operation using the target track and the current track by calculating the number of steps and the direction as discussed earlier. (iv) Read the data from the correct sector into the controller’s buffer. This involves matching of the sector address with the target address as discussed earlier. (v) Check for transmission errors by using CRC/parity, etc. and take appropriate actions such as retry or report an error after a specific number of attempts. (vi) If everything is OK, set up the DMA registers - the Memory Address and the Count for the data transfer. (vii) Transmit the data into the main memory by DMA. (DMA registers are already set up.) This is done chunk by chunk, updating the DMA registers each time as discussed earlier. (viii) Check for transmission errors and take appropriate actions. (ix) Stop the motor after a predetermined time, unless a new request arrives before this period expires. Therefore, these instructions in 0 s and 1 s are loaded by the DD into the controller’s memory, whereupon each instruction is fetched and then executed. The routine looks quite simple, but it is actually l fairly complex. For each type of device/controller, l l you need a separate DD, unless the two devices/ l controllers have a lot of similarity to have only l one DD for them. Also, the number of errors that l the DD has to cater to is large. Some of the errors l encountered are given in Fig. 4.10 (b). l The controller has a status register indicating l an error in any operation as discussed earlier and shown in Fig. 4.10. Depending upon the type of error, the hardware itself sets a particular bit in this register on. The DD has to check for the bits in this status register in the controller for the possible errors after each operation, and take appropriate actions. This is what makes the DD a complex and tedious piece of software.

We will now study the data transfer between the controller’s memory and the main memory. How is this transfer achieved ? Normally, this is achieved by using a mechanism called Direct Memory Access (DMA). This transmission takes place bit parallelly, using the same data bus as the one used for other transfers between the CPU registers and the memory. In this scheme, the controller has two registers as shown in Fig. 4.10. These are Memory Address and Count registers. The Memory Address register is similar to MAR in the main CPU. It gives the target address of the memory locations where the data is to be transferred. The Count tells you the number of words to be transferred. The MAR from the DMA is connected to the address bus and the appropriate memory word is selected by using the memory decoder circuits. The DD uses previleged instructions to set up these initial values of the DMA registers to carry out the appropriate data transfer. Once having set up, the DMA electronics transfers the word from the controller’s memory to the target memory location whose address is in the Memory Address Register of the DMA. This takes place via the data bus. It then increments the Memory Address Register within the DMA and decrements the Count Register to effect the transfer of the next word. It continues this procedure until the count is 0.

When the DMA is transferring the data, the main CPU is free to do the work for some other process. This forms the basis of multiprogramming. But there is a hitch in this scheme. Exactly at the time when the DMA is using the data bus, the CPU cannot execute any instruction such as LOAD or STORE which has to use the data bus. In fact, if the DMA and the CPU both request the data bus at the same time, it is the DMA which is given the priority. In this case, the DMA ‘steals’ the cycle from the CPU, keeping it waiting. This is why it is called cycle stealing. The CPU, however, can continue to execute the instructions which involve various registers and the ALU while the DMA is going on. An alternative to the scheme of DMA is called programmed I/O, where each word from the controller’s memory buffer is transferred to the CPU register (MBR or ACC) of the main computer first and then it is transferred from there to the target memory word by a ‘store’ instruction. The software program then decrements the count and loops back until the count becomes 0 to ensure that the desired number of words are transferred. This is a cheaper solution but then it has two major disadvantages. One is that it is very slow. Another is that it ties up the CPU unnecessarily, therefore, not rendering itself as a suitable method for multiprogramming, multiuser systems. We will presume the use of the DMA in our discussion. Figure 4.10 (c) shows both the possible methods of I/O. From the data buffer in the controller, the data can be transferred through the Operating System buffers or directly into the memory of the application program.

We use files in our daily lives. Normally a file contains records of a similar type of information. e.g. Employee file or Sales file or Electricity bills file. If we want to automate various manual functions the computer must support a facility for a user to define and manipulate files. The Operating System does precisely that. The user/Application programmer needs to define various files to facilitate his work at the computer. As the number of files at any installation increases, another need arises: that of putting various files of the same type of usage under one directory, e.g., all files containing data about finance could be put under “Finance” directory. All files containing data about sales could be put under “Sales” directory. A directory can be conceived as a “file of files”. The user/application programmer obviously needs various services for these files/directories such as “Open a file”, “Create a file”, “Delete a directory” or “Set certain access controls on a file”, etc. This is done by the File System, again, using a series of system calls or services, each one catering to a particular need. Some of the system calls are used by the compiler to generate them at the appropriate places within the compiled object code for the corresponding HLL source instructions such as “open a file”, whereas others are used by the CI while executing commands such as “DELETE a file” or “Create a link” issued by the user sitting at a terminal. The File System in the IM allows the user to define files and directories and allocate/deallocate the disk space to each file. It uses various data structures to achieve this, which is the subject of this section. We have already seen how the Operating System uses a concept of a block in manipulating these data structures.

The Operating S ystem looks at a hard disk as a series of sectors, and numbers them serially starting from 0. One of the possible ways of doing this is shown in Fig. 4.11 which depicts a hard disk. If we consider all the tracks on different surfaces, which are of the same size, we can think of them as a cylinder due to the obvious similarity in shape. In such a case, a sector address can be thought of as having three components such as: Cylinder number, Surface number, Sector number. In this case, Cylinder number is same as track number used in the earlier scheme where the address consisted of Surface number, Track number and Sector number. Therefore, both these schemes are equivalent. The figure shows four platters and therefore, eight surfaces, with 10 sectors per track. The numbering starts with 0 (may be aligned with index hole in case of a floppy), at the outermost cylinder and topmost surface. We will assume that each sector is numbered anticlockwise so that if the disk rotates clockwise, it will encounter sectors 0, 1, 2 etc. in that sequence. When all the sectors on that surface on that cylinder are numbered, we go to the next surface below on the same platter and on the same cylinder. This surface is akin to the other side of a coin. After both the surfaces of one platter are over, we continue with other platters for the same cylinder in the same fashion. After the full cylinder is over, we go to the inner cylinder, and continue from the top surface again. By this scheme, Sectors 0 to 9 will be on the topmost surface (i.e. surface number = 0) of the outermost cylinder (i.e. cylinder number = 0). Sectors 10 to 19 will be on the next surface (at the back) below (i.e. surface number = 1), but on the same platter and the same cylinder (i.e. cylinder number = 0). Continuing this, with 8 surfaces (i.e. 8 tracks/cylinder), we will have Sectors 0–79 on the outermost cylinder (i.e. Cylinder 0). When the full cylinder

is over, we start with the inner cylinder, but from the top surface and repeat the procedure. Therefore, the next cylinder (Cylinder = 1) will have Sector s 80 to 159 , and so on. With this scheme, we can now view the entire disk as a series of sectors starting from 0 to n as Sector Numbers (SN) as shown. 0

1

2

3....................................................................................................

N

This is the way one can convert a three dimensional sector address into a unique serial number. Now, in fact, one can talk about contiguous area on the disk for a sequential file. For instance, a file can occupy Sectors 5 to 100 contiguously, even if they would be on different Cylinders, and Surfaces. A simple conversion formula can convert this abstract one dimensional sector number (SN) back into its actual three dimensional address consisting of surface, track, sector number (or cylinder, surface, sector number).

For example, If SN = 7 what will be its physical address? We know that a track in our example contains 10 sectors. Therefore, the first 10 sectors with SN = 0 to SN = 9 have to be on the outermost cylinder (cylinder = 0) and the uppermost surface (surface = 0). Therefore, SN = 7 has to be equivalent to cylinder = 0, surface 0 and sector = 7. By the same logic, the sectors with SN = 10 to SN = 1 9 will be on cylinder = 0, and surface = 1. Therefore, if SN = 12, it has to be cylinder = 0, surface = 1 and sector = 2. Similarly, if SN is between 80 and 159, the cylinder or track number will be 1, and so on. By a similar logic, given a three dimensional address of a sector, the Operating System can convert it into a one dimensional abstract address, viz. Sector Number or SN. The formatting program discussed earlier maintains a list of bad sectors which the Operating System refers to. Therefore, these bad sectors are not taken into account for allocating/deallocating of disk space for various files by the Operating System. As we know, the Operating System deals with the block number for all the internal manipulation. A block may consist of one or more contiguous sectors. If a block for the Operating System is the same as one physical sector, the SN discussed above will be the same as Block Number or BN. If a block consists of 2 sectors or 1024 bytes, the view of the disk by the Operating System will be as shown in Fig. 4.12. In this case, if BN = x, this block consists of sector numbers 2x and 2x + 1. For instance, Block number 2 consists of Sectors 4 and 5 as the figure depicts. Similarly, for a sector number SN, BN = integer value of SN/2 after ignoring the fraction part if any. For instance, Sector 3 must be in Block 1 because the integer value of 3/2 is 1.

Therefore, given a block number (BN), you could calculate the one dimensional abstract sector number (SN) and then calculate the actual three dimensional sector addresses for both the sectors quite easily and vice versa. For example, block number 1 would mean Sectors 2 and 3 (based on our sector numbering scheme). Given that SN = 2 and 3, it is now easy to calculate three component addresses for these sectors, as has been discussed. In our examples hereinafter, we will assume a block = 512 bytes. This is assumed to be the same as a sector size for simplicity. Therefore, in our examples, BN will be same as SN. The File System internally does all the allocation/deallocation in terms of blocks only. Only when finally a Read/Write operation is to be done, are the block numbers (BNs) converted by the Device Management to the sector numbers (SNs), the SNs are converted into the physical addresses (cylinder, surface, sector) as discussed earlier. These are then used by Device Management for actually reading those sectors by setting the ‘seek’ instruction in the memory of the controller accordingly. The controller, in turn, sends the appropriate signals to the disk drives to move the

R/W arms and to read the data into the controller’s buffer and later to transfer it to the main memory through the DMA. Here too, though other schemes are possible, for logical clarity, we will assume that the data is read in the Operating System buffer first and then transferred to the APs memory (e.g., FD area) by DMA. Some Operating Systems follow the technique of interleaving. This is illustrated in Fig. 4.13. After starting from Sector 0, you skip two sectors and then number the sector as 1, then again skip two sectors and call the next sector as 2, and so on. We call this interleaving with factor = 3. Generally, this factor is programmable, i.e. adjustable. This helps in reducing the rotational delay. The idea here is simple. While processing a file sequentially, after reading a block, the program requesting it will take some time to process it before wanting to read the next one. In the non-interleaving scheme, the next block will have gone past the R/W heads due to the rotation by that time, thereby forcing the controller to wait until the next revolution for the next block. In the interleaving scheme, there is greater probability of saving this revolution if the timings are appropriate.

The Operating System is responsible for the translation from logical to physical level for an Application Program. In earlier days, this was not so. In those days, an application programmer had to specify the Operating System the actual disk address (cylinder, surface, sector) to access a file. If he wanted to access a specific customer record, he had to write routines to keep track of where that record resided and then he had to specify this address. Each application programmer had a lot of work to do, and the scheme had a lot of problems in terms of security, privacy and complexity. Ultimately, somebody had to translate from the logical to the physical level. The point was, who should do it? Should the Application Program do it or should the Operating System do it? The existing Operating Systems have a great deal of differences in answering this question. Some (like UNIX) treat a file as a sequence of bytes. This is one extreme of the spectrum where the Operating System provides the minimum support. In this case, the Operating System does not recognize the concept of a record. Therefore, the record length is not maintained as a file attribute in the file system of such an Operating System like UNIX or Windows 2000. Such an Operating System does not understand an instruction such as “fread” in C or “Read...record” in COBOL. It only understands an instruction “Read byte numbers X to Y”. Therefore, something like the application program or the DBMS uses has to do the necessary conversion. At a little higher level of support, some others treat files as consisting of records of fixed or variable length, (like AOS/VS). In this case, the Operating System keeps the information about record lengths etc. along with the other information about the file. At a still higher level of support, the Operating System allows not only the structuring of files (such as records, fields, etc.) but also allows different file organizations at the Operating System level. For instance, you could define the file organization as sequential, random or indexed, and then the Operating System itself would maintain the records and their relationships (such as hashing or keys, etc.). In this case, the application

programmer could specify “get the customer record where Customer # = 1024”. This is really a part of the Data Management System (DMS) software function that the Operating System takes upon itself. This is true of current version of VAX/VMS which has subsumed RMS under it. It is also true of many Operating Systems running on the IBM mainframes. Of course, the services provided are not sufficient to represent the complex relationships existing in the actual business situations. This is why you need a separate Database Management System (DBMS) on the top of the Operating System. At the other extreme, the Operating System can actually provide full fledged Database functions allowing you to represent and manipulate complex data relationships existing in business situations such as “Print all the purchase orders for a supplier where the category of an ordered item = “A”. As an example, OS/400 running on AS/400 has embedded relational database functions as a part of the Operating System itself. PICK Operating System also belongs to the same class. The whole spectrum from UNIX (stream of bytes) to OS/400 (integrated Database) varies in terms of how much the Operating System provides and how much the user process has to do (ultimately, somebody has to do the required translation.) This is shown in Fig. 4.14.

We will assume that the Operating System recognizes the file structure as consisting of various records for our subsequent discussions.

We have said before that the File System is responsible for translating the address of a logical record into its physical address. Let us see how this is achieved by taking a simple example of a sequential file of, say, customer records. Let us assume that a customer record (also referred to in our discussion as logical record) consists of 700 bytes. The Application Program responsible for creating these records written in HLL has instructions such as “WRITE CUST-REC” to achieve this. As we know, at the time of execution, this results in a system call to the Operating System to write a record. Therefore, the Operating System is presented with customer records one after the other. These records may or may not be in any specific sequence (such as customer number). The Operating System assigns a Relative Record Number (RRN) to each record starting with 0, as these records are written. This is because we have assumed that the Operating System recognizes a file made up of logical records. In case of UNIX, the application program will have to perform this.

For instance, if 10 customer records are written, 700×10 = 7000 bytes will be written onto the customer file for RRN = 0 to 9. You can, in fact, imagine all the customer records put one after the other like carpets as shown in Fig. 4.15. It is important to know that this is a logical view of the file as seen by the application programmer. This is the view the Operating System would like the Application Programmer to have. It does not however mean that in actual practice, the Operating System will put these records one after the other in a physical sense of contiguity, as given by the sector/block numbering scheme. The Operating System may scatter the records in variety of ways, hiding these details from the application programmer, each time providing him the address translation facilities and making him feel that the file is written and therefore, read contiguously.

The Operating System can calculate a Relative Byte Number (RBN) for each record. This is the starting Byte number for a given record. RBN is calculated with respect to 0 as the starting byte number of the file and again assuming that the logical records are put one after the other. For instance, Fig. 4.16 shows the relationship between RRN and RBN for the records shown in Fig. 4.15. It is clear from Fig. 4.16 that RBN = RRN×RL, where RL = record length. This means that if a record with RRN = 10 is to be written (which actually will be the 11th record), the Operating System concludes that it has to write 700 bytes, starting from Relative Byte Number (RBN) = 7000. Therefore, if an Operating System recognizes a definition of a record, it can be supplied with only the RRN and the record length. It then can calculate the RBN as seen earlier. For an Operating System like UNIX which considers a file as only a stream of bytes, it has to be supplied with the RBN itself along with the number of bytes (typically equal to RL) to be read. The next step is to actually write these logical records onto various blocks. Let us assume that we have a disk as shown in Fig. 4.11 with 8 surfaces (0 to 7), each surface having 80 tracks (0 to 79) and each track has 10 sectors ( 0 to 9). Therefore, the disk has 8×80×10 = 6400 sectors of 512 bytes each. Let us also assume that one block = 1 sector = 512 bytes and all these blocks are numbered as discussed earlier. Therefore, the Operating System will look at the disk as consisting of 6400 ‘logical blocks’ (0 to 6399) each of 512 bytes, as shown in Fig. 4.17.

Let us assume that we consider our file as a stream of bytes written into various blocks which we will consider as contiguous for convenience (that is why they were called ‘logical’). In such a case, block number 0 starts at Relative Byte Number (RBN) = 0, block number 1 starts at RBN = 512, block number 2 starts at

RBN = 1024. Therefore, it is clear that a block with a block number (BN) = N starts at Relative byte number (RBN) = BL×N where BL = Block Length. In our example RBN = 512 N, because the BL is = 512. Therefore, we now have two logical views. One is a logical view of records given by Fig. 4.15 and another is the logical view of blocks given by Fig. 4.17. The point is to map the two and carry out translation between them. As a disk is shared amongst many files of different users, the Operating System has to act as an arbitrator and therefore, it is responsible for allocating/deallocating blocks to various files. There are basically three ways of allocating these blocks to a file. They are: Contiguous, Indexed and Chained. We will discuss these later in greater detail. In this example, we will assume contiguous allocation. In this scheme, the user has to specify at the time of creating a file, the maximum number of contiguous blocks required for each file. For instance, if a user knows that he has currently 500 customers, but will never have more than 730 customers, he must ask for 730×700 bytes or 730×700/512 blocks = 998.05 or approximately 1000 blocks. Let us assume that the Operating System has already allocated block numbers 0 to 99 for some other file and that blocks 100 and thereafter are free. (The Operating System has to maintain a list of free blocks to know this!) Therefore, let us assume that the Operating System allocates block numbers 100 to 1099 for our customer file. Let us assume that the AP is reading 700 byte customer records from a tape and is writing it onto the disk one by one. Every time the AP wants to write a record on the disk, Relative Record Number (RRN) increases and the Operating System has to keep track of it. The Operating System maintains a field called Cursor to keep track of the current RRN which is incremented after each record is read/written. This is true about the Operating System which recognizes the concept of a record. In UNIX, the compiler of the AP has to generate the machine instructions to do this. Let us assume that at a specific juncture, the Operating System has to write a 700 byte record with RRN = 4. We will assume that 4 records with RRN = 0 to 3 have already been written. Given RRN = 4, the Operating System can compute the physical address as follows: (i) RRN = 4 means RBN = 2800 (refer to Fig. 4.16). This means that the Operating System has to start writing at logical byte number 2800 in the file. We have to now decide which logical block this RBN falls in as per Fig. 4.17. Instead of referring to that figure, we would like a general formula that the Operating System can use. (ii) Divide RBN by block length i.e. 2800/512. We get quotient = 5 and remainder = 240. Therefore, 5 becomes the Logical Block Number (LBN). This means that the 2800th byte is the same as the 240th byte in the logical block number 5. Logical block numbers 0 to 4 will occupy RBN = 0 to 2559. (logical block number 5 starts at RBN = 2560 (refer to Fig. 4.17). RBN = 2800 falls in this block. RBN = 2560 would be the 0th byte in LBN = 5. RBN = 2561 would be byte number 1 in LBN = 5. Therefore, extrapolating this logic, RBN = 2800 would be byte number (2800–2560) or byte number 240 in LBN = 5. This quite fits into our formula. (iii) The Operating System now has to translate the logical block number (LBN) into the Physical Block Number (PBN). Logical block number 0 corresponds to physical block number = 100, because physical block numbers 100 to 1099 are allocated to this file. Therefore, logical block number 5 of this file is the same as physical block number 100 + 5 = 105.

(iv) Therefore, the Operating System knows that it has to start writing from byte number 240 of the 105th physical block of the disk.

However, there are only (511–239) = 272 bytes left in block number 105 starting from 240. Therefore, the Operating System has to continue writing the logical record into the next block too! It will have to use the remaining (700–272)=428 bytes (byte numbers 0–427) of block 106. This is shown in the Fig. 4.18. Therefore, to write the fifth customer record (RRN=4), the Operating System will have to write 272 bytes at the end of physical block number 105, and 428 bytes at the beginning of physical block number 106, because 272 + 428 = 700 which is the record length. (v) The DD portion of the Operating System now translates the physical block numbers 105 and 106 into their physical addresses as discussed earlier, based on the sector/block numbering scheme. For instance, given the disk characteristics of 8 surfaces of 80 tracks each, where there are 10 sectors per track, we get the following equation: Block 105: Surface = 2, Track (or Cylinder) = 1, Sector = 5 Block 106: Surface = 2, Track (or Cylinder) = 1, Sector = 6 The reader should verify this, keeping in mind our numbering scheme. The track number is synonymous with the cylinder number. Cylinder 0 has blocks 0–79, Cylinder 1 has blocks 80–159 and so on. Again, within cylinder 1, surface 0 has sector 80 – 89, surface 1 has sectors 90 – 99 and surface 2 has sector 100 – 109 and so on. Therefore, the procedure of translating the logical record address to the physical sector address can be summed up as below. l l l l l

(e) While creating any file, the Operating System starts from cursor = 0 and therefore, RBN = 0. After each record is written, the cursor is incremented by 1 by the Operating System so that the RBN is incremented by the record length. After the Operating System knows the RBN and the number of bytes to be written, further translation can be performed easily.

(f) After getting the physical block numbers and their offsets, the File System can request the DD to read the required blocks. The DD then translates the physical block numbers into the three dimensional addresses and reads the desired sectors after instructing the controller. For this, the DD normally ‘constructs’ a program for this I/O operation, and loads it into the memory of the controller. As we know, one of the instructions in the program is the “seek” instruction in which the three dimensional address is specified as a target address as discussed earlier. After this, the desired sectors are located and the I/O operation is then performed as studied in detail earlier. The File System then picks up the required bytes to form the logical record. (g) While writing fixed length records in a file, the Operating System can keep track of the number of records and/or the number of bytes written in that file. This information in the form of file size is normally kept in the File Directory where there is one entry for each file for file size.

Let us now see how records are read by the Operating System on the request of the AP. An unstructured AP for processing all the records sequentially from a file would be as shown in Fig. 4.19 (a) for C and Fig. 4.19 (b) for COBOL. During the processing, at every time it executes the “fread” or “READ” instruction respectively, the AP gives a call to the Operating System which then reads a logical record on behalf of the AP as we know. The Operating System blocks the AP during this time, after which it is woken up.

If there are 50 customer records, the AP requests the Operating System to read records from RRN=0 to 49, one by one, for fread (in C) and READ CUST-REC (in COBOL) instruction by using it in a loop. The problem is: How does the Operating System resolve the problem of address translation? Let us now study this.

(i) The Operating System maintains a cursor which gives the “next” RRN to be read at any time. This cursor is initially 0, as the very first record to be read is with RRN=0. After each record is read, the Operating System increments it by 1 for the next record to be read. This is done by the Operating System itself and not by the AP. We know that the compiler generates a system call in the place of the HLL instruction such as Read. (ii) When the AP requests the Operating System to read a record by a system call at the “fread” (in C) or “READ CUST-REC...” (in COBOL) instruction, the Operating System calculates RBN (Relative Byte Number) as RBN = RL×RRN, where this RRN is given by the cursor. Therefore, initially RBN will be 0, because, RRN = 0. For the next record, RRN will be 1 and RBN will be 1×700 = 700 and for RRN = 2, RBN will be 2×700 = 1400. RBN tells the Operating System the logical starting byte number from which to read 700 bytes. Needless to say that for Operating System like UNIX, RBN will have to be supplied to it directly instead of RBN, as it has no concept of records. Let us now see how a record at this juncture with RRN = 2 and RBN = 1400 is read by the Operating System on behalf of the AP. (iii) The File System calculates the logical block number as the integer value of RBN/512. For instance, for RRN = 2 and RBN = 1400, 1400/512 = 2 + (376/512). Therefore, logical block number (LBN) = 2, offset = 376. This means that the File System has to start reading from byte number 376 of LBN = 2. But if only this is done, the Operating System will get only (511–375) = 136 bytes out of this block. This is far less than 700. (iv) The File System will have to read the next block with LBN = 3 fully to get the additional 512 bytes to achieve 136 + 512 = 648 bytes in all. This is still less than 700. The Operating System will have to read the next block with LBN = 4 and extract the first 52 bytes to finally make it 648 + 52 = 700 bytes. Therefore, for this instruction to read one logical record, the Operating System has to translate it into reading a sequence of logical blocks first, as shown in Fig. 4.20. (v) At this stage, the File System does the conversion from LBN to PBN by adding 100 to LBN, because, the starting block number is 100 and all allocated blocks are contiguous, as per our assumption. (vi) Therefore, the File System decides to read 136 bytes (376 – 511) in PBN 102 + all (512) bytes in PBN 103 and 52 bytes (0–51) from PBN 104. This is shown in Fig. 4.21. The File System issues an instruction to the DD to read Blocks 102 to 104.

(vii) As before, the DD translates the PBNs into three dimensional physical sector addresses as given below. Block

102 = Surface 2, Track 1, Sector = 2

Block

103 = Surface 2, Track 1, Sector = 3

Block

104 = Surface 2, Track 1, Sector = 4

This can be verified easily. (viii) The DD now directs the controller to read these sectors one by one in the controller’s memory first and then to transfer the required bytes into the buffer memory of the Operating System by setting the appropriate DMA registers as studied earlier. (ix) After all the data is read, the File System picks up the relevant bytes as shown in Fig. 4.21 to form the required logical record and finally transfers it to the I/O area of the AP. (x) This procedure is repeated for all the records in the file, each time incrementing the cursor and therefore, the RRN by 1. Therefore, the logical records are read one by one without the AP being aware of the actual translation process. (xi) In COBOL, the instruction says “READ CUST-REC... AT END...GO TO EOJ”. Similarly, in C, it checks to see if the ‘fread call’ returned any records, i.e. whether the end of the file is reached. In order to execute this correctly, the Operating System has to indicate to the AP when all the records are processed, so that the AP can jump to EOJ. For fixed length records, the Operating System can do that easily by comparing the cursor (after incrementing) with the total number of records in that file which can be kept in the file directory. File size can also be used to indicate the end of file. Alternatively, an End Of File (EOF) marker - yet another ASCII/EBCDIC character written by the Operating System at the end of the file, can be used to indicate the end of file. For an Operating System such as UNIX, where there is no concept of data records in the Operating System, the address translation is done differently. There is no concept of a logical record for UNIX. Hence, there is no system call to read such a record. You have to specify the starting RBN and the number of bytes to be read and target memory address to UNIX. Therefore, the compiler of a C/COBOL program under UNIX has to generate a routine to define the cursor, increment it after each Read/Write and arrive at RBN = RRN (given by CURSOR)×Record Length. Alternatively, the cursor can directly give the RBN itself. In this case, after a record is read, the cursor is incremented by the number of bytes read (i.e. the RL) which is supplied through the system call itself. All this is done by the compiled AP itself at the time of execution. At this juncture, the compiler generated system call to UNIX to read the number of bytes equal to the record length, starting from the computed RBN is executed. Therefore, somebody has to finally do the work of this translation if an instruction in an AP, written in a language such as C/COBOL, is to be supported. Whether the compiler does it or the Operating System is the main question.

As we have seen, the Operating System can present to the AP, records in the same sequence that they have been written, i.e. the way they had been presented by the AP to the Operating System for writing in the first place. If another AP wanted to process the records in a different sequence - say by Customer number, it is the duty of the AP to ensure that they are presented to the Operating System in that fashion, so that they are

also retrieved in the same fashion. If that sequence was different, what should be done? One way is to sort the original file in the customer number sequence if all the records are to be processed in that sequence only. Alternatively, if only some records are to be selectively processed in that sequence as in a query (given customer number, what is his name, address or balance?), it is advisable to maintain some data structure like an index to indicate where a specific record is written. One advantage of this scheme is that you can access the file in the original sequence as well as the customer number sequence. A sample index on customer number as the key is shown in Fig. 4.22. Notice that RBN is used to indicate the address of that specific record. We will revisit this later. Normally, there is another piece of software to maintain and access these indexes. It is known as the Data Management Systems (DMS). It is obvious that the index has to be in the ascending sequence of the key, if the search time is to be reduced. If a new record is to be added to the customer file, it will be written by the Operating System in the next available space on the disk for that file, and therefore, will not necessarily be in the sequence of customer number. However, that does not matter anymore, because the index is maintained in the customer number sequence. As soon as a record is added, the Operating System knows at what RBN the record was written. For instance, we have studied in Sec. 4.2.4 that the fifth record, i.e. with RRN = 4 was written at RBN = 2800 if the RL was 700. In fact, at any time the File directory maintains a field called file size. A new record has to be written at RBN = File size. After the record is written, the file size field also is incremented by the RL. The Operating System can pass this to DMS and the DMS can then add an entry consisting of the key and the RBN of the newly added record to the index at an appropriate place to maintain the ascending key sequence. Let us assume that till now, four records are written with RBNs as shown in Fig. 4.22. As is obvious from the RBNs, they have been written in the sequence of C001, C009, C004 and C003, because that is the sequence in which records were created and presented to the Operating System. But you will notice that the index is in the ascending sequence of the key. Let us now assume that a new record with customer number (key) = C007 is added. Let us trace the exact steps that will take place to execute this (assuming the same scenario of RL = 700 and that the physical blocks 100 - 1099 are allocated to this file contiguously). (i) The AP requests the DMS to write the next record from its I/O area in the memory, specifying the position of the key in the record. (ii) The DMS extracts the key (in this case C007) and stores it for future use. (iii) DMS now requests the Operating System to write the record onto the customer file and return the RBN to DMS. (iv) The Operating System knows that till now, 4 records of 700 bytes each, i.e. totally 2800 bytes have been written from RBN = 0 to 2799. Therefore, RBN for this new record is 2800. This can be derived from the file size. (v) The File System of the Operating System now does the address translation into logical blocks to be written. The DD translates them in turn into the physical blocks required to be written, as shown in Fig. 4.18 to discover that it has to write the last 272 bytes of block 105 (Surface = 2, Track = 1, Sector = 5) and first 428 bytes of block 106 (Surface = 2, Track = 1, Sector = 6).

(vi) The DD requests the controller to write these sectors one by one after loading the physical target addresses in the controller’s memory. It also transfers the data from the main memory to the controller’s memory using DMA. One important thing that the Operating System has to take care is that data already written is not lost. For instance, if only 272 bytes are to be written onto block 105 at the end, that particular block is read first, the last 272 bytes of it are then updated with the desired data and then that block is written back. If this is not done, the first 240 bytes of that block will be lost. (vii) The controller generates the appropriate signals to the device to seek the correct track and to write the data on the correct sector. (viii) The DD supervises the whole operation to ensure that all the sectors corresponding to that logical record are written on the disk. (ix) Having successfully written the logical record, the Operating System passes the RBN (which is 2800) to the DMS as requested by the DMS. (x) The DMS uses this RBN along with the already stored key (C007) to modify the index. The modified index is shown in Fig. 4.23. In the earlier systems, the data management functions such as maintenance of an index, etc. were part of the Operating System, as in the case of IBM’s old ISAM, but as these functions started getting more complex, separate Data Management Systems (DMS) were written. The DMS can either be a File Management System (FMS) or a Database Management System (DBMS). RMS on VAX, VSAM on IBM and INFOS on DG are examples of FMS, whereas ORACLE, DB2, INFORMIX are examples of DBMS. DMS is a complex piece of software and its detailed discussion is out of the scope of the current volume. What is important here is to understand the exact role of DMS, and where it fits in the layered approach. DMS is in between the AP and the Operating System, as shown in Fig. 4.24. DMS is responsible for maintaining all these index tables based on keys. The index shown in Fig. 4.23 is a very simple one. There can be more complicated indexes with multiple levels. Again, an index is only one of the possible data structures used by DMS. The relational DBMS mainly uses indexes. The hierarchical DBMS such as IMS/VS on IBM or network DBMS like IMAGE on HP, DG/DBMS on DG, IDMS on IBM, DBMS-10 and DBMS-11 on the DEC systems normally use chains or some other techniques in addition to indexes. However, in the chains too, RBN can serve a useful purpose as an address in the chain. For instance, if an address of the next child record for the same parent is to be maintained, RBN again can form this address. In any data structure such as a chain or an index, the DMS can use different methods to store the address. Some of these methods are listed below: l Physical Sector Addresses

RRN l RBN Storing actual physical addresses is a very fast method. While reading a record, given a customer number if the physical address is maintained in the index, the DMS itself can access the final addresses without having to go through various levels of address translations. But then, it has a number of disadvantages. For one logical record, you may have to store multiple addresses as one logical record can span over multiple sectors. A more important disadvantage is hardware dependence and the resultant lack of portability. If a sector goes bad or if the database is moved to a larger disk with more sectors/track and more tracks/surface, the old database/indexes will not be usable. IBM’s old ISAM had this problem, forcing the development of VSAM which is hardware independent. Using RRN as an address in an index or a chain is a better method, but then it requires an Operating System which recognizes the concept of a record. This also limits the portability (UNIX will not know what to do with the RRN for instance). Using RBN as an address in an index or a chain is by far the most popular method as it renders portability. We will continue to assume that RBN is used in our example. Let us study the exact sequence of events that take place when an AP requests for a specific record (say for customer C004) to answer a specific query like “What is the balance of customer C004?” We will assume that the DMS is maintaining an index of customer numbers versus the RBNs as shown in Fig. 4.22. What happens is as follows: (i) The AP for customer inquiry prompts for the customer number in which the user is interested. l

(ii) The user keys in “C004”. This is stored in the memory of the AP. (iii) The AP supplies the key (in this case, C004) to the DMS and requests the DMS to read the desired record. (iv) DMS carries out the index search to arrive at the RBN of the desired record to be read (in this case, 1400). If the DMS was using chains, it will have to have algorithms to traverse through them to search for the record with customer number = C004. (v) DMS supplies the extracted RBN to the Operating System and requests the Operating System to read the record by a system call (in this case, the record of 700 bytes with RBN = 1400). (vi) The File System of the Operating System translates this RBN into the logical address(es) by the same techniques as described in the last two sections. The DD now translates them into the physical address(es) and then reads the required blocks in the Operating System buffer first via the controller’s memory using DMA. (vii) The File System then formulates and transfers the logical record for that customer to the DMS buffer reserved for this AP. DMS is a generalized piece of software catering to many programs at a time. Therefore, normally it allocates as many memory buffers as the number of processes using it. (viii) The DMS transfers the required record from its buffer into the I/O area of the AP. (ix) The AP now uses the details such as the balance, etc. in the record read in its area to display them on the screen as desired. While studying the address translation scheme, we have assumed contiguous allocation of blocks. In some systems, blocks allocated to a file are not contiguous but they are scattered on the disk. The Operating System then maintains an index or chains to access all the blocks for a file one after the other. (These indexes

or chains are different from the ones maintained by DMS for faster on-line access to specific records such as customer number index. We will study this later.) Even if the disk space allocation to files is non-contiguous, the address translation scheme substantially remains unchanged. The only difference arises in the way the PBN(s) are derived from the LBN(s), as we will study.

For each file created by the Operating System, the Operating System maintains a directory entry also called Volume Table of Contents (VTOC) in IBM jargon, as shown in Fig. 4.25. The figure shows the possible information that the Operating System keeps at a logical level for each file. Physically, it could be kept as one single record or multiple records, as in the case of AOS/VS, where access control information constitute one small record. All dates, etc. constitutes the other. In hierarchical file systems, normally this logical record is further broken into two logical records, to allow sharing of files, as we will learn later. We will talk about the significance and the contents of these two logical records later. Again, these logical records for VTOC or file directory entries are stored on the disk using some hashing algorithm on the file name. Alternatively, an index on the file names can be maintained to allow faster access of this information once the file name is supplied. However, in the case of Unix, no hashing technique is used. Entries are created sequentially or in the first available empty slot. So, everytime, the Operating System goes through the entire directory starting from the first entry to access this directory entry given a file name. Therefore, the algorithm for this is very simple, though quite time consuming during execution.

The only thing of significance at this stage for address translation is the file address field for each file. This signifies the address (i.e. block number) of the first block allocated to the file. If the allocation is contiguous, finding out the addresses of the subsequent blocks is very easy. If the allocation is chained or indexed, the Operating System has to traverse through that data structure to access the subsequent blocks of the same file. When you request the Operating System to create a file with a name, say CUST.MAST and request the Operating System to allocate 1000 blocks, the Operating System creates this file directory entry for this file. As in the last example, if blocks 100 to 1099 are allocated to it, the Operating System also creates the file address within the file directory entry for that file as 100. This is subsequently used by the Operating System for the I/O operations.

An AP written in C/COBOL or any other HLL has to “Open” a file for reading or writing. As we know, “Open” is an Operating System service, and therefore, a compiler substitutes an Operating System call in the place of the “Open” statement in the HLL program. The system call for “Open” at the time of execution searches for a file directory entry for that file using the file name and copies that entry from the disk in the memory. Out of a large number of files on the disk, only a few may be opened and being referred to at a given time. The list of the directory entries for such files is called Active File List (AFL) and this list in the memory is also arranged to allow faster access (index on AFL using file name, etc.). After copying in the memory, the Operating System ensures that the user is allowed to perform the desired operations on the file using the access control information. As we have seen, for calculating the physical address for every “Read” and “Write” statement, the starting block number (which was 100 in the previous example) in the file directory entry in the memory has to be added to the logical block number. If the AP adds new records to a file, the file size is altered correspondingly by the Operating System in the file directory entry in the memory (AFL). Similarly, any time the file is referred to or modified as well as created initially, the dates and times of this creation/modification in the file directory entry are changed accordingly. When the file is closed by yet another system call, the updated directory entry is written back to the disk and removed from the AFL in the memory, unless it is used by another user. For the sake of security, the writing back on the disk can be done more frequently also.

Basically, there are two major philosophies for the allocation of disk space (blocks) to various files, namely: contiguous and non-contiguous.

This is what most of the earlier IBM Operating Systems such as DOS/VSE used to follow. We have assumed contiguous allocation up to now in our examples in Sec. 4.2.4. In this scheme, the user estimates the maximum file size that the file will grow to, considering the future expansion and requests the Operating System to allocate those many blocks through a command at the time of creation of that file. The main disadvantages of this scheme are space wastage and inflexibility. For instance, until the file grows to the maximum size, many blocks allocated to the file remain unutilized because they cannot be allocated to any other file either, thereby wasting a lot of disk space. On the other hand, by some chance, if the file actually grows more than the predicted maximum, you have a real problem. For instance, if all the blocks between 100 to 1099 are written and if now you want to add new customers, how can you do it? The blocks from 1100 onwards may have been allocated by the Operating System to some other file. One simple way is to stop selling and adding new customers. But this solution is not likely to be particularly popular. If you want to add a few blocks without disturbing contiguity, you have to load this file onto a tape or some other area on the disk, delete the old directory entry, request the Operating System to allocate new contiguous space with the new anticipated maximum file size (may be 3000 blocks), and load back the file onto the new area. At this juncture, the earlier allocated area can be freed. Before all this is done, your program to “ADD A CUSTOMER” cannot proceed. It will abort with a message “No disk space.” This gives rise to inflexibility.

Despite these disadvantages, it was used by some Operating Systems in the past mainly due to its ease of implementation. There is one advantage which can result out of the contiguous allocation, however. If the processing is sequential and if the Operating System uses buffered I/O, the processing speed can enhance substantially. In buffered I/O, the Operating System reads a number of consecutive blocks at a time in the memory in anticipation. This reduces the head movement and, therefore, increases the speed of the I/O. This is because all the blocks read in are guaranteed to be used in that sequence only. But they are read at the most appropriate time with the least amount of head movements. However, in an on-line query situation, an Application Program may have a query on a record residing in one block and then the next query could be on a record somewhere else altogether, which is not in any of the blocks held by the Operating System in the memory. The buffering concepts may not be very useful in such a case. In fact, it may be worse. The records read in anticipation may not be used at all. The situation while writing records may however be different. Even if records are created randomly in any sequence, the Operating System could buffer them in the memory and write them in one shot while passing their respective RBNs to the DMS for the index entry creation. This definitely reduces the R/W head movement and enhances the I/O speed with contiguous allocation. But, then, this buffering gives rise to a new complication in real time, on-line systems. For instance, if an AP requests for some details of a customer added to the system just a while ago, and the record is only in the main memory still and not yet written back on the disk, the DMS or Operating System should not search for it only on the disk and abort the search with a remark “Not found”. To achieve this, some additional data structures (The customer index having, not the RBN but the memory address, with an indicator whether the record is in the memory or the disk etc.) and some additional algorithms are needed. In fact, a common method is to search through the index to find out whether the data is in the memory or not and then issue an I/O request only if it is not in the memory but it is on the disk. One point needs to be understood in this context. If buffering is not used, i.e. if the Operating System does not read more consecutive blocks than necessary, the contiguity of disk space allocation does not necessarily enhance the response time in a time sharing environment even if the processing is sequential. The reason is that when a record for that process is read and is being processed, the CPU may be switched to another process. This process may request an I/O from an entirely different disk area causing the disk R/W heads to move. When the original process is reactivated and the next record for that process needs to be read, the R/W heads have to move again! It is interesting to know that without buffering, even the writing speed does not increase even if the processing is sequential and the disk space allocation is contiguous. The reason for this is the same as explained above. How does the Operating System manage the disk space in the case of contiguous allocation? It normally uses either a blocks allocation list or a bit map. We will now study these. A sample block allocation list is shown in Fig. 4.26. The list basically gives the account of all the blocks on the disk. It tells you whether a specific chunk is free and if it is not, to which file it is allocated.

In fact, it is more useful to maintain two different tables as shown in Fig. 4.27. One table shows the blocks allocated to various files, and another one shows the details of the free blocks. In the tables shown in Figs. 4.26 and 4.27, both “from” and “to” columns are not necessary because the number of blocks is also maintained. These columns are only shown here for better comprehension only. Whenever a new file is created, depending upon the size of the file requested, the Operating System can allocate contiguous area by looking at the free blocks list as shown in Fig. 4.27. A question may arise that if there are a number of entries in the of free blocks list called ‘holes’ to satisfy this request, which one should be chosen? There are multiple methods in which this choice can be made. The easiest method of allocation is called First fit. For instance, if a new file NEW wants 7 blocks, the Operating System using the method of first fit will go through the free blocks list and allocate from the first entry which has free blocks equal to or more than 7. It is the very first entry in our case which has 16 free blocks. Therefore, the two tables after the allocation will look as shown in Fig. 4.28.

There are two other methods, viz., Best fit and Worst fit methods to choose an entry from the free blocks list for the allocation of free blocks. Both of these methods would require the free blocks list to be sorted by number of free blocks. Such a list before the allocation would be as shown in Fig. 4.29. The best fit method would choose an entry which is smallest amongst all the entries which are bigger than the required one. To achieve this, this sorted table is used. In our case, where we want 7 blocks, the first entry in the sorted list is such an entry. Therefore, blocks 41–47 will be allocated. The resulting two tables, similar to the ones shown in Fig. 4.28 can now be easily constructed. If 10 blocks were requested for a file, we would have to use the second entry of 16 blocks in

the sorted list and allocate blocks 5 to 14 to the new file. After this allocation, there would be only 16 – 10 = 6 free blocks left in this hole. As 6 is less than 8, which is the number of free blocks in the first entry, the list obviously would need resorting - therefore, consuming more time. However, the best fit method claims to reduce the wastage due to fragmentation, i.e. the situation where blocks are free, but the holes are not large enough to enable any allocation. This is because, this method uses a hole just enough for the current requirement. It does not allocate blocks from a larger hole unnecessarily. Therefore, if subsequently, a request for a very large allocation arrives, it is more likely to be fulfilled. The advocates of worst fit method do not agree. In fact, they argue that after allocating blocks 41 to 47, block number 48 which is free in the example above cannot be allocated at all. This is because it is far less likely to encounter a file requiring only one block. Therefore, they recommend that the required 7 blocks should be taken from the largest slot, provided that it is equal to or larger than our requirement (i.e. 7). Therefore, by this philosophy, blocks 2001 to 2007 will be allocated, thereby leaving the remaining blocks with numbers 2008 to 6399 still allocable. This chunk is large enough to cater to other large demands. At some point, however, in the end, it is likely to have very few free blocks remaining and those would most probably be unallocable even in the worst fit scenario. But by then, some other blocks are likely to be freed, thereby creating larger usable chunks after coalescing. It is fairly straight forward to arrive at the resulting two tables after the allocation using this philosophy. In either of these philosophies, the tables have to be recreated/resorted after creation/deletion of any file. In fact, after the deletion of a file, the Operating System has to check whether the adjacent areas are free and if so, coalesce them and create a newly sorted list. To achieve this, the Operating System needs both the tables shown in Fig. 4.28. For instance, let us assume that the block allocation to various files is as shown in Fig. 4.28 at a specific time. Let us assume that CUSTOMER file is now deleted. The Operating System must now follow the following steps: (i) It must go through the file allocation list as given in Fig. 4.28(a) to find that 52 blocks between 49 and 100 will be freed after the deletion. (ii) It must now go through the free blocks list as in Fig. 4.28(b) to find that free blocks 41–48 and 101–200 are free and are adjacent to the chunk of blocks 49–100. Therefore, it will therefore coalesce these three as shown in Figs. 4.30 and 4.31 and work out a new free blocks list. This new list is shown in Fig. 4.32.

(iii) It will sort this new free blocks list, as shown in Fig. 4.33. The new list can be used later for best or worst fit algorithms. Another method of maintaining the block allocation list is by using chains. Figure 4.34 shows such chains or the allocations as per Fig. 4.26. The Operating System can reserve various slots to maintain the information about chunks of blocks. The figure shows 16 such slots, out of which only 13, i.e. 0 to 12 are used. This is because, Fig. 4.26 contains only 13 entries. Each slot consists of 5 fields. These are listed below. (i) Slot number (shown for our understanding - the Operating System can do without it), which is shown in a bold typeface in the figure (ii) (A)llocated/(F)ree status code (iii) Starting block number of that chunk (iv) Number of blocks in that chunk (v) Next slot number for the same status (A or F as per the case). An asterisk ( * ) in this field denotes the end of the chain. At the top of the figure, we show two separate list headers. This allows us to traverse through the allocated or free list. Therefore, this method does away with two separate tables of Fig. 4.27. If we want to go through the free list, we would read the free list header - start address. It is 1 in this case, as shown in Fig. 4.34. We will

read slot number 1. It says that there is a free chunk starting from block 5 of 16 blocks. The next slot number with free status is given in slot 1. It is 3 in this case. This is called a chain. We then go to slot 3, and so on. When a file is deleted, the status indicator of that slot is changed from ‘A’ to ‘F’ and the next slot number fields are updated in the appropriate slots to reflect this change. After this is done, the Operating System goes through the slots sequentially without using the chains to decide about coalescing. For allocating, it goes through free block chains. Using this method, it can perform coalescing in a better fashion, but then, the algorithms for best or worst fit are still time-consuming because the free block chunks are not accessed in the descending sequence of the size of free blocks in a chunk. The Operating System has to manipulate these slots and their chains quite frequently when a file is deleted or created or extended. We leave it to the reader to develop these algorithms. This scheme requires a larger memory to maintain various slots. The time taken to readjust the chains can also be considerable after the blocks are allocated. Imagine, for instance, if 7 blocks (blocks 5 – 11) are allocated to a file NEW, the slot will need splitting into two slots. The Operating System will have to acquire slot number 13 to manage this. After coalescing, some slots may be unused and this scheme will have to have an algorithm to manage the free slots also. A variation of this scheme which is far less time-consuming is through the use of bit maps. Bit maps can be used with both contiguous and non-contiguous allocation schemes. We will now study this method. A bit map is another method of keeping track of free/allocated blocks. A bit map maintains one bit for every block on the disk as shown in the Fig. 4.35. Bit 0 at a specific location indicates that the corresponding block is free and Bit 1 at a specific location indicates that the corresponding block is allocated to some file.

The figure shows a bit map corresponding to the original table in Fig. 4.26. The first 5 blocks are allocated, the next 16 are free, the next 20 are again allocated and so on. A bit map is used only to manage the free blocks and therefore, it need not show the file to which a specific block is allocated. In contiguous allocation, the file directory entry contains the file size and the starting block number. This information is sufficient for accessing all the blocks in the file one after the other or at random. You do not need any help from the bit map in this regard. When a file is deleted, the file directory entry is consulted and the corresponding blocks are freed, i.e. the corresponding bits in the bit map are set to 0. When a file is to be created, for example, NEW with 7 blocks, normally the first fit algorithm is chosen and the routine in the File System searches for 7 consecutive zeroes in the bit map starting from the beginning. Having found the first such 7 zeroes, it allocates them, i.e. changes them to 1 and creates the corresponding file directory entry with the appropriate starting block number. To implement Best fit and Worst fit strategies using a bit map is obviously tougher unless the Operating System also maintains the tables of free blocks in the sequence of hole size. This is fairly expensive and is the main demerit of a bit map over a free blocks list. However, a bit map has the advantage of being very simple to implement. The first look may suggest that the block list method will occupy much more memory than the bit map method. But this is deceptive. The reason is that, normally, the free blocks list itself is kept in free blocks and, therefore, does not require any extra space. In non-contiguous allocations, the maximum size of a file does not have to

be predicted at the beginning. The file can grow incrementally with time as per the needs. This gives it a lot of flexibility at reduced wastage of disk space. Another advantage is that the Operating System automatically allocates additional blocks if the file gets full during the execution of a program, without aborting the program and without asking for the operator’s intervention. But then, the Operating System routines to handle all this become more complex. There are two main methods of implementing non-contiguous allocations viz, chained allocation as in MS-DOS or OS/2 and indexed allocation as in UNIX, AOS/VS or VAX/VMS. We will consider these one by one. Chained allocation believes in allocating to a file, blocks which are not contiguous. But then, the Operating System must have a method of traversing to the next block or the cluster of blocks allocated to that file, so that all the blocks in a file are accessible, and therefore, the whole file can be read in sequence. A pointer is a field which gives the address of the next blocks(s) in the same file. This address could comprise block number instead of the physical address as we have seen. One of the early ideas was to reserve two bytes in a block of 512 bytes to give this address. This scheme is not very popular today but we will start our description with it. In one of the schemes of chained allocation, the following happens (Windows 2000 follows a slightly different version of this as we shall see). (a) The file directory entry gives the first block number allocated to this file. For instance, Fig. 4.36 shows that the first block for FILE A is 4. (b) A fixed number of bytes (normally 2) in each block are used to store the block number of the next block allocated to the same file. We will call it a pointer. This means that in a 512 byte block, only 510 bytes can be used to store the data, as 2 bytes are used for a pointer. (c) With 2 bytes, i.e. 16 bits to denote the block number, the maximum blocks on the disk and therefore, in a file can be 216 or 65536 or 64k. Therefore, the maximum file size would be 32 MB (1 block = 512 bytes = 0.5 kb). However, out of this, the actual data is obviously slightly less. It will be actually 510 ×32 MB/512. (d) Some special characters with some predefined ASCII codes used as this pointer in the last block for a file indicate the end of the chain. This is shown as ‘*’ in Fig. 4.36. For instance, if blocks 4, 14, 6, 24 and 20 are allocated to a file, they will be chained as shown in Fig. 4.36. A better scheme, as followed in MS-DOS or OS/2, is to keep these pointers externally in a File Allocation Table (FAT) as shown in Fig. 4.37. In this scheme, the block can have full 512 or 1024 bytes of actual data, depending upon the block length (in MS-DOS, the block length is 1024 bytes). You still have the overhead of the FAT which is again 2 bytes per pointer and therefore, per block. The only difference is that in FAT, the pointers are kept externally. Let us assume that there are three files in our system. File A has been allocated blocks 5, 7, 3, 6 and 10. File B has been allocated blocks 4, 8 and 11, and File C has been allocated blocks 9, 2 and 12. The file directory entries for these three files have been shown in the figure which mention the first block number in these respective files, viz. 5, 4 and 9. Against the directory entry on the right side is the list of blocks allocated to different files which is shown only for our understanding, because the actual blocks allocated to that file are maintained in the FAT in the form of a chain as we shall see. This is, in a sense, a conceptual and a graphical representation of FAT which is shown on the left hand side. The study of FAT will reveal how the chain works. Each FAT entry is of fixed size - say 2 bytes. Therefore, a block of 1024 bytes can have 512 FAT entries or if a block is of 512 bytes, it can contain 256 FAT entries.

Therefore, normally specific block(s) are allocated to contain FAT entries themselves, depending upon the total number of blocks on the disk. These are extremely important blocks on the disk which contain the information about all other blocks which are either bad (unallocable), free (allocable) or already allocated to different files. A computer virus corrupting the FAT entries can bring the whole system to a standstill, because, if this happens, no file can be accessed reliably. Therefore, normally more than one identical copies of FAT are maintained for protection. Before any file is read or written, the Operating System brings the block(s) containing the FAT entries in the memory for faster operation. The serial numbers of the blocks shown on the left hand side of FAT in Fig. 4.37 and outside the box are shown only for our convenience. They are not a part of the FAT, and therefore, are not actually stored. After reading the FAT in the memory, if we know the Byte Address of the beginning (i.e. the zeroth entry) of the FAT (say A), the address of any entry can be found as A + 2n, where n = entry number. This is because, each entry takes 2 bytes. Hence, there is no reason for storing them. Suppose that we want to read FILE C sequentially. The procedure will be fairly straightforward. The file directory entry which will have been copied to the memory at the time of opening the file gives the first block number which is 9. The Operating System can read that data block after the necessary address conversions/ translations into its three dimensional address. After this, it would read the next block allocated to this file. To do this, it treats this 9 as a random key and then it calculates the address of entry number 9 of FAT in the main memory, by the formula A + (2×9) = A + 18, where A is the starting address of the FAT. Now the Operating System accesses entry number 9 in the FAT which gives the next block number allocated to this file as 2. We can verify this in the FAT (9th entry) and also the graphical representation on the right side. The Operating System now can read this data block after the necessary translations. After this, the Operating System accesses entry number 2 of FAT at memory address of A + (2×2) = A + 4. This entry

gives the next block number as 12 as the third data block allocated in this file ( i.e. FILE C ). The Operating System can now read block number 12. Again the Operating System treats 12 as the random key and accesses the 12th entry of the FAT at memory address of A + (2 ×12) = A + 24. This entry says “EOF”, i.e. End Of File. This indicates that there are no more blocks for this file. This is how, if you want to read FILE C sequentially, the Operating System can read block numbers 9,2 and 12 one after the other. The point is that the AP normally reads logical data records (e.g. Customer record ) one after the other. Therefore, the conversion of a logical data record to blocks still needs to be done. This conversion is done to logical blocks (0, 1, 2 in this case) first and then to physical blocks (9, 2, 12 in this case). For instance, when the AP reads a very first data record ( RRN = 0 ) of 700 bytes in a system where block consists of 512 bytes, the file system will have to read all 512 bytes from LBN = 0 (i.e. PBN = 9) and the first 188 bytes from LBN = 1 (i.e. PBN = 2) We will take another example to illustrate this. How will our previous example of a sequential processing with “While (!eof) – fread” in C or “READ CUST-REC... AT END” in COBOL work in this case? It is easy to imagine. Most of the processing is absolutely similar to what we have described earlier, until we arrive at the logical block numbers (LBN) to be read. In our example in Sec. 4.2.5, we had to read the following to get one CUSTOMER record of 700 bytes starting at RBN = 1400 (refer to point (iv) in Sec. 4.2.5 and Fig. 4.20.) In our example, if File A, shown in Fig. 4.37 is the CUSTOMER file, the logical blocks 2, 3 and 4 are the respectively 3rd, 4th and the 5th blocks from the beginning (because logical block 0 is the first block in the file). These are physical block numbers 3, 6 and 10. Therefore, the Operating System will have to read blocks 3, 6 and 10 in the controller’s memory, form a logical record of 700 bytes as shown above and then transfer it

Logical block no. = 2, Last

136 bytes

Logical block no. = 3, All

512 bytes

Logical block no. = 4, First

52 bytes

Total 700 bytes

to the I/O area of the AP. The reading of these blocks 3, 6 and 10 one after the other is obviously facilitated due to the chains maintained in the FAT. You can now easily imagine how the Operating System can satisfy the AP’s requests to read logical customer records sequentially one after the other by using an internal cursor, and traversing through these chains in the FAT. With chained allocations, on-line processing however tends to be comparatively a little slower. A Data Management System (DMS) used for an on-line system will have to use different methods for a faster response. An index shown in Fig. 4.22 is one of the common methods used in most of the Relational Database Management Systems (RDBMS). Imagine again that we have an index as shown in the figure. An Application Program (AP) for “Inquiring about customer information” is written. A user asks for the details of a customer with customer number = “C009”. How will the query be answered? Let us follow the exact steps: (a) The AP will prompt for the customer number for which details are required. (b) The user will key in “C009” as the response. (c) This will be stored in the I/O area for the terminal of the AP. (d) The AP will supply this key “C009” to the DMS to access the corresponding record (e.g. MS-Access under Windows 2000). (e) DMS will refer to the index and by doing a table search, determine the RBN as 700 (refer to Fig. 4.22). (f) DMS now will request the Operating System, through a system call, to read a record of 700 bytes from RBN = 700. (g) The Operating System will know that it has to read 700 bytes starting from Relative Byte Number = 700 (after skipping the first 700 bytes, i.e. 0 to 699.) This gives us the starting address = 700/512 = 1 + 188/512, i.e. the reading should start from byte number 188 of logical block number (LBN) = 1. But only 511 – 187 = 324 bytes will be of relevance in that block. Therefore, we would therefore need 700 – 324 = 376 bytes from the next block (i.e. LBN = 2) (h) The Operating System will translate this as:

+

LBN LBN

=

Total

1 : Last 324 bytes (188 – 511) 2 : First 376 bytes (0 – 375) :

700 bytes

(i) In our example, if FILE A is the CUSTOMER file, as per the FAT, logical block 0 (given in the directory entry) = physical block number 5. Similarly, logical block 1 will be physical block number 7 and logical block 2 will be physical block number 3 (refer to Fig. 4.37). Therefore, the DD issues instructions to the controller to read physical blocks 7 and 3, pick up the required bytes as given in point (h) and formulate the logical record as desired by the AP. (j) The Operating System transfers these 700 bytes to the I/O area of the AP (perhaps through the DMS buffers, as per the design).

(k) The AP then picks up the required details in the record read to be displayed on the screen. This is the way the interaction between the Operating System such as Windows 2000 and any DMS such as Access takes place. An interesting point emerges. In this method, how does the Operating System find out which are the logical blocks 1 and 2? The Operating System has to do this by consulting the file directory entry, picking up the starting block number which is LBN 0. It then has to consult the FAT for the corresponding entry (in this case entry number 5) and proceed along the chain to access the next entry each time adding 1 to arrive at the LBN and checking whether this is the LBN that it wants to read. There is no way out. If logical block numbers 202 and 203 were to be accessed, the Operating System would have to go through a chain of 202 pointers in the FAT before it could access the LBN = 202 and 203, get their corresponding physical block numbers and then ask the controller to actually read the corresponding physical blocks. If we had the pointers embedded in the blocks, leaving only 510 bytes for the data in each block, the chain traversal would be extremely slow because the next pointer would be available only after actually reading the previous block, thereby, requiring a lot of I/O operations which are essentially electromechanical in nature. If the FAT is entirely in the memory as in MS/DOS or OS/2, the chain traversal is not very slow because, you do not have to actully read a data block to get the address of the next block in a file. However, as the chain sizes grow, this is not the best method to follow, especially for on-line processing as we have seen. This is the reason why indexes are used for disk space allocations in some other Operating Systems. An index can be viewed as an externalized list of pointers. For instance, in the previous example for file A, if we make a list of pointers as shown below, it becomes an index. All we will need to do is to allocate blocks for maintaining the index entries themselves and the file directory entry should point towards this index. 5

7

3

6

10

The problem is how and where to maintain this index. There are many ways. CP/M found an easy solution. It reserved space in the directory entry itself for 16 blocks allocated to that file as shown in Fig. 4.38. If the file requires more than 16 blocks, the directory entry is repeated as many times as necessary. Therefore, for a file with 38 blocks, there would be 3 directory entries. The first two entries will have all the 16 slots used (16×2=32). The last entry will use the first 6 slots (32+6=38) and will have 10 slots unused. In each directory entry, there also is a field called ‘block count’ which, if less than 16, indicates that there are

some free slots in the directory entry. Therefore, this field will be 16, 16 and 6 in the 3 directory entry records in our example given above. Figure 4.38 shows this field as 5 because there are only 5 blocks (5, 7, 3, 6 and 10) allocated to this file. This corresponds to file A of Fig. 4.37. When the AP wants to write a new record on to the file, the following happens: (a) The AP makes a request to the Operating System to write a record. (b) If in a block already allocated to the file there is not enough free space to accommodate the new record, the Operating System calculates the number of blocks needed to be allocated to accommodate the new record, and then it acquires these blocks in the following manner. l The Operating System consults the free blocks pool and chooses a free block. Let us say, 3 blocks are needed for this operation. l It updates the free blocks pool (i.e. removes those 3 blocks from the free blocks pool). l It now writes as many block numbers as there are vacant slots in the current directory entry given by (16 - block count). If all block numbers (in this case 3) are accommodated in the same directory entry, nothing else is required. It only writes these block numbers in the vacant slots of the directory entry and increments the block count field within that entry. However, if, after writing some block numbers, the current directory entry becomes full (block count = 16), then it creates a new directory entry with block count = 0 and repeats this step until all required block numbers have been written. (c) Now the Operating System actually writes the data into these blocks (i.e. into the corresponding sectors after the appropriate address translations). Reading the file sequentially in this scenario is fairly straight forward and will not be discussed here. For on-line processing, if you want to read 700 bytes starting from RBN=700 as given by the index as in the last example in the section on chain allocation under non-contiguous allocation, it effectively means reading logical block numbers 1 and 2. The File System can easily read the directory entry and pick up the second and third slots (i.e. logical block numbers 1 and 2 - which correspond to physical blocks 7 and 3, as per Fig. 4.38). In the same way, picking up logical blocks 35 and 36 is not as difficult now as it was in chained allocation. The File System can easily calculates that the logical block numbers 35 and 36 will be available in the slot number 3 and 4 of the third directory entry for that file. Therefore, it can directly access these blocks improving the response time. With the scheme of allocating only one block at a time (as is done in CP/M, MS/DOS or UNIX), you have a lot of flexibility and minimum disk space wastage, but then the index size increases, thereby also increasing the search times. Do we have any via media between allocating only one block at a time and allocating all blocks for a file, as discussed earlier in contiguous allocation? AOS/VS running on Data General machines and VMS running on the VAX machines provide such a via media. AOS/VS defines an element as a unit in which the disk space allocation is made. VAX/VMS calls this cluster. For a detailed discussion, we will follow AOS/VS methodology, though VMS follows a very similar one. An element consists of multiple contiguous blocks with default as 4. The user can define a different element size for each file. Where response time is of importance and wastage of disk space is immaterial, the user can choose a very high element size. Whenever a file wants more space, the Operating System will find out from the bit map as many number of contiguous blocks as given by the element size, and then allocate. If the element size is very large, we come very close to contiguous allocation as in MVS. On the other hand, if the element size is only one block, we get closer to the other extreme as in UNIX implementation. The element is only a unit in which the disk space is allocated essentially to reduce the index length.

Let us illustrate how the concept of the element or cluster works by taking a default element size of 4 blocks or 2048 bytes. In AOS/VS, 4 bytes (or 32 bits) are reserved for specifying the starting block number of an element allocated to a file. Therefore, one block of 512 bytes can contain 512/4 = 128 pointers to 128 elements allocated to a file. Let us discuss how a file grows in a step by step fashion. Let us imagine that initially, we create a file called F1 using a command given to Command Language Interpreter (CLI) on the Data General machine. Later, we copy records from a file on the tape into this file on the disk. The file on the disk and its corresponding indexes grow in various stages that we will trace now. The good point is that the user/programmer need not be aware of this, as his intervention is not needed at all. When you create a file say F1 using a command “CRE F1” for AOS/VS CLI, the Operating System will do the following. l Find 1 free element, i.e. contiguous 4 free, allocable blocks using the bit map. l Allocate them to F1 and mark them as allocated by setting the appropriate bits. l Create a directory entry for F1, containing the block number of the first block in that element as a pointer. This is as shown in Fig. 4.39, assuming that blocks 41, 42, 43 and 44 were allocated.

At this stage, imagine that you are running a program copying a tape file onto F1 on the disk. Let us assume a logical record length of 256 bytes for the sake of convenience. Therefore, the first element of 2048 bytes therefore can hold 8 data records. The pseudocode for the AP to copy the file will be as shown in Fig. 4.40.

Copying the first 8 records, the Operating System will have no problem. After writing each record, a cursor within the file is incremented to tell the Operating System where to write the next record. The file size field also can serve this purpose. For the 9th record, when the AP gives a system call for “Write Disk-file-record ...”, the Operating System will realize that there is no disk space left for the file as the only element of 4 blocks or 2048 bytes has already been written by the first 8 records. The Operating System finds this using the file size/cursor. The Operating System now will proceed as follows: l It will locate another free element from the free blocks pool. A point needs to be made here. Any Operating System is normally designed in a modular way. Therefore, there will be a routine within any Operating System to acquire n number of contiguous blocks. This routine also can be organized as a System Service or System Call. This will be called and executed at this juncture. Let us say that the acquired free blocks were blocks 81, 82, 83 and 84. l It will allocate this element to F1 and mark these blocks as allocated by setting the appropriate bits in the bit map. Now there are two elements allocated to F1, one starting at block 41 and the other starting at block 81, but the directory entry for F1 can point to only one of these. Therefore, we need some scheme to accommodate both these elements. Essentially, we need an index. This index has to be kept in yet another free block. l It will find one more free block using the bit map and allocate it to an index to contain 128 entries of 4 bytes each; and then update the bit map accordingly. Let us say that this index block was block 150. Therefore, this will be removed from the free blocks pool. l Now the directory entry will point to this index block, i.e. it will contain the pointer 150. l

The first two entries in the index blocks will be updated to point to the two elements as shown in Fig. 4.41 i.e. they will contain the pointers 41 and 81 respectively. The remaining 126 slots in the index block will still be free. The copying of another 8 tape records can now continue due to the new element consisting of 4 blocks.

As the tape records are read, new elements are allocated to F1 and the index entries for F1 are filled in block 150. What happens if the file requires more than 128 elements? The index block 150 can contain only 128 pointers pointing to 128 elements or 128×4=512 blocks. For file sizes larger than this, there is a problem. In such a case, the Operating System acquires one more index block as shown in Fig. 4.42. Let us say that block 800 in addition to block 150 is allocated as an index block. We are faced with the same problem. Which pointer - 150 or 800 - should be maintained in the directory entry now? There are 2 index blocks, whereas there is space for only one pointer. The Operating System uses the same trick to solve this problem. Another block, say 51 is acquired and is assigned to a higher level index. This block also can contain 128 entries of 4 bytes each. Each entry in this case holds a pointer to the lower level index. The first two entries in block 51 are pointers to lower level indexes (150 and 800, respectively). The remaining 126 slots in block 51 are unused at this juncture. Now, the file directory entry points to block 51. A closer study of Fig 4.42 will clarify how this system works. If two levels of indexes are not sufficient, a third level of index is introduced. AOS/VS supports 3 levels of indexes. It is easy to imagine this and therefore, it is neither discussed, nor shown. Reading a file sequentially is fairly straight forward in this case. If the file is being processed from the beginning, the Operating System does the following to achieve this: l From the file size maintained as a field in the file directory, the Operating System determines the index levels used. For instance, if file size is less than the element size, no index will be required. If file size is between 1 element to 128 elements, there will be one index, and so on. This tells the Operating System the meaning of the pointer in the file directory entry, as to the level of index to which it points. l The Operating System picks up the pointer in the file directory and traverses down to the required level as given above to reach the data blocks. l It now can read data blocks 41, 42, 43, 44 and so on after appropriate address translations, in that sequence. Actually, normally, an AP will want to process the logical records sequentially. The translation from logical records to logical blocks and then physical blocks is already discussed earlier and therefore, it will not be discussed here. l After reading the data blocks in the controller’s memory, relevant bytes from physical blocks are transmitted to the main memory to from a logical record which is presented to the AP. l After block 44, the Operating System knows that one element is over, and it has to look at the next pointer in the index (which is 81 in this case). l By the same logic, when data block 8 which is the last data block pertaining to that index block is read (refer to Fig. 4.42), the Operating System knows that it has to look up the next index block, whose address is given as the second pointer in the higher level index (which is in the block number 800 in this case). l By repeating this procedure, the entire file is read. For on-line processing, the AP will make a request to DMS to read a specific record, given its key. The DMS will use its key index to extract the RBN and request the Operating System to read that record. At this stage, AOS/VS will determine the logical blocks to be read to satisfy that request. Given the LBNs and the file size, AOS/VS can find out the index level and the element which is to be read. For instance, from RBN, if the Operating System knows that LBNs 3 and 4 are to be read, the Operating System can work backwards to find out where these pointers will be. It knows that LBN 0–3 are in element 0 and LBN 4–7 in element 1. Therefore, it wants to read the last block of element 0 + First the block of element 1 in order to read the blocks

with LBN = 3 and 4. These are physical blocks 44 and 81 respectively, as shown in Fig. 4.41. It also knows that elements 0–127 are in index block 0. Therefore, the first two entries in this index block (given by block 150 shown in the figure), will give pointers to these two elements, i.e. element 0 and element 1. It can read those pointers almost directly and then access the actual block numbers thereafter. This scheme therefore, has a direct advantage over the chained allocation for on-line processing. You do not have to go through the previous pointers to arrive at a desired one. Given RBN and number of bytes to be read in the record, the DMS requests the operating system to read the record.

From the user’s viewpoint, a directory is a file of files, i.e. a file containing details about other files belonging to that directory. Today, almost all operating systems such as Windows 2000, UNIX, AOS/VS, VMS and OS/2 have the hierarchical file structure. Therefore, we will not consider the single level (as in CP/M) or two level (as in RSTS/E on PDP-11) directory structures. In a sense, these can be considered as special cases of the hierarchical file system. In the hierarchical file structure, a directory can have other directories as well as files. Therefore, it forms a structure like an inverted tree - as shown in Fig. 4.43. The figure shows most of the facilities and aspects of a hierarchical file system. All the directories are drawn as rectangles and all the files are drawn as circles. At the top is the “ROOT” directory which contains a file, viz., VENDOR and two subdirectories, viz., PRODUCTION and COSTING. PRODUCTION and COSTING in turn, have other directories/files under them.

Normally, when a user logs on, he is automatically positioned at a home directory. This is done by keeping the home directory name in the record for that user, called a profile record in a profile file. In UNIX, this record is in /etc/passwd file. Each user in the system has a profile record which contains information such as username, password, home directory, etc. This is created by the system administrator at the time he allows a user to use the system. For instance, while assigning a production manager as a valid user on the system, the system administrator may decide that his home directory is /PRODUCTION. This is stored in the record for the production manager in the profile file. When the production manager logs onto the system, the login process, while checking the password itself consults the profile file and extracts the home directory. After this, he is put automatically in the /PRODUCTION directory. At this juncture, after login, if he immediately gives a command to list all the files and directories without mentioning any directory name explicitly from which he wants to list, the Operating System will respond by giving a list, which as BUDGET, WORK-SORDERS and PURCHASING. The list will also mention that out of these, BUDGET is a file, whereas the other two are directories. He then can start manipulating/using them. Similarly, a profile record for the costing manager may contain /COSTING as his home directory. In this case, when the Costing manager logs onto the system, after checking the typed username, password with the one in the profile record, the Operating System uses this home directory to position him at/COSTING directory. This information is copied at this juncture in the process control block (PCB) or (u-area in UNIX) for the process created by him for further manipulation as we shall see. The Operating System allows you to start from the top, i.e. the ROOT directory and traverse down to any directory or any file underneath. At any time, right from the time you login, you are positioned in some working or current directory (which is the same as the home directory immediately after logging in).

To reach any other directory or file from your current directory, you can specify complete path name. For example, if you want to print the BUDGET file under the directory PRODUCTION, you can give the following command from any current directory that you may be at: PRINT/PRODUCTION/BUDGET The first slash(/) denotes the root. This instruction in effect directs the Operating System to start from the root, go into the PRODUCTION directory, search for the file called BUDGET within that directory and then print it. In hierarchies which are large, specification of the complete access path can become cumbersome. To help the user out of this, many systems allow the user to specify partial path name or relative path name, beginning with some implied point of reference such as the working or current directory. For example, assume that a user is currently in the directory /PRODUCTION. The user can now just say PRINT BUDGET to have the same effect as specifying the complete path name as specified above. The absence of a slash(/) in the path name tells the Operating System command interpreter that the path is partial with respect to the current directory; and it does not have to start with the root. This is obviously possible because the Operating System maintains a cursor or pointer at the current position in the hierarchy for each user in the PCB or the u-area of the process that the user is executing (in this case, it may be the command interpreter or a shell process). In Unix, by convention, the directory name “.” refers to the current directory and “..” always refers to the directory which is one level higher. (i.e. parent directory.) Let us assume that after login, you are in your home directory /PRODUCTION. The path name “..”, therefore, would refer to the root directory. Assume that you are in the WORKS-ORDER directory under /PRODUCTION directory. If you want to move from there to the PURCHASING directory, under the same parent directory, you can use a command with “../PURCHASING” as the pathname to move to that directory. This is because, the “..” in the command takes you one level up, i.e. to /PRODUCTION directory, and therefore, “../PURCHASING” would take you to /PRODUCTION/ PURCHASING directory. This now becomes your current or working directory which is stored in the PCB or u-area of your process. You can now issue a command to list all the files in the PURCHASING directory under /PRODUCTION without having to specify any pathname at all. If you are in the /PRODUCTION/WORKS-ORDER directory and if you want to open the “BUDGET” file under the “/COSTING” directory, you will have to an issue a command to open “../../COSTING/BUDGET”. After this, an Operating System such as UNIX will execute this command if you have the appropriate access rights. This is because, the first “..” would take you in /PRODUCTION, the next “..” would take you to the root (/) itself and therefore, “../../COSTING/BUDGET” would be the command to take you to the desired directory.

If there was no hierarchy, and all the files for all the users were put together under one big, global directory, there would be chaos. First of all, there would be a problem of avoiding duplicate file names. How can one prevent a user from not giving a name to his file which some other user has already used? Of course, a user can keep on giving a name and the Operating System can keep checking against a list of all already existing files (which could run into hundreds). But this is cumbersome. Therefore, it is much more convenient to have separate directories for different users or applications. In fact, with the passage of time, a user or an application also may have hundreds of files underneath. It is obviously handy to subdivide even these, in turn, in terms of different subdirectories, depending upon the subject or interest. In such a case, however, the same filename can be used a number of times in the whole system, so long as it does not appear more than once under one directory. Figure 4.43 shows the same name BUDGET

used for two files. This is legitimate, as these files are under different directories and therefore, therefore have different pathnames. With a hierarchical file system, sharing of files or directories is possible, thereby obviating the need for copying the entire file or directory. This saves disk space. Figure 4.43 shows PURCHASING directory itself along with all the files underneath it as shared by two directories, viz. /PRODUCTION and / COSTING. Therefore, you can reach the PURCHASING directory and/or all the files belonging to it from either /PRODUCTION or /COSTING. This saves duplication. This is achieved by what is known linking. The idea is that there is only one copy of PURCHASING directory and also one copy each of all the files underneath it. However, it is pointed to from two directories. We will study linking later in detail. The hierarchical file system normally also allows aliases, that is, referencing the same file by two names. For example, the same physical file depicted in Fig. 4.43 can be accessed with the name VENDOR from the root directory and with the name SUPPLIER from the /PRODUCTION/PURCHASING directory. This means that the access paths /VENDOR and /PRODUCTION/PURCHASING/SUPPLIER denote the same physical file. This also is achieved by linking. For sharing, the file or directory has to previously exist under one directory, after which you can create links to it from another directory. For instance, if the file VENDOR already exists under the root directory, you can create a link from the directory PURCHASING to the same file but call it by a different name such as “SUPPLIER”. There can be two types of links as is the case with different implementations of UNIX. One is called hard link. In this case, in PURCHASING directory, you create a file called SUPPLIER, but in that file, you only create one record giving the pathname which in this case is /VENDOR. Therefore, when the Operating System tries to access the SUPPLIER file, it will come across this pathname record. It will then separate the / and the VENDOR and then actually resolve the pathname by traversing the path from the root (/) to the VENDOR file. This is how the Operating System can reach the same file. Since the complete path name is hardcoded to denote a link, it is called hard link. Another method of file sharing is by a soft link. In this case, there is a field maintained for each file in the directory entry known as a usage count. This will be 2 if a file is shared from 2 directories. Every time a link is created to an existing file, its usage count maintained in the file directory entry for that file is incremented. Similarly, when a file is deleted from any directory, its usage count is reduced by 1 first, and only if the usage count now becomes 0, the file is physically deleted - i.e. the blocks allocated to that file are added to the list of free blocks, because, now the Operating System can be sure that it will no longer be used by anybody. What will happen if you are in a root directory and you delete the file VENDOR? If hard links are used, the file will be physically deleted and the blocks will be freed immediately. This makes sense, because even if you try to access it from PURCHASING directory, you will access the SUPPLIER file and then to go through /VENDOR path finally. However, if soft links are used, the physical file is not deleted and the blocks are not freed because the usage count before executing the DELETE instruction was 2. After executing this DELETE instruction, the Operating System breaks the connection between the VENDOR file and the root directory, and reduces the usage count in the directory entry by 1. Now the usage count becomes 1. The Operating System does not delete the file physically because the usage count is 1 and not 0. The same file is still accessible from PURCHASING directory. For all the files with usage count = 1 (i.e. unshared), the DELETE instruction would result in making the usage count 0. At this time, the file is physically deleted, thereby freeing the blocks occupied by that file. Files represent information which is very valuable for any organization. For each piece of data, the management would like to assign some protection or access control codes. For instance,

a specific file should be readable only for users A and B, whereas only users B and C should be allowed to update it, i.e. write on it, while perhaps all others should not be able to access it at all. This access control information is a part of the directory entry and is checked before the Operating System allows any operation on any data in that file. With hierarchical file system, you could group various files by common usage, purpose and authority. Therefore, you could set up different access controls at the directory levels instead of doing so at each individual file level, making things easier and controls better. When you deny a user an access to a directory, obviously he can not access any file within it too! In our figure, you could say that the /PRODUCTION directory should be under the control of the production manager, /COSTING directory should be under the control of the costing manager, and so on. PURCHASING directory and therefore, therefore all the files under it, could be read by costing, production and purchase managers, etc. Specifying these access controls at different levels is facilitated by having a hierarchical file system.

In the last section, we have seen different aspects and benefits of the hierarchical file system. In this section, we will examine how this is normally implemented internally by the Operating System. One idea to implement this would be to treat a directory as a file of files. For instance, the file for root directory shown in Fig. 4.43 will have entries as shown in Fig. 4.44. Each of the entries is essentially a record for the file or a subdirectory within it (denoted by the “Type” field). The field “Type” indicates whether it is a directory or a file. If it is a file, the address field tells you the address of the first block, element or the block number of the index at an appropriate level depending upon the disk space allocation method used. Using this, the address translation between the logical block number (LBN) and the physical block number (PBN) is achieved as studied earlier. If it is a DIRECTORY entry, the address is the block number where the details of files and directories within that directory are found. The details will be maintained in the same fashion, as shown in Fig. 4.44. This means that once the “record layout” of the entries or records for the “directory file” is determined by the Operating System, it can easily pick up the relevant fields such as “Type”, “Address” within a record and take the necessary action. For instance, if you read block 50 shown against PRODUCTION in Fig. 4.44, you would get the entries for PRODUCTION directory as shown in Fig. 4.45. This information tallies with Fig. 4.43. Now, if the file BUDGET is to be read, the Operating System can access the field for “Address” in the entry for that file, and access all the data blocks as per the disk space allocation method. Similarly, if you read block number 70, you would get the details of the COSTING directory, as shown in Fig. 4.46. However, there is a problem in this scheme. It

duplicates the information about PURCHASING directory which is essentially a shared directory in both the directories to which it belongs. This is evident from Figs. 4.45 and 4.46. In an actual environment, where a file or a directory can be shared by many programmers/users, this duplication or redundancy can be expensive in terms of disk space. The fact that the file name PURCHASING is duplicate cannot be helped, but all other details, apart from the name, are unnecessarily repeated, and this is not insignificant (refer to Fig. 4.25). This redundancy is expensive from another point of view as well. When you delete a file or a directory - how can the Operating System decide when to actually free the blocks allocated to that file? For instance, even if you delete the PURCHASING directory from both PRODUCTION as well as COSTING directories, how can the Operating System take a decision of actually freeing the blocks unless it goes through all the directories to ensure that PURCHASING does not belong to a third directory as well? Essentially, how can it use the idea of the usage count that we had talked about? To solve these problems, normally the file directory entry is split in two parts: Basic File Directory (BFD) Symbolic File Directory (SFD) The BFD gives the basic information about each physical file including its type, size, Access Rights, Address and a lot of other things such as various dates, etc. as mentioned and shown in Fig. 4.25 (in UNIX, each BFD entry is called i-node and the BFD itself is called the inode list). On the other hand, SFD gives you only the file name and the pointer to the BFD (by BFD Entry Number or BEN) for that file. Therefore, if the same physical file with two same or different names in two different directories is shared by these directories, there will be two SFD entries with appropriate (same or different) file names but both pointing towards the same BFD entry. i.e. both of these will have the same BEN. The usage count in the BFD entry will be 2 in this case. The SFD entry does not have a usage count. There is one and only one BFD entry for every physical file or directory in the system, regardless of the sharing status. If a file or a directory is shared, the usage count for that file will be more than 1 and there will be as many entries in different symbolic file directories (SFDs) pointing towards the same BFD entry i.e. with the same BEN. For the directory structure shown in Fig. 4.43, we have shown the BFD and SFD entries in Figs. 4.47 and 4.48 respectively. We will now study these in a little more in detail.

(i)

(ii) (iii) (iv) (v) (vi)

This contains one entry for every file. The fields in the BFD entry are as follows: BFD Entry Number (BEN) or File ID: This is a serial number for each entry starting from 0. Each file has a unique BEN. As this is a serial number, it actually need not be stored in the BFD. Knowing the length of each BFD entry, the Operating System can access any BFD entry directly, given its BEN. This field is still shown as a part of BFD only for better comprehension. Type: This denotes whether a file is a data file (DAT) or a directory (DIR) or Basic File Directory (BFD) itself ( the very first entry in the BFD with BEN = 0 ). Size: This refers to the file size in blocks. Usage Count: This refers to the number of directories sharing this file/directory. If this is 1, it is not shared, but if it is more than 1, it means that sharing is involved. Access Rights: This gives information about who is allowed to perform what kind of operations (read, write, execute, etc.) on the file/directory. Other Information: Other information that the BFD can hold is as shown in Fig. 4.25 and is discussed below. (a) File-date of creation, last usage, last modification: These are self-explanatory. The Operating System updates these any time anybody touches the file in any way. This information is useful in

deciding questions such as “When was the program modified last?”, and therefore, it enhances security. (b) Number of processes having this file open: While on the disk, this field is 0. When a process opens this file, the BFD entry for that file is copied in memory, and this field is set to 1. Any time another process also opens it, the BFD entry for that file does not need to be copied from the disk. The Operating System only increments this field by 1. This field is, therefore, used like usage count in the BFD, except that it is used at the run time during execution. When a process closes a file, the Operating System decrements this field by 1 and it removes the in-memory BFD entry only if this field becomes 0. (c) Record length: This is maintained by an Operating System only if it recognizes a logical record as a valid entity. If maintained, this is used in calculating the relative byte number from the request of the user process to read a logical record. In UNIX, this field does not exist. (d) File key length, position, etc.: This is maintained by an Operating System basically for the non-sequential access. Typically, this is done when access methods such as Indexed Sequential Access Method (ISAM) are a part of the Operating System. This is not a very commonly followed practice today. The user has to specify the length and position of the key within a record to the Operating System, which is then used for building indexes by the Operating System automatically as the record are added. (e) File organization: Again, this is maintained, depending upon the support level of the Operating System (refer to Sec. 4.2.3). (vii) Address: This gives the block number of the first data block or element or the index at an appropriate level, depending upon the disk space allocation method used and the file size. For instance, in contiguous and chained allocations, it is the block number of the first block. In indexed allocation, it is the block number of either the data or the index block at the appropriate level, depending upon the file size. This has been discussed earlier. Therefore, BFD is a directory containing information about all the files and directories in the File System. The BFD itself is normally kept at a predefined place on the disk. The Operating System knows about it. It may said to be hardcoded. The first entry in BFD is for BFD itself. This is basically kept for the sake of completeness. This is because, BFD itself is kept as a file. We will ignore this entry (with BEN = 0). The second entry in the BFD with BEN = 1 is for the root directory, and this also is fixed (in UNIX, it is BEN or inode number = 2). When you want to read any file, you have to read the root directory first. Because, its place is fixed, the Operating System can easily read it and bring it in the main memory. The entry in our example (Fig. 4.47) says that it is a directory (type = DIR), and it occupies 1 block (size = 1) at block number 2 (Address = 2). If the Operating System actually wants to read it, it will have to read block 2 with necessary address translation. After reading this, the Operating System will get the contents of the root directory as shown in the SFD of Fig. 4.48 (a). After the first two entries, the BFD has one entry for each unique file/directory. This consists of a simple listing of all the files/directories in a specific directory and their corresponding BENs. The format of each entry in the SFD is known to the Operating System. For instance, the Operating System knows the length of each entry in the SFD, and within it, how many bytes are reserved for the symbolic name and how many are reserved for BEN. Using this information, the Operating System can interpret each entry. Let us illustrate this by a simple example. If we refer to Fig. 4.43, we will realize that the root directory consists of a VENDOR file, a PRODUCTION directory and

the COSTING directory. If we compare this with the SFD for the root directory in Fig. 4.48 (a), we will realize that the SFD exactly reflects the same. Let us assume that if we want to access the VENDOR file. The SFD for the root directory tells us that the BEN for the VENDOR file is 3. If we access the BFD with BEN = 3 as shown in Fig. 4.47, we get the details for the VENDOR file, which say that this file consists of 200 blocks, starting at block number 501 and it is being shared from two directories as suggested by the usage count. This fact tallies with Fig. 4.43. The BFD entry also gives the access rights information. The Operating System can verify this access rights information and if the user process is allowed to access that file for reading, the Operating System can access block 501 and start reading the VENDOR file on behalf of that user process. As we have seen, this block, denoting the address could be for a data block itself or an index block pointing towards the data blocks. The procedure for reading subsequent blocks has already been discussed. What is true about the SFD for the root directory is also true about the SFDs for others. For instance, in the SFD for Root, (Fig. 4.48 (a)), there is an entry for PRODUCTION with BEN = 4. If we refer to the BFD with BEN = 4 in Fig. 4.47, we are told that it is an unshared directory of 1 block at block number 34. If we read block number 34, we get the SFD for PRODUCTION. This is again shown in Fig. 4.48 (b). It shows that this directory consists of a file BUDGET and directories WORKS-ORDER and PURCHASING. This quite tallies with the file structure shown in Fig. 4.43. We can follow this procedure iteratively to traverse any path as we shall study in an example that follows. Each SFD has also two other entries marked as “.” and “..”. The entry with “.” conveys the BEN for the current directory. For instance for root, it is 1 and for PRODUCTION, it is 4. The entry with “..” tells us the BEN for the parent directory, which is one level up. For PRODUCTION directory, it is 1, meaning thereby that if you want to go up a level in the path name of /PRODUCTION directory, i.e. if it has to go up to the root (/) directory, the Operating System will have to read the BFD entry with BEN=1. This, as we know, is the BEN for the root directory which is its parent directory. Each directory has a parent directory and therefore, a valid “..” entry in the SFD. But this is not true about the root. This is because, you could not traverse one level up from the root. Therefore, conceptually, the parent of root is taken as root itself. Therefore, in the SFD entry for the root, the “..” entry has the BEN for the parent directory which is 1, i.e. the same as for the root directory itself. The “..” entry is used in moving from one working directory to another. For instance, if we have to move from WORKS-ORDER to PURCHASING as the working directory, we have to use the relative path name as ../PURCHASING. It is at the time of translating this pathname into the actual movement from one working directory to another, that the entry “..” is used. For instance, in the WORKS-ORDER directory, we have a “..” entry with BEN = 4 as shown in Fig. 4.48 (e). This is the BEN for the BFD for PRODUCTION directory, which is its parent directory as depicted in Fig. 4.43. If we read block 4 for the PRODUCTION directory, we will find an entry for PURCHASING directory as shown in Fig. 4.48 (b). Its BEN is found to be 9. This is how, we have resolved the relative path name ../PURCHASING. You will notice some SFDs in the figure have zero values as symbolic file names. There are empty entries. It means that a file was once created under that directory, but then it was deleted thereafter. The entry is still maintained as an empty entry with zero values. The actual values for the empty entries vary with the Operating System. Similarly, the Operating System has to have various methods to manage the space for the BFD. The Operating System can allocate a fixed number of blocks for storing the BFD. This also will put the upper limit on the number of files/directories that can exist in the file system. At any time, some BFD entries will be allocated to some files or directories, and the remaining will be free. The Operating System will have to maintain some kind of a data structure like linked lists or indexes to keep track of free BFD entries, so that

when a file is being created, a free entry can be allocated to maintain its details. After the allocation, that entry will obviously have to be removed from the free list. If the file is deleted, the corresponding BEN entry will be added to the free pool. The Operating System also has to have techniques to manage the space allocated to the BFDs and the SFDs themselves. Some Operating Systems maintain one consolidated record for each BFD, whereas some others (e.g. AOS/VS) split it into various logical units such as one for various dates, another for access controls, etc. Each BFD entry can occupy one or more blocks on the disk, depending upon the level of details maintained in the BFD and the way these are maintained. Therefore, the Operating System therefore has to have various ways to manage these BFDs and the space allocated on the disk for them. The Operating System has to have various algorithms to maintain SFD entries in the SFD too. For instance, the actual number of files in a directory can be far more than shown in SFDs of Fig. 4.48. In such a case, given a file name, the Operating System wants to perform a table search to locate that entry in the relevant SFD. This search can be very time consuming and expensive (remember that the SFD/BFD accesses take place very often during the execution of a process and therefore, these access/search techniques have to be extremely efficient). One way is to keep the file names in an alpha sequence and use a binary chop method while searching for a specific name. Another would be to use some kind of a hashing algorithm to locate the correct entry. But then these methods are more complex and they may not be the most efficient in all the situations! UNIX does not use any hashing algorithm at all. It uses a straight table search method whereby each entry in a table is compared. This method can be very slow for directories having a very large number of files. Despite this, this method is used due to its basic simplicity. We can now trace the steps involved in traversing the absolute path /COSTING/PURCHASING/PR after a user instructs the Operating System through a command to open that file. (i) The Command Interpreter watchdog program (e.g. shell) interpreting your command parses this instruction. It finds “/” first. It knows that this means the root directory. It knows that it corresponds to the BFD entry with BEN=1. This is fixed and predetermined. (ii) It reads the BFD entry with BEN=1. It verifies access Rights and reads the block number for the SFD for root, i.e. block number 2. The contents of this block are as shown in Fig. 4.48(a). (iii) It now picks up (parses) the name until the next “/” in the command. In this case, it finds “COSTING”. (iv) It searches the SFD for root for “COSTING” as a file/directory name. It finds it, and stores its BEN, which is 5. (Refer to Fig. 4.48 (a)). (v) It now accesses the BFD entry with BEN=5. It finds that it is an unshared directory of one block at block number 75. (vi) It verifies the access rights and reads block number 75 for the SFD of COSTING. This is as shown in Fig. 4.48 (c). (vii) It now picks up (parses) the name in the command until the next “/”. It finds “PURCHASING” in this case. (viii) It now searches the SFD for COSTING for “PURCHASING” as the file/directory name. It finds it and stores its BEN, which is 9. (Refer to Fig. 4.48 (c)). (ix) It now accesses the BFD entry with BEN=9. It finds that PURCHASING is a shared directory of one block at block number 109. (x) It verifies the access rights and reads block 109 for the SFD of PURCHASING. The contents of this block are as shown in Fig. 4.48 (d).

(xi) It now picks up (parses) the next name until the end. In this case, it is “PR”. (xii) It now searches the SFD of PURCHASING for the name PR. It finds it and again stores its BEN, which is 13. (Refer to Fig. 4.48 (d)). (xiii) Now it reads the BFD entry with BEN = 13. It says that PR is an unshared data file of 150 blocks starting at block number 2001. (xiv) It now verifies the access rights for that process to open that file in a desired mode (read, write, etc.) and then proceeds to open it. After opening that file, the data blocks for that file can be accessed one after the other starting from 2001 as seen earlier. This procedure may appear quite cumbersome. If the Operating System has to go through all these chains and read all these blocks from the disk every time it subsequently accesses the BFD entry for PR, it will be very wasteful. It is for this reason that the BFD entry is copied into the memory once and for all. This entry is removed from the memory only after the file is closed by all the processes accessing it. When a process reads a file after opening it, does the Operating System go through all this procedure of parsing the file name again, before knowing what BFD entry to refer to? Certainly not. In order to avoid this repetitive process, the Operating System call to open a file allocates a specific unique Active File Number (AFN) like a channel number for each file. When the BFD entry is copied in the memory, the Operating System links this BFD entry to this Active File Number (AFN). Therefore, at any time, there will be inmemory BFD entries with assigned AFN values 0, 1, 2, ..., etc. Given the AFN, one could directly access the in-memory BFD entry. Needless to say that the Operating System will have to reserve a fixed area to store these in-memory BFDs with AFNs from 0 to n. The Operating System will have to have routines to manage the space allocated to them as well as the space for unused (free) entries. The subsequent Read, Write or even Delete system calls also use the same AFN, so that the Operating System can directly access the required BFD entry in the memory without having to parse the entire file name, leave alone reading intermediate blocks. Let us assume that we want to delete the file given by the absolute path name /COSTING/PURCHASING/ PR. For this, we give an appropriate command at the terminal for the command interpreter. The following now happens: (i) The Operating System converts the pathname for this file into the BEN using the data structures described previously. As we know, for this file , BEN=13. It now checks the access rights for this file (in the BFD entry with BEN = 13) to ensure that the user has the right to delete this file. We will talk more about access rights later. (ii) If the user does not have this right, it gives an error message and exits. If the user has this right to delete, then it proceeds as follows: l It removes the entry corresponding to this name PR from the SFD for the directory PURCHASING. The Operating System may use the same hashing or directory search techniques that have been used for their maintenance to begin with. After removing the entry, the Operating System may maintain zeroes, spaces or some special characters to signify a free entry in the SFD. l It subtracts 1 from the usage count in the BFD entry for this file, i.e. the one with BEN=13. Now if usage count does not become 0, it takes no action and exits the routine. However, if it does become 0 (which it does in this case), it does the following. l It frees the blocks allocated to this file and adds them to the free blocks pool maintained as lists, indexes or bit maps. The Operating System uses the Address field in the BFD entry to traverse through

the chains or indexes (as per the allocation method) to access the block number allocated to that file before it can do this job. l

It removes the BFD entry for that file.

It should be noted that all these changes in the BFD are made in the in-memory BFD images first using various data structures described above. Periodically (for better recovery), and finally in the end at the shut off, the updated data structures are copied onto the disk at the appropriate locations in the BFD on the disk, so that next time you get the updated values when the system is used again. An algorithm to create a file under a directory is almost reverse of this and can be easily imagined. An algorithm for creating a soft link for a file/directory essentially parses the path name, locates the BFD entry and increments its usage count. It also inserts the file/directory name in the required SFD with the BEN same as for the file being linked. If the user wants to go up in a directory, the entry with “..” in the SFD can be used. For instance, if we want to traverse from PURCHASING —> COSTING —> BUDGET, using a relative path name ../BUDGET when we are in PURCHASING directory, the following algorithm is executed. (i) The Operating System will parse the path name. (ii) The Operating System will read the “..” entry in the SFD of PURCHASING as shown in Fig. 4.48 (d). It gives BEN = 5. This is the BEN of the parent directory (which in this case is the COSTING directory). (iii) It will access BFD entry with BEN = 5. (iv) It will know that it is a directory starting at block 75. It will verify its access rights and then read it to get the SFD of COSTING. The contents of the SFD for COSTING are as shown in Fig. 4.48 (c). (v) It will now perform the search for a name BUDGET in the SFD for COSTING and will store its BEN = 10. (vi) Having located the desired file, it will proceed to take any further permissible actions. The data structures that are maintained in the memory have to take care of various requirements. They have to take into account the following possibilities: One process may have many files open at a time, and in different modes. One file may be opened by many processes at a time and in different modes. For instance, a Customer file may be opened by three processes simultaneously. One may be printing the name and address labels sequentially. Another may be printing the statements of accounts, again sequentially, and the third may be answering an on-line query on the balances. It is necessary to maintain separate cursors to denote the current position in the file to differentiate an instruction to read the next record in each case, so that the correct records are accessed. The exact description of these data structures and algorithms is beyond the scope of this text, though it is not difficult to imagine them. This brings us to the end of our discussion about the File Systems. We need to examine the DDs more closely to complete the picture.

A file is a collection of data, which is stored on the secondary storage devices like hard disk, magnetic tape, CD-ROMs etc. When data is being processed, that data is present in the primary memory or

RAM of the computer. Primary memory or RAM is volatile and cannot be used for permanent data storage. In almost all computing applications we require data to be stored permanently for the future use. Processes require data for the processing and these processes execute in the primary memory. Primary memory has limitation of size and large data cannot be store in primary memory. Even if a particular process is able to accommodate large data, other processes may not get enough space for their execution. Primary memory is for data processing and not for storage. Hence files are used to store large data permanently on the secondary devices even after process has completed its execution (either successfully or unsuccessfully).

File organization means how a file is stored on the storage devices. Access Management describes how a file is accessed from the storage devices for processing. Processing may require sometimes all the records, a set of records, a particular record, the first record or the last record, etc. Early Operating System provided only one type of file access which is called as “sequential access”. Sequential access means all the bytes from the file are read sequentially (from beginning to end of file) one-by-one. It is not possible to skip particular records and jump to any specific record. In other words we cannot select any specific record by skipping or without reading intermediate records in that file. Sequential files are widely used when the storage device is magnetic tapes, since magnetic tapes provide only sequential access. Magnetic/optical disk provides access to any record directly. It allows us to choose any bytes or records out of order. It is also possible to choose a record by using keys rather than the position of the record. It is also possible to move directly to a particular position by specifying the byte number. This access type is called “random access”. Random access is the requirement of many applications and all DBMS and RDBMS use random access mechanism.

When an Operating System allows multiple users to work simultaneously then it is quite possible that more than one user can demand the same file. When many users are just reading the same file then there is no problem; but if more than one user is writing to or updating the same file (here same file means file with the same name with same location or path) then this would lead to problems. In such situations, the data of one user will be stored while the data other users may be deleted or overwritten. We cannot exactly predict what will happen to the data of that file. In single user Operating System, this will not happen since there will be only user working at any given time. Multiuser Operating System must provide appropriate ways to deal with all types incidents to handle file sharing mechanism. In a multiuser environment, the Operating System would have to maintain more file/ directory attributes than a single user Operating System. Multiuser user Operating System maintain attributes such as owner of the file/directory. The owner of the file is able to perform all operations on that file. There is also a provision to maintain users of that file/ directory. These users can perform a subset of operations (e.g. read the file) and they are not able to perform all operations on file (e.g. write to the file or to delete it) like the owner can.

The owner ID and other users or members id of a given file are stored with other file attributes. When a user requests any operation on file the user ID is compared with owners attribute of the file to determine the user is owner or not. Likewise for other users also. The result of comparison is to determine which permissions are applicable to that user. Files are stored in directories or folders. Directory is the collection of files and each file must belong to directory. Following are some directory implementations used by the Operating Systems. In single level directory system there is only one directory at the top, which is also called as root directory. All the files are present in the root directory only and users cannot create sub directories under the root directory. In single level directory system, the same file name cannot be used more than once. Even if the system allows a file to be created with an existing file name, the old file with the same name will be destroyed first. Also all the files are visible to all the users, though they may not be able to view each other’s file. It is not possible to keep files separately depending on classification under various sub directories. Here, the users are allowed to create a directory directly inside the root directory. However, once such a directory is created by the user, the user cannot create sub-directories under that directory. This design will help users to keep their files separately under their own directories. This allows having a file with same name more than once on the disk but under different user directories. In this structure, there should be a system directory to access all system utilities. Otherwise, all the users need to copy system utilities in their own directories which results into wastage of disk space. This structure goes beyond the two-level system and allows users to create a directory under the root directory and also to create sub directories under this structure. Here, the user can create many sub directories and then maintain different files in different directories based on the types of files. In this directory structure, files are identified and accessed by their locations. The file location is described using path. There are two types of paths. An absolute path name would describe file name and location, considering root directory as the base directory. e.g. usr/david/salary.doc -> this means salary.doc file is present in david directory and david is sub directory of usr directory and usr is present on the root of the disk. In relative path convention, file name is described consider a user’s specific directory as the base or reference is the base directory. Base directory can be user’s current working directory. e.g. usr/david/prg/payroll is the directory structure and consider prg is the base directory. Then to access “bonuscal.c” file, we can use the path payroll/bonuscal.c.

create new directory. Name must be unique under that particular directory. When new directory has been created there only dot and dotdot can be seen.

delete empty directory. Directory which contains files and subdirectories can be deleted. When directory contains dot and dotdot, the directory is considered to be empty. open directory/to browse the content of the directory. close opened directory. read/display the content of an opened directory.

Files are stored on the disk. As the disk space is limited, we need to reuse the space from deleted files for new files. So, the management of disk space is a major concern to file system designers. To keep track of free disk space, the system maintains a free-space list. The free-space list records all the free disk blocks that are not allocated to any file or directory. To create a file, we search the free-space list for the required amount of space, then that space is allocated to a new file. This space is then removed from the free-space list. Conversely, when we delete a file, its disk space is added to the free-space list.

The free-space list is implemented as a bit-map or bit vector. Each block is represented by 1 bit. If the block is free, the bit is 1; if the block is allocated the bit is 0. The main advantage of this approach is its efficiency in finding the free blocks on the disk.

CPUs are getting faster day by day and memory size is also getting bigger and bigger. Same is true with disk space. But the one parameter is not improving with all these changes is the disk seek time. Log structured file system is aimed at improving the disk seek time and improving the overall disk write operation. Log structured file system reduces disk-memory trips for fetching data from the disk and loading it into memory. Since we have an increased memory size, we can load all the required data inside memory and use that for processing. When we execute a write operation, the time taken to complete that operation will not be exactly the same as the time taken for actual write. There are other factors such seek time, rotational delay, etc. And one write operation involves changes to the disk at various places such as i-node entry, FCB, directory block etc. Delay in any operation would lead to inconsistency. Any delay or failure in at least one of these operations will cause a big problem. This problem can be solved by maintaining all the write operations in a log file and then committing all the write operations periodically.

n n

n n n

n

n n

n

n

n

/

We have seen the functions of a disk controller in the previous sections. We have also seen various instructions that this controller understands - i.e. the instruction set of the controller. We had also seen how the DD uses these instructions to construct various routines for Read, Write operations. In this section, we will look at other aspects of the DD. In most computers, including the IBM-PC family, the devices are attached to the CPU and the main memory through a bus and interfaces or controllers as depicted in Fig. 5.1. The figure shows a serial interface for the terminals, a parallel interface for the printer and the DMA for the disk. Let us illustrate the connections by another figure. Figure 5.2 shows a motherboard of a microcomputer on which all the electronics required for the CPU and various related components is mounted. A bus links the CPU to a series of slots that are used to attach other boards on the system. Figure 5.3 shows the same scheme from a different angle for further clarification. Figure 5.4 shows various slots used for different interfaces. Notice how the memory and various interfaces are attached to the bus. Therefore, when a word of 16 bits is to be transferred from a desired memory location to a device (say a disk), the data is deposited on the data bus which carries the data to the required interface. The data reaches the device through its corresponding interface which is the same as a controller. The interface can be a parallel or a serial one. If it is a serial interface, such as the one for the keyboard, the connection between

the device and interface for the data bits is by only two wires, one for sending and one for receiving the data (governed by RS-232C or may be RS-449 protocol). The connection between the interface to the bus is by 16 parallel wires for data. Therefore, some conversion is necessary. The interface has the hardware (such as shifters) to convert the parallel data into serial data and vice versa. In the case of the keyboard, when a key is pressed, an ASCII or EBCDIC code is generated, but it is sent bit by bit from the keyboard to the interface. The interface shifts one bit each time, to make room for the next arriving bit until a word - i.e. 16 bits in the memory of the interface is full, after which the data is sent parallelly over 16 wires from the interface to the main memory by the bus (if the bus is a 16 bit bus) as depicted in Fig. 5.4. The data is sent in the kernel area of the Operating System first and subsequently to the memory of the user process controlling that terminal. This is what happens while executing the scanf function call in C or ACCEPT statement in COBOL. A block diagram of this scheme is depicted in Fig. 5.5.

Controllers require large memory buffers of their own. They also have complicated electronics and therefore,

are expensive. An idea to reduce the cost is to have multiple devices attached to only one controller as shown in Fig. 5.6. At any time, a control ler can control only one device and therefore, only one device can be active even if some amount of parallelism is possible due to overlapped seeks. If there are I/O requests from both the devices attached to a controller, one of them will have to wait. All such pending requests for any device are queued in a device request queue by the DD. The DD creates a data structure and has the algorithms to handle this queue. If the response time is very important, these I/O waits have to be reduced, and if one is ready to spend more money, one can have one separate controller for each device. We have already studied the connections between a controller and a device such as a disk drive in Fig. 4.8. In the scheme of a separate controller for each device, this connection will exist between each controller/device pair. In such a case, the drive select input shown in Fig. 4.8 will not be required. This scheme obviously is faster, but also is more expensive. In some mainframe computers such as IBM-370 family (i.e. IBM 370, 43XX, 30XX etc.), the functions of a controller are very complex, and they are split into two units. One is called a Channel and the other is called a Control Unit (CU). Channel sounds like a wire or a bus, but it is actually a very small computer with the capability of executing only some specific I/O instructions. If you refer to Fig. 5.6, you will notice that one channel can be connected to many controllers and one controller can be connected to many devices. A controller normally controls devices of the same type, but a channel can handle controllers of different types. It is through this hierarchy that finally the data transfer from/to memory/device takes place. It is obvious that there could exist multiple paths between the memory and devices as shown in Fig. 5.6. These paths could be symmetrical or asymmetrical as we shall see. Figure 5.7 shows a symmetrical arrangement. In this arrangement, any device can be reached through any controller and any channel.

We could also have an asymmetrical arrangement as shown in Fig. 5.8. In this scheme, any device cannot be reached through any controller and channel, but only through specific preassigned paths. Due to multiple paths, the complexity of the DD routine increases, but then the response time improves. If one controller controls two devices, the speed will be much slower than if there was only one controller per device, for instance. This is because in the latter case, there is no question of path management and waiting, until a controller gets free. This also is clearly shown by many benchmarks.

The Information Management as we know is divided into two parts: the File System and the DD. We can

consider DD, in turn, conceptually to be divided into four submodules: I/O procedure, I/O scheduler, Device handler and Interrupt Service Routines (ISR). We will take an example to outline the interconnections among them. Let us warn the reader at this juncture that these four are the conceptual submodules. An Operating System designer may split the entire task in any number of submodules - may be only two or three or five. The number and names of these modules that we have used are the ones that we have found suitable in explaining this in a step by step fashion. All these submodules are intimately connected. Broadly speaking, I/O procedure converts the logical block numbers into their physical addresses and manages the paths between all the devices, controllers and channels etc. It also chooses a path and creates the pending requests if a specific piece of hardware is busy. The I/O scheduler manages the queue of these requests along

with their priorities and schedules one of them. The device handler actually talks to the controller/device and executes the I/O instruction. On completion of the I/O, an interrupt is generated, which is handled by the ISRs. Let us assume that the file system has translated the request from the Application Program to read a logical record into a request to read specific blocks. Now the DD does the following: (a) The I/O procedure translates the block number into the physical addresses (i.e. cylinder, surface, sector numbers) and then creates a pending request on all the elements on the path (i.e. channel, control unit and the device). While doing this, if multiple paths are possible, the I/O procedure chooses the best one available at that time (e.g. where there are minimum pending I/O requests on the channel/CU connecting the device) and adds the current request to the pending queues maintained for all the units on the path such as a device, controller or a channel. (b) An I/O scheduler is a program which executes an infinite loop. Its whole purpose is to pick up the next pending I/O request and schedule it. It is for this purpose that all the I/O requests for the same device from different processes have to be properly maintained and queued. When a request is being serviced, the I/O scheduler goes to sleep. When the device completes an I/O, an interrupt is generated which wakes up the I/O scheduler and sets the device free to handle the next request. From that time onwards, the device continuously generates interrupts at specific time intervals, suggesting that it is free and ready for work. Every time an interrupt is generated by a device, the appropriate Interrupt Service Routine (ISR) for that device starts executing. The ISR activates the I/O scheduler which, in turn, checks whether there are some pending I/O requests for that device. If there are, it organizes them according to its scheduling algorithm and then picks up the next request to be serviced. On scheduling it (i.e. instructing the device handler about it), it goes to sleep only to be woken up by the execution of the ISR again, which is executed on the completion of the I/O request. If there are no pending requests for that device, the scheduler goes to sleep immediately. By this mechanism of generating interrupts continuously at regular interval for a “free” device, the checking on the pending I/O requests is continuously done without keeping the I/O scheduler running and consuming CPU power all the time. If some I/O operation is complete, the ISR, apart from waking up the I/O scheduler, intimates this to the device handler for error checking on the data read in. (c) When the I/O scheduler schedules a request, the Device Handler uses the details of the request such as the addresses on the disk, memory, the number of words to be transferred and constructs the instructions for the disk controller. It then issues these to the controller in the form of a program which the controller understands. In most of the cases, the device handler can construct a full I/O program for the operation such as “read” or “write” and load the entire program in the controller’s memory. After this, the controller takes over and executes the I/O operation as we have studied earlier. The controller sets some hardware bits to denote the success/failure of the operation after it is over, and generates an interrupt. In some schemes, the device handler instructs the controller one instruction at a time and monitors the operation more closely. We will assume the former scenario in our subsequent discussion. (d) The controller then actually moves the R/W arms to seek the data, check the sector addresses and read the data to its own memory. It then transfers it to the main memory using DMA as we have studied earlier. (e) After the data is read/written, the hardware itself generates an interrupt. The current instruction of the

executing process is completed first and then the hardware detects the interrupt, and automatically branches to its ISR. This ISR puts the current process to sleep. The ISR also wakes up the I/O scheduler which deletes the serviced request from all the queues and schedules the next one, before going to sleep again. (f) The device handler checks for any errors in the data read, and if there is none, intimates the I/O procedure to form the logical record for moving it into the memory of the AP. (g) The process for which the record is read/written is now woken up and inserted in the list of ready processes, depending upon its priority, This process is eventually dispatched, when it starts executing again. We will now study the functions of these four submodules a little more closely.

In order to perform the path management and to create the pending requests, the I/O procedure maintains the data structures as described below. This data structure maintains the following information (Fig. 5.9): This data structure maintains the following information (Fig. 5.10): This data structure maintains the following information (Fig. 5.11): In the Device Control Block (DCB), we maintain device characteristics and device descriptor to achieve a kind of device independence. The idea is to allow the user to write the I/O routine in a generalized fashion so that it is applicable for any device once the parameters for that device are supplied to that routine. These parameters are the same as the device characteristics in the DCB. Therefore, the idea is that whenever an Operating System wants to perform an I/O for a device, it reads the DCB for that device and extracts these Device charecterisation. It then invokes the “common I/O routine” and supplies these Device characterisation as parameters. The ultimate dream is to be able to have only one common I/O routine. However, this is quite an impractical goal. A via media is to have a common routine for the same types of devices at least. If you study the contents of the DCB, it is easy to imagine how the Operating System would maintain these fields. Most of them could be updated at the time of system generation. But the list of processes waiting for that device changes with time, and is updated at run time.

How does the I/O procedure maintain this list? It does this by creating a data structure called Input/Output Request Block (IORB) for each process waiting for that device. It contains the following information. (iv) Input/Output Request Block (IORB) This data structure maintains the following information (Fig. 5.12): These IORB records are chained together with the DCB for that device. For instance, if there are five processes waiting for I/O on a specific device, there will be five IORB records chained to one another as well as to the DCB, as shown in the Fig. 5.13. Similarly, the I/O procedure can create data structures for processes waiting for a control unit or a channel. Again the CCBs, CUs and DCBs can be connected with a pointer chain structure to denote things like a list of CUs connected to a channel, etc. In a situation with no channel, only one controller and one or multiple disks, these structures become very simple. In such a case, the task of the I/O procedure becomes far easier. This is true with a number of modern mid-range machines. When some data is required to be read from a specific device, the I/O procedure can use the DCB for the device and check whether it is free. Then, it can trace the printer chains from the DCB to CUCBs using the field in the DCB “List of CUs connected to this device”. It can access these CUCBs one by one, checking their status (free, busy, ...) and choose a CU with a CUCB which is either free or the one with the least number of pending requests. It can use the field in the CUCB “List of processes waiting for this CU” for this purpose. It will now trace all the CCBs connected to that CUCB, and choose the CCB which represents a channel which is either free or the one with the least number of pending requests. This selection procedure is not very simple if one also wants to optimize. For instance, one may choose a controller with the least number of requests but that controller may be connected to a channel which has a long queue. On the other hand, there could be a controller with a long queue but it could be connected to a channel which is free (at least at that moment). These situations make the decision a complex affair. Having chosen the path, it can create IORBs on all the components and chain them in the way as depicted in Fig. 5.13. The figure shows the IORBs connected to a DCB. Similarly, the IORBs for the pending requests for a CU will be connected to the CUCB, and so on. This data structure is maintained in the memory and it is updated as a new request arrives and as it is serviced. One can imagine a certain area in the memory dedicated to contain these IORB records. This area could have a number of slots for the IORB records for all the devices, CUs and channels. The management of free slots, allocation of these slots and linking and unlinking these IORBs to proper DCBs, CUCBs and CCBs will require complex algorithms which are a part of I/O procedure.

The I/O Scheduler sequences the IORB chained to a DCB according to the scheduling policy and picks up an IORB from that chain when a request is to be serviced. It then requests the device handler to actually carry out the I/O operation and then deletes that IORB from all the queues for pending requests from the appropriate DCB, CCUB and CCB after setting the appropriate flags in the DCB, CCUB and CCB to denote that they are now busy. The I/O scheduler can use a number of policies for scheduling IORBs. In fact, the IORBs are chained to the DCB and one another depending upon this policy only. For instance, if First Come First Served (FCFS) method is used, the new IORB is just added at the end of the queue. Therefore, one can imagine that the I/O procedure prepares the IORB and hands it over to the I/O scheduler. The I/O scheduler then chains

it with the DCB as per its scheduling policy. In many modern systems, scheduling is done by the controller hardware itself thereby making the task of the I/O scheduler simpler and the whole operation much faster. In some others, the software in the Operating System has to carry it out. Requests for I/O operations pending for a device normally originate from different processes with different priorities (we will study process priorities in the chapter on process management later). An apparently rational solution to the problem of scheduling the I/O requests is by the priorities of requesting processes. But, this is generally not followed. The reason for this is simple. The I/O operation is electromechanical and therefore, it is very slow. Hence, the whole emphasis of I/O scheduling is on reducing the disk arm movement, and consequently the seek time. This improves the overall throughput and response time even if it might mean a little extra wait for a process with higher priority. For instance, let us imagine that there are two processes waiting for a device to get free (and therefore, there are two IORBs for that device). Let us also assume that the process with the lower priority wants to read the data from the same track where the R/W arms currently are, therefore, requiring no seek time. This is shown in Fig. 5.14. The figure shows that the R/W arms are currently positioned on track number 5 which is the target track of the IORB with the lower priority (IORB = 0). The IORB with higher priority (IORB = 1) has a target track number = 25, requiring some head movement which is essentially mechanical and therefore, slow. In this case, the I/O for the process with lower priority is performed first, thereby improving the overall

efficiency at a little injustice to some other processes. This methodology also ensures that the processes of lower priorities are not indefinitely postponed. If the Operating System selects the IORB strictly according to the process priority, the total throughput can decrease substantially. Put in other words, the response time for a process with higher priority improves a little, but that with a little lower priorities deteriorates disproportionately, because I/O is the slowest of the operations and controls the overall throughput. In Real Time Operating System, where the response time and not necessarily the throughput is the main concern, one could think of picking up the IORB according to the process priorities. This will entail having a data item “process priority” in the IORB and maintaining the IORB chains on the DCB in the Process Priority sequence and adjusting them everytime a new process is added or deleted. As we have seen, IORB scheduling is done by some controllers in the hardware itself. In other cases, it is done by the Operating System. Regardless of the implementation, the algorithms are similar and interesting. Let us therefore, study them. We will study the following policies: l l l l l

Figure 5.15 illustrates this method. The figure depicts four requests, 0, 1, 2 and 3. It also shows these requests on the figure at their respective target track numbers. If the

requests have arrived in the sequence of 0, 1, 2 and 3, they are also serviced in that sequence, causing the head movement as shown in the figure starting with the R/W head position which is assumed to be between the target tracks of requests 2 and 3. FCFS is a ‘just’ algorithm, because, the process to make a request first is served first, but it may not be the best in terms of reducing the head movement as is clear from the figure. To implement this in the Operating System, one will have to chain all the IORBs to a DCB in the FIFO sequence. Therefore, the DCB at any time points to the next IORB to be scheduled. After one IORB is dispatched, that IORB is deleted, and the DCB now points to the next one in time sequence, i.e. which came later. When a new IORB arrives, it is added at the end of the chain. In order to reduce the time to locate this ‘end of the chain’, the DCB also can contain a field “Address of the Last IORB” for that device. For recovery purposes, normally IORB chains like all others are maintained as two-way chains, i.e. each IORB has an address of the next IORB for the same device as well as the address of the previous IORB for the same device. We can now easily construct the algorithms to maintain these IORB queues for this method.

; ;

Figure 5.17 illustrates this method. In this method, the head starts a scan in some direction picking up all the requests on the way, and then reverses the direction. While the scan in one direction is going on, new IORBs for new requests can be added to the IORB chains, but only the pending IORBs in the chosen direction are serviced first. If the newly added IORB is lucky, and is properly positioned in the direction of the scan, it will be serviced during the scan. All the remaining IORBs will be serviced in the reverse scan. Figure 5.17 depicts this. We have numbered the requests in the sequence that they are serviced so that we can understand the policy better. One of the ways to achieve this is to chain the IORBs to the DCB in both ascending and descending orders of the target track number in the IORB. This can be done by keeping the next and previous pointers in every IORB, and the first and the last pointers in the DCB. Every new arrival has to be carefully added with the adjustment of chains. Figure 5.18 illustrates this method. This method allows the addition of new IORBs in the chain only at the end of one scan before reversing the direction of the traversal of the R/W arms. Therefore, while the scan is going on in one direction, any new additions are kept separately and not added to the chain. They are all added when the direction is reversed. At this time, the seek steps required are calculated for each IORB and then they are chained at proper places. For instance, the figure depicts that IORB 0 is serviced first and then IORB 1 is serviced. We assume that while this is going on, IORB 3 arrives. But since it has arrived after the scan has started, it is ignored. Therefore, IORB 2 is serviced directly. Same is the fate of IORB 5. Both of these are serviced in the reverse scan. Again, the number of the requests show the sequence in which the requests are serviced. viz 0, 1, 2, 3,

4 and 5. This, obviously, has no relation to the sequence in which the requests have arrived. This is done so that we can understand the policy better. In this method, both the next and previous pointers can be used but the chain readjustments need not be done every time a new IORB is added, because all additions are kept in a separate area (Area-1). After all the IORBs currently connected to the DCB are over, these serviced IORBs are deleted from the chain, the IORBs accumulated in Area-1 by then are chained to the DCB in the sequence of target track numbers. At this time, Area-1 is cleared to enable fresh accumulation of IORBs which arrive late during the next Scan. Now all those IORBs which are chained to the DCB are scheduled in the reverse order by using the previous chains as the figure shows. Figure 5.19 illustrates this method. This method is like N-scan, but it scans only in one direction. In the C-scan method too, any new arrivals are grouped together in a separate area. But at the end of the scan, the R/W arm is brought back to the original position and the scan is started again in the same direction. Essentially, at the end of the scan, when the original IORBs chained to the DCB are serviced and therefore, deleted and the IORBs in Area-1 are chained to DCB, the sequence of chaining will have to be opposite of what it was in N-scan. The same method of having pointers as in SCAN can work, except that you need only the next pointers and therefore, it is in fact simpler to implement than the SCAN method. There is one interesting point that the Operating System designer has to keep in mind. For instance, a given scheduling philosophy may choose a specific IORB to be scheduled. At this juncture, the device is certainly free (that is what has caused an interrupt the ISR for which has triggered the execution of the I/O Scheduler which executes this scheduling algorithm to select the new IORB). But what if the control unit or the channel connected to that device is not free at the same time? This is what makes the combination of the scheduling

algorithm and path management a very complex task, especially when there are multiple channels, control units and devices in the system with multiple connections. In fact, the algorithm may become very complex and with no guarantee of good performance that it may be worthwhile to just choose the first IORB for which the entire path is free. With systems with only one controller and single or multiple devices, these complications do not arise.

The device handler essentially is a piece of software which prepares an I/O program for the channel or a controller, loads it for them and instructs the hardware to execute the actual I/O. After this happens, the device handler goes to sleep. The hardware performs the actual I/O and on completion, it generates an interrupt. The ISR for this interrupt wakes up the device handler again. It checks for any errors (remember, the controller has an error/status register which is set by the hardware if there are any errors). If there is no error, the device handler instructs the DMA to transfer the data to the memory.

Let us now take a complete example to study in a step-by-step manner, how all these pieces of software are interconnected. When an AP wants to read a logical record, the following happens: (i) The AP-1 has a system call to read a logical record for an Operating System which can recognize an entity such as a logical record. For systems such as UNIX which treat a file as a stream of bytes, a system call to read a specific number of bytes starting from a given byte is generated in its place. Upon encountering this, AP-1 is put aside (blocked) and another ready process, say AP-2 is initiated.

(ii) The File System determines the blocks that are needed to be read for the logical record of AP-1, and requests DD to read the same. We have seen how this is achieved. (iii) The I/O procedure within DD prepares an IORB for this I/O request. (iv) The I/O procedure within DD now establishes a path and chains the IORB to the device control block and other units as discussed earlier. The I/O procedure actually constructs an IORB and hands it over to the I/O scheduler. The I/O scheduler chains the IORB in the appropriate manner as per the scheduling philosophy. (v) Whenever a device is free, it keeps on generating interrupts at regular intervals to attract attention (“I want to send some data” or “does anybody want to send anything to me ? I am free”). The ISR for that interrupt wakes up the I/O scheduler. The I/O scheduler checks up the IORB queues and then services them one by one, as per the scheduling philosophy. When the device is free, the controller may not be. Or even if it is, the channel may not be free at that time. For the actual I/O, the entire path has to be free, and this adds to the complication. The I/O scheduler ensures this before scheduling any IORB. After this is done, it makes a request to the device handler to actually carry out the I/O. (vi) After an IORB has finally been scheduled, the I/O scheduler now makes a request to the device handler to carry out the actual I/O operation. (vii) The device handler within DD now picks up the required details from the IORB (such as the source, destination addresses, etc.) and prepares a channel program (CP) and loads it into the channel, which in turn, instructs the controller. If there is no channel, the device handler directly instructs the controller about the source and target addresses and number of bytes to be read. As we know, the device handler can issue a series of instructions which can be stored in the controller’s memory. (viii) The controller finally calculates the direction and the number of steps that the R/W heads have to traverse for the seek operation. Depending upon this, appropriate signals are generated from the controller to the device (refer to Figs. 4.7 and 4.8) and the R/W heads actually move onto the desired track. (ix) The R/W head is now on the desired track. It now accesses the correct sector on the track as the disk rotates. For every sector, it looks for an address marker followed by the address and then matches it electronically with the target address stored in the controller’s buffer before starting the data transfer. On hitting the desired sector, the data is transferred bit serially into the controller’s buffer, where it is collected as bytes and these bytes are then collected into a larger unit (512 bytes or more) in the controller’s buffer. (x) This buffer is transferred to the memory buffer within the Operating System using DMA under the direction of the channel and/or the controller, but through the data bus in a bit parallel fashion. (xi) When the I/O operation is completed for AP-1, the hardware itself generates an interrupt. (xii) The Interrupt Service Routine (ISR) within the DD starts executing. It signals the completion of the requested I/O for the AP-1 and informs up the device handler regarding the same. (xiii) The device handler checks for any errors and if none is found, it signals it to the I/O procedure. (xiv) The I/O procedure now deletes the IORB and signals the File System after all the blocks for that logical record have been read. (xv) The File System now formulates a logical record from the read blocks and transfers it to the I/O area of the AP-1. In some cases, the data can be directly read in the AP’s memory. In others, a common buffer can be maintained between the Operating System and the AP. One can refer to one or the other by just mapping logical to physical addresses appropriately.

(xvi) The process AP-1 is now moved from the blocked to the ready state, whereupon it can now be scheduled in due course. At this juncture, assuming that there are only two processes, AP-1 and AP- 2 in the system. AP-2 can be thrown out of the control of the CPU and AP-1 can be scheduled. Alternatively, AP-2 can continue executing and AP-1 is scheduled later. This depends upon whether the process scheduling philosophy is pre-emptive or non-pre-emptive. We will study how this happens in the section on Process Management (PM). (xvii) When AP-1 is next scheduled by the PM module of the Operating System, the program can assume that the logical record is already in the I/O area of AP-1 and therefore, it can start processing it. While all this happens, the device from which the data was needed to be read gets free and the flag in the DCB is updated to indicate this. From this time on, at a regular time interval the device generates an interrupt. The ISR of this interrupt wakes up the I/O scheduler to check if there are any IORBs to be scheduled. And the cycle continues thereafter.

A terminal or visual display unit (VDU) is an extremely common I/O medium. It would be hard to find any programmer or user who has not seen and used a terminal. Ironically, there is not much of popular literature available explaining how terminals work and how the Operating System handles them. We want to provide an introduction to the subject to uncover the mysteries around it.

Terminal hardware can be considered to be divided into two parts: the keyboard, which is used as an input medium and the video screen which is used as an output medium. These days, if one uses light pens and similar devices, the screen can be used as an input medium also. The terminal can be a dumb terminal or an intelligent terminal. Even the dumb terminal has a microprocessor in it on which can run some rudimentary software. It also can have a very limited memory. The dumb terminal is responsible for the basic input and output of characters. Even then, it is called ‘dumb’ because it does no processing on the input characters. As against this, the intelligent terminal can also carry out some processing (e.g. validation) on the input. This requires a more powerful hardware and software for it. We will assume a dumb terminal for our discussions. Terminals can be classified in a number of ways, as shown in Fig. 5.20.

A detailed discussion of all these is beyond the scope of the current text. We will consider only the memory mapped, character oriented alphanumeric terminals. These terminals have a video RAM as shown in Fig. 5.21. This video RAM is basically the memory that the terminal hardware itself has. The figure shows that the video RAM in our example has 2000 data bytes (0 to 1999) preceded by 2000 attribute bytes (0 to 1999). There is therefore, one attribute byte for each data byte. A typical alphanumeric screen can display 25 lines, each consisting of 80 characters, i.e. 25×80 = 2000 characters. This is the reason why Fig. 5.21 shows 2000 data bytes. This is typically the case with the monochrome IBM-PC.

Anytime, all the 2000 characters stored in the video RAM are displayed on the screen by the video controller using display electronics. Therefore, if you want to have a specific character appear on the screen at a specific position, all you have to do is to move the ASCII or EBCDIC code for that character to the video RAM at the corresponding position with appropriate coordinates. The rest is actually handled by the video controller using display electronics. Therefore, when one is using any data entry program where the data keyed in has to be displayed on the screen or one is using an enquiry program where data from the desired database or file is to be displayed on the screen, it has to be ultimately moved into the video RAM at appropriate places, after which display electronics displays them. What is then the attribute byte? The attribute byte tells the video controller how the character is to be displayed. It signifies whether the corresponding data character which is stored next to it in the video RAM is to be displayed bold, underlined, blinking or in reverse video etc. All this information is codified in the 8 bits of the attribute byte. Therefore, when you give a command to a word processor to display a specific character in bold, the word processor instructs the terminal driver to set up the attribute byte for that character appropriately in the video RAM after moving the actual data byte also in the video RAM. The display electronics consults the attribute byte which, in essence, is an instruction to the display electronics to display that character in a specific way.

For the monochrome IBM-PC display, only one attribute (i.e. 8 bits) is sufficient to specify how that character is to be displayed. For bit oriented color graphics terminals, one may require as many as 24 or 32 bits for each byte or even for each bit if a very fine distinction in colors and intensities is needed. This increases the video RAM capacity requirement. It also complicates the video controller as well as the display electronics. But then you get finer color pictures. Why is this terminal called memory mapped? It is because the video RAM is treated as part of the main memory only. Therefore, for moving any data in or out of the video RAM, ordinary load/store instructions are sufficient. You do not need specific I/O instructions to do this. This simplifies things but then it reduces the memory locations available for other purposes. Figure 5.22 shows a typical arrangement of all the components involved in the operation. It also shows the data bus connecting all these parts.

When a character is keyed in, the electronics in the keyboard generates an 8 bit ASCII/EBCDIC code from the keyboard. This character is stored temporarily in the memory of the terminal itself. Every key depression causes an interrupt to the CPU. The ISR for that terminal picks up that character and moves it into the buffers maintained by the Operating System for that terminal. It is from this buffer that the character is sent to the video RAM if the character is also to be displayed (i.e. echoed). We will shortly see the need for these buffers maintained by the Operating System for various terminals. Normally, the Operating System has one buffer for each terminal. Again, the Operating System can maintain two separate buffers for input and output operations. However, these are purely design considerations. When the user finishes keying in the data, i.e. he keys in the carriage return or the new line, etc., the data stored in the Operating System buffer for that terminal is flushed out to the I/O area of the Application Program which wants that data and to which the terminal is connected (e.g. the data entry program). Therefore, there are multiple memory locations involved in the operation. These are: l A very small memory within the keyboard itself l The video-RAM (data and attribute bytes)

The Operating System buffers The I/O area of the AP (FD, working storage, etc.) There are various Operating System routines and drivers to ensure the smooth functioning between the keyboard, the screen, all these memory areas and the Application Program itself. This is the subject of our next section. l l

Imagine that an Application Program written in HLL wants to display something on the terminal. The compiler of the HLL generates a system call for such a request, so that at the time of execution, the Operating System can pick up the data from the memory of the AP, dump it into its own output buffers first and then send it from these buffers to the terminal. The rates of data transfers between the memory of the AP to the Operating System buffers and finally from there to the terminal are very critical if the user has to see the output continuously, especially in cases such as scrolling or when using the Page Up/Page Down facility etc. What is transferred between the AP to the Operating System and finally to the terminal is the actual data as well as some instructions typically for screen handling (given normally as escape sequences). Let us say that the AP wants to erase a screen and display some data from its working storage section. The AP will request the Operating System to do this. The Operating System will transfer the data from the working storage of the AP to its own buffers and then send an instruction (in the form of escape sequences) to the terminal for erasing the screen first. The terminal’s microprocessor and the software running on it will interpret these escape sequences, and as a result, move spaces to the video RAM, so that the screen will be blanked out. The Operating System then will transfer the data from its buffers to the terminal along with the control information such as where on the screen it should be displayed and how. Again, the terminal’s microprocessor and the software will interpret this control information and will move the data into appropriate places of the video RAM. Now the actual data and also the attribute bytes of the video RAM will be correctly set. We know that the rest is done by the display electronics. But there is a problem in this scheme. For instance, when the AP is displaying some matter, if a user keys something in, where should it be displayed? Should it be mixed in the text being displayed or should it be stored somewhere and then displayed later? This is the reason the Operating System needs some extra storage area. Let us take another example. Let us say a user is keying in some data required by an Application Program. The user keys it in. As soon as a key is depressed, we know that an interrupt is generated and the interrupt service routine (ISR) for that terminal will call a procedure which will take that character and move it to the memory of the AP. But there is a problem in this scheme of transferring the data keyed in directly to the memory of the AP too! What if the user types a “DEL” key? Should it be sent to the memory of the AP? What if he types a “TAB” key? What if he types a “CTRL-C” to abort the process? It is obvious that the Operating System needs a temporary storage area for all the characters keyed in whether they are displayable characters or command characters such as “DEL” or “TAB”, etc. It also needs to have a routine to interpret these command characters and carry out the processing of special command characters such as DEL, backspace, TAB etc. This routine moves the cursor for the DEL command or inserts the necessary spaces for the “TAB” command in this temporary storage area. Having done this processing, the Operating System will need to transfer the data from this temporary area to the Application Program on receiving a carriage return (CR) or a line feed (LF) character. It is clear that the Operating System needs some temporary buffer space and various data structures to manage that space between the video RAM and the AP’s memory. We will now study these.

The Operating System reserves a large input memory buffer to store the data input from various terminals and before it is sent to the respective APs controlling these terminals. Similarly, it reserves the output buffer to store the data sent by the AP before it is sent to the respective terminal screens for displaying. For large systems, there could be dozens if not hundreds of users logging on and off various terminals throughout the day. The Operating System needs a large area to hold the data for all these terminals for the purpose of input (the data keyed in by the users) and the output (the data to be displayed). These are the buffers which are the most volatile in nature. They get allocated and deallocated to various terminals by the Operating System, a number of times throughout the day. There are two ways in which this buffer space is allocated to various terminals. This scheme estimates the maximum buffer that a terminal will require, and reserves a buffer of that size for that terminal. This is depicted in Fig. 5.23. The advantage of this scheme is that the algorithms for allocation/ deallocation of buffers are far simpler and faster. However, the main disadvantage is that it can waste a lot of memory, because this scheme is not very flexible. For instance, it is quite possible in this scheme that one terminal requires a larger buffer than the one allocated to it, whereas some other terminal grossly underutilizes its allocated buffer. This scheme is rigid in the sense that the Operating System cannot dynamically take away a part of a terminal’s buffer and allocate it to some other. In this scheme, the Operating System maintains a central pool of buffers and allocates them to various terminals as and when required. This scheme obviously is more flexible and it reduces memory wastage. But the algorithms are more complex and time consuming. We will see this trade-off time and again in all allocation policies of the Operating System be it for any part of memory or disk space! In this scheme of central pool, normally, merits overweigh the demerits and therefore, this scheme is more widely followed. AT&T’s UNIX system 5 follows this scheme, for instance! We will illustrate this scheme by continuing with this example. A buffer is divided into a number of small physical entities called Character blocks (Cblocks). A Cblock is fixed in length. A logical entity such as Character List (Clist) consists of one or more Cblocks. For instance, for each terminal, there would be a Clist to hold the data input through a keyboard as it was keyed in. This Clist would also actually store the ASCII or EBCDIC codes for even the control characters, such as “TAB”, “DEL” etc. along with those for data characters in the same sequence that they were keyed in. If a Cblock is, say, 10 bytes long, and if a user keys in a customer name which is 14 characters, it will be held in a Clist requiring 2 Cblocks. In this case, only 6 bytes would be wasted because the allocation/deallocation takes place in units of full Cblocks. If a user keys in an address which is 46 characters long, that Clist will require 5 Cblocks, thereby wasting only 4 bytes. All Cblocks in a buffer are numbered serially. Therefore, the terminal buffer can be viewed as consisting of a series of Cblocks 0 to n of fixed length. When the Clist is created or when it wants to expand because its already allocated Cblock is full, the Operating System allocates another free Cblock to that Clist. The Cblocks assigned to a Clist need not be the adjacent ones, as they are dynamically allocated and deallocated to a Clist.

Therefore, the Operating System has to keep track of which Cblocks are free and which are allocated to which Clist. The Operating System does that normally by chaining the Cblocks belonging to the same Clist together with a header for each Clist. As the user keys in characters, they are stored first in the terminal’s own memory and then in the video RAM if it is to be echoed, as we have seen before. After this, the character is pushed into a Clist from the terminal’s memory by the ISR for that terminal. While doing this, if a Cblock of that Clist has an empty space to accommodate this character, it is pushed there; otherwise the Operating System acquires a new free Cblock for that Clist, delinks that Cblock from the pool of free Cblocks, adjusts all the necessary pointers and then pushes the character at the beginning of the newly acquired Cblock. The size of the Cblock is a design parameter. If this size is large, the allocation/deallocation of Cblocks will be faster, because the list of free and allocated Cblocks will be shorter, therefore, enhancing the speed of the search routines. But in this case, the memory wastage will be high. Even if one character is required, a full Cblock has to be allocated. Therefore, the average memory wastage is (Cblock size - 1)/2 for each Clist. If the size of the Cblock is reduced, the allocation/deallocation routines will became slower as the list of free and allocated Cblocks will be longer, and also the allocation/deallocation routines will be called more often, thereby reducing the speed. Therefore, there is a trade-off involved - again similar to the one involved in the case of deciding the page size in memory management or the size of a cluster or an element used in the disk space allocation. Each Cblock has the format as shown in Fig. 5.24, assuming that the Cblock contains 10 bytes in our examples.

We will now study various fields that constitute a Cblock. l Cblock number is a serial number of the Cblock in buffers (0 to n). l Next pointer is the Cblock number of the next Cblock allocated to the same Clist. “*” in this field indicates the end of the Clist. l Start offset gives the starting byte number of the valid data. For instance, Fig. 5.25 shows 10 bytes but may be some junk is remaining from the past in bytes 0 to 3. Therefore, the start offset in this case is 4 - as shown in the figure. The Cblock stores the employee number as “EMP01”. Normally, it is set to 0 for a newly acquired Cblock for a Clist. l Last offset gives the byte number of the last significant byte in the Cblock. For instance, Fig. 5.25 shows that byte number 8 is this byte number. That is where the employee number ends. Byte number 9 contains “-” which is again of no consequence. It could be garbage from the past. This field indicates after which byte, newly keyed-in data can be stored in that Cblock if there is a room.

Let us illustrate the Clist and Cblock structures by taking an example. Let us say that we have two terminals, T0 and T1. At a given moment, let us also assume that T0 has 3 Clists-CL0, CL1 and CL2. T1 is currently unused. Therefore, no Clist is allocated to T1. A question arises: why are multiple Clists required for a terminal? We defer the answer to this question for a while. As we have described before, one or more Cblocks are normally associated with each Clist. Let us assume the size of Cblock to be 10 characters in our example. These Cblocks are linked together with pointer chains for the same Clist using the “next” pointers in the Cblocks. Let us also assume that there is a list of free Cblocks called CLF which links all the free unallocated Cblocks. Whenever a new Cblock is to be allocated to a Clist, the Operating System goes through this CLF and allocates a free Cblock which is at the head of the chain of CLF to the required Clist. After this, it adjusts the pointer chains for that Clist as well as for CLF to reflect this change. In the same way, if a Cblock has served its purpose, the Operating System delinks it from that Clist and adds it to the pointer chains of CLF at the end of the chain. There is no reason why it should be added anywhere else. This is because there is no concept of priority or preference in this case. Any free Cblock is as good as any other for its allocation. Our example clarifies this in a step by step manner. Let us now imagine that the Operating System has allocated the buffer for all the Clists for all the terminals put together, such that it can accommodate 25 Cblocks in all. This is obviously too small to be realistic, but it is quite fine to clarify our concepts. Let us assume that the Cblocks are assigned in the manner as shown in Fig. 5.26. To represent this, the Operating System maintains the data structures as shown in Fig. 5.27. These are essentially the readers of the chains. The actual Cblocks would be as shown in Fig. 5.28. We can easily verify that if we start traversing the chain, using the starting Cblocks from Fig. 5.27 and the “next” pointers in Fig. 5.28, we will get the same list as shown in Fig. 5.26. Figure 5.28 shows only 5 Cblocks allocated to terminal T0 in detail with their data contents. The others are not deliberately filled up to avoid cluttering and to enhance our understanding. We can make out from the figure that the user has keyed through the terminal T0, the following text “Clist number 0 for the terminal T0 has 5 Cblocks.” This is stored in Cblocks 0, 5, 8, 17 and 21. All Cblocks in this

example are fully used because there are exactly 50 characters in this text, and therefore, in this example, the start offset is 0 and last offset is 9 for all those Cblocks. As we know, this need not always be the case. Let us assme that a user logs on to the terminal T1 at this juncture, where he runs a program which expects the user to key in an employee name. Let us assume that the name is of 16 characters (Achyut S.Godbole), requiring 2 Cblocks to hold it. We know that each key depression causes an interrupt and activates the Interrupt Service Routine (ISR) which invokes a procedure to pick up the byte keyed in and deposit it from the memory of the terminal itself into one of the Clists associated with that terminal. We have also seen that if the character is to be displayed on the screen, i.e. echoed, it is also moved in, along with its attribute byte, to the appropriate location in the video RAM. We will later see different routines within the Operating System required to handle the terminal, their exact functions and the exact sequence in which they work together. For now, let us assume that the first character is to be deposited into the Clist for T1.

At this juncture, there is no Clist and consequently no Cblocks allocated to T1. The following procedure will now be adopted: (i) The Operating System routine goes through the entry for free Cblocks (CLF) in the table shown in Fig. 5.27. (ii) It will find Cblock number 3 as the entry for the first free Cblock as shown in CLF in Fig. 5.27. (iii) It will create a Clist CL3 for T1 and allocate Cblock 3 to 5 it. At this juncture, the “starting” free Cblock will be Cblock number 6 (Refer to Fig. 5.26 for the row for CLF). The terminal data structure now will look as shown in Fig. 5.29. (iv) Assuming that the user keys in the first character of the entire text i.e. “Achyut S.Godbole”, viz, the character “A”, it will move this character “A” into Cblock 3. After this is moved, the Cblocks will look as shown in Fig. 5.30. Notice that the start as well as the offsets for this Cblock 3 are set to 0 because only zeroth character oin that Cblock has some worthwhile data (in this case, it is “A”). As the user keys in the first 10 characters (“Achyut S.G”), Cblock 3 will keep getting full. Let us now imagine that the user has keyed in the 11th character “o”. Now, a new Cblock has to be acquired. The list of free Cblocks i.e. CLF in Fig. 5.29 now points towards Cblock 6. Therefore, it will be allocated to CL3, with both the offsets set to 0, and the 11th character will be moved into byte number 0 of Cblock 6 of CL3. The terminal data structure will then look as shown in Fig. 5.31. It is possible that at any moment, terminal T0 may acquire new Cblocks to its Clists or it may relinquish them. We have, however, assume that the Cblocks for T0 have not changed during the period of this data entry for T1. We now assume that all the 16 characters are keyed in. The user then keys in a “Carriage Return (CR)”. At this juncture, the terminal data structure and the Cblocks will continue to look as shown in Figs. 5.31 and 5.32, respectively (and therefore, not repeated here), except that Cblock 6 would be different and would look as shown in Fig. 5.33. Notice that the “last” offset has now been updated to 5. This means that a new character should be entered at position 6. As soon as “CR” is hit, the Operating System routines for the terminals understand that it is the end of the user input. At this juncture, a new routine is invoked which moves all the 16 characters into the AP’s memory in a field appropriately defined (e.g. char[17] in C or PIC X(16) in the ACCEPT statement or screen section of COBOL). After this, Cblock 3 and Cblock 6 are released and they are chained in the list of free Cblocks again. We also know that if the name is to be displayed as each character is entered at that time itself, it will

The Cblocks at this stage will look as shown in Fig. 5.32.

have been sent to the video RAM, from which it would have been displayed. After all the characters are keyed in, the terminal data structure and the Cblocks again look as shown in Figs. 5.27 and 5.28, respectively. This is where we had started from.

We will now study some of the algorithms associated with Clists and Cblocks (Figs 5.34 and 5.35).

In these, “free the Clist and Cblocks for that terminal” can be further exploded easily. We give below the algorithm for “insert a character” which is a little complex. The algorithm for “Acquire a free Cblock” is as shown in Fig. 5.36. In the algorithm, routines to “Allocate FCB” and “Deallocate FCB” need further refinements. They essentially are the routines to add an element to a linked list and to remove an element from the linked list. These must be fairly clear from our example showing Cblocks and Clists in Figs. 5.26 to 5.33. Whenever a Cblock is added to a Clist, the fields in it are set as follows: (i) Cblock number = the free Cblock number allocated from the head of CLF (ii) Next pointer = “*” (iii) Start offset (SO) = 0

(iv) Last offset (LO) = –1 (v) Other fields = blank. Notice that, LO is set to –1 because LO + 1 gives the byte number within that Cblock where the next character is to be stored. For newly acquired Cblock, a character has to be stored in byte number 0. How does the Operating System remove these 16 characters from these Cblocks and send them to the user process running the AP? While doing that, one thing has to be taken care of. The characters must be sent to the user process in the same way that they were keyed in, as if Clists and Cblocks were not present - i.e. in the FIFO manner. The kernel of the Operating System normally has an algorithm for removing the characters from the Clist in this fashion. It looks as shown in Figs. 5.37 and 5.38. The kernel can and infact normally does provide for an algorithm to extract or copy only 1 character from the Clist also. In actual practice, at any given moment, many Cblocks are being acquired and many others are being released for different terminals in a multiuser system. Things do not necessarily happen for one terminal after the other as far as Cblocks are concerned. Our example must have clarified different algorithms and data structures required for Cblock and Clist maintenance. The procedure for displaying some value by a process is exactly the reverse. The data is moved from the AP’s memory into Cblocks for Clists (if required after acquiring them) for the target terminal; and then the data is moved from the Cblocks to the terminal. As we know, in addition to the actual data, the commands to manipulate the screen (e.g. user screen) are also sent to the terminal as some specific escape sequences. The hardware and the software within the terminal interpret these as commands or data and update the video RAM with the appropriate data and attribute bytes. The display electronics now does the rest. The screen now shows what the user wants to display. After this is done, the Cblocks are freed again. The Kernel normally has a number of procedures to handle a variety of requirements. Some of them are listed below: (i) Allocate a Cblock from a list of free Cblocks to a given Clist. (ii) Return a Cblock to a list of free Cblocks. (iii) Retrieve the first character from a Clist. (iv) Insert a character at the end of Clist.

(v) Extract all the characters belonging to a Cblock within a Clist (and then free that Cblock). (vi) Extract all characters from a Clist. (vii) Place a new Cblock of characters at the end of a Clist. We have already shown the algorithms for some of them. e.g. for extracting all the characters from a Clist in Figs. 5.37 and 5.38. The algorithms for all the others by now should be fairly easy to construct from the preceding discussions in the last two sections. We now will see various components of these routines in the terminal driver, but before doing that, it is

necessary for us to know why normally different Clists are needed for each terminal. There are two modes in which a terminal can be operated by the Operating System. They are: l l

In the ‘raw’ mode, whatever the user keys in is passed on to the user process faithfully by the Operating System without any processing. In such a case, the Operating System needs only one Clist for the input from a terminal apart from the one for the output to the screen. In this mode, the user keys in the data which is deposited in the input Clist and on encountering a CR or an LF, it is transported to the user process. This is exactly what we had traced in our example. The raw mode does not do any processing on the characters keyed in such as “DEL” (delete a character just keyed in) or “TAB” (jump to the next TAB column). As we know, DEL, TAB keys etc. also have some ASCII and EBCDIC codes associated with them. The raw mode extracts the 8 bit code associated with the character from the terminal, puts it into the input Clist and passes it on to the user process wanting that character. It is then up to the user process to interpret these special characters and take actions accordingly. This mode is especially used by many editors. A specific control character sequence or escape sequence means a specific thing to one editor, but the same may mean a different thing to a different editor running under the same Operating System. In this case, it is useless for the Operating System to interpret and process these characters. It is better for the Operating System to pass these special characters to the editor and let the editor interpret them, the way it wants. Therefore, a raw mode is used. However, in addition to the input Clist, the raw mode requires one output Clist from where the characters are to be displayed on the screen. For instance, if a user keys in A, B, C and then the TAB character, the input Clist will contain “ABC(TAB)” whereas the output Clist will contain “ABC”. The spaces after “ABC” are as per the TAB. After the TAB character is faithfully sent to the user process (such as an editor), the user process interprets it and sends some escape sequence back to the terminal driver operating in the raw mode. The terminal driver interprets them sequence and expands the “ABC(TAB)” to “ABC” for the output Clist. It is then sent from the output Clist to the video RAM to fill up the data and the attribute bytes. The rest is known to us. Let us assume that the user keys in a function key “F1” as an input and a user program is written such that if “F1” is encountered, the user should be taken back to the previous screen. How is this accomplished? The raw mode stores the ASCII/EBCDIC code for “F1” in the input Clist after it arrives from the terminal buffer into the input Clist. After this, it sends the character “F1” to the user process for interpretation faithfully. The user process is written in such a way that it checks for “F1” and if encountered, it sends the instructions to “Erase screen” back to the terminal driver. It then also actually sends the data from the previous screen to be displayed again to the terminal driver. Both these instructions and the data are sent by the terminal driver to the terminal. The terminal hardware and software interpret these and set up the video RAM accordingly. For instance, the instruction to erase screen in the form of special characters (escape sequences) will make the terminal to move spaces to video RAM, so that screen is blanked out. After that, the characters from the previous screen are moved to the video RAM appropriately. If scrolling is involved, more than a screenful data can be sent to and stored in the output Clist from the user process and then scrolling is synchronized with the rate at which the data is sent from the user process to the output Clist and from there to the video RAM.

The ‘cooked’ or ‘canonical’ mode on the other hand processes the input characters before they are passed on to the user process. It is for this reason that in this mode, the driver requires an additional input Clist associated with the same terminal. In this case also, there is only one output Clist from which characters are sent to the screen for displaying. However, this mode demands two input Clists. These are called ‘raw’ and ‘cooked’ Clists. In this case, as the characters are keyed in, they are first input in the raw input Clist. This is the same as was the case in the ‘raw’ mode. After this is done, each character is examined. If it is an ordinary data character, it is copied to the second “cooked” input Clist. If it is a command or control character such as “F1” or “DEL” etc., it is then processed according to the character and the result moved into the second cooked input Clist. It is from this Clist that the data finally goes to the user process. The Clist/Cblock management algorithms and data structures are as we have already discussed in the last section. As an example, if the user keys in a TAB character, the ASCII code for TAB is moved into the raw Clist. The terminal driver in the cooked mode then calculates the next TAB position from the current cursor position, calculates the number of spaces to be inserted and actually moves the result along with the spaces into the cooked Clist. If the user keys in a “DEL” character, the Operating System will store the ASCII code for DEL in the raw Clist first. After this, the Operating System will move spaces to the last character in the cooked Clist and decrement the Last offset position in the Cblock of that list. It will also decrement the cursor position, send the instruction to the terminal accordingly, to actually display the cursor at a previous position. The terminal microprocessor alongwith the software running on it will interpret it and actually move the cursor character from the current position to one to the left in the video RAM. The attribute byte also is moved so that the nature of the cursor (blinking, etc.) remains the same. All this enables the user to key in the next character in the same place. From the cooked Clist, data is normally sent to the user process only when it encounters CR or LF or NL. This character such as CR is stored in the raw Clist, but it is a “command” character, and therefore, it is not sent as it is. Thereafter, if echoing is required, the data from the input cooked Clist is moved (after the required processing if any) to the output Clist from where it is sent to the video RAM through the hardware and software of the terminal for display. If a character is to be displayed as soon as it keyed in, it is sent immediately to the output Clist and thereafter to the video RAM. Figure 5.39 shows a partial list of characters handled specially in cooked mode along with their possible interpretations. Therefore, in cooked mode, after each key depression, the driver has to check whether the character input is an ordinary character or it needs special interpretation. But if we want to actually input such a special character as an ordinary character, what should be done? For instance, if we want to key in “50 pieces @ 2 per week will take 25 weeks for delivery”. What will happen after we input the character “@” due to its special meaning as given in Fig. 5.39? For this reason, the backslash (\) character is used to denote that what follows is to be treated as an ordinary character. This will be clear from Fig. 5.39. Therefore, the message given above should be sent as “50 pieces \@ 2 per week will take 25 weeks for delivery”. If the user actually wants to use “\” in his actual message, then he must type “\\”. After encountering any backslash, the

terminal driver sets a flag denoting that the next character is to be treated as an ordinary character. Therefore, the first “\” itself is not entered in the Clist. When a user types “DEL”, the driver must interpret it and send a message/instruction back to the terminal which must in turn take an action equivalent to three steps. (i) Backspace (decrement cursor etc.) (ii) Move a blank character (iii) Backspace The reason that another backspace is needed in step (iii) is that after moving a blank character in step (ii), the cursor would have advanced by 1 position. While interpreting the “DEL” command, if the previous character was TAB character, the problem becomes complex. The terminal driver has to keep track of where the cursor was prior to the TAB. In most systems, backspacing can erase characters on the current line of the screen only. This simplifies the driver routine, whereas it is definitely possible to allow to continue erasing characters from the previous lines too! This would enhance user friendliness but would make the terminal driver routines complex. CTRL-Q and CTRL-S are normally used to control scrolling. On encountering them, the user process will start or stop sending data to the output Clist. Many editors and other sophisticated programs need to manipulate the screen in a variety of ways. To support this need, most of the terminal drivers provide for various routines as listed in Fig. 5.40. l l l l l l l l l l l

There is normally a fixed protocol in terms of certain escape sequences between the user process and the terminal driver for each of these routines. We have seen before how these escape sequences are interpreted by the terminal as instructions to set up the video RAM accordingly. In most of the systems, CTRL-C aborts the current process. When the user types this, how should this be handled? The reason for this question is that CTRL-C is neither a data character nor a command for screen manipulation but is a command to the Operating System itself. Therefore, there are three types of characters that can be keyed in by the user: (i) Ordinary, displayable characters requested by the user process or sent by it for display. (ii) Terminal control characters, such as “DEL”, “TAB” etc.

(iii) Process control characters, such as “CTRL-C”. As soon as a character is keyed in, the driver has to analyze which category it belongs to and then depending upon the cooked or raw mode, take action by itself or wait until the user process takes action. There is also an output Clist associated with each terminal. The contents of this are sent to the video RAM for display. Normally, whatever is keyed in is also echoed, except for special characters or password etc. Characters to be displayed are moved to the output Clist after interpretation, if necessary. In the raw mode, this processing is not done. Therefore, if you type “lbte” instead of “late” and then realize your mistake, you will have to type 3 “DEL” characters, then type “ate” followed by a carriage return. In the raw mode, all these 11 characters will be stored as “lbte(DEL)(DEL)(DEL)ate(CR)” in the input Clist and then they will be sent to the output Clist and displayed and cook mode, only “late” would be moved into the cooked input Clist and then the output Clist. After this, it will be moved to the video RAM and displayed. One problem that arises is that if a user process wants to display some data on a terminal, the data will be moved from the user memory to the output Clist, from where it will be moved to video RAM. But if a user is also keying in something at that time which has to be echoed, it will go to the raw input Clist first, and then to the output Clist. This method can intermix both these outputs resulting in garbage. Therefore, the Operating System has a very delicate job of processing the keyed-in input and displaying it only at the appropriate time. If the Operating System does not have this sophistication, one can observe the display intermixed with the keyed in input. Therefore, depending upon the mode, two or three Clists are assigned for each terminal. This mode can be changed by changing the terminal characteristics through software. In any input/output operation on the terminal, there are basically five layers, as shown in Fig. 5.41. Each of these has a specific function. l

l

l

l

The I/O system call is what the compiler substitutes in the place of a high level language instruction such as DISPLAY or ACCEPT or the screen section instructions. The terminal driver is responsible for the movement of the data between the memory of the user process and the Clist (input or output Clist, depending upon the operation). Line Discipline is responsible for the following functions: - To parse input strings into lines. - To process the special characters as given in Fig. 5.39 to form a cooked Clist, if applicable. - To echo the received characters on the terminal, if required (excepting for password, etc). - To allow for the input in the raw mode without interpreting the special characters. - To control the flow. - To check transmission errors. The driver should not be confused with the terminal driver. The driver is responsible for the movement of data between the Clist and the device. Actually, we will need two drivers. One being a

keyboard driver when is a piece of software which is responsible for moving data from the keyboard to the input raw Clist. As we know, each key depression causes an interrupt. An interrupt service routine (ISR) is then executed, which invokes this keyboard driver. The screen driver on the other hand, is responsible for moving data from the output Clist to the device I/O. If the screen is to be manipulated (e.g. to erase the screen), the screen driver sends the appropriate escape sequence to the device as seen before. l Device I/O consists of the microprocessor and the software running inside the terminal. This is responsible for both input and output. When any key is depressed, this device I/O is responsible for generating the ASCII code from the row number or column number or the key number. This also is responsible for cursor management. For instance, when you depress a key, the cursor is incremented by 1. When you depress back arrow ‘!’, the cursor is brought back by 1 position. When you depress a forward arrow ‘"’, the reverse takes place. All this is managed within the terminal even if it is a dumb terminal. It is made possible due to this software. Similarly, device I/O is responsible for receiving characters from the driver, interpreting them if necessary (e.g. escape sequence for screen management) and moving the data into the video RAM for display. We will now take a final concrete example to cement all our ideas. Scenario: An Application Program “Customer Enquiry” prompts for customer number on the terminal. The user keys in the desired number on the keyboard. The Application Program gets the information for the specified customer with the help of the data management software (DMS) and outputs it on the screen. Assumptions: The Application Program does not do any special input handling like (i) Force the user to enter four and only four characters. (ii) Proceed even without the user pressing the key. (iii) Not echoing the entered characters, etc. All validations and done by the Application Program after the user has entered the full customer number. The Application Program gets the control only after the user hits the key (input is in canonical mode). Startup Information: The C/COBOL compiler when producing the object code for the program would have generated the library calls ‘write’, ‘read’ for DISPLAY, ACCEPT verbs, respectively in the program. The setup code that is executed by all COBOL programs before executing the first statement in the main function (for C) and PROCEDURE DIVISION (for COBOL) would have opened the terminals for input and output (stdin, stdout). There are two modes of operation for all UNIX programs - user and kernel. All user Application Programs start in the user mode. We will now trace the steps followed for two events. One is the Application Program displaying the characters “Enter Customer Number” on the screen as a prompt. The second is the user entering the input on the keyboard. The COBOL compiler would have generated the necessary code to push two parameters onto the stack (file descriptor of the terminal as a terminal is also treated as a file in UNIX, and memory address of the string to be displayed which in this case is “Enter Customer Number” which must have been already stored in the memory). The compiler would have also generated the code to call the library procedure for write (DISPLAY). The library code pushes some constant on the stack (which depends upon UNIX implementation)

to indicate to the kernel that it is the write system call that needs to be executed. A mode switch from user to kernel is then done. The method to do this varies from CPU to CPU (On the 680x0 CPU, a mode switch is achieved by executing a trap instruction). The program is now executing in the kernel mode. A table look-up is done for the system call to determine validity, number of parameters etc. The arguments are then fetched from the stack. Depending on the arguments, the pertinent device driver is called (in this case, the terminal driver). The character is read in by the terminal driver from the Application Program’s address space to the kernel address space. Finally these characters are passed to the line discipline. The line discipline (canonical) places the characters on a Clist. Necessary processing is done (expansion of tabs to spaces etc). When the number of characters on the Clist becomes large or if the Application Program requests flushing, or the Application Program desires input from the keyboard, the line discipline invokes the driver output procedure. In this case, no processing is required in our text ‘Enter customer number’. Therefore, the Line discipline merely moves it to the Clist. As the AP desires input, the Line discipline now invokes the driver. The driver outputs the characters to the device I/O of the device (the terminal). All characters are sent (including escape sequence) to the terminal. The Application Program then goes to sleep. Device I/O software within the terminal displays the characters (doing escape sequence processing for instructions like erase screen, display bold, display with underline etc.). Positioning of characters wrapping long lines etc. are done by this device I/O software running inside the terminal. This step sets up the video RAM with the appropriate data and attribute bytes and then the video controller comes into action for actually displaying the characters. Upon completion of the display operation, the video controller interrupts the UNIX Operating System. If the process is sleeping on this device, it is awakened in the ISR. The system call returns, the mode switches back to user made and the Application Program continues. We assume that the C/COBOL compiler would have generated the code necessary to push 2 parameters onto the stack (file descriptor of the terminal and the memory address of the variable where data entered e.g. 0001 by the user is to be moved). We also assume that the compiler would have generated the necessary code to call the library procedure for read (ACCEPT). The library code pushes some constant (which depends upon the UNIX implementation) to indicate to the kernel that it is the read system call that needs to be executed. A mode switch from user to kernel is then done as seen before. The program is now executing in the kernel mode. A table look up is done for the system call to determine the validity, number of parameters etc. for that system call. The arguments are then fetched from the stack. Depending on the arguments, the pertinent device driver is called (in this case the terminal driver). If enough characters exist on the input cooked Clist to be flushed out to the process, step 7 is executed and the data is moved to the memory of the process. Otherwise, if nothing is keyed in at all, the terminal driver has nothing to do except to wait for additional data. It then invokes line discipline, which in turn invokes the driver routine. The driver also cannot do anything unless the user keys in any data. When the user types in a character, this character is now available in the terminal after the interrupt is processed through the ISR. The terminal driver now picks up this character and puts it in the raw Clist.

The line discipline now examines the character and processes it by taking any appropriate action if necessary. This has been discussed before. The line discipline moves the data in the input cooked Clist as well as output Clist for echoing. The driver moves the data from the output Clist to the terminal (device I/O). The device I/O moves the data (‘0001’ in this case) to the video RAM. The display electronics now displays it at the appropriate place along with the other text such as the original prompt etc. The data is now moved from the input cooked Clist to the memory of the user Application Program after CR or LF is entered or after the entry of a pre-determined number of characters, as the case may be. The user process uses ‘0001’ as the key for the database searches, and gets the record in the user address space. The user process now formulates its display and goes through the same procedure to display the information on the screen.

For years, magnetic storage such as disk and tape was popular. However, in the recent past, optical storage is becoming extremely commonplace. The biggest advantage of optical devices is their huge storage capacity as compared to disks and tapes. Also, the access time is relatively fast. One small inexpensive disk can replace about 25 magnetic tapes! The main disadvantage of CD-ROMs, as the name implies, is, we cannot record data onto them more than once. This means that they can be created only once and thereafter, they are read-only. However, the recent improvements in technology have resulted into the birth of CDs that can be overwritten.

The CD-ROMs (Compact Disk-Read Only Memory) are 1.2 mm thick and 120 mm across. There is a 15-mm hole at the center of the disk. Recall that the magnetic disks use the principle of a binary switch to indicate 0 or 1. In case of CDs, the surface has areas called as pits and lands. A CD is prepared by using a high-power infrared laser. Wherever the laser strikes the disk surface (made up of polycarbonate material), a burn occurs. This is the pit. It is like a 1 on the magnetic disk. The land is like a 0: no laser beam strikes the surface, and hence, the surface remains unburned. During the playback, a low-power laser shines infrared light on the pits and lands as they pass by. A pit reflects less light back as compared to a land. A land reflects back strong light. These reflections can then be converted into the corresponding electric signals. This is how the drive recognizes a 0 from a 1. The following figure shows a CD-ROM disk. Figure 5.42 shows a CD-ROM. Notice that the main difference between magnetic disks and CDs is that in case of CDs, the track-like structure is continuous. A single unbroken spiral contains all the pits and lands. Figure 5.43 shows what happens when a CD-ROM is played back. When the laser beam strikes a land bit, as shown in the left portion of the figure, the laser beam is reflected back and thus the sensor receives the beam back. Therefore, it is read as a 1 bit. However, when the laser beam strikes a pit bit, the pit does not reflect back the laser beam. As a result, the sensor does not receive anything, which leads to the conclusion that the data there is a 0 bit.

When music is being played on the CD, the lands and pits should pass by at a constant speed. For this, the rate of rotation of the CD is continuously reduced as the read head moves from the inside of the CD to the outer parts of the CD. At the inside, the speed of rotation is 530 revolutions per minute (RPM), and it reduces to 200 RPM at the outside. Philips and Sony realized that CDs could be used for storing computer data, sometime in 1984. They published a standard called as Yellow Book for this. CDs were used only for storing music until then. The CDs which were used for storing computer data from this time onwards and were called as CD-ROM to distinguish them from audio CDs. To make CD-ROM similar to the audio CDs, the same specifications such as physical size, mechanical and optical compatibility were used. The Yellow standard decided the format

of computer data. Earlier, when only music was stored on CDs, it was all right to lose some tones. However, when computer data was to be stored on CDs, it was very important to make sure that no data is lost. For this, error correction mechanisms were also decided. Every 8 bit byte from a computer is stored as a 14 bit symbol on a CD-ROM. The hardware does the 14to-8 transition. Thus, one symbol on CD-ROM = 14 bits. Next, 42 such consecutive symbols make up one frame. Each frame contains 588 bits (42 symbols, each consisting of 14 bits). Out of these, only 192 bits (24 bytes) are used for data. The rest 396 bits are used for error detection and control, for a given frame. This matches with the formats of audio CDs. Going one step ahead of audio CDs, 98 frames are combined to form a CD-ROM sector. Since we are talking about data bytes alone, we have the following equation: 24 data bytes per frame × 98 such frames = 2352 data bytes per sector. Every sector has a 16-byte preamble. The first 12 of these 16 bytes are used to allow the player to detect the fact that a new sector is beginning. The next three bytes give the sector number. The last byte contains information about the mode, which is explained later. The next 2048 bytes contain the actual data. Finally, the last 288 bytes contain error correction and control mechanism data for a given sector. First, let us have a look at these concepts in a diagrammatic form as shown in Fig. 5.44.

The Yellow Book has defined two modes. Mode-1 takes the form as shown in the figure Here, data part is 2048 bytes and error correction part is 288 bytes. However, not all applications need such stringent error correction mechanisms. For instance, audio and video applications should preferably have more data bytes. Mode-2 takes care of this. Here, all the 2336 bytes are used for data. Note that we are talking about three levels of error correction: (a) Single-bit errors are corrected at the byte-level, (b) Short burst errors are corrected at the frame level and finally any other remaining errors are corrected at the sector-level.

The definition of CD-ROM data format has since been extended with the Green Book, which added graphics and multimedia capabilities in 1986.

The Digital Versatile Disk Read Only Memory (DVD-ROM) uses the same principle as a CD-ROM for data recording and reading. The main difference between the two, however, is that a smaller laser beam is used in case of a DVD. This results into a great advantage that data can be written on both the surfaces of a DVD. Also, the laser beam is sharper. This adds to the advantage of extra storage capacity: the tracks on a DVD are closer and hence pack more data. It is not possible in case of a CD. The typical capacity of each surface of a DVD is about 8.5 gigabytes (GB) – hence, together, they can accommodate 17 GB! This is equivalent to the storage on 13 CDs. The main technical differences between a CD-ROM and a DVD are the following:

The Operating System is responsible for using devices properly and in an efficient manner. Normally, we would have devices such as disks, printers, scanners etc. Disks are considered as important devices since disks are capable of storing large amounts of data. Disks are involved in all read-write or I/O operations since some amount of data is required to store/write and some amount of data is required to get/read. The Operating System has to monitor and control these actions to achieve the best performance. Disk scheduling is about to read the necessary data requested by the processes. There are two important parameters regarding disks operations: Access time has two components: Seek time is one type of delay associated with reading or writing data on a computer’s disk drive. In order to read or write at a particular place on the disk, the read/write head of the disk needs to be physically moved to the correct place. This process is known as seeking and the time require to move head at correct place is called as seek time. Rotational delay is the time required for the addressed area of the disk to rotate into a position where it is accessible by the read/write head.

Disk bandwidth is the capacity of the disk to transfer data from memory to disk and from disk to memory. It is the total number of bytes transferred, divided by the total time between the first request for service and the actual completion of last transfer. There are various algorithms for disk scheduling, i.e. for the disk read/write operations.

The drive head sweeps across the entire surface of the disk, visiting the outermost cylinders before changing direction and sweeping back to the innermost cylinders. It selects the next waiting request, whose location it will reach on its path backward and forward across the disk. Thus, the movement time should be less than FCFS but the policy is clearly fairer than SSTF.

C-SCAN is similar to SCAN but the I/O requests are only satisfied when the drive head is travelling in one direction across the surface of the disk. The head sweeps from the innermost cylinder to the outermost cylinder satisfying the waiting requests in the order of their locations. When it reaches the outermost cylinder, it sweeps back to the innermost cylinder without satisfying any requests and then starts again.

Similarly to SCAN, the drive sweeps across the surface of the disk, satisfying requests in alternating directions. However, the drive now makes use of the information it has about the locations requested by the waiting requests. For example, a sweep out towards the outer edge of the disk will be reversed when there are no waiting requests for locations beyond the current cylinder.

Based on C-SCAN, C-LOOK involves the drive head sweeping across the disk satisfying requests in one direction only. As in LOOK, the drive makes use of the location of waiting requests in order to determine how far to continue a sweep, and where to commence the next sweep. Thus, it may curtail a sweep towards the outer edge when there are locations requested in cylinders beyond the current position, and commence its next sweep at a cylinder which is not the innermost one, if that is the most central one for which a sector is currently requested. Selecting Disk-Scheduling algorithm l SSTF is commonly used and has natural appeal. l SCAN and C-SCAN perform better in the cases that place a heavy load on the disk. l Disk I/O performance depends on the number and types of requests. l Disk-Scheduling algorithm should be written in the Operating System and it should easy to replace with different algorithm when necessary.

Swap space is an area on a high-speed storage device (almost always a disk drive), reserved for use by the virtual memory system for deactivation and paging processes. At least one swap device (primary swap) must be present on the system. During system startup, the location (disk block number) and size of each swap device is displayed in 512KB blocks. The swapper reserves swap space at the process creation time, but it does not allocate swap space from the disk until pages need to go out to disk. Reserving swap at process creation protects the swapper from running out of swap space. You can add or remove swap as needed (that is, dynamically) while the system is running, without having to regenerate the kernel. System memory used for swap space is called pseudo-swap space. It allows users to execute processes in memory without allocating physical swap. Pseudo-swap is controlled by an Operating System parameter called as swapmem_on. By default, swapmem_on is set to 1, enabling pseudoswap. Typically, when the system executes a process, swap space is reserved for the entire process, in case it must be paged out. According to this model, to run one gigabyte of processes, the system would have to have one gigabyte of configured swap space. Although this protects the system from running out of swap space, disk space reserved for swap is under-utilized if minimal or no swapping occurs. When using pseudo swap as the swapping mechanism, the pages are locked. As the amount of pseudoswap increases, the amount of lockable memory decreases. For factory-floor systems (such as controllers), which perform best when the entire application is resident in memory, pseudo-swap space can be used to enhance performance. We can either lock the application in memory or make sure that the total number of processes created does not exceed three-quarters of system memory. Pseudo-swap space is set to a maximum of three-quarters of system memory because the system can begin paging once three-quarters of system available memory has been used. The unused quarter of memory allows a buffer between the system and the swapper to give the system computational flexibility. When the number of processes created approaches the capacity, the system might exhibit thrashing and a decrease in system response time. If necessary, we can disable pseudo-swap space by setting the tunable parameter swapmem_on in /usr/conf/master.d/core-hpux to zero. At the head of a doubly linked list of regions that have pseudo-swap allocated, there is a null terminated list called pswaplist. There are two kinds of physical swap space: device swap and file-system swap. Device swap space resides in its own reserved area (an entire disk or logical volume of an LVM disk) and is faster than file-system swap because the system can write an entire request (256 KB) to a device at once. File-system swap space is located on a mounted file system and can vary in size with the system's swapping activity. However, it’s throughput is slower than device swap, because free

file-system blocks may not always be contiguous; therefore, separate read/write requests must be made for each file-system block. To optimize system performance, file-system swap space is allocated and de-allocated in swchunk-sized chunks. swchunk is a configurable Operating System parameter; its default is 2048 KB (2 MB). Once a chunk of file system space is no longer in use by the paging system, it is released for file system use, unless it has been preallocated with swapon. If swapping to file-system swap space, each chunk of swap space is a file in the file system swap directory, and has a name constructed from the system name and the swaptab index (such as becky.6 for swaptab[6] on a system named becky).

Files are stored on disk. Disk space management is a challenge and a concern for file system designers. There are two methods to write our files on the disk – 1) Complete file is stored sequentially; one byte after another occupying consecutive bytes on the disk. 2) File is stored on the disk not sequentially. Instead, it is split into several blocks and stored wherever the disk has free space. One big concern when file is stored in consecutive manner on the disk is that it would be difficult to store the file when the file size is growing. For this reason, the file systems break files into fixed size blocks that need not be adjacent.

After the challenge of storing a file into fixed-sizes block, the next challenge is deciding the appropriate size for the block. If the block size is large and suppose if the file is small then it would waste the disk space. If we decide on a small block size, then it means that the number of blocks would be high per file. This would cause reading of many blocks when we read a file and the read operation would be slow. Choosing an appropriate block size would be decided by the file system designers. The block size varies from one Operating System-to-another.

Keeping track of free blocks is necessary for the allocation of unused/free blocks to store a file on disk. There are two methods widely used to keep the track of free blocks: (1) The first method consists of a linked list of disk blocks, with each block holding as many free disk block numbers as will fit. Often, free blocks are used to hold the free list. (2) The second technique is the bitmap. A disk with n blocks requires a bitmap with n bits. Free blocks are represented by 1s in the map, allocated by 0s. A 16 GB disk has 2^24 1-KB blocks and thus requires 2^24 bits for the map, which requires 2048 blocks. Bitmaps require less space than linked list method, since they use 1 bit per block.

n n

n

n

n n n n

n

n

n

n

n

n

In a multiuser system, many users and programmers sit at their respective terminals and execute the same or different programs. One may be working on a spreadsheet application, someone else may be querying a database of customers, and yet another may be compiling and testing a program. Despite many users working on the system, each one feels as if the entire system is being used only by him. How does this happen? This is made possible by the Operating System which arbitrates amongst all the users of the computer system. The disk stores programs and data for all users and we have seen how the Information Management (IM) module of the Operating System keeps track of all directories and files belonging to various users. In the next chapter, we will see how the Memory Management (MM) module divides the main memory into various parts for allocating to different users. The Operating System enables the CPU to switch from one user to another, based on certain pre-determined policy, so rapidly that the users normally do not become aware of it. Each one thinks that he or she is the only user of the system. However, we know that at any given time, the CPU can execute only one instruction and that instruction can belong to only one of the programs residing in the memory. Therefore, the Operating System will have to allocate the CPU time to various users based on a certain policy. This is done by the Process Management (PM) module which we will discuss here in this chapter. We will study only the uniprocessor Operating System in this chapter.

In order to understand Process Management, let us first understand what a process is and how it is different from a program as far as Operating System is concerned. In simple terms, a program does not compete for the computing resources like the CPU or the memory, whereas a process does. A program may exist on paper or reside on the disk. It may be compiled or tested but it still does not compete for CPU time and other resources. Once a user wants to execute a program, it is located on the disk and loaded in the main memory at that time it becomes a process, because it is then an entity which competes for the CPU time. Many definitions of a process have been put forth, but we will call a process “a program under execution, which competes for the CPU time and other resources”.

How did multiprogramming come about? Did it exist from the beginning? In the earlier days, there were only uniprogramming systems. Only one process was in the memory which was being executed at a given time. Let us go through a typical calculation of CPU utilization to understand the problems involved in this scheme. Let us say that a program is reading a customer record and printing a line to the customer report after processing and calculations. It does this for all the customer records in a file. The program will look as shown in Fig. 6.1 (shown unstructured). There are two I/O statements in this program: READ and WRITE, and there are 200 processing and calculation instructions in between. As depicted in the figure, however, all these instructions basically use the main memory, CPU registers and therefore, data transfers or calculations amongst them take place electronically. Hence, these instructions are very fast. For instance, for 200 instructions, it might take only 0.0002 seconds for any modern computer. However, READ and WRITE instructions are different. The Operating System carries out the I/O on behalf of the Application Program (AP), with the help of the controller which finally issues the signals to the device. The entire operation is electromechanical in nature and takes anywhere from 0.0012 to 0.0020 seconds in any modern computer (these figures are only representative). We will assume 0.0015 seconds as an average. Therefore, the time taken for processing one record completely can be calculated as given below. Read Execute 200 instructions Write

: : :

0.0015 0.0002 0.0015

Total

:

0.0032

When the Operating System issues an instruction to the controller to carry out an I/O instruction, the CPU is idle during the time the I/O is actually taking place. This is because, the I/O can take place independently by DMA without involving the CPU. Hence, in a single user system, the CPU utilization will be 0.0002/0.0032 = 6.2 per cent. Figure 6.2 (drawn out of proportion) depicts the progression of time from the point of view of the CPU.

In earlier days, the computer systems were very costly, and therefore, this idleness had to be reduced. Hence, it was desirable that by some means, one could run two processes at the same time such that when process 1 waits for an I/O, process 2 executes and vice versa. There would be some time lost in turning attention from process 1 to process 2 called context switching. The scheme would work if the time lost in context switch was far lower than the time gained due to the increased CPU utilization. This is generally true, as depicted by Fig. 6.3, showing two processes and Fig. 6.4 showing three processes running simultaneously. This is the rationale of multiprogramming where the Process Management (PM) portion of the Operating System is responsible for keeping track of various processes, and scheduling them.

We have already seen in Chapter 4 on Information Management (IM) that how multiprogramming becomes feasible. We know that the disk controller can independently transfer the required data for one process by DMA when the CPU can be executing another process. DMA transfers the data between the disk and the memory in bursts without involving the CPU. When the DMA is using the data bus for one process, the CPU can execute at least limited instructions not involving the data bus for some other process (e.g. register–to–register transfers or ALU calculations). However, between the bursts of data transfer, when there is no traffic on the data bus, the CPU can execute any instruction for the other process. This is the basis of multiprogramming. The number of processes running simultaneously and competing for the CPU is known as the degree of multiprogramming. As this increases, the CPU utilization increases, but then each process may get a delayed attention, hence, causing a deterioration in the response time.

How is this context switching done? To answer this, we must know what is meant by the context of a process. If we study how a program is compiled into the machine language and how each machine language instruction is executed in terms of its fetch and execute cycles, we will realize that at any time, the main memory contains the executable machine program in terms of 0s and 1s. For each program, this memory can be conceptually considered as divided into certain instruction areas and certain data areas (such as I/O and working storage areas). The data areas contain at any moment, the state of various records read and various counters and so on. Normally, today modern compilers produce code which is reentrant, i.e. it does not modify itself. Therefore, the instruction area does not get modified during execution. But the Operating System cannot assume this. If we interrupt a process at any time in order to execute another process, we must store the memory contents of the old process somewhere. It does not mean that we have to necessarily dump all these memory areas on the disk. It is sufficient but not necessary to keep their pointers. Also, all CPU registers such as PC, IR, SP, ACC and other general purpose registers give vital information about the state of the process. Therefore, these also have to be stored. Otherwise restarting this process would be impossible. We will not know where we had left off and therefore, where to start from again. The context of the process precisely tells us that, which comprises both the entities mentioned above. If we could store both of these, we have stored the context of the process. Where does one store this information? If the main memory is very small, accommodating only one program at a time, the main memory contents will have to be stored onto the disk before a new program can be loaded in it. This will again involve a lot of I/O operations, thereby defeating the very purpose of multiprogramming. Therefore, a large memory to hold more than one program at a time is almost a prerequisite of multiprogramming. It is not always true, but for the sake of the current discussion, we will assume that the memory is sufficient to run all the processes competing for the CPU. This means that even after the context switch, the old program will continue to be in the main memory. Now what remains to be done is to store the status of the CPU registers and the pointers to the memory allocated to this process. This is done by the Operating System in a specific memory area called Register Save Area which the Operating System maintains one for each process. Normally, this area is a part of a Process Control Block (PCB) again maintained by the Operating System one for each process as we shall see later. When a process issues an I/O system call, the Operating System takes over this I/O function on behalf of that process, keeps this process away and starts executing another process, after storing the context of

the original process in its register save area. When the I/O is completed for the original process, that process can be executed again. But at this juncture, the CPU may be executing the other process and, therefore, its registers will be showing the values pertaining to that process. The context of that process has now to be saved in its register save area and the CPU registers have to be loaded with the saved values from the register save area of the original process to be executed next (for which I/O is complete). At that time, the Operating System restores the CPU registers including the PC which gives the address of the next instruction to be executed, but not yet executed because the CPU was taken away from it. This in essence resumes the execution of the process. This operation is carried out so fast that to the user, there is seldom a perceived break or delay during the execution of “his” process. This is depicted in Fig. 6.5. Figure 6.5 shows that before the context switch, process A was running, denoted by the dotted lines. At the time of the context switch, the Operating System stores the state of the CPU registers for process A (step (i) shown in the figure), restores (loads) the already saved registers of process B onto the CPU registers (step (ii) shown in the figure) and starts executing process B (step (iii) shown in the figure). Since processes A and B can both be in the memory, this context switch does not require any swapping, thereby saving the time consuming I/O operations. After some time, when process A is scheduled again, process B registers are stored and the registers for process A are restored in and from the respective register save areas. process A then continues where it had left from. The entire operation is very fast and therefore, the user thinks that he is the only one using the whole machine. That in fact is the essence of multiprogramming.

In order to manage switching between processes, the Operating System defines three basic process states, as given below: Running is the only process which is executed by the CPU at any given moment. In multiprocessor systems with multiple CPUs however, there will be many running processes and the Operating System will have to keep track of all of them. A process which is not waiting for any external event such as an I/O operation is said to be in ready state. Actually, it could have been running, but for the fact that there is only one processor which is busy executing instructions from some other process, while this process is waiting for its chance to run. The Operating System maintains a list of all such ready processes and when the CPU becomes free, it chooses one of them for execution as per its scheduling policy and dispatches it for execution. When you sit at a terminal and give a command to the Operating System, to execute a certain program, the Operating System locates the

program on the disk, loads it in the memory, creates a new process for this program and enters this process in the list of ready processes. It cannot directly make it run because there might be another process running at that time. It eventually is scheduled when it starts executing. At that time, its state is changed to running. When a process is waiting for an external event such as an I/O operation, the process is said to be in a blocked state. The major difference between a blocked and a ready process is that a blocked process cannot be directly scheduled even if the CPU is free, whereas, a ready process can be scheduled if the CPU is free. Imagine, for instance, a process running a program as shown in Fig. 6.1. At the time of execution, after the READ instruction is executed, the process will be blocked. If it is scheduled again before the desired record was read in the main memory, it would execute an instruction on the wrong data (may be by using the previous record!) and therefore, there is no sense in scheduling this blocked process until its I/O is over, i.e. it is changed to the ready state. Let us trace the steps that will be followed when a running process encounters an I/O instruction. (i) Let us assume that process A was running and it issues a system call for an I/O operation. (ii) The Operating System saves the context of process A in the register save area of process A. (iii) The Operating System now changes the state of process A to blocked, and adds it to the list of blocked processes. (iv) The Operating System instructs the I/O controller to perform the I/O for process A. (v) The I/O for process A continues by DMA in bursts, as we have seen. (vi) The Operating System now picks up a ready process (say process B) out of the list of all the ready processes. This is done as per the scheduling algorithm. (vii) The Operating System restores the context of process B from the register save area of process B. We assume that process B was an already existing process in the system. If process B was a new process, the Operating System would locate on the disk the executable file for the program to be executed. The header normally gives the values of the initial CPU register values such as for PC. It stores these values in the register save area for this newly created process, loads the program in the main memory and starts executing process B. (viii) At this juncture, process B is executing but the I/O for process A is also going on simultaneously, as we have seen earlier. (ix) Eventually, the I/O requested by process A is completed. The hardware generates an interrupt at this juncture. (x) As a part of Interrupt Service Routine (ISR), the Operating System now moves process B from running to the ready state. It does not put it in a blocked state. This is because, process B is not waiting for any external event at this juncture. The CPU was taken away from it because of the interrupt. The Operating System essentially needs to decide which process to run next (it could well be process B again, depending upon the scheduling algorithm and process B’s priority!). (xi) The Operating System moves process A from blocked to the ready state. This is done because process A is not waiting for any event any more. (xii) The Operating System now picks up a ready process from the list of ready processes for execution. This is done as per the scheduling algorithm. It could choose process A, process B or some other process.

(xiii) This selected process is dispatched after restoring its context from its register save area. It now starts executing. In addition to these, there are two more process states namely new and halted. They do not participate very frequently in the process state transitions during the execution of a process. They participate only at the beginning and at the end of a process and therefore, are not described in detail. When you create a process, before getting into a queue of ready processes, it might wait as a new process if the Operating System feels that there are already too many ready processes to schedule. Similarly after the process terminates, the Operating System can put it in the halted state before actually removing all details about it. In UNIX, this state is called the Zombie state.

Process state transitions can be depicted by a diagram as shown in Fig. 6.6. Figure 6.6 shows the way a process typically changes its states during the course of its execution. We can summarise these steps as follows:

(a) When you start executing a program, i.e. create a process, the Operating System puts it in the list of new processes as shown by (i) in the figure. The Operating System at any time wants only a certain number of processes to be in the ready list to reduce competition. Therefore, the Operating System introduces a process in a new list first, and depending upon the length of the ready queue, upgrades

processes from new to the ready list. This is shown by the ‘admit (ii)’ arrow in the figure. Some systems bypass this step and directly admit a created process to the ready list. (b) When its turn comes, the Operating System dispatches it to the running state by loading the CPU registers with values stored in the register save area. This is shown by the ‘dispatch’ (iii) arrow in the figure. (c) Each process is normally given certain time to run. This is known as time slice. This is done so that a process does not use the CPU indefinitely. When the time slice for a process is over, it is put in the ready state again, as it is not waiting for any external event. This is shown by (iv) arrow in the figure. (d) While running, if the process wants to perform some I/O operation, denoted by the I/O request (v) in the diagram, a software interrupt results because of the I/O system call. At this juncture, the Operating System makes this process blocked, and takes up the next ready process for dispatching. (e) When the I/O for the original process is over, denoted by I/O completion (vi), the hardware generates an interrupt whereupon the Operating System changes this process into a ready process. This is called a wake up operation denoted by (vi) in the figure. Now the process can again be dispatched when its turn arrives. (f) The whole cycle is repeated until the process is terminated. (g) After termination, it is possible for the Operating System to put this process into the halted state for a while before removing all its details from the memory as shown by the (vii) arrow in the figure. The Operating System can however bypass this step. The Operating System, therefore, provides for at least seven basic system calls or routines. Some of these are callable by the programmers whereas others are used by the Operating System itself in manipulating various things. These are summarised in Fig. 6.7. " " " " " " "

For each system call, if the process-id is supplied as a parameter, it carries out the process state transition. We will now study how these are actually done by the Operating System.

The Operating System maintains the information about each process in a record or a data structure called Process Control Block (PCB) as shown in Fig. 6.8. Each user process has a PCB. It is created when a user creates a process and it is removed from the system when the process is killed. All these PCBs are kept in the memory reserved for the Operating System. Let us now study the fields within a PCB. The fields are as follows:

Process-id is a number allocated by the Operating System to the process on creation. This is the number which is used subsequently for carrying out any operation on the process as is clear from Fig. 6.7. The Operating System normally sets a limit on the maximum number of processes that it can handle and schedule. Let us assume that this number is n. This means that the PID can take on values between 0 and n–1. The Operating System starts allocating Pids from number 0. The next process is given Pid as 1, and so on. This continues till n–1. At this juncture, if a new process is created, the Operating System wraps around and starts again with 0 again. This is done on the assumption that at this juncture, the process with Pid = 0 would have terminated. UNIX follows this scheme. There is yet another scheme which can be used to generate the Pid. If the Operating System allows for a maximum of n processes, the Operating System reserves a memory area to define the PCBs for n processes. If one PCB requires x number of bytes, it reserves nx bytes and pre-numbers the PCBs from 0 to n–1. When a process is created, a free PCB slot is selected, and its PCB number itself is chosen as the Pid number. When a process terminates, the PCB is added to a free pool. In this case, the Pids are not necessarily allocated in the ascending sequence. The Operating System has to maintain a chain of free PCBs in this case. If this chain is empty, no new process can be created. We will assume this scheme in our further discussions. We have studied different process states such as running, ready, etc. This information is kept in a codified fashion in the PCB. Some processes are urgently required to be completed (higher priority) than others (lower priority). This priority can be set externally by the user/system manager, or it can be decided by the Operating System internally, depending on various parameters. You could also have a combination of these schemes. We will study more about these in later sections on process scheduling. Regardless of the method of computation, the PCB contains the final, resultant value of the priority for the process. As studied before, this is needed to save all the CPU registers at the context switch. This gives direct or indirect addresses of pointers to the locations where the process image resides in the memory. For instance, in paging systems, it could point towards the page map tables which in turn point towards the physical memory (indirect). In the same way, in contiguous memory systems, it could point to the starting physical memory address (direct). This gives pointers to other data structures maintained for that process. This is self-explanatory. This can be used by the Operating System to close all open files not closed by a process explicitly on termination.

This gives the account of the usage of resources such as CPU time, connect time, disk I/O used, etc. by the process. This information is used especially in a data centre environment or cost centre environment where different users are to be charged for their system usage. This obviously means an extra overhead for the Operating System as it has to collect all this information and update the PCBs with it for different processes. As an example, with regard to the directory, this contains the pathname or the BFD number of the current directory. As we know, at the time of logging in, the home directory mentioned in the system file (e.g. user profile in AOS/VS or /etc/passwd in UNIX) also becomes the current directory. Therefore, at the time of logging in, this home directory is moved in this field as current directory in the PCB. Subsequently, when the user changes his directory, this field also is appropriately updated. This is done so that all subsequent operations can be performed easily. For instance, at any time if a user gives an instruction to list all the files from the current directory, this field in the PCB is consulted, its corresponding directory is accessed and the files within it are listed. Apart from the current directory, similar useful information is maintained by the Operating System in the PCB. This essentially gives the address of the next PCB (e.g. PCB number) within a specific category. This category could mean the process state. For instance, the Operating System maintains a list of ready processes. In this case, this pointer field could mean “the address of the next PCB with state = “ready”. Similarly, the Operating System maintains a hierarchy of all processes so that a parent process could traverse to the PCBs of all the child processes that it has created. Figure 6.9 shows the area reserved by the Operating System for all the PCBs. If an Operating System allows for a maximum of n processes and the PCB requires x bytes of memory each, the Operating System will have to reserve nx bytes for this purpose. Each box in the figure denotes a PCB with the PCB-id or number in the top left corner. We now describe in Fig. 6.9, a possible simple implementation of PCBs and its data structures. The purpose is only illustrative and a specific Operating System may follow a different methodology, though essentially to serve the same purpose. Any PCB will be allocated either to a running process or a ready process or a blocked process (we ignore the new and halted processes for simplicity). If the PCB is not allocated to any of these three possible states, then it has to be unallocated or free. In order to manage all this, we can imagine that the Operating System also maintains four queues or lists with their corresponding headers as follows: One for a running process, one for the ready processes, one for the blocked process and one for free PCBs. Therefore, we assume for our current discussion, that a process is admitted to the ready queue directly after its creation. We also know that there can be only one running process at a time. Therefore, its header shows only one slot. But all other headers have two slots each. One slot is for the PCB number of the first PCB for a process in that state, and the second one is for the PCB number of the last one in the same state. Each PCB itself has two pointer slots. These are for the forward and backward chains. The first slot is for the PCB number of the next process in the same state. The second one is for the PCB number of the previous process in the same state. In both the cases, ‘*’ means the end of the chain. Though, we can imagine pointers only in one direction, we have assumed bidirectional pointers to enhance the data recovery. These slots are shown at the bottom right corner of each PCB. The PCB shows the Pid number of PCB number in the top left corner. This is shown only for our better comprehension. As PCBs are of same size, given the PCB number, the kernel can directly access any PCB, and therefore, this PCB number is not actually needed to be a part of PCB.

At the bottom, we also have shown some area (currently blank) to list all the PCB numbers of all the processes in different states. This will enable us to follow the pointers while we give further description. This area is only for our clarification and the Operating System does not actually maintain it. It maintains only the PCBs. Whenever a process terminates, the area for that PCB becomes free and is added to the list of free PCBs. Any time a new process is created, the Operating System consults the list of free PCBs first, and then acquires one of them. It then fills up the PCB details in that PCB and finally links up that PCB in the chain for ready processes. We assume that, to begin with, at a given time, process with Pid = 3 is in running state. Processes with Pid = 13, 4, 14 and 7 are in ready state. Processes with Pid = 5, 0, 2, 10 and 12 are in the blocked state. PCB slots with PCB number = 8, 1, 6, 11, 9 are free (we have shown only Pids 0–14). The same is shown in Fig. 6.10. In the PCB list of blocked processes or that of the free PCBs, there is no specific order or sequence. A list of free PCBs grows as processes are killed and PCBs are freed, and there is no specific order in which that will necessarily happen. The Operating System can keep the blocked list in the sequence of process priorities. But that rarely helps, because that is not the sequence in which their I/O will be necessarily completed to move them to the ready state. On the other hand, the ready processes are normally maintained in a priority sequence. For instance, a process with Pid = 13 is the one with the highest priority and the one with Pid = 7 is with the lowest priority in the list of ready processes shown in Fig. 6.10. In such a case, at the time of dispatching the highest priority ready process, all that the Operating System needs to do is to pick up the PCB at the head of the chain. This can be easily done by consulting the header of the list (which gives the PCB with Pid = 13 as shown in Fig. 6.10) and then adjusting the header to point to the next ready process (which is with Pid = 4 in the figure). If the process scheduling philosophy demanded

the maintenance of PCBs in the ready list in the FIFO sequence as in the Round Robin philosophy, instead of the priority sequence, the Operating System would maintain the pointer slots accordingly. In this case, any new PCB will be added at the end of the list necessarily. Some Operating Systems have a scheduling policy which is a mixture of the FIFO and priority-based philosophies. The Operating System in this case will have to chain the PCBs in the ready list accordingly. In fact, the Operating System may have to sub-divide the ready list further into smaller lists according to subgroups within the ready list. In this case, each sub-group corresponds to a priority level. But all the processes belonging to a sub-group are scheduled in the Round Robin fashion. We will study more about this when we study Multilevel Feedback Queues in the section on process scheduling. At this juncture, let us assume that there is only one list of ready processes maintained in the same sequence as the Operating System wants to schedule them. Let us trace one chain completely to see how it works. As we know, the PCB contains two pointers: next and the prior for the PCBs in the same state. For instance, if we want to access all the PCBs in the ready state, we can do that in the following manner: (i) Access the ready header. Access the first slot in the header. It says 13. Hence, PCB number 13 is the first PCB in the ready state (i.e. with Pid = 13). (ii) We can now access PCB number 13. We confirm that the state is ready (written in the box). Actually the process state is one of the data items in the PCB which gives us this information. (iii) We access the next pointer in the PCB 13. It says 4. It means that PCB number 4 is the one for the next process in the ready list. (iv) We now access PCB 4 and again confirm that it is also a ready process.

(v) The next pointer in PCB 4 gives 14. (vi) We can now access PCB 14 as the PCB for the next ready process, and confirm that it is for a ready process. (vii) The next pointer in PCB 14 is 7. (viii) We can access PCB 7 and confirm that it is for a ready process. (ix) The next pointer of PCB 7 is “*”. It means that this is the end of this chain. (x) This tallies with the ready header which says that the last PCB in the ready list is PCB 7. We thus have accessed PCBs 13, 4, 14 and 7 in that order. We know from the box at the bottom of Fig. 6.10 that these are all ready processes in the system to be scheduled in that order. If we wanted to access them in the reverse order, i.e. 7, 14, 4 and 13, we could start with the last pointer in the header and use the prior pointers in the PCBs. This is called a two-way chain and is normally maintained for recovery purposes in case of data corruption. We leave it to the reader to traverse through the blocked and free PCB chains. The above procedure will also throw some light on the algorithms needed to access a PCB in a given state to remove it from the chain or add to it.

Normally, the Operating System allows a process to have offsprings. The new process now created is called a child process. This child process in turn can create further child processes—hence, creating the process hierarchy, as shown in Fig. 6.11. The following example will clarify the concept of process hierarchy: (i) When a user is logged on to a computer, executes a command interpreter (CI, e.g. CLI under AOS/ VS or shell under UNIX). (ii) Now let us assume that the user says, “RUN PAYCALC”. At this juncture, the Operating System creates a process for the program “PAYCALC” after finding this program on the disk and loading it in memory. This process is created as a child process to the original CI process. (iii) Let us suppose that this program “PAYCALC” calls another sub-program, say “TAXCALC” by an instruction “CALL PROGRAM TAXCALC”. At this juncture, the Operating System locates the program TAXCALC, loads it into the memory and creates yet another process “TAXCALC” as a child process to “PAYCALC”. This is the way a process hierarchy grows. How is this process hierarchy normally implemented by the Operating System? The Operating System maintains a separate pointer chain to link the related PCBs to represent the process hierarchy (very similar to the hierarchical database system). Hence, each PCB will have some additional pointer fields, as shown in Fig. 6.12. Using these pointers for the process hierarchy in Fig. 6.11, we get the picture as depicted in Fig. 6.13

(we will show only pointers A and C from Fig. 6.12 to avoid cluttering). A shaded box in Fig. 6.12 represents the end of the respective chain in this figure. For instance, PCB-A does not have a twin. Therefore, its second pointer slot is shown as shaded. We have not shown these pointers in Fig. 6.10 to avoid complications and cluttering. These pointers are used to traverse through the process hierarchy. For instance, if the Operating System wants to know about all the children for a parent, these pointer chains can be used. This knowledge is necessary to implement various policies followed by the Operating System. For instance, some systems forcibly kill all the child processes if you kill the parent process. Some systems do not allow a parent process to be killed unless you have already killed all its child processes. The pointers given in Fig. 6.12 allow the Operating System to take these actions.

We present in Fig. 6.14, a list of very common operations on processes. Any Operating System has to provide for these system calls or services. Pid is supplied as a parameter to all of these system calls. Most of these result in some changes in the process states, and hence, into linking/delinking PCBs to/from various queues maintained for different process states. In the subsequent sections, we will consider these one by one, each time giving a step by step procedure and the accompanying process state transitions. We will assume that in the system, there are many processes in different states, as depicted in Fig. 6.10 to begin with. We will construct an imaginary sequence of events to study the process state transitions more closely. While doing that, we will create a process, kill a process,

l l l l l l l

dispatch a process, change the process’ priority, block a process due to an I/O request, dispatch yet another process, time up a process and wake up a process. Essentially, we are trying to simulate a realistic example.

When you sit at a terminal and give a command to the CI to execute a program or your program gives a call to execute a sub-program, a new child process is created by executing a system call. The Operating System follows a certain procedure to achieve this which is outlined below. 1. The Operating System saves the caller’s context . If you give a command to the CI, then the CI is the caller. If a sub-program is being executed within your program, your program is the caller. In both the cases, the caller process will have its PCB. Imagine that you are executing process A, as shown in Fig. 6.15.

After the divide instruction (instruction number 7), the program calls another sub-program at instruction 8, after the completion of which the main program must continue at instruction 9. At the time of execution, instruction 8 gives rise to the creation of a child process. The point is that after the child process is executed and terminated, the caller process must continue at the proper point (instruction 9 in this case). As we know, while executing instruction 8, the program counter (PC) will have already been incremented by 1. Hence, it will already be pointing to the address of instruction 9. Hence, it has to be saved so that when it is restored, the execution can continue at instruction 9. This is the reason why the caller’s context has to be saved. All CPU registers are saved in the register save area of the caller’s PCB, before a new child process is created and a PCB is allocated to it. After saving its context, the caller’s process is blocked. 2. The Operating System consults the list of free PCBs and acquires a free PCB. Assuming that the states of various processes correspond to Fig. 6.10, the Operating System will find that PCB number 8 is free (it is at the head of the free chain).

3. It assigns Pid = 8 for the new process. 4. It updates the free PCB header to take the value 1 as the first free PCB number. The header for the free PCBs now looks as shown below:

5. The Operating System now consults the IM for the location of the sub-program file on the disk, its size and the address of the first executable instruction (such as the first instruction in the Procedure Division in COBOL or first stmt in main() in C) in that program. The compiler normally keeps this address and other information in the header of the executable compiled program file. The Operating System also verifies the access rights to ensure that the user can execute that program. 6. The Operating System consults the MM to determine the availability of the free memory area to hold the program and allocates those locations. 7. The Operating System again requests the IM to actually load the program in the allocated memory locations. 8. The Operating System determines the initial priority of the process. In some cases, the priority can be assigned externally by the user at the time of process creation. In others, it is directly inherited from the caller. Priorities can be global (or external) or local (or internal). We will talk about priorities later. 9. At this juncture, the PCB fields at PCB number 8 are initialised as follows (refer to Fig. 5.8). (i) Process id = 8 (ii) Process state = ready (iii) Process priority = as discussed above in point 8. (iv) Register Save Area l PC is set to the address of the first executable instruction as discussed in point 5. l l

SP is set to the beginning of the stack, etc. All the other relevant registers are also initialised.

(v) Pointers to process’ memory: Address of the beginning of the program in the physical memory for contiguous allocation or address of the page map tables for paging systems, as we will learn later. The limit registers such as PMTLR also are set to proper values, depending upon the memory allocation method and actual locations allocated (we will learn more about these in the subsequent chapters). These limit registers are used by the memory management module to ensure the protection of the memory, and also that a process is accessing the correct locations allocated only to that process. At the time of a context switch, these values of the limit registers are picked up from the PCB and actually loaded onto those registers. This is how only a single set of hardware limit registers suffices. (vi) Pointers to other resources (such as semaphores): None at the time of creating a process. (vii) List of open files: None to begin with. (viii) Accounting information: The Operating System notes down the starting time and initialises the other fields (such as CPU time, disk I/O, etc.) to be updated later during the course of the execution of the process. (ix) Other information: This is initialised, as required.

(x) Pointers to other PCBs: These are set up as discussed in point 10. 10. The Operating System links this PCB in the list of ready processes. One of the algorithms use could be Round Robin. In this case, each process is allocated a fixed time slice, at the end of which the PCB for the process is removed from the head of the ready chain and introduced at the end of it, assuming that it consumes the time slice without requiring any I/O. (If it does require an I/O, it will be blocked before its time slice is up.) If we follow this policy, the new PCB with Pid = 8 would be introduced at the end of the ready queue. The ready queue would have been 13, 4, 14, 7 and 8 after this addition (refer to Fig. 6.10). If the scheduling algorithm is priority driven, the PCB will have to be linked at the appropriate place. For instance, Fig. 6.10 shows the PCBs in ready state as 13, 4, 14 and 7. In a strictly priority driven scheme, the priority of a process with Pid = 13 will be the highest and that of a process with Pid = 7 the lowest. The simple reason is that the scheduler just picks up the PCB at the head of the list and dispatches that process for execution. Hence, if our new process has a priority (as discussed in point 8) which is the highest of all the ready processes, the queue and the header of ready processes is changed. We assume that the caller creates a child process and then both of these processes continue to exist within the same system competing for the CPU. We also assume that, after linking, the process with Pid = 8 at the head of the ready queue, the caller process with Pid = 3 continues to run if its priority is higher than that of process with Pid = 8. The PCBs now look as shown in Fig. 6.16. 11. The Operating System now updates the master list of all known processes. This list can be in the Pid sequence as shown in Fig. 6.17. This list is used to track all the known processes in the system.

12. The PCB for the created process is linked to an other process, according to the process hierarchy. This has been already discussed in Sec. 6.8.

This system call is executed either after the logical completion of a program (typically at the STOP RUN statement in the COBOL program) or after forcibly terminating or aborting a program (by typing CTRL-C, CTRL-A at the terminal). The Operating System follows the steps as given below. In our example, let us assume that the process with Pid = 3 terminates. 1. Pid is supplied as a parameter to this system call as shown in Fig. 6.14. 2. The system call causes a software interrupt and the Operating System routine starts executing. No user process is in the running state at this juncture. 3. The Operating System stores the Pid (in this case 3) for future use. 4. The Operating System accesses the PCB with Pid = 3. It refers to the data items such as ‘pointers to the process’ memory’ and ‘pointers to the other resources’ in the PCB for Pid = 3 and frees those resources with the help of the MM and IM routines. 5. The Operating System refers to the list of open files in the PCB 3 and ensures that they are closed (especially if the killing of the process is due to CTRL-C). 6. The Operating System now adds PCB 3 to the list of free PCBs at the end. 7. The PCB chains governed by the process hierarchy as discussed before are also updated appropriately, depending upon where the process with Pid = 3 belonged. Also, other required actions are taken. For instance, if the Operating System has a philosophy of forcibly killing all the children on the termination of a parent, the Operating System traverses the pointer chains for process hierarchy and eradicates all the processes below it and frees all the corresponding PCBs and their corresponding resources. If the Operating System believes in allowing a parent process to die only after all the children are dead, this also can be verified at this juncture by using the same pointer chains. 8. The master list of known processes, as shown in Fig. 6.17 is updated. 9. The PCB chain now looks as shown in Fig. 6.18 (the free PCBs now are 1, 6, 11, 9 and 3). 10. The Operating System dispatches the next ready process in the queue for execution which is with PCB = 8 in this case.

We will simply assume that the currently running process gets over (e.g. process with Pid = 3 as in the last example) and therefore, there is a need to dispatch a new process. We will assume that the Operating System has finished the kill process procedure as outlined in Sec. 5.11. The PCBs will be in a state as shown in Fig. 6.18. 1. The Operating System accesses the ready header and through it, it accesses the PCB at the head of the chain. In this case, it will be the PCB with Pid = 8.

2. It removes PCB 8 from the ready list and adjusts the ready header. It changes the status of PCB 8 to running. The PCBs will look as shown in Fig. 6.19. 3. The Operating System updates the running header to Pid = 8. 4. The Operating System loads all the CPU registers with the values stored in the register save area of PCB 8. 5. The process with Pid=8 now starts executing where it had left before or from the first executable instruction if it has just started executing. 6. The master list of known processes as shown in Fig. 6.17 is also updated.

This system call is very simple and can be executed after the Operating System is supplied with the Pid and the new priority as parameters. The Operating System now does the following: 1. The Operating System accesses the PCB for that Pid. 2. It now changes the priority field within the PCB to the new value. 3. If the scheduling algorithm is based on the process priorities, then the ready list and the corresponding header is updated to reflect the new change. We will assume in our example for simplicity that after the change of the priority of a process, the sequence of ready processes remains unchanged. Even then, reflecting the changes in the priorities in the PCBs is essential, because it may affect the placement of any PCBs added to the ready queue later on, depending upon their priorities. We will assume that Fig. 6.19 still depicts the status of PCBs.

Let us now assume that the running process with Pid = 8 issues a system call to read a record. The process with Pid = 8 will have to be blocked by a system call. This is executed in the following steps: 1. All CPU registers and other pointers in the context for Pid = 8 are stored in the register save area of the PCB with Pid = 8. 2. The status field in the PCB with Pid = 8 is updated to blocked. 3. PCB 8 is now added at the end of the blocked list. We have seen why it is not necessary to link the blocked processes in any order such as by priority. 4. The running header is updated to reflect the change. We know that the scheduler process within the Operating System is executing at this juncture. 5. The master list of known processes, as shown in Fig 6.17 is updated accordingly. The PCBs will look as shown in Fig. 6.20.

As the process with Pid = 8 gets blocked, there is a need to dispatch the next ready process (dispatch is being discussed twice only to simulate a realistic example). We will now assume that the ready process at the head of the chain, i.e. process with Pid = 13 will be dispatched. We have already discussed the detailed algorithm for the dispatch operation in Sec. 6.12 and hence, it need not be discussed here again. The PCBs at the end of the operation will now look as shown in Fig. 6.21.

In order to be fair to all processes, the time sharing Operating System normally provides a specific time slice to each process. We will study about this more in the scheduling algorithms. This time slice is changeable. There is a piece of hardware called timer which is programmable. The Operating System loads the value of the time slice—e.g. 32 ms in the register of this timer. In the computer system, there is a system clock provided by the hardware. Each clock tick generates an interrupt. At the end of each clock tick, some actions may be necessary. For instance, the Operating System may believe in lowering the priority of a process as it executes for a longer period. This is done by the Operating System in the interrupt service routine (ISR) for the clock tick. The clock tick is normally a very small period and the time slice of the Operating System for each process

is normally made up of multiple clock ticks. After the time slice value is loaded in the timer, the hardware keeps on adding 1 for each clock tick until the time elapsed becomes equal to the time slice. At this juncture, another interrupt is generated for the time-up operation. The Operating System uses this interrupt to switch between processes so that a process is prevented from grabbing the CPU endlessly. At this juncture, the Operating System executes a system call: “process time up”, given the Pid. Let us assume that the time slice is up for our running process with Pid = 13. The Operating System now proceeds in the following fashion: 1. The Operating System saves the CPU registers and other details of the context in the register save area of the PCB with Pid = 13. 2. It now updates the status of that PCB to ready. It may be noted that the process is not waiting for any external event, and so it is not blocked. 3. The process with Pid = 13 now is linked to the chain of ready processes. This is done as per the scheduling philosophy as discussed before. Meanwhile, let us assume that, externally, the priorities of all other ready processes have been increased more than that of 13, and hence, the PCB with Pid = 13 is added at the end of the ready queue. The ready header is also changed accordingly. 4. The running header is updated to denote that the scheduler process is executing. 5. The master list of known processes, as shown in Fig. 6.17 is now updated to reflect this change. The PCBs now look as shown in Fig. 6.22.

When the I/O for a process is completed by hardware, before the execution of the wake up system call, the following things happen: (i) The hardware itself generates an interrupt. (ii) The device which has generated this interrupt and the corresponding interrupt service routine (ISR) are identified first (directly by hardware in the case of vectored interrupts or by software routine otherwise). (iii) The ISR for that device is executed. (iv) The ISR accesses the Device Control Block (DCB) for that device. We have discussed the DCB in the Information Management (IM) module. As we know, the DCB maintains a list of processes waiting on that device. The Operating System also knows from the DCB, the current process for which I/O is completed. The Pid of this process is of importance. (v) Now the ISR executes the wake up system call for that specific process, let us assume that in our example, the I/O is completed on behalf of a process with Pid = 2 and therefore, it needs to be woken up. (a) The Operating System changes the status of a process with Pid = 2 to ready. (b) It removes the process with Pid = 2 from the blocked list and also updates the blocked header if needed (in this case, it is not necessary).

(c) It chains the process with Pid = 2 in the list of ready processes. As we know, this is done as per the scheduling philosophy. We assume that this PCB is added at the end of the ready list. It also updates the ready header if necessary (in this case, it is necessary). (d) It updates the master list of known processes. The PCBs now look as shown in Fig. 6.23, assuming that the process with PCB = 4, shown in Fig. 6.22, which is at the head of the ready list, is already dispatched.

We have seen a variety of commonly used operations on the processes and the way they are executed. We will now study some operations which are less frequently used, but which are quite necessary, nevertheless. There is sometimes a need to be able to suspend a process. Imagine that you are running a payslip printing program for a company with 6000 employees. After printing 2000 payslips, you suddenly realise that there could possibly be a mistake in your calculations. At this juncture, you want to suspend the process for a short

while, check the results and then resume it. You do not want to abort the run, because the processing/printing for 2000 employees might go waste if on inspection, you find that you were actually right. What will you call the state of such a process after suspension? There could be two possibilities. When you suspend it by hitting a specific key sequence (CTRL and H, for instance), at that very moment, the original process could be in either running or ready or blocked states. The Operating System defines two more process states to take care of suspension while in different states. These are suspendready and suspendblocked. If the process was in either running or ready state exactly at the time of suspension, the Operating System puts it in the suspendready state. If the process was in the blocked state exactly at the time of suspension, it puts the process in the suspendblocked state. If the process is in the suspendready state, it continues to be in that state until the user externally resumes it (by hitting another specific key sequence). After the user resumes it, the Operating System puts the process in the ready state again, whereupon it is eventually dispatched to the running state. This is depicted in Fig. 6.24. However, if the process is in the suspendblocked state, two things can happen subsequently. (a) The I/O, for which the process was initially blocked before being suspended, is completed before the user resumes the process. In this case, the process is internally moved by the Operating System from the suspendblocked state to the suspendready state. The logic behind this is clear. The process is suspended all right, but apart from this fact, it is not waiting for any external event such as I/O. Hence, it is not suspendblocked any more. Therefore, it is moved to the suspendready state. After the user resumes it, it is then moved to the ready state, whereupon it is eventually dispatched to the running state. This is depicted in Fig. 6.24.

(b) If the I/O still remains pending, but the user resumes the process before the I/O is completed, the Operating System moves the process from the suspendblocked to blocked state. Again, the logic is clear. After resuming, the process continues to be blocked all right, as it is waiting for an I/O, but it is no longer suspended. When the I/O is eventually completed for that process, the Operating System moves it to the ready state, whereupon it is eventually dispatched to the running state. This also is depicted in Fig. 6.24. The Operating System has to maintain two more queue headers corresponding to the suspendready and suspendblocked states, and it has to chain all the processes belonging to the same state together. Using these

PCB chains, the Operating System has to implement the following system calls (see Fig. 6.21). " " " " " "

It should be fairly straightforward to imagine the headers that are necessary for the PCB chains and also the algorithms for implementing all of these system calls. We leave it to the reader to construct them.

While scheduling various processes, there are many objectives for the Operating System to choose from. Some of these objectives conflict with each other, and therefore, the Operating System designers have to choose a set of objectives to be achieved, before designing an Operating System. Some of the objectives are as follows: l Fairness l Good throughput l Good CPU utilization l Low turnaround time l Low waiting time l Good response time. Some of these objectives are conflicting. We will illustrate this by considering fairness and throughput. Fairness refers to being fair to every user in terms of CPU time that he gets. Throughput refers to the total productive work done by all the users put together. Let us consider traffic signals as an example (Fig. 6.26) to understand these concepts first and then see how they conflict. There is a signal at the central point S which allows traffic in the direction of AB, BA or CD and DC. We assume the British method of driving and signals in our examples. Imagine that there are a number of cars at point S, standing in all the four directions. The signalling system gives a time-slice for traffic in every direction. This is common knowledge. We define throughput as the total number of cars passed in all the directions put together in a given time. Every time the signal at S changes the direction, there is some time wasted in the context switch for changing the lights from green to amber and then

subsequently to red. For instance, when the signal is amber, only the cars which have already started and are half way through are supposed to continue. During this period, no new car is supposed to start (at least in principle) and hence, the throughput during this period is very low. If the time slice is very high, say 4 hours each, the throughput will be very high, assuming that there are sufficient cars wanting to travel in that direction. This is true, because there will be no time lost in the context switch procedure during these 4 hours. But then, this scheme will not be fair to the cars in all the other directions at least during this time. If this time slice is only 1 hour, the scheme becomes fairer to others but the throughput falls because the signals are changing direction more often. Therefore, the time wasted in the context switch is more. Waiting for 1 to 4 hours at a signal is still not practical. If this time slice is 5 minutes, the scheme becomes still fairer, but the throughput drops still further. At the other extreme, if the time slice is only 10 seconds, which is approximately equal to the time that is required for the context switch itself, the scheme will be fairest, but the throughput will be almost 0. This is because, almost all the time will be wasted in the context switch itself. Hence, fairness and throughput are conflicting objectives. Therefore, a good policy is to increase the throughput without being unduly unfair. The Operating System also is presented with similar choices as in the case of street signals. When the Operating System switches from one process to the next, the CPU registers have to be saved/restored in addition to some other processing. This is clearly the overhead of the context switch, and during this period, totally useless work is being done from the point of view of the user processes. If the Operating System switches from one process to the next too fast, it may be more fair to various processes, but then the throughput may fall. Similarly, if the time slice is far much, the throughput will increase (assuming there are a sufficient number of processes waiting and which can make use of the time slice), but then, it may not be a very fair policy. Let us briefly discuss the meaning of other objectives. CPU utilization is the fraction of the time that the CPU is busy on the average executing either the user processes or the Operating System. If the time slice is very small, the context switches will be more frequent. Hence, the CPU will be busy executing the Operating System instructions more than those of the user processes. Therefore, the throughput will be low, but the CPU utilization will be very high, as this objective does not care what is being executed, and whether it is useful. The CPU utilization will be low only if the CPU remains idle. Turnaround time is the elapsed time between the time a program or a job is submitted and the time when it is completed. It is obviously related to other objectives. Waiting time is the time a job spends waiting in the queue of the newly admitted processes for the Operating System to allocate resources to it before commencing its execution. This waiting is necessary due to the competition from other jobs/processes in a multiprogramming system. It should be clear by now that the waiting time is included in the turnaround time. The concept of response time is very useful in time-sharing or real-time systems. Its connotation in these two systems is different and therefore, they are called terminal response time and event response time, respectively, in these two systems. Essentially, it means the time to respond with an answer or result to a question or an event and is dependent on the degree of multiprogramming, the efficiency of the hardware along with the Operating System and the policy of the Operating System to allocate resources. If these different objectives were not conflicting, a designer would have desired all of them. However, that is not the case as we have seen. Therefore, the Operating System designers choose only certain objectives (e.g. response time is extremely important for online or real time systems) and the design of the Operating System is guided by this choice.

Due to many processes competing for the same available resources like CPU and memory, the concept of priority is very useful. Like in any capacity planning or shop loading situation, the priority can be global (i.e. external) or it can be local (i.e. internal). An external priority is specified by the user externally at the time of initiating the process. In many cases, the Operating System allows the user to change the priority externally even during its execution. If the user does not specify any external priority at all, the Operating System assumes a certain priority called the default priority. In many in-house situations, most of the processes run at the same default priority, but when an urgent job needs to be done (say for the chairman), the system manager permits that process to be created with a higher priority. In data centre situations where each user pays for the time used, normally higher priority processes are charged at a higher rate to prevent each user from firing his job at the highest priority. This is known as the scheme of purchased priorities. It is the function of the Operating System to keep track of the time used by each process and the priority at which it was used, so that it can then perform its accounting function. To prevent the highest priority process from running indefinitely, the scheduler can decrease the priority of such a process slightly at some regular time interval, depending on its CPU utilization. After some time, if its priority drops below that of another ready process, a context switch between them takes place. This operation is aided by the system clock, and this is also the reason why an interrupt is generated after each clock tick, so that the scheduler can do this checking. This change in priority is not monitored externally, but the Operating System can carry this out internally and intelligently using its knowledge about the behaviour of various processes. The concept of internal priority is used by some scheduling algorithms. They base their calculation on the current state of the process. For example, each user, while firing a process, can be forced to also specify the expected time that the process is likely to take for completion. The Operating System can then set an internal priority which is the highest for the shortest job (Shortest Job First or SJF) algorithm so that at only a little extra cost to large jobs, many short jobs will complete. This has two advantages. If short jobs are finished faster, at any time, the number of processes competing for the CPU will decrease. This will result in a smaller number of PCBs in the ready or blocked queues. The search times will be smaller, thus improving the response time. The second advantage is that if smaller processes are finished faster, the number of satisfied users will increase. However, this scheme has one disadvantage. If a stream of small jobs keeps on coming in, a large job may suffer from indefinite postponement. This can be avoided by setting a higher external priority to those important large jobs. The Operating System at any time calculates a resultant priority based on both external and internal priorities using some algorithm chosen by the designer of the Operating System. The internal priority can also be based on other factors such as expected remaining time to complete which is a variation of the SJF scheme discussed above. This scheme is identical with the previous one at the beginning of the process. This is because in the beginning, the remaining time to complete is the same as the total expected time for the job to complete. However, this scheme is more dynamic as the process progresses. At regular intervals, the Operating System calculates the expected remaining time to complete (total expected completion time—already consumed time) for each process and uses this to determine the priority. The overhead in this scheme is that as soon as a process uses a certain CPU time, the Operating System has to keep track of the same, and recalculate the priority at a regular interval. Some Operating Systems do not operate on the concept of priority at all. They use the concept of time slice as was described in our example of a traffic signal. Each process is given a fixed time slice, irrespective of its

importance. The process switch occurs only if: l A process consumes the full time slice or l A process requests an I/O before the time slice is over. In this case also, a process switch is done, because, there is no sense in wasting the remaining time slice just waiting for the I/O to complete. The context switch time consisting of all CPU/memory-related instructions within an Operating System routine are far less time consuming than the I/O that a process is waiting for. Some Operating Systems use a combination of the concepts of priority and time slice to schedule various processes as we will discuss in the later sections. These concepts can be applied to different levels of scheduling—which is the topic of our discussion in the next section.

There are basically two scheduling philosophies: Non-Preemptive and Preemptive. Depending upon the need, the Operating System designers have to decide upon one of them. A non-preemptive philosophy means that a running process retains the control of the CPU and all the allocated resources until it surrenders control to the Operating System on its own. This means that even if a higher priority process enters the system, the running process cannot be forced to give up the control. However, if the running process becomes blocked due to any I/O request, another process can be scheduled because, the waiting time for the I/O completion is too high. This philosophy is better suited for getting a higher throughput due to less overheads incurred in context switching, but it is not suited for real time systems, where higher priority events need an immediate attention and therefore, need to interrupt the currently running process. A preemptive philosophy on the other hand allows a higher priority process to replace a currently running process even if its time slice is not over or it has not requested for any I/O. This requires context switching more frequently, thus reducing the throughput, but then it is better suited for online, real time processing, where interactive users and high priority processes require immediate attention. Imagine a railway reservation system or a bank, hotel, hospital or any place where there is a front office and a back office. The front office is concerned with bookings, cancellations and many types of enquiries. Here, the response time is very crucial; otherwise customer satisfaction will be poor. In such a case, a preemptive philosophy is better. It is pointless to keep a customer waiting for long, because the currently running process producing some annual statistics is not ready to give up the control. On the other hand, the back office processing will do better with the non-preemptive philosophy. Business situations with workloads large enough to warrant a separate computer for front and back office processing, in fact, can go in for different Operating Systems with different scheduling philosophies if they are compatible in other respects.

Figure 6.27 shows three different levels at which the Operating System can schedule processes. They are as follows: l Long term scheduling l Medium term scheduling l Short term scheduling An Operating System may use one or all of these levels, depending upon the sophistication desired.

The scheme works as follows: (a) If the number of ready processes in the ready queue becomes very high, the overhead on the Operating System for maintaining long lists, context switching and dispatching increases. Therefore, it is wise to let in only a limited number of processes in the ready queue to compete for the CPU. The long term scheduler manages this. It disallows processes beyond a certain limit for batch processes first and in the end also the interactive ones. This is shown in Fig. 6.27. As seen before, this scheduler controls the admit function as shown in Fig. 6.6.

(b) At any time, the main memory of the computer is limited and can hold only a certain number of processes. If the availability of the main memory becomes a great problem, and a process gets blocked, it may be worthwhile to swap it out on the disk and put it in yet another queue for a process state called swapped out and blocked which is different from a queue of only blocked processes, hence, requiring a separate PCB chain (we had not discussed this as one of the process states to reduce complications, but any Operating System has to provide for this). The question that arises is as to what happens when the I/O is completed for such a process and if the process is swapped out? Where is the data requested by that process read in? The data required for that process is read in the memory buffer of the Operating System first. At this juncture, the Operating System moves the process to yet another process state called swapped out but ready state. It is made ready because it is not waiting for any I/O any longer. This also is yet another process state which will require a separate PCB chain. One option is to retain the data in the memory buffer of the Operating System and transfer it to the I/O area of the process after it gets swapped in. This requires a large memory buffer for the Operating System because the Operating System has to define these buffers for every process as a similar situation could arise in the case of every process. Another option is to transfer the data to the disk in the process image at the exact location (e.g. I/O area), so that when the process is swapped in, it does so along with the data record in the proper place. After this, it can be scheduled eventually. This requires less memory but more I/O time.

When some memory gets freed, the Operating System looks at the list of swapped but ready processes, decides which one is to be swapped in (depending upon priority, memory and other resources required, etc.) and after swapping it in, links that PCB in the chain of ready processes for dispatching. This is the function of the medium term scheduler as shown in Fig. 6.27. It is obvious that this scheduler has to work in close conjunction with the long term scheduler. For instance, when some memory gets freed, there could be competition for it from the processes managed by these two schedulers. (c) The short term scheduler decides which of the ready processes is to be scheduled or dispatched next. These three scheduling levels have to interact amongst themselves quite closely to ensure that the computing resources are managed optimally. The exact algorithms for these and the interaction between them are quite complex and are beyond the scope of this text. We will illustrate the scheduling policies only for the short term scheduler in the subsequent section.

We will now discuss some of the commonly used scheduling policies—belonging to both pre-emptive and non-preemptive philosophies and using either a concept of priority or time slice or both. It should be fairly easy to relate these policies to the kind of PCB chains for ready processes that will be needed for implementing them. This is the simplest method which holds all the ready processes in one single queue and dispatches them one by one. Each process is allocated a certain time slice. A context switch occurs only if the process consumes the full time slice (i.e. CPU bound job doing a lot of calculations) or if it requests for I/O during the time slice. If the process consumes the full time slice, the process state is changed from running to ready and it is pushed at the end of the ready queue. The reason why it is changed to a ready state is that it is not waiting for any external event such as an I/O operation. Therefore, it cannot be put in a blocked state. It is pushed at the end of the ready queue because it is a Round Robin policy. The process will be served in strict sequence only after serving all the other processes ahead of it in the ready queue. After adding the PCB for this process at the end of the ready queue, the PCB pointers and the headers are changed as discussed earlier. If a running process requests for the I/O before the time slice is over, it is pushed into the blocked state. It cannot be in the ready state, because even if its turn comes, it cannot be scheduled. After its I/O is complete, it is again introduced at the end of the ready queue and eventually dispatched. This continues until the process is complete. It is at this time that the PCB for that process is removed from the system. All the new processes are introduced at the end of the ready queue. This is depicted in Fig. 6.28. The policy treats all the processes equitably and therefore, it is extremely fair, but if the number of users is very high, the response time may deteriorate for online processes that require fast attention (e.g. railway or airline reservations, etc.). The efficiency and throughput in this policy is dependent upon the size of the time slice as discussed in the analogy with traffic signals. If the time slice is very high it tends to a single user FIFO policy. This policy is not fair, even though the throughput can be more in this scheme. On the other hand, if the time slice is reduced, the policy is fair but it produces a lower throughput due to the overhead of higher frequency of context switch. The implementation of this scheme can be done by maintaining a PCB chain in the FIFO sequence of all the ready processes with the chain pointers being adjusted every time a context switch takes place.

Priority-based policy can be preemptive or non-preemptive as studied earlier. A preemptive one gives more importance to the response time for real time processes than it gives to fairness. A pure priority driven preemptive policy is at the other extreme with respect to pure Round Robin with time slicing. In this case, if the highest priority process is introduced in the system at any moment, with no regard to the currently running process or the queue of ready processes, the new process will grab the CPU, hence, forcing a context switch. In fact, if the kernel calculates the new priorities at every clock tick, the PCB chain also is changed appropriately and the highest priority process then can be dispatched. Thus, a different process can be dispatched at any clock tick even if no new process is introduced in the system. This can happen if the internal priorities are modified by the kernel ‘intelligently’. Next to this can be a priority based non-preemptive policy which schedules the highest priority process bypassing the queue of ready processes, but only after the currently running process gives up its control of the CPU due to either an I/O request or termination. After the new process starts running, it in turn does not give up control unless it requires an I/O or it terminates. If it gets blocked due to an I/O request, the ready process with the highest priority is dispatched. When the original high priority process becomes ready due to the I/O completion, the ready queue is again ordered in the priority sequence and if that process happens to be still of the highest priority, it is again dispatched; otherwise the other process with the highest priority is dispatched. To implement these policies, the kernel needs to take actions as shown in Fig. 6.29. In both of these schemes, the final priority is a result of external and internal priorities. Again, as we have seen, there are a number of ways to calculate the internal priorities. We have seen one of these in the “Shortest Job First (SJF)” method. Another one could be based on “Shortest Remaining Time First (SRTF)”.

l l

l

l

The strictly priority-driven policies are really very good for real time events, but can lead to indefinite postponement or unfairness to low priority processes. Imagine, for example, a process whose whole purpose is to count from 1 to 100 and then to start all over again after initialising the counter, requiring no I/O at all. If this process is introduced as the highest priority process, it can virtually bring the whole system to a standstill. A limited solution to this problem is to introduce and use the concept of priority class. According to this philosophy, the Operating System allows you only a limited priority classes instead of very large possibilities of priority numbers. It then essentially splits the chain of ready processes into as many different PCB chains as there are priority classes. This is as shown in Fig. 6.30. Within each priority class, you could have different scheduling policies. You could run all of them Round Robin for instance. In this case, after a process consumes its time slice, the PCB is linked at the end of the chain for that priority class instead of linking at the end of all PCBs in ready chain. If a process gets blocked, it is put into the blocked queue and after the I/O completion, it is reintroduced at the end of the ready queue for the same priority class that it originally belonged to. If the currently running process consumes the full time slice, it is introduced at the end of the ready queue for the same priority class. The ‘dispatch’ system call in this case will look for the first PCB for a process in the queue for the highest priority. If that queue is empty, then only will it traverse down for the queues of PCBs with lower priorities to look for a process to be dispatched. It is not necessary that the Operating System has to have the same scheduling philosophy for all the priority classes. For instance, the Operating System could have a strictly event driven priority-based policy for the processes in the highest priority class and it could have a Round Robin policy for the processes in the next priority class. These variations still do not completely solve the problem of indefinite postponement. What you need is a system which will be better in fairness, throughput and response time, all at one time. It should be good for online as well as batch jobs. There is one such policy. It belongs to a group called heuristic scheduling, because it modifies the priority depending upon the past behaviour of the process. We will now discuss this. Let us take a CPU bound program performing mainly calculations. As an example, after each I/O, it does calculations for say 200 milliseconds before requesting the I/O for the next record. In

a Round Robin policy, if the time slice for this process is, say 25 ms, then during the calculation phase, it will have to give up the CPU 8 times due to the ‘time up’ situation. Eight times, the Operating System will have to incur overheads due to context switching, hence, decreasing the throughput. Is increasing the time slice the solution? If we increase the time slice, it will be good for this process, but then this big a time slice may be quite unnecessary for the other processes. You need a policy whereby the Operating System can increase the time slice only for certain types of processes (which are CPU bound). But then, in order to be fair, it should allocate this larger time slice less frequently to it, i.e. it should reduce its priority. For an I/O bound processes, the actions should be just the reverse. Hence, if you want to be fair and also increase the throughput, for CPU bound processes, you need to: l Increase the time slice, and l Reduce the priority. Therefore, this process should get more time slice, but less frequently. Similarly, for I/O bound processes, you need to: Decrease the time slice, and Increase the priority. Hence, an I/O bound process should get less time slice each time but it should get it more frequently. The reason is that such a process cannot utilize a bigger time slice anyway! Notice the fairness in the policy. For both CPU bound and I/O bound processes, the total time allocated is more or less equitable if not exactly the same. How should this policy be implemented? When the process is initiated, the Operating System does not know whether it is I/O bound, or CPU bound. What we need is a heuristic approach for the Operating System which will monitor the performance of the process in terms of the frequency of I/O calls (I/O boundness) and then change the priority and the time slice of that process accordingly. This is normally implemented using Multilevel Feedback Queues (MFQ) as shown in Fig. 6.31. The scheme works as follows: (a) The list of ready processes is split up into many queues with levels from 0 to n (in the figure shown, we have assumed n = 3). At each level, the PCBs are chained together as before. (b) Each level corresponds to a value of time slice. For instance level 0 has time slice = t0, level 1 has time slice = t1 and so on. These time slice values are stored by the Operating System. When it wants to dispatch a process belonging to a specific queue, it loads the corresponding value of the time slice into the timer, so that there will be a ‘time up’ interrupt generated after that much time, as we have studied earlier. This is organised in such a way that as you go down the level, i.e. from level 0 to level 3, the time slice increases, i.e. t1 > t2 > t1 > t0. In practice if t0 = x milliseconds, t1 could be 2x, t2 could be 4x and t could be 8x. 3 (c) As you go down the level, the priority decreases. This is implemented by having the scheduler search through the PCBs at level 0 first, then level 1, then level 2 and so on for choosing a process for dispatching. Hence, if a PCB is found in level 0, the scheduler will schedule it without going to level 1 implying thereby that level 0 has higher priority than level 1. It will search for the queue at level 1 only if the level 0 queue is empty. The same philosophy applies to all the levels below. Hence, as we traverse from level 0 to level 3, the time slice increases and the priority decreases. After studying the past behaviour at the regular interval, now the kernel needs to somehow keep pushing the I/O bound processes at the upper levels and push the CPU bound processes to the lower levels. Let us now see how this is achieved. l l

(d) A new process always enters at level 0 and it is allocated a time slice t0. (e) If the process uses the time slice fully (i.e. if it is CPU bound), it is pushed to the lower level, thereby increasing the time slice but decreasing the priority. This is done for all levels, excepting if it is already at the lowest level in which case it is reintroduced at the end of the same (lowest) level only, because, obviously, it cannot be pushed any further. (f) If the process requests for an I/O before the time slice is over (i.e. if it is I/O bound), the process gets blocked and when the I/O is complete, it is pushed up to the next higher level excepting if it is already at level 0, it is reintroduced at the end of level 0 only. Hence, instead of only one queue header for ready processes, the Operating System will have to maintain four queue headers for four different queues for the processes in ready state. The CPU bound jobs will keep getting pushed down, whereas the I/O bound jobs will get pushed up. This is exactly what we wanted. We have shown only four queues in our example. The number of queues is really an issue of the Operating System design which the designers have to choose. This completes the description of MFQ. It was conceived as early as 1962 for the Operating System called CTSS and is implemented in many modern Operating Systems (i.e. AOS/VS on DG machines). UNIX

implements a variant of this method. The Operating System can adapt to the changing nature of the process even during the execution—e.g. a process is CPU bound for some time initially, after which it becomes interactive. In this case, this scheme will dynamically change the time slice and the priority as per the recent behaviour of the process. The drawback of this scheme is however a very high overhead on the Operating System to provide this kind of intelligence and adaptability by managing various queues. Therefore, though it is one of the best systems available, it may not be the best in all the situations in the benchmarks for performance.

A thread can be defined as an asynchronous code path within a process. Hence in Operating Systems which support multithreading, a process can consist of multiple threads, which can run simultaneously in the same way that a multiuser Operating System supports multiple processes at the same time. In essence, multiple threads should be able to run concurrently within a process. This is why a thread is sometimes referred to as a lightweight process. Let us illustrate the concept of multiple threads using an example without multithreading first and then using one with it. Let us consider a utility which reads records from a tape with a blocking factor of 5, processes them (may be selects or reformats them) and writes them onto a disk one by one. Obviously, the speed of the input or read operation may be quite different from the speed of output or write operation. The logic of the program is given in Fig. 6.32. Let us imagine that we have Round Robin scheduling with time slice for each process = 25 ms. When the process running of our program is scheduled, 25 ms will be allocated to it, but almost immediately, the process will be blocked due to the ‘Read’ system call, thereby utilizing only a small fraction of the time slice. When the record is eventually read, the process becomes ready and then it is dispatched. But almost immediately, it will be blocked again due to the ‘Write’ system call, and so on. In the Round Robin philosophy, 25 ms will be allocated to a process regardless of its past behaviour. Hence, the I/O bound processes such as this suffer in the bargain. We have seen that heuristic scheduling provides one of the solutions to this problem, but it is not very inexpensive to provide this type of scheduling and not many Operating Systems choose that path. Multithreading provides yet another improvement in solving this problem. The idea is simple and it works as follows: (i) The programmer (in this case the one who is writing this tape to disk copy utility) defines two threads within the same process as shown in Fig. 6.33. The advantage is that they can run concurrently within the same process if synchronized properly. We need not bother about the exact syntax with which a programmer can define a thread within a process. Let us just assume that it is possible. (ii) The compiler recognises these as different threads and maintains their identity as such in the executable code. In our example in Fig. 6.33, a thread is encapsulated between Thread-N-Begin and Thread-NEnd statements for thread N. (iii) When the process starts executing, the Operating System creates a PCB as usual, but now in addition,

it also creates a Thread Control Block (TCB) for each of the recognised and declared threads within that process. The TCB contains apart from other things, the register save area to store the registers at context switch of a thread instead of a process. Because, the idea is to run different threads simultaneously within a process, similar concepts, ideas and data structures are used here as the ones used for multiple simultaneous processes. Hence, you need a queue of TCBs and a queue header in addition to that for PCBs. A thread can now get blocked like a process can. Hence, a TCB needs to have a register save area for each thread within a process. (iv) The threads also could have priorities and states. A thread can be in a ready, blocked or running state, and accordingly all the TCBs are linked together in the same way that PCBs are linked in different queues with their separate headers. (v) When the Operating System schedules a process with multiple threads and allocates a time slice to it, the following happens: (a) The Operating System selects the highest priority ready thread within that process and schedules it. (b) At any time, if the process’ time slice is over, the Operating System turns the process as well as currently running thread into ready state from running state. (c) If the process time slice is not over but the current thread is either over or blocked, the Operating System chooses the next highest priority ready thread within that process and schedules it. But if there is no ready thread left to be scheduled within that process, only then does the Operating System turn the process state into a blocked state. And it is in this procedure that there is an advantage which we will see later. It is worth noting that the process itself does not get blocked if there is at least one thread within it which can execute within the allocated time slice. (d) Different threads need to communicate with each other like different processes do. Our example can be treated as a producer–consumer problem. The tape-read is the producer task and diskwrite is the consumer task. Hence, in multitasking, both Inter Task Communication and Task Synchronisation are involved and the Operating System has to solve the problems of race

conditions through mutual exclusion. We shall study this problem in more detail in the next chapter. Clearly an Operating System with multithreading will be far more complex to design and implement than the one without it. What did we gain? Is it worth it? The answer is not clearly a yes or no. A multithreading Operating System has an overhead, but it also allows the programmer flexibility and improves CPU utilisation. For instance, in our example, if Thread 0 is blocked, instead of blocking the entire process, the Operating System will find out whether Thread 1 can be scheduled. When both the threads are blocked, only then will the entire process be blocked. Again, even if any thread becomes ready, the process can be moved to a ready list from the blocked list and then scheduled. This reduces the overheads of context switching at a process level, though adding to those at a thread level. The latter is generally far less time consuming; and this is where the advantage stems from. The example may show some advantage, but may not reveal the magnitude of the benefit because the threads in our example consist mainly of I/O only. If you imagine more complex threads with more processing, the advantages will be clear.

Practically, threads can be implemented at two different levels, namely, user level and kernel level. The threads implemented at kernel level are known as kernel threads. The kernel threads are entirely handled by the Operating System scheduler. An application programmer has no direct control over kernel threads. The threads implemented at the user level are known as user threads. The API for handling user threads is provided by a thread library. The thread library maps the user threads to the kernel threads. Depending on the way the user threads are mapped to the kernel threads, there are three multithreading models as described below: The many-to-one model associates many user threads with a single kernel thread. Figure 6.34 depicts the many-to-one model. The thread library in the user space provides very efficient thread management. The user threads are not directly visible to the kernel and they require no kernel support. As a result, only one user thread has access to the kernel at a time and if the thread blocks, then the entire process gets blocked. The Green Threads library on Solaris Operating Systems implements this model. This model is also implemented on the Operating Systems that do not provide kernel threads. In the one-to-one model, there is one kernel thread corresponding to each user thread. Thus, the kernel provides full thread support. The thread management provided by kernel is

slower compared to thread management provided by the user library. On the other hand, as kernel provides thread scheduling, even if one thread blocks, others can still run. Thus, the one-to-one model provides greater concurrency compared to the many-to-one model. As the creation of a user thread requires creation of a kernel threads, the application programmers must be careful when creating large number of user threads. Moreover, the Operating Systems that implement one-to-one model like Windows 2000, OS/2 usually limit the number of threads supported by the system. Figure 6.35 shows the one-to-one model.

The many-to-many model maps many user threads to equal or lesser number of kernel threads. This model provides very fast and efficient thread management resulting in better application performance and system throughput. Unlike the one-to-one model, the application programmers need not worry about the number of threads being created. Being very powerful and flexible, the many-tomany model is also complex to implement. As a result, debugging an application can be complicated at times. Figure 6.36 depicts the many-to-many model.

In this section, we will briefly discuss the various implementations of threads. In-depth discussion of these topics is beyond the scope of this book and the reader is encouraged to refer other excellent texts available on these subjects for more information. Prior to the existence of POSIX threads, each hardware vendor implemented their own versions of threads. As each implementation was significantly different from the others, writing portable multithreaded applications was very difficult. Thus, the need was felt to standardize the APIs for thread management. IEEE then came up with POSIX 1003.1c standard in 1995 that specified APIs for thread management. POSIX stands for ‘Portable Operating Systems Interface Standards’. The implementations of threads that conform to the IEEE POSIX 1003.1c specification are called as POSIX threads or Pthreads. The current Pthreads API is defined only for the C programming language and it is implemented as functions with a header file pthread.h and a thread library. The Pthreads API consists of over 60 functions covering thread management, thread synchronization and communications between threads. The naming convention for Pthreads is well defined and the prefix of all the objects or functions is pthread. Additionally, based on the prefix, it is possible to have an idea about the functionality of the identifier. For example, all the functions with prefix pthread_ cover thread management, those with prefix pthread_attr_ cover thread attribute objects, and so on. Most flavours of Unix Operating Systems such as HP-UX, AIX, Solaris, etc. and the Linux Operating System now provide implementation of Pthreads. Solaris, being a flavor of Unix Operating System, provides Pthread implementation in addition to its proprietary implementation of threads. Please refer section 13.18 for detailed discussion on Solaris thread management and synchronization. Threads in Linux are handled quite differently from most other operating systems due to the open source nature of Linux. Although the Linux kernel supported the user threads since version 1.x, the kernel threads support was added only after version 2.x. An important difference between Linux threads and other threads is the fact that Linux does not distinguish between a process and a thread. A task represents basic unit of work for Linux. To create a child process, Linux provides two system calls. First is fork which we have already studied in Chapter 3. Second Linux specific system call is clone. It creates a child process like the fork call, but important difference between the two is, fork creates a child process that has its own process context similar to the parent process whereas the child process created by the clone shares parts of its execution context with the calling process, such as the memory space, the file descriptor table, and the signal handler table. As a result, the clone system call is used to implement kernel threads in Linux. At user level, various libraries that implement Pthreads are available. Some examples are LinuxThreads, NPTL (Native POSIX Threads Library). In Windows 2000, a process is composed of a set of threads and a thread represents the basic unit of execution. As soon as a thread starts, the Windows memory manager allocates memory to allow the thread to run. When a thread terminates, it releases the memory used by it to the memory manager. Windows 2000 implements one-to-one multithreading model.

n

n

n

n n n

n

n

n n

n n

n

n

n

n

n

n

n

n n

n n

n

n n

In practice, several processes need to communicate with one another simultaneously. This requires proper synchronization and use of shared data residing in shared memory locations. We will illustrate this by what is called a ‘Producer-Consumer problem’. Let us assume that there are multiple users at different terminals running different processes but each one running the same program. This program prompts for a number from the user and on receiving it, deposits it in a shared variable at some common memory location. As these processes produce some data, they are called ‘Producer processes’. Now let us imagine that there is another process which picks up this number as soon as any producer process outputs it and prints it. This process which uses or consumes the data produced by the producer process is called ‘Consumer process’. We can, therefore, see that all the producer processes communicate with the consumer process through a shared variable where the shared data is deposited. This is depicted in Fig. 7.1. UNIX has a facility called ‘pipe’ which works in a very similar manner. Instead of just one variable, UNIX assigns a file typically of the size of 4096 bytes which could reside in the memory entirely. When Process A wants to communicate with Process B, Process A keeps writing bytes into this shared file (i.e. the pipe) and Process B similarly keeps reading from this shared file in the same sequence in which it was produced. This is how UNIX can allow a facility of pipes through which the output of one process becomes the input to the next process. This is shown in Fig. 7.2. Let us say that the sales data is to be selected for a specific division by a program or utility P1. The output

of P1 (i.e. the selected records) is fed to the query program (P2) as input through a pipe. The query program P2 finally displays the results of the enquiry. All that the user has to do is to give commands to the UNIX shell to execute P1 to select the data, pipe it to P2 and execute P2 using this piped input data to produce the results. The pipe is internally managed as a shared file. The beauty of this scheme is that the user is not aware of this shared file. UNIX manages it for him. Thus, it becomes a vehicle for the ‘Inter Process Communication (IPC)’. However, one thing should be remembered. A pipe connects only two processes, i.e. it is shared only between two processes, and it has a “direction” of data flow. A shared variable is a much more general concept. It can be shared amongst many processes and it can be written/read arbitrarily. Another example of a producer–consumer situation and the IPC is the spooler process within the Operating System. The Operating System maintains a shared list of files to be printed for the spooler process to pick up one by one and print. At any time, any process wanting to print a file adds the file name to this list. Thus, this shared list becomes a medium of IPC. This is depicted in Fig. 7.3. Sometimes, two or more processes need to be synchronized based on something. The common or shared variables again provide the means for such synchronization. For instance, let us imagine that Process A has to perform a task only after Process B has finished a certain other task. In this case, a shared variable (say, a flag) could be used to communicate the completion of the task by Process B. After this, Process A can check this flag and proceed, depending on the flag, in the end resetting the flag for future. In a sense, all the examples discussed above are typical of both ‘Process Synchronization’ and IPC, because

both are closely related. For example, in the first case, unless any of the producer processes outputs a number, the consumer process should not try to print anything. Again, unless the consumer process prints it, none of the producer processes should output the next number if overwriting is to be avoided (assuming that there is only one shared variable). Thus, it is a problem of process synchronization again! There is however a serious problem in implementing these schemes. Let us again go back to our first example. Let us see how we can achieve this synchronization to avoid overwriting. Let us imagine that apart from a shared variable to hold the number, we also have another flag variable which takes on the value of 0 or 1. The value of the flag is 1 if any of the producer processes has output a number. Hence, no producer process should output a new number if this flag = 1, to avoid overwriting. Similarly, the consumer process will print the number only if the flag = 1 and will set the flag to 0, thereafter. Again, the consumer process should not print anything if this flag = 0 (i.e. nothing is ready for printing). We illustrate this in the programs in Fig. 7.4.

In this scheme, as we know, instruction P.0 i.e. “while flag = 1 do;” is a wait loop so long as the flag continues to be = 1. The very moment the flag becomes 0, the program goes down to step P.1 and thereafter to P.2 whereupon the flag is set to 1. When a process reaches instruction P.2, it means that some producer process has output the number in a shared variable at instruction P.1. At this juncture, if another producer process tries to output a number, it should be prevented from doing so in order to avoid overwriting. That is the reason, the flag is set to 1 in the instruction P.2. After this, the while-do wait loop precisely achieves this prevention. This is because as long as the flag = 1, the new producer process cannot proceed. A similar philosophy is applicable for instructions C.0, C.1 and C.2 of the consumer process. For instance, the consumer process does not proceed beyond C.0 as long as the flag continues to be = 0, which indicates that there is nothing to print. As soon as the flag becomes 1, indicating that something is output and is ready for printing, the consumer process executes C.1 and C.2, whereupon the number is printed

and the flag is again set to 0, so that subsequently the consumer process does not print non-existing numbers but keeps on looping at C.O. Everything looks fine; where then is the problem? The problem will become apparent if we consider the following sequence of events. (i) Let us assume that initially the flag = 0. (ii) One of the producer processes (PA) executes instruction P.0. Because the flag = 0, it does not wait at P.O, but it goes to instruction P.1. (iii) PA outputs a number in the shared variable by executing instruction P.1. (iv) At this moment, the time slice allocated to PA gets over and that process is moved from running to ready state. The flag is still = 0. (v) Another producer process PB is now scheduled. (It is not necessary that a consumer process is scheduled always after a producer process is executed once.) (vi) PB also executes P.0 and finds the flag as 0, and therefore, goes to P.1. (vii) PB overwrites on the shared variable by instruction P.1 therefore, causing the data to be lost. Hence, there is a problem. An apparent problem is that setting of the flag to 1 in the producer process is delayed. If the flag is set to 1 as soon as a decision is made to output the number, but before actually outputting it, what will happen? Can it solve the problem? Let us examine this further. The modified algorithms of the producer and consumer processes will be as shown in Fig. 7.5.

However, in this scheme, the problem does not get solved. Let us consider the following sequence of events to see why. (i) Initially flag = 0 (ii) PA executes instruction P.0 and falls through to P.1, as the flag = 0. (iii) PA sets flag to 1 by instruction P.1. (iv) The time slice for PA is over and the processor is allocated to another Producer process PB. (v) PB keeps waiting at instruction P.0 because flag is now = 1. This continues until its time slice also is over, without doing anything useful. Hence, even if the shared data item (i.e. the number in this case) is empty, PB cannot output the number. This is clearly wasteful, though it may not be a serious problem. Let us proceed further. (vi) A consumer process CA is now scheduled. It will fall through C.0 because flag = 1. (It was set by PA in step (iii) )

(vii) CA will set flag to 0 by instruction C.1. (viii) CA will print the number by instruction C.2 before the producer has output it (may be the earlier number will get printed again!). This is certainly wrong! Therefore, just preponing the setting of the flags does not work. What then is the solution? Before going into the solution, let us understand the problem correctly. The portion in any program which accesses a shared resource (such as a shared variable in the memory) is called as ‘Critical Section’ or ‘Critical Region’. In our example, instructions P.1 and P.2 of producer process or instructions C.1 and C.2 of consumer process constitute the critical region. This is because both the flag and the data item where the number is output by producer process are shared variables. The problem that we were facing was caused by what is called ‘race condition’. When two or more processes are reading or writing some shared data and the outcome is dependent upon which process runs precisely then, the situation can be called ‘race condition’. We were clearly facing this problem in our example. This is obviously undesirable, because the results are unpredictable. What we need is a highly accurate and predictable environment. How can we avoid race conditions? A closer look will reveal that the race conditions arose because more than one process was in the critical region at the same time. One point must be remembered. A critical region here actually means a critical region of any program. It does not have to be of the same program. In the first example (Fig. 7.4), the problem arose because both PA and PB were in the critical region of the same program at the same time. However, PA and PB were two producer processes running the same program. In the second example (Fig. 7.5), the problem arose even if processes PA and CA were running separate programs and both were in their respective critical regions simultaneously. This should be clear by going through our example with both alternatives as in Figs. 7.4 and 7.5. What is the solution to this problem then? If we could guarantee that only one process is allowed to enter any critical region (i.e. of any process) at a given time, the problem of race condition will vanish. For instance, in any one of the two cases depicted in Figs. 7.4 and 7.5, when PA has executed instruction P.1 and is timed out (i.e. without completing and getting out of its critical region), and if we find some mechanism to disallow any other process (producer or consumer) to enter its respective critical regions, the problem will be solved. This is because no other producer process such as PA would be able to execute instructions P.1 or P.2 and no other consumer process such as CA would be allowed to execute instructions C.1 or C.2. After PA is scheduled again, only PA would then be allowed to complete the execution of the critical region. Until that happens, all the other processes wanting to enter their critical regions would keep waiting. When PA gets out of its critical region, one of the other processes can now enter its critical region; and that is just fine. Therefore, what we want is ‘mutual exclusion’ which could turn out to be a complex design exercise. We will outline the major issues involved in implementing this strategy in the next section. It is important to remember that mutual exclusion is a necessary, though not a sufficient condition, for a good Operating System. This is because the process for achieving mutual exclusion could be very expensive. We do not want an Operating System which achieves mutual exclusion at the cost of being extremely slow, or by making processes wait for a very long time. In fact, we can list five conditions which can make any solution acceptable. They are: (i) No two processes should be allowed to be inside their critical regions at the same time (mutual exclusion). (ii) The solution should be implemented only in the software, without assuming any special feature of the machine such as specially designed mutual exclusion instructions. This is not strictly a precondition

but a preference, as it enhances portability to other hardware platforms, which may or may not have this facility. (iii) No process should be made to wait for a very long time before it enters its critical region (indefinite postponement). (iv) The solution should not be based on any assumption about the number of CPUs or the relative speeds of the processes. (v) Any process operating outside its critical region should not be able to prevent another process from entering its critical region. We will now proceed to seek solutions which satisfy all these conditions.

We know that we can have instructions to disable or enable interrupts. One solution is to disable all interrupts before any process enters the critical region. As we know, a process switch after a certain time slice happens due to an interrupt generated by the timer hardware. If all interrupts are disabled, the time slice of the process which has entered its critical region will never get over until it has executed and come out of its critical region completely. Hence, no other process will be able to enter into the critical region simultaneously. This solution is depicted in Fig. 7.6 which depicts how the instructions in any program with shared variables will look like. The problem with this scheme is obvious. Giving the power of playing around with interrupts to ordinary user processes is extremely dangerous. Imagine a process in the critical region executing an infinite loop due to a silly bug. The interrupts will never be enabled, and no other process can ever proceed. The machine will come to a standstill. We surely need an alternative.

One of the alternatives is to use a lock-flag. Consider a lock-flag which takes on only two values “F” or “N”. If any process is in its critical region, this lock-flag is set to “N” (Not free). If no process is in any critical region, it is set to “F” (Free). Using this, the algorithm for any process could be as shown in Fig. 7.7. A process wanting to enter the critical region checks whether the lock-flag is “N”. If it is “N”, i.e. the critical region is not free, it keeps waiting, indicated by the instruction 0, where the “While...do” suggests a wait loop. When the lock-flag becomes “F”, the process falls through to instruction 1, where it sets the lock-flag to “N” and enters its critical region. The idea is that while one process is in its critical region, no other process should be allowed to enter its critical region. This can be achieved because, every process has a structure as shown in Fig. 7.7. Therefore, any other process also will check the lock-flag before entering its critical region. Since the lock-flag is a shared variable amongst all the processes, its value will be “N” and therefore, the new process will keep

waiting. It will not enter its critical region. When the process in the critical region gets out of it, it sets the lock-flag to “F” as shown in the instruction 3 of Fig. 7.7. At this juncture, the other process waiting on the lock-flag at instruction 1 can enter its critical region. Therefore, this procedure is designed to ensure that at any time, only one process is in the critical region. We should not confuse the lock-flag with the flag in our earlier example. Earlier, the flag indicated whether the memory location contained any valid number to be printed. In this case, the lock-flag indicates whether any critical region is entered by any process (i.e. it is busy) or not. At first sight, the solution seems satisfactory. But then, this solution also has a major bug. In fact, it does not solve the problem of mutual exclusion at all. Consider the following sequence of events, assuming that initially the lock-flag = “F”. (i) Process A executes instruction 0 and finds the lock-flag = “F”, and decides to go to instruction 1, but loses control of the CPU before actually executing it as the time slice is up. Hence, the lock-flag still remains to be “F”. (ii) Process B now is scheduled. It also executes instruction 0 and still finds lock-flag = “F”. (iii) Process B executes instruction 1 and sets the lock-flag to “N”. (iv) Process B enters the critical region and half way through, loses the control of CPU as the time slice is up. (v) Process A is scheduled again. It resumes from instruction 1 and sets the lock-flag to “N” (which was already set to “N” by Process B). (vi) Process A also enters the critical region. Thus, both the processes (A and B) are in the critical regions simultaneously. The objective of mutual exclusion has not been achieved. The race condition and the resultant inconsistency may occur. What we need are mutual exclusion primitives or instructions which will guarantee the mutual exclusion. There have been many attempts in that direction. We need not at this juncture worry about whether a primitive can be implemented using hardware or software. The meaning of primitives will be very clear in the subsequent discussions.

Let us imagine that we have two primitives: Begin-Critical-Region and End-Critical-Region. We can use them as a boundary of a critical region in any process. They really act like security guards. The system uses these primitives to recognize the critical region and allows only one process to be in any critical region at any given time. Let us, for now, not bother about how these primitives are implemented, but assume that they exist. We could now rewrite our producer–consumer programs of Fig. 7.4 as shown in Fig. 7.8. The idea is that for any process when Begin-Critical-Region is encountered, the system checks if there is any other process in the critical region and if yes, no other process is allowed to enter into it. This guarantees mutual exclusion. If we retrace steps (i) to (vii) discussed in connection with Fig. 7.4, we will realize this. For cross-reference, we have retained the numbers such as P.0, P.1 etc. We have added only P.S1, P.S2 and C.S1 and C.S2 as mutual exclusion primitives in Fig. 7.8. The whole scheme will work in the following way:

(i) Let us assume that initially the flag = 0 (ii) A producer process PA executes P.0. Because flag = 0, it falls through to P.S1. Again, assuming that there is no other process in the critical region, it will fall through to P.1. (iii) PA outputs the number in a shared variable by executing P.1. (iv) Let us assume that at this moment the time slice for PA gets over, and it is moved into the ‘Ready’ state from the ‘Running’ state. The flag is still 0. (v) Another producer process PB now executes P.0. It finds that flag = 0 and so falls through to P.S1. (vi) Because PA is in the critical region already, PB is not allowed to proceed further, thereby, avoiding the problem of race conditions. This is our assumption about the mutual exclusion primitives. We can verify that the scheme works in many different conditions. The problem that remains now is only that of implementing these primitives.

There have been a number of attempts to implement the primitives for mutual exclusion. Many algorithms do not solve the problem of mutual exclusion at all. Some algorithms are based on the assumption that there are only two processes. Some assume the existence of a special hardware instruction such as ‘Test and Set Lock’ (TSL). All these solutions were refined by a Dutch mathematician Dekkar into a feasible solution which was further developed by E.W. Dijkstra. But the solution was very complicated. Finally, in 1981, G. L. Peterson developed a simple but feasible solution. However, all these solutions, including Peterson’s, required a phenomenon called ‘busy waiting’. This can be explained again using our producer–consumer example. In our example, as depicted in Fig. 7.4, if flag = 0 and the producer process enters its critical region, the consumer process keeps looping to check the flag, until it becomes 1. This is called busy waiting. This is highly undesirable because it wastes CPU resource. In this scheme, the consumer process is still a process which can contend for CPU resources because it is a ready process. It is not blocked. It is not waiting for any I/O operation. It is waiting on the flag. If, somehow, the consumer process could be blocked and therefore, kept away from competing for the CPU time, it would be useful. Allocating a time slice to a process which is going to waste it in “busy waiting” anyway is quit unproductive. If we avoid this, the CPU would be free to be scheduled to other ready processes. The blocked consumer process would be made ready only after the flag status is changed. After this, the consumer process could continue and enter its critical region.

This is exactly what we had wanted. This effectively means that we are treating this flag operation as an I/O operation as far as process states are concerned, so that a process could be blocked not only while waiting for an I/O but also while waiting for the change in the flag. The problem is that none of the solutions, including Peterson’s, avoids busy waiting. In fact, if this was put down as a sixth condition, in addition to the five conditions listed at the end of Sec. 7.1, to make the solution as an acceptable one, none of these solutions would be acceptable, however brilliant it may be. A way out of all this had to be found. In 1965, Dijkstra suggested a solution using a ‘Semaphore’, which is widely used today. We will discuss some of these earlier solutions and their shortcomings in the following sections. At the end, we will discuss Dijkstra’s solution.

This was the first attempt to arrive at the mutual exclusion primitives. It is based on the assumption that there are only two processes: A and B, and the CPU strictly alternates between them. It firstly schedules A, then B, then again, A and so on. The algorithms for programs run by processes A and B are outlined in the succeeding lines. We assume that the variable Process-ID contains the name of the process such as A or B. This is a shared variable between these processes and is initially set up by the Operating System for them. Figure 7.9 depicts this alternating policy. We can verify that mutual exclusion is guaranteed. A.0 and A.2 are the instructions which encapsulate the critical region and therefore, functionally play the role of the primitives for mutual exclusion. This is true about instructions B.0 and B.2 also. Let us see how this works. (i) Let us assume that initially Process-ID is set to “A” and Process A is scheduled. This is done by the Operating System. (ii) Process A will execute instruction A.0 and fall through to A.1 because Process-ID = “A”. (iii) Process A will execute the critical region and only then Process-ID is set to “B” at instruction A.2. Hence, even if the context switch takes place after A.0 or even after A.1 but before A.2, and if Process B is then scheduled (remember, we have assumed that there are only two processes!), Process B will continue to loop at instruction B.0 and will not enter the critical region. This is because Process-ID is still “A”. Process B can enter its critical region only if Process-ID = “B”. And this can happen only in instruction A.2 which in turn can happen only after Process A has executed its critical region in instruction A.1. This is clear from the program for Process A as given in Fig. 7.9. Mutual exclusion can be guaranteed in this scheme, but then, the scheme has some major problems as listed below:

(a) If there are more than two processes, the system can fail. Imagine three processes PA1, PA2, PA3 executing the algorithm shown for Process A and PB1 executing the algorithm shown for Process B. Now consider the following sequence of events when Process-ID = “A”. l PA1 starts executing. It executes A.0 and falls through to A.1. Before it executes A.1, process switch takes place. l PA2 starts executing, and because Process-ID is still “A”, it also goes through A.0 to A.1, and during A.1 (i.e. while in the critical region), again process switch takes place. l PA3 resumes from instruction A.1 and enters its critical region, thereby failing the scheme of mutual exclusion. Both PA1 and PA2 are in the critical region simultaneously. The algorithm for multiple simultaneous processes is fairly complex. Dijkstra proposed a solution but it involved the possibility of indefinite postponement of some processes. Knuth suggested an improvement to Dijkstra’s algorithm but it still involved a possibility of long delays for some processes. Many revised algorithms have been suggested but they are very complex and far from satisfactory. (b) This algorithm also involves busy waiting, and wastes the CPU time. If Process B is ready to be dispatched, it may waste the full time slice, waiting at instruction B.0, if Process-ID = “A”. (c) This algorithm forces Processes A and B to alternate in a strict sequence. If the speed of these two processes is such that Process A wants to execute again before Process B takes over, it is not possible. It is clear from the demerits discussed above that this solution violates many of the five conditions discussed earlier, and therefore, it is not a good solution. As mentioned earlier, many other solutions have been put forth. We will not discuss all of these. We will now present Peterson’s algorithm.

This algorithm also is based on two processes only. It uses three variables. The first one called chosenprocess takes the value of “A” or “B” depending upon the process chosen. This is as in the earlier case. PATO-ENTER and PB-TO-ENTER are two flags which take the value of “Yes” or “No”. For instance, if PA wants to enter the critical region, PA-TO-ENTER is set to “YES” to let PB know about PA’s desire. Similarly, PB-TO-ENTER is set to “YES”, if PB wants to enter its critical region so that PA can know about it, if it tests this flag. The following algorithm (Fig. 7.10) will clarify the concepts. Let us assume that we start with the following values and trace the sequence of events. PA-TO-ENTER = “NO”, PB-TO-ENTER = “NO”, and Chosen-Process = “A” (i) Let us say that PA is scheduled first. (ii) After executing A.0 and A.1, PA-TO-ENTER will become “YES”, and Chosen-Process will be “B”. (iii) At A.2 and A.3 (it is one statement only!), because PB-TO-ENTER is “NO”, it will fall through to A.4. This is because of the “AND” condition in A.2. It will now start executing the Critical Region-A at A.4. (iv) Let us assume that at this time, process switch takes place and PB is scheduled. PA is still in its critical region.

(v) PB will execute B.0 and B.1 to set PB-TOENTER to “YES” and Chosen-Process to “A”. (vi) But at B.2, it will wait, because, both the conditions are met, i.e. PA-TO-ENTER = “YES” (in step (ii)) and Chosen-Process = “A” (in step (v)). Thus, PB will be prevented from entering its critical region. (vii) Eventually when PA is scheduled again, it completes instruction A.5 to set PA-TOENTER to “NO”, but only after coming out of its critical region. (viii) Now if PB is scheduled again, it will resume at instruction B.2, it will fall through to B.2 and B.3 (because PA-TO-ENTER = “NO” in step vii) to execute B.4 and enter the critical region of B. However, this has happened only after PA has come out of its critical region. Peterson’s algorithm is simple but brilliant. However, it suffers from the same shortcoming as discussed earlier. More than two processes are not allowed by this algorithm. Again, it is based on the inefficient ‘busy waiting’ philosophy.

All solutions discussed till now were software solutions which required no special help from hardware. However, many computers have a special instruction called “Test and Set Lock (TSL)” instruction. This instruction has the format: “TSL ACC, IND”, where ACC is the accumulator register and IND is the symbolic name of a memory location which can hold a character to be used as an indicator or a flag. The following actions are taken after this instruction is executed: l l

The interesting point is that this is an indivisible instruction, which means that it cannot be interrupted during its execution consisting of these two steps. Therefore, the process switch cannot take place during the execution of this TSL instruction. It will be either fully executed (i.e. both the actions mentioned above) or it will not be executed at all. How can we use this TSL instruction to implement the mutual exclusion? Let us assume that IND can take on the value “N” (indicating thereby that the critical region is being used currently, i.e. it is NOT free.) or “F” (indicating thereby that the critical region is not being used currently, i.e. it is FREE). When this flag is “F”, a process can set it to “N”, and only then can it enter the critical region.

Obviously, if IND = “N”, no process can enter its critical region because some process is already in the critical region. We had used this scheme earlier. The only difference in this scheme is the use of the TSL instruction. Checking whether IND is “F” and regardlessly setting it to “N” can be done in one shot without interruption, thereby removing the problem area. For instance, we can write two common routines as shown in Fig. 7.11. These routines, “Enter-Critical-Region” and “Exit-Critical-Region”, can now constitute the mutual exclusion primitives. They are written in an assembly language of a hypothetical computer. They can be easily written for any other computer as well.

The format of the algorithm for any process using shared variables is the same and is as given in Fig. 7.12. Let us imagine that both PA and PB are executing the same algorithm as given in Fig. 7.12. The following sequence of events can be traced to verify the result: (i) Initially, let us say that IND = “F”. (ii) PA is scheduled. (iii) Instruction 0 is executed and then through instruction 1, Enter-Critical-Region is called. (iv) Instruction EN.0 is executed. Now ACC becomes “F”, but IND becomes “N”. (v) Instruction EN.1 recognizes that ACC has the value “F” and therefore, prepares to go into the critical region by executing EN.3, i.e. returning to the caller (in this case, instruction 2 of PA). (vi) Let us assume that PA loses control to PB due to the context switch at this juncture, just before the critical region is actually entered. (vii) PB now executes instruction 0 of process PB. (viii) PB executes instruction 1 and therefore, calls Enter-Critical-Region of process PB. (ix) EN.0 is executed for PB, ACC now becomes “N” (In step (iv), IND had become “N”, which is now moved to ACC) and IND continues to hold the value “N”. (x) EN.1 now is executed for PB and because of unequal comparison, it loops back and therefore, does

(xi)

(xii) (xiii) (xiv)

(xv)

not get out of the Enter-Critical-Region routine. Thus, PB cannot reach instruction 2 and enter its critical region. This is because PA is in its critical region. Eventually, PA is again scheduled. It gets into the critical region (executes instruction 2 of the main program). While executing the critical region (instruction 2) for PA, if a context switch takes place again, and if PB is again scheduled, PB will still loop because both IND and ACC still continue to be “N” from step (ix). (Busy Waiting!) Let us assume that PA completes instruction 2 and gets out of its critical region. PA executes instruction 3 of the main program and calls the Exit-Critical-Region routine. EX.0 is now executed for PA where IND becomes “F” again. PA gets out of instruction 3 of the main program by executing EX.1. Let us assume that now PB is again scheduled, where it executes EN.0 once more. (It was looping in Enter-Critical-Region, remember?) After EN.0, ACC becomes “F” (because IND had become “F” in step (xiii) and IND is moved to ACC) and IND is changes to “N” due to the TSL instruction. At EN.1, it now finds that ACC is equal to “F” (due to step (xiv)) and therefore, it goes to EN.3 and returns to the main program - i.e. instruction 2.

(xvi) PB now can enter its critical region. Thus, we can see that PB could enter its critical region only after PA came out of its critical region. We leave it to the reader to ensure that this solution works in all cases. The algorithm shown in Fig. 7.12 does not specify whether it is for a producer process or a consumer process. It is valid for any process. In short, if any process which uses critical region is written in this fashion, race conditions can be avoided. The solution, however, is not without demerits. In the first place, it uses special hardware, and therefore, cannot be generalized to all the machines. Secondly, it also is based on the principle of busy waiting and therefore, is not the most efficient solution. Finally, Dijkstra in 1965 found a new method using the concept of Semaphores. It can tackle most of the problems mentioned above, depending upon its implementation. We will now study this method.

Semaphores represent an abstraction of many important ideas in mutual exclusion. A semaphore is a protected variable which can be accessed and changed only by operations such as “DOWN” (or P) and “UP” (or V). It can be a ‘Counting Semaphore’ or a ‘General Semaphore’, where it can take on any positive value. Alternatively, it can be a ‘Binary Semaphore’ which can take on the values of only 0 or 1. Semaphores can be implemented in software as well as hardware. The concepts of semaphores is as follows: “DOWN and UP” form the mutual exclusion primitives for any process. Hence, if a process has a critical region, it has to be encapsulated between these DOWN and UP instructions. The general structure of any such process then becomes as shown in Fig. 7.13. The “DOWN(S)” and “UP(S)” primitives ensure that only one process is in its critical region. All other processes wanting to enter their respective critical regions are kept waiting in a queue called a ‘Semaphore ‘queue’. The queue also requires a queue header and all the PCBs in this queue also need to be chained in the same way as the ready and blocked queues. Hence, the Operating System can traverse through all the PCBs for all the processes waiting on the Semaphore (i.e. waiting for the critical region to get free). Only when a process which is in its critical region comes out of it, should the Operating System allow a new process to be

released from the semaphore queue. Figure 7.14 shows the flowcharts for “DOWN(S)” and “UP(S)” routines. Semaphores work on the following basic principles. (i) As is clear from Fig. 7.13, unless a process executes a DOWN(S) routine successfully without getting added to the semaphore queue at instruction 1, it cannot get into its critical region at instruction 2. Thus, if a process is in its critical region, we can safely assume that its DOWN(S) instruction must have been executed. We have to study the DOWN(S) routine to understand this. (ii) Let S be a binary semaphore. This means that S can take values only as 0 or 1. Let us decide that a process can enter its critical region only if S = 1. The flowchart in Fig. 7.14 (a) shows that if S is > 0, it is reduced by 1 and it becomes 0 in the DOWN routine (instruction 1 of Fig. 7.13) itself. Then only, the process is allowed to enter its critical region at instruction 2. Therefore, if any process is in its critical region, there is a guarantee that S = 0. This is clear from Fig. 7.14 (a). (iii) Hence, any new process cannot get into its critical region when S = 0. If it tries to execute DOWN(S) routine, it will be added to the semaphore queue as shown in Fig. 7.14 (a). It will not be able to proceed into its critical region because S = 0. The

process is pushed from running state into a semaphore queue in instruction 1 only. Because it is not in a running state, it cannot proceed and therefore, cannot enter into its critical region to execute instruction 2. (iv) S can become 1 again only in the UP(S) routine as is clear from Fig. 7.14 (b). From Fig. 7.13, it is also clear that UP(S) is executed in instruction 3 only after a process has come out of its critical region. This again is the reason that if S = 1, there is a guarantee that there is no other process in the critical region at that time. Therefore, a new process could be allowed to enter into it. This is exactly what the DOWN(S) routine does. (i.e. allows a process to enter into its critical region only if S = 1.) We will study the use of the semaphore queue with an example later. One of the requirements of this scheme is the indivisibility of DOWN(S) and UP(S) instructions. Lock and unlock operations in these routines shown in Fig. 7.14 are essentially for this purpose. On a single processor computer, Lock/Unlock can be implemented by “Disable interrupts” and “Enable interrupts” instructions so that during the execution of DOWN(S) or UP(S), no process switch can take place. On multiprocessor computers, it is possible for two or more processes running on different processors to enter into a Wait condition. In such a case, a hardware instruction such as Test and Set Lock (TSL) is used to implement indivisibility despite its drawback of ‘busy waiting’. We will assume that DOWN(S) and UP(S) are the routines implemented in the Kernel of the Operating System as system calls. We now present the algorithms for DOWN(S) and UP(S) on a uniprocessor system. These are shown in Fig. 7.15 (a) and Fig. 7.15 (b). In the Figs. 7.15 (a) and 7.15 (b), “Wait on S” in D.3 means moving the PCB of the running process in the semaphore queue. “Release a process” in U.3 means moving the first PCB from the semaphore queue to the ready queue. As we have seen, S = 0 indicates that the DOWN(S) operation has been performed but UP(S) has not been completed. This means that there is a process in a critical region. At this time, there could be other processes wanting to enter their critical regions. They cannot be put in a blocked state. This is the reason why they are put in a semaphore queue, if S is not > 0 in the DOWN(S) routine. Thus, the semaphore queue is a list of PCBs for all the processes which are waiting for the critical region to get free. As soon as it gets free and UP(S) is performed, S is made = 1 and a process at a time is admitted from the semaphore queue to the ready queue in UP(S) routine.

Let us illustrate in a step by step manner how these algorithms work. Let us assume that S=1, to begin with, and there are four processes PA, PB, PC and PD in the ready queue. Each of these processes has a format as shown in Fig. 7.13. Let us now assume that PA gets scheduled and dispatched first. Following now takes place: (i) PA executes instruction 0 shown in Fig. 7.13. (ii) PA starts executing instruction 1 shown in Fig. 7.13 - i.e. DOWN(S) routine. (iii) Interrupts are disabled by “DOWN” call at D.0 as shown in Fig. 7.15 (a), to ensure indivisibility. (iv) It checks in D.1 if S > 0. As S = 1, it finds that S > 0, and so it decrements S in D.2. Now S becomes 0. It then comes out of the “If” statement by “Endif” at D.4 in Fig. 7.14 (a) (It skips D.3.). (v) It enables interrupts at D.5. (vi) PA starts executing instruction 2 shown in Fig. 7.13 - i.e. it enters the critical region. (vii) Let us assume that the time slice for PA gets over while it is in the critical region and PA is moved from “Running” to “Ready” state. (viii) Let us assume that PB is now dispatched. (ix) PB executes instruction 0 of Fig. 7.13. (x) PB starts executing instruction 1 of Fig. 7.13 - i.e. the DOWN (S) operation. (xi) The “DOWN” call disables interrupts at instruction D.0 of Fig. 7.15 (a). (xii) It checks if S > 0 at D.1. At this juncture, S = 0 (see step (iv)) and therefore, this check will fail. (xiii) Therefore, it skips D.2 and executes D.3 - i.e. adds the PCB of PB to the semaphore queue and keeps it waiting. It is no more running. (xiv) It comes out of the “If” statement by “Endif” at D.4. (xv) It enables interrupts again at D.5. (xvi) As PB is no more a “Running” process, let us assume that the scheduler schedules PC and dispatches it. (xvii) PC will go through the same steps as PB, and its PCB will also get added to the semaphore queue, because S is still = 0. We know that S can become 1 only in the UP(S) operation, which takes place after the execution of the critical region portion of any process. We also know that only PA, when rescheduled, can achieve this, since no other process can even enter it so long as PA has not come out of its critical region. All will only continue getting added to the semaphore queue. Only UP(S) instruction can again set S to 1; but UP(S) can get executed only after PA has come out of its critical region. (xviii) Let us assume that PA is rescheduled eventually and it resumes where it had left off last time at step (vii). (xix) PA completes instruction 2 of Fig. 7.13 - i.e. its critical region. (xx) PA calls the UP(S) routine at instruction 3 of Fig. 7.13. (xxi) UP(S) disables interrupts at U.0. (xxii) It increments S by 1 at U.1. Now S becomes 1. (xxiii) It checks the semaphore queue at U.2 and finds that it is NOT empty. (xxiv) It releases PB - i.e. moves it from the semaphore queue to the ready queue. PC still continues in the semaphore queue.

(xxv) It executes U.4 and U.5 and comes out of UP(S) after enabling the interrupts again. (xxvi) PA starts executing instruction 4 of Fig. 7.13. Let us assume that during the execution of 4 (actually instruction 4 is a set of instructions), PA’s time slice gets over, and PD gets scheduled (maybe PD had a higher priority than PB!). (xxvii) PD executes instruction 0 of Fig. 7.13. (xxviii)PD calls the “DOWN” routine at instruction 1 of Fig. 7.13. (xxix) The “DOWN” routine goes through the instructions exactly as discussed in steps (iii), (iv) and (v). It will decrement S to 0 and allow PD to enter its critical region. (xxx) Let us assume that PD finishes instruction 2 of Fig. 7.13 - i.e. Critical Region. (xxxi) Let us assume that the time slice for PD is over after it has finished instruction 2 of Fig. 7.13 (i.e. critical region) but before it executes instruction 3 of Fig. 7.13 (i.e. “UP(S)” call). (xxxii) Let us assume that PB gets scheduled. (It can now be scheduled, because it was made “Ready” in step (xxiv)). (xxxiii)PB executes instructions 0 and 1 of Fig. 7.13. Because S = 0, PB is added to the semaphore queue again. (xxxiv) Let us assume that PD is scheduled again, and it completes the UP(S) operation to set S to 1. Also, because the semaphore queue is NOT empty, it will release PC into the ready queue. (xxxv) PC now can be scheduled. This procedure is repeated for all the processes until all are over. Two interesting points emerge out of this discussion. (a) When a process enters a critical region, if a process switch takes place before it completes the UP(S) instruction, what is the use of scheduling any other process? What is the purpose of picking up another process from the ready queue and moving it to the semaphore queue? At a later time, it will have to be moved back to the ready queue before dispatching, in any case. Also, even if this happens, and it is dispatched from the ready queue once again, by that time also if UP(S) has not been finished by the other process i.e. if S is still 0, our process will be moved again from the ready queue to the semaphore queue. In short, what is the sense in scheduling any process when S = 0? The answer could be: Not all processes have critical regions. Thus, the queue of ready processes contains only a few which have the format as given in Fig. 7.13. If a process having a critical region could be marked separately for the scheduler to ignore when scheduling while S = 0, it could work. But then the overhead for all this is just not worth it. (b) Due to the scheme of the semaphore, the order in which the scheduler wanted the processes to proceed can change. For instance, in our example we had planned our processes in the order of PA, PB, PC and PD. They however were actually scheduled in the order of PA, PD, PC and PB. Semaphores have very wide uses and applications whenever shared variables are used. In fact, the Operating System can use them to implement the scheme of blocking/waking up of processes when they wait for any event such as the completion of the I/O. Thus, semaphores can be used to synchronize block/wakeup processes as shown in Fig. 7.16. Note that one of the processes has only DOWN instruction and the other has only UP instruction. We would leave it to the reader to verify the detailed steps involved in the interaction between these two processes and how they can perform block/wakeup operations.

The operating system literature has extremely interesting problems pertaining to the IPC. Three of them seem to be most well known: (a) The dining philosophers' problem (b) The readers' and writers' problem (c) The sleeping barber problem We shall now examine these three problems and the three corresponding algorithms to tackle them. Dijkstra posed the dining philosophers, problem in 1965, and also solved it himself. The problem can be described as follows: Five philosophers are seated on five chairs across a table. Each philosopher has a plate full of spaghetti. The spaghetti is very slippery, so each philosopher needs a pair of forks to eat it. However, there are only five forks available all together, arranged in such a manner that there is one fork between any two plates of spaghetti. This arrangement is shown in Fig. 7.17. (i) Each philosopher performs two activities continuously: thinking for some time and eating for some time. Obviously, thinking does not require any special algorithm here. (ii) However, eating does. In order to eat, a philosopher lifts two forks, one to his left and the other to his right (not necessarily in that order). (iii) If he is successful in obtaining two forks, he starts eating. (iv) After some time, he stops eating and keeps both the forks down. A simple algorithm to implement this solution is shown in Fig. 7.18. For simplicity, we shall assume that a philosopher always picks up the left fork first, and the right fork second. Let us examine this algorithm closely. On the face of it, it appears that it should work perfectly. However, it has a major flaw. What if all the five philosophers decide to eat at the same time? All the five philosophers would attempt to pick up two forks at the same time. But since only five forks are available, perhaps none of them would succeed. To improve the algorithm, let us add a condition, between the two Take_fork instructions shown in the algorithm. (i) When a philosopher succeeds in obtaining a left fork, he checks to see if the right fork is available.

(ii) If the philosopher does not become successful in obtaining the right fork, he puts down his left fork and waits for some time. (iii) After this pre-defined time elapses, the philosopher again starts with the action of picking up the left fork. Unfortunately, this solution can also fail. What if the five philosophers pick up their left fork at the same time, and attempt to pick up their right fork simultaneously? Obviously, they all will fail to obtain the right fork, and therefore, abandon their attempt. Moreover, they might wait for the same amount of time, and reattempt lifting their left fork. This would continue without any philosopher succeeding in actually obtaining both the forks.

This problem, wherein many programs running at the same time simply wait for some event to happen without performing any useful task is called as starvation (which, actually, is quite appropriate here, as the philosophers would be really starved!). One possible solution is to add randomness to this. The time for which the philosophers wait can be made different for each one of them. (i) Thus, if all the philosophers pick up their left fork at the same time, all the five can put their left forks down. (ii) Then, the first philosopher could wait for three seconds, the second philosopher could wait for five seconds, the third philosopher could wait for just one second and so on. (iii) This randomness can be further randomized, just in case any two philosophers somehow manage to wait for the same time. With such a scheme in place, this problem might be resolved. However, this is not a perfect solution, which guarantees success in every situation. A still better scheme is desired. The appropriate solution to this problem is the usage of a binary semaphore. (i) Before a philosopher starts acquiring the left fork, he would do a DOWN on mutex (i.e. disallows other philosophers to test him). (ii) After eating is over, a philosopher performs a UP on mutex (i.e. allows other philosophers to test him). (iii) With five forks, at the most two philosophers can eat at the same time. Therefore, for each philosopher, we define three possible states: eating, hungry (making an attempt to acquire forks) or thinking. (iv) A philosopher can move into the eating state only if both of his neighbors are not eating. The algorithm shown in Fig. 7.19 presents an answer to the dining philosophers, problem. We allocate one semaphore to each philosopher. This allows each philosopher to maintain his current state (e.g. a hungry philosopher can wait before he can move into the eating state). The algorithm shows the steps carried out by a single philosopher. The same logic applies for all the other philosophers. In which practical situations would the dining philosophers problem apply? Clearly, it is useful when the number of resources (such as I/O devices) is limited, and many processes compete for exclusive access over those resources. This problem is unlikely in the case of a database access, for which the next problem is applicable. Imagine a large database containing thousands of records. Assume that many programs (or processes) need to read from and write to the database at the same time. In such situations, it is quite likely that two or more processes make an attempt to write to the database at the same time. Even if we manage to take care of this, while a process is writing to the database, no other process must be allowed to read from the database to avoid concurrency problems. But we must allow many processes to read from the database at the same time. (i) A proposed solution tackles these issues by assigning higher priorities to the reader processes, as compared to the writer processes. (ii) When the first reader process accesses the database, it performs a DOWN on the database. This prevents any writing process from accessing the database. (iii) While this reader is reading the database, if another reader arrives, that reader simply increments a counter RC, which indicates how many readers are currently active.

(iv) Only when the counter RC becomes 0 (which indicates that no reader is active), a writer can write to the database. An algorithm to implement this functionality is shown in Fig. 7.20. Clearly, this solution assigns higher priority to the readers, as compared to the writers. (i) If many readers are active when a writer arrives, the writer must wait until all the readers end their reading jobs. (ii) Moreover, if a few more readers keep coming in, the writer has to wait until all of them finish reading. This may not always be the best solution, but surely is secure. An interesting IPC problem can happen inside a barber shop! A barber shop has one barber, one barber chair for the customer being served currently (if any) and n chairs for the waiting customers (if any). The barber manages his time very efficiently.

(i) When there are no customers, the barber goes to sleep on the barber chair. (ii) As soon as a customer arrives, he has to wake up the sleeping barber, and request for a hair cut or shave. (iii) If more customers arrive whilst the barber is serving a customer, they either sit down in the waiting chairs (if empty) or simply leave (if there are no empty waiting chairs). (iv) The challenge is to write an algorithm that manages all these activities without causing any problems or race conditions. This situation is shown in Fig. 7.21. We define three semaphores: customers (which specifies the number of waiting customers, excluding the one in the barber chair), barbers (0 means that the barber is free, 1 means it is busy) and mutex (our standard mutual exclusion variable). We maintain another copy of the customers counter with the name waiting. This variable is required so that a new customer entering the shop can check if the number of waiting customers is the same as the number of waiting chairs (in which case, he leaves the barber shop, as he does not have a free chair to wait in). The algorithm shown in Fig. 7.22 provides the solution. (i) When the barber starts his working day, he executes a procedure called as Barber. (ii) This blocks the semaphore Customers until a customer arrives. He then straightaway goes to sleep! This is shown in Fig. 7.21.

(iii) When the first customer arrives, he executes another procedure called as Customer. This results into the acquiring of the critical region. (iv) Thus, if a second customer arrives immediately following the first one, the barber would not be able to attend him, as the first one has not released mutex yet. (v) The incoming customer compares the number of chairs with the number of waiting customers. If the former is lesser, the customer leaves without a haircut. (vi) If the incoming customer was able to locate an empty chair, he increments the counter variable waiting, and does an UP on the semaphore customers, which wakes the barber up. (vii) When the customer releases mutex, the barber takes it for performing its own tasks and when they are over, performs a haircut. (viii) After the haircut is complete, the customer leaves the shop. Since a haircut is not a repeating activity, no loop is required. However, the barber loops, and attempts to get the next customer. If a new customer is available, a new haircut is given. Otherwise, the barber goes back to what he knows next best – sleep!

Semaphores offer a good solution to resolve the concurrency issues when multiple processes want to access

the same resource. However, the problem with semaphores is that the actual implementation of semaphores requires programming at the level of system calls. That is, the application developer needs to explicitly invoke semaphore-related system calls to ensure concurrency. This can not only be tedious, but can actually lead to an erroneous code. Consequently, better schemes are desired. Monitors offer such a solution. A monitor is a high-level construct. It is an abstraction over semaphores. Coding monitors is similar to programming in high-level programming languages, as compared to semaphores, which are akin to assembly language. Monitors are easy to program. The compiler of a programming language usually implements them, thus reducing the scope for programmatic errors. A monitor is a group or collection of data items (variables), data structures and procedures. It is somewhat similar to an object (as in the context of object technology). The client processes cannot access the data items inside a monitor directly. The monitor guards them closely. The client objects can only invoke the services of the monitor in the form of the monitor’s procedures. This provides a shield against the internal details of a monitor. A monitor is similar to a critical section. At any given time, only one process can be a part of a monitor. If another process makes an attempt to enter the monitor while the other process has not finished, the attempt to work inside the monitor will fails, and the later process must wait until the process, which is already inside the monitor, leaves. The processes, which utilize the services of the monitor, need not know about the internal details of the monitors. They need not know, for instance, the way they are implemented, or the sequence in which a monitor executes its instructions, etc. In contrast, we know that a programmer who works with semaphores actually needs to use this sort of information while coding. Of course, one argument in favor of semaphores as against the monitors is that semaphores, by virtue of their basic low-level interface, provide more granular control. This is always true even in the case of assembly language, which provides a fine-grained control to the application programmer, as compared to a high-level language. However, if the programmer does not want to use such low-level features, monitor is a better choice.

The need for message passing came about because the techniques such as semaphores and monitors work fine in the case of local scope. In other words, as long as the processes are local (i.e. on the same CPU), these techniques work perfectly. However, they are not intended to serve the needs of processes, which are on physically different machines. Such processes, which communicate over a network, need some mechanism to be able to perform communication with each other, and yet be able to ensure concurrency. Message passing is the solution to this problem. Using the technique of message passing, one process (sender) can safely send a message to another process (destination), without worrying if the message would reach the destination process. This is conceptually similar to the technology of Remote Procedure Calls (RPC), the difference being that message passing is an operating system concept, whereas RPC is a data communications concept. In message passing, two primitives are generally used: send and receive. The sender uses the 'send' call to send a message. The receiver uses the 'receiver' call. These two calls take the following form: send (destination, &message); receive (source, &message);

Notably, the two processes can be local, i.e. on the same machine, or they can be remote, i.e. on physically different machines. l If the two processes are local, the message passing mechanism is quite simple. However, if the two processes are not on the same machine, a lot of overheads are required to ensure that the message passing is successful. For instance, the receiver has to send an acknowledgement (either a positive acknowledgement, i.e. ACK or a negative acknowledgement, i.e. NAK) to the sender. The sender has to take an appropriate action accordingly. There can be other issues, as well. How long should the sender wait for an acknowledgement before re-sending the message? What if the re-transmission fails, as well? How does the receiver distinguish between the various parts of a message, if the sender has broken down the original messages into multiple parts and sent them separately? We can see that to handle such situations, the message passing mechanism has to be a bit similar to the Transmission Control Protocol (TCP), which guarantees an error-free, only-once and guaranteed delivery of messages. Process is any particular operation that is currently executing. A process is created as result of a specific type of system call. In multi-user environment there are users working simultaneously and each user generally requests for and initiates some processing. This means that there is a need to execute many processes simultaneously to satisfy all the users. In single-user Operating System also, there multiple processes that are running to perform multiple tasks. Each process requires resources such as memory, CPU time, access to files/directories, etc. In the case of concurrent processes, memory and CPU time would be distributed among all the processes. The process management function of Operating System would monitor concurrent process execution. Each process has following stages during the execution: Start – Process starts Wait – Process waits Terminate – Process exits when its task is complete Child process – a process can create child processes. Parent and child processes can execute concurrently and a parent process can wait till the execution of the child process is finished. When concurrent processes are executing, there are possibilities of two types of processes: 1) Independent processes and 2) Cooperating process. Independent process – Independent process – A process, which does not affect the execution of other running processes and which cannot be affected by the execution of running processes is an independent process. When a process is running independently, it does not share data or resources with any other process. Cooperating process – When execution of processes are depending on other processes then the processes are called cooperating processes. Cooperating processes share data and resources among each other. We require supporting environments for coOperating System for the following important reasons: Information sharing – When many user are interested in the same file or database table then only one process cannot operate exclusively on that file/database table. Hence concurrent access is required on such resources for information sharing. CPU utilization – When a process has been divided into small processes and each small process is executing concurrently, the CPU utilization would be proper and overall process can be finished in less time.

Concurrent processes means programs that are designed in such a way that whole program can be divided into small interactive executable pieces, which can run in parallel or sequentially. Each piece of program behaves as a separate computational process. Processes may be close to each other or they are distributed across network. Main challenge in designing concurrent programming is ensuring that the process execution happens in synchronization, causing no resource locking and situations such as deadlocks. There are challenges such as process co-ordination, processes communication, co-coordinating access to the resources, sharing of resources among the processes, etc..

Some I/O media, such as disks are easily sharable. Multiple processes could be using the same disk drive for reading or writing. But we cannot do the same for certain I/Omedia such as a tape drive or a printer or a plotter. For instance, it is not very easily imaginable to have a printer allocated to two processes, and worse yet, belonging to two different users. In such a case, a printed report may contain some lines of payslips interspersed with the lines of sales analysis or production figures. One can imagine the resulting (or addition to the) chaos! This is why some I/Omedia have to be allocated exclusively to only one process. Because of its non–sharable nature, the user process has to request for the entire device explicitly and the Operating System has to allocate those I/Odevices accordingly. Only when the user process gives up a device explicitly, can the Operating System take it back and add it to the free pool. Some problems arise when an I/Omedium is allocated to a process in an exclusive manner. Let us imagine two processes, PA and PB running simultaneously. Half way through, PA requests the Operating System for a file on the tape drive, and PB requests the Operating System for the printer. Let us assume that both the requests are granted. After a while, PA requests for the printer without giving up the tape drive. Similarly, PB requests for the tape drive without giving up the control of the printer. Assuming that the system has only one tape drive and one printer, what will happen? It is clear that both the processes cannot proceed. PA will wait until PB releases the printer. But that can happen only if PB can proceed further and finish off its processing with the printer. And this can happen only if PB gets the tape drive that PA is holding. PB can get the tape drive only if PA can proceed further and completes its work with the tape drive. That cannot happen too unless PA gets the printer which PB is holding. This situation is called a ‘deadlock’. It is not necessary that a deadlock can take place only in the context of the I/O media. It can, in fact, happen for any shared resource, such as the internal tables maintained by the Operating System or even semaphores.

Due to the problem of race conditions, we want semaphores and certain shared variables to be accessed on an exclusive manner as seen earlier. These then can become the causes of a deadlock. For instance, let us assume that an Operating System allows a maximum of 48 processes because it has allocated an area for only 48 PCBs and other data structures. If a process creates a child process, a new PCB has to be acquired and allocated to it. If no new PCB is available, the parent process waits, and after a while, attempts to acquire a PCB again. Normally, after a while, if some other process is killed, aPCB will become available and the attempt to create a child process may succeed. So far so good. But imagine that there are 8 processes running simultaneously each of which needing to create 9 subprocesses or children. In this case, assuming that the nature and speed of the processes are the same, after each process has created 5 subprocesses, the total number of processes will be 8 parent processes + (8x5) child processes = 48. When the 8 parent processes start creating the 6th child process each, the PCB space will be exhausted, and all the processes will go in the endless wait loop hoping that some day, there might be some space for the new PCBs to be created. But that day will never arrive, because it is a deadlock. No other process exists which will eventually terminate to free a PCB to allow for the forking parent process to proceed and create further child processes. All the processes will keep on waiting. A similar problem exists in Database Management Systems due to locking of records. Process A has locked record REC–0 and wants to read REC–1. Process B has locked REC–1 and issues a call to read REC–0. What will happen?

To represent the relationship between processes and resources, a certain graphical notation is used. Figure 8.1 shows square boxes as resources named R1 and R2. Similarly, processes shown as hexagons are named P1 and P2. The arrows show the relationship. For instance, in part (a) of the figure, resource R1 is allocated to process P1, or in other words, P1 holds R1. In part (b) of the figure, process P2 wants resource R2, but it has not yet got it. It is waiting for it. (The moment it gets the resource, the direction of the arrow will change.) These graphs are called ‘Directed Resource Allocation Graphs (DRAG)’. They help us in understanding the process of detection of a deadlock, as we shall see. Now let us imagine a typical scenario for the deadlock. l P1 holds R1 but demands R2 l P2 holds R2 but demands R1 If we draw a DRAG for this situation, it will look as shown in Fig. 8.2. You will notice that there is a closed loop involved. Therefore, this situation is called a ‘circular wait’ condition. We should not get confused by the shape of the graph. For instance, the same DRAG can be drawn as shown in Fig. 8.3. If you start from any node and follow all the arrows, you must return to the original node. This is what makes it a circular wait or a deadlock situation. The shape is immaterial.

This principle is used by the Operating System to detect deadlocks. However, what we have presented is a simplistic picture. In practice, the DRAGS can get very complicated, and therefore, the detection of a deadlock is never so simple! At any moment, when the Operating System realizes that the existing processes are not finishing for an unduly long time, it can find out whether there is a deadlock situation or not. All resource allocations are made by the Operating System itself. When any process waits for a resource, it is again the Operating System which keeps track of this situation of waiting. Therefore, the Operating System knows which processes are holding which resources and which resources these processes are waiting on. In order to detect a deadlock, the Operating System can give some imaginary coordinates to the nodes, R and P. Depending upon the relationships between resources and processes (i.e. directions of the arrows), it can keep traversing, each time checking if it has returned to a node it has already travelled by, to detect the incidence of a deadlock. What does the Operating System do if it finds a deadlock? The only way out is to kill one of the processes so that the cycle is broken. Many large mainframe computers use this strategy. Some systems do not go through the overhead of constructing a DRAG. They monitor the performance of all the processes. If none finishes for a very long time, the Operating System kills one of the processes. This is a crude but quicker way to get around the problem.

What causes a deadlock? Coffman, Elphick and Shoshani in 1971 have shown that there are four conditions all of which must be satisfied for a deadlock to take place. These conditions are given below:

Resources must be allocated to processes at any time in an exclusive manner and not on a shared basis for a deadlock to be possible. For instance, a disk drive can be shared by two processes simultaneously. This will not cause a deadlock. But printers, tape drives, plotters etc. have to be allocated to a process in an exclusive manner until the process completely finishes its work with it (which normally happens when the process ends). This is the cause of trouble.

Even if a process holds certain resources at any moment, it should be possible for it to request for new ones. It should not have to give up the already held resources to be able to request for new ones. If this is not true, a deadlock can never take place.

If a process holds certain resources, no other process should be able to take them away from it forcibly. Only the process holding them should be able to release them explicitly.

Processes (P1, P2, ...) and Resources (R1, R2, ...) should form a circular list as expressed in the form of a graph (DRAG). In short, there must be a circular (logically, and not in terms of the shape) chain of multiple resources and multiple processes forming a closed loop as discussed earlier. It is necessary to understand that all these four conditions have to be satisfied simultaneously for the existence of a deadlock. If any one of them does not exist, a deadlock can be avoided.

Various strategies have been followed by different Operating Systems to deal with the problem of a deadlock. These are listed below: Ignore it. Detect it. l Recover from it. l Prevent it. l Avoid it. We will now discuss these strategies one by one. These are also the areas in which research is going on because none of the approaches available today is really completely satisfactory. l l

There are many approaches one can take to deal with deadlocks. One of them, and of course the simplest, is to ignore them. Pretend as if you are totally unaware of them. (This is the reason why it is called, interestingly, as ‘Ostrich algorithm’.) People who like exactitude and predictability do not like this approach, but there is a very valid reason to ignore a deadlock. Firstly, the deadlock detection, recovery and prevention algorithms are complex to write, test and debug. Secondly, they slow down the system considerably. As against that, if a deadlock occurs very rarely, you may have to restart the jobs but then the time may be lost quite infrequently and may not be significantly large. UNIX follows this approach on the assumption that most users would prefer an occasional deadlock to a very restrictive, inconvenient, complex and slow system.

We have discussed one of the techniques for the detection of a deadlock in Sec. 8.2. The graphs (DRAG)

provide good help in doing this, as we have seen. However, normally, a realistic DRAG is not as straightforward as a DRAG between two processes (P1, P2) and two resources (R1 and R2) as depicted in Fig. 8.2. In reality, there could be a number of resource types such as printers, plotters, tapes and so on. For instance, the system could have two identical printers, and the Operating System must be told about it at the time of system generation. It could well be that a specific process could do with either of the printers when requested. The complexity arises due to the fact that allocation to a process is made of a specific resource by the Operating System, depending upon the availability but the request is normally made by the process to the Operating System for only a resource type (i.e. any resource belonging to that type). A very large number of processes can make this DRAG look more complex and the deadlock detection more time-consuming. We will denote multiple instances of the same resource type by means of multiple symbols within the square. For example, consider the DRAG as an example as shown in Fig. 8.4. R1 is a resource type–say, a tape drive of a certain kind, and let us assume that there are two tape drives, R10 and R11 of the same kind known to the system. R2 may be a printer of a certain type and there may be only one of that type available in the system–say, R20. The DRAG shows the possibility of an apparent circular wait, but it is actually not so. Therefore, it is NOT a deadlock situation. In the figure, R10 is allocated to P1. P1 is waiting for R20. R20 is allocated to P2. Now comes the question of the last leg in the diagram. Let us assume that R11 is free and P2 wants it. In this case, P2 can actually grab R11. And if it does so, an arrow will be actually drawn from R11 to P2 as shown in Fig. 8.5. If you traverse from a node, following the arrows, you would not arrive at the starting node. This violates the rules for a circular wait. Therefore, P2 in this case need not wait for R11. It can go to completion. The point is that the visual illusion of the cycle should not deceive us. It is not a circular wait condition. If R11 however, is also not free and is already grabbed by, say P1, it can lead to a deadlock if P2 requests for R11. We will follow a method to detect a deadlock where there are multiple instances for a resource type. We will use DRAG to achieve this. However, what is discussed here is only to clarify the concepts. By no means, it is the only way to detect deadlocks. In fact, today, there exist far more efficient and better algorithms for this purpose. The Operating System, has to treat each resource separately, regardless of the type. The type is important only while the Operating System is allocating resources to the processes because normally any free resource of a given type can be allocated. For instance, if a process demands R1, the Operating System could allocate R10 or R11 depending upon the availability. The Operating System, in this case, could do the following to detect a deadlock:

(i) Number all processes as P0, P1, ......PN. (ii) Number each resource separately-using a meaningful coding scheme. For instance, the first character could always be “R” denoting a resource. The second character could denote the resource type (0 = tape, 1 = printer etc.) and the third character could denote the resource number or an instance within the type, e.g. R00, R01, R02, .... could be different tape drives of the same type; R10, R11, R12 .... could be different printers of the same type with the assumption that resources belonging to the same type are interchangeable. The Operating System could pick up any of the available resources within a given type and allocate it without any difference. If this is not true with certain resources, the Operating System should treat them as different resource types such that the principle of interchangeability of resources within the same resource type holds true. (iii) Maintain two tables as shown in Figs. 8.6 and 8.7. One is a resourcewise table giving, for each resource, its type, allocation status, the process to which it is allocated and the processes that are waiting for it. In fact, we know in Device Management that the Operating System maintains the information about the process currently holding the device in the ‘Device Control Block (DCB)’ maintained for each device. We also know that for each process waiting for the device, there is a data structure called an “Input Output Request Block” or IORB, which is linked to the DCB. Revisiting Fig. 5.22 will clarify that the Operating System already maintains this information in some form. Another table is a processwise table giving, for each process, the resources held by it and the resources it is waiting for. This is normally held along with PCB. Logically, it is a part of PCB, but an Operating System could choose to maintain it in a separate table linked to the PCB for that process. Therefore, if DCB and PCB data structures are properly designed, all information needed by the Operating System to allocate/deallocate resources to various processes will be already available. The Operating System could use this information to detect any deadlock, as we shall see later. (iv) Whenever a process requests the Operating System for a resource, the request is obviously for a resource belonging to a resource type. The user would not really care which one is exactly allocated (If he did, a new resource type would have been created). The Operating System then goes through the resourcewise table to see if there is any free resource of that type, and if there is any, allocates it to the process. After this, it updates both these tables appropriately.

If no free resource of that type is available, the Operating System keeps that process waiting on one of the resources for that type. (For instance, it could add the process to the waiting queue for a resource, where the wait list is the shortest.) This also will necessitate updating of both tables. When a process releases a resource, again both the tables will be updated accordingly. (v) At any time, the Operating System can use these tables to detect a circular wait or a deadlock. Typically, whenever a resource is demanded by a process, before actually allocating it, the Operating System could use this algorithm to see whether the allocation can potentially lead to a deadlock or not. It should be noted that this is by no means the most efficient algorithm of deadlock detection. Modern research has come out with a number of ingenious ideas, which are being discussed and debated. Some of these are implemented too! What we present here is a simplified, accurate (though a little inefficient) method to clarify the concepts. The algorithm would simulate the traversal along the DRAG to detect if the same node is reached-i.e. the circular wait. The working is as follows: (a) Go through the resourcewise table entries one by one, each time storing the values processed. This is useful in detecting a circular wait, i.e. in finding out whether we have reached the same node or not. (b) Ignore entries for free resources. (such as an entry for R00 in Fig. 8.6). (c) For all other entries, access the process to which the resource is allocated (e.g. resource R01 is allocated to process P1 in Fig. 8.6). In this case, store the numbers R01 and P1 in separate lists called resource list and process list respectively. (d) Access the entry in the processwise table (Fig. 8.7) for that process (P1 in this case). (e) Access one by one the resources this process (P1) is waiting for. For example, P1 is waiting for resource R20. Check if this is the same as the one already encountered. i.e. if R20 is the same as R01 stored in step (c). In short, check if circular wait is already encountered. If yes, the deadlock is detected. If no, store this resource (e.g. R20) in the resource list. This list will now contain R01 and R20. The process list still contains only P1. Check from Fig. 8.7 whether there is any other resource apart from R20, that process P1 is waiting for. If there is any, this procedure will have to be repeated. In this example, there is no such resource. Therefore, the Operating System goes to the next step (f). (f) Go to the entry in the resourcewise table (Fig. 8.6) for the next resource in the resource list after R01. This is resource R20, in this case. We find that R20 is allocated to P5. (g) Check if this process (i.e. P5) is the one already encountered in the process list (e.g. if P5 is the same as P1). If it is the same, a deadlock is confirmed. In this case, P5 is not the same as P1. So only store P5 after P1 in the process list and proceed. The process list now contains P1 and P5. The resource list is still R01, R20 as in step (e). After this, the Operating System will have to choose R10 and R23 as they are the resources process P5 is waiting for. It finds that R10 is allocated to P1. And P1 already existed in the process list. Hence, adeadlock (P1!R20!P5!R10!P1) has been detected. Therefore, the Operating System will have to maintain two lists-one list of resources already encountered and a separate list of all the waiting processes already encountered. Any time the Operating System hits either

a resource or a process which already exists in the appropriate list while going through the algorithm, the deadlock is confirmed. (h) If a deadlock is not confirmed, continue this procedure for all the permutations and combinations e.g. for all the resources that a process is waiting for and then for each of the resources, the processes to which they are allocated. This procedure has to be repeated until both the lists are exhausted one by one. If all the paths lead to resources which are free and allocable, there is no deadlock. If all the paths make the Operating System repeatedly go through the same process or resource (check if already encountered), it is a deadlock situation. Having finished one row, go to the next one in Fig. 8.6 and repeat this procedure for all the rows where the status is NOT = free. Let us verify this algorithm for a deadlock. For instance, the two tables corresponding to Fig. 8.2 are shown in Fig. 8.8 and Fig. 8.9.

If you follow the algorithm described above, you will traverse from R1 to P1 (resourcewise table), P1 to R2 (processwise table), R2 to P2 (resourcewise table) and P2 to R1 (processwise table), therefore, returning to R1 again, completing a circular wait and revealing a deadlock.

Deadlock recovery becomes more complex due to the fact that some processes definitely lose something in the bargain. Basically, there are two approaches to solve this problem: suspending a process or killing it. We will consider these one by one. A process is selected based on a variety of criteria (low priority, for instance) and it is suspended for a long time. The resources are reclaimed from that process and then allocated to other processes that are waiting for them. When one of the waiting processes gets over, the original suspended process is resumed. This scheme looks attractive on the face of it, but there are several problems in its implementation. These are listed below: Not all Operating Systems support the suspend/resume operations due to the overheads involved in maintaining so many more PCB chains for added process states and also due to the added system calls. This strategy cannot be used in any on-line or real-time systems because the response time of some processes then becomes unpredictable, and clearly this is unacceptable. Suspend/Resume operations are not easy to manage physically/programmatically for this purpose. Imagine that a tape is read half way through and then a process holding the tape drive is suspended. The operator will

have to dismount that tape and mount the new tape for the new process to which the tape drive is now to be allocated. When the old process is resumed, the tape for the original process will have to be mounted again and more importantly, it will have to be exactly positioned. The problem with the printer is worse and can be easily imagined. Therefore, this solution is not normally implemented. The Operating System decides to kill a process and reclaim all its resources after ensuring that such action will solve the deadlock. (The Operating System can use the DRAG and deadlock detection algorithms to ensure that after killing a specific process, there will not be a deadlock.) This solution is simple, but involves loss of at least one process. Choosing a process to be killed, again, depends on the scheduling policy and the process priority. It is safest to kill a lowest priority process which has just begun, so that the loss is not very heavy. However, the matter becomes more complex when one thinks of a database recovery (the process which is killed may have already updated some databases on-line) or Inter-Process-Communications. As yet, there is no easy solution to this problem and it is a subject of research today.

This strategy aims at creating the circumstances so that deadlocks are prevented. A study of Coffman’s four conditions discussed in Sec. 8.3, shows that if any of these conditions is not met, there cannot be a deadlock. This strategy was suggested by Havender first. We will now discuss the ways to achieve this and problems encountered while trying to do so. If every resource in the system were sharable by multiple processes, deadlocks would have never occured. However, such sharing is not practicable. For instance, a tape drive, a plotter or a printer cannot be shared amongst several processes. At best, what one can do is to use the spooling techniques for the printer, where all the printing requests are handled by a separate program, therefore, eliminating the very need for sharing. When the spooler is holding the printer, no other process is even allowed to request for a printer, leave alone get it. All that a process is allowed to do is to add the data to the spooler to be printed subsequently. Unfortunately, the same technique cannot be used for all other devices. Moreover, deadlocks are concerned with a number of resources such as various Operating System tables, disk areas and records in addition to the external I/O devices. Besides, all resources do not lend themselves to an easy application of the technique of spooling. Therefore, it is very difficult to guarantee that we can avoid this condition. By prohibiting a process to wait for more resources while already holding certain resources, we can prevent a deadlock. This can be achieved by demanding that at the very beginning, aprocess must declare all the resources that it is expected to use. The Operating System should find out at the outset if all these are available and only if available, allow the process to commence. In such a case, the Operating System obviously must update its list of free, available resources immediately after this allocation. This is an attractive solution, but obviously, it is inefficient and wasteful. If a process does calculations for 8 hours updating some files and at the end, uses the tape drive for updating the control totals record only for one minute, the tape drive has to be allocated to that process for the entire duration and it will, therefore, be idle for 8 hours. Despite this, no other process can use it during this period.

Another variation of this approach is possible. The Operating System must make a process requesting for some resources to give up the already held resources first and then try for the requested resources. Only if the attempt is successful, can the relinquished resources be reallocated to that process, so that it can run. However, if the attempt fails, the relinquished resources are regained and the process waits until those resources are available. Every time a check is made, the existing, already held resources are relinquished so that the deadlock can never take place. Again, there are problems involved in this scheme. After giving up the existing resources, some other process might grab one or more of them for a long time. In general, it is easy to imagine that this strategy can lead to long delays, indefinite postponement and unpredictability. Also, this technique can be used for shared resources such as tables, semaphores and so on, but not for printers and tape drives. Imagine a printer given up by a process half way in the report and grabbed by some other process! Guaranteeing a situation so that the “no preemption” condition is not met is very difficult. If we allow the resources allocated to a process to be taken away forcibly from it, it may solve the problem of a deadlock, but it will give rise to worse problems. Taking away the tape drive forcibly from an incomplete process which has processed only part of the records on a tape drive because some other process requires it, will definitely be an unacceptable situation due to the problems of mounting/dismounting, positioning and so on. With printers, the situation is worse. It is obvious that attacking the first three conditions is very difficult. Only the last one remains. If the circular wait condition is prevented, the problem of the deadlock can be prevented too. One way in which this can be achieved is to force a process to hold only one resource at a time. If it requires another resource, it must first give up the one that is held by it and then request for another. This obviously has the same flaws as discussed above while preventing condition (iii). If a process P1 holds R1 and wants R2, it must give up R1 first, because another process P2 should be able to get it (R1). We are again faced with a problem of assigning a tape drive to P2 after P1 has processed only half the records. This, therefore, is also an unacceptable solution. There is a better solution to the problem, in which all resources are numbered as shown in Fig. 8.10. A simple rule can tackle the circular wait condition now. Any process has to request for all the required resources in a numerically ascending order during its execution, assuming again that grabbing all the required resources at the beginning is not an acceptable solution. For instance, if a process P1 requires a printer and a plotter at some time during its execution, it has to request for a printer first and then only for a plotter, because 1 < 2. This would prevent a deadlock. Let us see how. Let us assume that two processes P1 and P2 are wanting a tape drive and a plotter each. Adeadlock can take place only if P1 holds the tape drive and wants the plotter, whereas P2 holds the plotter and requests for the tape drive, i.e. if the order in which the resources are requested by the two processes is exactly opposite. And this contradicts our assumption. Because 0 < 2, a tape drive has to be requested for before a plotter by each process, whether it is P1 or P2. Therefore, it is impossible to get a situation that will lead to the deadlock. What holds true for two processes also is true for multiple processes. However, there are some minor and major problems with this scheme also.

Imagine that there are two tape drives, T1 and T2 and two processes, P1 and P2 in the system. If P1 holds T1 and requests for T2 whereas P2 holds T2 and requests for T1, the deadlock can occur. What numbering scheme should then be followed as both are tape drives? Giving both tape drives the same number (e.g. 0) and allowing a request for a resource with a number equal to or greater than that of the previous request, a deadlock can still occur as shown above. This minor problem however, could be solved by following a certain coding scheme in numbering the resources. The first digit denotes the resources type and the second digit denotes the resource number within the resource type. Therefore, the numbers 00, 01, 02, ... would be for different tape drives and 10, 11, etc. would be for different printers. The process requests for a resource type only (such as a tape drive). The Operating System internally translates it into a request for a specific resource such as 00 or 01. Applying this scheme to the situation above, we realize that our basic assumption would be violated if the situation was allowed to exist. For instance, our situation is that P1 holds 00 and requests for 01. This is acceptable because 00 < 01. But in this situation, P2 holds 01 and requests for 00. This is impossible because 00 < 01. Numbering not only the external I/O media but all the resources including all the process tables, disk areas such as spooler files will be required. That would be a cumbersome process. It is almost impossible to make all the processes to request resources in a globally predetermined order because the processes may not actually require them in that order. The waiting periods and the consequent wastage could be enormous. And, this surely is the major problem. Therefore, we conclude that there is yet no universally acceptable, satisfactory method for the prevention of deadlocks and it is still a matter of deep research.

Deadlock prevention was concerned with imposing certain restrictions on the environment or processes so that deadlocks can never occur. But we found out in the last section the difficulties involved in deadlock prevention. Therefore, a compromise is sought by the Operating System. The Operating System aims at avoiding a deadlock rather than preventing one. What is the exact difference between the two? The difference is quite simple. Deadlock avoidance is concerned with starting with an environment where a deadlock is theoretically possible (it is not prevented), but by some algorithm in the Operating System, it is ensured, before allocating any resource that after allocating it, adeadlock can be avoided. If that cannot be guaranteed, the Operating System does not grant the request of the process for a resource in the first place. Dijkstra was the first person to propose an algorithm in 1965 for deadlock avoidance. This is known as ‘Banker’s algorithm’, due to its similarity in solving a problem of a banker wanting to disburse loans to various customers within limited resources. This algorithm in the Operating System is such that it can know in advance before a resource is allocated to a process, whether it can lead to a deadlock ('unsafe state') or it can certainly manage to avoid it ('Safe state'). Banker’s algorithm maintains two matrices on a dynamic basis. Matrix A consists of the resources allocated to different processes at a given time. Matrix B maintains the resources still needed by different processes at the same time. These resources could be needed one after the other or simultaneously. The Operating System has no way of knowing this. Both these matrices are shown in Fig. 8.11. Matrix A shows that process P0 is holding 2 tape drives at a given time. At the same moment, process P1 is holding 1 printer and so on. If we add these figures vertically, we get a vector of Held Resources (H) = 432.

This is shown as the second row in the rows for vectors. This says that at a given moment, total resources held by various processes are : 4 tape drives, 3 printers and 2 plotters. This should not be confused with the decimal number 432. That is why it is called a vector. By the same logic, the figure shows that the vector for the Total Resources (T) is 543. This means that in the whole system, there are physically 5 tape drives, 4 printers and 3 plotters. These resources are made known to the Operating System at the time of system generation. By subtraction of (H) from (T) columnwise, we get a vector (F) of free resources which is 111. This means that the resources available to the Operating System for further allocation are: 1 tape drive, 1 printer and 1 plotter at that juncture. Matrix B gives processwise additional resources that are expected to be required in due course during the execution of these processes. For instance, process P2 will require 2 tape drives, 1 printer and 1 plotter, in addition to the resources already held by it. It means that process P2 requires in all 1 + 2 = 3 tape drives, 2 + 1= 3 printers and 1 + 1= 2 plotters. If the vector of all the resources required by all the processes (vector addition of Matrix A and Matrix B) is less then the vector T for each of the resources, there will be no contention and therefore, no deadlock. However, if that is not so, a deadlock has to be avoided. Having maintained these two matrices, the algorithm for the deadlock avoidance works as follows: (i) Each process declares the total required resources to the Operating System at the beginning. The Operating System puts these figures in Matrix B (resources required for completion) against each process. For a newly created process, the row in Matrix A is fully zeros to begin with because no resources are yet assigned for that process. For instance, at the beginning of process P2, the figures for the row for P2 in Matrix A will be all 0s; and those in Matrix B will be 3, 3 and 2 respectively. (ii) When a process requests the Operating System for a resource, the Operating System finds out whether the resource is free and whether it can be allocated by using the vector F. If it can be allocated, the Operating System does so, and updates Matrix A by adding 1 to the appropriate slot. It simultaneously subtracts 1 from the corresponding slot of Matrix B. For instance, starting from the beginning, if the Operating System allocates a tape drive to P2, the row for P2 in Matrix A will become 1, 0 and 0.

The row for P2 in Matrix B will correspondingly become 2, 3 and 2. At any time, the total vector of these two rows, i.e. addition of the corresponding numbers in the two rows, is always constant and is equivalent to the total resources needed by P2, which in this case will be 3.3 and 2. (iii) However, before making the actual allocation, whenever, a process makes a request to the O/S for any resource, the Operating System goes through the Banker’s algorithm to ensure that after the imaginary allocation, there need not be a deadlock, i.e. after the allocation, the system will still be in a ‘safe state’. The Operating System actually allocates the resource only after ensuring this. If it finds that there can be a deadlock after the imaginary allocation at some point in time, it postpones the decision to allocate that resource. It calls this state of the system that would result after the possible allocation as ‘unsafe state’. Remember: the unsafe state is not actually a deadlock. It is a situation of a potential deadlock with the arithmetic comparison. The point is: How does the Operating System conclude about the safe or unsafe state? It uses an interesting method. It looks at vector F, and each row of Matrix B. It compares them on a vector to vector basis i.e. within the vector, it compares each digit separately to conclude whether all the resources that a process is going to need to complete are available at this juncture or not. For instance, the figure shows F = 111. It means that at that juncture, the system has 1 tape drive, 1 printer and 1 plotter free and allocable. (The first row in Matrix B for P0 to 100.) This means that if the Operating System decides to allocate all needed resources to P0, P0 can go to completion because 111 > 100 on a vector basis. Similarly, row for P1 in Matrix B is 110. Therefore, if the Operating System decides to allocate resources to P1 instead of to P0, P1 can complete. The row for P2 is 211. Therefore, P2 cannot complete unless there is one more tape drive available. This is because 211 is greater than 111 on a vector basis. The vector comparison should not be confused with the arithmetic comparison. For instance, if F were 411 and a row in Matrix B was 322, it might appear that 411 > 322 and therefore, the process can go to completion. But that is not true. As 4 > 3, the tape drives would be allocable. But as 1 < 2, both the printers as well as the plotter would fall short. The Operating System now does the following to ensure the safe state: (a) After the process requests for a resource, the Operating System allocates it on a ‘trial’ basis. (b) After this trial allocation, it updates all the matrices and vectors, i.e. it arrives at the new values of F and Matrix B as if the allocation was actually done. Obviously, this updation will have to be done by the Operating System in a separate work area in the memory. (c) It then compares F vector with each row of Matrix B on a vector to vector basis. (d) If F is smaller than each of the rows in Matrix B on a vector basis, i.e. even if all F was made available to any of the processes in Matrix B, none would be guaranteed to complete, the Operating System concludes that it is an ‘unsafe state’. Again, it does not mean that a deadlock has resulted. However, it means that it can take place. (e) If F is greater than any row for a process in Matrix B, the Operating System proceeds as follows: l It allocates all the needed resources for that process on a trial basis. l It assumes that after this trial allocation, that process will eventually get completed, and, in fact, release all the resources on completion. These resources now will be added to the free pool (F). It now calculates all the matrices and F after this trial allocation and the imaginary completion of this process. It removes the row for the completed process from both the matrices. l It repeats the procedures from step (c) above. If in the process, all the rows in the matrices get

eliminated, i.e. all the processes can go to completion, it concludes that it is a ‘safe state’. If it does not happen, it concludes that it is an ‘unsafe state’. (f) For each request for any resource by a process, the Operating System goes through all these trial or imaginary allocations and updations, and if it finds that after the trial allocation, the state of the system would be ‘safe’, it actually goes ahead and makes an allocation after which it updates various matrices and tables in a real sense. The Operating System may need to maintain two sets of matrices for this purpose. Any time, before any allocation, it could copy the first set of matrices (the real one) into the other, carry out all trial allocations and updations in the other, and if the safe state results, update the former set with the allocations. Two examples to understand this algorithm clearly are presented here. Example-1 Suppose process P1 requests for 1 tape drive when the resources allocated to various processes are given by Fig. 8.11. The Operating System has to decide whether to grant this request or not. The Banker’s algorithm proceeds to determine this as follows: l If a tape drive is allocated to P1, F will become 011 and the resources still required for P1 in Matrix B will become 010. After this, the free resources are such that only process P1 can complete because each digit in F, i.e. 011 is equal to or more than the individual digits in the row for required resources for P1 in Matrix B i.e. 010. Therefore, hypothetically, if no other process demands anything in between, the free resources can satisfy P1’s demands and lead it to completion. l If P1 is given all the resources it needs to complete, the row for assigned resources to P1 in Matrix A will become 120, and after this allocation, F will become 001. l At the end of the execution of P1, all the resources used by P1 will become free and F will become 120 + 001 = 121. We can now erase the rows for P1 from both the matrices, indicating that, this is how the matrices will look if P1 is granted its first request of a tape drive and then is allowed to go to completion. l We repeat the same steps with the other rows. For instance, now F = 121. Therefore, the Operating System will have sufficient resources to complete either P0 or P3 but not P2. This is because P2 requires 2 tape drives to complete, but the Operating System at this imaginary juncture has only 1. Let us say, the Operating System decides to allocate the resources to P0 (It does not matter which one is chosen). Assuming that all the required resources are allocated to P0 one by one, the row for assigned resources to P0 in Matrix A will become 300 and that in Matrix B will obviously become 000. F at this juncture will have become 121 – 100 = 021. If P0 is now allowed to go to completion, all the resources held by P0 will be returned to F. Now, we can erase the rows for P0 from both the matrices. F would now become 300 + 021 = 321. l Now either P2 or P3 can be chosen for this ‘trial allocation’. Let us assume that P3 is allocated. Going by the same logic and steps, we know that resources required by P3 are 111. Therefore, after the trial allocation, F will become 321 – 111 = 210, and resources assigned to P3 in Matrix A would become 212. When P3 completes and returns the resources to F, F will become 212 + 210 = 422. l At the end, P2 will be allocated and completed. At this juncture, resources allocated to P2 will be 332, and F would be 442 – 211 = 211. In the end, all the resources will be returned to the free pool. At this juncture, F will become 332 + 211 = 543. This is the same as the total resources vector T that

are known to the system. This is as expected because after these imaginary allocations and process completions, F should become equal to the total resources known to the system. l The Operating System does all these virtual or imaginary calculations before granting the first request of process P1 for a tape drive. All it ensures is that if this request is granted, it is possible to let some processes complete, adding to the pool of free resources and by repeating the same logic, it is possible to ultimately complete all the processes. Therefore, this request can be granted because after the allocation, the state is still a ‘safe’ state. It should be noted that after this allocation, it is not impossible to have a deadlock if subsequent allocations are not done properly. However, all it ensures is that it is possible to avert the deadlock. The Operating System now actually allocates the tape drive. After the actual allocation, the Operating System updates both the matrices and all the vectors. An interesting point is: After this, the processes need not actually complete in the same sequence as discussed earlier. Example 2 Let us go back to Fig. 8.11 which depicts the state of the system at some stage. Imagine that process P2 instead of P1 requests 1 tape drive. Let us now apply Banker’s algorithm to this situation. If it is granted, F will become 011, and this is not sufficient to complete any process. This is because the vector F = 011 is less than every row in Matrix B after the allocation, since there is no tape drive free. Therefore, there can be a deadlock. There is no certainty because even if 1 tape drive is allocated to P2, P2 can relinquish it during the execution before its completion (may be in a short while), and then the processes can complete as in Example-1. The Operating System still does not grant this request because it is an unsafe state which may not be able to avert a deadlock. Therefore, the Operating System waits for sometime until some other process releases some resources during or at the end of the execution. It then ensures that it is a safe state by the same logic as discussed above, and then only grants the request. The algorithm is very attractive at first sight, but it is not easy for every process to declare in advance all the resources it is going to require, especially if the resources include such intangibles as shared data, variables, files, tables etc. Therefore, deadlock avoidance is also a matter of research today. A system consists of many resources, which are shared or distributed among several competing processes. Memory, CPU, disk space, printers and tapes are the example of resources. When a system has two CPUs then we can say that there are two instances of CPUs. Similarly, in a network, we may have ten printers and we can say that there ten instances of printers. In such situations, we are not bothered about which instance of the requested resource is processing the request. When a process is executing, it requests for a resource before using it and it must release the resource after using it. Any process can request as many requests to carry out the assigned task. It cannot make more requests than the maximum number available in the system. A process uses the request in the following sequence. Process requests for necessary resource(s). If the resources are not free then the process has to wait until the resources are free so that it can acquire control on the resources. Process can operate/use the acquired resource to carry out assigned task. Process releases the resources when operation is complete on the acquired resource. In the request and release steps, a process make system calls such as disk read/write, printing, memory allocation, etc. Therefore, it is necessary to make sure that there is no conflict or a situation where two processes are acquiring the same resources.

n n

n

n

n

n

We will now discuss the last portion of the Operating System viz. the functions of Memory Management (MM). As mentioned earlier, the topic discussed in this chapter assumes special importance when a number of users share the same memory. In general, this module performs the following functions: (a) To keep track of all memory locations-free or allocated and if allocated, to which process and how much. (b) To decide the memory allocation policy i.e. which process should get how much memory, when and where. (c) To use various techniques and algorithms to allocate and deallocate memory locations. Normally, this is achieved with the help of some special hardware. There is a variety of memory management systems. Fig. 9.1 lists them. These systems can be divided into two major parts: 'Contiguous' and 'Non-contiguous'. Contiguous Memory Management schemes expect the program to be loaded in contiguous memory locations. Non-contiguous systems remove this restriction. They allow the program to be divided into different chunks and loaded at different portions of the memory. It is then the function of the Operating System to manage these different chunks in such a way that they appear to be contiguous to the Application Programmer/User. In 'paging', these chunks are of the same size, whereas in 'segmentation', they can be of different sizes. Again, Memory Management can be of 'Real Memory'

whereby the full process image is expected to be loaded in the memory before execution. Virtual Memory Management systems can, however, start executing a process even with only a part of the process image loaded in the memory. We will now discuss these schemes one by one. In each case, the following issues are involved:

Relocation and Address Translation refers to the problem that arises because at the time of compilation, the exact physical memory locations that a program is going to occupy at the run time are not known. Therefore, the compiler generates the executable machine code assuming that each program is going to be loaded from memory word 0. At the execution time, the program may need to be relocated to different locations, and all the addresses will need to be changed before execution. This will be illustrated later with an example and also different methods of Address Translations will be discussed.

Protection refers to the preventing of one program from interfering with other programs. This is true even when a single user process and the Operating System are both residing in the main memory. A common question is: “If the compiler has generated proper addresses and relocation is properly done, can one program interfere with others?” One of the answers is: “hardware malfunction.” Imagine an instruction "JMP EOJ" in an assembly language which can get translated to 0111000011111001 [JMP = 0111 and EOJ is assumed to be at the address (249) in decimal or (000011111001) in binary]. If due to hardware malfunction, two high order bits in the address change from 0 to 1, it will not be detected by the parity checking mechanism due to cancelling errors. The addresses in the memory will now be (110011111001) in binary which is (3321) in decimal. If the program actually jumps to this location, there might be a serious problem. Sure enough, such hardware malfunction does not happen very often, and also some of these cases can be detected by a few ingenious methods. However, our objective is to guarantee complete accuracy. That is why protection is important. In most cases, the protection is provided by a special hardware, as we shall see later. This is so important that until this protection was provided, many Operating Systems, especially on the microcomputers, could not provide the multiuser facility, despite having all the other algorithms ready.

Sharing is the opposite of protection. In this case, multiple processes have to refer to the same memory locations. This need may arise because the processes might be using the same piece of data, or all processes might want to run the same program. e.g. a word processor. Having 10 copies of the same program in the memory for 10 concurrent users seems obviously wasteful. Though achieving both protection and sharing of memory are apparently contradictory goals, we will study various schemes for accomplishing this task a little later. Each of these memory management methods can be judged in terms of efficiency by using the following norms: Wasted memory is the amount of physical memory which remains unused and therefore, wasted. Access time is the time to access the physical memory by the Operating System as compared to the memory access time for the bare hardware without the overheads of the Operating System, basically caused due to Address Translation. Time complexity is related to the overheads of the allocation/deallocation algorithm and the time taken by the specific method. Regardless of the method used, it has to work in close co-operation with the other two managers within the Operating System, viz., Information and Process Managers, as discussed earlier.

In the scheme of Single Contiguous Memory Management, the physical memory is divided into two contiguous areas. One of them is permanently allocated to the resident portion of the Operating System (monitor) as shown in Fig. 9.2. (CP/M and MS-DOS fall in this category.) The Operating System may be loaded at the lower addresses (0 to P as shown in Fig. 9.2) or it can be loaded at the higher addresses. This choice is normally based on where the vectored Interrupt Service Routines are located because these addresses are determined at the time of hardware design in such computers. At any time, only one user process is in the memory. This process is run to completion and then the next process is brought in the memory. This scheme works as follows: All the ‘ready’ processes are held on the disk as executable imageswhereas the Operating System holds their PCBs in the memory in the order of priority. At any time, one of them runs in the main memory. When this process is blocked, it is ‘swapped out’ from the main memory to the disk. The next highest priority process is ‘swapped in’ the main memory from the disk and it starts running. Thus, there is only one process in the main memory even if conceptually it is a multi-programming system. Now consider the way this scheme solves various problems as stated in Section 9.1.

In this scheme, the starting physical address of the program is known at the time of compilation. Therefore, the problem of relocation or Address Translation does not exist. The executable machine program contains absolute addresses only. They do not need to be changed or translated at the time of execution.

Protection can be achieved by two methods: 'Protection bits' and 'Fence register'. In 'Protection bits', a bit is associated with each memory block because a memory block could belong either to the Operating System or the application process. Since there could be only these two possibilities, only 1 bit is sufficient for each block. However, the size of the a memory block must be known. A memory block can be as small as a word or it could be a very large unit consisting of a number of words. Imagine a scheme in which a computer has a word length of 32 bits and 1 bit is reserved for every word for protection. This bit could be 0 if the word belongs to the Operating System and it could be 1 if it belongs to the user process. At any moment, the machine is in the supervisor (or privileged) mode executing an instruction within the Operating System, or it is in the 'user' mode executing a user process. This is indicated by a mode bit in the hardware. If the mode changes, the hardware bit also is changed accordingly automatically. Thus, at any moment, when the user process refers to memory locations within the Operating System area, the hardware can prevent it from interfering with the Operating System because the protection bit associated with the referenced memory block (in our example, a word) is 0. However, normally the Operating System is allowed unrestricted access to all the memory locations, regardless of whether they belong to the Operating System or a user process. (i.e. when the mode is privileged and the Operating System makes any memory reference, this protection bit is not checked at all!) If a block is as small as a word of say 32 bits, protection bits constitute (1/32)×100 = 3.1 % overhead on the memory. As the block size increases, this overhead percentage decreases, but then the allocation unit increases. This has its own demerits such as the memory wastage due to the internal fragmentation, as we shall study later. The use of a Fence register is another method of protection. This is like any other register in the CPU. It contains the address of the fence between the Operating System and the user process as depicted in Fig. 9.3, where the fence register value = P. Because it contains an address, it is as big as MAR. For every memory reference, when the final resultant address (after taking into account the addressing modes such as indirect, indexed, PC-relative and so on) is in MAR, it is compared with the fence register by the hardware itself, and the hardware can detect any protection violations. (For instance, in Fig. 9.3, if a user process with mode bit = 1 makes a reference to an address within the area for the Operating System which is less than or equal to P, the hardware itself will detect it.) Sharing of code and data in memory does not make much sense in this scheme and is usually not supported.

This method does not have a large wasted memory (it can not be used even if it were large anyway!) This scheme has very fast access times (No Address Translation is required.) and very little time-complexity. But its use is very limited due to the lack of multiuser facility.

Most operating systems such as OS/360 running on IBM hardware used the Fixed Partitioned Memory Management method. In this scheme, the main memory was divided into various sections called 'partitions'. These partitions could be of different sizes, but once decided at the time of system generation, they could not be changed. This method could be used with swapping and relocation or without them. In this method, the partitions are fixed at the time of system generation.(System generation is a process of tailoring the Operating System to specific requirements. The Operating System consists of a number of routines for supporting a variety of hardware items and devices, all of which may not be necessary for every user. Each user can select the necessary routines depending upon the devices to be used. This selection is made at the time of system generation.) At this time, the system manager has to declare the partition size. To change the partitions, the operations have to be stopped and the Operating System has to be generated (i.e. loaded and created) with different partition specifications. That is the reason why, these partitions are also called 'static partitions'. On declaring static partitions, the Operating System creates a Partition Description Table (PDT) for future use. This table is shown in Fig. 9.4. Initially, all the entries are marked as "FREE". However, as and when a process is loaded into one of the partitions, the status entry for that partition is changed to "ALLOCATED". Fig. 9.4 shows the static partitions and their corresponding PDT at a given time. In this case, the PCB (Process Control Block) of each process contains the ID of the partition in which the process is running. This could be used as a "pointer to the physical memory locations" field in the PCB. For instance, in the PCB for process A, the ID of the PDT will be specified as 2. Using this, the Operating System can access the entry number 2 in the PDT. This is how, using the partition ID as an index into PDT, information such as starting address, etc. could easily be obtained. The Operating System, however, could keep this information directly in the PCB itself to enhance the speed at the cost of redundancy. When the

process terminates, the system call "kill the process" will remove the PCB, but before removing it, it will request the MM to set the status of that partition to "FREE". When a partition is to be allocated to a process, the following takes place: (i) The long term process scheduler of the PM decides which process is to be brought into the memory next. (ii) It then finds out the size of the program to be loaded by consulting the IM portion of the Operating System. As seen earlier, the compiler keeps the size of the program in the header of the executable file. (iii) It then makes a request to the partition allocation routine of the MM to allocate a free partition with the appropriate size. This routine can use one of the several algorithms for such allocations, as described later. The PDT is very helpful in this procedure. (iv) With the help of the IM module, it now loads the binary program in the allocated partition. (Note that it could be loaded in an unpredictable partition, unlike the previous case, making Address Translation necessary at the run time.) (v) It then makes an entry of the partition ID in the PCB before the PCB is linked to the chain of ready processes by using the PM module of the Operating System. (vi) The routine in the MM now marks the status of that partition as “allocated”. (vii) The PM eventually schedules this process.

The Operating System maintains and uses the PDT as shown in Fig. 9.4. In this case, partition 0 is occupied by the Operating System and is thus, unallocable. The “FREE” partitions are only 1 and 4. Thus, if a new process has to be loaded, we have to choose from these two partitions. The strategies of partition allocation are the same as discussed in disk space allocation, viz., first fit, best fit and worst fit. For instance, if the size of a program to be executed is 50k, both the first fit and the worst fit strategies would give partition ID = 1 in the situation depicted by Fig. 9.4. This is because the size of the partition with partition ID = 1 is 200k which is > 50k and also it is the first free partition to accommodate this program. The best fit strategy for the same task would yield partition ID = 4. This is because the partition size of this partition is 100k, which is the smallest partition capable of holding this program. The best fit and the worst fit strategies would be relatively faster, if the PDT was sorted on partition size and if the number of partitions was very high. The processes waiting to be loaded in the memory (ready for execution, but for the fact that they are on the disk or swapped out) are held in a queue by the Operating System. There are two methods of maintaining this queue, viz., Multiple queues and Single queue. In multiple queues, there is one separate queue for each partition as shown in Fig. 9.5. In essence, the linked list of PCBs in "ready but not in memory" state is split into multiple lists-one for each partition, each corresponding to a different size of the partition. For instance, queue 0 will hold processes with size of 0–2k, queue 2 will be for processes with size between 2k and 5k (the exact size of 2k will be in this queue) and queue 1 will take care of processes with size between 5k and 8k, etc. (The exact size of 5k will be in this queue.) When a process wants to occupy memory, it is added to a proper queue depending upon the size of the process. If the scheduling method is round robin within each queue, the processes are added at the end of the proper queue and they move ahead in the strict FIFO manner within each queue. If the scheduling method is priority driven, the PCBs in each queue are chained in the sorted order of priority.

An advantage of this scheme is that a very small process is not loaded in a very large partition. It thus avoids memory wastage. It is instead added to a queue for smaller partitions. A disadvantage is obvious. You could have a long queue for a smaller partition whereas the queue for the bigger partition could be empty as shown in Fig. 9.6. This is obviously not an optimal and efficient use of resources!

In the Single queue method, only one unified queue is maintained of all the ready processes. This is shown in Fig. 9.7. Again, the order in which the PCBs of ready processes are chained, depends upon the scheduling algorithm. For instance, in priority based scheduling, the PCBs are chained in the order of priority. When a new process is to be loaded in the memory, the unified queue is consulted and the PCB at the head of the queue is selected for dispatching. The PCB contains the program size which is copied from the header of the executable file at the time a process is created. A free partition is then found based on either first, best or worst fit algorithms. Normally, the first fit algorithm is found to be the most effective and the quickest. However, if the PCB at the head of the queue requires memory which is not available now, but there is a free partition available to fit the process represented by the next PCB in the queue, what procedure is to be adopted? The non-availability of the partition of the right size may force the Operating System to change the sequence in which the processes are selected for dispatching. For instance, in Fig. 9.7, if the partition with size = 5k is free, the highest priority process at the head of the chain with size = 7k cannot be dispatched. The Operating System then has to find the next process in the queue which can fit into the 5k partition. In this case, the Operating System finds that the next process with size = 2k can fit well. However, this may not be the best decision in terms of performance.

If the Operating System had a "lookahead intelligence" feature, it could have possibly known that the partition with size = 2k is likely to get free soon. In this case, choosing the next process of 5k for loading in the partition of size = 5k could have been a better decision. Almost immediately after this, the 2k partition would get free to accommodate the 2k process with higher priority than the one with size = 5k. The highest priority process with size = 7k will have to wait until the partition with size = 8k gets free. There is no alternative to this. This kind of intelligence is not always possible and it is also quite expensive. If the Operating System chooses a simple but relatively less intelligent solution and loads the process with size = 2k in the partition with size = 5k, the process with size = 5k keeps waiting. After a while, even if the 2k partition gets free, it cannot be used, thus causing memory wastage. This is called external fragmentation. Contrast this with internal fragmentation in which there is a memory wastage within the partition itself. Imagine a partition of 2k executing a process of 1.5k size. The 0.5k of the memory of the partition cannot be utilized. This wastage is due to internal fragmentation. This discussion shows that the MM and the PM modules are interdependent and that they have to cooperate with each other.

One more way in which the partitioned memory management scheme is categorized is based on whether it supports swapping or not. Lifting the program from the memory and placing it on the disk is called 'swapping out'. To bring the program again from the disk into the main memory is called 'swapping in'. Normally, a blocked process is swapped out to make room for a ready process to improve the CPU utilization. If more than one process is blocked, the swapper chooses a process with the lowest priority, or a process waiting for a slow I/O event for swapping out. As discussed earlier, a running process also can be swapped out (in priority based preemptive scheduling). Swapping algorithm has to coordinate amongst Information, Process and Memory Management systems. If the Operating System completes the I/O on behalf of a blocked process which was swapped out, it keeps the data read recently in its own buffer. When the process is swapped in again, the Operating System moves the data into the I/O area of the process and then makes the process ‘ready’. In demand paging, some portion of the memory where the record is to be read can be ‘locked’ or ‘bound’ to the main memory. The remaining portion can be swapped out if necessary. In this case, even if the process is ‘blocked’ and ‘swapped out’, the I/O can directly take place in the AP's memory. This is not possible in the scheme of ‘fixed partition’ because, in this case, the entire process image has to be in the memory or swapped out on the disk. The Operating System has to find a place on the disk for the swapped out process image. There are two alternatives. One is to create a separate swap file for each process. This method is very flexible, but can be very inefficient due to the increased number of files and directory entries thereby deteriorating the search times for any I/O operation. The other alternative is to keep a common swap file on the disk and note the location of each swapped out process image within that file. In this scheme, an estimate of the swap file size has to be made initially. If a smaller area is reserved for this file, the Operating System may not be able to

swap out processes beyond a certain limit, thus affecting performance. The medium term scheduler has to take this into account. Regardless of the method, it must be remembered that the disk area reserved for swapping has to be larger in this scheme than in demand paging because, the entire process image has to be swapped out, even if only the "Data Division" area undergoes a change after the process is loaded in the memory. A compiled program is brought into the memory through a single unified queue or through multiple queues. At the time of compilation, the compiler may not know which partition the process is going to run in. Again a process can be swapped out and later brought back to a different partition. A question is: How are the addresses managed in such a scheme? This is what we will learn in the section to follow.

Imagine a program which is compiled with 0 as the starting word address. The addresses that this program refers to are called ‘virtual addresses or logical addresses’. In reality, that program may be loaded at different memory locations, which are called 'physical addresses'. In a sense, therefore, in all Memory Management systems, the problem of relocation and Address Translation is essentially to find a way to map the virtual addresses onto the physical addresses. Let us imagine that there is an instruction equivalent to "LDA 500" i.e. "0100000111110100" in a simple machine language, in a compiled COBOL program, where 0100 is the machine op. code for LDA, and 500 in decimal is 000111110100 in binary. The intention of this instruction is obviously to load a CPU register (usually, an accumulator) with the contents of the memory word at address = 500. Obviously, this address 500 is the offset with respect to the starting physical address of the program. If this program is loaded in a partition starting from word address 1000, then this instruction should be changed to "LDA 1500" or "0100010111011100", because 1500 in decimal is 010111011100 in binary. Address Translation (AT) must be done for all the addresses in all the instructions except for constants, physical I/O port addresses and offsets which are related to a Program Counter (PC) in the PC relative addressing mode because all these do not change depending upon where the instruction is located. There are two ways to achieve this relocation and AT: Static and Dynamic. This is performed before or during the loading of the program in memory, by the relocating linker or a relocating loader. In this scheme too, the compiler compiles the program assuming that the program is to be loaded in the main memory, at the starting address 0. In this scheme, this relocating linker/loader uses this compiled object program (with 0 as the starting address) essentially as a source program and the starting address of the partition (where the program is to be loaded) as a parameter, as shown in Fig. 9.8. The relocating linker/loader goes through each instruction and changes the addresses in each instruction of the program before it is loaded and executed. Obviously, the relocating linker/loader will have to know which portion of the instruction is an address, and depending upon the type of instruction and addressing mode, it will have to decide whether to change it or not (e.g. do not change PC relative addresses) And this is not very trivial. Yet another problem is really to decide where one instruction ends and the next one starts. This may not be a very easy task, especially in machines where there are a number of instructions with different lengths and each with multiple options. This scheme was used in earlier IBM systems. It has two problems. Firstly, it is a slow process because it

is a software translation. The software routine for relocation is also not trivial. Secondly, because it is slow, it is used only once before the initial loading of the program. Each time a process is swapped out which then needs to be swapped in, it becomes fairly expensive to carry out this relocation. Dynamic relocation is used at the run time, for each instruction. It is normally done by a special piece of hardware. It is faster, though somewhat more expensive. This is because, it uses a special register called 'base register'. This register contains the value of relocation. (In our example, this value is 1000 because the program was loaded in a partition starting at the address 1000.) In this case, the compiled program is loaded at a starting memory location different than 0 (say, starting from 1000) without any change to any instruction in the program. For instance, Fig. 9.9 shows the instruction “LDA 500” actually loaded at some memory locations between 1000 and 1500. The address 500 in this instruction is obviously invalid, if the instruction is executed directly. Hence, the address in the instruction has to be changed at the time of execution from 500 to 1500. Normally, any instruction such as “LDA 500” when executed, is fetched to Instruction register (IR) first, where the address portion is separated and sent to Memory Address Register (MAR). In this scheme however, before this is done, this address in the instruction is sent to the special adder hardware where the base register value of 1000 is added to it, and only the resulting address of 1500 finally goes to MAR. As MAR contains 1500, it refers to the correct physical location. For every address needing translation, this addition is made by the hardware. Hence, it is very fast, despite the fact that it has to be done for every instruction. Imagine a program with a size of 1000 words. The 'virtual address space' or ‘logical address space’ for this program comprises words from 0 to 999. If it is loaded in a partition starting with the address 1000, 1000 to 1999 will be its 'physical address space' as shown in Fig. 9.9 though the virtual address space still continues to be 0 to 999. At the time of execution of that process, the value of 1000 is loaded into the base register. That is why, when this instruction “LDA 500” is executed, actually it executes the instruction “LDA 1500”, as shown in Fig. 9.9. The base register can be considered as another special purpose CPU register. When the partition allocation algorithm (first fit, best fit, etc.) allocates a partition for a process and the PCB is created, the value of the base register (starting address of the partition) is stored in the PCB in its Register Save Area. When the process is made "running", this value is loaded back in the base register. Whenever the process gets blocked, the base register value does not need to be stored again as the PCB already has it. Next time when the process is to be dispatched, the value from the PCB can be used to load the base register if the process has not been swapped out. After a process gets blocked and a new process is to be dispatched, the value of the base register is simply picked up from the PCB of the new process and the base register is loaded with that value. This is again assuming that the new process is already loaded in one of the partitions.

However, if a process is swapped out and swapped into a new partition later, the value of the base register corresponding to the new partition will have to be written into the PCB before the PCB of this process is chained in the queue of ready processes to be dispatched eventually. This is the most commonly used scheme amongst the schemes using fixed partitions, due to its enhanced speed and flexibility. A major advantage is that it supports swapping easily, i.e. a process can be swapped out and later swapped in at different locations very easily. Only the base register value needs to be changed before dispatching.

A process should not, by mistake or on purpose, become capable of interfering with other processes. There are two approaches for preventing such interference and achieving protection and sharing. These approached involve the use of: − Protection bits − limit register Protection bits are used by the IBM 360/370 systems. The idea is the same as in single user systems, except that 1 bit will not suffice for protection. A few bits are reserved to specify each word's owner (e.g. 4 bits if there are 16 user processes running in 16 partitions). This scheme however is expensive. If the word length is 32 bits, 4 bit overhead for every word would mean 4/32 = 1/8 or 12.5% increase in the overheads. Hence, IBM 360 series of computers divided the memory into 2 KB blocks and reserved 4 protection bits called the 'key' for each such block again, assuming 16 users in all. Size of each partition had to be a multiple of such blocks and could not be any arbitrary number. This resulted in memory wastage due to the internal fragmentation. Imagine that the block size is 2 KB, and the process size is 10 KB + 1 byte. If two of the partitions are of size of 10 KB and 12 KB, the Operating System will have to allocate a partition of 12 KB for this process. The one with 10 KB size will not do. Hence, an area of 2KB-1 will be wasted in that partition. It can easily be seen that the maximum internal fragmentation per partition is equal to block size -1, the minimum is 0, and the average is equal to (block size -1)/2 per process. All the blocks associated with a partition allocated to a process are given the same key value in this

scheme. If the number of partitions is 16, there can be maximum 16 user processes at any moment in the main memory. Therefore, a 4 bit key value ranging from 0000 to 1111 serves the purpose of identifying the owner of each block in the main memory. This is shown in Fig. 9.10. Considering a physical memory of 64 KB and assuming a block of 2 KB size, there would be 32 blocks. If a 4 bit key is associated with a block, 32×4 = 128 bits have to be reserved for storing the key values. At the time of system generation, the System Administrator would define a maximum of 16 partitions of different sizes out of these 32 total blocks-available. One partition could be of 1 block, another of 3 blocks, and yet another of 2 or even 5 blocks. Each partition is then assigned a protection key from 0000 to 1111. After declaring various partitions with their different sizes, all the 128 bits reserved for the key values (4 per block) are set. This is done on the principle that all the blocks belonging to a partition should have the same key value. Figure 9.10 illustrates this. When a process is assigned to a partition, the key value for that partition is stored in 'Program Status Word (PSW)'. Whenever a process makes a memory reference in an instruction, the resulting address (after taking into account the addressing mode and the value of the base register ) and the block in which that address falls are computed. After this, a 4 bit protection key for that block is extracted from the 128 bit long protection keys, and it is tallied with the key stored in PSW. If it does not match, it means that the process is

trying to access an address belonging to some other partition. Thus, if due to hardware malfunction, a high order 0 of an address becomes 1, the process still is prevented from interfering with an address in some other partition belonging to a different process. However, if this hardware malfunction generates another address belonging to the same partition, this protection mechanism cannot detect it! This scheme has four major drawbacks: (i) It results in memory wastage because the partition size has to be in multiples of a block size (Internal fragmentation). (ii) It limits the maximum number of partitions or resident processes (due to the key length). (iii) It does not allow sharing easily. This is because the Operating System would have to allow two possible keys for a shared partition if that partition belongs to two processes simultaneously. Thus, each block in that partition should have two keys, which is cumbersome. Checking the keys by hardware itself will also be difficult to implement. (iv) If hardware malfunction generates a different address but in the same partition, the scheme cannot detect it because the keys would still tally. Another method of providing protection is by using a Limit register (see Fig. 9.11), which ensures that the virtual address present in the original instruction moved into IR before any relocation/ Address Translation. is within the bounds of the process. For instance, in our example in Sec. 9.3.4.3, where the program size was 1000, the virtual addresses would range from 0 to 999. In this case, the limit register would be set to 999. Every logical or virtual address will be checked to ensure that it is less than or equal to 999, and only then added to the base register. If it is not within the bounds, the hardware itself will generate an error message, and the process will be aborted. The limit register for each process can also be stored in the corresponding PCBs and can be saved/restored during the context switch in the same way as the base register. Sharing poses a serious problem in fixed partitions because it might compromise on protection. One approach to sharing any code or data is to go through the Operating System for any such request. Because the Operating System has access to the entire memory space, it could mediate. This scheme is possible but it is very tedious and increases the burden on the Operating System. Therefore, it is not followed in practice. Another approach is to keep copies of the sharable code/data in all partitions where required. Obviously, it is wasteful, apart from giving rise to possible inconsistencies, if for instance, the same pieces of data are updated differently in two different partitions. Another way is to keep all the sharable code and data in one partition, and with either key modification or change to the base/limit registers, allowing a controlled access to this partition even from outside by another process. This is fairly complex and results in high overheads. Besides, it requires specialized hardware registers. This is the reason it is not followed widely.

If a partition of size 100k is allocated to a process of size 60k, then the 40k space of that partition is wasted, and cannot be allocated to any other process. This is called 'internal fragmentation'. However, it may happen that two free partitions of size 20k and 40k are available and a process of 50k has to be accommodated. In this case, both the available partitions cannot be allocated to that process because it

will violate the principle of allocation, viz. That only contiguous memory should be allocated to a process. As a result, there is wastage of memory space. This is called 'external fragmentation'. In fixed partitions, a lot of memory is thus wasted due to fragmentation of both kinds. We have seen examples of these earlier. Access times are not very high due to the assistance of special hardware. The translation from virtual to physical address is done by the hardware itself enabling rapid access. Time complexity is very low because allocation/deallocation routines are simple, as the partitions are fixed.

We have studied the problems associated with fixed partitions, especially in terms of fragmentation and restriction on the number of resident processes. This puts restrictions on the degree of multiprogramming and in turn the CPU utilization. Variable partitions came into existence to overcome these problems and became more popular. In variable partitions, the number of partitions and their sizes are variable. They are not defined at the time of system generation. At any time, any partition of the memory can be either free (unallocated) or allocated to some process or free in pretty much the same way as given in the PDT in Fig. 9.4. The only difference is that with variable partition, the starting address of any partition is not fixed, but it keeps varying, as is depicted in Fig. 9.12. The eight states of the memory allocations in the figure correspond to the eight events which are given below: We will trace these events and study Fig. 9.12 to understand how this scheme works.

(i) (ii) (iii) (iv) (v)

The Operating System is loaded in the memory. All the rest of the memory is free. A program P1 is loaded in the memory and it starts executing. (after which it becomes a process.) A program P2 is loaded in the memory and it starts executing. (after which it becomes a process.) A program P3 is loaded in the memory and it starts executing. (after which it becomes a process.) The process P1 is blocked. After a while, a new high priority program P4 wants to occupy the memory. The existing free space is less than the size of P4. Let us assume that P4 is smaller than P1 but bigger than the free area available at the bottom. Assuming that the process scheduling is based on priorities and swapping, P1 is swapped out. There are now two chunks of free space in the memory. (vi) P4 is now loaded in the memory and it starts executing.(after which it becomes a process.) Note that P4 is loaded in the same space where P1 was loaded. However, as the size of P4 is less than that of P1, still some free space remains. Hence, there are still two separate free areas in the memory. (vii) P2 terminates. Only P4 and P3 continue. The free area at the top and the one released by P2 can now be joined together. There is now a large free space in the middle, in addition to a free chunk at the bottom. (viii) P1 is swapped in as the Operating System has completed the I/O on its behalf and the data is already in the buffer of the Operating System. Also, the free space in the middle is sufficient to hold P1 now. Another process P5 is also loaded in the memory. At this stage, there is only a little free space left. The shaded area in the figure shows the free area at any time. Notice that the numbers and sizes of processes are not predetermined. It starts with only two partitions (Operating System and the other) and at stage (vi), they are 6 partitions. These partitions are created by the Operating System at the run time, and they differ in sizes. The procedure to be followed for memory allocation is the same as described for fixed partitions in steps (i) to (vii) of Sec. 9.3.1, excepting that the algorithms and data structures used may vary. We will not repeat these steps here. An interested reader can go through that section to refresh the memory.

The basic information needed to allocate/deallocate is the same as given in the PDT in Fig. 9.4. However, because the number of entries is uncertain, it is rarely maintained as a table, due to the obvious difficulties of shifting all the subsequent entries after inserting any new entries.(Try doing that for our example in the previous section shown in Fig. 9.12).

Therefore, the same information is normally kept as bitmaps or linked lists much in the same way that you keep track of free disk blocks. To do this, like a block on the disk, the Operating System defines a chunk of memory (often called a block again) This chunk could be 1 word or 2 KB, or 4 KB or whatever. The point is that for each process, allocation is made in multiples of this chunk. In a bit map method, the Operating System maintains 1 bit for each such chunk denoting if it is allocated (=1) or free (=0). Hence, if the chunk is a word of 32 bits, 1 bit per 32 bits means about 3.1% of memory overhead, which is pretty high. However, the memory wastage is minimal. This is because the average wastage, as we know, is (chunk size-1)/2 per process. If the chunk size is high, the overhead is low but the wasted memory due to internal fragmentation is high. In a linked list, we create a record for each variable partition. Each record maintains information such as: Allocated/free (F = Free, A = Allocated) Starting chunk number Number of chunks Pointer (i.e. the chunk number) to the next entry. Figure 9.13 depicts a picture of the memory at a given time. Corresponding to this state, we also show in b and c, - a bit map and a linked list, where a shaded area denotes a free chunk. You will notice that the corresponding bit in the bit map is 0. The figure shows 29 chunks of memory-(0 to 28), of which 17 are allocated to 6 processes. As is clear, the bit map shows that the first 4 chunks are free, then the next 3 are allocated, then, again, the next 2 are free and so on. We can ensure that the linked list also depicts the same picture, essentially, any of these two methods can be used by the Operating System. Each has merits and demerits. In this scheme, when a chunk is allocated to a process or a process terminates, thereby freeing a number of chunks, the bit map or the linked list is updated accordingly to reflect these changes. In addition, the Operating System can link all the free chunks together by bidirectional pointers. It can also maintain a header for these free chunks, giving the start and end of this chain. This chain is used to allocate chunks to a new process, given the desired size. When a process terminates, this chain is appropriately updated with the chunks freed due to the terminated process. At any time, the PCB contains the starting chunk number of the chunks allocated to that process. Observe the merits/demerits for both the approaches. Bit maps are very fast for deallocations. For instance, if a process terminates, the Operating System just carries out the following steps:

information can be found out from the PCB. (Program size may indicate the number of chunks.) But bit maps can be very slow for allocations. For instance, let us assume that a new process wants 4 chunks. The algorithm has to start from the beginning of a bit map and check for consecutive 4 zero bits. This certainly takes time. Linked lists, on the other hand, are time-consuming for deallocations, but they can be faster for allocations. A study of the algorithms for these will reveal the reason for this. Again for allocations, in both the methods, you could use the 'first fit', 'best fit' and the 'worst fit' algorithms as discussed earlier. For best fit and worst fit algorithms, linked lists are far more suitable than bit maps as they maintain the chunk size as a data item explicitly in the linked list. However, you need to have these chunks sorted in the order of chunk size for both the best and worst fit methods. For the first fit method, which is the

most common, all that the Operating System needs to do is to access the queue header for the free chunks, traverse through the chain of slots for Free chunks until you hit a first chunk with size >= the size needed. There is also an algorithm called 'quick fit' which maintains different linked lists for some of the commonly required process sizes e.g. there is one linked list for 0 – 4k holes, and there is another list for 4k – 8k holes and yet another one for 8k – 12k holes and so on. The hole when created (due to process termination) is added to the appropriate list. This reduces the search times considerably. All these techniques normally 'coalesce' the adjacent holes. For instance, in step (vii) of Fig. 9.12, when process P2 got over, two adjacent holes are created. The Operating System looks around to see if there are adjacent holes, and if yes, it creates only one large hole. (To do this, a linked list in the original sequence as shown in Fig. 9.13 is much better than multiple linked lists.) Having created a large hole, depending upon its size, it may have to be added to the appropriate list if 'quick fit' is used. There is yet another method of allocation/deallocation called 'Buddy System' proposed by Knowlton and Knuth to speed up merging of adjacent holes. Unfortunately, it is very inefficient in terms of memory utilization. Various modified buddy systems have been proposed to improve this, but the discussion of those is beyond the scope of this text. One problem with all these systems is external fragmentation. In states (v), (vi) and (vii) shown in Fig. 9.12, There are two holes, but at separate locations so that it is not possible to coalesce them. If a process

which requires memory more than each hole individually, but less than both holes put together, that process cannot run even if total free memory available is larger than what that process requires. What is then, the solution to this problem? The process to solve this is called ‘Compaction’. We will now study it. This technique shifts the necessary process images to bring the free chunks to adjacent positions in order to coalesce. There could be different ways to achieve compaction. Each one results in the movement of different chunks of memory. For instance, Fig. 9.14 (a) shows the original memory allocations and Fig. 9.14 (b), (c) and (d) show three different ways in which compaction could be achieved. These three ways result in the movement of chunks of sizes 1200, 800 and 400 respectively. While calculating the movements, imagine that the live processes are actually moving rather than the free chunks. For instance, in the method shown in Fig. 9.14 (b), the Operating System has to move P3 (size = 400) and P4 (size = 800). Hence, the total movement is 1200. In Fig. 9.14 (c), only P4 (size = 800) is moved in the free chunk of 800 available (1200 – 2000). Hence, the total movement is only 800. In Fig. 9.14 (d), you move only P3 (size = 400) in the free chunk of 400 available (3800 – 4200). In this case, the total movement is only 400. The free contiguous chunk is in the middle in this case, but it does not matter. Obviously, the method depicted in Fig. 9.14 (d) is the best. The Operating System has to evaluate these alternatives internally, and then choose. It is obvious that regardless of the method used, during the compaction operation, normally no user process can proceed, though in the case depicted in Fig. 9.14 (d), it is possible to imagine that P1, P2 or P4 can continue while compaction is going on, because they are unaffected. Any process for which the image is being shuffled around for compaction has to be blocked until the compaction is over. After the 'external event' of compaction is over, the PCB is updated for memory pointers and the PCB is chained to the 'ready' list. Eventually, this process is dispatched. Obviously, whenever a process terminates, the Operating System would do the following: (i) Free that memory. (ii) Coalesce, if necessary. (iii) Check if there is another free space which is not contiguous and if yes, go through the compaction process. (iv) Create a new bit map/linked list as per the new memory allocations. (v) Store the starting addresses of the partitions in the PCBs of the corresponding processes. This will be loaded from the appropriate PCB into the base register at the time the process is dispatched. The base register will be used for Address Translation of every instruction at the run time as seen earlier. Compaction involves a high overhead, but it increases the degree of multiprogramming. This is because, after compaction, it can accommodate a process with a larger size which would have been impossible before compaction. Both IBM and ICL machines have used this scheme for their operating systems.

The swapping considerations are almost identical to those discussed in Sec. 9.3.3. for fixed partitions, and therefore, need no further discussion.

This is substantially the same as in fixed partitions. This scheme also depends upon the base register which is saved in and restored and from the PCB at the context switch. The physical address is calculated by adding

the base register to the virtual address as before, and the resulting address goes to MAR for decoding. After swapping or compaction operations, if the processes change their memory locations, these values also need to be changed as discussed earlier.

Protection is achieved with the help of the limit register. Before calculating the resultant physical address, the virtual address is checked to ensure that it is equal to or less than the limit register. This register is loaded from the PCB when that process is dispatched for execution. As this value of limit register does not undergo any change during the execution of a process, it does not need to be saved back in the PCB at the context switch. Sharing is possible only to a limited extent by using 'overlapping partitions' as shown in Fig. 9.15.

The Figure depicts that process A occupies locations with addresses 3000 to 6999, and process B occupies locations with addresses 6000 to 8999. Thus, in this case, locations with addresses between 6000 and 6999 are overlapping, as they belong to both the partitions. This is possible only because the partitions were variable and not fixed. Though it may sound like a very good idea, in practice, it has a number of limitations. Firstly, it allows sharing only for two processes. Secondly, the shared code must be either reentrant or must be executed in a mutually exclusive way with no preemptions. While mapping all the virtual addresses to physical addresses, references to itself within the shared portion must map to the same physical locations from both the processes. Due to these difficulties, this method is not widely used for sharing.

This scheme wastes less memory than the fixed partitions because there is theoretically no internal fragmentation if the partition size can be of any length. In practice, however, the partition size is normally a multiple of some fixed number of bytes giving rise to a small internal fragmentation. If the Operating System adopts the policy of compaction, external fragmentation can also be done away with, but at some extra processing cost. Access times are not different from those in fixed partitions due to the same scheme of Address Translation using the base register. Time complexity is certainly higher with the variable partition than that in the scheme of fixed partitions, due to various data structures and algorithms used. Consider, for instance, that the Partition Description Table (PDT) shown in Fig. 9.4 is no more of fixed length. This is because the number of partitions are not fixed. Also consider the added complexity of bit maps/linked lists due to coalescing/compaction.

Upto now, various contiguous memory allocation schemes and the problem of fragmentation that arises thereof have been studied. Compaction provides a method to reduce this problem, but at the expense of a lot of computer time in shifting many process images to and fro. Non-contiguous allocation provides a better method to solve this problem. Consider Fig. 9.16. for instance. Before compaction, there are holes of sizes 1k and 2k.

If a new program of size = 3k is to be run next, it could not be run without compaction in the earlier schemes. However, compaction would force most of the existing processes also to stop running for a while. A solution to this has to be found. The problem of fragmentation reinvest involves answers to the following questions: (a) Can the program be broken into two chunks of 1k and 2k to be able to load them into two holes at different places? This will make the process image in the memory noncontiguous. This raises several questions. (b) How can such a scheme be managed? (c) How can the addresses generated by the compiler be mapped into those of the two separate noncontiguous chunks of physical memory by the Address Translation mechanism? (d) How can the problem and protection and sharing be solved? A thought would be to have two base registers for two chunks belonging to our process in the above example. Each base register will have the value of the memory address of the beginning of that chunk. For instance, in Fig. 9.16 the program 2 of 3k size will be loaded in the two chunks of sizes 1k and 2k starting at the physical memory addresses 500 and 2100 respectively. In this case, the two base registers will have the values of 500 and 2100. An address in the process will belong to either chunk-A or chunk-B. Thus, to arrive at the final physical address by the Address Translation, the respective base register will have to be added to the original address depending upon which chunk the address belongs to. Values of both of the registers could be initially stored in the PCB and restored from the PCB at every context switch as before. Thus, this scheme could be conceptually an extension of earlier ideas. There are as many base and limit registers as there are chunks in a program will be needed. For instance, if a program is loaded is n non-contiguous chunks, you will need n base registers and n limit registers for that program alone.

A question is: What should be the sizes of these chunks? Approaches used for solving this are listed here: paging'. In this case, the process, image is divided in fixed sized pages. segmentation. In this case, the process image is divided into logical segments of different sizes. entire process image (all chunks) has to reside in the main memory before execution can commence. which can be brought into the main memory as and when required, the system is called ‘virtual memory management system’ (not to be confused with ‘virtual address’). The term ‘virtual address’ can be used meaningfully even in the pure paging system but that by itself, will not make it virtual memory management system. One is called ‘demand paging’. The other is called ‘working set method’. These methods differ in the way the chunks are brought from the disk into the main memory. ‘segmented paged method’, in which each process image is divided into a number of segments of different sizes, and each segment in turn is divided into a number of fixed sized pages. Again, this scheme can be implemented using virtual memory; though it is possible to implement it using 'real' memory. These methods are considered one by one in subsequent paragraphs.

As discussed earlier, the chunks of memory are of equal sized pages in the paging scheme. The logical or virtual address space of a program is divided into equal sized pages, and the physical main memory also is divided into equal sized page frames. The size of a page is the same as that of the page frame, so that a page can exactly fit into a page frame and therefore, it can be assigned to any page frame, which is free. (Questions of first fit, etc. do not arise.) In order that this scheme works, the following must happen: (i) The process address space of a program is thought of as consisting of a number of fixed sized contiguous pages (hence, the name 'virtual or logical pages'). (ii) Any virtual address within this program consists of two parameters: a logical or virtual page number (P) and a displacement (D) within the page. (iii) The memory is divided into a number of fixed sized page frames. The size of a page frame is the same as that of a logical page. The Operating System keeps track of the free page frames and allocates a free page frame to a process when it wants it. (iv) Any logical page can be placed in any free available page frame. After the page (P) is loaded in a page frame (F), the Operating System marks that page as "Not free".

(v) Any logical address in the original program is two dimensional (P, D), as we know. After loading, the address becomes a two-dimensional physical address (F, D). As the sizes of the page and the page frame are the same, the same displacement D appears in both the addresses. (vi) When the program starts executing, the Address Translation mechanism has to find out the physical page number (F), given the virtual page number (P). After this, it has to append or concatenate D to it to arrive at the final physical address (F, D). Hence, in the virtual address, it must be possible to separate out the bits for the page (P) and the ones for D, in order to carry out this translation. However, there is a problem in our scheme. How does the compiler generate a two-dimensional address? We know that the compiler generates only one-dimensional single address in binary. How then is it possible to separate out the address into two components, P and D? The secret of the solution lies in the page size. If the page size is a power of 2 such as 32, 64, ... 1k, 2k, etc., this problem vanishes. This is because, the single binary address can be shown to be the same as a two-dimensional address, i.e. automatically some high order bits correspond to P and the remaining low order bits correspond to D. This can happen only if the page size is a power of 2. This is the reason the compiler does not have to generate any separate two dimensional address specifically for the paging system address. It generates only a single binary address, but it can be interpreted as a two-dimensional address. This is what helps in separating P, translating it to F and then concatenating the same D to it to arrive at the final address. If the page size is not a power of 2, this automatic separation of P and D does not take place. We will consider an example to illustrate this. Let us say that page size = 100 and that the address in question is 107 is decimal. The address 107 in binary would be 01101011. This is essentially a single dimensional address in 8 bits as the compiler would generate. In a two-dimensional address with page size equal to 100, page 00 will have addresses 0 to 99, and page 01 will have addresses 100 to 199. Thus, address 100 would correspond to P = 1, D = 0, address 101 would correspond to P = 1 and D = 1, and so on. Therefore, address 107 would be that of the location number = 7 in page number 1. Therefore, P = 01, D = 000111 in binary, if we reserve two bits for P and six for D. If we concatenate the two, we get the two dimensional address as 01000111, as against a one dimensional address of 01101011. Notice that these two are different. The Address Translation at the time of execution will pose difficulties, if the compiler produces an one-dimensional address which has no correlation with a two-dimensional one. This problem can be easily solved if the page size is a power of 2, which is normally the case. Assume in this case that the page size is 32. Thus, locations 0–31 are in page 0, 32–63 in page 1, 64–95 in page 2 and 96–127 in page 3. Therefore, location 96 means P = 3 and D = 0, location 97 means P = 3 and D = 1. We can, therefore, easily see that the address 107 will mean P = 3 and D = 11. Hence, the two-dimensional address for decimal 107 in binary is P = 011 and D = 01011. If we concatenate the two, we will get 01101011, which is exactly same as the one-dimensional address in binary that the compiler produces. An interesting point is worth noting. Even if the page size were 64 instead of 32, the two-dimensional address would remain the same. In this case, page 0 would have addresses 0–63 and page 1 would have 64–127. Hence, location 107 would mean location 43 in page 1. Therefore, a two-dimensional address for 107 would be page (P) = 01 and Displacement (D) = 101011 in binary. Concatenating the two, we still get 01101011 which is the same as the one-dimensional address in binary address that the compiler produces. Therefore, the compiler does not have to produce different addresses, specifically because it is going to be treated as a two-dimensional address. The compiler compiles addresses, as if they were absolute addresses

with respect to 0 as the starting address. These are the same as single-dimensional addresses. At the time of execution, the addresses can be separated as page number (P) quite easily by considering only a few high order bits of the address and displacement (D) by considering the remaining low order bits of the address. This is shown in Fig. 9.17. The point is: How many bits should be reserved for P and how many for D? The answer to this depends upon the page size which determines D and maximum number of pages in a process image which determines D and maximum number of pages in a process image which determines P. Given that the total size of the process image = page size X number of pages which is a constant, a number of possibilities can arise. For a process image of 256 bytes, we can have:

The decision of page size is an architectural issue, which has an effect on performance. We will study this later. The point is: the beauty of binary system is such that, whatever the page size may be, the onedimensional address is same as the two-dimensional one. Normally, in commercial systems, the page size chosen varies from 512 bytes to 4 KB. Assuming that 1 Megabyte or 1 MB (= 1024 KB) of memory is available and page size as well as the page frame size is = 2 KB, we will require 1024/2 = 512 page frames numbering from 0 to 511 or from 000000000 to 111111111 in binary. Hence, the 9 high order bits of the address can be reserved to denote the page frame number. Each page has 2 KB (2048) locations numbering from 0 to 2047; thus, requiring 11 bits for displacement D. (512 requires 9 bits, 1024 would require 10 and 2048 would require 11 bits.) Thus, the total address would be made up of 9 bits for page number + 11 bits for displacement = 20 bits.

Similarly, any virtual address produced by the compiler can be thought of as made up of two componentspage number (P) and displacement (D). An interesting point of this scheme is that when a page is loaded into any available page frames, (the sizes of both are the same), the displacement for any address (D) is the same in virtual as well as physical address. Hence, all that is needed is to load pages in available page frames and keep some kind of index as to which page is loaded where. This index is called a 'Page Map Table (PMT)' which is the key to the Address Translation. At the execution time, all that is needed is to separate the high order bits in the address reserved for the page number (P), and convert them into page frame number (F) using this PMT, concatenate F and D and arrive at the physical address as we know that D remains same. This is the essence of Address Translation. The Page Map Table (PMT) is shown in Fig. 9.18. There is one such PMT maintained for each process. The PMT in the figure shows that a virtual address space of a process consists of 4 pages (0 to 3) and they are loaded in physical page frames 5, 3, 9 and 6 respectively.

In this scheme, in order to load a page, any page frame is as good as any other, so long as it is free; there is nothing to choose one against the other. In other words, memory allocation algorithm implies maintaining a list of free page frames and allocating as many page frames from it as there are pages to be loaded. The scheme for allocating page frames of physical memory at the time of process creation has to coordinate among the Information Management (IM); Process Management (PM) and the Memory Management (MM) modules. It works as follows: (i) MM keeps track of free page frames at any time. In the beginning, all page frames except those occupied by the Operating System itself are free. Thus, MM maintains this list of free page frames in addition to PMTs. (ii) When a process is to be loaded into the memory, PM requests IM for the size of the program. (iii) IM goes through the directory structure to resolve the path name, accesses the Basic File Directory (BFD) information to get the size of the object file to be executed. (iv) Having obtained this size, PM requests MM to allocate the memory of that size. (v) MM calculates the number of page frames needed to be allocated. This is equal to (Program size/Page frame size) rounded up to the next integer. (vi) MM now consults the list of free page frames and if possible, allocates them to the process. You know that they need not be contiguous. MM now updates the list of free page frames to mark these page frames as “allocated”. It also creates a PMT for that process. If there are not enough free, allocable page frames, MM indicates that to PM, which postpones the loading of this process (In Virtual Memory Management System, where the execution of a process can commence with only a part of the process image in the memory, the story would have been different!). If this process is of high priority, it is for the PM to swap out an existing low priority process to make room for the new one. (vii) Having allocated the required page frames, MM now signals the PM to load the process. (viii) PM loads various pages of the process address space into the allocated physical page frames of the memory with the help of the IM; and links the PCB for that process in the list of ready processes. The PCB also maintains a pointer to the starting address of the PMT in the memory. This is used for Address Translation for this process when this process is dispatched after the context switch.

Let us take an example to illustrate how the free page frames are allocated to a new process and how a PMT is created for it. Fig. 9.19 shows three processes, A, B and C with their respective PMTs which map the virtual or logical pages (P) onto the physical page frames (F). A list of free page frames is shown on the top

of the figure. This list is not necessarily maintained in any particular order. The order is dictated by the way the page frames get free and are allocated. The figure also shows that a new process (Process D) has arrived wanting to occupy two page frames. The Operating System will consult the list of free page frames, allocate the first two page frames in that list i.e. page frames 10 and 14 to Process D and then it will create a PMT for it. It will then remove those page frames from the free list. This is depicted in Fig. 9.20.

There is one PMT for each process and the sizes of different PMTs are different. Study the free page frames list before and after the allocation. The page frame list need not always be in the sorted order of frame numbers. As and when the frames are freed by the processes which are terminated or swapped out, they are added to the list, and that order can be random because you could not predict which page frame would get free when. Hence, there is no specific sequence maintained in that list. It is important to know that this order is also not of any consequence. While allocating, any page frame is as good as any other. The need for contiguity also has vanished. Hence, the allocation algorithms of best fit, first fit etc. are of no value in this scheme. If 4096 bytes are required, the MM will calculate this as two pages of 2 KB each and will allocate the first two free page frames in the list of free page frames. These could be physically quite distant. But this does not matter because the Address Translation is done separately for each page using PMT.

The considerations for swapping are similar to those discussed earlier. In paging, if a process is swapped, it is swapped entirely. Keeping only a few pages in the main memory is useless because a process can run only if all the pages are present in the main memory. AOS running on 16 bit Data General machine follows the pure paging philosophy. The process cannot run unless the entire process image is in the memory even if the process image is divided into pages, and a few of them are in the main memory. Therefore, the entire process image is swapped out if required. Which process is to be swapped out depends upon the priorities and states of the processes already existing in the main memory and the size of the new process to be accommodated. Normally, a blocked process with very low priority can be swapped out, if space is to be created in the memory for a new process. These issues are handled by the medium level process scheduler and are already discussed in the section on process management. When a process is swapped out, its area in the memory which holds the PMT for that process is also released. When it is swapped in again, it may be loaded in different page frames depending upon which are free at that time. At that time, a new PMT is created, as the PCB for the process is chained to the ready processes. We know that the PCB also contains the memory address of the PMT for that process itself.

Relocation has already been discussed in Sec. 9.6.1, showing that different pages are loaded in different page frames before execution. After loading, the addresses in the program, as it resides in the main memory, are still virtual addresses as generated by the compiler (i.e. assuming that the program is loaded contiguously from address 0). It is only at the run time that Address Translation is done using PMT. This is shown in Fig. 9.21. Let us assume that we have a machine where a word length is 8 bits. Let us also assume that our machine has main memory with the capacity of 512 words or bytes. This memory is divided into 16 pages of 32 words each (16×32 = 512). Hence, we will require 4 bits to represent the page number P (0 to 15) and 5 bits to represent the displacement D (0 to 31). Therefore, the total number of bits in the address will be 9 (i.e. 4 + 5). We verify that with 9 bits, the maximum address that can be generated is 511 which is quite as expected because we have a memory size of 512 words (0 to 511). This is shown in Fig. 9.21. It has been seen that any address generated by the compiler is automatically divided into two parts - page number (P) and displacement (D). This is because the page size is a power of two. When the instruction is fetched into IR, depending upon the addressing mode (direct, indirect, indexed etc.) the resultant address

ultimately resides in the CPU register. It is this address that is split into P and D. P is fed as an input to the Address Translation mechanism. Address Translation finds out the page frame (F) corresponding to P using the PMT, and generates the physical address F + D. This is shown in Fig. 9.21.

Let us assume that there is a COBOL program with an instruction "ADD BASIC, DA GIVING TOTAL". The compiler would generate many machine instructions for this one instruction. Also let there be one of those instructions as LDA 107 and that the instruction itself is at a virtual address 50 (i.e. P = 1, D = 18) in the program with 0 as the starting address. The following discussion shows how this instruction is executed. At the fetch cycle, when the Program Counter (PC) gets incremented from 49 to 50 (i.e. 000110010), this address is transferred to MAR by the microinstruction PC "MAR. The bits in MAR act as the control signals for the address decoder which activate the desired memory location. It is at this stage that we need to modify the address so that the resulting address can finally be put on the address bus which can access the physical memory. The PMT in Fig. 9.21 shows that page 1 is mapped onto page frame 4, and thus, the physical address at which you will find the instruction "LDA 107" will be within page frame 4, at a displacement of 18. Page frame 4 will contain physical addresses 128 to 159. Therefore, displacement 18 within that page frame would mean physical address of 128 + 18 = 146 in decimal or 010010010 in binary. Hence, we need to fetch the instruction not at location 50 but at location 146. To achieve this, the address coming out of MAR which is 50 in decimal or 000110010 in binary is split into two parts. The page number P is used to find the corresponding page frame F using the PMT Figure 9.21 shows that P = 0001 corresponds to F = 0100. F + D now is used as the address which is used as the control signal to the memory decoder. Thus, actually the instruction at 146 in decimal or 010010010 in binary is fetched in the Instruction Register (IR). This is just what is required. At the 'execute cycle', the hardware “knows” that it is an LDA instruction using direct addressing. It, therefore, copies the address portion 107 i.e. 001101011 (P = 0011 = 3, D = 01011 = 11) to MAR for fetching the data by giving a 'read' signal. The figure shows that page 3 is mapped onto page frame 2. Hence, the data at virtual address decimal 107 will now be found at physical address with page frame (F) = 2 = 0010 and Displacement (D)=11 = 01011 or binary address = 001001011 i.e. 75 instead of 107 in decimal. Again, this Address Translation is done using PMT on the address in MAR and the resultant translated address is put on the address bus, so that actually the correct addresses are used for address decoding. This is shown in Fig. 9.22. Implementation of PMTs is a major decision which affects performance. The main factor is the maximum size of PMT. This, in turn, is dependent upon the maximum program size (which is dependent on the size of the address bus in terms of bits) and page size. AOS on Data General machines can have 16 pages of 2KB each for instance. Some systems may allow more number of pages per process. Each process must have a PMT, and in theory, PMT must be large enough to have as many entries as the maximum number of pages per process. However, very few processes actually use up all the pages and thus, there is a scope for reducing the PMT length and saving the memory. This is achieved by a register called 'Page Map Table Limit Register (PMTLR)' which contains the number of pages contained in a process. There is one PMTLR for each PMT. PMTLR is maintained in the PCB for each process in the register save area. Correspondingly, there is a hardware register PMTLR in the CPU. At the context switch, like other registers, PMLTR also is restored from the PCB. PMTLR is used for protection purposes to detect any invalid references as will be seen later. There are basically three methods of actually implementing PMTs:

In this method, the Operating System keeps all the PMTs in the main memory. The starting word address of the PMT for a process is known at the time the process is created, its pages are loaded and a PMT is created and stored in the main memory. This address is also stored in the PCB. At the context switch, this address is loaded from the PCB into another hardware register in the CPU called 'Page Map Table base register (PMTBR)'. This register is used to locate the PMT itself in the memory. If a process is swapped out and sometime later it is swapped in again in different page frames, a new PMT may be created and that too may be loaded at different memory locations. PMTBR also will change accordingly in this case. If we reserve one word for each PMT entry, word 0 within the PMT would correspond to the virtual page number 0, word 1 would correspond to virtual page number 1, etc. for that process. Therefore, given a virtual page number (P), the entry for P in the PMT can be easily found out by computing PMTBR + P. For instance, if PMTBR is 500 for the PMT of a process (i.e. PMT for that process starts at word number 500), then the entry for virtual page number 8 would be found at word number 508. This is on the assumption that each PMT entry is one word long. Both PMTBR and PMTLR are known at the time the process is created and PMT is created. Thus, after the new process starts running, the PMTBR and PMTLR point to the correct values of PMT for that process. From the logical address, physical address can now be computed fairly easily as shown in Fig. 9.23. We will use the terms displacement (D) and offset interchangeably. The exact steps followed for Address Translation are described below. The same steps (a) to (g) are also shown in the figure for reference.

(a) The logical or virtual address is divided into two parts: Page number (P) and Displacement (D) as discussed earlier. (b) The page number P is now checked for its validity by ensuring that PMTLR. This comparison takes place in the CPU registers only, hence, taking negligible time. (c) If P is valid, it is used as an index into the PMT whose starting address is given by PMTBR. Therefore, P + PMTBR gives the entry number of PMT. This addition takes place in the ALU, and hence, takes very little time. (d) The selected entry of PMT is fetched into the CPU register. This operation requires one memory access because PMT resides in the memory. (e) The page frame number (F) is extracted from the selected PMT entry brought into the CPU register. This, again, takes negligible time. (f) Original displacement (D) is concatenated to F obtained in step (e) to get the final physical address F + D. This takes virtually no time, as this is done by the hardware itself in the CPU register. (g) This address is put on the address bus to locate the desired data item in the memory. This, again, requires a memory access. The merit of this scheme is that, it is a very simple method to comprehend and implements. It is also inexpensive. However, it is a very slow process. For every memory reference, an additional memory reference is required to fetch the appropriate PMT entry from the main memory to the CPU. Hence, as compared to a pure memory access time (tma), pure software method of address translation requires at least 2tma, thereby, degrading the performance by a factor of at least 2. A pure hardware method would use 'associative registers'. They are also known by other names such as 'associative memory', 'look ahead memory', 'look ahead buffer', 'content addressable memory' or 'Cache'. (This cache is different from the normal cache memory.) The essence of this method is really that the "table search" is done in the hardware itself, and hence, it is very fast. For instance, if P is supplied to this associative memory, F is output directly by hardware itself in one shot. This F then is concatenated by D to give the resultant physical address. We have studied these registers under "Hardware Prerequisites" in Chapter 3. A pure hardware method of Address Translation works as follows (Figure 9.24 shows the corresponding steps for reference): (a) The system will have as many associative registers as the maximum number of entries in any PMT , which is the same as the maximum number of pages that are possible in a program. For instance, in a machine whose address bus is 16 bit wide, the maximum address that it can generate is 216. This, theoretically, will be the maximum size of any program which can run on that machine. If the page size of that machine is 2048 = 211 words, maximum number of pages that a program can have is 216 / 211 = 25 = 32.Therefore, maximum size of the PMT is 32 words, if we reserve one word for each PMT entry. In pure hardware method, we may have to have 32 associative registers which can carry out the table search to find F, given P, by pure hardware, parallelly in one shot. In a machine with a still wider address bus, the maximum size of the PMT and therefore, the number of associative registers required would also increase substantially. At any moment, only one process is running. When that process is dispatched, the associative registers are loaded with the PMT for that process. This is how, even if there is one PMT per process, systemwide there will be only one set of associate registers to reduce the cost.

(b) The Operating System would maintain a separate PMT for each process as before, with the starting address of the PMT (PMTBR) and the number of pages (PMTLR) stored in the PCB of that process. (c) As a process is dispatched from "Ready" to the "Running" state, the PMT for that process is located using its starting address (PMTBR) stored in the PCB and the entries in the PMT are loaded into these associative registers as a part of context switch. Each associative register now contains a virtual page number and its corresponding page frame number, ready to carry out the hardware table-search. (d) At the time of the execution of an instruction, the virtual address is separated into a page number (P) and Displacement (D) as seen earlier. This takes virtually no time. (e) This P now is fed as an input to all the associative registers after ensuring its validity i.e. by ensuring that P =< PMTLR. This is shown is Fig. 9.24. This is a pure hardware operation, requiring virtually no time. (f) In one shot, the corresponding Page Frame Number (F) is output from the corresponding associative register. This requires a little time for associative memory access (tama). (g) The Displacement (D) is concatenated to F, giving the resultant physical address which is put on the address bus. This takes negligible time. (h) After this, the physical memory is accessed as discussed earlier. This requires usually memory access time (tma).

Thus, the total time required for any memory reference = tama + tma. As tama is very small, the time overhead in this method is very low. This method is fast but expensive. This is because, the associative registers are

not very cheap. Having a large number of associative registers is very expensive. Therefore, a via media is needed. The hybrid method provides such a mechanism. In this method, associative memory is present, but it consists of only 8, 16 or some other manageably small number of registers. This reduces the cost drastically. Only the pages actually referenced frequently are kept in the associative memory with the hope that they will be referenced more frequently. The Address Translation in the hybrid method is carried out in following fashion: (a) The virtual address is divided into two parts: page number (P) and displacement (D) as discussed earlier. This is a timeless operation. (b) The page number (P) is checked for its validity by ensuring that P