These are the revised and illustrated notes of the Software Architecture lecture of the Master in Software and Data Engi
1,640 659 186MB
English Pages [688] Year 2023
Table of contents :
Preface
Acknowledgements
Introduction
De Architectura
The Art and Science of Building
Foundation
Platforms
Closed
Open
Interfaces
Forces
When do you need an architect?
How Large?
Software Architecture
Lecture Context
Why Software Architecture?
Hiding Complexity
Abstraction
Communication
Representation
Visualization
Quality
Change
Evolution
Quality Attributes
Quality
Functional, Extra-Functional Qualities
Internal vs. External
Static vs. Dynamic
Meta-Qualities
Quality Attributes
Design Qualities
Feasibility
Affordability
Slack
Time to Market
Modularity
Reusability
Design Consistency
Simplicity
Complexity
Clarity
Stability
Composability
Deployability
Normal Operation Qualities
Performance
Scalability
Capacity
Usability
Ease of Support
Serviceability
Visibility
Dependability Qualities
Reliability
Availability
Security Qualities
Defensibility, Survivability
Privacy
Change Qualities
Flexibility
Configurability
Customizability
Resilience, Adaptability
Extensibility, Modifiability
Elasticity
Compatibility
Portability
Interoperability
Ease of Integration
Long Term Qualities
Durability
Maintainability
Types of Maintenance
Sustainability
Definitions
Who is a software architect?
Functional Organization
Cross-Functional Organization
Facilitate Communication
Software Engineering Lead
Technology Expert
Risk Management
Architect Tribes
Software Architecture and the Software Development Lifecycle
Bridge the Gap
Think Outside the Box
Evolutionary Model
Agile Unified Process
System Lifecycle
Defining Software Architecture
Architecture vs. Code
Architecture vs. Technology
Architecture vs. Design
Basic Definition
Design Decisions
Design Process
Design Decisions
Decision Making Phases
Decision Trees
Design Outcome
Modeling Architecture
Can this skeleton fly?
Prescriptive vs. Descriptive Architecture
Green Field Development
Brown Field Development
Architectural Degradation
Causes of Architectural Drift
From Drift to Erosion
Entropy
Architecture or Code First?
Architecture Hoisting
Presumptive vs. Reference
Solution vs. Product
M-Architecture vs. T-Architecture
The $10000 boolean flag
Art or Science?
Science or Art?
References
Modeling
Capturing the Architecture
What is modeling?
Abstraction and Interpretation
Solving Problems with Models
Question first, Model second
Scope of the Model
What is a view?
How many views?
Multiple Views
View Consistency
Domain and Design Models
Modeling = Learning
Domain Model
Example Domain Model
Design Model
Example Design Model (Interfaces)
Example Design Model (Implementation)
Some Modeling Notations
Use Case Scenarios
Example Music Player Scenarios
Feature Models
Feature Model Example
Feature Model Constraints
Constrained Feature Model Example
Feature Configuration
From C4 to C5
System Context View
System Context View Example
Containers View
Container View Example
Example Containers
Components View
Components View Example
C4
Classes View
C5
Connectors View
Connectors View Example
4+1
Logical View
Logical View Notation
Example Logical View
Process View
Example Process View
Development View
Example Development View
Physical View
Example Deployment View
Content is more important than representation
Model Quality
Accuracy vs. Precision
Model Quality - Advice
Model-Driven Architecture
References
Modularity and Components
What is a Software Component?
Hardware Component
Software Component
Examples: Application-specific Components
Examples: Infrastructure Components
Black Box
Recursive Components
Clustering Components
Design vs. Run-time
Distributed Components
Component Lifecycle Decisions
Externally Sourced Components
Discovery
Selection
Integration
Test
Release
Deploy
Stateful Components
Migration
Backup and Recovery
Properties of Components
Component Roles
Stateless vs. Stateful
Stateless vs. Stateful Code
Stateless vs. Stateful Operations
Components vs. Objects
Component Technology
Component Frameworks
Component Frameworks Demo Videos
Where to find components?
Buy vs. Build
How much does it cost?
References
Reusability and Interfaces
Interfaces
Component Interface
Provided Interface
Provided Interfacesand Component Roles
Required Interface
Explicit Interfaces Principle
Information Hiding
Effective Encapsulation
Example
Describing Interfaces
Principle of Least Surprise
Easy to use?
Interface Description Languages
Java/RMI
C/RPC
RAML
OpenAPI/Swagger
Working With IDL
API Documentation Demo Videos
What is an API?
Is it really API?
Many Applications
Developers, Developers, Developers
Where to find APIs?
Operating Systems
Programming Languages
Hardware Access
User Interfaces
Databases
Web Services
API Design
Where is the API?
API Design: Where to start?
Who to please?
Reusable Interfaces
Usability vs. Reusability
Easy to reuse?
Performance vs. Reusability
Small Interfaces Principle
How many clients can these APIs satisfy?
Uniform Access Principle
Few Interfaces Principle
Clear Interfaces Principle
Let's create a new Window
Expressive? No: Stringly Typed
Consistent?
Primitive Operations
Design Advice
References
Composability and Connectors
Software Connectors
Connector: enabler of composition
Software Connector
Components vs. Connectors
Connectors are Abstractions
Connector Roles and Runtime Qualities
Connectors and Transparency
Connector Cardinality
Connectors and Distribution
Connectors and Availability
Software Connector Examples
RPC: Remote Procedure Call
File Transfer
Shared Database
Message Bus
Stream
Linkage
Shared Memory
Disruptor
Tuple Space
Web
Blockchain
References
Compatibility and Coupling
Compatibility
Compatible Interfaces
There's an app adapter for that!
Adapter
Wrapper
Mismatch Example
Partial Wrappers
Types of Interface Mismatches
Synchronous vs. Asynchronous Interfaces
Half-Sync/Half-Async
Sync to Async Adapter
Half-Async/Half-Sync
Async to Sync Adapter
How many Adapters?
Scaling Adapters with N Interfaces
Composing Adapters
One or Two Adapters?
Reusable Adapters and Performance
How Standards Proliferate
On Standards
Standard Software Interfaces
Representation Formats
Operations
Protocols
Addressing
Interface Description
Coupling
Understanding Coupling
Coupling Facets
Session Coupling Examples
Binding Times
Be liberal in what you accept, and conservative in what you send.
Water or Gas Pipe?
References
Deployability, Portability and Containers
The Age of Continuity
Deployability Metrics
Release
Release Frequency
Speed vs. Quality
Software Production Pipeline
High Quality at High Speed
Types of Testing
Types of Release
Gradual Phase-In
Essential Continuous Engineering Practices
Tools
Build Pipeline Demo Videos
Container Orchestration Demo Videos
Virtualization and Containers
Virtualization
Lightweight Virtualization with Containers
VM vs. Container
Containers inside VMs
Images and Snapshots
Virtual Machine Migration
Inverted Hypervisor
Virtual Hardware = Software
References
Scalability
Will it scale?
Scalability and Workload
Scalability and Workload: Centralized
How to scale?
Scalability and Resources: Decentralized
Scalability and Resources
Centralized or Decentralized?
Scalability at Scale
Scale Up or Scale Out?
Scaling Dimensions
Scalability Patterns
Directory
Dependency Injection
Directory vs. Dependency Injection
Scatter/Gather
Master/Worker
Master Responsibilities
Worker Responsibilities
Load Balancing
Variants
Sharding
Computing the Shard Key
Looking up the Shard Key
References
Availability and Services
Components vs. Services
Business model: how to sell
Design decisions
Technology: how to use
Availability
Availability Questions
Monitoring Availability
Which kind of monitor?
Availability Incidents
Downtime Impact
Contain Failure Impact
Retry
Circuit Breaker
Canary Call
Redundancy
State Replication
Which kind of replication?
CAP Theorem
CAP Theorem Proof
Eventual Consistency
Event Sourcing
References
Flexibility and Microservices
API Evolution
Only one chance...
API Evolution
API Compatibility
Semantic Versioning
Changes and the Build Pipeline
Version Identifier
Two in Production
API Sunrise and Sunset
To break or not to break
Who should keep it compatible?
Layers
Layering Examples
Data on the Inside, Data on the Outside
Tolerant Reader
Which kind of reader?
Extensibility
Extensibility and Plugins
Microservices
Monolith vs. Microservices
Will this component always terminate?
Will this service run forever?
Will this microservice continuously change?
DevOps
DevOps Requirements
Feature Toggles
How small is a Microservice?
Continuous Evolution
Hexagon Model
Decomposition
Independent DevOps Lifecycle
Isolated Microservices
Splitting the Monolith
Microservice Best Practices
Bezos's Mandate (2002)
Evans's Bounded Context (2004)
Bob Martin's Single Responsibility Principle (2003)
UNIX Philosophy (1978)
Separation of Concerns (1974)
Parnas's Criteria (1971)
Conway's Law (1968)
Vogels's Lesson (2006)
References
Software Architecture visual lecture notes
Cesare Pautasso
This book is for sale at https://leanpub.com/software-architecture This version was published on February 20, 2023. We plan to update the book from time to time, you may check on LeanPub if you still have the latest version.
©2020–2023 Cesare Pautasso
Software Architecture visual lecture notes
Cesare Pautasso
Preface This book collects the revised and illustrated notes of the Software Architecture lecture of the Master in Software and Data Engineering held at the Software Institute at USI Lugano, Switzerland during the Spring of 2020. The book includes the script for the following lectures: 1. Introduction 2. Quality Attributes 3. Definitions 4. Modeling Software Architecture 5. Modularity and Components 6. Reusability and Interfaces 7. Composability and Connectors 8. Compatibility and Coupling 9. Deployability, Portability and Containers 10. Scalability 11. Availability and Services 12. Flexibility and Microservices The lecture is designed for master students attending the second semester of the MSDE master program with a background in data management, software engineering, software design and modeling, domain specific languages and programming styles.
ii
Acknowledgements I started to teach Software Architecture and Design back in the Spring of 2008, encouraged by Mehdi Jazayeri, the founding Dean of the USI Faculty of Informatics, who shared with me a preview of Software Architecture: Foundations, Theory and Practice by Richard N. Taylor, Nenad Medvidovic, Eric Dashofy, to whom I will always be in debt. I would not have been successful in teaching Software Architecture without the insightful experience and practical advice on architectural decision making shared by Olaf Zimmermann, whom we had the pleasure of welcoming to Lugano as a guest speaker in many editions of the lecture. It has been really great to harvest Microservice API Patterns for the past four years together with Olaf as well as Mirko Stocker, Uwe Zdun and Daniel Lübke. During later revisions of the lecture I started to recommend to my students to read and study Just Enough Software Architecture: A Risk-Driven Approach by George H. Fairbanks (2010) as well as the more recent Design It!: From Programmer to Software Architect by Michael Keeling (2017). It is still a challenge to model something as complex as a software architecture. But since Architectural Blueprints—The “4+1” View Model of Software Architecture were proposed by Philippe Kruchten (1995), we have come a long way with, for example, Simon Brown’s C4 model (context, containers, components and code) which as part of this lecture has been augmented with connectors (C5). The lecture also includes valuable concepts, good ideas and their powerful visualizations borrowed from: Joshua Bloch, Grady Booch, Jorge Luis Borges, Eric Brewer, Peter Cripps, David Farley, Martin Fowler, Luke Hohmann, Gregor Hohpe, Jez Humble, Ralph Johnson, James Lewis, Rohan McAdam, M. Douglas McIlroy, Bertrand Meyer, David Parnas, Jon Postel, Eberhardt Rechtin, John Reekie, Nick Rozanski, Mary Shaw, Ian Sommerville, Michael T. Nygard, Will Tracz, Stewart Brands, Werner Vogels, Niklaus Wirth, Eoin Woods. The lecture has achieved his current shape also thanks to the feedback of many generations of students attending our Master of Software and Data Engineering and the generous support of my past, present and future teaching assistants (in chronological order): Romain Robbes, Alessio Gambi and Danilo Ansaloni, Marcin Nowak, Daniele Bonetta, Masiar Babazadeh, Andrea Gallidabino, Jevgenija Pantiuchina, and Souhaila Serbout. A big thank you to everyone! Since 2013, the entire lecture has been delivered through ASQ, a tool for Web-based interactive presentations developed by Vasileios Triglianos, which now can also automatically generate books like the one you are reading.
Cesare Pautasso Lugano, Switzerland
iii
Contents Preface
ii
Acknowledgements
iii
Introduction De Architectura . . . . . . . . . . . . The Art and Science of Building Foundation . . . . . . . . . . . Platforms . . . . . . . . . . . . Closed . . . . . . . . . . . . . . Open . . . . . . . . . . . . . . . Interfaces . . . . . . . . . . . . Forces . . . . . . . . . . . . . . When do you need an architect? . . How Large? . . . . . . . . . . . Software Architecture . . . . . . . . Lecture Context . . . . . . . . Why Software Architecture? . . . . . Hiding Complexity . . . . . . . Abstraction . . . . . . . . . . . Communication . . . . . . . . Representation . . . . . . . . . Visualization . . . . . . . . . . Quality . . . . . . . . . . . . . Change . . . . . . . . . . . . . Evolution . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
1 1 3 6 7 9 10 11 12 13 16 18 19 20 21 22 23 24 25 27 30 31
. . . . . . . . . . . . . . . . .
35 35 37 38 39 40 42 42 42 44 45 47 48 50 52 53 54 55
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
Quality Attributes Quality . . . . . . . . . . . . . . . . . . . . . Functional, Extra-Functional Qualities . Internal vs. External . . . . . . . . . . . Static vs. Dynamic . . . . . . . . . . . . Meta-Qualities . . . . . . . . . . . . . . Quality Attributes . . . . . . . . . . . . . . . Design Qualities . . . . . . . . . . . . . . . . Feasibility . . . . . . . . . . . . . . . . . Affordability . . . . . . . . . . . Slack . . . . . . . . . . . . . . . . Time to Market . . . . . . . . . . Modularity . . . . . . . . . . . . . . . . Reusability . . . . . . . . . . . . . . . . Design Consistency . . . . . . . . . . . Simplicity . . . . . . . . . . . . . . . . . Complexity . . . . . . . . . . . . . . . . Clarity . . . . . . . . . . . . . . . . . . .
iv
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
Contents Stability . . . . . . . . . . . . Composability . . . . . . . . Deployability . . . . . . . . . Normal Operation Qualities . . . . Performance . . . . . . . . . Scalability . . . . . . . . . . . Capacity . . . . . . . . . . . . Usability . . . . . . . . . . . . Ease of Support . . . . . . . . Serviceability . . . . . . . . . Visibility . . . . . . . . . . . Dependability Qualities . . . . . . Reliability . . . . . . . . . . . Availability . . . . . . . . . . Security Qualities . . . . . . . . . Defensibility, Survivability . Privacy . . . . . . . . . . . . Change Qualities . . . . . . . . . . Flexibility . . . . . . . . . . . Configurability . . . . . . . . Customizability . . . . . . . . Resilience, Adaptability . . . Extensibility, Modifiability . Elasticity . . . . . . . . . . . Compatibility . . . . . . . . . Portability . . . . . . Interoperability . . . Ease of Integration . . Long Term Qualities . . . . . . . . Durability . . . . . . . . . . . Maintainability . . . . . . . . Types of Maintenance Sustainability . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. 56 . 57 . 58 . 60 . 61 . 62 . 64 . 65 . 67 . 68 . 69 . 70 . 71 . 73 . 77 . 78 . 79 . 81 . 82 . 83 . 85 . 86 . 87 . 88 . 91 . 93 . 95 . 96 . 98 . 99 . 101 . 103 . 103
Definitions Who is a software architect? . . . . . . . . . . . . . . . . . . . Functional Organization . . . . . . . . . . . . . . . . . . . Cross-Functional Organization . . . . . . . . . . . . . . . Facilitate Communication . . . . . . . . . . . . . . . . . . Software Engineering Lead . . . . . . . . . . . . . . . . . Technology Expert . . . . . . . . . . . . . . . . . . . . . . Risk Management . . . . . . . . . . . . . . . . . . . . . . Architect Tribes . . . . . . . . . . . . . . . . . . . . . . . Software Architecture and the Software Development Lifecycle Bridge the Gap . . . . . . . . . . . . . . . . . . . . . . . . Think Outside the Box . . . . . . . . . . . . . . . . . . . . Evolutionary Model . . . . . . . . . . . . . . . . . . . . . Agile Unified Process . . . . . . . . . . . . . . . . . . . . . System Lifecycle . . . . . . . . . . . . . . . . . . . . . . . Defining Software Architecture . . . . . . . . . . . . . . . . . . Architecture vs. Code . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
106 . 106 . 108 . 110 . 111 . 113 . 114 . 115 . 116 . 117 . 119 . 120 . 121 . 122 . 123 . 124 . 125
v
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Contents Architecture vs. Technology . . . . . . . . Architecture vs. Design . . . . . . . . . . Basic Definition . . . . . . . . . . . . . . Design Decisions . . . . . . . . . . . . . . . . . Design Process . . . . . . . . . . . . . . . Design Decisions . . . . . . . . . . . . . . Decision Making Phases . . . . . . . . . . Decision Trees . . . . . . . . . . . . . . . Design Outcome . . . . . . . . . . . . . . Modeling Architecture . . . . . . . . . . . . . . Can this skeleton fly? . . . . . . . . . . . Prescriptive vs. Descriptive Architecture . Green Field Development . . . . . . . . . Brown Field Development . . . . . . . . . Architectural Degradation . . . . . . . . . . . . Causes of Architectural Drift . . . . . . . From Drift to Erosion . . . . . . . . . . . Entropy . . . . . . . . . . . . . . . . . . . Architecture or Code First? . . . . . . . . . . . Architecture Hoisting . . . . . . . . . . . Presumptive vs. Reference . . . . . . . . Solution vs. Product . . . . . . . . . . . . M-Architecture vs. T-Architecture . . . . The $10000 boolean flag . . . . . . Art or Science? . . . . . . . . . . . . . . . Science or Art? . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
Modeling Capturing the Architecture . . . . . . . . . . . . . . . What is modeling? . . . . . . . . . . . . . . . . . Abstraction and Interpretation . . . . . . Solving Problems with Models . . . . . . Question first, Model second . . . . . . . Scope of the Model . . . . . . . . . . . . . . . . . What is a view? . . . . . . . . . . . . . . . . . . . . . . How many views? . . . . . . . . . . . . . . . . . Multiple Views . . . . . . . . . . . . . . . . . . . View Consistency . . . . . . . . . . . . . . . . . . Domain and Design Models . . . . . . . . . . . . . . . Modeling = Learning . . . . . . . . . . . . . . . . Domain Model . . . . . . . . . . . . . . . . . . . Example Domain Model . . . . . . . . . . Design Model . . . . . . . . . . . . . . . . . . . . Example Design Model (Interfaces) . . . . Example Design Model (Implementation) Some Modeling Notations . . . . . . . . . . . . . . . . Use Case Scenarios . . . . . . . . . . . . . . . . . Example Music Player Scenarios . . . . . Feature Models . . . . . . . . . . . . . . . . . . . Feature Model Example . . . . . . . . . .
vi
. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
. 126 . 127 . 128 . 129 . 131 . 132 . 137 . 138 . 139 . 140 . 143 . 144 . 145 . 146 . 148 . 149 . 150 . 151 . 152 . 153 . 155 . 157 . 159 . 160 . 162 . 163 . 164
. . . . . . . . . . . . . . . . . . . . . .
165 . 165 . 167 . 168 . 170 . 172 . 173 . 175 . 176 . 178 . 178 . 180 . 181 . 182 . 184 . 184 . 186 . 186 . 186 . 188 . 189 . 189 . 191
Contents Feature Model Constraints . . . . . Constrained Feature Model Example Feature Configuration . . . . . . . . From C4 to C5 . . . . . . . . . . . . . . . . . . . System Context View . . . . . . . . . . . . System Context View Example . . . Containers View . . . . . . . . . . . . . . . Container View Example . . . . . . Example Containers . . . . . . . . . Components View . . . . . . . . . . . . . . Components View Example . . . . . C4 . . . . . . . . . . . . . . . . . . . . . . . Classes View . . . . . . . . . . . . . . . . . C5 . . . . . . . . . . . . . . . . . . . . . . . Connectors View . . . . . . . . . . . . . . . Connectors View Example . . . . . . 4+1 . . . . . . . . . . . . . . . . . . . . . . . . . . Logical View . . . . . . . . . . . . . . . . . Logical View Notation . . . . . . . . Example Logical View . . . . . . . . Process View . . . . . . . . . . . . . . . . . Example Process View . . . . . . . . Development View . . . . . . . . . . . . . . Example Development View . . . . Physical View . . . . . . . . . . . . . . . . . Example Deployment View . . . . . Content is more important than representation . Model Quality . . . . . . . . . . . . . . . . . Accuracy vs. Precision . . . . . . . . . . . . Model Quality - Advice . . . . . . . Model-Driven Architecture . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Modularity and Components What is a Software Component? . . . . . . . . . . Hardware Component . . . . . . . . . . . . . Software Component . . . . . . . . . . . . . . Examples: Application-specific Components Examples: Infrastructure Components . . . . Black Box . . . . . . . . . . . . . . . . . . . . Recursive Components . . . . . . . . . . . . Clustering Components . . . . . . . . . . . . Design vs. Run-time . . . . . . . . . . . . . . Distributed Components . . . . . . . . . . . . Component Lifecycle Decisions . . . . . . . . . . . Externally Sourced Components . . . . . . . Discovery . . . . . . . . . . . . . . . . Selection . . . . . . . . . . . . . . . . Integration . . . . . . . . . . . . . . . Test . . . . . . . . . . . . . . . . . . . Release . . . . . . . . . . . . . . . . .
vii
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. 193 . 195 . 196 . 197 . 198 . 200 . 201 . 202 . 203 . 204 . 205 . 207 . 208 . 209 . 210 . 211 . 213 . 214 . 215 . 217 . 218 . 220 . 221 . 223 . 224 . 225 . 228 . 228 . 230 . 231 . 233 . 233
. . . . . . . . . . . . . . . . .
234 . 234 . 236 . 238 . 240 . 241 . 243 . 244 . 245 . 246 . 246 . 248 . 249 . 250 . 251 . 252 . 253 . 254
Contents Deploy . . . . . . . . . . . . . Stateful Components . . . . . . . . . . Migration . . . . . . . . . . . . Backup and Recovery . . . . . Properties of Components . . . . . . . Component Roles . . . . . . . . . . . . . . Stateless vs. Stateful . . . . . . . . . . Stateless vs. Stateful Code . . . . . . . Stateless vs. Stateful Operations . . . Components vs. Objects . . . . . . . . . . . Component Technology . . . . . . . . . . . Component Frameworks . . . . . . . . Component Frameworks Demo Videos Where to find components? . . . . . . Buy vs. Build . . . . . . . . . . . . . . . . . How much does it cost? . . . . . . . . References . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. 255 . 256 . 257 . 258 . 259 . 260 . 261 . 263 . 264 . 266 . 270 . 271 . 271 . 271 . 273 . 275 . 277
Reusability and Interfaces Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . Component Interface . . . . . . . . . . . . . . . Provided Interface . . . . . . . . . . . . . Provided Interfacesand Component Roles Required Interface . . . . . . . . . . . . . Explicit Interfaces Principle . . . . . . . . . . . . Information Hiding . . . . . . . . . . . . . . . . . . . Effective Encapsulation . . . . . . . . . . . . . . Example . . . . . . . . . . . . . . . . . . . Describing Interfaces . . . . . . . . . . . . . . . . . . Principle of Least Surprise . . . . . . . . . . . . . Easy to use? . . . . . . . . . . . . . . . . . . . . . Interface Description Languages . . . . . . . . . . . . Java/RMI . . . . . . . . . . . . . . . . . . . . . . C/RPC . . . . . . . . . . . . . . . . . . . . . . . . RAML . . . . . . . . . . . . . . . . . . . . . . . . OpenAPI/Swagger . . . . . . . . . . . . . . . . . Working With IDL . . . . . . . . . . . . . . . . . API Documentation Demo Videos . . . . . . . . What is an API? . . . . . . . . . . . . . . . . . . . . . . Is it really API? . . . . . . . . . . . . . . . . . . . Many Applications . . . . . . . . . . . . . . . . . Developers, Developers, Developers . . . . . . . Where to find APIs? . . . . . . . . . . . . . . . . . . . Operating Systems . . . . . . . . . . . . . . . . . Programming Languages . . . . . . . . . . . . . Hardware Access . . . . . . . . . . . . . . . . . . User Interfaces . . . . . . . . . . . . . . . . . . . Databases . . . . . . . . . . . . . . . . . . . . . . Web Services . . . . . . . . . . . . . . . . . . . . API Design . . . . . . . . . . . . . . . . . . . . . . . . Where is the API? . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
278 . 278 . 281 . 282 . 283 . 285 . 287 . 288 . 290 . 291 . 292 . 294 . 295 . 297 . 300 . 301 . 302 . 304 . 306 . 307 . 307 . 310 . 311 . 312 . 313 . 314 . 315 . 315 . 316 . 316 . 317 . 318 . 319
viii
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
Contents API Design: Where to start? . . . . . . . . . . . . Who to please? . . . . . . . . . . . . . . . Reusable Interfaces . . . . . . . . . . . . . . . . Usability vs. Reusability . . . . . . . . . . Easy to reuse? . . . . . . . . . . . . . . . Performance vs. Reusability . . . . . . . . Small Interfaces Principle . . . . . . . . . . . . . How many clients can these APIs satisfy? Uniform Access Principle . . . . . . . . . . . . . Few Interfaces Principle . . . . . . . . . . . . . . Clear Interfaces Principle . . . . . . . . . . . . . Let’s create a new Window . . . . . . . . Expressive? No: Stringly Typed . . . . . . Consistent? . . . . . . . . . . . . . . . . . Primitive Operations . . . . . . . . . . . . Design Advice . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. 320 . 321 . 322 . 323 . 324 . 326 . 327 . 328 . 330 . 331 . 332 . 334 . 336 . 338 . 340 . 342 . 344
Composability and Connectors Software Connectors . . . . . . . . . . . . . . Connector: enabler of composition . . . Software Connector . . . . . . . . . . . Components vs. Connectors . . . . . . . . . Connectors are Abstractions . . . . . . Connector Roles and Runtime Qualities Connectors and Transparency . . . . . . Connector Cardinality . . . . . . . . . . Connectors and Distribution . . . . . . Connectors and Availability . . . . . . . Software Connector Examples . . . . . . . . RPC: Remote Procedure Call . . . . . . File Transfer . . . . . . . . . . . . . . . Shared Database . . . . . . . . . . . . . Message Bus . . . . . . . . . . . . . . . Stream . . . . . . . . . . . . . . . . . . . Linkage . . . . . . . . . . . . . . . . . . Shared Memory . . . . . . . . . . . . . . Disruptor . . . . . . . . . . . . . . . . . Tuple Space . . . . . . . . . . . . . . . . Web . . . . . . . . . . . . . . . . . . . . Blockchain . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
345 . 345 . 347 . 348 . 350 . 352 . 357 . 358 . 360 . 361 . 362 . 363 . 365 . 367 . 372 . 375 . 378 . 381 . 383 . 387 . 390 . 394 . 397 . 399
Compatibility and Coupling Compatibility . . . . . . . . . . . . . Compatible Interfaces . . . . . There’s an app adapter for that! Adapter . . . . . . . . . . . . . Wrapper . . . . . . . . . . . . . Mismatch Example . . . . . . . Partial Wrappers . . . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
400 . 400 . 402 . 403 . 404 . 405 . 406 . 407
. . . . . . .
. . . . . . .
. . . . . . .
ix
. . . . . . .
. . . . . . .
Contents Types of Interface Mismatches . . . . . . . . . . . . . . . . . . . . Synchronous vs. Asynchronous Interfaces . . . . . . . . . . . Half-Sync/Half-Async . . . . . . . . . . . . . . . . . . . . . . Sync to Async Adapter . . . . . . . . . . . . . . . . . . Half-Async/Half-Sync . . . . . . . . . . . . . . . . . . . . . . Async to Sync Adapter . . . . . . . . . . . . . . . . . . How many Adapters? . . . . . . . . . . . . . . . . . . . . . . . . . Scaling Adapters with N Interfaces . . . . . . . . . . . . . . . Composing Adapters . . . . . . . . . . . . . . . . . . . . . . . One or Two Adapters? . . . . . . . . . . . . . . . . . . . . . . Reusable Adapters and Performance . . . . . . . . . . . . . . How Standards Proliferate . . . . . . . . . . . . . . . . . . . . On Standards . . . . . . . . . . . . . . . . . . . . . . . . . . . Standard Software Interfaces . . . . . . . . . . . . . . . . . . Representation Formats . . . . . . . . . . . . . . . . . Operations . . . . . . . . . . . . . . . . . . . . . . . . Protocols . . . . . . . . . . . . . . . . . . . . . . . . . Addressing . . . . . . . . . . . . . . . . . . . . . . . . Interface Description . . . . . . . . . . . . . . . . . . Coupling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Understanding Coupling . . . . . . . . . . . . . . . . . . . . . Coupling Facets . . . . . . . . . . . . . . . . . . . . . . . . . Session Coupling Examples . . . . . . . . . . . . . . . Binding Times . . . . . . . . . . . . . . . . . . . . . . Be liberal in what you accept, and conservative in what you send. . Water or Gas Pipe? . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . Deployability, Portability and Containers The Age of Continuity . . . . . . . . . . . . . . . . . . . Deployability Metrics . . . . . . . . . . . . . . . . Release . . . . . . . . . . . . . . . . . . . . . . . . Release Frequency . . . . . . . . . . . . . . . . . . Speed vs. Quality . . . . . . . . . . . . . . . . . . . Software Production Pipeline . . . . . . . . . . . . . . . High Quality at High Speed . . . . . . . . . . . . . Types of Testing . . . . . . . . . . . . . . . . . . . Types of Release . . . . . . . . . . . . . . . . . . . Gradual Phase-In . . . . . . . . . . . . . . . Essential Continuous Engineering Practices Tools . . . . . . . . . . . . . . . . . . . . . . . . . . Build Pipeline Demo Videos . . . . . . . . . . . . . Container Orchestration Demo Videos . . . . . . . Virtualization and Containers . . . . . . . . . . . . . . . Virtualization . . . . . . . . . . . . . . . . . . . . . Lightweight Virtualization with Containers . . . . VM vs. Container . . . . . . . . . . . . . . . . . . . Containers inside VMs . . . . . . . . . . . . Images and Snapshots . . . . . . . . . . . . . . . . Virtual Machine Migration . . . . . . . . . Inverted Hypervisor . . . . . . . . . . . . . . . . .
x
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
. 408 . 409 . 411 . 413 . 414 . 415 . 417 . 419 . 420 . 421 . 422 . 423 . 424 . 428 . 429 . 430 . 431 . 432 . 433 . 434 . 438 . 440 . 442 . 444 . 446 . 447 . 448
. . . . . . . . . . . . . . . . . . . . . .
449 . 449 . 451 . 454 . 457 . 459 . 461 . 466 . 470 . 470 . 473 . 475 . 477 . 477 . 478 . 480 . 480 . 484 . 485 . 486 . 487 . 488 . 489
Contents Virtual Hardware = Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 490 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 491 Scalability Will it scale? . . . . . . . . . . . . . . . . . . . . Scalability and Workload . . . . . . . . . . Scalability and Workload: Centralized . . . How to scale? . . . . . . . . . . . . . . . . . Scalability and Resources: Decentralized . Scalability and Resources . . . . . . . . . . Centralized or Decentralized? . . . . Scalability at Scale . . . . . . . . . . . . . . . . . Scale Up or Scale Out? . . . . . . . . . . . . Scaling Dimensions . . . . . . . . . . . . . Scalability Patterns . . . . . . . . . . . . . . . . Directory . . . . . . . . . . . . . . . . . . . Dependency Injection . . . . . . . . . . . . Directory vs. Dependency Injection Scatter/Gather . . . . . . . . . . . . . . . . Master/Worker . . . . . . . . . . . . . . . . Master Responsibilities . . . . . . . Worker Responsibilities . . . . . . . Load Balancing . . . . . . . . . . . . . . . . Variants . . . . . . . . . . . . . . . . Sharding . . . . . . . . . . . . . . . . . . . . Computing the Shard Key . . . . . . Looking up the Shard Key . . . . . . References . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
492 . 492 . 494 . 496 . 497 . 499 . 500 . 501 . 503 . 505 . 506 . 507 . 508 . 514 . 518 . 519 . 524 . 527 . 529 . 530 . 534 . 536 . 539 . 540 . 543
Availability and Services Components vs. Services . . . . . . . . Business model: how to sell Design decisions . . . . . . Technology: how to use . . Availability . . . . . . . . . . . . . . . . Availability Questions . . . . . . . Monitoring Availability . . . . . . Which kind of monitor? . . Availability Incidents . . . . . . . Downtime Impact . . . . . . . . . Contain Failure Impact . . . . . . . . . Retry . . . . . . . . . . . . . . . . . Circuit Breaker . . . . . . . . . . . Canary Call . . . . . . . . . . . . . Redundancy . . . . . . . . . . . . . . . . State Replication . . . . . . . . . . Which kind of replication? CAP Theorem . . . . . . . . . . . . CAP Theorem Proof . . . . Eventual Consistency . . . . . . . Event Sourcing . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
544 . 546 . 548 . 551 . 553 . 556 . 557 . 559 . 562 . 564 . 565 . 567 . 568 . 572 . 577 . 582 . 583 . 585 . 587 . 588 . 589 . 591
. . . . . . . . . . . . . . . . . . . . .
xi
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
Contents References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596 Flexibility and Microservices API Evolution . . . . . . . . . . . . . . . . . . . . . . . . Only one chance... . . . . . . . . . . . . . . . . . . API Evolution . . . . . . . . . . . . . . . . . . . . . API Compatibility . . . . . . . . . . . . . . . . . . Semantic Versioning . . . . . . . . . . . . . . . . . Changes and the Build Pipeline . . . . . . . . . . . Version Identifier . . . . . . . . . . . . . . . . . . . Two in Production . . . . . . . . . . . . . . . . . . API Sunrise and Sunset . . . . . . . . . . . . . . . To break or not to break . . . . . . . . . . . . . . . . . . Who should keep it compatible? . . . . . . . . . . Layers . . . . . . . . . . . . . . . . . . . . . . . . . Layering Examples . . . . . . . . . . . . . . Data on the Inside, Data on the Outside . . Tolerant Reader . . . . . . . . . . . . . . . . . . . . Which kind of reader? . . . . . . . . . . . . Extensibility . . . . . . . . . . . . . . . . . . . . . . . . Extensibility and Plugins . . . . . . . . . . . . . . Microservices . . . . . . . . . . . . . . . . . . . . . . . . Monolith vs. Microservices . . . . . . . . . Will this component always terminate? . . . . . . Will this service run forever? . . . . . . . . . . . . Will this microservice continuously change? . . . DevOps . . . . . . . . . . . . . . . . . . . . DevOps Requirements . . . . . . . . . . . . Feature Toggles . . . . . . . . . . . . . . . . . . . . How small is a Microservice? . . . . . . . . . . . . Continuous Evolution . . . . . . . . . . . . . . . . Hexagon Model . . . . . . . . . . . . . . . . . . . . Decomposition . . . . . . . . . . . . . . . . . . . . Independent DevOps Lifecycle . . . . . . . Isolated Microservices . . . . . . . . . . . . Splitting the Monolith . . . . . . . . . . . . Microservice Best Practices . . . . . . . . . . . . . . . . Bezos’s Mandate (2002) . . . . . . . . . . . . . . . Evans’s Bounded Context (2004) . . . . . . . . . . Bob Martin’s Single Responsibility Principle (2003) UNIX Philosophy (1978) . . . . . . . . . . . . . . . Separation of Concerns (1974) . . . . . . . . . . . Parnas’s Criteria (1971) . . . . . . . . . . . . . . . Conway’s Law (1968) . . . . . . . . . . . . . . . . . Vogels’s Lesson (2006) . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . .
xii
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
597 . 599 . 600 . 601 . 603 . 604 . 606 . 607 . 609 . 611 . 613 . 615 . 619 . 622 . 624 . 626 . 627 . 628 . 629 . 633 . 635 . 636 . 637 . 638 . 639 . 640 . 642 . 647 . 649 . 651 . 653 . 654 . 655 . 657 . 662 . 663 . 664 . 665 . 666 . 667 . 668 . 669 . 672 . 673
So tware Architecture
Introduction
1
Contents • Architecture and other Metaphors • Architectural Forces • When do you need an architect? • Why software architecture? • Course Overview
1
De Architectura • Durability the building should last for a long time without falling down on the people inside • Utility the building should be useful for the people living in it • Beauty the building should look good and raise the spirits of its inhabitants
Vitruvio, 23BC
Software architecture like software engineering is a metaphor that we borrow from some other discipline and we bring into Informatics. There is something that we do when we develop software at scale with large teams, when we need to make certain kinds of important decisions, which reminds of what architects do. We call those decisions architectural. How well does the architecture metaphor fit with software? There are some similarities and there are some mismatches. So now before we talk about software architecture want to spend a little bit of time to introduce what is it from architecture that we can bring in and learn from. After all software architecture is a relatively young discipline. Architecture’s been around for longer. For example, one of the basic ideas of Architecting Buildings is that you should know how to design a building that can last for a long time. There are buildings from thousands of years ago that are still standing today. They have been built using stones only. As with every technology that you choose to adopt. You should know the limits and you should know when it’s applicable when it’s not. You should be confident that the building will not collapse while or after attempting to build it. The second principle is that what you built should be something useful. A bridge which connects two sides across a river, making it possible to cross it. A house for people to live in. A university with classrooms for students. The third one concerns the effect of the building on the people inside. Does it provide a welcoming place for people to live inside? Do people enjoy working in it? Or do people want to escape from it as soon as they can?
2
The Art and Science of Building • Architects are not concerned with the creation of building technologies and materials—making glass with better thermal qualities or stronger concrete. • Architects are concerned with how building materials can be put together in desirable ways to achieve a building suited to its purpose. • The design process that assembles the parts into a useful, pleasing (hopefully) and cost-effective (sometime) whole that is called “architecture”.
Architects are not concerned about how to create new and improved building technologies. They are also not concerned with the actual construction of the building themselves. They’re mostly interested in how to select and compose the right tools, introduce the right elements, and connect them appropriately. How to establish the right kind of relationships between the elements so that we have a good building that fits its purpose? While doing this, the architect is making the decisions for the building’s plan, which will be executed to construct the building by someone else. If you’d like to learn how to be an architect, you need to learn how to plan. And you will learn how to do it before the actual building exists. The same holds for software architecture, full of ideas that you should be able to express in your design for the software before it’s being built so that someone else can go and write the actual code for it. As experienced individual developers, you may wonder how this works, where does the architect fit between the customer requirements and my work writing the software? The architect is key to be able to lead the development effort and scale it so that you can have hundreds of developers working on the same project. And making sure these developers agree on how the architecture for this code should look like. Overall this is what we call architecture: a kind of design process, with selection and composition activities, but also a process rather different to simply writing and testing the code.
3
Eberhardt Rechtin, 1991
Architecting, the planning and building of structures, is as old as human societies – and as modern as the exploration of the solar system
4
Architects follow a process going from a concept, represented with sketches, which are refined into mockups and detailed plans, which can then be built. A successful result of the whole process will embed and highlight the original concept. Following such process helps to catch and correct mistakes early, explore different options, validate the concept also taking into accounts its relationship with its context (no city building exists in isolation from the ones surrounding it). Likewise, it helps to obtain buy-in from the customer, who can attempt to influence the outcome of the process. You should learn how to sketch the architecture of a software system. You should learn how to refine your sketch and make it into a plan so that somebody else can write its code.
5
Another concept that we borrow is the idea that when we build something, we need to have a foundation for it. This is the point of contact with the ground, the surface that supports the building. Also when you write your software you are building it on top of a foundation made of software and hardware. It is important to be aware of the kind of foundation you select to deploy your software on.
6
Even if we don’t usually call it ”software foundation”, there are many different ”platforms” on which you can deploy and you can write software for. The latest metaphor to describe a foundation for software is not based on something solid like the ground, but instead nowadays our software runs in the clouds. For some reason this seems to be a good way to sell access to so-called ”platforms as a service” (PaaS). You place your software in the Cloud. And this somehow gives you some stable foundation you can rely on. Which platforms have you built software for? Is a programming language like Java a platform? The language can be part of a platform. Still, while Java was marketed as platform independent, it actually is a special platform to write software which then can run on multiple platforms. This is where the foundation metaphor starts to show its limits.
7
Successful platforms last for a long time. Successful platforms attract developers, who flock from platform to platform. Unsuccessful platforms are discontinued as it is not economically viable to maintain and provide support for them. Platforms also evolve as their underlying technology improves, gets faster and cheaper. Planned obsolescence on the other hand is used to quicken platform upgrade cycles by force so that it becomes possible to make platform adopters pay again and again, especially when there is only one platform left in the market. Eventually, you come to the realisation that there is nothing that is built on a solid foundation. Everything is built on shaky ground. But we have to pretend that when we decide to invest in building a piece of software, we can do so by reusing some kind of a platform and trusting or hoping that it will remain there long enough. It is no longer feasible for a single developer to design their hardware, define a processor instruction set, build their own computer, invent a programming language, write a compiler, a standard library for it and use it to write the operating system for such computer, and still have enough time left to write useful apps to run on it.
8
Other metaphors that we use when we talk about architecture is for example, the notion of boundary. The first thing you should define is the system’s boundary, for example, by drawing a box delimiting the inside of your system from the outside, the rest of the world. This boundary can be closed if your system is isolated and does not exchange any information with outside. You have complete knowledge about the state of the system and there is no information that you have to go somewhere else to get. Another way to look at the closed metaphor is to consider the functional requirements that the closed system should satisfy. For example, let’s consider the game of chess or the game of go. You can list the rules, you know exactly what they are, and the rules are not changing. And these rules are limited and well known in advance. This makes it possible to solve the game.
9
Your system is no longer closed when there is uncertainty about its boundary. Not only requirements may be partially known, but they may change in the future in unexpected ways. Let’s say you build a application and you publish it on the open Web and it goes viral. A few days after you launch, you get 2 million users that you did not anticipate. But what if it never happens? Should every architecture be designed to scale to ”Web scale”? Or is it more important to be able to grow and adapt your system when it becomes necessary? Enter the Internet. Every thing becomes connected to everything else. Today every system is an open system. Whether you want it or not, you cannot build a wall around your system, it needs to be integrated with other systems to be useful. Your system becomes interdependent and interconnected to other systems over which you have no control. Service providers may disappear without notice or change their terms of service (e.g., switching from a free to a paid service, or a stiff price increase), making it no longer possible for your system to work with them.
10
When we have open systems which connect with other systems, the points where their boundaries interact become very important. We call these interfaces and will dedicate some time to dissect the issue of how to build a good API. This biological picture represents a slice through some vegetables which has cells that are quite large. Every cell has a boundary and an interface towards the other cells. Life has taken a long time to find out how to evolve from single-celled organisms into multicellular ones. It is not a trivial problem for heterogeneous, autonomous and distributed systems to communicate, interact and inter-operate so that you can use them as building blocks to construct something larger, something whose value is greater than the sum of its parts. This becomes particularly difficult when each system is built by people who do not know one another and are unable to talk together and have a tendency to apply changes without informing the others in advance.
11
Forces Time Resources Costs
Viability
Technology
Desirability
Requirements
Architecture Aesthetics Feasibility
Simplicity Elegance
What are the challenges of designing an architecture that as an architect you have to deal with? Your architecture should be useful: how do you find out exactly what the requirements are? What’s the problem that you are supposed to solve? It is usually not your problem, but it belongs to someone else: how do you talk to your customer in the right way so that you can understand what they want you to do? Since software can be applied across any domain, this is particularly hard, unless you become an expert in the particular domain. Assuming you understand what you need to solve, and assuming you can solve it with software. Will you be able to do so within a budget? Can you build it within a reasonable amount of time and by spending not too much money? Can you always speed up your project by adding more developers to the team? Be careful not to fall into the mythical man month trap. Once you have a reasonable budget, you still need to pick a suitable technology platform. Learn more than one technology, and be critical and aware that every technologies that you choose will have some limits and these limits will affect where it is applicable and where it should not be applied. And the choice of an expensive technology will impact on your budget, or an immature or unknown technology will reduce your team productivity. Although developers tend to be motivated to play with shiny new tools. Finding a viable, feasible and desirable solution is also what engineers do. The difference with architects is their focus on making the solution as simple and elegant as possible. Everything else being equal, never underestimate how challenging it is to find an essential, beautiful and simple architecture. This is something worth striving for, even if it may not so obvious what makes a piece of software beautiful. Even if you should have enough experience to tell the difference. For example, interfaces may be used to hide the ugly internal details. And developers will enjoy or hate working with a certain language or library. You can help them to come to your platform if you make it simple and easy for them to learn. However doing this is hard. 12
It depends. Is it a first-of-a-kind project? Or is it something your team has done many times before? Also, it depends on how large and complex your software is going to be.
13
Small One Person can build it • Minimal planning • Basic tools • Simple process
Medium An experienced team can build it • Modeling plans • Power tools • Well-de ned process
14
Large Do not try this without an architect
Ultra-Large Was this designed by an architect?
15
CodeCity, Richard Wettel
How Large? Always choose the appropriate tools for the size of the project you are working on
16
When you start a project ask yourself how large is going to be and how long is it going to be alive for? And then choose the right team organization and choose the appropriate tools to do it.
How Large? • How much time to build it? • How many people in the project? • How much did it cost? How to measure software size? Lines of code (LOC), GB, $
There are many different ways to measure the size of software. For example, you can ask: How long does it take to build it? And this is actually quite ambiguous, you can define it as the development wall time, starting from the very first Git init command where no code exists yet and then checking how long does it take you to write all the code. Or if you are using DevOps and continuous integration. How long does it take to build it and release it? I push a change and then I go home and come back the next morning and I have the fresh new build. Or I push a change and I can see the effect instantaneously like with a live programming environment. It’s a totally different developer experience. Small systems can be written by one developer, but most are written by teams. You can have larger and larger teams, but how do you organize 5000 people working on the same system?
17
What is the largest so tware project you have been working on?
Time
1 year
1 month
1
2
10
Team
100
So tware Architecture As the size and complexity of a software system increase, the design decisions and the global structure of a system become more important than the selection of speci c algorithms and data structures.
18
As the size and complexity of your software grow, there is something that becomes gradually more and more important than just knowing what is the algorithm or picking the data structure. Going back to the building metaphor, this would be like the choice of the type of electrical wires, plumbing pipes or window glass.
Lecture Context • Software Engineering • Programming Languages • Algorithms and Complexity • Databases • Software Architecture • Component-based Software Engineering
How to build components from scratch
How to design large systems out of reusable components
Semester 2011
Most of the courses you have studied give you the impression that you will build everything every time from scratch. Design a new database schema. Write a new program. The real world is not like this. You don’t often go into a software company to write your application from scratch. You work on their existing code and the existing code can have millions of lines. And what you build is usually an extension or an integration, a new piece that connects with other pieces. We assume that you know how to build individual parts, but once we have all these components. How do we integrate them? How do we connect them so they can work together? In this lecture, we talk about the big picture.
19
To motivate the need to learn how to work with software architecture, we need to place ourselves in the right context, where there is one or more teams of developers working on a large piece of software. How do we keep this alive and well but also under control?
20
Hiding Complexity
You should be able to work with powerful concepts that underneath have a lot of information attached. But you can talk about them because you learned how to use the right technical terms that can help you to describe something that is complex in a very compact way.
21
Abstraction
The dif cult part is to know which elements to leave out and which to emphasize
In the geography domain, you have a satellite picture or you can have a map. And, which one represents the territory, the real thing? When we build a model we are trying to capture the information that is necessary for us to solve certain problems for example, here we have a street map. And you want to know, how do I go from A to B? Should I take the highway? The difficult part when you do abstractions is to know what information do you keep in the model in which information can you ignore. When you draw a picture of your software that has 1,000,000 of lines of code. You have to decide which lines are important, and which lines are not which ones should I visualize and which ones I shouldn’t. In the example, we have 3 different representations, which when doing an architectural model we call viewpoints or perspectives. They show different aspects of the same architectural model. It’s important that all representations, while showing different aspects, are consistent. Otherwise, the developers that look at one map will do something different than the ones looking at a different map, because they read an inconsistent message about the system they are trying to build. At the end of the day, the code is the ultimate source of truth about a software system. What you write in the code embeds all decisions that were made to design it. However, how can you gain an understanding of millions of lines of code? It is impossible to read every single one. This is where using abstraction to manage such complexity helps. You can see the software architecture as a map to navigate large amounts of code.
22
Communication
Business Architecture
IT
A second very important aspect of architecture is about enabling the communication between different team members. The goal is to have an agreement between the developers about what to do so they will build the same system. They will write code that can be integrated together. But also there is a need for communication between the customers and the business side of the world or the domain experts and the technical people. The architect is the point of contact, who should be able to translate back and forth. As an architect, you have to be an expert in the technology, you need to know how to talk to a developer an be able to tell him how to build the software, but also you have to be able to talk to the business side and explain why you need such amount of time to do a project or why do we need to pay so much money to so many developers. So this may be a challenge because if you just study software and data engineering, you are very strong on one side but do keep in mind that as an architect you need also to be able to bridge this gap.
23
Representation
Architecture visualization was born with those famous boxes and lines diagrams. Sometimes the discipline is criticized because it seems architects spend their time drawing and looking at these diagrams. How is that connected with real software? However, when you have very large software, the argument is that it’s important for describing and understanding it, to be able to represent it. Not by using a programming language, but using some kind of a diagram which follows some kind of a visual notation. There are many notations, UML is not the only one. This one is more like a data flow diagram, showing how data is transformed while in transit between different storage elements. Whatever notation you choose, the visualization works if it tells you a story. If it explains how the software works and how it is built so let’s look at this example. You should be able to tell me the name or at least the purpose of the system. A search engine? Yes, Why? It shows many individual functions (the bubbles) that you would expect to find in such a system and also, if you look at the overall flow, there is a loop for crawling the Web and everything leads down into to the searcher. Between some functions you have state (the barrels) which functions read from and write into. Why are there so many barrels? Because it is important to show at this architectural level that the index is huge, so probably they use sharding and also need to have fast access to it to improve the search query performance. Exactly which search engine is this diagram representing? This is actually an early picture of Google, you can tell by the ”Pagerank” bubble, which takes the Web link topology into account when ranking the search results. This was their key innovation, sometimes it is enough to add one new component to an existing architecture to make a difference.
24
Visualization
There is of course, the whole branch of software architecture visualization out of which the layer cake is one of the most famous examples. This is also the classical way to explain in the connection between operating systems and applications, the different roles of user space and kernel mode and illustrate how hardware is abstracted away through several layers of software. This type of visualization also shows this concept of ”foundation“ on which your application is built on. User programs use the services provided by the “underlying” infrastructure. What is also important in this picture is that every point of contact between the layers becomes an interface. Choosing a layered architecture, gives you a fundamental way not only for subdividing the system but also for deciding how the different parts are connected together (and which parts cannot be connected) and what kind of interfaces do they should provide.
25
Visualization Load Balancer (assigns a web server)
Web Server (PHP assembles data)
Memcache (fast, simple)
Database (slow, persistent) Spring Semester 2011
Another layered architecture, a bit simpler than the previous one. This is the type of picture that you would see if you join the company, to give you the overview when they start explaining you the architecture of their system. What system does it represent? Something online, because there is a Web Server. Something that needs to scale to handle lots of users, because there is a load balancer. Since it mentions PHP, it could be Facebook. These are all valid guesses, but independently of the actual name of the system, we can look at the diagram and think about what are the most important concerns for the architects that designed it. How can this describe Facebook without mentioning the social network graph of friends? Well, this picture tells a different story. It says we have billions of users and we need to scale. We need a load balancer, we use PHP (but we rewrote the compiler to improve performance), we use a database, but we put a memory cache in front of it to speed up read operations. If you are the architect in a company that went from 0 to billions of users in a short time, your core concern is how to scale. This scalability problem is one of the many possible qualities that you should worry about as an architect.
26
Quality
Cost Functionality Elasticity
Ease of support
Compatibility Time to Market Reliability/Availability
Testability
Security
Usability
Dependability
Performance Scalability
Ease of Integration Customizability Resilience/Maintainability
Portability Reusability
You may have heard about the difference between functional and non-functional requirements. If you satisfy the functional requirements you have a correct system. If you satisfy the non-functional requirements, you have a non-functional system, a system that doesn’t work. In this lecture, we use the term ”extra-functional” requirements, to describe all qualities of a system that go beyond its basic correctness. A functional system works, but how well does it work? If you want to satisfy your customer, delivering a correct solution is necessary, but is it sufficient? What if you deliver a super scalable solution (like in the Facebook example) but you forget to implement the ”invite a friend” feature? How do you write a program that is correct? There are plenty of lectures where you can learn that, so I assume that you know how to do that. What we’re going to do here is look at everything else. For example, once you deliver it to the user, your software should be usable. You can study human-computer interaction (HCI) to learn how to and how do we design user interface which takes advantage of what the hardware can do (e.g., keyboard and mouse, touch, VR/AR, or voice) and choose the appropriate ones for the type of system you are going to build. What happens if something is not usable? People need to be trained to learn how to use it. So it’s more expensive to start using the system because people cannot use it by themselves, they need to go to a training course. However, when users get in trouble, they ask for help; so how easy is it to support your user? Do you have a way to actually see what they’re doing in the system so you can help them if they get stuck? Another quality is how expensive is your software going to be. Are you going to sell it for a certain price? If you give it away for free, then you don’t make any money with it. How are people supposed to be supported while using it? Or In other words, if you have a business model where the software is free, how are you supposed to pay the salaries of the developers that build and maintain it? It’s also an important decision: What’s the business model behind your software? Do you offer free software or people have to pay do you have a one time license or do you pay every subscription every month? If you don’t write the software and sell it to end users, but you write a component and that component is sold to be used within other systems and then you can also sell it and get 27
royalties. You can also sell it as a service and people pay you and then depending on how much traffic they send you then you charge them in your monthly bill. This is what happens with service-oriented architectures and cloud computing. Resilience determines how easy it is for you to keep maintaining the system. For example, when a user tells you there is a bug. How much work does it take to fix it? Or they have new feature requests, how easy it is to extend the system? What is the impact of change within your architecture? Maybe you have designed and built the perfect system. But you should never touch it again. Because it’s not possible to change it. Maybe you lost the source code, maybe the system is still working fine, but nobody has the source code anymore. So you need to call the reverse engineering consultants to do the retrofitting. Another aspect that is different than usability is how reusable is your software. What is easy to use for a certain application, may be difficult to reuse for another one. And usually, when you make it reusable, it tends to be very general. You can transport it and you apply it in many different contexts, but then it’s more complex. Because it has a general interface, which is more difficult to learn and thus less usable. Reusability also interacts with your business model. If you are in the business of selling software as a component. Then you want these components to be highly reusable so you can sell them to a lot of customers. If you build a component for one customer, a highly customized component, specialized for one particular context, then you cannot reuse it anywhere else. So the price will be higher. What about security? It’s a very big topic, we’re not going to go very deep into it. But it’s another aspect that is important to consider about your software? What kind of information does it work with? Does it have any sensitive data? Is it important than only some users can do something as opposed to others right? We need to put some kind of access control into it. How available and how reliable is the system that we built? Once people want to use your software, is the software going to be there for them when they need it? Or the people will have to live with the uncertainty that when they need to use this software now, but maybe it’s not working and I have to come back later and hope for the best. As an architect, you have to worry about whether your design is fault tolerant, then deploy it the right kind of environment with enough redundancy. This is covered under the more broad quality of dependability. There are many other qualities. Sometimes you have a problem of interfacing a system with another one, so how compatible is it? How easy it is to integrate? How portable? How much does it cost to redeploy the system across different platforms? How easy it is to test my system when I want to make sure that is correct? Do I have tests for it? How expensive is to write tests with good coverage? How easy it is to customize my system? Does it scale to many users? How good is the performance? These are the usual examples when you think of requirements that go beyond the correctness. Of course, we have a lecture on software performance. You may have the most scalable and performant system, but it is no good unless we can ship it today. What if you need to change something, fix a bug? How long does it take for you to fix it and then to deliver the fixed version to the customer? Last but not least, how elastic is your system? Assume we run it in the cloud, can it scale just enough to handle the load and then shrink back to the small configuration so that it’s cheaper to run? So these are all qualities of a software system and you as an architect should see yourself inside the circle with all these arrows pointing at you. And you need to decide which ones are you going to consider which one are you going to be pushing back against. We decide will make it really secure, but we have to sacrifice a little bit of usability because people have to enter the password or scan their faces. We have to sacrifice a little bit of 28
performance because we need to encrypt the messages. Or we will make it portable so that we can sell it too many customers but it will need to e virtualized and will be will run a bit slower. The message is that you cannot have the perfect system where every quality is maximized so you always have to balance and trade off one quality against another. One thing that we will practice is how to do these trade offs between different qualities. You can also look at this picture as a way to structure your path into the topic of software architecture. We will spend some time going over the qualities and defining them in detail, and then we will discuss how to design architectures which can deliver each quality in a way that should be independent from the way that you program its code. So that’s the main idea for this lecture.
29
Software is expected to change. Software is meant to change. The software architecture includes all the design decisions about the software which are hard to change. If you decide to write your code in a programming language, you may not be able to afford to spend 6 months developing your system and then go back on your decision and switch to another language. Once you make an architectural decision, you decide to invest into a technology, you decide to start moving in a certain direction, it becomes more and more difficult to switch or steer your project to follow another path later on.
30
Evolution Architectural decisions: hard to change later on and that you wish you could get right early in a project
Skin
Services
Structure Site
Stewart Brands, Ralph Johnson
Stuff Space Plan
Architectural decisions are important because if they are difficult to change, then you have to get them right and you cannot make mistakes the first time you make them. You usually cannot change the location of a building after you started to build it. It’s difficult to add more floors on top of an existing building unless the structure has been originally designed for the building to grow. It is easy to repaint a facade to change its color. It’s more difficult for example, to add a kitchen or add a bathroom where the plumbing is not available. It is also possible to throw down non-load-bearing walls if you need a bigger room, or you want to open or close a window. And some furniture is meant to be rearranged on the fly. When you think about your software, think about which elements are like the chairs which are easy to change and which elements are like the columns which shouldn’t be touched or everything will collapse.
31
Why So tware Architecture? . Manage complexity through abstraction . Communicate, remember and share global design decisions among the team . Visualize and represent relevant aspects (structure, behavior, deployment, …) of a software system . Understand, predict and control how the design impacts quality attributes of a system . De ne a exible foundation for the maintenance and future evolution of the system
Course Overview • Theory • Modeling exercises • Technology demos • Design workshops
32
Modeling exercises • Learn how to sketch, re ne, communicate and defend your architecture from multiple perspectives • Describe an idea for your next software project, or represent an existing one • One model presented by each student
Technology Demos • Learn about how software architecture meets the code • Give a convincing demo of a component framework, a connector, a continuous integration tool, a deployment tool • One video for each student
33
Design Workshop • Learn about how to make architectural decisions • Present your alternative • Compare it with another one, argue why your team should adopt it • One design discussion for two/three students (one student per alternative)
References • Michael Keeling, Design it! From Programmer to Software Architect, Pragmatic Bookshelf, 2017 • Michael T. Nygard, Release It! Design and Deploy Production-Ready Software, 2nd edition, Pragmatic Bookshelf, 2017 • Richard N. Taylor, Nenad Medvidovic, Eric M. Dashofy, Software Architecture: Foundations, Theory and Practice, John-Wiley, 2009 • Simon Brown, Software Architecture for Developers, LeanPub, 2015 • George Fairbanks, Just Enough Software Architecture: A Risk Driven Approach, M&B 2010 • Eberhardt Rechtin, Systems Architecting: Creating and Building Complex Systems, Prentice Hall 1991 • Henry Petroski, Small Things Considered: Why There Is No Perfect Design, Vintage 2004 • Ian Gordon, Essential Software Architecture, Springer 2004 • Christopher Alexander, The Timeless Way of Building, Oxford University Press 1979 • Stewart Brand, How Buildings Learn, Penguin 1987 • Amy Brown and Greg Wilson (eds.) The Architecture of Open Source Applications, 2012
34
So tware Architecture
Quality Attributes
2
Contents • Internal vs. External Quality • Meta-quality • Quality attributes along the software lifecycle: design, operation, failure, attack, change and long-term
35
Quality Defective
Required
Desired
Ideal
Quality is the reason why we need the software architecture. If we want to get a software that has a certain quality, we need to design it into the system. Later we will need to make sure that we get the quality after we write the code. But before we start writing the code, we need to think about: what is the goal? What do we want to get out of the system? And depending on the context, depending on the customer, depending on the users, the required quality will change. When we talk about quality we talk about: something that we can observe, something that we can measure and something for which we have a reference that tells us if we have a good quality or a bad quality. Should you just go for the minimum acceptable level? Is your software good enough? Or should you never ship it until it is perfect? While perfection is typically impossible to achieve with finite time and resources, shooting for the stars could land you on the moon. Anytime you try to design something, you have to have a reference, a target to be achieved. And the customer should set it. What if the customer doesn’t have an idea about what is the level of quality that is required? Then you have a bigger problem to solve. It’s much harder to hit a moving target.
36
Types of Functionality Functional
It works!
Non-Functional
It doesn't work
Dys-Functional
It doesn't work properly
Extra-Functional
It works well
Functional • Correctness • Completeness • Compliance (e.g., Ethical Implications)
Extra-Functional • Internal vs. External • Static vs. Dynamic Functional requirements need to be satisfied: your software should deliver what the customer wants correctly. Your software should also comply with internal or external regulations, legal constraints. When we talk about functional requirements, you should also consider completeness: Is your software feature complete? Is the system already doing everything it’s supposed to do? It is important to know when you are done. This assumes that the customer does not come up with new requirements, as most successful software in use tends to make customers ask for more features, so most projects are never truly complete. Functional requirements tell you whether the software works or not. What about describing how well does it work? Extra functional requirements describe all other qualities that need to be delivered to make your stakeholders happy. Internal qualities focus on your developers, external ones on the customer experience. Static qualities can be checked before starting your software by analyzing its source code or the model for its architecture; dynamic qualities (e.g. performance) will need to be observed while running the system. What about non-functional requirements? Those are used to specify under which conditions your software shouldn’t work. Like when an online supermarket flooded by customers locked down at home sends you an apology email after mistakenly accepting your order: the checkout page should have been non-functional at that time. 37
Internal vs. External External qualities concern the tness for purpose of the software product, whether it satis es stakeholder concerns. They are affected by the deployment environment. Internal qualities describe the developer's perception of the state of the software project and change during the design and development process.
Note that whether you can deliver externally-visible qualities will also depend on what kind of deployment environment you run your software in. Performance sometimes can be improved without rewriting the software by upgrading from a slow Raspberry Pi to a faster computer. As a software architect, to guarantee certain qualities, you also need to control the target deployment environment, which can be made of hardware or, nowadays, mostly virtual hardware, e.g., sofware-defined hardware platforms in the Clouds. Internal qualities determine the beauty of your code, or how badly written it is; whether your team wrote it in a rush, or if it is fully covered by tests. Unless the code is open source, nobody else will be able to appreciate these internal but fundamentally important qualities, which will affect how rapidly your software can grow while still allowing you to keep a good level of understanding and control over it.
38
Static vs. Dynamic Static qualities concern structural properties of the system that can be assessed before it is deployed in production Dynamic qualities describe the system's behavior: • during normal operation • in the presence of failures • under attack • responding to change • in the long term
As soon as you design your architecture and write the code you can already measure and predict static qualities. But until you deploy, start, and run your software for the first time you cannot measure dynamic qualities. Most qualities which matter to your customers and users (e.g. performance, availability) will need to be monitored in production. Still, if you can afford it, you can do some performance or reliability tests in a dedicated environment so that if something goes wrong your users are not affected. How representative will be the workload? Will you be able to reproduce typical failure scenarios? It may be hard to fully anticipate and reproduce the actual production traffic or failure scenarios, or even malicious attacks to your software. Different qualities cover different aspects of the software lifecycle and how your software reacts to events which you may not fully control (a random cosmic ray hitting the memory cell and flipping a bit). Will your software tolerate failures, survive attacks, and recover back to normal operating conditions? Another set of qualities describes how you can change, extend and evolve your software. How flexible is your software? How many times can you go around the cycle? Does software wear out? Do bits rot? For some reasons, in the long term, it may become harder and harder to maintain your software: dependencies disappear, hardware regularly becomes obsolete, or it may become impossible to reboot your system for installing the latest upgrades. What if your software is running an autonomous space probe and can only be updated before the rocket launch? Or maybe you lost the original developers and the new developers replacing them are not familiar with the system and have a hard time maintaining it. Or maybe you thought you have built a throwaway prototype but it ends up deployed in production and a business happily runs it for a few decades. How do you balance short term thinking during rapid, incremental delivery cycles with a long term perspective to make your software architecture sustainable?
39
Meta-Qualities • Observability • Measurability • Repeatability (Jitter) • Predictability • Auditability • Accountability • Testability Before we look at the actual qualities, I wanted to mention some qualities about qualities. For every quality we will introduce later, you should ask yourself first: how can I observe and measure this quality? Can I give a number for it? What kind of measure? What kind of metric is it? Money for cost, time (in nanoseconds) for performance (response time). But some qualities are not so easy to measure: you look at the architecture and tell me how complex it is. Measurability is important because if you cannot measure it, you cannot control it and you cannot argue let’s improve it because you don’t know how to assess and compare the impact of your architectural decisions on the quality of your software system. Another meta-quality concerns measure multiple samples over time: How stable is the result? Are those values all over the place or do you get some nice distribution? If you measure the system for a while can you predict where the quality is going? Can you extrapolate and plan for when the target will be reached? Can you detect if the quality is degrading? Who wants to measure the quality? Who is interested to know about a certain quality? You want to measure it yourself as an architect or as a developer. Maybe the customer wants to see it as well. There can be some contractual agreements stating the expected performance of the delivered system. And if the agreement is violated, there may be some financial penalties involved. So you need to trust the measurement as the results can have legal consequences. Sometimes you are supposed to keep an audit log that shows when the system was up and running for a given period to track availability issues. And often you need someone to blame or to be held accountable when a certain quality drops below an acceptable level. And then you have the problem of quality assurance: you have a quality to achieve, you go through the development process, you release your system: How easy it is to check that indeed what you deliver has the quality that you promise. As an architect, you do your best to design the system in the right way. The developers take your design and turn it into code. Did they do a good job? Did they violate your design and therefore the quality? How to make sure that it’s possible to test that a certain quality has been achieved? While automated testing typically is done for functional correctness, but can also be done for other qualities. Performance, for example, you run a benchmark and 40
measure the performance. If you’re concerned about reliability, you can follow chaos engineering practices to inject failures and check whether the system survives and whether your recovery strategy works. Different qualities have their own way to check whether they have been achieved.
41
Long-term
Change
Attack
Failure
Operation
Design
Meta Observability Measurability Repeatability Predictability Auditability Accountability Testability
Functionality Correctness Completeness Compliance Ethics
External
Evolvability Maintainability Durability Disposability Understandability Explainability Sustainability
Compatibility Portability Interoperability Ease of Integration
Resilience Adaptability Extensibility
Flexibility Modi ability Con gurability Elasticity Customizability
Feasibility Time to Market Affordability Consistency Simplicity Aesthetics Clarity Stability Modularity Reusability Deployability Composability Usability Accessibility Ease of support Manageability Serviceability Performance Scalability Visibility Dependability Safety Recoverability Reliability Availability Security Con dentiality Defensibility Integrity Authentication Authorization Non-Repudiation Survivability Privacy
Internal
Stakeholders
Quality Attributes
Design Qualities
• Feasibility
• Consistency
• Simplicity
• Clarity
• Aesthetics
• Stability
• Modularity
• Reusability
• Composability
• Deployability
42
Feasibility What's the likelihood of success for your new project?
• Affordability • Time to Market When you worry about feasibility you ask yourself what’s the likelihood of success of your projects. Are we going to make it? The project can be a completely new software or it can also be a modification: I have a system in this state. I have to extend it. I have to change it. I have to fix it and are we going to successfully deliver the improvement? What are the two main factors that affect this? One is whether you can afford it. And the other one is time: how fast are we going to be, because maybe even if you’re going to make it you’re going to be late.
43
A ordability Are there enough resources to complete the project? • Money • Hardware • People (Competent, Motivated) • Time • Slack What does it mean to have enough resources for a project? What do you need to do a software project? You need money. The money turns into people paid to develop the software. Developers need also the right hardware and development tools, both to set up a development environment to work on the system and also, later on, to test it and operate it. The other aspect is time. It turns out designing and writing code is not instantaneous. It takes time. You can try to trade off time with money under the assumption that if you pay more you can go faster because you get more people. But there is a point in which you have too many people on the same project and it starts taking longer. This is called the mythical man-month. And then you also need to have something called slack, not that one, the real slack.
44
Slack Are there enough free resources (just in case)? • Deal with unexpected events • Breathing space to recharge • Planning, Backlog Grooming • Keep track of the big picture • Re ect and Refactor • Learn and Experiment
If you look at your project plan, you have a schedule. And you know who will work on what when. The plan is very detailed, you can predict what is expected to happen during the next few months down to the minute. Does it have a little bit of space in between the activities? More in general, do you have just barely enough of what is necessary to complete the project? Or is there a little bit more? Do you have enough spare capacity? Are you facing a tight delivery deadline? Or maybe you know that we will finish one week before. Just in case, because you never know, and maybe at the end of the day, we can use the extra week. If you are a student, you should think about slack when you work on your assignments. At the beginning of the semester, you have a lot of slack. But somehow unexplicably, with the end of the semester slack disappears. And then you have a hard time to deal with unexpected events. There is no breathing room left. You cannot keep track of the big picture because you’re fighting fires one after the other; you’re trying to deliver all these assignments. You can no longer, for example, think about how to prioritize the work. No time is left to do what in the agile literature is called the backlog grooming. Your backlog accumulates what should be done and you’re not supposed to just taking on random activities to do, but you are actually trying to see what’s the best thing to do next. But to do that, you need to afford the time to be able to think about it, to reflect, to hold a retrospective: to check the quality of the outcome you have to check the quality of the process that was used to do it and then you have to be able to learn something from past mistakes so that you can improve the way you work. Another thing that is typically lost when there is a lack of slack is the ability to do refactorings and care about the internal quality of your software. You write the code in a rush so you can ship it and immediately start working on the next iteration. You build on it, and every time you add stuff but you never have time to clean up the code refactoring it. Technology evolves, when do you keep up with it? You have to learn how new technologies works, what’s new and check how changes affect your system. This also takes time, either you plan for keeping up, or you do it in your spare time. 45
When you budget time and resources for a project always add a little bit of breathing room in the schedule but also maybe you want to have more money just in case you need more people. Then you can have the time to hire them without stress, and it will be less likely you need to deal with burn outs. And you can afford to waste time chatting on the other slack as well.
46
Time to Market How soon can we start learning from our users? Slow
Fast
Build from scratch
Reuse and Assemble
Perfect Product
Minimum Viable Product
Design by Committee
Dedicated Designer
How do you know you have allocated or acquired enough resources for your project to be feasible? You cannot know unless you have a target, a deadline to ship. It is feasible because we will be ready to ship tomorrow. If you are writing gaming software, you have to ship before people start looking for Christmas presents. If you ship in January, chances are your competitors have won all the market for the season. When you get your software not only running but you deliver it in the hands of your customer is fundamental. And it’s getting more and more necessary to shorten this time to market. Ask yourself: Can I give it in the hands of my customer right away and or how long do I need to wait before somebody can actually use it? Why is it important that they start using it? because you can start learning from them. Either because you ask them for feedback, or they come asking you for new features, or you just sit back and observe them. What if shortening your time to market means you do not write any code at all? What if you just rapidly prototype, combine and compose products out of existing reusable components? What if you ship user interface mockups with fake buttons, and only implement their behavior if you detect users who try to click on them? If nobody ever clicks, then why should you spend time writing code that will never run? When you start a new project, your goal should be to ship an MVP: minimum viable product, or a walking skeleton. Definitely something not perfect. If you are perfectionist, you should look for a different line of work. Perfect software tends to have an infinite time to market, because you can always tweak it or polish it just a little bit more before giving it to your users. Another way to be fast is to improve your architectural decision making processes. And the fastest way to make decisions is to make them by yourself, so if you need to have a group of people that need to meet and they have to come to a democratic consensus and they have to have an agreement and everybody’s voice has to be heard as everybody voices their suggestions which should be somehow included or crammed into the design. Not only this takes time, but the complexity of what you’re designing will explode and the time to get there will also increase. To keep your time to market under control, it’s better if you can get the right person to make the right decisions by themselves at the earliest or latest possible right time.
47
Modularity Is there a structural decomposition of the architecture? Prerequisite for: Code Reuse, Separate Compilation, Incremental Build, Distributed Deployment, Separation of Concerns, Dependency Management, Parallel Development in Larger Teams
Programming Languages with modules: Ada, Algol, COBOL, Dart, Erlang, Fortran, Go, Haskell, Java, Modula, Oberon, Objective-C, Perl, Python, Ruby
Programming languages without modules: C, C++, JavaScript Modularisation is a design issue, not a language issue (David Parnas) This is the first example of internal quality, helping you to observe the structure of your architecture. If your system has a single box, there is only one element, only one module, then you have a monolithic system, also called the big ball of mud. You will have no chances to, for example, reuse existing code: there is only one element, either you reuse the whole thing or you rewrite the whole thing. It takes time to compile a large program, so splitting it into modules helps to separately compile them and incrementally rebuild only the parts that have changed. If you touch one line of the monolith, you need to rebuild the whole thing. If you have one module and you want to run it, you have to run it in one place. There is no possibility, for example, to run the user interface on a mobile phone and to run the back end in the cloud. It’s all in one piece: so either you download it all on the phone or you upload it all into the cloud. Modularity is a prerequisite for distributed deployment. If you have one module and you have one developer, everything works fine. If you have one component but you have 100 developers working on the same component, then they will probably spend most of their time dealing with merge conflicts. To manage the parallel construction of your software you need a modular architecture. Modularity is so important that many programming languages have embedded it into structural constructs, such as abstract data types, modules, packages, or namespaces. Still, very few languages are aware of the architectural concept of software components. The goal is to represent inter-dependent partitions of your code which can be developed independently. At some point, to build the whole integrated system, separate components will need to be composed and connected together. Despite its importance, it is surprising that there are some famous languages that don’t have explicit modularity constructs. But they’re still successful and widely used because you can still do modular design, give a shape to the structure of your code, even if the programming language doesn’t support it and has no way, for example, to manage module dependencies or document module interfaces to make it easier to discover and reuse them. 48
"Well used" so tware Which car would you buy? • Used by 1 user • Used by 100'000 users
Once we have modules your goal is to be able to assemble them into a larger system, but also to be able to amortize your investment into them by reuse. Which software would you rather reuse? One that has never been used before? Or another which has lots of existing users? When it comes to software reuse there is an avalanche effect, the more it gets reused the more likely to be reused it becomes. But the avalanche does not start easy for software that has never been used before. It’s very difficult to start or bootstrap such virtuous cycle.
49
Reusability Can we use this software many times for different purposes?
• Reuse Mechanism: Fork (duplication) vs. Reference (dependencies) • Origin: Internal vs. External (Not Invented Here Syndrome) • Scope: General-purpose vs. Domain-speci c • Pre-requisites for reuse: trusted "quality" components, standardized and documented interfaces, marketplaces It is often easier to write an incorrect program than to understand how to reuse a correct one (Will Tracz, 1987)
When you want to reuse a piece of software you have two fundamentally different ways to do it. The first is copy’n’paste. Every developer does it. Another way to call this is to “fork it on GitHub”. You have a project in which you have initially developed a component. And then you clone it and carry it over to the next project where you reuse it. What’s the disadvantage of doing reuse by copy’n’paste? What’s the problem? You make a replica: you have been duplicating code. You have been duplicating bugs. And if you fix them in one place, they still need to be fixed in the other copies. If you do the fork on GitHub then you can have pull requests or you can try to synchronize so it’s not so bad, but there is this problem due to the redundant clones. How to do software reuse without this problem? We don’t duplicate. We call. We refer to. We define shared dependencies so there is only one copy to be maintained well and then you refer to it by importing it in many other places. What can possibly go wrong? Different parts change at different speeds. Both strategies have benefits and disadvantages. So the advantage here is that if we need to do an improvement we can fix everything at once, and the same is also a disadvantage. If we make an improvement which is not compatible. Then we break all the dependencies. This doesn’t happen when you fork because you decide whether you want to pull the changes or not. If you don’t pull the changes you don’t get the fixes. But you can avoid pulling breaking changes. Fork or dependencies: this is a fundamental decision that you need to make if you want to bring reusability into your architecture. When you decide to reuse software, another big question concerns where does it come from? Do you intend to reuse your own code? Every developer over time develops a golden chest of treasures, code very nicely done brought along from project to project. And this is your own code. You wrote it yourself. Maybe in a previous project. You know it. You know it exists. You know because you wrote it, maybe you documented for yourself and you don’t hesitate before you take it and reuse it. 50
Scale this concept to the whole organization. My team did it. We already have a lot of experience with it. We don’t have a problem with using it. The difficulty is when you are moving from your own closed world to an open world where sources of code meant for reuse can be external vendors: are those components as good as they claim to be? Do we trust the external service provider? Can we integrate them without any adverse consequences? What if we picked a very popular package. The maintainer doesn’t have any time anymore to keep updating it. Somebody else takes over and then they take advantage of the fact that these are popular packages everybody is using them. And they inject some improvement which is actually doing something malicious and then the next time you upgrade you get a virus, you get a back door, you get a sniffer that is stealing credit card numbers. This is code that is externally sourced: who knows what’s in there? And this is even more so if you don’t have access to the source code. If you do reuse, but not the source code level, but you do it at the binary level, you should be even more careful. If you think about the purpose of the software, we can reuse general-purpose software, for example, a database. Databases help us to store information persistently. We need it in most systems. Let’s reuse a database. Of course, you need to choose what kind of database out of many different ones. If you are somebody that builds databases, you can sell them to customers from different application domains: finance, health, transportation, doesn’t matter. The database can store any kind of data. But also it doesn’t know how to store by itself anything. You have to define the scheme. You have to write the queries. You have to take the general database and specialize it to your domain. The alternative we can consider is reusing within the same domain. We say we have a very successful library, like a game engine. This is still very general, but its scope of applicability is limited within a particular domain. So this is a choice that you make: the more specific your software, the less potential users. But these users will be pleased to reuse your solution because it fits with their domain, it does exactly what they want. The moment that you want to establish an industrial approach to software development, you need to have reusable components and you have to assess and trust the quality of such components. You cannot afford to keep following the “not invented here“ syndrome. The points in which your software comes into contact with external, reusable software are fundamental. The points of contact between the external components and your system are called the interface. Interfaces of reusable components have to be documented. They have to be standardized so that it’s easy to connect them together. And we also need to have a way to discover these components. Where do you find them? So there has to be a marketplace, there has to be a repository and this could be a way for people to sell components and people to buy components. This used to be one of the Holy Grails of software engineering and when the field was born in 1968, D. McIlroy mentioned software would become a serious engineering discipline and software would grow from craft to industry unless there will be a market of reusable components where people can go make money by not selling entire application to their end users but by selling parts that can be assembled into larger systems. It’s been a while since back then and I am happy to say such reusable software components do exist now. They are everywhere. Making a living by writing and selling reusable software components is possible.
51
Design Consistency What's the design's conceptual integrity and coherence?
Understanding a part helps to understand the whole Avoid unexpected surprises: • Pick a naming convention • Follow the architectural style constraints Know the rules (and when to break them) It is better to have a system re ect one set of design ideas, than to have one that contains many good but independent and uncoordinated ideas (Fred Brooks, 1995) After looking at the structure, whether it’s modular and if the modules can be reused, we can observe the overall consistency of the design. Why is having a consistent design important? Because if you have a large but consistently designed architecture, if you understand how a part of the system is designed you can take that knowledge and apply it everywhere else. One example can be naming conventions, once you learn the convention you can expect to read code following it everywhere. If this is not the case, then every module that you read, every file that you open, will be different, and it will be much more expensive and time consuming to understand it. When we talk about architecture we will talk about design decisions. We can also talk about constraints over these design decisions as we talk about architectural styles. But if you claim that your architecture follows a style, then you expect a certain level of consistency because the style is guiding you by constraining the way you make the decisions. There can be some exceptions. But you need to be aware of them and before you break the rules, you need to know what the rules are. Otherwise, you get a random and totally inconsistent design. You can infer the experience of the designer just by looking at the consistency of their designs. If the design doesn’t have any consistency probably there’s not a lot of experience just as if the design is totally consistent. But when in a few places, the right places, there are some variations, there are good reasons for introducing those: that’s the mark of an experienced designer. And there are so many good architectural styles. There are so many good ideas, patterns, and design concepts. We want to adopt them all. Maybe this is your experience as students of design patterns. You learn a lot of patterns and then the next time you do a project you want to use all of them. And then you don’t get a consistent design because you just put too much stuff into it. And so it’s better to have the system use a few good patterns that make sense as opposed to just trying to reuse everything you know. It’s very easy, in other words, to over-engineer your design and lose consistency. Learn to be selective. Learn how to recognize which tools are necessary. Learn to pick which tools are sufficient. 52
Simplicity What's the complexity of the design?
• A simple solution for a complex problem • One general solution vs. many speci c solutions: • Lack of duplication • Minimal variability • Conciseness • Resist changes that compromise simplicity • Refactor to simplify As simple as possible, but not simpler (Albert Einstein)
How simple is your solution? It’s great to strive to have simple solutions for complex problems. Or is complexity unavoidable? Among two alternative ways, pick the simplest one. Sometimes you will be surprised how simple the solution would be, but how hard it is to find it. So if you are pressed for time, chances are that your design will not be the simplest possible one. How can you simplify? Look for the essence. Abstract and make your solution general. Or make it super specific, knowing that it will work for only a simple case. But that’s what you need anyway. Sometimes you can get simplicity because you do not repeat yourself, DRY solutions are simpler because they lack redundant parts. The code is smaller, concise, compressed one may say. The code takes less time and effort to write after you invested in simplifying its design. Avoiding duplication helps to keep your design consistent when you evolve it. There is a point in the evolution of the system in which you achieve a simple version 1.0. You won your fight against complexity and you have released a nice clean solution. Congratulations. Don’t let your guard down. Entropy and complexity will start to grow again as change creeps in. Refactoring (e.g., factoring out commonality, building layers of powerful abstractions) is what can help you simplify your code. There is also the problem of over-simplification. The opposite of over-engineering. So don’t make it too simple. That’s easy to say, but also difficult to achieve.
53
Complexity What is the primary source of complexity? • The number of components of the architecture • The amount of connections between the components
Say we have a modular architecture. Where does its complexity come from? Does it get more complex because you have a larger and larger number of different modules? Or does it depend on the number of possible connections between them? Complication (or size) grows linearly, while complexity grows quadratically. Imagine you have a developer for each component. If every component is interdependent with every other component, all developers need to talk to one another all the time to understand how each component works and to coordinate how they can be changed. How large can your architecture grow? The larger the coordination meeting, the longer it takes and less time remains available for coding. Your job as an architect not only to draw the boxes and arrows, but to cut the unwanted or unnecessary edges to keep complexity under control. If you prevent these connections from happening, you will keep modules independent, loosen their coupling and keep developers from attending too many meetings.
54
Clarity Is the design easy to understand?
A clear architecture distills the most essential aspects into simple primitive elements that can be combined to solve the important problems of the system Freedom from ambiguity and irrelevant details De nitive, precise, explicit and undisputed decisions Opposite: Clutter, Confusion, Obscurity
Architecture represents the system so that different people can talk about it. How easy is it to understand what people are talking about? How clear is their view over the details? Do they easily reach a shared understanding? If you hand an unclear architecture to developers, they will be confused. They will not know what code to write, how the solution is supposed to work, or even worse, which problem they are supposed to solve. Clarity can be enhanced by simplicity, but even complexity needs to be clearly described. As an architect, you make a decision. It has to be clear cut, you say: black or white; yes or no, you cannot say maybe. Unclear decisions breed ambiguity, which leads to conflicting interpretations. Some of the developers will do it in one way, the others in the opposite way. So you get an inconsistent implementation riddled with conflicting decisions. Learn how to make clear decisions; explain decisions clearly and don’t put irrelevant details or pick from ambiguous options. This is something that we will emphasize when you work on modeling. We will look at your architectural models and we will try to assess their clarity. Notations help a lot to avoid fuzzy sketches. Learn how to use them well. If you use a standard notation to represent your architecture, you clearly express the meaning behind what you draw and people should be able to understand it more clearly.
55
Stability How likely to change is your design?
Unstable
Stable
Prototype
Product
Implementation
Interface
Likely to break clients
Platform to build upon
Experimental spike, throwaway code
Worthy of further investment: building, testing, documenting
The design process is something that progresses over time. You incept the original concept, refine it, modify it, extend it until you reach a point in which you stop. Your architecture has reached a stable state. Then you can move on to the implementation phase. There are different types of software artifacts to distinguish whether they are stable or not. For example, if you hear the word “prototype” you know that this code may work today but might not continue to work tomorrow and it definitely shouldn’t be used in production. It is important to set the expectation about stability with users or developers who may depend on the particular artifact. Also, which elements of an architecture are supposed to be stable? The point of contact at the interfaces between components. Interfaces should be very stable while the internals of the components can be a bit more unstable. You can easily change the implementation as long as you don’t touch the interface because the change stays confined within the component and the other components – as they only depend on the stable interfaces – are not affected. Platforms are meant to give a stable foundation on which developers build applications. What if you write your application and it will stop working if you upgrade the compiler for the language you used to write it? If you write code that is unstable – a spike, or a quick and dirty prototype – you want to learn something from it and then throw it away. If you have a product that is stable, then it’s worth to: freeze its interfaces, write the documentation, train users, write automated tests for it. Unstable user interfaces have the same issues. Every improvement will impact users who may get confused, may need to be retrained to relearn how to do things they already spent time figuring out how to do with a previous version. 56
Composability How easy is it to assemble the architecture from its constituent parts?
• Assuming all components are ready, putting them together is fast, cheap and easy • Cost(Composition) < Cost(Components) • Components can be easily recomposed in different ways How easy it is to take all of your modules and assemble as components of a larger system? We talk about a composable architecture when the effort that it takes it to assemble it is less than the cost for actually building all of its parts. How much effort does it take to replace an existing component? Can you just plug the new one in and everything will still run? Or does it take hours of configuration? Or a month-long high-risk integration project, where you end up developing your own middleware connectors? Even if individual components may still be expensive to build, the consequence of a composable architecture is that their composition is supposed to be cheap. And sometimes not only components are meant to be reusable, but you can also start reusing the glue in between them.
57
Deployability How dif cult is it to deploy the system in production?
Hard
Easy
Manual Release
Automated Release
Scheduled Updates
Continuous Updates
Unplanned Downtime
Planned or No Downtime
Wait for Dependencies
No synchronization
Changes cannot be undone
Rollback Possible
We are now making the transition from design to run time. We have a design, we have the code. We are making a release so that we can deploy it and people can start to use the newly released version. This deployment process can be manual or automatic. You can have a continuous integration and delivery pipeline: I push into the repository, there is something that checks out the code, compiles it, builds it, tests it. If everything is green, even automatically pushes it out to the production environment. Maybe somebody has to check and sign off before the golden master can be shipped and take responsibility for it. If deployment is automatic, it will be a repeatable, deterministic process: it will always have the same output given the same input. If it is manual, then it will depend on who does it, what they had for breakfast and maybe sometimes it works sometimes it doesn’t, and the release unexpectedly fails. How often do you ship a new release? Once a year or 356 times per day? Are you making very small, incremental but continuous changes? Or you ship a whole new system every once in a long while. Earlier on, it was a big deal to make a release and it would happen unfrequently. Look at the history of most operating systems. Nowadays, releases are getting deployed much more frequently. If you sell mobile apps, the bottleneck is how long does it take for the app store to check and publish your updates. If you provide a Web service and you have control over your data center infrastructure, you can open a stream of continuous fixes and updates. Why is it better to deploy continuously? because you write and put in the hands of your customers every improvement as soon as possible. It takes time to debug and write the fix. It should not take time to ship the new version so that it lands on your users’devices. There are also risks when you deploy changes. Can you depend on a system which is continuously changing under your feet? Is it really always getting better? Do all users always need or like all changes? And what if you’re driving a car and in the middle of the highway they make a new release of the auto-pilot software? How safe is to patch a live system? When you deploy, are the users going to notice? What if you have to do a payment, you open to the e-banking website and you read: ’Sorry we are down while deploying an upgrade. It will be marvellous when it starts working. Come back tomorrow when we are back up and running again.’ Deployment is not only a transition from design time to run time, but also a transition from old (and broken) to new (and better). It’s introducing a 58
change of the system that affects the users both while the change is being put into place but also afterwards. If downtime is planned, your users can expect it. For example, don’t assume certain services work at night or during weekends. But if it’s unplanned and the very moment you need the system it’s not available because it’s being upgraded, this is not really good in terms of deployability. You can annoy your users by simply politely asking them if they would like to reboot their computer while they are watching a movie, or giving a live presentation. Now, when you make a change and you want to ship the new release and you have multiple components. Can you change one without asking the others for permission? Or do you need to wait until everybody is ready to ship and then you do a big release at the same time? If you need to coordinate the release is much more complicated because everybody has to be ready to give you the green light and this usually doesn’t happen at the same time. If you can ship everything independently, you can do releases much faster. What happens if the deployment goes wrong? What happens if you ship a bad release, you install it, and then you start getting all kinds of problems. Can you click on undo? Can you go back to the previous version? If you can’t. Then your system is not as (un)deployable as one that can be rolled back with ease. When you install software, you should treat it like a database transaction that can be aborted or committed. And can also be rolled back. Installation of software packages is for some reason not always like that. Sometimes you install it and you will never be able to disentangle it from the rest of its environment unless you throw away and reinstall the whole system from scratch. The ultimate failed deployment leaves you with a brick or an expensive doorstop.
59
Normal Operation • Performance • Scalability • Capacity • Usability • Ease of Support • Serviceability • Visibility After deployment, during operation we start to use the system. The system is running. Let’s look at the most important quality attributes during normal operation.
60
Performance How timely are the external interactions of the system?
• Latency • Communication/Computation Delay • User-Perceived: First Response vs. Completion Time • Throughput • Computation: Number of Requests/Time • Communication: Limited by Bandwidth (Data/Time)
Everybody has some experience and can give some examples of software that has a good performance or poor performance. It either meets or exceeds their expectations for how long an action should take, or it is so slow that they wonder if it will ever get it done. From the user’s perspective, performance can be observed in terms of response time: the delay between the last input and the first corresponding output. I send a request. I click a button and wait until I see the output. I have received a response with a result and it took time to compute it. So the latency of the response can be short and then we have a good performance. If the latency is extremely long, at some point we will start to suspect that maybe the system is never going to answer, then we need a way to entertain the user, show a spinner, flash a colorful progress bar, preview partial approximate results, and check that the job is still alive. Ideally, we can give an estimate of when the output will arrive. It’s all about expectation management. The latency is both affected by the processing time and by the communication time. With a distributed deployment, network latency has an impact. Systems deployed in different data centers in the cloud will behave differently depending on where the users are with their mobile phones. You might choose a data center closer to them to improve the performance. This has nothing to do with the software itself; it concerns the choice on where you deploy it in relationship with the users location. From the service provider perspective, we can also observe performance in terms of how many customers a system can handle per unit of time. This is different than the time it takes for individual interactions. Maybe those are slow, individually, but it is possible to execute many of them concurrently. And this is what we call the throughput, so we can measure it for computational tasks. We can measure how many clients and how many requests can your system process over time? If we have a multicore computer, we can run computations in parallel threads, and get better throughput. Regarding communication, this is usually measured in terms of bandwidth. How many kilobits, megabits or gigabits can you send or receive in one second? These are all different types of indicators that can help you to define, measure, predict and compare the performance of your architecture. 61
Scalability Is the performance guaranteed with an increasing workload?
• Architecture designed for growth: • client requests (throughput) • number of users (concurrency) • amount of data (input/output) • number of nodes (network size) • number of software components (system size) by taking advantage of additional resources • Scalability is limited by the maximum capacity of the system • Software systems are expected to handle workload variations of 3-10 orders of magnitude over short/large periods
Related with performance, scalability is a function of how performance is affected by changes in workload or in the resources allocated to run your architecture. You may have an incredible performance as long as you use the system only by yourself. The moment additional users show up, the performance degrades noticeably. Scalability by itself is a meaningless term. Since scalability is a function, you need to specify which variable the performance depends on. Scalability in terms of the workload may depend on the number of concurrent requests, the number of concurrent clients or users, the amount of data sent as part of each request. Scalability in terms of resources may depend on the number of available CPU cores, the amount of memory, the amount of storage. Will your computation scale to fill up a supercomputer? Will your architecture store and process big amounts of data? Regarding its internal structure, an architecture may need to scale to use a large number of components. Your development process may need to scale to involve thousands of developers. In general, you have a system that scales because when the workload grows, you can proportionally increase the resources, but the performance stays the same. The system doesn’t scale if when you increase the workload or you increase the resources, the performance degrades. What if you had infinite resources? Do you get instantaneous results? Maybe if you’re willing to pay for them? There is a limit in the amount of resources that you can have and this limit is what we call the capacity of your system. We can see a scalable system as a system that can fill up the available capacity. It’s impossible to scale beyond it, if the workload grows the system is first saturated and then it overloads. If you look at the workloads that software usually is expected to work with, they can 62
change between 3 to 10 orders of magnitude. Billions of users. Petabytes of storage. So it’s important to be aware of what kind of variations to expect during the lifetime of your architecture. If you expect a small variation you will design it in a certain way; if you expect a huge variation the design will need to be different. The advice is not to over-engineer your system for scalability when you don’t need it. Maybe some system needs to grow and scale beyond the initial workloads, but not all systems will go viral overnight and die of their own success unless they can figure out how to scale.
63
Capacity How much work can the system perform?
• Capacity: Maximum achievable throughput without violating latency requirements • Utilization: Percentage of time a system is busy • Saturation: Full utilization, no spare capacity • Overload: Beyond saturation, performance degradation, instability • Ensure that there is always some spare capacity
Capacity gives us an idea of how much work we can get out of the system. This is the maximum throughput achievable without violating the latency expectations of individual requests. There are only so many requests/second we can perform without having the queue of incoming traffic get longer and longer. If to service the workload you are running at 100% utilization of your processing or storage resources, you are running at the limit and entering saturation where there is no slack left. There is no spare capacity. If your workload goes beyond saturation, then you overload the system. This is when performance will suffer, clients will randomly disconnect after the network protocols time out or worse, they will start hitting their refresh buttons, thus further increasing the workload. The concept of slack we have seen for project management is also applicable to devops. We can adopt this good idea during operations: avoid running with your workload matching the exact capacity that you have. If, for some reason, one more user than expected shows up, you and everyone else will notice.
64
Measuring Normal Operation Qualities Results are displayed
1 second
after users submit their input
The system can process messages sent by After
1 hour
1M concurrent clients
of initial training, users are already productive
Last Friday the workload reached
1000 requests/second
Usability Is the user interface intuitive and convenient to use?
• Learnability ( rst time users) • Memorability (returning users) • Ef ciency (expert users) • Satisfaction (all users) • Accessibility • Internationalization
Beyond performance, scalability and capacity, another externally visible quality concerns the interaction between users and the system’s user interface. To assess this quality, you need to think about who are the target users of your system. Which expectations do we have to meet? This can be unpacked looking at the ability of users to learn how 65
to use the system. Make the user interface intuitive, so that beginners can be productive without training. At the same time, the user interface should not get in the way of experts. How do you give a path to discover the features of your system? Welcome new users so that they have a good experience and come back. Make them do the journey from first time user to expert quickly. If they come back, will they remember or need to be reminded with subtle cues on how to pick up your user interface again? Expert users by definition don’t have problem learning or remembering, but they have a problem being efficient. And typically need shortcuts, a fast track to get the job done. In general, are users satisfied after using your system? There is a standard system usability scale which can help to measure that, which requires however explicitly asking users. This means users need an incentive to provide you with feedback beyond their sheer happiness or hate for the system. There are also users with special needs, and Accessibility is the specific usability quality to check if your user interface can support them. Internationalization is also important for supporting people speaking different languages. Do you need to rebuild your app to switch its display language? These qualities might be ignored with early releases but need to be considered once the user population grows worldwide.
66
Ease of Support Can users be effectively helped in case of problems?
Hard
Easy
Cryptic Error Messages
Self-Correcting Errors
Heisen-bugs
Reproducible Bugs
Unknown Con guration
Remotely Visible Con guration
No Error Logs
Stack Traces in Debug Logs
User in the Loop
Remote Screen
Intuitiveness and convenience help until users attempting to use your application get into trouble. Something goes wrong and all users get is a cryptic error message: Error 25. How are they supposed to understand what they did wrong? How can they correct their mistake by themselves? How can they get help? Ideally, when something goes wrong your system should be designed in a way to even self-correct the problem. Users should not notice, that’s the idea of a robust, fault tolerant, or self-healing design. You try to open a file to edit it. The file doesn’t exist: do you get an error? or do you create the file? What’s the best design? Better to empower people to solve problems by themselves. A big challenge for supporting users is due to hard-to-reproduce issues: Heisen-bugs, the ones which disappear when the user calls support for help, and reappear right after hanging up the phone. To reproduce a bug, you need to know not just the input, but also the configuration of the container or specific runtime environment. Where was the system deployed? How is it configured? Does an ordinary user know that? And how can support personnel access the information? Without it, there is no context for troubleshooting. Is there just a cryptic error message popup? or is there an error log with user accounts, stack traces, IP addresses and timestamps? Leaking stack traces or other details can also be a security issue, since it reveals internal details which are necessary for supporting users in trouble but can also open up an attack vector. Is it easier to land a pilot-less plane by directing some random passenger over the radio or by remotely taking over the controls? Having users in the loop means that the user is helping you to debug. This can be great for the heisenbugs problems but it can be a problem because the user is also misunderstanding what you tell them and you can just indirectly observe the screen through their eyes. If you have remote access into the system, you can even type yourself commands to fix the problem or to reproduce the bug. This may be not always possible, as it happened to that consultant who ended up in a security-conscious company, and could only tell employees what to type, but was not legally allowed direct access to the terminal’s keyboard. This is a particularly expensive way to support your users.
67
Serviceability How convenient is the ordinary maintenance of the system?
Hard
Easy
Complete Operational Stop
Service Running System
Reboot to upgrade
Transparent upgrade
Install Wizard
Unattended Installation Script
Restart to apply con guration change
Hot/live con guration
Manual Bug Reports
Automatic Crash Report
If we switch to the internal perspective, during normal operation we need to maintain the system: do whatever it takes to keep it running. For example, update or upgrade a component or its execution environment. Sometimes your Mac shows you a pop up where you read: “to update please click to reboot”. How many of you click it right away, and how many of you postpone it until tomorrow, every time? It would be much better (both to keep a system up to date and for reducing user distruption) if system improvements are delivered without the user noticing and applied on the fly while the system is running. To prepare a new machine for service, sometimes you have to install software on it. How many of you are using “brew”? Why are you doing that? Because it is so much straightforward to get software installed from the command line as opposed to drag and drop or filling out wizard forms and watching progress bars. if you have to install something once on your personal machine, you might be willing to go through the effort, but if you have to prepare all the computers for the whole University, are you going to pay a person to sit and click through point and click wizards for days? What if you are tuning a system’s performance by changing some configuration parameters. Is it better if you have to reboot it after every change? or if you just change a few parameters and the system picks the new values up without any restart? When something goes wrong, do you need to have a trusted person to give you a good bug report, featuring screenshots and a step by step guide on reproducing the issue? or do you just have an automatic crash detect and report bot, which sends to the developers all the necessary stack and heap memory dumps and input/output logs so they have enough information to be able to fix it? While planned outages and service downtimes were acceptable with local users taking weekend offs, since the maintenance work could be scheduled during weekends or at night, with global users expecting 24/7 availability, it gets more challenging to get serviceability.
68
Visibility Is it possible to monitor runtime events and interactions?
To which extent the system behavior and internal state can be observed during operation? Are there logs to debug, detect errors or audit the system in production? Is the system self-aware? Process Visibility: can the progress of the project be measured and tracked? We see in order to move; we move in order to see. (William Gibson)
When it comes to being able to observe what the system is doing while it’s running, we need to be able to go beyond its external input/output interface. To get visibility into the system we need to open up the black box and look inside. Can developers observe the internal state of the system, or trace what the system is doing while it’s running and processing our requests? Are there some normal operating logs? What about error logs? Can users monitor and track their usage, e.g., especially if they get billed on a pay-per-use basis? Basic visibility is one-way information flow from the inside to interested parties outside. What if this flow could be fed back to the system itself? Visibility is the pre-requisite for self-awareness: if you know you are running out of battery, you can kill non-essential tasks. If you detect a workload peak that brings you into overload, you can automatically provision additional resources to add capacity for absorbing the peak and free them once they are no longer needed. Doing this requires to be able to observe the consumption of resources and manage their internal allocation. We also talk about visibility in the context of project management, when tracking the progress of the project. You can observe commit logs, count issues in the backlog, spot trends concerning failing tests, and even try out preview releases. Aggregating these statistics helps to track the speed of a project or predict whether it will be late. Good visibility also helps when something goes wrong and you need to detect faults or deal with the consequences of failures.
69
Dependability Qualities • Availability • Reliability • Recoverability • Safety • Security What could possibly go wrong? Will your software still work if you flip one random bit of its code? what if the data gets corrupted or lost? Faults can originate from within, but also be due to external problems (e.g., failed dependencies, network communication outages) affecting our system. Was the lost packet carrying an alarm about someone’s wearable heart monitoring sensor? or just an unimportant frame of a social media video? How much we can depend on our system? Can a life depend on your software? If so, when something goes wrong, someone may die. Or maybe it’s just a little bit of digital money that gets locked forever inside a smart contract. Maybe all you have to lose is a little bit of your precious time, just a nuisance. Dependability is a very broad quality and we can decompose it into availability and reliability; in how easy it is to recover normal operation when things go wrong (recoverability); safety and robustness, as well as security, are considered part of dependability.
70
Reliability How long can the system keep running?
• MTBF - Mean Time Between Failures • MTTF - Mean Time To Failure
Recoverability How long does it take to repair the system?
• MTTR - Mean Time to Recovery • MTTR - Mean Time to Repair • MTTR - Mean Time to Respond
If the system is running ok now, for how long is it still going to run? We need to use the system from now until the future for as long as possible. So we rely on the system to work, to be there for us. We distinguish 2 different time metrics to quantify reliability: the time until the next failure and the mean time between failures. To understand reliability, think about these two states: up and down. Up means everything is fine: the system is working. Down happens when a failure occurs. That’s where the system is no longer working. How much time do we spend between these 2 failure events? If these are far apart, a system is more reliable than if it would crash every hour. What if you don’t know when the last failure manifested itself? Then you can look into the future and worry about how long will it take until the next failure. What happens when you unbox a new phone or a new laptop. What are your expectations about its reliability? Will it break as soon as the warranty expires? We distinguish MTTF from MTBF because it is important to consider not just the transition from green to red, up to down, working ok to failure, but also hopefully the opposite direction: coming back up after recovery. Only biological systems fail once and then they’re dead and unfortunately there is no way to bring them back. Artificial systems fail and then hopefully there is a possibility to recover them. Once something fails, first, you need to be able to observe and detect the fact that it has failed. Maybe the root-cause and the fault triggering a complex chain of events happened earlier, but there is a moment in which its ultimate consequence visibly breaks the system, which has failed. This is when we can start our attempt to recover. What is the classical and universal recovery strategy across computer science? Reboot it. How long does it take? Depending on the operating system rebooting may not be instantaneous. What if recovery requires something more complex than a simple reboot? 71
Before you can recover something you have to fix the actual cause of the problem, you have to repair it and before you can perform the actual repair, you need to know who is the right person to call. So until systems get fixed by themselves, we need to get the right person at the right place so they can start typing the right commands on the right keyboard. Before you recover you have to repair it and before repairing it, you have to respond. Sometimes you have support organizations that will guarantee that when something goes wrong they will respond within a contractually specified number of minutes. From the time they are called and notified of the problem, there will be a person physically present to start the repair process. But this just a really first response which doesn’t guarantee that the problem is solved right away. It just states that somebody will be working on it. And before you can respond you need to know who to call. And what if this person is on vacation, quarantined or simply unavailable? Your “down time” just got longer. Depending on what went wrong it may not be trivial to fix it. If a backup plan or redundant part has been foreseen in advance, it may be a good time to fail over while the original component is being worked on. If there is no slack, then your recoverability track record just got worse.
72
Availability How likely is it that the system is functioning correctly? Availability and Reliability
• Availability = MTTF / (MTTF + MTTR) Availability and Downtime
• Availability = (Ttotal - Tdown) / Ttotal Availability
Downtime (1 Year) 99%
3.65 days
99.9%
8.76 hours
99.99%
53 minutes
99.999%
5.26 minutes
99.9999%
31.5 seconds
What’s the difference between reliability and availability? The system is on, but how can you be sure that it’s working properly? How do you measure something that is uncertain? If you have a reference point for how long it takes for the system to respond to a specific probe, we could try probing it, sending a request. After asking a system to do something, how long will it take to complete the operation? How do you know that it will actually finish eventually? What if it never responds? Maybe it doesn’t matter if an answer will actually materialize. All it matters is how long you are willing to wait for it. When you reach a point in which you say: “I’m not willing to wait anymore”. That’s when the system is not available. How do you deal with such uncertainty? What’s the probability for the system to give an answer within 1 second? 100%. Sure. Well, that depends on the input, it depends on whether it is about to fail. Maybe you don’t get an answer after all. We call availability, the probability for the system to successfully process and provide a timely response to your input at the time you actually need it. How can we measure how likely or what’s the probability that the system is functioning correctly: it gives me a correct answer within the time that I’m willing to wait for it. We can use the relationship between availability and reliability. If the system is failing, when it is down, it’s not available. So we can divide the mean time to failure (MTTF) by the total time, which also includes the time to recover (TTR). If it takes 0 seconds to recover, you have 100% availability because the moment that you fail you immediately bounce back and you’re recovered again and then you don’t even notice that it was unavailable. If it takes a non-zero time to recover you have a smaller availability. We can fudge the metric if we do not include as down time the planned outages, which are expected by users, and only measure unplanned outages as actual down time. 73
What if you don’t consider this and just say we want to always be able to use the system for one year, If something goes wrong, for how long is the system going to be unavailable during that time? Sometimes you see availability defined by counting how many nines. They say we have 5 nine availability: 99.999%. This means approx. 5 minute downtime over one year. It’s actually quite challenging and expensive to add 1 nine. It takes redundancy, automation, processes and still making sure that people don’t go on vacation at the wrong time. If you have to call somebody at the beach to fix something, you already lost the critical 5 minutes. All of these metrics: availability, time to respond, repair and recover are part of service level agreement contracts and can be tied to financial penalties, not to mention the loss of reputation and customer trust.
74
Measuring Availability and Reliability The service level agreement states up to 1 month
, an availability of
1 hour
99.861%
After we call support, they need to be there within Rebooting the server takes
Safe Secure
30 minutes
5 seconds
The uptime of our oldest server has reached
Robust
downtime per
4 years
Is damage prevented during erroneous use outside the operating range? Is damage prevented during use within the operating range? Is damage prevented during intentional/hostile use outside the operating range?
B. Meyer
75
Before we go into the security discussion, we should mention this important definition about the safety of a system. And to understand it, you need to consider the type the input sent to the system. Whether you are sending valid input to a system that is running in normal conditions, then the system shouldn’t fail. It’s safe if no humans come to harm under these conditions. What if we are sending wrong, invalid, malformed input to a system that is failing or misconfigured? Then it’s not about safety, but about robustness, which means that the system can tolerate being used in the wrong way. Unintended erroneous use happens all the time. That’s why dangerous editing operations come with an undo option. That’s why you shouldn’t be able to put your car gear in reverse when driving down the highway. That’s why a cat walking on the keyboard of the control panel of a nuclear plant should not cause a meltdown. If the system is robust, there is no damage when misusing the system unintentionally. The difference between a robust system and a secure one is due to the intention of the user. Is it just an honest mistake due to incompetence? or is the incorrect input sent as a part of an attack? Like when someone is trying to trick your server to remotely execute the content of a message whose size overflows the buffer allocated to receive it. When you design your system, it has to be safe - minimize the risk that users trained to work with it come to no harm. Then you can go to the next level and make it robust: no damage even under random input. Security is the most challenging to achieve as attackers are intentionally misusing your system and targeting its weaknesses.
76
Security • Authentication How to con rm the user's identity?
• Authorization How to selectively restrict access to the system?
• Con dentiality How to avoid unauthorized information disclosure?
• Integrity How to protect data from tampering?
• Availability How to withstand denial of service attacks?
We can decompose security into different aspects: we should know the user. We should be able to authenticate the identity of the user. After we know who the user is, we need to know whether the user is supposed to be doing what it’s doing right: Are they authorized? To do so, you need to put a filter on top of the architecture, which tracks the user’s input and checks whether the user is allowed or not. Once the user is sharing information with the system, is this information going to leak everywhere else? or does the system preserves the confidentiality of the input and as well as the output? Is the output given only to the corresponding users on a need to know basis? How do we avoid that external attackers tamper with the information inside the system? This is about guaranteeing the integrity of the storage, but also the communication: when we transmit data, nobody should intercept it and modify it without us being able to detect it. Security is also connected with availability. A system can be overloaded because it is very successful and everyone wants to use it. Or because it is undergoing a denial of service attack, where malicious users flood it with requests so that ordinary users get stuck in the queue and are no longer provided a timely service.
77
Defensibility Is the system protected from attacks?
Survivability Does the system survive the mission?
If you look into military applications, you need to worry about how defensible is your system. If you design a system that is 99% secure and you connect it to the open Internet, how long does it take for someone to hack into it? Sooner or later, it’s pretty much certain that it will be hacked because all it takes is one point of weakness and the Internet is ready to exploit it. You need protection, a firewall. Maybe you build an air-gap defense where you don’t have any network connectivity into your system. But it is still possible to monitor electromagnetic radiation and read what you display on your screen. It’s difficult to defend electronic systems. If the blockchain cryptographic protocols are considered secure, the focus of the attackers will shift to the exchanges and your wallet where you store your private keys. How strong is your password? Do you remember to change it every 24 hours? Survivability is another aspect. If you use your system as intended, how much survives the mission and can be used again later? If you launch a rocket: will it deliver the payload and return to base? Or will the whole system get destroyed after a successful mission? Going back to civilian applications: How long does your phone keep working? For how many years can you keep installing apps on it? When does the manufacturer stop supporting it with operating system updates? The choice of embedding planned obsolescence into a business model will also impact how your system survives its normal usage. Usage which even if supposed to be as intended, will slowly cause your software to rot.
78
Privacy How to keep personal information secret?
Privacy
Good
Poor
Default
Opt-in
Opt-out
Purpose
Speci c, explicit
Generic, unknown
Tracking
None
Third-party Fingerprinting
Personal identi cation
Data anonymization
Data re-identi cation
Retention
Delete after use
Forever
Breach
Prompt Noti cation
Silent
Security should not be confused with privacy. Privacy is about personal information. So you need to distinguish information that is made public because you want everybody to know about it, from information that you should keep for yourself, or just shared with a controlled set of people. Please note that a shared secret is no longer a secret: you can allow someone to get a copy of your bits, but once the copy is out there, digital rights management tools are somehow supposed to prevent those bits from further spreading the information, but information wants to be free... A quick example illustrating the difference between privacy and security: if you consider cloud providers, they claim to be highly secure. Nobody else can get into your account and they have the massive size required to invest in hiring top talent for keeping their data centers secure. To make use of their services, however, you send them a copy of all your data. So your data is now in the cloud, stored right next to the one of your competitors. All the systems processing your data run the cloud. They are in a secure environment. But there is no privacy, because you have just released a copy of your personal or corporate information into an external environment. You gave away a copy to the Cloud provider, who guarantees that nobody else can see it, but they can! So implicitly or explicitly you have to trust them not to violate your privacy. This trust is represented by regulations such as GDPR, or privacy policies, or those friendly cookie popups which are meant to provide informed consent for people to agree with. They describe what is going to happen to the information once you put it into the system: Is it going to stay private or not? Typically it is not, but you still have to ask the people to give up their privacy, even if you just do it by default: if you come here I take everything for free unless you opt out. When you take information from people do you use it for one specific purpose only or do you keep a copy around just in case? You share a picture of a fleeting moment with a friend: a copy will be kept safely stored and digitally preserved in case the police needs it a few years down the road. Or if you ride your e-scooter through a pedestrian zone, or set your e-car’s cruise control above the speed limit, you will get an automatically generated fine. 79
What kind of tracking do you do on the users: do you receive their input, you give them the output and you forget the interaction ever took place? Or do you preserve detailed logs of every transaction forever? Is this tracking outsourced to a third party? Are such logs kept with personal information or are you going to strip them of the associated user identity? There is a whole industry dedicated to harvesting supposedly anonymous usage logs and reconstructing de-anonymized user profiles from them. Finally, there is the retention issue. Once you put data into the system, does it stay there forever? Or is it going to get deleted after a certain time? And when something leaks, everything leaks eventually, do you tell the people that their emails or their credit card numbers have been stolen or do you just stay silent and pretend it didn’t happen by not sending out any breach notifications.
80
Change Qualities What changes are expected in the future? No Change: put it in Hardware Software is expected to Change Versioning Before we discuss how easy or how difficult it is to change our software system, or to evolve our software architecture, let’s discuss why change matters. Where does change come from? Why software is supposed to change? On the one hand, functionality will change, so the correctness is a moving target. Concerning completeness, if the system is only partially correct, we have to complete it. Users come up with new ideas for more and more features so we need to keep the user engaged with the system. We need to give you more. The expectations for what software can do today have changed compared to what software should have been able to achieve decades ago. Another possible source of change: we have to fix bugs. A dangerous activity, because fixing a bug may introduce more bugs. This gives a limit on whether such improvements can be applied forever. Every change that we introduce is risky: because we go from a known state, which might be flawed, but its flaws are known. As we make the change, we jump into the unknown. Maybe it’s a better state, who knows? Changes can be driven by users, but also be imposed by the technology platform on which we depend on. How stable is this platform? Information Technology quickly evolves following technology cycles, in which we make progress, and sometimes we run just to stay in the same place. Hardware provides more performance so that software can waste it spectacularly. There are new and improved ways in which we can interact with the user. You used to type commands into physical terminals, then you can do the same in a virtual terminal window, or type commands into a chatbot channel, or use your voice to type the same commands. Keeping up with technology evolution, either requires to fundamentally rewrite your software every decade or so, or to build layer upon emulation layer to keep the semblance of a backwards compatible environment. Ultimately, if you never expect the software to change then what’s the point of writing software? Just burn it into the chip and do it in hardware. The hardware is expected not to change as quickly as the software. But the main reason for doing it in software is that you expect it to change, to grow or to adapt: in general to be flexible. The main tool to control change is versioning. You make a version of the software. Then you change it and therefore make a new version that includes the change. There is a whole technology stack just be able to do so reliably and efficiently with the software code. Never write any code without keeping track of it with a version control system. There are many versioning schemes, some linear (with a counter), other non-linear (with multiple counters: v1.0.1) to distinguish internal from externally visible changes. While the original purpose of versioning is technical, it has also been exploited for marketing purposes to establish a vintage (the 1995 edition) to put pressure on users to keep buying the latest shiny software, again under the assumption that the newer the better. Versioning is also used for determining compatibility, whether changes break dependencies. 81
Flexibility • Con gurability • Customizability • Modi ability • Extensibility • Resilience • Adaptability • Elasticity
Flexibility is the most general quality. Configurability, customizability, modifiability, and extensibiliy deal with how we can satisfy changes of functional requirements, how we can keep keeping our users happy. Changes can come and go, like the seasons. Resilience deals with temporary changes, while we need to adapt to changes with a permanent nature: either we make the transition or we’re not going to be able to adapt and survive the change. Elasticity is about changing the resources allocated to run the system to deal with workload changes. In general, try to avoid one-way changes. It’s always good to be able to change without the pressure of making irreversible changes. Rule number one of flexible software is: I can always go back.
82
Configurability Can architectural decisions be delayed until after deployment?
• Component Activation, Instantiation, Placement, Binding • Resource Allocation • Feature Toggle Poor
Good
Better
Undocumented con guration options
Documented con guration options
Sensible defaults provided
Hard-coded parameters (rebuild to change)
Startup parameters (restart to change)
Live parameters (instant change)
The first thing you can do when you design your software and you want to make it easy to change is to avoid changing it altogether. In other words, write code that is a bit more general and it can be specialized by configuring it. As an architect, you make a decision not to decide, so that as a developer, you have to write enough code to handle multiple configurations, which will be actually chosen by whoever ends up with the pleasure of configuring the system: the operator, the installer, the consultant, or even the brave end user. Configuring can be as simple as activating or deactivating which components get installed, or which features get switched on or off. We don’t make this selection in advance. We leave it open and then somebody that knows how to activate the feature can go there and do it if they actually need it. The basic configuration may be accessible for free, but advanced configurations may be expensive. It’s the same software. We just switch different parts on or off depending on how much money you’re willing to pay. Another important dimension that is usually very flexible concerns the idea of deploying the software in a certain execution environment. As long as the environment is compatible, the moment you deploy you have a choice: how powerful you need the execution environment to be if the user needs to have a certain level of performance? The software is the same, you just dramatically changed its performance by paying for faster hardware. Every deployment can be done with the option to activate or deactivate features. You can start by making a dark launch, where you have the feature implemented in the code, but you don’t activate it. You check whether the system is stable enough. When you’re confident the change that you introduced with the new feature doesn’t affect the stability of the system, you can activate the toggle. Does the new feature work? If the newly activated feature is problematic, you can switch off the toggle and revert back to the stable configuration. Let’s see some more concrete examples. Even if my system is very configurable, I just don’t tell you how to do it. There are these hidden folders where you can find configuration files. Maybe you’re not supposed to know it. Just developers can understand 83
them. But if you know where to touch it, you can try to reconfigure it at your own risk. Unsupported configuration changes may violate the warranty. Sometimes configuration options are well documented so that the users are told in which way that it is expected to be configured. Giving default values for every option is a good idea to ensure it still works even if they forgot to configure it. You’d be surprised by how many systems that come out of the box cannot start without proper configuration. They are so flexible that they will not stand straight on their own. How expensive is it to change the configuration? Do you need a full rebuild? Are you going to set the configuration early in the pipeline? Many options are hard-coded into the software in the source code and if you want to change something, you need to recompile it. Or is this more dynamic so you can change the configuration until before startup and the system will read and apply the configuration settings when it boots. Or you can make it even more configurable while the system is running. I don’t have to stop and restart it in order to configure it. So there are two configuration philosophies: the one in which after setting the option you have to click on apply, then click on ok, then reboot and it will eventually work; or the other one which you just set the option live. Guess which one is easier to recover from when you end up in a dark corner of the configuration space? The assumption behind configurability is that the software is the same and we can change its behaviour by changing its configuration. There is one system, with one codebase, written with infinite variability. This has been the foundation of business empires, where the software is so flexible that you have armies of highly-paid consultants who go out to just configure it. And yes, there is a thin line separating configuration languages from programming languages.
84
Customizability Can the architecture be specialized to address the needs of individual customers?
• One size Fits All • Product Line • White Labeling • UI Theming, Skin • Con gurability, Composability
How do you call a piece of software which has been written or configured to satisfy the exact requirements of one customer? Very few customers can afford custom-made software, written entirely from scratch. But there are tricks one can play with the architecture to make custom solutions, or custom configurations, or custom extensions. The architecture is so general that you can derive a whole family of specialized products from it. The customization process can follow multiple steps along the software vendor foodchain. As you write the software, you do not give it a name. You do not brand it with a logo. Someone else will, as they resell your software under different labels. From the user interface, it looks completely different, but under the hood, it’s the same software. This is similar to skinning the user interface. The scope of the change is limited to the visible surface. This is a way to limit the impact of the change while maximizing the visibility of the change. Users don’t usually notice code refactorings, but they will notice if you change colors or the visual language of the toolbar icons. And they will be delighted if they can buy a new theme to slightly change the layout of their website. How to customize a piece of software? Change the configuration to compose its components in a customer-specific way.
85
Change Duration • Temporary: Resilience Can the architecture return to the original design after the change is reverted?
• Permanent: Adaptability Can the architecture evolve to adapt to the changed requirements?
We should also consider whether the change is temporary or permanent. Can the need to deal with the change be ignored? Eventually, it will go away, but there may be consequences if your architecture is not resilient. Resilience is about the ability to change but also to undo the change. Resilience may also be achieved only for small changes, but if the impact of the change goes beyond a limit, then it may not be possible to absorb it. Adaptation uses an ecological metaphor. How well does your software fit within its niche? Which evolutionary forces make your software adapt? What if a competitor appears? What if the environment changes? Adapt or face extinction.
86
Adapt to Changing Requirements • New Feature: Extensibility Can functionality be added to the system?
• Existing Feature: Modifiability Can already implemented functionality be changed? Can functionality be removed from the system?
Extensibility is the ability to grow and support new and unexpected features. Successful software is never complete: you can always add more features to it. It is possible to design extensibility into the architecture. It is more difficult to do the opposite and change a system by removing features. The effort may not be much but the risk will be much bigger than when the feature was added. It’s safer just to disable the corresponding API and hide the feature from the user interface. In general, changing existing features without breaking everything else requires discipline, focus, and clarity of thought. Always think about the impact of changes before applying them. While extensions can be independently developed and bolted on an existing plugin interface, invasive changes require to develop an understanding and build enough confidence to touch already written code. It doesn’t matter if the code was written by someone else, or by a younger version of yourself.
87
Elasticity Can workload changes be absorbed by dynamically re-allocating resources?
• Assumption: Scalability + Pay as you go • Cost(SLA Violation) >> Cost(Extra Resource) • Example: Cloud Computing
How can we be flexible in our capacity so that we can deal with a dynamic and unpredictable workload? Allocated capacity costs money, but if we’re not using it, can we avoid paying for it? What if the workload increases with a peak beyond the capacity, how can we increase the capacity to deal with it? And what to do after the wave washes away? Elasticity is the ability of a system to dynamically adjust its capacity by allocating or deallocating resources. So you can make this part bigger or smaller, you can make its processor more or less powerful. You can add or remove storage. You can add or remove cores, memory and network connectivity and everything else that you need to run the system. The goal is to avoid overload situations. Avoid the penalty of making your customers wait in line. If they wait for too long, they are not happy, and switch to a faster competitor. The assumption is that you can afford it, because when you increasing capacity, then it will cost more. An elastic architecture is sustainable if the extra cost to other capacity is less than what you would need to pay to the customers that are not getting serviced as promised. This is the main quality that defines Cloud computing as a specific type of execution environment. One of the original unique selling points of the Cloud was that on it you can have elastic scalability.
88
Elasticity Can workload changes be absorbed by dynamically re-allocating resources?
Static Resource Allocation Wasted Capacity Workload Overload Static Resource Allocation Workload
Elasticity Can workload changes be absorbed by dynamically re-allocating resources?
Ideal Elastic Resource Allocation
Workload
Real Elastic Resource Allocation
Workload
89
To visualize elasticity you can have the first situation, in which you have the perfect capacity to deal with the highest workload peak, but all the red part is wasted because it is unnecessary to process the rest of the workload. If you statically allocate less capacity, there will not be enough to deal with the peak. That’s when your customers will experience outages: availability goes down. In a non-elastic architecture, the capacity line can only move up or down. But it cannot change over time. This fits with how physical hardware behaves. Once your server is built and installed, it can only process so many requests. Especially if it’s designed by some manufacturers, you cannot easily replace its CPU or increase the amount of memory or storage. Ideally, we would like to have a system in which if you don’t use anything, You don’t pay anything. The moment that you have some work to do, then you dynamically add the necessary resources, which can be freed up when the work goes away. This is sort of the most efficient type of resource allocation. What can one do to achieve this type of curve? You can monitor the workload and the performance of your system and if you notice it lags, you can provide more resources. Can these resources become available instantaneously? What is the delay involved in ordering a new piece of hardware, delivering it, and installing it? What if it’s enough to spin up a new virtual machine in the Cloud? If you rent a virtual machine: What’s the minimum period you will have to pay for it?
90
Compatibility Does X work together with Y?
• Interfaces • Protocols and Data Formats (Interoperability) • Platforms (Portability) • Source vs. Binary • Semantic Versioning (Backwards and Forwards Compatibility)
Sometimes users get so happy: you start from one user, you get their family, then you get all the friends, all the neighbors and eventually you have the whole city. Then everybody from all over the world comes to try your awesome cookies. And you have to grow. Your architecture has to work at scale. You shouldn’t touch the recipe of your cookies. But you have to hire more cooks. So you introduce a new clone of the architecture. You adapt it to fit with local environment. The moment that you have a modular architecture (it’s enough to split it into two parts) it means you have multiple components. And those multiple parts are going to have to work together. Why is this connected with flexibility and change? Usually, you start from a state in which everything is compatible. Everything works together, then something changes on one side, and you break the compatibility with the other. You update your dependencies and we can no longer recompile our system. When assessing compatibility, you focus on the point in which the two systems are touching. We call it the interface. This is the point in which you can determine whether the two elements are compatible or not. If the interfaces fit and the two sides can work together. There are two types of compatibility. One is along the vertical direction, which is about compatibility between your system and the platform on top of which it runs. That’s also called Portability: we want to be able to take our system and transfer it between different platforms. Interoperability is horizontal. It describes the ability of different components across two independent systems to work together. There is also source and binary compatibility. After changing something, you just have to recompile it and it will work again. Or you can make a change and still work with the old binaries. You don’t have to recompile it. Which is better? How do you know whether two elements are compatible? You can do a fully-fledged compatibility test, go through a detailed checklist to make sure every interface fits. Or you can compress such information into the version identifier. Take a look at the link on semantic versioning if you want to know more. Recently, the idea semantic versioning itself went to version 2.0, so you can version even versioning schemes. Very briefly, you can introduce a convention over multi dimensional version numbers to express whether there is forwards or backwards compatibility. 91
If a platform is backwards compatible, it means you can update the operating system and still run old apps on it. You can update your compiler and use the new version to compile the old code. Forwards compatibility is the opposite: new applications can run on an old operating system, today’s fresh new web browser can still display a website from the early 1990s.
92
Portability Can the software run on multiple execution platforms without modi cation? • Write Once, Compile/Run/Test Anywhere • Cost(porting) {});
Client
f(x)
Server
Client
emit
Queue
on
Server
Different connectors will also affect to which degree the connected components are aware of each other. Let’s see if I can give you an example. When you make a function call, you need to know the name (the identity) of the function that you’re calling. So the client expects that the function named F, and which takes one parameter, actually exists on the other side. The client knows that is calling this particular function. If you rename the function on the other side, the component that depends on this function F will be surprised: you just change the name of the function and you will not be able to complete the call unless you also change the name of the function known by the client. We will see there are many forms of coupling, but this is one is rather important: Do you need to be aware about the existence or the actual identity of the component that you depend on? If you have this type of connection, the client directly connects to the server: it directly depends on the existence and on the identity of the specific component which has been connected with (the function F). It’s also possible to have different types of connections, for example a message queue, which isolate components and keep them unaware. If you connect components indirectly through a message queue, the nice property of messaging is that components only need to know and agree on the identity of the queue Q you have to deliver a message into, or you are listening for messages from. The same if you have a message bus: I subscribe to a topic. When a message about this topic arrives, call me back. You want to publish a message about this topic, please forward it to all interested subscribers. Both sides are directly connected to the queue, but they are indirectly connected to each other. When you publish a message into the queue, you have absolutely no idea whether the message would be actually received by anybody else. You copy it in the 358
queue: only if someone is interested they will pick it up. Maybe they don’t pick it up today. They pick it up later. Or maybe it’s nobody is subscribing to this, so the messages just ends up in the dead letter queue. But you as a sender will have no awareness of who actually received the message. The communication will work, the data will flow, but the recipients will not need to know about the identity of the original sender. This is a separate form of connection which preserve a higher degree of independence between two components because they remain unaware about the existence or the identity of the others. Still, there has to be some shared assumption: If you change the name of the queue, of course both components will be affected. Both components have to agree on the name of the queue, because they are directly conneced to it. The queue itself acts as a layer of indirection between the two components,
359
Connector Cardinality Point to Point (1-1)
Multicast (N-M)
The other classification we can make is about whether connectors connect only a pair of components (they are shown visually as the classical edge between the nodes in the graph) or some more complex connectors can be used to connect more than two components together (the connector view becomes a hypergraph). It’s easy to transform one into the other, since a connector with cardinality N>2 can be split into N connectors of cardinality n=2 which connected each of the components to a new node, representing the original connector. Like we have seen before, this node would represent the queue through which all other components can send and receive messages. At some abstraction level, it is simpler to visualize a multi-pronged connector as opposed to many point to point ones.
360
Connectors and Distribution
• At design-time connectors de ne the points where the distribution boundaries are crossed so that components can be deployed over multiple physical hosts at run-time
Connectors are also fundamental to enable a distributed deployment of your architecture across multiple containers. At design time they help you to pinpoint the spots in which you will need to cross the boundaries between different runtime environments. So do all connectors give you support for a distributed deployment? Not all connectors support remote interactions. Some of those are actually local. For example, a shared memory buffer helps to transfer some data between processes running on the same operating system, but it is difficult to use shared memory in a distributed environment. You can pick a local or a remote connector and your choice will constrain whether components need to be co-located in the same host or can be freely deployed anywhere across the Internet. Or conversely, you can place your components in the corresponding containers, and then pick suitable local or remote connectors to make sure they can stay within reach. Warning: make sure that the decisions you represent in the deployment view are consistent with the decisions you make in the connector view. Take care to ensure that all your components can communicate and coordinate with each other, depending on where they are.
361
Connectors and Availability Synchronous Asynchronous
Both Components need to be available at the same time The connector makes the communication possible between components even if they are not available at the same time
The last characteristic that I wanted to point out concerns the difference between a synchronous vs. asynchronous connectors. This has a big impact on whether the pair or set of components that are connected can be available independently from one another. In the case of synchronous connectors, when we draw a line between the boxes, we assume that this will work only if both components are available at the same time. This means that when a component needs to interact with the other one, the other component has to be there. If it’s not, the component may fail, and in case your design features a beautiful chain of synchronous connectors, you will have a cascading failure, potentially affecting the entire architecture. If, on the other hand, we have asynchronous connectors, it means that the communication or the interaction would be successful even if some of the components are not available at the same time. How is that possible? Let’s go back to the message queue example again. If you put a message into the queue. Whoever receives the message doesn’t need to be available and at that exact time. You just assume that eventually they can receive it and read the message from the queue. When they are available, the message is delivered. For the delivery to work, the original sender does not have to be available at that time. You just have to make sure that there is a moment in which both the sender and the queue are available at the same time so they can exchange the message being published. Later on the queue and the receiver have to be available at the same time, so that the message can be delivered to its destination. But the original sender and the ultimate receiver of the message do not have to be available at the same time, thanks to the asynchronous connector between them. Another way to put it: if there is a chance in the state of availability of one component, this change will not affect the other side only if there is an asynchronous connector between them.
362
So tware Connector Examples
Procedure Call Remote Procedure Call Remote Method Invocation
File Transfer
Linkage
Shared Memory
Message Bus Shared Database
Stream Disruptor
Tuple Space
Web
Blockchain
In the second part of the lecture, I wanted to make the topic of software connectors more concrete by giving you a number of options that you can choose when you model the connector view of your architecture. The goal is to transform the graph linking the components that are logically dependent on each other and refine it by classifying each of the edges: which connector are you going to use? And, as a consequence, which properties can you expect about the overall behavior of the system? A bridge is a very important type of connector in the real world: it enables people to safely cross a river; it allows the people on one side to meet the ones on the opposite side. Over time, the bridge becomes not only a way for solving the problem of getting across the river without boats hanging from a rope, or having to swim against the current. Some bridges become meeting places, where people started to work and live on the bridge. You have a marketplace, with tourist shops and customs checkpoints. The function of a bridge gets actually quite overloaded with many more features in addition to the original use case. A lot of software connectors have experienced the same type of complexity growth: you start trying to solve a simple problem, and then you add more and more responsibility on the connector. Let’s go over the software connector examples that we are about to compare in detail. You have probably already seen some, you may have already used some, like if you shared the same database between multiple components. The most common mechanism used to share information between different systems is the file transfer. Files and databases are concerned with data exchange. A more advanced form of data exchange is the stream: a continuous exchange of infinite amounts data. As opposed to file transfer, where files are exchanged one at a time, bits are flowing through a stream continuously. We also can promote our local shared database to make it 363
accessible from the whole World Wide Web: so we can do data sharing at a global scale. Or we can have a more powerful form of database, which not only supports communication, but also coordination among a set of components: tuple spaces. If you’re interested about coordination: there is the procedure call or remote procedure call. With it, components can transfer control with a very simple and easy to program type of interaction. One makes a call, the other does the work to process the data and then give back a result when it’s done. And while this happens the caller waits. The call blocks and can continue only after the result arrives. Calls both handle communication but also coordination. If you want to avoid the synchronous/blocking limitations of a call, you can switch your decision to use the asynchronous message bus as a connector. The call introduces a direct connection between the two components. The bus is an indirect connection because all interactions between publishers and subscribers happen via the message queue in the bus. Another very simple connector is used to link together different components. What is interesting about linkage is that it can be not only static but also dynamic. So if you want to build an extensible, open, plugin-based architecture you will need to have a linkage connector somewhere. Shared data can be slowly persisted over long periods of time, or volatile, with local, shared memory-based connectors. What if you have a local (co-located) set of processes, which need to share some data? What if you need a low latency, lock-free solution? Take a look at the disruptor. One of the most recent additions is the blockchain: which helps to share an immutable, append-only, transaction logical maintained over a decentralized network of untrusted peer. As we go over each of these examples, we will see how each connector works more in detail. We can also classify them according to whether they are synchronous or asynchronous, whether they establish a direct or indirect connection, or whether their purpose is to perform data or control transfer, or both.
364
RPC: Remote Procedure Call
Call Client
Server
request response
Client
Server
• Procedure/Function Calls are the easiest to program with. • They take a basic programming language construct and make it available across the network (Remote Procedure Call) to connect distributed components. • Calls are synchronous and blocking interactions
The first connector is one of the simplest ones: the call, friend of every programmer. We have seen components as a basic construct helping to design modular systems, with the ability to delimit and encapsulate reusable parts of your code into functions (or procedures). When functions separate and structure your code, what is the corresponding construct to assemble functions together? How do you decide that you need to make use of this function at the right place? You call the function. What does it mean to call a function? First make a request passing some input data. The other side is going to receive the request and process it. And, as soon as possible, a response will deliver the result back the caller. In the sequence diagram, you notice that there is a gap in the control flow. First the client is active, then as the call starts the control is transferred to the server, which was not active before the call started. Since the client is waiting for the result, we say that the client is blocked until the response comes back. Once the server completes processing the request, the response is delivered and the client becomes active again. I focus on this detail to show you that with the call connector we have both communication and coordination. We have a request message carrying the input parameters that is sent followed by result data that is sent back with the response. We also have coordination because the client will block as it waits for the processing on the server to finish. This interaction is also about transferring control, making the server (a passive entity) do something and then resuming processing once the server sends back the response to the client. This combination makes calls very popular, as most developers know how to write code that uses method calls, function calls, different types of calls and we can use the same abstraction remotely. It’s much more complicated to implement: a local call is just one CPU instruction to jump and execute code at a different. It takes a few nano seconds. If you try to make a call across the Internet with the server running in a different continent, then you have a significant latency due to the finite speed of light. Also, while it is also 365
possible to locally jump into a bad address, as you make a call across the network, the other side may not necessarily be available when you try to call them. Also the network itself could be problematic, dropping the response message although the request went through ok. If you decide to introduce a call because you want to make it easy to implement, as a consequence, you pay the price of these drawbacks in terms of performance and reliability as you rely on a synchronous connector. Since the call will also block the client, in a way, the interaction is inefficient because the client will remain idle while they call is happening.
RPC: Remote Procedure Call
Call × Local ✓Data ✓Direct ✓Synchronous ✓1-1 ✓Control ✓Remote × Indirect × Asynchronous × N-M
Let’s summarize the main properties of this connector. When you make a call you jump to the other side. If the call has parameters or returns a result, then you will be also sending some data back and forth: for sure calls imply a control transfer, but also data exchange. You can have both local and remote calls. Whenever we have remote interactions, in most cases, they can also happen locally. A call is a direct connection between exactly two components: the caller and the callee. The component being called must be available to answer the call: calls are by definition a synchronous type of interaction. We will enumerate these properties for all connectors, so you have a frame to compare them against.
366
File Transfer
Write Copy Watch Read • A component writes a le, which is then copied on a different host, and fed as input into a different component. • The transfers can be batched with a certain frequency • Transferring les does not require to modify existing components which can already read/write les • Source and destination le formats need to match If you’re interested about communication and transferring data, the file transfer is actually one of the most popular connectors. You will be surprised how many architectures still use the file transfer protocol (FTP) to transfer information between different systems. I was once visiting a company and they were very excited about having found this anonymous FTP server somewhere on the Internet and they discovered that they could actually use it to copy data between the systems. You know, let their users from their desktop computers upload some files on the FTP server so that we can download it from the other side and we can share the information this way. You should know FTP is an ancient Internet Protocol that is totally unsecure; if you want to use the file transfer connector today, pick an encrypted protocol. What makes it a popular connector it the only assumption that you need to make about the components that can be connected by file transfer: their ability to read and write files. You can take any existing code written since many years, and the simplest way for it to interact with the external world is by reading an input file and writing to an output file. If you want to connect and integrate this program with another one, just copy the file and give it as input to the next step. When we transfer the files we need to decide when we actually transfer the files. If you make a decision to use a file transfer you have to know: which data gets transferred? What’s the right frequency? Once per day? Every night? Or do you transfer it after the file reaches a certain size? Another important assumption beyond reading and writing files, is that if you write a file and export it from a component, the component that is going to import the file has to be compatible. Many integration projects have failed because components could not even import by themselves the files that they were exporting. Both components share assumptions about the format and the content of the file being transferred, which needs to be understood by both sides. If you write a file in a certain format and the other the other side doesn’t understand it, what can you do? You never had this problem? What if your friend sends you some files: can you always open them? How can you solve the mismatching file problem? 367
File Transfer Write Copy Watch Read Source Write
Destination File Copy
File Read
Source
File
File
Destination
Let’s see more in detail how file transfer works: here we have the two components: what are the primitives asociated with the connector? How do we use them to describe what happens when a file gets transferred? There is a lot of machinery under the hood for file transfers to work successfully.,The origin component has to write the file. How often does it happen? There will be a point in which the file is ready. The transfer can start. The connector makes a copy. This already has a performance impact. You have to double the storage cost: each side needs to have storage space allocated to keep a copy of the file, especially if there is no shared file system among the containers in which each component is deployed. If you are in a distributed runtime environment, you use FTP, SCP, or some other secure file copy protocol to execute the transfer. After the actual transfer happened, the file has arrived on the other side, then the component can read it. This completes the basic file transfer interaction.
368
File Transfer Write Copy Watch Read Source
Hot Folder
Destination
Watch Write
File Copy
File Notify Read
Source
File
File
Hot Folder
Destination
While copying a file is a form of data transfer, how does the destination component get to know that the file has been copied successfully and it is ready to be processed? We can look at the size of the file: if the file is not empty or its size has stopped increasing for a while, then you know that it can be read. One can test if the file exists. You can check if its latest modification date has changed. This may mean that it has been refreshed, so there’s probably something new in there. These are all heuristics, which go under the ’watch’ primitive offered by the connector. Watch is what should be done on the destination side to become aware of the change in the modification date on the change of the size of the file. Or just to detect the appearance of a new file. Watch produces a notification, which triggers the processing of the file. If we want to actually coordinate the execution between the two sides and not only solve the data transfer problem, but also add a little bit of coordination on top,the interaction becomes slightly more complicated. We use the concept of a hot folder to represent the fact that destination expects files to appear in a certain location. The hot folder is being watched by the destination component. There are file system APIs where a component can subscribe to be notified if something happens inside this folder. For the coordination to work, the files need to be copied in the right location. This is another shared assumption: sender and recipient agree on the content, the format, as well as the location of the hot folder in which the file should be placed. Each recipient component can have their own hot folder so that files being transferred are routed appropriately. Once the file appears in the watched location, the component is notified and it can start reading it. It should be avoided that the destination reads the file while it is still being copied. There has to be a mechanism that detects when the copy operation is finished. Only 369
after the file being copied is closed, the other side gets the notification: the file can now be opened for reading it. So we also have a clear mechanism to transfer the control. You can see that the sender is not blocked and can continue writing more files, while the recipient is processing the ones which have just been copied across. In general, different connectors provide different primitives to manage the data and control transfer. In the simplest case, we have seen there is only a call (which performs both control and data flow). The file transfer connector has actually four different primitives (write, copy, watch, read) that you need to know how to combine to make the interaction work.
370
File Transfer
Write Copy Watch Read × Direct × Synchronous × 1-1 ✓Data ✓Local ✓Control ✓Remote ✓Indirect ✓Asynchronous ✓1-M
× N-M
File transfer is about copying the data across. Sometimes you FTP the data and then you make a phone call and you tell the other side: the data has arrived; run your code to process it. So the control flow is achieved via out of band communication. The transfer can both work locally or remotely. Is this connector direct or indirect? Are the two components directly aware of each other? What is the assumption that they make in order for the transfer to work? Recipients know about the file but not about its source, the sender. The most important assumption concerns where the file is going to appear (the hot folder); maybe they need to agree on the name of the file. But they don’t care where the file is coming from. The interaction is mediated through the file which doesn’t necessarily carry any knowledge about its origin. Is this connector synchronous or asynchronous? Does the availability of the destination impacts the ability of the source component to write the file? There is a point in time, in which both parties need to be present: when you make the copy. For transferring the file across two different systems, both of them need to be present. But when writing the file, this is a local operation. And when reading it, the file has already arrived, so that’s also a local operation and the other side doesn’t need to be present. So there is only one step, when you’re actually doing the transfer, during which synchronous availability is needed. From the perspective of the components, since they are indirectly connected via the file, then they can interact asynchronously. For example, during the day you collect all the changes and write them into a file. At the end of the day, you make the copy so the other side can process it at night. The file transfer will be delayed until both sides wake up to be able to exchange the data. Regarding the topology: in the simplest case is of course one to one: one writer and one reader. It is possible to generalize this to multiple readers. Once you have a file, you can transfer it and send it to multiple destinations. To summarize: this is the most popular connector as it doesn’t require any change to the code. You can use it as long as your existing software components know how to read and write files. Don’t forget: the content and the format of the files need to be compatible. 371
Shared Database
Create Read Update Delete • Sharing a common database does not require to modify pre-existing components, if they all can support the same schema • Components can communicate by creating, updating and reading entries in the database, which can safely handle the concurrency
Here is another approach that helps to solve the problem of having multiple writers to the same shared file. Instead of copying the file around, just connect all components to the same database. A database is already present in most systems that have stateful components to store persistently their state. The shared database connector simply proposes to reuse or misuse this already existing database and configure multiple components to use the same database to store their states. If all the components share the same database, they can actually communicate via the database. One component, for example, writes information into it and another component can read it. Components already know how to query a database. We just have to trick them into connecting to the same one. It is enough to reconfigure all of the connection strings of all of your components and point them to the same database. Assuming they support the same schema, the data that they store is compatible. If they understand each other’s data, the shared database is a big improvement over the file transfer. One problem of the file transfer is that it is a batched operation that happens with limited frequency (e.g. once per day). The advantage of switching to a shared database is that right after one component writes into the database and the transaction is committed, the information becomes visible to all other components. This can have a significant performance impact. Many companies work with multiple systems that are customer facing. It can happen that if they use file transfer to synchronize the data, the customers perform one operation – for example in person in the store – when they go home and check the result through the Internet, they may find that the result of the operation is not yet visible. Because the updates performed via the front office system in the store would be sent to the Web backend database using file transfer, and the changes would be propagated only after 24 hours to the rest of the systems. So there is a very inconvenient and annoying delay from the customer experience point of 372
view, due to file transfer. If you put a shared database instead, the propagation delay disappears. As soon as the transaction commits, the data is already available so that the customer can even get a notification about their purchase from the mobile phone. The advantage is obtained by centralizing all the information in one place. Another advantage is that the database is designed so that you can have multiple concurrent writers. Components run different atomic and isolated transactions and the database serializes them to ensure that the shared data remains consistent. Another feature provided by databases is access control, so we can use it to check whether components should be allowed to write or only read certain shared data items. What could be a disadvantage? If you worry about scalability, that could be an issue, with lots of clients is a database going to scale to handle queries from all of them? Performance depends on your expectations. Compared to the file transfer connector, latency is better. What about throughput? What is the most efficient way to transfer terabytes of data? Availability: if the database is not available, nothing works, because everything depends on it, so that’s a serious issue. The shared database is a single point of failure. What about the schema? what could possibly go wrong? Yes, in the same way that with the file transfer, we assume that the format of the file is compatible, here we assume that the schema of the databases is and remains compatible. All components have to be compatible with the shared database schema. What happens if you make a change to the schema? If one component needs to modify the structure of the data, if you use a shared database, you need to ensure the compatibility of the schema modifications with all of the other components. You cannot freely evolve your components anymore in terms of the way that they represent their data. Since it needs to be kept compatible, you can no longer independently evolve your components concerning the way they represent and store their state. Every component is affected since their state is no longer private, but potentially shared with all other components. Changes to its representation (not to its content) become much slower, since all components need to agree with the changes and be brought up to date in lock step.
373
Shared Database
Create Read Update Delete
✓Data
× Control
× Local
× Direct
× Synchronous
× 1-1
✓Remote ✓Indirect ✓Asynchronous ✓N-M
So far we have discussed data transfer, what about coordination? We share the data using these primitives: read, create, update. We can also delete it. But, are the other components able to react to those state changes? Probably not, unless they poll (repeatedly read) the database very frequently to see if some changes happen. If something changes, they can do something about it. This is a very inefficient way to transfer control. Is there a primitive provided by the database, which would enable components to be notified about changes? With plain CRUD (create, read, update and delete) primitives there is no way that you can efficiently transfer control. This connector is also remote and indirect: we have database servers that we can connect to from anywhere, which act as intermediaries between components, which do not need to be aware of one another, as long as they share the same database connection string. They connect to the database and then through the database they can exchange information. Do the components need to be available at the same time? Well, this can also be a consequence of the indirect and remote properties. If you have an indirect, remote connection, then it will be most probably asynchronous. As long as one component can talk to the database, it doesn’t matter if other components are available at the same time to read the information being updated. This may happen any time later. Databases support multiple connected clients, so the sharede database connector cardinality is many to many.
374
Message Bus
Publish Subscribe Notify • A message bus connects a variable number of components, which are decoupled from one another. • Components act as message sources by publishing messages into the bus; Components act as message sinks by subscribing to message types (or properties based on the actual content) • The bus can route, queue, buffer, transform and deliver messages to one or more recipients
The fourth most popular connector after file transfer, remote procedure call and shared database is the message bus. A message bus routes and delivers messages between different queues. A message bus is used to connect two or more components while keeping them unaware of each other. Message buses are very useful to decouple the components connected to them. The only assumption components need to make concerns the addressing scheme that they use to identify the destination or source of the messages. It’s possible to publish a message into the bus without any knowledge of where or when exactly the message will be delivered to, or if it will be delivered at all. As long as you receive the message from the bus, you do not need to know who sent it (unless you need to reply exactly to the sender). Subscriptions can be based on meta-data associated with the message, based on their content, or simply based on the name or address of the queue. To successfully exchange messages, senders and recipients need to agree on the queue name. Instead, with content based routing, subscriptions are defined based on the content of the messages. In this case you just look inside each of the messages and based on the information you find in the actual data, you can decide who is interested to receive it.
375
Message Bus Publish Subscribe Notify
Source
Bus
Destination
?
Subscribe
Publish Notify
Source
Bus
Destination
The messaging primitives are: subscribe, publish and notify. First the subscription tells the bus about the topic of interest, about the queue, or the content about which the destination is interested. Only after the subscription is active, messages that are published which match the subscription are going to be delivered to the subscriber. There is a clear sequence: If you subscribe after a message has already been published, you’re not going to be notified about it. First, you need to subscribe so the bus knows where it should deliver future messages. Notification combines data and control transfer: once the message arrives, it gets delivered and the notification wakes up the recipient so that they know there is a new message to process. This is also a way to transfer control not only data, the content of the message, but to react to the ’you have got mail’ event. As you notice by looking at the sender: after the sender publishes a message into the bus, the sender is not blocked. So you send a message and then you can continue working on your side because you’re not interested to hear any answer as opposed to the call, which would block after sending the request until receiving the corresponding response. Here we just send a message and we can continue while the message is delivered and processed elsewhere. Sometimes the primitive ’publish’ is also called ’emit’. The notification is an event, so we use ’on’. When you write ’on’, you are doing a subscription, and you pass the listener that would be called back with the notification every time a message arrives. Message buses always work with this three basic interactions: First you say ”I’m interested“. Then somebody says something that is interesting and then you get the notification and you work with it. Within the message bus, many things can happen. The bus routes, buffers, transform and deliver the messages. Routing means figuring out based on the message where it 376
should be delivered to. Based on the current subscriptions, the destination can actually be multiple recipients. Not all recipients may be available to pick up their copy of the message at the same time, so the bus may need to store messages in transit and forward them along when the recipients are ready for them. We use term queue to indicate that there may be multiple messages going through the bus and there has to be some ordering between the messages. Different systems give you different guarantees. In some cases, you can ensure that the order in which the messages are sent is the same order in which they are received on the other side. But especially in case of failures, this is not always true, so in some cases you send the messages in order and then you receive them in exactly the opposite order. Why can this happen? Because inside the message bus there are buffers where you store the messages, and if something fails during their transmission the bus will retry sending the message. These repeated attempts can cause out of order but also duplicate message delivery. Very powerful ”enterprise“ message buses can also perform message transformations: they can convert messages between different formats. This is typically not not available in a file transfer connector where you copy the bits of the file as they are. A message bus can be configured with message transformation rules to apply some translation and modifications to the content in transit.
Message Bus
Publish Subscribe Notify × Local × Direct × Synchronous × 1-1 ✓Data ✓Control ✓Remote ✓Indirect ✓Asynchronous ✓N-M Let’s summarize the properties of the message bus connector: both control and data transfer, remote (but also local), indirect, as well as asynchronous (by definition) with one to many cardinality (thanks to multicast or broadcast, where multiple recipients are involved), but also many to many (if different components can send from the same address).
377
Stream
Send Receive • Data streams let a pair of components exchange an in nite sequence of messages (discrete streams) or data (continuous streams, like video/audio) • A streaming connector may buffer data in transit and provide certain reliability guarantees with ow control • Streams are mostly used to set up data ow pipelines Similar to the file transfer connector, the stream also enables to communicate two different components, one sending the data, the other one receiving it. We talk about stream data source and the stream data sink. As opposed to the batched file transfer, or the discrete message bus, the stream is continuous and is meant to deliver an infinite sequence of messages. The file transfer happens once, while the stream flows all the time. The advantage is that the information that we need to communicate appears on the other side as soon as possible, as opposed to the file transfer, which will require to copy the entire file before the other side can start processing it. We can see the difference between file transfer and streaming with another example. When you watch a video online: if you stream it, you can start watching it as soon as the enough data has been received. If you download it it means you’re doing file transfer, so you need to wait for the whole transfer to complete before you can actually open the video file and watch it. Also once you stop watching the video stream, it’s gone, while if you downloaded the video, you have a local copy which you can read as many times you want. The streaming connector purpose is to make the data flow continuously: to do so the data may be buffered as it is in transit between the two sides because the rate of sending and receiving may vary. A stream connector may also employ flow control protocols to slow down or speed up the sender if the receiving end is not able to keep up or is waiting for more data to arrive. A stream connects two components, but multiple components can be connected along a streaming pipeline. A linear topology of components is found in data flow pipeline architectures. In that case, we have one source of data which will be streamed an processed by multiple filters, which will eventually reach the final sink. The role of the sink is usually to store the information that has been processed through the pipeline or to display and visualize it to the user, or both.
378
Stream Send Receive Source
Pipe
Filter
Pipe
Sink
Send Receive Send
Send Receive
Receive Send Receive
Source
Pipe
Filter
Pipe
Sink
Let’s see more in detail how the various components interact along the pipeline. The source sends stream data elements into the pipe. The pipe will wait for the filter to receive it and deliver it. Once the filter is done processing, it will send the result along to the next pipe which will buffer it and then wait for the sink to receive it. This happens once for the first stream element to go through the whole pipeline. When the source sends the next stream data element, the whole pipeline will process it again in the same way. As opposed to the message queue, which delivers individual messages to one or more subscribers, here we have the pipe which is responsible for delivering an infinite sequence of messages from one sender to one receiver.
379
Stream
Send Receive ✓Data
× Control
× Local
✓Remote
✓Direct
× Indirect
✓Synchronous
× Asynchronous
✓1-1
× N-M
What is the data stream used for? Mainly data transfer, since all components are actively producing and consuming data into and from the streams they are connected to. The stream can both be a local (in memory structure), or it can also be implemented across the network using some streaming protocol. The stream establishes a direct connection between a pair of components, which need to agree on which end of the pipe they connect with. The stream is a synchronous form of communication, in which both components needs to be available and connected at the same time. While the pipe can provide buffering, usually the stream is processed live. Once the information flows through the stream, it is not persistend. If you are not available at the time it was streamed, sorry, you missed it. And the topology is one to one, which results in linear data flow pipelines. Sometimes also known as pipe and filter architectures. Some components can split or demultiplex a stream into multiple outgoing streams, or merge or multiplex multiple incoming streams into one.
380
Linkage
Load Unload • Statically Linking components enables them to call each other, but also to share data in the same local address space • Dynamic linking also enables the components to be loaded and unloaded without stopping the whole system
The next connector that we’re going to discuss is the linkage. Linkage takes two pieces of software and links them together so that they can call each other and they can access each others state. Linkage can happen statically, while the system is being built and as a result, out of two input components, we link them and we have one resulting component, which can be deployed as a single unit. This is an operation that happens after you compile the code and before you package it so that you can release it. Linkage can also be done dynamically. So this means that the two components are already deployed. We can load dynamically one component and link it with another one. It is also possible for the opposite to happen, where we detach a linked component and separate it from another component before unloading it. With dynamic linking, in the ideal case, we can do so without having to stop the whole system.
381
Linkage
Load Unload ✓Data ✓Local ✓Direct ✓Synchronous ✓1-1 ✓Control × Remote × Indirect × Asynchronous × N-M
Linkage is a special connector that is used to enable data sharing and calls between the components. By itself it doesn’t implement directly any of those. As it is more a tool used to construct and the component out of smaller parts, it is a local operation. The two components are directly linked with each other. And by definition the components have to be present and available at the time they are linked together. If you can link two components, then you can link any set of components, one pair at a time.
382
Shared Memory
Read Write Lock Unlock • Shared memory buffers support low-latency data exchange between threads/processes • Concurrent access must be coordinated via read-write locks or semaphores
Like a shared database for remote components, the shared memory connector works with local (co-located) components. The purpose is to coordinate access to a shared data structure, which can be read or written by the different components. As opposed to a shared database, this connector does not provide persistent storage but provides high performance, thanks to the low-latency, zero-copy access to shared data. Exactly the same memory is mapped on the different components which they can transfer data with a very low latency, just by sharing a pointer to it. You just copy a pointer and then dereferncing it, you find the data that you want. So this is the most efficient way to communicate. The problem is that since you are sharing a pointer to a shared data structure, you need to coordinate concurrent read and write access to it. One solution involves using some locks or semaphores.
383
Shared Memory Read Write Lock Unlock
Source
Buffer
Destination
Write Read
Source
Buffer
Destination
This is the simplest scenario. We have a component that is going to write into the buffer. Somehow the destination knows the reference to the same buffer. The source component can write information into the shared memory buffer before the destination component reads that particular value. This example shows you how we can use the read and write primitives of this connector.
384
Shared Memory Read Write Lock Unlock Thread 1
Buffer
Thread 2
Lock Lock
Write Unlock
Read
Lock
Write Read
Unlock
Unlock
Thread 1
Buffer
Thread 2
Here we show how we can synchronize access to the shared memory buffer. For example, one thread before interacting with it needs to lock it. And then it can write some data into the buffer. After this update is complete then it can unlock it. This makes it possible for other components to acquire the lock only when the information is in a consistent state. So we can see the second component is blocked until the first component releases the lock on the buffer. While this second component has the lock, it can read the information and modify it and then release again the lock on the buffer. The first component is trying to access the data once again, but it has to wait until the second component has unlocked the buffer. This interaction is more complicated, but thanks to the locks we can make sure that we only read the data after somebody’s writing has completed.
385
Shared Memory
Read Write Lock Unlock × Direct ✓Data ✓Local ✓Synchronous × 1-1 ✓Control × Remote ✓Indirect × Asynchronous ✓N-M
Let’s summarize the main properties for the shared memory connector. We can use it for efficient data transfer and communication thanks for the read and write primitives, but we can also use it for coordination because of the locking primitives. Shared memory is by definition a local connector. It’s indirect because the components only know the address of the shared buffer, but remain unaware of each other. The interaction is synchronous because since this is a local deployment, either all the components are present or none is. Like the shared database, this also is a connector that enables two or more components to share access to the same local memory buffer.
386
Disruptor
Next Publish WaitFor Get • Lock Free: Single Writer, Multiple Consumers • Cache-friendly Memory Ring-buffer • High Throughput (Batched Reads) and Low Latency
The next connector is an optimization of the shared memory buffer, which works without the need to have these expensive locks which can block concurrent access. There are many data structures that are lock free. The disruptor is one example, which is particularly interesting if you have a large number of concurrent threads which need to process a large amount of shared information. For example, here we can see one producer and two consumers. They share a memory buffer of a fairly large size. The buffer uses a circular data structure, which we call a ring buffer. It offers very good performance in terms of throughput and latency because of the lack of synchronization. It’s interesting also because it allows consumers to read the information bit by bit, but also to perform a batch read.
387
Producer
Disruptor
Disruptor
Next 1
Ring Buffer
1 2
Publish 1
3
Next 2
4
Publish 2
Consumer A
Next 3
Get 1 Get 2
Publish 3
Consumer B
Get 3
Next Publish WaitFor Get
Get 1,2,3 WaitFor WaitFor Next 4 Publish 4
Get 4 Get 4
Producer
Disruptor
Consumer A
Consumer B
Let’s see how the disruptor works. In this configuration we have one producer entering data elements into the buffer. Before we can write some information into the buffer, we have to ask the buffer where to put it, and this is done using the ’next’ primitive. The next primitive will give us the address of the next available location for writing. We can repeatedly publish the new data into the next available position in the buffer. The buffer is being populated by the producer which is writing into it. The first consumer wakes up and asks ring buffer to get the first element. So far we have a managed to transfer one piece of data between two components like before, but there is more data so we can keep getting all of it. The other consumer is independent from the first one, but it gets the whole data present in the buffer with one single sweep. This is an efficient way to catch up. Since the buffer is empty, the two consumers can register with the buffer and ask to be notified when more information is added. This happens right after the producer publishes the next element. In this case the two consumers are unblocked and can get it. As opposed to the previous solution where we had locks at every step, here we can read and write information without locks because the consumers will always read one element behind where the producer is writing. Once the producer runs out of space for writing the next element and the consumer runs out of new elements to read, then this is the only point in which we have the synchronization.
388
Disruptor
Next Publish WaitFor Get
× Direct ✓Data ✓Local ✓Synchronous × 1-1 ✓Control × Remote ✓Indirect × Asynchronous ✓N-M The disruptor connector was originally implemented for high throughput, low latency multi-core processing scenarios which occur, for example, with high frequency trading applications. It can also be found within the pipes of big data processing pipelines, as long as all producers and consumers are deployed in the same container with lots of memory available. The assumption is that if the ring buffer is big enough it never gets full as the consumers will never catch up with the producer, but after the slots have been read they will be recycled for the next round of writing.
389
Tuple Space
In Out Rd
• A tuple space acts like a shared database, but offers a simpler set of primitives and persistence guarantees • Components can write tuples into the space (Out) or read matching tuples from it (Rd). • Components that read tuples can also atomically take them out of the space (In) • The connector also support different kinds of synchronization (blocking or non-blocking reads) The next connector is also a data sharing connector, which improves upon the shared database because it adds a few interesting coordination primitives. A tuple space organizes the shared data as a set of tuples. Tuples are a set of typed values. For example, you can have a tuple which contains the customer name, telephone number. Another with the account number and the balance. These tuples are all mixed together into a space. It is not necessary to structure them into tables or predefined relations. You just throw all the tuples inside the space. Components can also read them by looking for tuples matching a given structure. For example to retrieve the phone number of a given customer, they can query the space with a tuple containing the given name and a typed placeholder, which will be filled with the number, if a tuple maching the name is present and found. It is easier to understand the primitives that you can use with the tuple space from the point of view of the components invoking them. If we have a component that is writing into the tuple space, it will use the ’out’ primitive. The opposite will be the ’in’ or the ’read’ primitive. What is the difference between them? You simply read whatever tuple is present that matches the expected structure. Read does not affect the content present in the tuple space. It is a non destructive operation. However, there is also the possibility to read tuples and atomically take them out of the space in one shot. This is what the ’in’ operation does: take information from the tuple space and bring it in the component reading and removing it. What happens if you try to read something, but the information that you’re looking for is not present yet? If you query a normal database, when you send a query to read some information, the database will simply answer: sorry I don’t have any information that matches this query at this time. Please come back later. If you do the same with the tuple space, it is possible to actually block the reply until a tuple that matches what you 390
try to read appears into the space. Thanks to this very simple concept you can actually implement interesting coordination scenarios between multiple independent components that not only share information in the tuple space but they can also use it to coordinate their work.
391
Tuple Space In Out Rd Component Out
Tuple Space
Component
Component
Component
Component
Rd
Rd
In In
Out
Out
Component
Tuple Space
Let’s take a look at a concrete example of how these various primitives can be used. We have the components connected to the same tuple space in the middle. One component is going to perform an ’out’ operation to write a certain tuple into the space. The other component wakes up and it looks for the matching type tuple. Since this is a read operation the tuple is not removed from the space, but only a copy is sent to the component which can process it. The same tuple can also be sent to another component, because the read operation is non destructive. After both component finish processing the tuple they have read, they perform an ’in’ operation. In this case, there is nothing in the tuple space yet. So they block and wait. When the other component which plays the role the producer is going to ’out’ a matching tuple, it will be the responsibility of the tuple space to deliver this tuple to one of the components that was waiting for it. At this point the tuple space detects that there are actually two components waiting to read and will need to decide which one wins the race to extract the tuple that is coming in. It can use a first come first served policy or some other load balancing algorithm to deliver it to the first component, which will receive it and unblock. The other component doesn’t unblock: since the only matching tuple was delivered to the other one, it is still waiting. However, when the producer performs the next ’out’ with another matching tuple, it will only have one component waiting to receive it. The blocking reads and in primitives are similar to message bus subscriptions. However, they’re subscription that disappear the moment the matching tuple is delivered. This example shows you how we can use a combination of simple primitives to share and coordinate access over information in a distributed environment. The tuple space runs like a database server, accepting remote connections from the other components. 392
Tuple Space
In Out Rd
× Local × Direct × Synchronous × 1-1 ✓Data ✓Control ✓Remote ✓Indirect ✓Asynchronous ✓N-M
To summarize, the properties of the tuple space, like a shared database helps to transfer data. But thanks to these blocking read primitives, and the possibility to atomically read and delete matching tuples by a set of competing consumers, the tuple space helps to coordinate them as well. The tuple space runs its own server for remote components, which interact indirectly – they just need to agree on the structure and the content of the tuples they are reading and writing. Otherwise they are completely unaware of the presence and location of each other. The tuple space connector is also asynchronous, the various components can come and go, and as long as the tuple space is available they can interact through it asynchronously. Its cardinality is many to many, with multiple producers and multiple consumers connected to the same tuple space in the middle routing everything.
393
Web
Get Put Delete Post • Components reliably transfer state among themselves using shared resources accessed with the GET, PUT, DELETE primitives. POST is used for unsafe interactions. • Components refer to their shared state with global addresses (URI) which can be embedded into resource representations (Hypermedia).
If you use the web as a software connector, you can build a shared data structure of global proportions. And connect to it an arbitrary number of components. What is the purpose of the web as a software connector? It enables a set of components to reliably transfer data among themselves using a shared data structure that is stored in various distributed locations, which can be addressed globally. Components deployed anywhere can access the Web and can become part of it. It is enough for them to have a link: the reference to a global address of a shared Web resource. They will follow it and dereference it to get information from the corresponding Web resource. Reading is what happens 99% of the times as information gets downloaded from a website. But it’s also possible to put or post new content associated with a Web address or even delete it altogether. The interesting thing about using these global Web addresses is that Web resources are discovered dynamically by asking a resource where to find related resources. This is the core idea of hypermedia: decentralized discovery by referral.
394
Web Get Put Delete Post Component
Web Resource (URL)
Component
Component
Component
Component
PUT GET GET PUT GET GET DELETE
Component
Web Resource (URL)
Let’s see how we can use the Web as a connector to perform these reliable state transfers between different components. Let’s say that there is a component that wants to inform the others about some data: this component can publish it by associating the data with a certain Web address. After this is done, the information is persistently stored at this particular location. This data can be then transferred onto the components interested to process it. When each component is ready to receive it, it will perform a get request that will retrieve a copy of the current state asociated with the particular address. The get operation is non destructive so you can read the current state of resource as many times as you want and also multiple components can read it at the same time by retrieving and downloading their own copy to process locally. Another example: We take information from a component A. We publish it on the Web, so that two other components can retrieve it. In these examples the Web acts as a form of shared memory at the global scale, because each memory address is not located within the local environment in which components are deployed, but every component connected to the Web can be running anywhere. Also the address of each Web resource can be found anywhere on the Web (as opposed to be matched against the tuples stored within a particular space). The interaction can be more complex. We can have some component decide to update the data. And this data can be then be read by the other components. It’s also possible once a resource has been initialized with the first ’put’ to have another component clean it up and delete it. This example concludes the whole life cycle of these shared Web resource. This is similar to what you would do with a shared database where components can insert data into tables which then eventually are dropped. As opposed to the tuple space, there is only one type of read operation which is called ’get’, but there is no way to atomically read and delete a resource. Delete is actually a separate primitive. 395
The web enables the reliable transfer of state because each of these primitives (get, put, delete) are idempotent: if there is a failure, you can retry the interactions and eventually the data will make it across. If this component is unable to connect to the resource when attempting to write into it, the component can keep retrying the ’put’ request, and eventually the data will make it across. The same is even more so for the read operations: if the ’get’ fails, just retry it as many times as necessary.
Web
Get Put Delete Post ✓Data
× Control
× Local
× Direct
× Synchronous
× 1-1
✓Remote ✓Indirect ✓Asynchronous ✓N-M
Let’s summarize the properties of the Web as a software connector. The Web is used primarily for globally publishing shared information among a very large number of components that are interested to mostly read it. However, there is no control transfer primitive associated with the Web HTTP protocol. The connector is by definition as remote as it can get; all the components are indirectly interacting through the shared Web resources. Each of the components can be active at a completely different time. We have observed on the Web that bits published 25 years ago are still available today. Also the Web can scale to support a very large number of components sharing the same resource.
396
Blockchain
AddTx Read
• The blockchain is a trusted shared transaction log built on top of an untrusted and decentralized network of peers. • Components may read the transaction history and add transactions to extend the blockchain • All information shared on the blockchain is replicated on each peer • The content of the blockchain can only be changed if the majority of peers agrees to do so The last connector of this collection is also the most recent addition. The blockchain comes into play when you have a large number of components which need to agree on the value of shared data but they do not have a centralized place in which they can store a copy of the information that they have to collectively agree about. What is the blockchain? A fully replicated data structure, whose copies are stored with every component that needs to access it. By using cryptography, it is possible to check that the existing data remains unchanged and additions are only allowed at one end of the chain. If you look at this particular chain, we can see that this is the initial block, a.k.a., the origin block. This is followed by other blocks attached to it over time. The most recent block is being formed with the new information. From the past you can only read, while if you want to write you can only append transactions from the top. Is it true that the chain is immutable? Well, it is immutable as long as the majority of the components does not agree that it’s OK to change it. Since the blockchain is also meant to scale to a very large number of peers, it will be too expensive to gain control over the majority of the replicas and therefore there will not be a chance to modify the past. Technically, the blockchain is a trusted shared transaction log: a linear data structure which goes from old entries to newer entries. And you build trust by sharing such transaction log among an untrusted and decentralized network of replicas. The components that used a blockchain as a software connector can use it to read the past transaction history, which is similar to having a shared, but immutable database. They can only modify the content found at the head of the log. So the blockchain can always grow, can never shrink. It is not possible to update existing blocks or to remove blocks from it. Everything that has been logged in the past will always be remembered. 397
There are also costs involved with all of this because we have full replication of all the information shared on the Blockchain. The larger the system grows, the more copies you need to have, so the more storage you consume and also you have to continuously scan the chain to make sure that nobody has been trying to temper it with it, and this is also very wasteful in terms of CPU power and energy consumption.
Blockchain
AddTx Read
✓Data
× Control
× Local
× Direct
× Synchronous
× 1-1
✓Remote ✓Indirect ✓Asynchronous ✓N-M
So the blockchain is about data transfer, in particular from components which append transactions to the blocks to other components which can read these transactions forever. Blockchain is a highly decentralized distributed system, so it supports remote, indirect and asynchronous interactions between multiple components reading and appending transactions onto it.
398
So tware Connector Demo Videos • Cardinality: 1-1 (Point to Point) . How to exchange data from A to B (and back)? . How to transfer control from A to B (and back)? . How to discover the existence and location of A and B? . How to perform a rendez vous/barrier synchronization (A waits for B or B waits for A)
• Cardinality: 1-N (Multicast) . How to exchange data from A to B*? . How to coordinate multiple connected components? . How to safely concurrently modify data shared between multiple components?
• Reliability: What happens if A or B are not available? • Maintainability: What happens to A if the interface of B is changed? • Adaptation: How to connect heterogeneous but compatible interfaces? (if possible)
References
• Richard N. Taylor, Nenad Medvidovic, Eric M. Dashofy, Software Architecture: Foundations, Theory and Practice, John-Wiley, January 2009, ISBN 978047016774 • Nikunj R Mehta, Nenad Medvidovic, Sandeep Phadke, Towards a taxonomy of software connectors, Proc. of the 22nd international conference on Software engineering (ICSE 2000), Pages 178-187. • Andrew D. Birrell and Bruce Jay Nelson. Implementing remote procedure calls. ACM Trans. Comput. Syst. Volume 2, Number 1, Pages 39-59, February 1984. • Martin Thompson, Dave Farley, Michael Barker, Patricia Gee, Andrew Stewart, Disruptor, May 2011 • Gelernter, David. "Generative communication in Linda". ACM Transactions on Programming Languages and Systems, Volume 7, Number 1, Pages 80-112, January 1985 • Cesare Pautasso, Erik Wilde, Why is the Web Loosely Coupled? A Multi-Faceted Metric for Service Design, pp. 911-920, Proc. of the 18th International World Wide Web Conference (WWW 2009), ACM Press, Madrid, Spain, April 2009. • Xiwei Xu, Cesare Pautasso, Liming Zhu, Vincent Gramoli, Alexander Ponomarev, An Binh Tran, and Shiping Chen, The Blockchain as a So tware Connector, 13th Working IEEE/IFIP Conference on Software Architecture (WICSA 2016), Venice, Italy, April, 2016. • Cesare Pautasso, Olaf Zimmermann, The Web as a So tware Connector: Integration Resting on Linked Resources, IEEE Software, 35(1):93-98, January/February 2018
399
So tware Architecture
Compatibility and Coupling
8
Contents • Adapters and Wrappers • Interface Mismatches • Adapters and Interoperability • Adapting between Synchronous and Asynchronous Interfaces • Scaling Adapters to N Interfaces • Compatibility and Interface Standards • Coupling Facets • Binding Times
400
How is it possible for the train to roll on the tracks? The wheels are the point in which the train touches the tracks. So that’s the interface element between the rest of the train and the tracks. The assumption behind their compatibility is that the distance between the wheels needs to match the distance between the rails. There is not only one standard setting for this value: depending on the history and geographic location, the type of train network, there are many different possibilities. The chosen distance is actually arbitrary, an historical accident. It may be due to political decisions to make the trains compatible or incompatible between different train networks. The distance has also an impact on the size and the speed of the trains. Spanish trains have the most comfortable riding experience because they are so much larger. If you are riding a train across the Swiss Alps, there are lots of sharp turns: it pays off to shrink the rails distance so that you can make trains climb to the top of the mountains. What is the connector in this scenario between the train and tracks? Gravity. It’s good to have such force on your side if you want to keep components together.
401
Compatible Interfaces
• To be connected, component interfaces need to match perfectly The main assumption that we make when we connect together two components is that they are compatible. In other words, the interfaces connected by the connector need to match, perfectly. But what is it that keeps the two software components together? The coupling between matching and connected interfaces.
402
There's an app adapter for that!
You can build a whole industry just by solving interface compatibility issues. Provide people with a solution to connect incompatible interfaces: there is an adapter for that. You may rightfully wonder if the design of incompatible interfaces is actually intentional. You would not be able to sell any adapters if all interfaces were compatible in the first place.
403
Adapter
• Adapters help to connect mismatching interfaces • Adapters deal with transforming the interaction provided by one to the interaction required by the other interface • Warning: if data or functionality are missing the adaptation may be impossible When we work with software we assume that the component interfaces have to match perfectly so they can be connected. If you have two components but their interfaces do not match and they are not compatible, then you have a problem. Sometimes you can solve the problem by building an adapter. The first part of the lecture is going to be about: how do we work with software adapters? and what kind of incompatibilities can be solved with adapters and which ones are impossible? What is an adapter? It’s a specific kind of software connector which is dedicated to solving interface mismatches. To deal with such mismatches, some kind of data transformation might be needed. As each interface comes with their own data model defining the data types that the interface understands, and can exchange with other interfaces sharing the same data types and structures. The first function of the adapter is to convert between different data representations, between different formats within the same semantic domain. But there is more than just incompatible data representations. There are also different types of interactions. For example, you could have the blue component working with an API based on synchronous calls, and the red component assuming that it depends on a message-based asynchronous interface. Adapters may also need to transform between asynchronous and synchronous interactions. Is adaptation always possible? Not always: if the data that you depend on, or if the functionality that you require is missing. The adapter cannot find it anywhere within the interface it is trying to adapt. In this case, the adaptation would fail, unless you would overload the adapter with a re-implementation of missing functionality or make it stateful by storing the data which is not found in the original interface. This goes beyond what an adapter should be dedicated for: it is after all just a connector between two components, rewiring the pins. But if you go back to the hardware example, you will be surprised by the complexity of some of those adapters. Not only are the interface incompatible, but missing implementations have been outsourced into the adapters themselves. Here is the adapter: with it we are able to connect two components even though their original interfaces are not compatible. Remember: adaptation is not always possible. 404
Wrapper
• The mismatching end of the adapter is hidden inside a wrapper component • The adaptation logic is encapsulated within the wrapper and made reusable
What is a wrapper? A wrapper makes the adapter reusable by hiding it inside a compatible interface. The Blue Interface is the original one, which may be incompatible. The Red Interface is the interface of the wrapper, which allows to directly connect to other components without these components realizing that the component is actually incompatible with what they require. The red interface is compatible because it wraps the incompatible (blue) one. From the outside, you do not see the old interface, but you can use the adapter directly. Wrapping is a combination between adaptation and abstraction; adaptation with encapsulation gives you a wrapper.
405
Mismatch Example id upload(user, image); image download(id); setTitle(id, title); title getTitle(id); ids[] list(user);
A
id upload(user, image, title); {user, title, time} getImageMetaData(id); image getImageData(id); ids[] list();
B
Are Interfaces A and B equivalent?
× Yes (A can replace B and vice-versa) × A can be replaced by B × B can be replaced by A
✓Not completely
This examples shows you the type of problem you have to deal with when attempting to connect two mismatching interfaces. Equivalence means that it is possible to write an adapter which transforms one into the other and vice versa. Is one interface replaceable by the other and vice versa? If this is true for both, then they are equivalent. In general, it’s unlikely that a pair of arbitrary interfaces can be considered as equivalent, because there is always going to be some feature that is present in one interface but missing from the other. It is more likely that it is possible to do the replacement in one direction only. It may be possible to have an adapter from one to the other interface, but not the other way around. Also if it is not possible to build an adapter for the complete interface, you may be lucky that your component does not require the whole interface but only a subset for which the adaptation is possible. Consider the example and try to answer the question before we try to write the two adapters.
406
id upload(user, image); image download(id); setTitle(id, title); title getTitle(id); ids[] list(user);
A
id upload(user, image, title); {user, title, time} getImageMetaData(id); image getImageData(id); ids[] list();
B
Partial Wrappers A➙B A
B➙A B
id upload(user, image) { return b.upload(user, image, ""); } image download(id) { return b.getImageData(id); } setTitle(id, title) { ? } title getTitle(id) { return b.getImageMetaData(id).title; } ids[] list(user) { return b.list().filter(user); }
B
A
id upload(user, image, title) { id = a.upload(user, image) a.setTitle(id, title) return id; } {user, title, time} getImageMetaData(id) { return {?, a.getTitle(id), ?} }; image getImageData(id) { return a.download(id); }; ids[] list() { return a.list("*"); };
In the example, it’s not possible to completley replace one with the other. Let’s start from A to B: we have a component that depends on A and we try to write an adapter that implements A’s interface using B’s interface. There is one property offered by A, which has the ’getTitle’ and ’setTitle’. While we can map the property read to the ’getImageMetaData’ getter, we cannot find the corresponding feature in B. A workaround would be to re-upload the image every time the title changes, but this would invalidate the image identifier and fill up the B implementation with duplicate images. The adapter would need to keep track of the identifiers associated with duplicate images, thus becoming stateful. If you switch perspective, and try to go from B to A, then we spot one case in which we have a very broad operation: the ’getImageMetaData’ which returns all the metadata for a certain ID. If you look at what is available to do the same in interface A, we only have the title, so you will be able to return the title, but you will not be able to return the user or the time unless you keep track of those inside the adapter. To do so, you will have to make a stateful adapter that remembers when the image was uploaded as well as which user is associated with it. Adding state to an adapter introduces redundancy (overlapping information about same item is stored in different places) which leads to potential inconsistency. Stateless adapters simply map or transform interfaces. Stateful adapters do so but also need to remember the entire history of the interactions with the interface and their lifecycle becomes tightly coupled with the one of the component they are adapted. If you really need to add state to an adapter, then make it a wrapper, so that you can hope that all interactions with the mismatching component go through the adapter. Bypassing the adapter and using the original interface may render the adapter state inconsistent. 407
Types of Interface Mismatches • Cannot be solved with adapter: • Missing required feature • Can be partially solved with adapter: • Different data types • Can be solved with adapter: • Same feature with different name/order • Operation granularity (one vs. many) • Data granularity (one structured parameter vs. many simple parameters; scalar vs. vector) • Interaction (e.g., synchronous vs. asynchronous) In general, if we consider what types of interfaces mismatches are there, we can see that if a required feature is missing, we cannot solve this with an adapter. Required features missing from the provided interface are a big problem, because there is no adapter that will actually solve it unless you’re willing to re-implement the missing feature inside the adapter. But that’s not the job of the adapter. There is another case in which we can solve it partially with an adapter if, for example, we have different, but partially overlapping data types. If you have to transform integers into floating point numbers. All integers are floating points, but if you go the other way then you have a rounding problem. There are other cases in which it’s easy to do the adaptation. For example, if you have a different name for the same thing, you just have to rename it. If you have the wrong granularity, it’s possible to decompose a large operation by composing the small ones offered by an interface. It’s more difficult to do the opposite with a stateless adapter. Mapping many fine-grained operations on top of a coarse-grained interface may require to accumulate all the small calls and then make the big call in one shot. But how do you know when you have enough data to pass on to the coarse grained interface? And what if the last piece of information never comes? The big call will never happen. Mismatches can be found at the level of operations, events and properties but they can also be at the level of the data models. There can be APIs that work with individual data items and others which work in batch mode with collections of items. It can be more efficient to transfer a whole batch of objects that have the same structure as opposed to having to make calls for every separate object. Another case that is a very challenging concerns mismatches in the type of interaction. The data is compatible. The operation is exactly the same, but there is still a difference: one interface is synchronous while the other is asynchronous.
408
Synchronous vs. Asynchronous Interfaces Synchronous
Asynchronous
Block the caller
f(x,(y)=>{})
Easier to program with
emit(x) on(y=>{})
Require polling/busy waiting to detect events
y = f(x)
Closer to the hardware Use callbacks for event noti cation
Let’s compare synchronous and asynchronous interfaces. You can recognize the difference in the code snippets and also if you can see what properties, challenges or constraints each comes with. In some scenario, it is actually useful to have a synchronous interface, and in other scenarios it is better to have an asynchronous one. So it is not always clear which one to pick. And as a result, it’s unavoidable to have both in the same architecture, and as a consequence you have to be able to bridge this mismatch. Synchronous interfaces involve calls: you make a call, wait until you get the answer and, as a consequence, you – the caller – are blocked. You will need to wait until the result comes back. Since there is nothing simpler to program than calling a function, they are the easiest to write your code against. Synchronous interfaces have a little drawback. What if the purpose of the call is to monitor state changes inside the component being called? In other words, you want to detect if an event has happened, then synchronous interfaces they require busy waiting. You have to keep calling to retrieve the latest state until you detect something happened. Or you call once and remained blocked until something happens. Both are extremely inefficient ways to detect events: they keep the caller busy. And they also consume resources on the component being monitored because it’s being called all the time. The interface would work much better if you can just have a callback to notify about the event. And this is the main reason why we have asynchronous interfaces. You can make a call and pass the callback to receive the result. Or you can just use different primitives. You can have event-based or message-based interfaces to receive asynchronous event notifications. Whenever our interfaces encapsulate hardware components, within the components the software will be triggered by interrupts, signaling low-level events. For example, if you have an operating system API that gives you access to the mouse. You will be able to observe the mouse moving by receiving an event that tells you its new position. You can also check what is the current position with a call, but you shouldn’t keep asking for the current state all the time if you want to efficiently track the movement.
409
Synchronous and Asynchronous
How to mix?
Part of our architecture works synchronously (with call-based interfaces) and the other part works asynchronously (with messages or event-based interfaces). The challenge is: how do we mix them? How do we come up with an adapter? How to connect synchronous and asynchronous interfaces? How can we transform between calls and messages? We are going to use an adapter, which will be different depending on whether we transform one synchronous call into the exchange of two asynchronous messages (the request followed by the response) or vice versa.
410
Half-Sync/Half-Async How to connect synchronous (call) and asynchronous (message queue) interfaces? Use an adapter hiding asynchronous interactions behind a synchronous interface (or viceversa) Sync2Async adapter: buffer synchronous calls using queues and map them to low-level asynchronous events. Async2Sync adapter: invoke synchronous interfaces when receiving the corresponding message and emit another message with the call results.
Good news: the mismatch can be solved in both directions. If we map from synchronous to asynchronous, we have a synchronous call coming in. We have to buffer it using message queues. The incoming call produces an outgoing request message. The adapter listens for the event representing the arrival of the incoming response message. When this happens, the adapter is able to answer the call and unblock the caller. We can also have an asynchronous clients that needs to be interfaced and connected with synchronous interfaces. Also in this case, the adapter will be listening for inbound messages from the asynchronous client. When the message arrives, we can extract the input parameters needed to make the call. When the call returns, we emit the outbound message with the results back to the original client.
411
Half-Sync/Half-Async Synch Client
Request Response
Adapter
2. Send 5. Receive
1. Call
Asynch Interface
3. Receive 4. Send
The adapter converts from synchronous calls to asynchronous messages
We start from the case in which we have a synchronous client which needs to interact with an asynchronous interface, which expects to be connected to a message bus. The adapter bridges between two different connectors. On the left side we have a synchronous call. On the right side, the asynchronous message bus. You cannot directly connect them together. By now we have seen that is possible to solve this mismatch with an adapter. How does the adapter actually work? The interaction begins with the call, intercepted by the adapter. Then the call gets transformed into a request message. The message goes out on the bus. The message is sent on the bus with the call input parameters by the adapter. The asynchronous interface is going to receive it directly from the bus. The asynchronous interface will do whatever it has to do when such messages arrive. Eventually it will produce a response message, which will go out on the bus. The adapter will be listening for it, and will receive it. Once the response has been received, it can be delivered to the synchronous client. From the perspective of this client it looks exactly as if it was interacting with a synchronous interface. Behind the adapter we have something completely different: a bus with asynchronous interactions.
412
Sync to Async Adapter f(x) { var result; bus.onreceive = (y) => { result = y; notify; }
Client
Sync Interface
Adapter
Bus
Async Interface
f(x) f(x) send
bus.send(f,x);
f,x
wait
wait; }
return result;
receive
y y
f,x
send
receive
notify y
Client
Sync Interface
Adapter
Bus
Async Interface
413
Half-Async/Half-Sync Synch Interface
Request Response
Adapter
4. Send 2. Receive
3. Call
Asynch Client
5. Receive 1. Send
The adapter converts from asynchronous messages to synchronous calls The structure remains the same. However, the order of the interactions is inverted. The asynchronous client is connected with the bus, which delivers the request message to the adapter. When the message arrives, the adapter converts it into the call. The call is synchronous and blocking the adapter until the result is returned. The adapter packages the call result into the response message, which is sent back to the client.
414
Async to Sync Adapter //Async Client f(x,c) { bus.onreceive = (y) => { c(y); } }
Client
Bus
send
Async Interface
Adapter
Sync Interface
f,x
receive
bus.send(f,x);
f,x onreceive f(x)
//Async to Sync Adapter
y
bus.onreceive = (f,x) => { var y = f(x); }
y
send
bus.send(y); y
Client
receive
Bus
Async Interface
Adapter
Sync Interface
Half-Sync/Half-Async • Bene ts of adaptation: • Simplify access to low-level asynchronous interfaces (no need to deal with buffers, interrupts, and concurrency) • Visibility of interactions (all communication happens through the message bus) • Disadvantages: • Potential performance loss due to layer crossing (data copy, context switching, synchronization) • Mapping requires each call to match one pair of messages
To summarize, we cannot build systems only with one type of connector. It’s too simplistic to assume we will just make calls everywhere. 415
You have to be able to mix asynchronous messaging with calls. You have to mix data streams with shared databases and all different kinds of connectors we have seen. Interfaces are affected by the expectations that they make on the type of connectors that you can use with them. For example, if you have an interface that is close to some low-level stateful hardware device, you will need to be able to query its current state but also efficiently monitor state changes by listening for events. This can be complicated, so in some scenarios having a simple blocking call that returns when the event is detected can be easier on the client. Calls are also more difficult to monitor, unless you can inspect the content of the stack of the current thread. Message buses can easily log and trace messages in transit. This makes it possible to observe at runtime the actual interactions and the communication flow between various components. Every adaptation has a penalty: You cannot just use an interface directly, but you have to convert. You have to copy data. Maybe you have these locks which you need to ’notify/wait’. A local call is very efficient, while transforming it into four different calls to send and receive two messages that go back and forth in the bus can add significant overhead. If you look closely, you will be surprised how many times this sync/async adaptation happens along the whole software stack. One asynchronous layer is hidden under a synchronous one, but this one again is turned into an interface which is asynchronous and so forth.
416
How many Adapters? One Point to Point Adapter
N
2
Two Adapters via intermediate standard interface
N
So far we have discussed why we need adapters to connect incompatible interfaces and how we can use encapsulation together with adapters to build wrappers of components so they present an interface that is more compatible than their original one. We have also discussed different types of interface mismatches. Some can be resolved and others are impossible to resolve with adapters. And then we looked at how can we build adapters that help us to connect pairs of incompatible interfaces and in particular focused on the problem of mapping between synchronous and asynchronous interfaces. In this part we are going to work on: how do we scale the notion of adaptation to work with more than two interfaces? And we will see how standards play an important role in that. We will conclude by looking at the other effect of connecting together two components: they become coupled together. We will discuss different facets of the coupling concept, and in particular we will highlight when coupling is established. How can we scale adaptation to multiple incompatible components? If you look at this picture, every square has a different color, meaning that has a different interface. Each line connecting the boxes means that it’s possible to do the adaptation: it’s possible to convert from yellow to orange using a specific adapter. If you have some information inside the yellow component and you would like to transfer it over to the orange one, you can do so because the adapter can transform it from one format to the other. You can limit the number of adapters but go through multiple adaptations to transfer information along the outer rim, or you can transform directly between the source and destination format. It’s always possible to build an interoperable architecture by adding more and more adapters and eventually you reach the format of the level of compatibility that you need. Sometimes this is inefficient because there are many transformations that need to be executed every time you go through one of these edges, there is a cost to pay. 417
And also there is a risk that the transformation could lose information, which is fine for a direct adapter but could make it challenging to compose multiple adapters. What if you draw an adapter directly between every pair of components that you want to connect? If you take this to the ultimate consequences with a fully connected graph, you end up with a lot of adapters. Every time you add one more component to your architecture and you want to integrate it with the rest of the system, then you will need to make sure that this component can interact with every other component into your system. Every time you do so, you have to add a larger and larger number of adapters. It doesn’t scale as it gets increasingly more expensive with quadratic complexity. To control the complexity we can introduce a standard: an intermediate interface to which all of the component can be converted to and from. In this case, with only two transformations, we can go from one side to the other. Instead of going directly from yellow to orange, we go from yellow to black and then from black to orange. We can go from this one to the intermediate, and then from the intermediate we can go everywhere else. This is an important concept that helps you to reduce the complexity of the scalable adaptation problem. You do not have a direct transformation anymore. You have a combination of two transformations, but this is a compromise compared to the earlier scenario in which you had to go through a transformation all around the outer rim of the circle and use up to N/2 combined adapters. So here we are in the middle. This is sometimes called a hub and spoke architecture. We go through the center so that we can convert to the standard, and then we have access to everything else. If we want to add one more component to the system like before, then it’s much less expensive: all we have to do is write an adapter that converts between the new component interface and the standard and vice versa, as opposed to building a mapping with every other existing interface.
418
Scaling Adapters with N Interfaces How to enable communication between many different heterogeneous interfaces? Use a standardized intermediate representation and communication protocol Before they can be connected, incompatible components need to be wrapped with adapters. These adapters enable interoperability as they implement a mapping to common means of interaction.
If we take a heterogenous system with many different interfaces and we want all of these to be compatible and, more precisely, interoperable, the solution is to first develop a common standard; then build adapters which translate every local interface to the common standard and the interoperability problem is solved. How does such standard emerge? How is it possible to find such a common representation? It sounds deceptively easy, but it’s difficult to achieve in practice: Do you seek the minimum common denominator? Does the standard cover the intersection (to make sure all mappings are feasible) or the union so that every interface will be fully covered and fully represented, but some adapters may disregard or need to provide defaults for some data elements which are missing from the local interface.
419
Composing Adapters Component (Platform A)
Adapter (Platform A)
Adapter (Platform B)
Component (PlatformB)
Components of different platforms can interoperate through adapters mapping the internal message format to a common representation and interaction style If we look at this pattern from a process viewpoint, the interaction between mismatching interfaces happens through a two step transformation. First, from the local platform, from the local language, from the local representation, we go through an adapter that converts it to the standard (shown in green). Then we do the reverse. Such transformations have to work in both directions: from local to global, and from global to the other local. This is the minimum requirement for mono directional interactions. If we need a reply to go back in the opposite direction, then we need adapters able to perform the inverse transformations.
420
One or Two Adapters? • Why two adapters? • For Point to Point solutions one adapter is enough (direct mapping between two interfaces, message translator) • In general, when integrating among N different heterogeneous interfaces, the total number of adapters is reduced by using an intermediate representation (or canonical data model)
In case you have to do a quick integration between two components, you don’t have to come up with a standard. This is called a point to point integration, where you just design a direct mapping with one adapter. In general, if you foresee the need to scale the system to a larger number of heterogeneous interface, it pays off to invest into a common standard so that you can later keep the cost of further growing the system under control. Sometimes this standard is called intermediate representation, or canonical data model. You can find these standards in any kind of integration architecture spanning thousands of different applications, which work together because they agree on a common canonical data model. You find it as well inside the architecture of a compiler pipeline. First you parse the input language, the result is an abstract syntax tree (AST), which works like an intermediate representation, as different stages of the pipeline operate on it performing different transformations, optimizations, and code generation for specific backends. Having such intermediate representation, separates (or de-couples) the parser of the input language from the back-end emitting the target language. And it even makes it possible to generate code for different processor instruction sets without having to change the parser.
421
Reusable Adapters and Performance • Adapters do not have to be re-implemented for each component, but can be generalized and shared among all components of a given interface type • This pattern helps to isolate the complexity of having to deal with interface differences when building each component. The adapters make the heterogeneity transparent. • Warning: performance may suffer due to the overhead of applying potentially complex transformations at each adapter (better to design compatible interfaces if possible)
If you target a standard, adapters can become reusable. You can generalize these type of transformations between common formats so that as long as you can map your local interface to the standard, then there can be many different reusable transformations that you can use to connect with other standards. Be aware: by resorting to one or more combined adapters, there is a performance penalty because the more layers that you have to traverse, the more transformations or adaptation layers you have to inject, the longer it will take for the interaction. As a design goal, it is much better to try to be compatible in the first place. After all, adaptation is always a backup strategy to recover compatibility, but if you can design something to be compatible in the first place, by all means do it.
422
It’s always possible to attempt to come up with a universal standard that fully covers all the previous use cases partially embedded in existing standards. As a consequence, here is yet one more standard. And that’s how standard proliferate. Typically the young developer, or the inexperienced architect will attempt to do that: create a new standard. You – as a student – haven’t seen this happen before in our industry, but after you get a bit more experience you will see that this standardization cycle happens all the time. People try to come up with better standards. This is progress. But for every new standard, we seem never to be able to get rid of the legacy. This is why I would like now to open a little parenthesis on where standards come from. As you may have noticed, standards are particularly important when we assess the compatibility of component interfaces.
423
On Standards Standard
de jure
de facto Implementations
• Standard compliant component interfaces are compatible and their implementation replaceable • Component vendors need to comply with standards to get to the customers depending on them • Claims of standard compliance should be tested and independently certi ed
If an interfaces is standard compliant, it means that clients that depend on the interface being compatible with the standard will be able to connect to the interface. Standard compliance makes life easier for the clients which needs to interact with your interface. And at the same time you also make life easier for clients to replace the implementation. The interface remains the same, it’s standard. The implementation can change without affecting the clients. We can observe a fundamental tension between vendors competing to give you the best implementation for the standard, and customers having a low cost in switching between different implementation which are standard compliant and thus easy to replace. For vendors, there is always the temptation to go beyond the standard to offer components that yes fulfilled the standard interface, but then offer you extra proprietary interfaces. These may fill the gaps of the standard (with useful non-standard features) or are just designed to be more convenient to use than the basic standard. Once the clients are hooked, they start to depend on the standard as well as on the non-standard part of the interface. This is the vendor lock-in problem. We originally choose this component due to its standard-compliant interface, but then we started to use and depend on the other features so much that we are no longer able to replace it. This ensures the future of the vendor of the specific implementation, who has now a pool of captive clients. Whenever you see a label on a component stating compliance with a certain interface standard, you shouldn’t believe this claim. It can be just a marketing ploy. As you review a checklist of features, it is cheap to add a mark under standard compliance, but more expensive to provide a test suite or have an indipendent certification of such label. If you don’t test it, later on you might encounter some surprises, for example, discover that the standard was only partially supported and that the exact feature you need has not yet been implemented.
424
On Standards Standard
de jure
de facto Implementations
Which is rst, the standard interface or its compliant implementation? • De facto standards promote existing proven and successful implementations (winner takes all, rst mover advantage) • De jure standards may still need to be implemented (design by committee)
Where do standards come from? This is like a ”chicken and egg“ problem. Do you start from the standard and then you implement it? Or would you rather find some existing proven implementation that becomes standard? Here we distinguish: de facto vs. de jure standards. If we start from a standard specification and build one or more implementations from it, we have a de jure standard. Sometimes there is only one implementation that is so fast to the market and so powerful, so successful, that becomes – even if there’s no formal standard – a de facto standard. Everyone knows that in practice this is the only commonly accepted way to solve that problem. Being the original developer of a de facto standard gives you first mover advantage and control over who is allowed to implement it after you and who is allowed to actually use it. What if we implement a standard only after it has been specified? This is a bit risky: how can one standardize something without knowing yet how to make it work? And how can you come up with a design for your interface that is worthy of standardization if no implementation already exists? There is a risk that if you have too many people around the table you get what is called design by committee. The result often is sub-standard in terms of quality (e.g., lack of simplicity) so there is uncertainty on whether it will actually be feasible to implement in a timely manner. After the initial version, standards typically iterate by cleaning up the specification by incorporating feedback from attempts at its implementation. Committees will standardize, developers will implement, feedback to clarify and improve the standard will be provided. Not necessarily in this order.
425
On Standards Standard
de jure
de facto Implementations
Who owns the standard? • Open Standards: everyone can participate and implement them • Closed Standards: the owner controls who can implement them, competitors usually not allowed.
Standards may have an open nature. For example, if there is an open source, reference implementation and everybody can fork it, improve it and integrate it into their products. With an open standard, everybody shares its ”intellectual property“ and can participate in a standardization process based on “rough consensus and working code”. When the standard is about to be released, there is the possibility for anybody to comment and provide feedback, and vote to finalize it. Also, there is no limit to who can actually implement it. The Alternative is to have a closed standard where there is a clear ownership of the technology and the interface design, as well as strict control over who is allowed to implement it. If you want to re-implement a certain API, you have to get permission first. And to get it you may have to pay some licensing fees. And if you are a direct competitor, maybe you’re not going to be allowed. There was once a big controversy going on between Oracle and Google concerning whether it’s legal for Google to give an alternative implementation of interface standard called: ’Java API’. Google has reimplemented it for their Android operating system, but Oracle has bought it from Sun. While Sun was promoting Java, they took an open standardization strategy. But after Oracle bought it, the strategy changed, and this gave rise to a legal battle about whether it’s possible to actually have multiple implementations for an interface unless the owner of the interface allows it. And this can have a big impact on the future of the software industry, because if you can control a software API and you can use the law to limit who is allowed to implement it, this gives you significant monopoly power. The U.S. Supreme Court eventually declared that Google’s usage fell well within their rights to do so. Now if only someone would try to provide alternative implementations of popular Web APIs such as the one of Twitter and Facebook.
426
On Standards Standard
de jure
de facto Implementations
Who writes standards? • IETF, W3C, OASIS, IEEE, OMG • ISO • SWIFT (Financial Messaging)
Where do standards come from? They come from standard organizations, like ISO: the International Organization for Standardization. Within the information technology sector, we have standards that are related to networking, hardware, modeling notations, programming languages, as well as software interfaces. I also added the SWIFT standard to the list in case you’re interested about banking and financial messaging. That’s just an example of a domain specific standard for ensuring the compatibility of a certain class of applications. This should give you an idea of how compatibility is achieved. Ultimately compatibility comes from the designer providing interfaces agreeing with the developers who will be consuming them. One way to reach and encode the agreement is to vote on a standard spelling it out explicitly. If we focus on interfaces, there are several aspects that can be standardized.
427
Standard So tware Interfaces Content Semantics
Representation Syntax
Addressing B
A Transport
Operation
When you would like to successfully connect your component with the interface of a separate component, you need to achieve interoperability between them with or without an adapter. You have to first make sure that the two parties can communicate. Whatever connector you choose, you will need to make sure that it is possible to transfer some data from A to B. How can the communication work? First, there needs to be an agreement on the type of transport protocol. If you’re deploying A and B in a distributed environment, the communication happens through the network. If they are co-located, then we know that we can use other kinds of connectors like for example shared memory. In this case the transport is a bit simpler, but no matter what type of transport you use, you are going to exchange some data that has certain representation format, using a certain syntax. There has to be an agreement between the two sides about which syntax they intend to use. When A writes out the message as a sequence of bits or bytes across the transport the bits appear on the other side and then you reconstruct a message carrying the same content. Once you have exchanged the bits, then you have to give them a structure so that their semantics can be understood. For example, the recipient has to understand that a messages about a payment just arrived: this is the amount of money, and this is the currency. So you have to assume that both sides agree not only on how do they represent the amount of money (using integers or floating or some other way to encode financial quantities) as well as the currency. Maybe it’s just a string, but you don’t have to come up with your own possible expected currency values, as there are standards listing all possible currencies that you can pick and choose. If both parties cannot agree on which standard works best, then you would need an adapter to convert between different formats and mismatching data models. You want to connect components to invoke operations so the components can call each other. And the set of operations that you can choose from is also a good candidate for standardization. This way there would be an agreement on what is the set of operations that the interface provides, as well as what each operation actually means. The same can be extended to all the interface features: properties, events and the data model. You also can have a standard way of locating or addressing the interface that you want to interact with. So we need to be able to identify it. We need to know where it is, and we need to know how to reach it. These are all various aspects, which standardization efforts have been devoted to. As a result, there is a whole technology landscape related to how to construct interoperable and compatible software interfaces. Let’s use this as a map to position, classify and compare different tools. 428
Standard So tware Interfaces Content Semantics
Representation Syntax
Addressing B
A Transport
Operation
Representation Format (Meta-Model)
Data Model
Plain Text
XML
RSS, HTML, SVG, SOAP
JSON
JSON-LD, GeoJSON
Binary
Protobuf
Regarding the representation of the data being exchanged across, the first decision that you have to make is whether the data will be sent over the network as text or as binary. Sending text messages, which can be read by a text editor, doesn’t necessarily make them human readable, but at least they can be copy and pasted into an email, for example. Or if you choose binary, you can use all the bits (increase the information density of the message) while making it more difficult to view and debug without custom tools. If you prefer to stay with a readable format, a bit less dense and less efficient in terms of bandwidth, and then you can choose for example between JavaScript object notation (JSON) or the extensible markup language. The valuable property of these formats is that they prescribe a syntax but they don’t assume anything about the actual structure of the messages that you are going to send following their syntax. That’s the reason why XML has an X because its syntax is used to define other markup languages by fixing the set of tags and their semantics. So from XML we can have HTML (for Web pages), SVG (for vector graphic images), SOAP (for messages), WSDL (for service interface descriptions), RSS feeds (for event logs).These are all specific formats that once you choose to go with XML, then you still need to make another decision which will set the exact type of XML document to be exchanged. The same is true for JSON, which gives a more lightweight representation for the structure of objects and you can add more assumptions about the field names, or the convention that used to name the fields. This way you can obtain, for example, the GeoJSON format to exchange your geo-located datasets. Or JSON-LD, to represent linked data, with documents that carry references to other documents. This little decision tree should help you to get started with the interface design decision: How am I going to pick a representation syntax? Before you try to define your own, be aware that there are many standards to choose from. Picking one of them, would make it much easier for others to send messages to your interface, or to read information your interface provides them.
429
Standard So tware Interfaces Content Semantics
Representation Syntax
Addressing B
A Transport
Operation
• Operations • HTTP Methods ( GET, HEAD, PUT, DELETE, POST, PATCH, OPTIONS )
Just one brief example about the operations, showing the most widely successful example of standardizing the set of operations that you can perform on a certain interface. The methods that are part of the hypertext transfer protocol (HTTP) offer a limited, but clearly distinct, set of operations that any Web resource can perform. Anywhere on the Web you will find billions of different resources, which let you read their whole state (GET), or just a subset (HEAD). Some also support updates (PUT) and can be deleted. You can call the resource and perform arbitrary computations (POST). You can also incrementally modify the current state (PATCH). Additionally, you can perform reflection: ask the interface what is it that you can do with it. OPTIONS is a meta method, which tells you which subset of the previously mentioned methods this particular resource can perform. So you don’t have to assume in advance which operations are actually there, but you can discover them dynamically.
430
Standard So tware Interfaces Content Semantics
Representation Syntax
Addressing B
A Transport
Operation
• Protocols • HTTP (Hyper Text Transfer Protocol) • SMTP (Simple Mail Transfer Protocol) • MQTT (Message Queue Telemetry Transport) • AMQP (Advanced Message Queue Protocol) • XMPP (Extensible Messaging and Presence Protocol) The chosen transport protocol helps to exchange data from A to B. Are you going to work with sockets? or can you choose a higher level protocol? So here are a few example protocols that on the networking stack are above TCP or UDP. If you are interacting with a remote Web API, you will probably use the HTTP protocol. There was a time in which people were trying to also use the email protocol (SMTP) to do the same. Then, other, more appropriate, messaging protocols emerged like MQTT or AMQP. This is also a space where you find many proprietary or de-facto standards for transporting messages between two components. Usually it takes decades before the dust settles down and people have found an agreement on which is the winner protocol. Your choice may depend on what kind of device you plan to deploy your software on. Will the device be part of the Internet of Things? Or the Web of Things? Will it be a cloudbased service? Each of these comes with different presumptive transport protocols, the kind of protocol you would be expected to choose by default.
431
Standard So tware Interfaces Content Semantics
Representation Syntax
Addressing B
A Transport
Operation
• Addressing • URI (Uniform Resource Identi er) • UUID (Universally Unique Identi er) • DNS (Domain Name System) • IPv4, IPv6 Another important aspect is about: where do we find the interface? how do we address elements found within the component interface? To solve these problems, we can use again networking standards. For example, IP addresses: they have run out for IP version 4, but we can have a bigger address space with IP version 6. We can then use a standard registry for transforming symbolic addresses into the numeric IP addresses: the Domain Name System (DNS). We can choose to take advantage of Web technology. For example, URI and URI schemes help to invent or reuse powerful mechanisms to structure address identifiers. Another problem related with addressing is: who is responsible for producing new addresses? Who is managing the addresses? And who is giving names to things? And if you work with the UUID scheme then it is possible to decentralized that operation. All the components in your system can come up with unique identifiers without having to agree beforehand. This independence is more difficult to achieve, for example, with the DNS, which is more hierarchical, but still depending on top level domains, where centralized control over how addresses are handed out is provided. If you think about Java and Java packages, how are you naming them? Basically you’re following a convention to use the reverse DNS symbolic address. If you own this domain, you can name your components after your domain. Noone else should be able to do the same.
432
Standard So tware Interfaces Content Semantics
Representation Syntax
Addressing B
A Transport
Operation
• Interface Description (Metadata) • Schema De nition Languages (XML Schema, JSON Schema) • Interface Description Languages (WSDL, Open API, Async API) • Data Semantics (schema.org)
There is another level which is about making sure that we can agree on how to describe the interface. It’s not about the content of the interface itself, but it’s about the languages and the tools that we use to model and represent what is inside the interface. To do so, we can benefit from standard data management technologies. Schema languages describe a data model: What information does this software interface consume or produce? Just specify the schema. Even though you’re not necessarily going to write queries for that schema or store data according to the schema, but you have a schema that tells you: given a certain message, does the content of the message fits within the expected schema? If the message doesn’t comply within this schema, it can be filtered, bounced back, or simply discarded and ignored. In the same way that you work with the type system to compile your code and statically check the validity of object structures based on their classes. Here we can check the messages as they arrive and validate them whether they fit within the constraints of the corresponding schema types. The schema is only the data modeling part of your interface. The interface, in addition to the data model, also has the operations, properties and events. Those are the focus of interface description languages (IDLs). For example, OpenAPI for Web based APIs or the new AsyncAPI, focusing on message-based interfaces. You can further constrain messages by specifying not only their syntax, structure but also semantics of the content. This way who is trying to use your interface would know what the data actually means. And the purpose of each operation. In addition to the plethora of Semantic Web languages, it is worth looking at the conventions followed by schema.org. If you’re interested about this, you can follow these pointers to find examples of what it means to describe interfaces, including their semantics.
433
What is the effect of connecting together two components? We couple them together. There is a benefit because connected components can interact, they can exchange information and communicate, they can transfer control and coordinate their work. They can call each other. But there’s also some other less positive effects. For example, they depend on each other: what could possibly go wrong? If you need a component and the component is not there, maybe there is some consequences for you: you could have cascading failures, propagating along the fault lines of your architecture, induced by coupling. What if the other side is going to change? You make a change on one component (either the implementation or its interface). Since you are connected and coupled to it, is this change going to affect you? It depends. If you simply touched the implementation, you have a chance to control the impact of the change. If you modified or removed existing features of an interface, then you are looking for trouble.
434
Highlight the code making these assumptions Representation Format
Location Address
Content Semantics
IPHostEntry hostInfo = Dns.GetHostByName("asq.click"); IPAddress address = hostInfo.AddressList[0]; IPEndPoint endpoint = new IPEndPoint(address, 8888); Socket socket = new Socket(address.AddressFamily, SocketType.Stream, ProtocolType.Tcp); socket.Connect(endpoint); Solution answer = new Solution("asq.click", 1); byte[] message = answer.getBytes(); socket.Send(message); socket.Close(); I would like to ask you if you can apply what you understood about coupling to highlight in these concrete examples the assumptions that we make about the representation format, the address of the component, as well as the semantics of the information that we exchange. Here is a piece of code that embeds all of these assumptions. The program is able to connect to a server,revealing the intricacies of opening a socket and sending information through the socket so that the other side can get the data that you’re trying to exchange. The code embeds assumptions that will affect the coupling between your client and the server. These assumptions could change or become invalid: being aware of them and being aware of the coupling that you introduce into your architecture when you write this type of code is what we want to discuss today. The location address encodes the knowledge about where is the server that we want to connect this client socket to. The part of the code that knows about the location uses the symbolic address of the server. This is the address that we are looking up with the DNS: if you want to switch the address you have to change this string. Two lines later there is a port number (8888) as well, which is part of the address. Everything else is just boilerplate code that you need to write if you want to open a TCP socket using the address as an input, but you don’t have to change it unless you want to switch to a different protocol. The content semantics is basically embedded in the code that is responsible for knowing the meaning of the data that you are exchanging. The ’Solution’ is a class that encodes a certain ’String’ with a certain color following the interface data model defined elsewhere. When you transform the ’answer’ object into a set of bytes, this is where you decided which representation format to use. The byte array is basically the format for the data that you use to send a message with. The data is a solution that contains a string and a color, and this byte array is the type of the message in which the data should be sent with. 435
We can recognize the need to perform a transformation between the internal representation of the data (stored in memory within the structure given by the ’Solution’ class) and the external representation of the data (also stored in memory as a byte array so that it is ready to be copied along the protocol stack). This ’getBytes()’ is the adapter between the inside and the outside, this is where all the assumptions on how the data gets serialized are found. These assumptions need to match the ones of the message parser on the server side.
Highlight the code making these assumptions Representation Format
Location Address
Content Semantics
fetch( "https://asq.click/api", { headers: { 'Accept': 'application/json' } } ) .then( response => response.json() ) .then( sol=>{ highlight( sol.text, sol.color ); }) It’s very rare that you actually would directly program sockets these days, so let’s take a look at a higher level example. This is no longer Java, this is JavaScript. You should still be able to pick the same three aspects. Also, we can see the code of a client that will contact a Web API using HTTP via the ’fetch’ asynchronous JavaScript API. The code will process the response by parsing the JSON string which gets transformed into an object. Two specific fields of the object get passed to the ’highlight’ function. The location address is probably more clear to spot, since it is found in only one place, written as a URI string encoding the absolute address of the API of the Web resource. By default fetch uses the GET method, so we do not have to explicitly configure it. How does the code control which representation format will be used? There are two places where this assumption is encoded. The first configures the ’Accept’ header to ask the server to use a particular media type as representation for the resource being fetched. The second is where the response is transformed into an object. The body of the response is actually received as a string. We parse the string and transform it into the object, assuming the string uses the JSON syntax. Then we work with the object and we start making assumptions about its content semantics, by extracting the text and the color fields. This code is actually independent of where the content of the object comes from and how it got there. You can interpret this code also in terms of connectors and adapters. The first fetch call represents the low-level HTTP protocol used to connect this client code with any remote Web resource (here you need to make sure you select the right address). Its result needs to be fed into the specific, local interface ’highlight’. There is a mismatch between the extremely general result (a string) downloaded as the payload of an HTTP response and what the local function requires as input (two specific parameters with a domain-specific 436
semantics: text and color). How to solve this mismatch? Use an adapter which can translate the original string into an object, from which the specific fields can be extracted. Does the local call to highlight the solution depend on where its input data parameters came from? No. Does the actual content of the object depend on the server sending the expected content with the correct semantics in the response? Yes. What if you change the structure of the object to represent a true/false answer? Which part of the code will need to be changed? Can you reuse the fetch and the JSON parse steps without changing them? They are completely orthogonal from the content semantics. Likewise, you could reuse the same highlight function and pass it an object hydrated from a SQL query sent by a object-relational mapper, as long as the schema would match the content that should be in the object. Addresses, Protocols, Representations and Semantics are always present when we connect together two different pieces of software. They can be controlled, configured and designed independently: we can take the same content with the same semantics and represent it in completely different ways. And if for some reason we need to move the server to a different location, most of this code will not be affected. The question is whether it’s a good idea to actually hard code these addresses into your code, or should they be actually read from some separate configuration settings, to keep the code location agnostic.
437
Understanding Coupling • Different connectors introduce different degrees of coupling between components. • Only disconnected components are fully de-coupled. • What type of assumptions are implied by a connector? • component exists • component matches interface • component deployed at a given location • component running and available to be invoked • component can freely evolve Depending on the connector that you use, you will have a different degree of coupling. Different types of assumptions you make will affect the coupling between the components. As long as components are connected, they are coupled. Sometimes you read about different degrees of coupling: loose coupling, tight coupling. Not all the coupling is the same, but only components that are completely disconnected are fully decoupled. Coupling is about the impact of what happens on one side to the other side. This impact directly depends on the assumptions that you make when you select and configure a certain type of connector. The most fundamental assumption that you already do at design time, is the fact that there is a component on the other side that exists and it matches the interface that you have on your side. This is implied just by the fact that you are drawing the line between the boxes. The edges connecting the nodes of the structural dependency graph of the logical view of your architecture represent an important assumption: each connected node assumes the other side exists and it matches the interface it requires or provides. When you switch between the design of your system and its deployment, then you also need to know where you will find the component that you depend on. So we assume the component exists and that the component has been deployed in some container and we know where to find it. When do we discover the existence of a component? When do we know the location? And do we hard code these answers like in the previous examples? This is a string with the IP address that we write into the code and if we want to change the location we have to change the code and recompile it. That’s sounds like a strong coupling: a change of location requires a rebuild. Instead, we should be able to separate the information and the knowledge about the location of our dependencies from the code that actually uses them. 438
After the deployment we start the system; its components start running. That’s when components start to interact with each other. Components that need to interact with each other, depending on the connector that you choose, may assume that the other side will be available to answer their invocations. Or maybe with different connectors, a component will take advantage of the opportunity to actually interact with another component that is not available at the same time. Depending on the connector, the state of the availability of one component may affect the success of the interaction with the other component. While you operate your system something will fail. How does a failure of one component affect the other side? Will the failure of a random component which you didn’t know existed bring down your component? Again, this depends on the type of connector, which can either propagate the failure or isolate it. As you evolve the system, you make a new version, you ship a new release and deploy it. Components you depend on have a life of their own, they change. Components that used to work, stop working after upgrading them. Interfaces which used to match, are no longer perfectly matching. Which changes on one side are going to be visible and affect the other side? When you connect together two components, you establish a dependency. You couple them together. And that means they are no longer independent from each other, they cannot change independently. Changes of location, state, availability state, implementation as well as interface will propagate in different ways along coupling relationship introduced by different connectors.
439
Coupling Facets
Session do clients and component share session state? or is each message independent?
Binding
Discovery how does the client nd the component location?
when does the client select the component?
Client
Interaction
must the client directly connect to the component?
Timing must the client and the component be available at the same time to exchange a message?
Interface/API
Platform is the client affected if the component is ported to another platform?
Implementation
Runtime Platform
Interface to what extent the API can change without breaking clients?
Let’s see more precisely these different coupling facets. Let’s decompose the definition of what is the coupling between two different components into multiple facets. Here we can see that there is a client that is connected to a certain interface. The interface has an implementation, which runs in a certain container on a certain runtime environment. The first question that you should ask yourself when you’re about to choose the type of connector to introduce is: should both the client, and the component it depends on, be available at the same time in order to successfully communicate? The answer can help you to constrain your choice of connector. The timing facet of coupling depends on wether the connector is a synchronous or asynchronous one. If the connector is synchronous, for example: remote procedure call, we assume that when we make the call, the other side is available to answer it within a certain time window. If it takes too long, the client will not receive an answer and typically the call will fail due to a timeout. The other aspect of coupling is related to the location: How does the client discover the address of the interface? How does the client learn about the location of the other side? The client needs to know where to find the component that it depends on. This can be resolved when you write the code of the client, you hard code the IP address of the server into the client. You write it in a configuration file. For example, when you start the client, the client reads the configuration file and it discovers the address of the server. This could also be something that the user can give to the client. If you open a web browser, the first thing you have to write is the address of the Web server. The browser doesn’t know it; it’s the user who knows. Here are different approaches to keep the client as independent as possible from the knowledge about the location of the server. The more the client knows, the larger impact a server migration will have. 440
Another difference separates direct or indirect connections between the two components. For example, the Message bus is indirect because the client is going to send a message into the queue. There is a direct connection between the client and the queue, but the other side does not talk to the client, it just talks to the queue to receive the message once it’s ready to be delivered and the server is available to process it. Also in this case your choice of connector will change the interaction facet of coupling. The binding facet is related to an earlier decision that you make related to the selection of which component you are going to depend on. Once you picked the component you can discover its location. But first you need to know which component you are going to use in the first place. Given a certain interface, we can have multiple implementations. So when do we select the implementation? A binding decision is a decision that you make when you choose the component that you want to connect with. Discovery is different because once you have chosen the component once, you know who you want to talk to, then you have to know where to find it. The same component can be located in different places. You may want to search for the nearest location to minimize latency among a globally replicated deployment. The binding is done based on the functional requirements, the discovery helps to optimize the extra-functional requirement of performance and availability. What if you work with this component and all of a sudden they tell you that from now on you have to start paying 10 times as much if you want keep calling the component. Here it doesn’t matter where the component is located. What you want is to switch to a cheaper provider and change your binding (assuming you can find a replacement). As a consequence you will also have to change the location. We can see the location as a more technical facet related to having a distributed runtime and components need to communicate over the network. The binding is more of a business decision which establishes a relationship with a given provider. Platform independence is a facet which determines whether changes in the runtime environment of the container in which one component is deployed will affect the other side. For example, consider if you switch the programming language used to implement a component: is the client going to notice? This is usually noticeable concerning the data model of an interface and how data is represented in different languages. If you expose a binary representation which directly maps the memory layout of your objects, you may notice even if you switch to a different compiler version of the same programming language. If you use programming language independent standards, such as JSON, XML, then there are already enough standard adapters between most programming languages and these representations so that your interface becomes platform independent. This way, changes to the dependencies of your dependencies are abstracted away and will not impact whoever depends on you. Another critical coupling facet concerns the assumptions the client makes about the content of the interface that it depends on. We can pick the right connector to mitigate changes to the location, representation format, timing. We can even rewrite the implementation using a different programming language. And still keep the client compatible, thanks to information hiding. But what if you deprecate or eventually remove some interface feature? The client would be totally destroyed, as it’s trying to use something that is not there anymore. We will study in the lecture about flexibility what kind of changes we can do to an API that do not affect clients and what kind of changes you should never do to an API unless you want to break all of your clients. The session facet depends on type of interaction between the two sides. If the client performs an operation on the interface, will the result of the operation depend on the previous interactions of this client? This is similar to stateful interfaces, for which the results depend on the history of the previous interactions of all clients. Establishing a 441
session between the client and the interface means that interactions are no longer independent from previous ones. The interface will remember and the client will assume the interface remembers. What could possibly go wrong? Let’s see an example.
Session Coupling Examples Stateful session >cd tmp >rm -rf * >Hi! What's your name? Hi Olaf, this is today's menu. What would you like to order? POST /order customer: Olaf choice: Menu 1
Let’s start from the shell command example: How many of you would want to run the second command independently of the result of the previous command? The first shell command sets the current directory, which will definitely determine the outcome of the second command. That’s why sometimes it is called a shell session: every command will change the state of the shell and you enter the next command assuming the shell is in a certain state. When your assumption fails to match the state of the shell, that’s when you could end up deleting your entire file system. Don’t try this at home. Always ’pwd’ and check the result before doing a ’rm -rf *’. The chatbot example shows a conversational session. Each party will exchange multiple messages in a certain order. And the big question is whether they expect to remember previous messages or whether each message is treated independently of the previous ones. Will a previous message affect the behavior of how further messages are processed? Is it even possible to send a message without having sent another message beforehand? The advantage of chatbots is that they are supposed to be more human friendly, more natural. You feed them information step by step. On the other side, the information gets accumulated until you get to the point in which you want to place the order and commit the transaction. While it makes perfect sense to have this conversation within a few seconds, but how long is the waiterbot willing to wait to have the second part of the conversation. Human short term memory tends to expire after a brief time, this is not the case of bots, which would gladly keep waiting for you to complete the order forever. But if you resume the conversation months later: Do you assume that the state on the other side of the session is still there? How likely is that? Have you ever experienced a long hiathus in a personal relationship but then resumed the conversation decades later as if no time had passed? In this case, the protocol should include mechanisms for checking whether the state of the session is still in synch between the two sides. There is nothing worst than entering data into a complex Web form and getting a session timeout error. 442
Speaking of Web forms, the third example is about making an HTTP request to place an order. The request here contains both the selection for the menu but also the name of the customer. With one command we get the job done. This is an example of an interface designed to work without establishing a session. If you have interfaces which require multiple rounds of interactions and every interaction builds on the previous ones, the longer the session takes, the higher the chances that something may go wrong and there may be a partial failure: a failure in which one of the two sides of the interaction looses the state of the interaction, while the other still remembers it. This makes the session state that you have established across the connection inconsistent. The risk of partial failure and the complexity of recovering from it makes session-ful connectors more coupled. Session-less connectors can simply retry failed interactions since both sides are always in sync as their session state by definition gets reset after every interaction.
443
Binding Times Discover
e static binding
Test
Re le as
Build
Sel ect
Int eg rat e
Deploy late binding dynamic binding
Operate
early binding
early binding
an Pl
e Cod
Migrate
Monitor
very late binding
Rec ove r p Backu
Binding Times • Depending on the connector, the topology of the graph of connected components may be de ned, constrained, re ned or modi ed at different points during the system lifecycle • design-time (as components are selected) • build-time (as components are packaged together into deployable units) • deployment-time (as components are unpacked and installed) • or even modi ed at run-time (to optimize the behavior of the system, e.g., switching between service providers). • during recovery, when something fails and needs to be replaced 444
When do you make the selection of the components that you want to reuse and will need to depend on? This can happen throughout the whole lifecycle of your architecture. Early binding indicates a choice made before running the system. A choice made based on the documentation of the interface we have discovered. We select it, but we still haven’t been putting it in action. An early decision is made before you even compile the code. Static binding also happens before starting up the system. This could be done both for components that you built yourself, or components that you have selected to be integrated from somewhere else. After you built your system, then you can run tests with it, this is where binding decisions become more dynamic. Binding after you start to run the system is also known as late binding. For example, after you ship a release, during its installation one may choose which component gets deployed. Sometimes deployment is atomic, all the binding decisions have been made statically and you can only drop the entire package or none. Sometimes you can customize which components get plugged in or out of the system at deployment time. You have already built the system, it’s already tested and released, and now during deployment is the time to do the customization. Dynamic binding happens while the system is running. It requires support for dynamic loading (and unloading) of components without the need for restarting the system. This can be useful to procrastinate the decision about which component to load until the very last moment right before you actually need to call it. Until you’re about to make the call, you can still switch the implementation targeted by the call. If you call something and it fails, it means that your binding decision on what to call was misguided. Maybe it used to work, but for this particular call it didn’t because your dependency crashed. If, as part of your recovery, you decide to change your selection of the component that you’re going to use, you will re-evaluate the binding doing what we call very late binding. The difference between late and very late binding is that late happens during deployment or startup; very late happens after a failure during recovery, which may involve another startup cycle. As part of recovery, to stabilize your system, you decide not to depend anymore on a flaky provider that is unreliable. You found a better provider or you had a back up provider and you can change the binding. The more dynamic your binding, the later you wait, the more flexible your system is going to be. You make early binding decision because you prescribe a connection between the two components based on their documentation. As you build your system, you can statically bind components while packaging a release artifact. Static linking results in a single artifact, which can be atomically deployed. Dynamic linking supports swapping or adding more components at runtime, while the system is operating, making decisions based on information only available at runtime (e.g., performance/workload/resource utilization/cost). The earlier they are bound together, the stronger the coupling between components and the more difficult it will be to replace some of them without having to go again through the entire lifecycle (e.g., rebuild or reboot) for your system.
445
Be liberal in what you accept, and conservative in what you send. Jon Postel
Strict A
Flexible B
We can conclude this lecture on compatibility and coupling with two messages: some food for thought. The first concerns the compatibility of two connected components. For example, they’re supposed to comply with a standard interface. How do you check or enforce this decision? You have two options: to be as strict as possible in how each side follows the standard, e.g., how it represents the data being exchanged. The other option is to be as flexible and forgiving as possible while exchanging information, which may deviate from the standard, but still work. One side sends a message to the other. Which side should be strict and which side should be lenient? Having both strict sides is not practical. It would imply having only a single interpretation (or implementation) of the standard. Flexibility on both sides is a recipe for chaos, since there would be too many possible variations from the agreed upon interface. You can have components producing noisy colorful messages, and strict components rejecting most of them. This also does not make it easy to increase the size of the system while keeping most components compatible. The only viable alternative for large-scale interoperability is to be forgiving as you accept messages from unknown senders, but being strict while sending messages to the rest of the world. This principle helped to support the growth of the Internet protocol, as it recognizes that while there can be many interpretations of a standard, it is important to provide incentives for compliance.
446
Water or Gas Pipe?
Sometimes incompatibility is the goal
The second message is that in some cases compatibility can be a problem. Like in this example, where we have two identical pipes but we want to reuse them to carry different content: water and gas. This makes it more efficient to source the pipes while constructing the building. However, how can you avoid connecting them together? After all, you want to prevent starting a fire when opening your bathroom faucet. The solution is simple: make the interfaces incompatible so that they cannot be connected together by mistake. You just need a standard or to pick a convention where gas and water pipes are threaded in the opposite direction to avoid they fit together for safety reasons. Safety can justify incompatibility, and you should not try to work around it with some adapter. Incompatibility can also be a business strategy to create vendor lock in, keep customers captive within your walled garden as it becomes too expensive to switch to alternative platforms. Incompatibility can fuel the growth of an industry dedicated to producing adapters.
447
References • Richard N. Taylor, Nenad Medvidovic, Eric M. Dashofy, Software Architecture: Foundations, Theory and Practice, John-Wiley, January 2009, ISBN 978047016774 • Gregor Hohpe and Bobby Woolf, Enterprise Integration Patterns, Addison-Wesley, October 2003, ISBN 0321200683 • Douglas C. Schmidt, Half Sync/Half Async, 1999 • Cesare Pautasso, Erik Wilde, Why is the Web Loosely Coupled? A Multi-Faceted Metric for Service Design, pp. 911-920, Proc. of the 18th International World Wide Web Conference (WWW 2009), ACM, Madrid, Spain, April 2009. • Jon Postel (ed.), TRANSMISSION CONTROL PROTOCOL, RFC 761 • Cesare Pautasso, Gustavo Alonso, Flexible Binding for Reusable Composition of Web Services, Proc. of the 4th Workshop on Software Composition (SC 2005), Edinburg, Scotland, April 2005 • XKCD 927: Standards • Klint Finley, The Oracle-Google Case Will Decide the Future of So tware, Wired, 23 May 2016
448
So tware Architecture
Deployability, Portability and Containers
9
Contents • Deployability Metrics • To Change or not to Change • Continuous Integration, Delivery and Deployment • Platform Independence: Virtualization and Containers
449
The Age of Continuity 1 release every 2-3 years 1985
1987
1990
1992
1995
1998
2000
1 release every second 2014
After we have discussed the qualities of a software architecture at design time, today we’re ready to make the big transition: we’re going to deploy our system in production. This is the lecture in which we focus on how easy it is to deploy the software system in production that we are designing and what kind of architectural decisions we can make that affect this deployability. Before we can deploy, we have to build it and make sure that it can be executed in a container in a runtime environment. So today we will also discuss virtualization and containers in the second part with an eye towards portability as well. From the next lecture on, we will focus on runtime qualities such as scalability, availability, flexibility. Why is deployability important? The way that we built our software systems has been undergoing a dramatic change as we have been switching away from an age in which software was built on a slow moving cycle with a major release of a software system every couple of years. It’s important to worry about this quality and to design the proper system to build our system because we have been changing the frequency with which we ship releases and deploy software. In the past, making a release was a major undertaking that happened relatively unfrequently, so here you can see for example the first decade of Windows releases happening every two or three years. Somehow between that time and today we have dramatically increased the frequency with which software is released. Some people actually speak about the end of the software release: from the user perspective there is no more the concept of what is a version of software system, because every time you access a system, it actually has changed. Today deployability is about making releases continuously, so that we can improve continuously the qualities of our software: add more features, fix more bugs, and do so with confidence at high speed. Some major cloud providers have announced, for example, that in 2011 they would make new releases every 10 seconds. Then in 2014 they actually increased the speed tenfold: now they are doing new releases every second. How can achieve this high speed of release without sacrificing the other qualities? That is the major challenge that has been addressed in the industry and which we will discuss in this lecture.
450
Deployability Metrics • Latency: • What's the delay between developers writing code and users experiencing the improvement? • What's the delay between users giving feedback and developers releasing a x? • Throughput: • How many releases are shipped to production every hour/day/month?
Deployability is a quality attribute that can be measured. And there are many ways to measure the deployability of a system. One concerns the latency, which we can define in terms of what is the delay between the code the developer writes and the user observing the effect of that code after it gets deployed in production. If the developer is writing a new feature: how long does it take before this new feature gets delivered to the user? And if the developer is fixing a bug: how long does it take before the user that reported the bug observes that the bug is no longer there? These are examples going forward, but you can also observe and measure deployability in the backward direction. For example, a user reports a bug, gives some feedback about something that should be addressed. Of course, you have to route the feedback to the right developer and make a decision whether you will actually implement it or not. Afterwards you can measure again how long it takes after the developer has written the code and the user has seen and accepted the improvement. So how long does it take you to do that? Is it just like a click over the ’deploy’ button and then the user automatically gets an update and no additional effort is required to benefit from the improvement? Or is this like a major project over several months to successfully go through this cycle once. Considering the throughput, we can measure deployability for the whole system in terms of how many releases do you ship to production over time? You do a release every two years, or you do multiple releases every second? That would be the expected range for deployability throughput.
451
Deployability Metrics • Time: • For how long will the system not be available during upgrades? • Is the release late with respect to the marketing target deadline? • Cost: • How many people are involved in a release? • How many Virtual Machines are needed to host the software production pipeline?
Let’s also consider how expensive it is to do a deployment. Is this something that one developer can do with a click to launch the corresponding script in their IDE? Or does it entail a major undertaking with multiple teams involved, where developers have to coordinate with operations, to schedule a suitable upgrade window? Or do multiple development teams have to synchronize? For example, one team is making a release for the server side and all the affected client endpoints need to be redeployed as well. When you’re doing an upgrade, is the system availability going to be affected? Are the users going to see a little message apologizing for the ongoing maintenance: you cannot use the system while it’s been upgraded, take a long coffee break. Your architecture has a better deployability if, during the change, as a new version of the system is deployed, the users do not notice. It is very important not to be disruptive of the productivity of your users. If you have to stop what they’re doing to wait for an upgrade, consider upgrading over the weekend. Disruptive, unplanned upgrades in the middle of the work day can be a very expensive proposition. Often one hears marketing announcements about the availability of new software products. For example, a new game will be released on the Black Friday or the Christmas seasons. It is critical to hit this particular launch window. If you are late and cannot ship it in December, probably this is a big missed opportunity for revenue. We can also measure deployability in terms of whether the release is shipped on time so we can meet our customers expectations generated by the marketing campaign. Another aspect for measuring the cost of deployability can be observed by assessing how expensive is the infrastructure to do the build and the testing along the software production pipeline. Can you build your system on your developers laptop? or do you need a whole cluster of virtual machines to process the compilation of the code and the parallel execution of all the automated quality assurance tests? There are some systems with millions of lines of code which require several hours to compile and the complete testing will run for a couple of days. In this situation, making a build is very expensive computationally, and if you make a small change and then you 452
have to wait for the whole night before you see the release of the result of the nightly build, your productivity becomes limited by the time it takes to get feedback due to the expensive deployability of large architectures.
Deployability Metrics Traditional
Continuous
Latency
High
Small
Throughput (Releases/Time)
1/Years
Many/Day
Downtime
Noticeable
0
Deadline
Death March
Not Applicable
People
Dedicated Team of Release Engineers
0 (Full Automation)
Infrastructure
?
Investment Required
Traditional deployments – as an extreme example – have very low throughput of yearly releases; when making a deployment, users will notice. If you need to catch a released deadline, you will attempt to do so through so-called death marches, which will lead to burnout and high developer turnover. Since every release is risky, requires heroic efforts, you do not want to go through crunch times too often. Making releases requires a big investment into a team of release engineers who take the code, build it, test it and package it so that it can be released by copying it on some website so that it can be downloaded and installed. In the good old days this also included people or robots burning the software on CDs and placing them into shrink wrapped boxes. There was no viable concept of automated builds and continuous integration delivery pipelines. Deployment was an ad-hoc, manual, error prone, very slow and expensive. Nowadays, if you want to make a release, you can do it quickly. You can do it often. Your users do not notice, as you can apply techniques that help you to minimize the disruption involved into making a change to a production system. Since the release is continuous, you don’t have a target window anymore, you just do it all the time. There is no longer a single opportunity, you can continuously improve the system. This has become possible because the process is fully automated: you just have to invest into setting up the pipeline with the proper tooling and the proper infrastructure to build your software. Then the pipeline reliably runs unattended: shipping a release has matured from a craft involving black arts into a well oiled industrial workflow.
453
Release Opportunity or Risk? • New failure modes
• Ful ll requirements
• New bugs
• Bug xes • Performance improvements
• Higher resource capacity required
• Generate revenue
• Peak of support calls
• Retain and grow customers
• Deployment may fail
• Match or exceed competition
• Deployment failure may be unrecoverable
Michael T. Nygard
• Deliver new features
Every time you introduce a change, you need to make a new release. When you make an improvement of your software, it’s a big opportunity because you are going to deliver new features, fix bugs and in general attempt to make your users happier because you better fulfill their requirements, so this is a positive opportunity to address users constructive feedback. New releases can be expected to improve not only the correctness but also extra functional qualities, for example: better performance. Another consequence of shipping a new release is the opportunity to make your customers pay for the upgrade. This is one of the business models for software components: every time you make a new release, your users have to pay to install the upgrade. Some customers have a tendency to flock to freshly released software. Others, more conservative, wait for the release to age before daring to install it. From a business point of view, making a release is critical to retain existing customers and get new ones, since your software now delivers more features than the competition, or it has been finally improved to catch up with the competition. These are all reasons why you should not hesitate before making a new release of your software. However, there is also the flip side in which when you make a new release, all of a sudden the system behaves in a new way - sometimes in an unexpected way which may lead to new failure modes. You already knew in which configuration and for which input your previous release was unstable. You make a change and now you need to discover even more cases in which things can go wrong. For every bug that you fix, you introduce a couple of new ones; you have to be careful not to catch those bugs before you put them in. The performance might be improved, but maybe also your capacity requirements have increased, so now we need more resources than before, so it has just become more expensive to run the system. Your new features interfere with the normal operations of the users. They confuse them and lead to an increased cost for training and support, helping users get used to the change: Why did this change? Where did my favorite feature go? Why did you move the menu somewhere that I cannot find it anymore? 454
While you’re in the middle of the deployment, things can go wrong. The deployment itself is a dangerous operation which changes the state of running system into an unknown one. If the new state works, congratulations for your success! But if the deployment fails, what happened to the previous release? Avoid ending up in a situation in which the new release is broken and the old release is gone. Your users will definitely notice. That’s why weekends have two days: Saturday to deploy the new release, Sunday just in case, to clean up or revert back to the previous one. Always have a time buffer and a strategy to recover from a broken upgrade. If you’re not careful in the way that you do the deployment failures could even be unrecoverable. Thanks to virtualization this should never happen anymore, but when people used to work with physical computers. You have a physical object in which you installed an operating system. And then you have a disk with the software and the data in a certain version. Since it’s expensive to have a new physical server on which to place the new release, some brave system administrators may just decide to install the new version over the existing one. If you do so and something goes wrong, the old one is gone - because you have just formatted the disk and forgot that you were supposed to migrate the old data. While recovering the old software may be possible with enough effort, losing data during deployments is unforgivable.
455
No Change = No Risk The only software which does not need to change is dead software which nobody uses
Minimize risk due to change instead of
minimizing change Release many small improvements often Every release should be reversible
For these reasons, experienced operators grow into being conservative. When they hear about your new release, their reflex is to ask: Are you really sure you want to install the update? Because after all there is a risk that it may not work as it used to. It is reasonable to hesitate before going from a known state to an unknown one. In the extreme case: simply don’t make changes, ever. Stay in known territory. If you don’t change, there is no risk: problem solved. However, the only software which doesn’t need to change is dead software used by nobody. Anyone using your software will always find some ideas about how to improve it or make friendly suggestions to make some corrections. If you avoid risk by never daring to make changes, your release process takes years before you actually gather the courage to actually make the step. What you want instead is to minimize the risk of change as opposed to reducing changes themselves. How do you do that? You make the magnitude of the change small, you make those changes often. This way you get used to making the changes and making changes to your system is a normal, frequently executed operational process: not something that you just do once in your lifetime. You release and deploy every day so that it becomes a habit and you learn how to do it properly. Also, you shouldn’t make changes without a safety net. Like database transactions: begin the deployment; apply all the changes; commit. If something goes wrong during a transaction, roll back to the initial state. If you can find a way to make a release of software work like atomic database transactions, which go from a consistent state to another consistent state, then you have the safety net which should reduce the risk of making a failed deployment that will destroy your system, tarnish your organization’s reputation and eventually get you fired.
456
Release Frequency Development Build
Operation Run Big Bang Release
If it hurts, do it more often (so that practice makes perfect)
Development Change Run Operation
Continuous Release
In the classical waterfall, a lot of time is spent in development building the software until it is feature complete. All tests are green. Eventually the development project is finished. Developers package and ship the release. And then they throw it over the wall to the operations team. The operations team has never done anything until this point and they take over. They install the system, they run it and after this transition one could literally fire your developers because you don’t need them anymore. The software is done, the project has been completed. So developers can move on to another project as the operations department runs the system. What can go wrong from a development point of view? you think that you’re done because you have clicked compile. And you generated some executable artifacts. Given some tests, maybe you even run them and check they pass. Still, even if it passes all the tests, there is no guarantee that is actually correct. The real test will start when actual users start to run it. But that’s no longer your job. That is somebody else’s problem. While it’s very rare that the system is perfect on the first shot, in the first release. The idea is that if this transition is expensive, complicated, risky – in other words: if it is a pain – then you should do it more often so that it becomes less painful. This situation is called ’Big Bang’ release. Big bang releases are a once-in-a-lifetime event (The Big Bang only happened at the beginning of the universe). That’s the only chance you have. So if it goes well, you are done. If it goes wrong, there is no second chance. There are many reasons making challenging to achieve a successful Big Bang release, but this was the traditional model where you have the software built by one part of the organization and then a different group takes over to execute it and will need to deal with whatever operational problems emerge as a consequence of a successfully shipped development project. In this setup, there is very little feedback going back from operations to developers. How do we make releases happen in a continuous way? How do we do it more often? 457
We are going to make the operation and the development happen in parallel, so they’re no longer sequential steps. They’re no longer: first we develop and then you operate, but while we develop, we also operate the system. The transitions between the system being built and the changes being introduced during development cycles and the deployment and operation happen continuously. This doesn’t mean that continuous deployments will be always successful. You can also have failed builds, broken releases and unsuccessful deployments. Look at the timeline for the red dots: The first releases are good, but then something goes wrong while we keep improving the system and eventually we hit another good release. What is fundamental here is that we have a continuous feedback cycle between developers that make the changes and operations that watch the impact of changes on users and feed their feedback back to the developers for further improving the system. There is no separation anymore between who is responsible for applying the changes and who is living with the consequences of that. Because of such tight feedback cycle between the two sides and by making the release something that happens continuously, we have a lot of opportunities for trying out new ideas for making improvements that might not necessarily stick. Maybe not all the users appreciate them, so we can always go back and undo those changes. This is a paradigmatic shift between building first and operating later and doing both at the same time, as well as going from separate teams dedicated to each activity to the same team involved with both activities.
458
Speed vs. Quality ?
Speed
Fast
Low
High
Slow
Quality
Thanks to Automation, Speed vs. Quality is no longer a Trade-Off Another change in in perspective that happens with DevOps and continuous releases is that there used to be a trade off between the speed with which you develop and release and the quality of the result. Conventional wisdom was that you can either have low quality when you go fast. The faster you develop something, the less time you can dedicate of doing so with good quality (both external and internal). Would you like to improve the quality? That’s OK, but then you have to slow down. Doing continuous releases wouldn’t work if this trade off was still true, because we cannot sacrifice quality to increase the release throughput. Otherwise we will always be in a broken state with a failed release. When you start to automate all your quality assurance and testing tasks, when you automate your production process to build and package the images and releases so that the process becomes not only fast but also error free, controllable and repeatable, then you can break free of this trade off and have a continuous stream of releases of high quality.
459
e Re le as
Fix
Tu ne
Deploy
io n raRutn s
Test
O pe
an Pl
Build
e lo v e D
ent m p Code
Monitor
Software Production Processes are Software too
Let’s go through the DevOps cycle once again: you start with the back log, prioritize it and make a plan for the next iteration. You implement the stories. You build your system, you write and run the tests. You package the release. Deploy it, install it and start it. While the system is running you monitor its behavior, gather feedback, try to learn something from observing the users. Get all the bug reports, put them in the backlog and go around once again. This sequence of steps is what we call a process: because your development and operations team goes through a certain number of activities. When you describe a process, you are basically describing a piece of software. The question is: how can we automate this process? How can we write software that helps us to build, test, release, deploy, run, monitor software? If you have automated such infrastructure, if you have tools which can support each of these activities, then the whole cycle can be fast, reliable and produce a high quality outcome. The only bottleneck that is left concerns the developers who need to write the code as well as the tests, but the actual execution of the other steps should be fully automated. The cycle may stop from time to time: you do not necessarily want to make a change, build it, release it, and then wait until the user complains and then scramble back to fix all their issues as quickly as possible. That’s why we want to have this little cycle on the development side in which you catch as early as possible all the problems, thanks to testing and test-driven development. If the test is red, we go back and fix the code and then test it again before shipping the release. The same is true also on the other side. Sometimes when you monitor some performance parameters you may be able to correct those issues by allocating more resources, by changing the configuration of your system, and those can be done directly in the operational side. You don’t have to go back and make a change to the system to do it, especially if the system is configurable.
460
So tware Production Pipeline
Capacity Push and Merge Changes
Integration
Build Compile
Performance Test
Source Code Repository
Unit Test
Integration Test
Acceptance User Test
Production
Release Approval
?
Release Approval
Ok
Deploy
Smoke Tests
How can we design a piece of software that supports this process? What kind of software can enable such continuous life cycle? This is called a software production pipeline. The input data to this pipeline is piece of software. It’s the code of your software which gets processed going through serious series of stages. On the other side, out of the pipeline comes an image, a piece of software ready for deployment. This is an executable software artifact that can be put in so called production: it can be made accessible to your users so that they can work with it. The software production pipeline transforms the software from its original source form into executable form, but it’s not just a compilation step, that’s only the very beginning. The build is actually not only compilation, but it’s also checking that the unit tests pass. We want to make sure that all the components that we compile separately are passing the quality control. Once you have built all the components you have to integrate them so you won’t have to release and deploy individual components, but a single artifact with your software and as many of its dependencies as possible packaged inside. This can be subject to the integration test, another quality assurance step. If this is green, further testing, for example, in terms of performance, helps to check that the system is not getting slower. So you compare the performance results with the previous release. And we run a benchmark to do that. And also you can already get some users to give them a preview and they tell you if they accept how the new features have been implemented. Based on the outcome of all these tests, the process reaches a fundamental decision: to release or not to release. That is the architect’s responsibility: to approve the release, to sign off and state: ”yes, we reached a state which we can publish our work and make it available to the users so they can start working with it in production“. Once this happens, then you deploy the system in production and before you open the door to the users you still do a quick sanity check to make sure that everything is supposed to work as intended. The most frequent word that you find in this slide is the word test. Software production pipelines emphasize automated quality assurance. At every step that you do, you need 461
some automatic way to check that your intermediate product meets expectations. You have to be able to encode these expectations so they can be checked automatically, both in terms of correctness, but also in terms of extra functional requirements. How often do you run this process? Do you rebuild and retest every time a developer pushes a new change? Do you do it on a time basis? For example, you run it every night. How often do you meet and decide that we are ready now to make a new release? That’s another decision that you have to take. How often would you like to potentially disrupt the production environment? In some conservative companies, for example, they say that they freeze their systems in November, because the last part of the year is critical from a business point of view, and they don’t want to risk the negative impacts of any change. Either you make a release before November or you wait until the New Year and stop the last part of the pipeline during this window. You can still work over in the first part of the pipeline, but you cannot access production. That’s why it’s important to take responsability for the decision to ship a new release.
So tware Production Pipeline
Continuous Integration Push and Merge Changes
Compile
Performance Test
Source Code Repository
Integration
Build Unit Test
Capacity
Integration Test
Acceptance User Test
Production
Release Approval
?
Release Approval
Ok
Deploy
Smoke Tests
Frequent commit, push and merge of changes with automated build with tests (At least once per day) We use the term continuous integration to refer to the first part of this pipeline. Developers commit, push and merge changes into the build with a certain frequency (e.g., at least once per day). The pipeline will pull the changes from the repository and go as far as building, testing and integrating the changes with the rest of the system. The result is something that we can run. There is a working version of the system that has passed unit and integration tests.
462
So tware Production Pipeline Continuous Delivery Capacity Push and Merge Changes
Integration
Build Compile
Performance Test
Source Code Repository
Unit Test
Integration Test
Acceptance User Test
Manual Release Approval
?
Release Approval
Production Ok
Deploy
Smoke Tests
Frequent automated packaging of releases, fully tested and ready for manual deployment in production Continuous delivery goes one step further. In this case we obtain not only something that is ready to be executed, but it’s also fully tested. It passed all the tests that we have and it’s packaged ready to be released for manual deployment in production. If we do continuous delivery we go all the way until the decision step, in which somebody has to take the decision. And then if the decision is yes, we are ready to start the deployment in production. The deployment itself can be manual or automated, but it will only start if someone takes responsability for it. This is difficult if you want to deploy new releases every second. During the compilation step, the compile targets a certain runtime platform. The goal is to run it in an environment that is as close to production as possible. If your users have different platforms (e.g., different operating systems) then all of your build and test pipeline have to reproduce those platform. So you will need to do a build for Windows OS, Mac, Linux, Android and iOS. The infrastructure to do this gets rather expensive because you need to instantiate all of the pipeline stages for every one of those platforms, and especially for running tests. The platform is defined not only by the OS, or the Web Browser, but it may also include other configuration parameters, describing the assumptions your architecture makes on the environment in which users will run your system. For example, different screen sizes, input/output devices. If you want to run a systematic test of how your responsive user interface adapts to all of these different platforms, browsers and device screens, the size of the configuration space to test will explode.
463
So tware Production Pipeline Every release is automatically deployed to production (as long as it passes all automated tests) Capacity Push and Merge Changes
Integration
Build Compile
Performance Test
Source Code Repository
Unit Test
Integration Test
Acceptance User Test
Production
Release Approval
?
Release Approval
Ok
Deploy
Smoke Tests
Automatic Release Approval
Continuous Deployment
If your goal is to continuously deploy, then you have to get rid of this manual decision point and also make the decision automatic. You need to write some rules that determine that as long as our new version passes all the tests, it will get deployed automatically. Such approach puts a lot of trust in the quality of your automated quality assurance tasks. There should be a very extensive coverage and you need to really trust your tests actually mimic the conditions under which your users will work with the system. We can also break the decision in two steps. On the provider side, you approve the system as being stable and ready for prime time. On the consumer side, you decide if you want to accept the risk deploying the release. This also depends on the kind of runtime environment is targeted by your architecture. If you split into the classical Cloud/Mobile, you can continuously push to the Cloud, but there will be a significant delay before mobile clients will receive their side of the release, both due to AppStore approval processes (those are not automated and can add significant delay) as well as mobile phone users who give a low priority to keeping their apps up to date. Do you really want to have automatic updates on your mobile OS and the apps deployed on it? If you accept that, you trust who is making the release decision. You assume that if they send you an update, this is actually an improvement, in your best interest. For example, if you want to get the attention of users and nudge them towards accepting the update, you should always tell them: it’s a security update. If you don’t update your OS, you’re going to get blamed if you get hacked because you didn’t install the security patches. What if instead the OS update would slow down your hardware to increase the likelihood that you buy a new phone sooner? In the long term, the big question is how much control users are going to have over ”their own“ hardware. Ownership, transparency and control of hardware devices is becoming questionable, since a lot of software running on ”your own“ computers is deployed without full knowledge or awareness of the end users. 464
Overall, automated release approval and deployment acceptance gives you the performance that you need to avoid blocking the software production pipeline because a committee has to meet and agree that your software can be shipped. If you have a release that is automatically built, tested, released as well as deployed, you joined the Brave New World in which every time your users connect to your system they are potentially using a brand new system. Probably it will not be completely different than the one they used the day before, because the changes that you introduce are small, but over time these changes accumulate. Congratulations: your software has reached the age of continuity.
So tware Production Pipeline Continuous Delivery Continuous Integration Push and Merge Changes
Compile
Performance Test
Source Code Repository
Integration
Build Unit Test
Capacity
Integration Test
Acceptance User Test
Manual Release Approval
?
Release Approval
Production Ok
Deploy
Smoke Tests
Automatic Release Approval
Continuous Deployment
To summarize, continuous integration is developer oriented. Developers already benefit from this because they have a safety net as they make changes to the code. The pipeline will reliably produce integrated artifacts that they can already test and execute in their environment. With continuous delivery we extend continuous integration with a manual decision to make a release. The decision is based on the information gathered during the automated tests. Continuous deployment extends continuous delivery automating both the release approval decision as well as the upgrade process to deploy the new release in production. Which is the most common option? It depends on the maturity of the organization and also on whether they trust their tests to give them sufficient information to make the decision automatically. When you introduce continuous deployment also the users need to be aware that they cannot stay behind and need to be exposed to a continuous stream of changes.
465
High Quality at High Speed • Every team member can see and in uence the build process outcome • Any pushed change can be released at any time • Small, frequently committed changes are less likely to con ict • Keep the build fast (Get feedback as early and quickly as possible) • Build after every commit, build every night (and run more tests) • Reliably and reproducibly build any version of the software (Keep a snapshot of all dependencies) • Released images are immutable If you work with continuous integration, delivery and deployment of your software, you follow these best practices as you work at high speed to produce high quality software releases. Every team member has an impact and visibility into the build process. This is about making everybody responsible and giving them the possibility to fix the problems that they introduce. It should not be somebody else’s problem to watch over the build, but everybody has to keep it green. It’s a big responsibility if you push your changes into one of these pipelines because, potentially, your commit – if it passes all the tests and there is an automatic decision – can end up in a release anytime. Developers should be aware about the impact of their changes, keep the changes small and commit them as frequently as possible to minimize conflicts with other developers. It’s important to keep at least the early stages of the build very fast because when you push a change and you make a mistake you want to see that the build breaks and you want to get this critical feedback as soon as possible. So that you can still have fresh in your mind what exactly did you touch and how to undo that problematic change. Depending on how expensive it is to do run the build, you should do at least an incremental build after every commit. During the nightly builds, since there is more time, run more tests to extensively check the system as opposed to after every commit when you want to give an outcome as fast as possible. How are nightly build handled when you have developers across the world? If the sun never sets on your software development teams, you need to just make sure you regularly schedule a full build and only some of your developers will get the results when they wake up. The build process should be not only reliable, but also reproducible. Based on the code repository, on your versioning system, you should be able to build any version ever released in the entire history of your software. The outcome should be the same no matter 466
when you run the build, even years later. As a consequence, you do not only want to version your code but also take a snapshot and version your dependencies as well. This is a typical beginners mistake: you do not version your dependencies and then years later when you try to build an earlier release to reproduce a bug, you’re no longer able to do so because the dependencies have changed and you cannot access the old versions of the old dependencies that you used in the old release. Once you depend on something you want to grab a copy of it, and store it in the repository together with your own code. You want to keep the copy around because dependencies change as well. And your code is not necessarily forwards or backwards compatible. Likewise, once you make a release, you tag it. You don’t change it anymore, so that the version number and the build number refer to an immutable artifact. While there is always the temptation to go into your Docker image and make a few quick changes, try to avoid doing that. You’re doing something manual and the whole process is supposed to be automatic. In the same way that you typically do not edit the bytecode or assembly code produced by your trusty compiler, if you want to do those small, last-minute corrections, you do them to the input source. Then you make a new build, you have a new image with the new build number.
467
Developers
Push Changes
Source Code
Reports
Build Stage
Build Logs
Fetch Dependencies
Fetch Changes
Coding Style
Static Analysis
Test Coverage
Unit Test
Compile
Generate Documentation
Package Images
Con guration
Integration Stage Container/VM Images Integration Outcome
Deploy Images
Deploy Dependencies
Smoke Tests
Integration Tests
Con guration
Acceptance Stage Deploy Images Acceptance Outcome
Smoke Tests
Con guration
Deploy Dependencies
Capacity Stage Deploy Images
User Tests
Deploy Dependencies
Testers
Smoke Tests
Capacity Outcome
Performance Tests
Capacity Tests
Production Architects
Release Approval
Con guration
Ok Deploy Images
Deploy Dependencies
Smoke Tests
No Revert Changes
Users
The software build pipeline can grow to become a relatively complex piece of software on its own. There is a whole set of products that software development companies use to actually automate this whole process. We start from the top where developers push changes into their code repository. From here the software becomes a piece of data that has to be processed, checked and transformed. Eventually it flows down towards production, where we have the users. Before the final deployment in production step, we have to make a decision to whether to approve the release based on all of the artifacts, metrics, and reports which are the outcome of the build and the tests phases. The first stage is about building the original source code. Building it involves a compilation step, but there can also be a static analysis step. This is also where one can generate documentation. Unit tests run after a successful compilation. Their coverage can be measured. And since these tests are focused on each individual component, mocks help to isolate the component under test. As a result, you have packaged your component as a deployable artifact: an image. What happens to the artifacts obtained along the pipeline? They get stored in an image repository with a certain version number called the build number. The process works with these images, checking they can be successfully deployed. In the integration stage, components as they get combined together with their dependencies will be further tested. Finally, it becomes possible to run an end to end test with the complete system, including all of its dependencies. If the integration is successful, we move on to the acceptance testing where sometimes you can automate this, but sometimes you have a dedicated crowd of test users. The goal is to observe the end to end behavior by stimulating the system from the user interface. Different use case scenarios can be validated to reproduce real world usage of the system. At the same time, the capacity, performance, or even scalability testing stages check how many resources are required to achieve a certain level of performance. Here the 468
focus is on the extra functional quality attributes. Acceptance reports the system works correctly, here you care whether the correct result is delivered on time, given the available resources. The capacity stage deals with the question: how much processing power do we need to allocate so that the system meets the performance targets? Can we estimate how much does it cost to run the system under a given workload? If all of these outcomes are green and the architects (or the automated release approval rules written by them) agree that this is a good release, then it is pushed in production. If there is a problem with the build, you have to fix the broken build. One way to fix it is to undo the changes that broke the build. So this is always a possibility. If it turns out that it was not possible to correct the problems, then you go back to the latest known state in which the system used to work. And you restart again from there. Why do you still need to do testing even after you deploy your system in production? Because there should be an automatic way to check that the system is successfully initialized and to assess whether it’s available: did it start correctly? Is it ready to work with the users? Such pipeline looks complicated and also very expensive to set up. Remember: software to build software is still software, make sure you at least use version control to keep track of how you change it, if not a fully fledged production pipeline to build your main pipeline. You could try to get away with a simpler pipeline. Maybe it’s enough if you build your system: compile it, package it, but once you have an image, just send it to your users, skip all the time consuming testing. This is called testing in production. If you put the new version in production, the user will be in charge to report whether it works or not. If it doesn’t you will hear loud complaints from the user soon enough. This is a form of outsourcing the quality assurance for your software to your user community. If users are willing to pay you to become your testers, then you can consider skipping most of the pre-release testing. Clearly, users should be aware they are using partially tested releases (sometimes these proudly sport a greek letter in their name: alpha, or beta). Otherwise your reputation will suffer. In every piece of software, testing in production is always the case. Only in some cases one can afford to also do unit, integration, acceptance testing and try to avoid pushing out bad releases.
469
Types of Testing Unit
Test individual components immediately after they are built (if necessary mocking their dependencies)
Integration
Test the complete system (integrated with all dependencies)
User
Test the end-to-end behavior from the user interface (check use-case scenarios work) - only partially automated
Capacity
Observe whether extra-functional quality targets (e.g., performance, scalability, reliability, capacity) are met to determine the resources required for a given workload
Types of Testing Acceptance
Test the system in an environment that is as close as possible to production to determine whether or not a system satis es the acceptance criteria and to enable the customers to decide whether to accept or reject the new version of the system.
Smoke
Quickly probe the system to check its successful initialization and state of readiness and availability
Note: Every company performs testing in production with endusers, some companies can afford a dedicated staging environment for running all of the other types of tests before shipping a release
470
Types of Release • Big Bang/Plunge: Everyone switches over to the new release after some down-time (with high hopes and no way to go back) • Blue/Green: Everyone switches over to the new release at once and can revert back in case of problems • Shadow: The release is deployed but not used (Dark launches test the release with real user input without revealing the output) • Pilot: The deployed released is evaluated by a few users for a limited time How do we make different types of releases? The original, oldest type of release is the Big Bang release, where we do it in one shot: every user goes to the new version. Deployment takes some effort, produces some downtime, some disruption because of the release requires to first switch off the existing system before the new one can replace it. Hopefully the new one will work. It must work, as there is no way to go back in case it doesn’t. The Big Bang release is a high risk proposition, which you want to avoid. If you have a good deployability of your architecture, you can support other types of releases. A blue/green release still involves switching as atomic step: every user will go to the new version. But in case there are some problems, you still have the old software deployment around and you can easily switch back and forth between the two. This already gives you a very important advantage, which is the ability to undo a failed release. This however makes it more expensive because you need both the blue system and the green system. So the number of virtual machines that you need doubles. The old ones should be kept immutable. Once the new ones are ready, you switch. If everything works fine, then after a certain time you can stop and decommision the old. If the new version doesn’t work, you can go back. As opposed to the blue/green release where all users will see either the new or the old system, if you do a shadow release, you make the release but you don’t use it. Why? Why do we go to the trouble or building the system, testing and releasing it and then we don’t give users access to it? What you want to do really is to connect the new release to the input of the users so that the system is actually undergoing the production workload. But you don’t show the output to the users, yet. Users still sees the output from the old version, since the new one is still kept in the shadow. If you do so, you have a chance to observe and compare the output. And if the output is different and you don’t expect it to be different, maybe you just detected a problem. Shadow releases are subtle way to let real users test your system with the real production workload but without affecting the users in case something goes wrong. Once you are confident that the behavior of the new version doesn’t have any regressions, you can bring it out of the shadow and make also its output visible to the users. 471
Another technique is to do a pilot release. When you make a change, you deploy it, but you only make this visible to a small set of pioneer users. The users are going to give feedback and then you can decide if it is worth to extend the access to this new feature to the rest of the user population. Here the main challenge is: how do you select the users that are willing to try the new version? You need to have a set of trusted users who provide constructive criticism and – in case something goes wrong – they will not blame you. One way is to call for volunteers. Have you every started an app and there was a pop up asking you to try to switch over to a preview of the new version? You could have been randomly selected. Or maybe they observe your interactions (or your poor reviews) and they see that you could potentially benefit from these new features so chances are you’re willing to give it a try.
Types of Release • Gradual Phase-in: The deployed release is used by a gradually increasing number of users. Requires to support multiple active versions in production. • Canary: A gradual phase-in release with very few initial users who do not mind dealing with failures • A/B Testing: Roll-out multiple experimental versions to different subsets of users to pick the best version to be eventually be used for everyone
There are also releases in which the switch between old and new version happens gradually. To do so, you run multiple versions of your system in production, and gradually spill over more and more users to the new version. The pilot is just with a small set of users and you’re not sure yet if you will go ahead with every user. After a successful pilot, you can gradually get more and more users on board. This way you can increase your confidence that the new version is actually stable and powerful enough. Canary is a similar term for a pilot (the term comes from the Canary in the coal mine). In this case you make a Canary release because you don’t trust its quality but you make it accessible to users that are aware of this. Still, they don’t mind having a few crashes because they know they’re working with an unstable system. A/B Testing is also related to the concept of having multiple active versions in production. It is used if you do not know when you make a change, whether it is actually an improvement or not. You make multiple experimental versions available to different users. Based on some metric (typically this is some kind of monetary metric), you compare which version outperforms the others. For example, which color of the links makes people click on the ads so that we can make more money? That’s one of the original use cases for this technique. To support these advanced types of releases, the infrastructure for your continuous 472
integration does not only contain all the quality assurance phases that we have discussed, but it actually becomes much more complicated because you have to collect feedback metrics that you measure from the user behavior in production. Then you need to feed them into some statistical inference to see whether your experiment is validating the hypothesis that version A is better than version B or not. You can afford this when you have millions of users coming to your system every day. Very quickly you get the results necessary to support your experiments. If you just have 10 users it’s very difficult to draw any conclusions by passively observing their behavior. Related to this is also this idea of making releases which have partially implemented functionality. For example, you show buttons on the screen. But you don’t implement the corresponding behavior of the application. If the user never clicks on these buttons, it means that the feature is not wanted by your user population. What’s the point of investing in the implementation of a feature that nobody is interested to click on? If it turns out that people are actually clicking on the button, then you show them a little message that says: ”Please sign up to be notified when we release this new feature“. If they sign up, they are really committed and you have a chance to deliver to them a rough implementation as quickly as possible. But now you know that some users are actually interested about it. Now you can justify your investment into building it, as opposed to building it in the hope that users will come. You will be surprised about what you can learn from this: it can completely change the way you prioritize your backlog.
Clients Routed to Each Version
Gradual Phase-In
1 2 ion Vers
tired 1 Re
3 Version 2 Launch Version 3 Launch
on rsi Ve
4 5
tired n 2 Re Versio
d lle ca e 3R
Version 4 Launch Version 5 Launch
tired n 4 Re Versio
Time
Here is a visualization of the gradual phasing type of release. On the vertical axis we have the number of users for each version and horizontally we have time. At the beginning we have version one in production. And every user is using this version one. Then we launch the new version, but we don’t do a Big Bang release where every user switches over at the same time. Instead, we actually have a gradual phasing were initially only small number of users have access. As we observe the logs, we can conclude that it doesn’t crash, users seem to be happy with it. There are not a lot of bug reports, so we gradually move over more and more users. Until version one gets retired: nobody is using it anymore. So we might as well switch it off. 473
These transitions can take some time, especially if you leave a choice for users whether to upgrade or not. While we’re still in that first transition, we attempt a pilot for version three. We launch it, but very soon discover that it’s not a good one. There is no acceptance by the user that have a chance to look at it, so you quietly pull it back and switch back every user to the previous version 2. Since version 3 was only seen by a small minority of the user population, the failed pilot is not disruptive. Version 4 happily works better, so gradually every users will be migrated to it. Depending on how each transition takes, you might end up for a given point in time to having to maintain multiple versions of your system in production at the same time. This is more expensive than a simple istantaneous switch. You have to ask yourself how many of those can you afford to keep running, and this will naturally give you a limit on how often you can make new releases depending on how long does such transition takes.
474
Essential Continuous Engineering Practices . Automate everything . Push changes regularly . Commit only if locally green . Watch the build until it's green . Never go home on a broken build
Jez Humble and David Farley
. Don't push on a broken build
. Never comment out failing tests . Always be prepared to revert . Time-box xes before reverting . Direct changes in production are forbidden
Now that we have seen both how to build and release, let’s conclude the continuous delivery discussion with a number of best practices that you may want to adopt into your development toolbox from now on. Make sure that the process that you use to write your software and test it is automated: don’t try to do things manually because you wouldn’t do them anyway and if you did, you would be slow. Push your changes regularly. That means at least once a day. You attend the stand up meeting in the morning, you work, and then before you leave you should push. If the build is broken, you shouldn’t push more stuff on it to break it even more. Before you push, make sure that your local build is green. Never push something that doesn’t compile, because then you would break the build for everybody else. After you push your change, don’t go away. Watch the build until it is green. Also, never go home on a broken build, so if you break it, it’s your responsibility to fix it. In the worst case, you just have to undo the work of the day and tomorrow it will go better. A big temptation for fixing a broken build is to remove failing tests. You can always blame the tests and not your code. My code is good and these tests are obsolete so might as well comment them out. If you do that, you reduce the coverage and of course then you get a green build, but it will be a fake green build (just like when someone recommended to slow testing down to improve the pandemic statistics, but I digress). If the tests are failing and you blame the test, then fix the test. Don’t just skip them. How long will you need to stay to fix a red build? Some developers never go home. You should time box it: if there is a problem with my latest change and I cannot fix it within one hour, then I can revert the changes. The work of one day has disappeared, but I can go home to rest. There is always the temptation to fix problems directly in production. You forgot a 475
comma. You open into the config file, you edit it. And the lights are on again. But this is not enough, because where did the missing comma come from? After rescuing the day, don’t forget to fix it at the source. And then trust your tools to smoothly produce a new release, push it in production and everything is fixed. It might take a little bit longer, but the problem will be fixed for good and not come back to haunt you again the next time. If you choose to adopt this distilled wisdom, then you are actually practicing continuous software engineering.
476
Tools Continuous Integration
Container Technology
• Cruise Control
• chroot
• Jenkins
• Docker
• Travis
• Docker Compose
• UrbanCode
• Docker Swarm
• GoCD
• Kubernetes
• Packer
• Mesos
• GitHub Actions
Build Pipeline Demo Videos Explain how to setup automatic software production pipeline for your tool: • How do you build a component? • How do you test a component? • How do you plan integration testing in your pipeline? • How do you plan capacity and acceptance testing in your pipeline? • How do you package a new release? • Which types of release is supported by your tool? Demo at least one type.
477
Build Pipeline Demo Videos Deployment: • How do you deploy an application? • Can you automatically deploy the application if the tests are green? Failure Scenarios: • What happens if something goes wrong in your build? • Does the tool provide reports or logs after each phase? • Can you show different examples of a broken build and how to x them?
Container Orchestration Demo Videos • What is a cluster? How do you create a cluster? • How do you con gure Kubernetes? • How do you create a deployment? • How do you create nodes and pods? • How do you open an application to the users? • How do you create multiple instances of the an application? • Which types of release is supported by Kubernetes? Show at least one.
478
Today’s XKCD is about the benefits of automation. Should you just get something done or invest into automating it? When you expect that you will do it very frequently, it may be worth to invest some time and effort into automating it so that eventually you will have more free time. This is also our ultimate goal too: let your automated build pipeline work on your release. In theory, automation takes over and your life is better. The reality: it turns out the software that you have to develop for running the software production pipeline needs to be written, tested, debugged, or it needs to be selected, deployed and configured, and tested and debugged. As usual the development never ends: instead of having more free time, you don’t have any time left for the original task. But what if you could automate the writing of software production pipelines as well? Before we jump into infinite recursion, remember the lessons of how to bootstrap programming language compilers and see if you can apply them to this other problem.
479
Virtualization Virtual Machine 1
Virtual Machine 2
Virtualization So tware (Hypervisor)
Real Physical Hardware Machine The enabling technology for deployability is virtualization. More recently virtualization evolved or specialized into containers. Virtualization is about isolating your code from the actual environment in which it runs, like when you deploy it up in the Cloud. 480
Virtualization App 1
A2
A3
A4
A5
OS 1
OS 2
OS 3
VM 1
VM 2
VM 3
Hypervisor Hardware
Hardware made into Software Hypervisor: Intermediate software layer that decouples the above software stack from the underlying hardware Virtualization can be applied to different hardware resources (CPU, memory, storage, and network)
At the bottom we have the actual hardware. On top, we place a layer of isolation: the hypervisor. On top of it, we obtain virtual machines on which an operating system can run your applications. From the perspective of your application, the virtual machines are like hardware, but they’re actually simulated or emulated by this piece of software, itself running on the real hardware. This isolation layer makes it possible to deploy any application on any physical runtime environment, since the application is decoupled from it. Clearly, there is a cost. The larger the gap between the actual hardware and the one emulated by the virtual machine, the more expensive it will be to run the application with a sufficiently good level of performance. Even if the virtual CPU and the actual CPU are the same, the other important goal of virtualization (which is also shared with containerization) is to isolate the various application and their operating systems from each other. If you run an operating system directly on the hardware, all the applications or processes in this operating system are both aware of one another and can somehow interfere with one another. This may be a feature, since they can use, for example, the shared memory or shared file system to exchange data. This may be a security issue, if the applications run on behalf of different users, who may be business competitors. Running these conflicting applications on separate virtual machines guarantees isolation between them, almost if they would be deployed on physically distinct machines. Other advantages of virtualization include so-called elasticity: How much of the underlying physical hardware should be dynamically allocated to each of the virtual machines? Depending on the workload of each application, the virtual hardware can be made more or less powerful. So in other words, as we’ve seen many times in software architecture, adding some intermediate software layer helps to decouple what is above from what is underneath. Let’s see different examples of how we can play this layering game.
481
Virtualization Applications
Hosted hypervisors require an operating system to run
Guest Operating System
Example: VirtualBox, VMWare Workstation
Virtual Machine Hypervisor Host Operating System Hardware
In this scenario, we have two operating systems running: the host and the guest. The host runs directly on the actual physical hardware. From its perspective, the hypervisor is an application. Within this application we can run simulated hardwares on which we are going to have the guest operating system that will run our own applications. This was one of the early virtualization approaches. This is also what you still do today when you run virtual machines on your computer. For example, you have the MacOS operating system running on the Apple hardware. On top of it you can run Linux, Windows, or whatever other operating system. This is great because the applications which depend on Windows can still run on the Mac. From the point of view of the application, they are not aware, but it can be convenient for users that cannot afford to buy separate computers to run different operating systems. For some reason, the opposite scenario, running MacOS in a virtual machine on top of a different OS, is not so common. Here we also have many layers and the performance suffers. Let’s try to optimize this architecture.
482
Virtualization Applications Operating System Virtual Machine
High performance, native hypervisors run directly on the hardware Example: original IBM Mainframe z/VM, Xen, VMWare ESX
Hypervisor
Hardware
What if we blend the bottom layers? If we decide that the only purpose of this particular piece of hardware is to host virtual machines, then we do not need a fully fledged operating system on the hardware. We can delegate the role of the hypervisor to just abstract the hardware and use it to offer virtual machines that we can use to run multiple (guest) operating systems. This reduces the redundancy in the stack, and can be optimized for the specific purpose.
483
Lightweight Virtualization with Containers Applications
Container
Containers are lightweight solutions to isolate applications that can share the same operating system Example: Docker, rkt, OpenVZ, Hyper-V
Host Operating System Hardware
The other approach comes from the opposite side. There are lots of layers between the application and the hardware. What if we can assume that all applications will run on the same guest operating system and this happens to be the same as the host? Then the goal becomes to isolate them as if they were running in separate virtual machines, but without the overhead. This is why containers are sometimes called ”lightweight virtualization“ within the same host operating system. If you can spot the difference, the applications are directly running on the host operating system, but the applications and operating system are separated by the container. This layer provides the needed isolation without the overhead of having a hypervisor underneath.
484
Virtual Machine
Container
Footprint
GB
MB
Boot Time
Minutes
Sub-Second
Guest OS
Different in each VM
Uniform for all containers (Linux)
Deployed Components
Multiple, Dynamic
Single, Fixed
Isolation
Hardware
Namespace OS Sandboxing
Persistence
Stateful
Stateless
Overhead
High (with Emulation)
Low (Native)
Baking Time
Slow
Fast
So which one is better? Containers or virtual machines (VM)? In terms of resources, the term lightweight virtualization is justified for containers. Virtual machines need to simulate whole hardware environment in which to run a full operating system: they consume gigabytes of memory. The footprint of containers typically depends much sooner on what you will deploy inside, since they have a smaller overhead. How long does it take to start a virtual machine? You have to boot the emulated hardware, you have to start the operating system, finally you have to wait for your application to start. Depending on the operating system and application, this may be a relatively slow process. If you have a container, the operating system is already running, you just have to initialize the container and start the application. This tends to be much faster. What about security? What about the isolation? If the virtual machine has full control over the hardware running a dedicated hypervisor, it could take advantage of specific processor features to achieve isolation. With containers, it is questionable whether two applications deployed in separate containers sharing the same operating system are really as isolated. Namespacing and other sandboxing techniques pretend that processes running on the same operating system are inside a separate container and therefore should not detect the presence of each other nor interfere with each other. Ultimately you get what you pay for. Baking refers to the process to build the images which can contain a whole operating system, including the whole application and its data. This is a slow process when building an image for virtual machine. With a container this is relatively faster because the image simply needs to store less bits.
485
Containers inside VMs Applications Container Operating System Virtual Machine
Containers and VMs can be combined to get the best of both worlds (however performance will suffer compared to raw iron) Example: VMWare vSphere integrated containers
Hypervisor Hardware
Good ideas can be combined with each other. Typically one needs to pay a Cloud provider to rent virtual machines. If you choose an infrastructure as a service provider, you get billed for the time during which your virtual machine is running. The more applications you run, the more expensive it gets because each should be deployed in a separate VM to keep it isolated. Still, within a VM you can run whatever operating system you want. What if you install docker inside a virtual machine? Then each application can be deployed within its own container and you just need to pay for one virtual machine, which needs to be allocated enough capacity to host all containers. Also in this case, the operating system can be optimized for the specific purpose of hosting containers.
486
Images and Snapshots VM 1
VM 2
VM 3
Hypervisor Hardware
Image Repository
Snapshot Repository
Virtual Machines or Containers are booted using a precon gured image selected from a repository Snapshots or checkpoints of the entire virtual machine state or the container le system can be saved, backed up and restored What are these images in which software component releases are packaged as? While simulating or virtualizing hardware, it becomes possible for the hypervisor to take a snapshot of the state of execution of the virtual machine. This includes a copy of the whole memory, a copy of the processor state. This snapshot can be written into a file. By reading it, it is possible to restore the state of the virtual machine, which can continue running from a known state. This uses the same techniques that your laptop uses to save a snapshot of your system state just before it runs out of battery. The laptop is going to take all the volatile information, write it out the disk so that when you plug it back in and you switch it on, it will restore the running applications and operating system from the image. We can benefit from this technique to provide an image from which we can start our application in a known state. This image is prepared (during the image baking process) by creating a virtual machine, installing the operating system as well as all of the applications. We start them and then we freeze it. Later we can copy it and deploy where we want to run it. Since it’s just a matter of initializing the virtual machine from the given image and all the necessary software is already installed and ready to go. A similar process is used for container images as well.
487
Virtual Machine Migration VM 1
VM 2
VM 3
VM 3
Hypervisor
Hypervisor
Hardware
Hardware
Virtual Machines can be moved between different physical hosts for load balancing, workload consolidation and planned hardware maintenance Some hypervisors support live migration during which the VM does need to be stopped before it is restarted elsewhere
Virtual machines also support live migration from one hosting environment to another. In case the hardware fails, we can recover the virtual machine from its latest snapshot and keep running it elsewhere. If the hardware needs to be upgraded or changed, we can temporarily move the virtual machines to a different physical hosts. Live migration is something that used to be unthinkable and impossible to achieve with physical hardware. In other words, thanks to virtualization, the deployment decisions you make can be easily changed also after the system has been started. As we will see, this can also help with elastic scalability, load balancing, or workload consolidation.
488
Inverted Hypervisor App 1
App 2
App 3
Applications
OS 1
OS 2
OS 3
Guest Operating System
VM 1
VM 2
VM 3
Virtual Machine
Hypervisor
Inverted Hypervisor
Hardware
HW 1
The hypervisor supports multiple isolated virtual machines sharing the same hardware
HW 21
HW 31
The inverted hypervisor runs one virtual machine over multiple distributed hosts
What is the purpose of the hypervisor? The hypervisor multiplexes multiple virtual machines over one shared physical machine. This helps with using the hardware as efficiently as possible. The single physical processor will be mapped to multiple virtual processors that are going to run over it. If each application is not always fully utilizing its processor, it would be inefficient to allocate a full physical processor to it. By sharing the hardware, we can make sure it gets fully utilized all the time by multiple applications. If we flip the structure then we have the inverted hypervisor. Here we can take one application and its operating system and then run it across multiple physical hardware resources. What if the hardware that you have is not powerful enough for one application? What if there it not enough memory? To reach the capacity needed to run your application, you can compose multiple pieces of hardware and thanks to the inverted hypervisor, they will look like it’s a single virtual machine, a very large one, which your application can take advantage of. It is the inverted hypervisor responsability to aggregate multiple physical hardware resources and partitions the large application’s workload among them.
489
Virtual Hardware = So tware Deployment Con guration and Resource Allocation (CPU, RAM, Disk, Network) becomes part of the Software code (and should be managed accordingly with versioning, change management, testing) Different, independent and isolated deployment con gurations should be used at different stages of the software production pipeline (never mix production and testing environments) Virtual hardware becomes software. This is the second technological advance that brought us into the age of continuity. The first one was that the production of software is software too. Now we also realize that the hardware itself becomes software. All of the configuration details about the container that you need in order to be able to run the software in production becomes a piece of software. This is specified using the corresponding language and then you can use version control, the proper change management processes. You can even do testing with it to make sure it doesn’t get modified incorrectly. We saw that in the software production pipeline we have to specify configuration for the integration stage, the capacity stage, the acceptance stage and the production stage. This configuration is not only the configuration of your software application, but it’s also a configuration of the virtual hardware environment that you will use to run the system. This configuration may change as your software evolves (adding features requires more processing power). It may also change as a given version is running (adding more users requires more memory). Resource allocation becomes as easy as pushing a new commit of a configuration file. Let the Cloud provider scramble to install the hardware capacity that you allocated with a few keystrokes. This configuration information becomes part of your software. The code you write to implement your component needs to include both the tests that will be run during the pipeline but also the specification of the virtual execution environment in which it will be deployed into. This lecture on deployability concludes our transition from design time to run time. In the next lectures we will discuss how to design architectures which can deliver critical runtime qualities: scalability, availability and flexibility.
490
References • Jez Humble, David Farley, Continuous Delivery. Reliable Software Releases Through Build, Test, and Deployment Automation Addison-Wesley, 2010, ISBN 978-0-321-60191-9 • Len Bass, Ingo Weber, Liming Zhu. DevOps: A Software Architect's Perspective, 2015, ISBN 9780134049847 • Michael T. Nygard, Release It! Design and Deploy Production-Ready Software, 2018, ISBN 9781-68050-239-8 • Gene Kim, Kevin Behr, George Spafford, The Phoenix Project: A Novel about IT, DevOps, and Helping Your Business Win, 2013, ISBN 978-0988262591 • Mendel Rosenblum, The Reincarnation of Virtual Machines, ACM Queue, Volume 2, issue 5, August 31, 2004 • The Twelve-Factor App • Open Container Initiative • XKCD 1319: Automation
491
So tware Architecture
Scalability
10
Contents • Scalability: Workloads and Resources • Scale up or Scale out? • Scaling Dimensions: Number of Clients (Workload), Input Size, State Size, Number of Dependencies • Location Transparency: Directory and Dependency Injection • Scalability Patterns: Scatter/Gather, Master/Worker, Load Balancing and Sharding
492
Scalability: this is the first of the quality attributes beyond raw performance that have to do with our system being in production at runtime. The question is: how do we scale the architecture? When you hear about scalability, the term is often tossed around in discussions, scalability by itself doesn’t mean anything. You have to be a bit more precise. You have to be able to put it into context. In which dimension do you want to scale your architecture? You can then produce a completely different design if your goal is to scale the number of concurrent clients that you want to support or you want to scale your system to be able to produce or consume a large amount of data. Or you want to be able to scale your architecture to be deployed over millions of different devices. First we will define what is scalability and what are the challenges involved into designing systems that can scale. And then we will see what kind of solutions, what kind of patterns we can introduce in our architecture so that we can claim that our system is going to scale. We start by defining scalability in terms of workload, or scalability because we want our system to fully use a large number of resources. We will see the difference between scalability techniques that involve decentralization or scaling up towards larger and larger centralized systems. What are the basic mechanisms at the foundation of architectures that can scale? We need to be able to enable the dynamic discovery of components, their identity and location: who is running where. And then we will look at more detailed scalability patterns: master/worker, load balancing and sharding.
493
Scalability and Workload Response Time
Workload
Throughput (req/s)
Workload
Workload = traf c, number of clients or their number of concurrent requests Ideal system: response time not affected by the workload; throughput grows proportionally to the workload Real systems will show this behavior up to their capacity limit
To observe the scalability of our system, we need to measure first its performance. The performance can be measured in terms of latency with the response time. To perform a certain amount of work, the system takes some time. We can also look at performance in terms of throughput: how much work over time can our system perform? In these charts we will see that the performance of the system is a function of the workload of the system. The busier your system is, the longer it will take to process the work. We start from a situation in which there is no impact of the workload on the response time. As the workload increases, the response time stays flat. It always takes the same amount of time no matter how busy the system is. Also we can see that the throughput grows together with the workload. The workload can also be defined in terms of requests per second. We can observe that there is a growing number of clients sending traffic to our system but the system can absorb the traffic. The more work you send, the more the system gets busy. But since there is no impact on the response time, we can say that it scales and it will do so until it reaches its capacity limit. That’s where it stops scaling with this behavior. And we say that it has reached a saturation point. The system is saturated because even if you keep increasing the workload (as shown with the dotted line) and send more and more requests, the system cannot cope with them. We recognize the saturation phase in which the throughput stays flat and the response time starts to increase. Let’s not confuse the throughput with the response time. When the system scales, the response time remains flat and the throughput increases proportionally to the workload. When the system saturates, the throughput flattens and the response time starts to increase. This can be intuitive to understand: imagine there is a queue of request somewhere. Why would requests stay longer and longer in the queue? because the processing capacity is limited and only so many requests can be processed per second. 494
If you further increase the workload, you go from saturation to overload. And here we see that the system is completely overwhelmed. As a consequence, its behavior becomes uncontrolled, its queues get clogged, the throughput goes down and the response time goes beyond what the clients will accept. That’s when you start to see disconnections. Clients start to get timeouts. Clients will react by resending their requests, further growing the length of your queue and overall the system becomes unstable. To summarize: the workload your system is subject to can change over time. Depending on the level of the traffic, the number of clients, the number of concurrent requests from the clients, we can observe the performance of the system in relationship to the workload. Linear scalability occurs when the response time is not affected by the workload and the throughput grows proportionally. The more work you give, the more the system will be busy, but the system can keep up. Infinite scalability does not exist, as every real system has a limited capacity. The linear scalability behavior will occur only for a certain workload range. There will be a point in which there is just too much work. There is too much traffic and this system has reached its saturation point and if the workload keeps growing it becomes overloaded. So this is true for every scalable architecture: there is always going to be a capacity limit. The question is how do you find the limit and how do you push the limit a little bit higher.
495
Scalability and Workload: Centralized
Response Time
Workload
Response Time
Workload
Many
1
Where do these curves come from? There is a growing number of clients – this is where the work comes from – which call the component in the middle. In your architecture model you just draw two boxes, but then you know that one of the boxes will be instantiated many times, deployed all over the world and there will be many of those at runtime that depend on this central component to perform some work and they have some expectations about its performance. If we look at this particular scenario, then we can say that the limit to the scalability that we have comes from the fact that this is a centralized architecture, in which there is one element that has to work for servicing requests coming from many elements. There is a fundamental imbalance: we have one element in the center working for all the others, which becomes increasingly busy the more clients you attach to it. We can also observe its behavior by introducing a message queue connector. When the system is working within its capacity, it can consume the messages produced by the clients without them stacking up in the queue. In this producer/consumer scenario, the producer and consumer work at the same rate: every message to be processed will take the same time as we can see from the response time. However, if we add more clients, there is more work to do: there will be a point in which the system is falling behind and the queue doesn’t get emptied anymore. In the worst case scenario, also the queue gets filled up beyond its capacity and it has to start preventing clients from adding more messages. In a centralized architecture the workload of a single component comes from a variable number of clients. Their number can grow or change dynamically. There is only one component that is going to service them. We can say that this component becomes the bottleneck: if this component is not fast enough, the architecture will not scale beyond the capacity of its centralized components.
496
How to scale? . Work faster . Work less . Work later . Get help
The question that I would like to ask is: How would you solve that scenario so that we can avoid completely filling up the queue? Clustering over many servers. Distribute the workload on multiple machines: good ideas. So let’s spread out the load over multiple servers. If one server is busy, we find another one. Another interesting idea is to put a time limit on the messages, so once we enter a message in the queue, we expect that this message will be processed within a certain time window, and if it takes too long, we drop expired messages. We can empty the queue this way without doing any work. Another way is to make the server more powerful and in general to balance the processing power between producers and consumers. The queue helps to even out temporary unbalances, but overall you need to make sure the queue length does not grow unbounded. Caching is also a great idea which helps to scale. This will basically make it unnecessary to send a message into the queue, since the client already knows the result. Avoid to recompute already known results. Invest into extra storage for the cache so that you save CPU cycles. The problem of scaling comes from having to deal with a workload peak. Such peaks could be predictable or they can be totally unpredictable. Once your architecture has to absorb such a peak, you have to figure out how you design it by taking advance of your suggestions. First, a very abstract and very simple principle that can help you is: Whatever you’re trying to do, just do it faster. If you do it faster, it takes less time. That means that in the same time you can process more requests. This is where your algorithm optimization skills come in to play. Reducing the complexity of algorithm makes your system more efficient. If you cannot squeeze the algorithmic complexity any more, you can throw money at the problem: if you can afford it, you can buy a faster computer. Or you just wait a couple of years and then Moore’s law is going to come to your rescue. Eventually things that used to be impossible or too expensive, they become affordable and easy to 497
do, thanks to faster and cheaper hardware. The second idea is also simple: instead of performing the whole computation, we’re going to compute an approximate result. This should also takes less time, the result may not be as good, but at least clients do not timeout. Another variation is not to return the freshest result, but to add a cache. Caching helps to reduce the workload because you remember historical results and then you look them up before sending or processing an incoming request. Or you can always push back and say I’m not going to do this for you. You do it yourself. And this goes even beyond caching. In this case we are actually offloading the work: we are not doing this here because I have limited capacity and you do it yourself that you have plenty of processing power. If you think about how powerful the latest smartphones are, you might as well off load the work on so-called edge devices. Parts of the components that used to require a powerful server to execute them, get to run elsewhere: this way you will make your server in the centre more scalable because the work is done on the mobile devices at the edge. If you are in an emergency situation, sometimes you can attempt to control which features are you going to offer while you disable unimportant features. So you partially degrade the user experience, but you can support a much larger number of clients. Users will understand: you are undergoing a huge load at the moment and here is a very simple version of the application that still covers 80% of the use cases, but we have cut off the expensive ones. Come back later after the peak is past and we will reactivate them. Another technique that you can use if you know that the peak is short lived is to buffer the work and then empty the queue after the peak is gone. If you survive the worst, you can catch up later. You don’t necessarily need to perform the work right now if you don’t have such real time constraints. If there is no deadline, you can absorb the peak when you have a time in which you’re not so busy. If all of these ideas are not enough. There is still another one which is also what you suggested right away: If you cannot keep up by yourself – there is always a limit on how much you can do – you will need to ask your friends for some help. So in this case, what you want is to be able to create multiple servers, add multiple cores, start additional parallel threads. Will introducing parallelism into your architecture help you to scale? Is your software going to benefit by adding more computational resources?
498
Scalability and Resources: Decentralized
Speedup
Resources
Speedup
Resources
1
Many
Let’s look at how parallelism affects scalability. In this scenario, when you want to spread out the work on a cluster, you reverse the relationship between the clients and the central server. Here we have actually one client that now has to spread out his work across multiple servers. So don’t be confused by the location of the boxes. In the previous picture, the clients were outside sending work to the server in the center that was getting overloaded. With a decentralized architecture we are inverting these roles, we have one client in the center which instead of talking to only one server, it is actually going to send the work across a large set of servers. The boxes on the outside represent multiple copies of the server which can receive the work. If we look at it from the queue point of view, we are in a situation in which the client is overloading the server. The queue is getting full. There is too much work. So if you want to scale beyond the capacity of one consumer, you need to add more copies of the server. Working all together, they will be able to work in parallel and consume the queue and bring back the system in balance. This is what we call a system in which the bottleneck has been solved by replicating the server components consuming the work to do from the queue. Even if you have a single client that is producing too much work for one server, you can absorb it and handle it in a scalable way by adding more and more resources to process the work. Here we don’t see performance as a function of the workload. Here we can try to study how does the performance behave in terms of the amount of allocated resources? How many copies of the servers we need to process the work? We still can measure the response time, but the X axis has changed. More precisely, we fix the workload. If we had more resources, if instead of just one server we add a whole cluster, or if we go from a cluster to a data center, or if we go from a data center to multiple regions all over the world: how is the increase in resources going to affect the performance? 499
Scalability and Resources Response Time
Resources
Speedup
Resources
For the same workload, will performance improve by adding more resources? Ideal system: linear speedup with in nite resources Real systems will only bene t up to a limited amount of resources
We have a scalable architecture if the response time decreases when we increase the resources available to process the given workload. With an ideal system, we will achieve linear speed up. The speedup (the performance relative to a centralized solution) improves linearly. It scales as long as the speed up grows linearly when you increase the resources. For example, if you start from one server and then you double the capacity of the system, you now have two servers. With a scalable architecture you would expect the response time to be half of the previous configuration. And if you double the resources again, then you expect the response time to four times lower than the original one. Can you keep doing this with an arbitrary amount of resources? Can you always double the amount of servers, double the number of cores with a processor and expect the performance to improve just like that? Also in this case, there is a limit after which actually you don’t get any benefit. Sometimes it even gets worse: you buy one more server and the performance no longer improves. There is a point in which the speedup doesn’t grow linearly, but it actually starts to become flat. This is another limit of the scalability in terms of resources, with a fixed workload. Sometimes this is also called the law of diminishing returns. So if you double, if you go from one to two CPU cores. You have a big performance improvement in absolute terms. But then if you double again, for example, go from 20 to 40 resources, the improvement is not as big anymore. There will be a point in which it becomes too expensive to add even more resources for the very small improvement that you actually get.
500
Centralized or Decentralized? Centralized
Decentralized
Consistent
Hot Spot
Single Point of Failure
Peer to Peer
Bottleneck
Churn
Client/Server
Partial Failure
Since we have seen these two sides of scalability with a variable workload in a centralized system or with a decentralized system with an increasing amount of resources, how are you going to design your architecture? Do you choose a centralized or a decentralized one? Can you connect these two concepts with their implications? In a centralized architecture some elements become a scalability bottleneck. Compared to other elements with a bigger capacity, the bottleneck is the weakest link along your chain where the performance is lost: the first element whose capacity gets fully utilized. Similar to a bottleneck, a hot spot is the element most affected by a workload peak. Even if you have a decentralized system, when you have a hot spot, one element reaches its capacity limit sooner than the others and you need to devote more and more resources to it. Client/Server in its simplest form implies a centralized architecture when you have multiple clients that share the same server. But you can also have a server architecture that is decentralized, with multiple replicas of the server. I would like to use the term to refer to an active client sending work to a passive server. This should be seen as the opposite to a peer to peer architecture in which both client and server act as client and server at the same time. So this means that you have elements in your architecture which can both send and receive work to and from others, which is mostly found in decentralized architectures. Still, also large peer to peer systems will use so-called ”super peers” in order to scale to run on a large number of resources. If you want to have consistency in your state, obtaining it with a centralized architecture is much easier. Basically, if you have only one copy in one place, then by definition this is equal to itself, so this is easy to keep consistent. As soon as you introduce decentralization, e.g., with multiple replicas, you will copy the data in different places. You can 501
still try to keep it consistent, but it will be much more expensive to do so. So when you need to deliver consistency as a requirement, then you tend to pick a centralized solution. When dealing with scalability requirements, you may need to introduce replication within a decentralized architecture. In this case, what is the impact on the consistency of the data? The decision also impacts possible failure modes. We will talk about failures more in detail when we go into the availability and reliability discussion. But clearly if you have a centralized architecture you have one element which becomes the single point of failure. Also with a centralized architecture, either everything works or nothing works. Partial failure instead happens with a decentralized architecture: to have a partial failure, it means that you can split the system into more than one element. Some of those elements fail while others don’t. This is a challenging situation to be in, especially when you have to recover your system, because while only some of the elements are down, so the rest of the system continues to work, when you restart them they will need to be brought into a consistent state with the others. Churn is also typical of decentralization: the more elements are part of your system, the higher the probability that some of them will not be available. As your architecture scales to run across a very large number of resources, be prepared to deal with a continuous stream of disappearing, reappearing and lost devices, which will need to be continuously replaced with new ones just to keep the system running.
502
Scalability at Scale
Constraint: Limit the number of edges into each node Let’s turn the design of a scalable architecture into a graph problem. No matter whether you choose a centralized or decentralized design, you will find a limit. It’s typically networking problem: the amount of connections that we can establish into the various elements of our system is limited. If you look at these graphs. What is the maximum number of connections N at each node? In the first case, there is no center: every element is connected to every other one. If you look at the second graph, this number is no longer a constant value for every element. What is the limit here? Most of the elements have only one connection, but in the center, here comes the bottleneck. How many connections does the black node can support? Maybe it will be thousands, a high number, but still limited. This intrinsically limits how many components you can attach to the center one. So if you go for this solution – with the centralized architecture – there will be a physical limit to the number of network connections that you can have towards all the clients that you put around your server. The limit may depend on the hardware, operating system configuration, networking protocols, but there will be a limit. What if you choose a decentralized solution? Does this limit disappear? Decentralized means that you now remove the server from the picture and go peer to peer. This means that all the elements in your architecture will potentially connect to everyone else. What is the maximum number of connections that you can have now in this fully connected graph? Exactly, so the number is still N . Getting rid of the server doesn’t really solve your problem, because now every node behaves like the original server and every node is limited like the central server used to be. How to scale beyond this limit? You have to remove the assumption of establishing a fully connected graph. You have to work on your graph topology under the constraint that there is a limit on how many connections each node may have. You can take this idea of having a central element. And you can do it recursively (or 503
hierarchically): you can see here we have the black central element, which is connected itself to the white central element of the central elements. So we have a tree of two levels. We call this hierarchical architecture. Every element of the system can find a route through the tree to talk with all the other ones. There will be a lot of traffic towards the root of the tree: the next limiting factor will not be the number of connections of each node but the bandwidth of each edges. One way to scale is to remove the root and have a hybrid topology, which sometimes is called a peer to peer system with super peers. Peers that are stable and perform better, and therefore we attach to them a local neighborhood of clients. Then the super peer themselves establish an overlay network, which can deliver more bandwidth than having to go through a single root note. We split the root and allow the black elements to connect directly to ourselves. There is a third approach not shown in the pictures which is the epidemic or gossip based design. Here we have a random subset of the elements which connect in a way to keep the graph fully connected while minimizing the number of hops between the nodes. With high churn it’s not easy to keep the tree topology in place and a random graph is more resilient, although more difficult to visualize. I hope this idea helps you to design something that can grow beyond these fundamental limits. Removing the centralized element is not enough, since the connectivity bottleneck is still there for every element of the architecture. Instead, you have to be clever with the interconnection structure that you establish and learn from the experience of peer to peer networks.
504
Scale Up or Scale Out? Scale Out Scale Up
Let’s look now at how to scale the nodes themselves. How can we increase the storage capacity? How can we get more powerful processing? There are two ways, in which either we grow the size of the center or the architecture becomes more an more decentralized. Scaling up means that we start from something small. As it grows, we keep it centralized. We keep it within the same element. We just make it bigger and bigger. For example, you buy your phone and you estimate: one terabyte of storage will be enough. What to do if it gets full? Buy a bigger phone, a phone with more space. The same holds for the memory and processor speed and cores. Is your system too slow, run it on a faster CPU with more cores. Also in this case, you reach a limit. The maximum available storage capacity is technologically limited, or its cost simply becomes too high. Once you reach the limit of scaling up, you can still scale out if you need to grow further. Just take multiple copies of the same thing and place them side by side. Even if the size of the individual disk is limited, we just use two or more discs instead of one. As a consequence, we have twice as much storage. One processor is not fast enough: if you can, parallelize your code to take advantage of multiple cores. Which is more expensive? Scaling up or out? While the price of adding capacity to an existing node may become increasingly expensive, the price of adding more nodes grows linearly with the number of copies. Still, there is a coordination overhead. When you add more disks, you need to invest in the RAID controller. When you add more servers, you need to invest in a good load balancer. Scaling up and out are not alternative, they are often used in combination. First you try to scale up. Once it becomes too expensive, you can still scale out.
505
Scaling Dimensions Input Size
Number of Clients
State Size
Number of Dependencies
So far we have been trying to define the challenges of scalability. Let’s summarize them with a map to understand the different solutions, because the problems that we face are not exactly the same. We need to scale our system because its workload grows. The workload of the system is generated by the clients that concurrently send requests into our system. So the number of clients is dynamic and unpredictable. It really depends on how successful your application is, the more clients you will need to serve. Even if you have only one client, they may send you a growing amount of data (larger requests or more frequent requests). So these two factors affect the workload independently. Another dimension that we will consider is how large is the amount of state that you need to manage. Some systems are stateless. These are easy to scale in terms of a large workload. In the extreme case you can start a server for each client. Stateful components are not so easy to scale. And special sharding techniques are used to grow the amount of state beyond the capacity of individual disks or storage elements. While we discussed the case of one server under the load of a growing number of clients, we can look at the opposite perspective of one client which needs to work with many servers. Will this decentralized architecture scale in terms of the number of dependencies that this client can have? How to take full advantage of a growing amount of resources? How to share the load among a larger and larger pool of workers?
506
Scalability Patterns Number of Clients (Workload)
Load Balancing
Input Size
Master/Worker
State Size
Sharding
Number of Dependencies
Directory Dependency Injection Scatter/Gather
If you want to scale in terms of the number of clients, if this is your challenge, you’re facing a peak of workload. Then load balancing as a solution comes to mind. If you assume the request of each client to be independent from every other one, you can share the growing load by adding multiple servers and making sure that, on average, they’re all busy in a similar way so you have a fair allocation of the work. What if you don’t have many clients, but you just have one client that is sending you a huge amount of work so the input size grows beyond your capacity? Is it possible to partition this work in independent units? You introduce the master worker architecture so you can still take advantage of a growing amount of resources. If what you’re trying to grow is the amount of storage, look into data sharding. In general, all of these solutions require to scatter the work (or the data) and then gather the results back together when it’s been processed. To know where you have to send it, you have to discover which workers are available, where they are and we will see that this can be done in two different ways with the directory or with the dependency injection patterns.
507
Directory How to facilitate location transparency? use a directory to nd interface endpoints based on abstract descriptions Clients avoid hard-coding knowledge about required interfaces as they lookup their dependencies through a directory which knows how and where to nd them
Let’s start from the foundations so that we can build a system that can dynamically grow and scale. This foundation is about being able to discover where are the components and the resources that we can count on to process the work. Given this queue of incoming work, we want to be able to switch between different instances, replicas or alternative workers that execute the same component. To deliver the capacity we need to work at scale, we need to be able to switch between them so that we can send the work where there is capacity available. If you will talk about scalability in terms of storage, in terms of the amount of state that you have to store, then the question is: where is this data going to be located? If you go back to the lecture about continuous delivery pipelines and deployability, there we can also use discovery as a tool to configure the system to use the right dependencies along the pipeline. You have a system that is in production. It will have some precise dependencies that cannot be mixed with the dependencies that you use for testing. Typically this is a fundamental for what concerns the storage of the data of your system. The state of the system in production is critical, whereas the one that you use for testing should be separate. If it gets corrupted you can still regenerate it. While if you lose the production state, it’s a major problem. You have to be able to configure this system so that when you start in a testing environment it will not use the same dependencies that you use in production. If you think about different types of releases, when we do a blue/green release or we do these pilots, then you need a way for binding clients with different versions of your implementation. When you make a new release, you only want 10 In other words, when we talk about scalability, you have typically a homogeneous system in which the component implementations that you instantiate are all the same, you just have multiple replicas if you need more capacity. If you talk about the evolution of your system, then you have a chance to have a system that is not exactly homogeneous and you might have different versions and different traffic being routed to different versions of the implementation. If you use discovery you’re trying to make components independent from each other 508
regarding their location. Even if most components have dependencies, they should not care where they are: this is called location transparency. You can use a component without knowing where it is. Because when you need to talk to it, somebody will find it for you and will put you in contact. That’s the job of the directory, a solution for many discovery problems. If we want to facilitate location transparency, if we want to make it possible for a component to discover where is another component, so that for example it can send some work there to be processed in parallel, then you can use a directory to look up its location, to look up which component I should talk to based on some requirements, to perform some match making between what you need and what the component provides. The goal of the directory is to know about what, where and who is available, which components are available and where they are and to perform the matchmaking between the components they depend on the components that need to be found to satisfy those dependencies. So this works because the clients do not need to know where the dependencies are and they can look up these dependencies through the directory at the latest possible moment.
509
Directory Client
Interface
Directory Interface Description
1. Register
2. Lookup
3. Invoke
Clients use the Directory to lookup published interfaces descriptions that will enable them to perform the actual invocation of the component they depend on
When a client is about to make a call, it needs to know where to find the component to be called. The directory can give its location. Here is how a directory works. We start from the provider side. There is a component offering an interface with a certain description. This description is registered with the directory to announce its availability. Registration is where we say that this is who we are. This is what we can do, and this is where we are. So the directory remembers that. Later, at a certain point in time, a client comes and it performs a look up. The client asks: I’m looking for a component that matches my requirements. The requirements come into the directory so that the directory goes over the set of registered interfaces, perform the matchmaking, and decide if there is a suitable interface that can satisfy your requirements. If the matchmaking is successful, the lookup response will contain the location for the implementation of that interface. Then the client can directly talk to the other component: the third step will be the actual invocation. This will be the actual message that goes between the client making the call and the server being called. The idea of the directory is that clients are actively looking for suitable implementations of their dependencies. The directory knows where to find them and tells the client about their location. Before we do the look up, we know what we need, but we don’t know whether it exists nor where to find it. And after a successful lookup, we know it exists and where to find it. Then we can use it. This assumes that the directory has a successful matchmaking. It’s also possible to try to look up something in that directory, and the directory doesn’t find anything suitable, so a lookup may fail or the result of the lookup may no longer work. The information that was registered may become obsolete. So even with a successful match, there is still a risk that it was based on obsolete information. Only after you do 510
the actual invocation you will know if everything has worked. If you have a whole architecture with many components and with many dependencies among them, typically there will be one element of the architecture which plays the role of directory. It will be responsible for managing the location of the components and knowing which components are there and keeping track of where they are. This critical element will be used by all the clients to do their look ups. One important issue that we have is: we have a single directory for the whole system. As with every centralized design: is this directory going to become a single point of failure? If the directory doesn’t answer the lookups, the clients are basically unable to perform their work because they don’t find where to send their calls. The directory may also become a performance bottleneck: if all the client lookups happen at the same time, how long does it take to get an answer? Before a client can do the actual work, they’re waiting for the directory to provide the information needed to find where to send their invocation. Lookups need to happen as fast as possible. So in which way can we improve the design to deal with this issue? We can apply some of the concepts that we discussed before to this concrete design problem. We have a slow, centralized directory. So we could have multiple directory instances and load balance each client to a different replica. The good thing about this replication is that the data of the directory doesn’t change very often: we start the system, the interfaces are registered and then if the system is not too dynamic, we can just assume that the information in the directory is pretty static, so it’s easy to make a replica. We can try a radical approach, fully decentralized, so each client keeps his own directory. So what we’re going to do is to deploy these two elements (clients and their directory) closely together so they run in the same host. The local directory lookup table can be synchronized through gossip, so each client will transfer all the changes to his neighbors. Once an interface wants to register, it has to find out where at least one of these copies is in order to send information about its description there. Eventually this will be propagated around and reach the other clients. This is a nice idea if we want to remove completely the central element and have a copy of the directory running in each client. The downside would be probably that it takes time before the updates reach all the clients because of the huge redundancy that we introduce in the system where everybody has their own independent copy. What if I ask you: do clients need to perform the look up every time? In this sequence diagram we see a look up before we do each invocation. Is this really necessary? If you want to avoid the cost of doing a look up before every invocation, we can use a cache in the client. This is also a copy of the directory content that gets populated on demand. After clients send a look up request to the central directory, they will store a copy of the result in the cache. If they need to look up the same interface twice, the second time they already have the result locally. So this is a way to scale the directory because you do not send so many requests to the directory, you only send them the first time. What if the cache becomes obsolete? This means that the invocation will fail because the client attempts to reach an obsolete endpoint. They are trying to reach a location that is no longer there. In this case, clients can look up again with the directory to refresh their cache. So this is an example of how we can scale the directory by caching the results. You can generalize this idea and apply it anywhere you have a client retrieving information from a component, and this information remains valid for a certain amount of time. You can avoid resending the same request if you can recycle the results from previous invocations. Let me ask you a different question now. We said that the directory helps the clients to locate the implementation of a certain interface, so the directory knows where to find 511
the components in the system, and the client uses the directory to find out where are the components. However, the directory itself is a component. The directory is a component which has a very simple interface that contains the register operation and the look up operation. The directory is a component which needs to be deployed and started on some container like any other component in your architecture. We can use the directory to locate all the other components; but how do we find out where is the directory going to be located? The location of the directory has to be known by the clients in advance. In other words, the directory gives us location transparency for all the components in architecture apart from itself. Clients do not have to know where to find the components, but they have to know where to find the directory. Introducing a directory makes it cheap to move around components because you just have to update the directory and then the clients will dynamically retrieve this updated information, but if you want to move around the directory itself, then you have to probably reconfigure all rebuild or your clients depending on how strongly hardcoded is this information. Another way to discover the directory’s location is to use some networking tricks, such using broadcasting. Clients will advertise their existence and the directory will be listening for such messages and perform a rendezvous. This would generate too much traffic for all components, but you can afford to do run this protocol to let each component discover the directory on startup. That can be another solution that avoids too hard coding the knowledge into the clients themselves. What kind of information do you store in a directory? This is about domain modeling for interface description and discovery. For example, consider the domain name system (DNS). This is a very simple type of look up where you go from a symbolic name to a numeric IP address. Directories can also store a complete copy of the interface, or the actual interface is a stored by the component itself and the directory just stores a reference to it. That depends on how easy you want to make it to update the directory: if you store a copy and you have to replace it every time you change it. If you store a reference then you just update your local copy and the directory will follow the reference to the latest version. How complicated is the matchmaking performed by the directory? Given the syntax and semantics of the client dependency the directory can attempt to retrieve a compatible implementation. You can also have business constraints such as usage prices or rate limits. Whatever metadata the directory collects, it is fundamental that it can be trusted. Directories are not only a performance and scalability bottleneck, not only a single point of failure, but also a security weakness. If you want to attack the system, you pollute its directory. You can call it a phishing attack or just an attack which exploits the trust the clients have in the lookup results from the directory. Clients will invoke whatever component the directory tells them to talk to. This requires strict validation of who is performing the registration. If you have an attacker, the attacker will register or update the address of an existing component and in the future clients will send all their messages to the attacker. To prevent this, while the look up can be relatively open, the registration has to be done only by authorized components. If you run a directory for several years, the directory will contain a history of all your deployments of the system because every time you deploy the system, the components will register themselves and the directory will store their location. Every time you redeploy, maybe you change the location so the question is whether you want to keep historical record or you want to clean up the directory from time to time and have an expiration date associated with the registrations. If you register a component, the registration will remain valid for a certain time window, for example, a couple of months. After that time, 512
unless the component register itself again and this information will be purged from the directory. This avoids that it just accumulates there and it becomes obsolete. Obsolete information means that clients will fail their invocations due to the out of date lookup results. In this simple design that we have shown here, there is no feedback: if the directory information is obsolete, the client invocation fails, but clients don’t have a mechanism to inform the directory: ”by the way, the location was incorrect, because the invocation failed. More sophisticated directories will either have a monitoring solution inside the directory that periodically checks whether the registered implementations are still where they claim to be, or will accept feedback from the clients and keep track of success/failure rates to estimate the availability of a given interface. This was the first example of one of the basic building blocks of scalable architecture. The directory is an architectural pattern that delivers location independence, and every time you connect two components you will need to decide: How are these two components going to discover each other? How do they find out where they are so they can talk to each other? We will see that directories are not the only way to solve this problem. With directories, the client is actively going to look up its dependencies. It’s also possible to completely flip the relationship between the client and the directory. This is called dependency injection.
513
Dependency Injection How to facilitate location transparency? use a container which updates components with bindings to their dependencies Clients avoid hard-coding knowledge about required interfaces as they expose a mechanism (setter or constructor) so that they can be con gured with the necessary bindings
In the directory pattern, the client calls the directory to perform lookups. The interaction starts from the client, while the directory is passive. This relationship can be inverted. We can also use a directory to configure the dependencies of a passive client, which does not or is unable to call the directory. That’s what the dependency injection is about. With dependency injection, we assume there is a container in which components are deployed. The container is responsible to update its components with knowledge about where to find their dependencies. In the same way as we had the directory, clients do not need to know where to find their dependencies. But they have to expose a mechanism so that they can be configured with the information they need to locate their dependencies. This way, the container can feed the information into them so that they can locate where are their dependencies without having to ask for them, and that’s the most important difference.
514
Dependency Injection Client 1. Get Dependencies
Interface
Container Interface Description
0. Registe
2. Configure 3. Invoke
As components are deployed in the container they are updated with bindings to the interfaces they require
So you can see how the interaction works. We still have the registration phase – the same as with the directory – where the interface advertises its location to the container. When the component is deployed inside the container, the container would ask the component for its dependencies. The interaction starts from the container and the client is passively answering this request. The client will typically do so with some configuration file and there will be a way to get the metadata for this particular component. The container will inspect it. It will check that these dependencies can be satisfied based on the registrations that have occurred so far. If the match making is successful, it will inform the component by configuring it so that when the component starts processing and needs to invoke a certain interface, it already knows where to find it. These steps happen during the initialization of the container. As you start the system, the dependencies are injected and afterwards the components can operate normally and proceed to the invocation under the assumption that their dependencies have been satisfied and located. As you can see the configuration part of the dependencies of the client is passive, so the client – the component that needs to be configured – is not doing anything by itself, it is just reacting to the container that injects the dependencies. When you use a directory, you can wait until the latest possible moment before you need to know where to find your dependencies. So this is late binding, very dynamic while here we anticipate the binding at startup time. So with dependency injection you have a startup phase in which you configure the dependencies so that you don’t have to do it later. This can save time during the actual invocation, because the invocation target is already bound. If the location of some dependency changes between the startup time and the time of the call, you will need to reconfigure the clients. Whenever the registration changes, you have to see which clients are affected. Then you can still deal with it to avoid failed invocations due to obsolete registrations. 515
Dependency Injection • Used to design architectures that follow the inversion of control principle: • “don't call us, we'll call you”, Hollywood Principle • Components are passively con gured (as opposed to actively looking up interfaces) to satisfy their dependencies: • Components should depend on required interfaces so that they are decoupled from the actual component implementations (which may be changed anytime)
Such inversion of the relationship between the directory and the component that is looking for dependencies is called “Hollywood principle”. When you go to an audition, the usual answer at the end of the audition is that we need to decide whether you got the part: Please don’t call us, we will call you (in case your audition was successful). This clarifies the relationship between who has the information and who needs the information. We know the dependencies that you need. We will tell you how to satisfy them. You don’t have to ask us, we’ll do it ourselves. With dependency injection components are passive. And they do not know how to look up for the interfaces that they need. They just sit there and wait to be configured. The idea is that you have components that know what their dependencies are and they are configured by an external entity so they can actually satisfy these dependencies.
516
Dependency Injection • Flexibility: • Systems are a loosely coupled collection of components that are externally connected and con gured • Component bindings can be recon gured at any time (multiple times) • Testability: • Easy to switch components with mockups
The environment in which would deploy the component is in control on when and where these connections are established. This happens, of course when you start, but there can also happen later. In case you need to move some of the dependencies to a different location with the directory, you would reconfigure the directory. As soon as the client would try to call the old location, the call would fail. The client will look up again in the directory and be given the updated information. With the dependency injection you can prevent the failure, so you can just whenever you know that something has changed, you reconfigure the components with the updated information. Another advantage of using dependencies injection is that it helps you to use the correct configuration along the build and continuous integration pipeline. If you deploy a component in a testing or staging environment, the container will configure the dependencies so that you can use for example mockups. Or components that are supposed to be used only during testing. When you deployed in production, the container that is configured as a production environment, it will use the correct dependencies. Since the component is passive and the responsibility for setting up the right environment is all outside of the component, if for some reason the component forgets to make the look up with the right directory and relies hard coded information, this will be difficult to rectify without rebuilding the component (e.g., to avoid using testing dependencies when deployed in production) to avoid deploying the system with inconsistent dependencies. It can be easier to check that this doesn’t happen by controlling the environment as opposed to each component.
517
Directory vs. Dependency Injection Container Configure
Registry Register
Component
Client Invoke
Dependency Injection
Lookup
Register
Component
Client Invoke
Directory
To summarize the difference between directory and dependency injection, let’s take a look at the logical perspective of how the three elements depend on each other. The component the client depends on is going to register with the container. And here we have the same: the component registers itself with the registry, which is a synonym for the directory. Also the invocation is the same on both sides, where the client invokes the component. What is the only difference? It’s a small one, but very important: the relationship between the client and the container/registry. With the directory, we have the client looking up the location of the component that wants to invoke in the registry, so the client is active while the registry is passive. With the dependency injection we have exactly opposite. The container is going to configure the component that is deployed inside the container setting the location about the component’s dependencies. The container is active, while the client is passive.
518
Scatter/Gather How to compose many equivalent interfaces? Broadcast the same request and aggregate the replies Send the same request message to all recipients, wait for all (or some) answers and aggregate them into a single reply message
Scatter/Gather is a structure that we are going to use for understanding how scalability patterns such as load balancing, master/worker and sharding work. Let’s take it from a more abstract perspective first. The problem here is that we are in this decentralized design in which we have multiple equivalent interfaces that we can use. The question is: how do we make use of them as a whole, single unit? How do we compose them together so that from the client perspective they behave as a single component? What we can do is, when we receive a request from the client, if we send this request to only one element, this would soon become a bottleneck when filling up its capacity, as a single element processing requests would be centralized. So we decompose the architecture by creating multiple copies of these bottleneck elements. Then we broadcast the request of the client to all of the copies, to all replicas. Once the replicas process the request we aggregate the replies. And we can send them back in one response message to the client. How do we perform the request broadcast? And how to perform the response aggregation?
519
Scatter/Gather Client
Scatter/ Gather
A
B
C
Broadcast Request Aggregate Results
Example: • Contact N airlines simultaneously for price quotes • Buy ticket from either airline if price 200 CHF • Make the decision within 2 minutes
We have one request message coming in from the client. This request can be copied so we can send the same request to all the replicas. But then we may have different answers from each of them, both in terms of their content and their timing. If we have a set of answers, how do we reduce the set of answer to only one? Because the client expects a single response. Do we really need to wait for all of them? If our goal is to keep the latency under control, in order to produce a fast response, we could just send back to the client the response of the fastest replica. One example: you need to travel somewhere. Different airlines receive the same input about your trip and respond with an offer describing a possible connection and its price. The rule for aggregating the results can be as follows: If you get a cheap price, you take the offer, there’s no need to wait any longer. However, if the prices are a bit more expensive, you want to wait long enough to compare different answers. And then you will take the cheapest. However, you cannot wait forever, so if you found an answer within 2 minutes which is within your budget, you will buy it. This gives you both a way to aggregate the results because you will take the cheapest together with a strategy to decide until when do you want to wait. If one of the backends that you contact is overloaded, it will not answer within the deadline. It’s better not having to wait forever for all replies, but to increase the chances that at least one replica will reply before the timeout.
520
Scatter/Gather Which components should be involved? • The recipients are kept hidden from the client They can be dynamically discovered using subscriptions or directory registrations • The recipients are known a priori by the client The request includes a distribution list with the targeted client addresses
How do we know which replicas you want to invoke? There are two options to decide this. One is to keep the fact that you’re scattering out the request hidden from the client, so the client thinks that it’s talking to one component, but behind it you have multiple ones. Alternatively, the client is actually going to tell you who do you want to invoke. This makes a very big difference in terms of how transparent your interface is. If you keep it hidden, it means that you are responsible to manage the replication of the implementation, the discovery of the back-end replicas, (you can use a directory behind the scenes) and scattering the client request and gathering the replies. The client doesn’t need to know. Otherwise, you push all the directory management and all the knowledge about where to forward the request to the client. Whenever the client sends a request, the request will include the list of addresses of the components that the clients wants you to target. Considering the airline example, it makes sense that the client can give you some constraints about which airlines is interested for you to contact. There are other travel reservation agencies that play with this and they make you an offer with particularly good prices, but they do not tell you what is the airline behind it. Only if you accept the offer they will reveal this information.
521
Scatter/Gather How to aggregate the responses? • Send all (packaged into one message) • Send one, computed using some aggregation function (e.g., average) • Send the best one, picked with some comparison function • Send the majority version (in case of discrepancy)
On the other direction, once we get multiple responses, how do we aggregate them? These aggregation strategies do not only contribute to enhance the scalability, but can also help to ensure the result is correct. The simplest solution is to just package multiple responses in one reply message. A message can always carry within a collection of messages, so it’s easy to do that and delegate the aggregation to the client. We can compress multiple messages with some statistics, e.g., calculating the cheapest, calculating the average, selecting the best according to some utility function. In case of a Byzantine environment in which you do not necessarily trust all of the back ends. You can use a voting strategy. If out of five copies of the same input, one of the replies is a completely different result but the other four agree with each other, we drop the outlier and forward the majority response.
522
Scatter/Gather Synchronization strategies • When to send the aggregated response? • Wait for all messages • Wait for some messages within a certain time window • Wait for the rst N out of M messages (N < M) • Return fastest acceptable reply • Warning: the response-time of each component may vary and the response-time of the scatter/gather is the slowest of all component responses
In terms of how long do you wait before sending back the answer? Waiting for all the replicas to respond is bound by the slowest. Follow this strategy only if you really need to wait for all of them to guarantee complete coverage. If you expect the result will be the same, maybe you can disregard some replicas if they are too slow. There will be a time window in which the client is expecting a response, and you will wait only until this window is about to expire before sending the best aggregated reply that has been gathered until then. If the time window is unclear, it is possible also just to count how many replies have arrived: once the first N out of messages have arrived, you have enough responses and you can already answer the client. Sometimes you really want to make it as fast as possible, so you take the first, or the first that is acceptable. This requires to validate the answers to avoid accepting super fast replies full of random bits (or high velocity garbage).
523
Master/Worker How speed up processing large amounts of input data? split a large job into smaller independent partitions which can be processed in parallel The master divides the work among a pool of workers and gathers the results once they arrive
Synonyms: Master/Slave, Divide-and-Conquer
How can we refine the scatter gather pattern so that we can use it to scale, not necessarily to a large number of clients, but to deal with a single client that is however sending a large amount of input data. The idea is to split the input data into smaller partitions which are processed independently and therefore in parallel. This is the assumption that we make when applying this master worker pattern. The master role is to accept the input data, divide it and then schedule the work among pool of identical workers. Workers are supposed to work independently and in parallel on each of the partitions. Once the results come back, then it’s like in the scatter/gather pattern: we have to accumulate and concatenate the result. Once all the workers have complete we can send it back. This pattern is also known as divide and conquer, especially when it is applied recursively. The problem is too large to be solved as a whole, so we split it and we solve each of the parts and then we have to reassemble the solution together.
524
Master/Worker Client
Master
Partition
Worker A
Worker B
Assign to Workers Merge Results
Example: • Matrix Multiplication (compute each row independently) • Movie Rendering (compute each picture frame independently) • Seti@home (volunteer computing) We can see the interaction has a similar structure than the scatter gather with a little but important difference: after the request that comes in, it is not directly forwarded to the workers. But before doing so, there is a partitioning step which takes the large input request and partitions it into smaller ones. Each of these partitions needs to be assigned to a worker. While the workers perform the same computation, they don’t necessarily complete it with the same performance. When all of them have processed their partition, the results are merged into a single response. From the client perspective, such partitioning and parallel processing is also something that you can do completely transparently. The client doesn’t see that its request is partitioned and then the results are reassembled back together. There is only one request coming in and one result going out. The only observation the client can make is that for some reason this implementation scales because, despite the fact that you’re sending larger and larger amounts of data, the response time doesn’t grow as much as one would expect. This holds if you can add more and more workers to deal with larger and larger amounts of data. We have examples from scientific computing where this is happening quite often with parallel supercomputers. If you are trying to render a movie, for example, you know that each frame can be rendered independently, so you can actually use master/worker to scale the longer movies with higher resolution. You set up a so-called rendering farm, where you have a huge number of graphics card and processor’s; you input the movie and then you don’t have to actually do the computation sequentially. If you scale this to the whole Internet, where the workers are actually running on desktops, recycling wasted computing cycles all over the world, then you have something called volunteer computing. The amount of work that you have to do is so large that you need millions of these workers all over the place. This was introduced for the search of 525
extraterrestrial life. Where the signals from radio telescopes were partitioned into blocks of time and also different frequencies and each block was sent out to screen savers running on desktop computers all over the world. After downloading a block, they would analyze it to search for interesting signals and possibly send back some detection events. This application became very popular, with the number of workers growing beyond the initial expectations. Here’s another scalability issue: when you really have millions of these workers coming and asking for work, you must have a huge queue of these partitions that need to be processed. Eventually they ran out of data, so they started to re-process the same data with more expensive detection algorithms. They also came up with incentives to keep the workers interested: a reward system for people to participate and donate their CPU cycles. You could spot your name on the top 10 or the top 100 of individuals or organizations that were contributing their resources to SETI at home. Not surprisingly, when you put gamification into play, then some people are driven to win the competition. And others hacked their way to the top by running modified screensavers, which would just return some random result after downloading a piece of work without really analyzing it, to get the reward while saving on CPU usage and the corresponding energy bills. If you do not trust the workers, how can you check that they are actually doing what they are supposed to do? What you can do is send the same partition to multiple workers, scatter, gather and compare the outcome. Or you can have some expectations about how long it is going to take to process it. If you get an answer too soon, there is something suspicious going on.
526
Master/Worker Master Responsibilities • Transparency: Clients should not know that the master delegates its task to a set of workers • Partitioning strategies: Uniform, Adaptive (based on available worker resources), Static/Dynamic • Fault Tolerance: if a worker fails, resend its partition to another one • Computational Accuracy: scatter the same partition to multiple workers and compare their results to detect inaccuracies (assuming workers are deterministic) • Master is application independent There are many second-order design decisions that you have to make when you decide to follow the master/worker pattern. Let’s discuss them more in detail. From the master perspective, the goal is to hide from clients the fact that they are splitting the work and forwarding it to a set of parallel workers. When partitioning the work, there can be different ways to do it. If you know that workers are homogeneous – every worker has the same computational power – then you can split the work in a uniform way. If you know some workers are faster than others, you can give them more work while assigning less work to the slowest workers. This requires a model to predict how much work can every worker perform so that at the end you get back the result at the same time. The set of workers can also fluctuate. You don’t necessarily know at the time in which the partitioning happens how many workers will be available. If the computation takes a few seconds, maybe you have a reasonable expectation about how many chunks are needed, but if the computation takes a few months, it’s difficult to foresee in advance the availability of workers. In this case, there can be a dynamic partitioning strategy which starts to slice the work progressively, while keeping some work from being allocated in case more workers show up. If more workers will happen to join the system, then it will either repartition, reshuffle or reassign work that was not given out yet. We can also have a work stealing strategy when a fast worker looks for work that has been assigned to other workers that are slower. There is also a little problem of failures, so if workers fail, we need to be aware of that because before we can send back the response to the client, we need to gather the results from all the partitions. It’s important that all the workers successfully completed their job. In case of failure we have to resend the partition to be processed somewhere else and whenever we use a retry strategy to deal with failures – as we’re going to see in the reliability lecture – the assumption is that it’s possible to repeat the work. That 527
processing a partition does not have any side effects. With deterministic computations the same result should be obtained no matter which worker produces it. In this case, it is possible to take the same input, the same partition, and have it processed by multiple workers. This way we can compare the results to detect whether some of the workers are cheating by sending back some incorrect results. The more replicas you make of each partition, the more expensive it becomes. There will be a tradeoff between how much fault tolerance you want to have and how much performance you want to have and how many replicas you want to use for processing each partition. All of these design decisions that you make about partitioning, about retrying failed jobs, about sending out multiple copies of the same partition are design decisions that are typically independent of what the processing is about. As long as it is possible to partition the input and merge the results back together, the master does not care about its content. The part of the system that is of course application dependent is the worker. The worker needs to know how to process the input and what the application domain is about.
528
Master/Worker Worker Responsibilities • Each worker runs its own parallel thread of control and may be distributed across the network • Worker churn: they may join and leave the system at any time (may even fail) • Workers do not usually exchange any information among themselves • Workers should be independent from the algorithm used to partition the work • Workers are application domain-speci c To speed up the work by scaling over more resources, or to keep the response time flat with a larger input, you need to parallelize the work: each worker should be distributed across different cores or cluster nodes. How stable are your workers? Here we see again, the term churn coming up. Can you assume that once you get the request from the client, you have a set of workers and they’re exactly the same throughout the whole computation? Or will you have an unpredictable set of workers that come and go anytime? What if some disappear and never send you back a result? Another assumption that simplifies applying the master/worker pattern is that when you give a processing job to a worker, the worker can perform the computation on its own. It doesn’t need to know who the workers are. It doesn’t have to exchange any information with them. The only communication flow is from the master to the worker and back. But workers don’t have to talk to each other. There can be more complex, less embarrassing, schemes in which workers periodically need to exchange some information. But that makes running the parallel computation much more problematic. Finally, the partitioning should be exclusive responsibility of the master. Workers should just know how to process one piece and they don’t care about how the partitioning works. Unless the pattern is applied recursively, with divide and conquer, workers will decide whether it’s worth to further split their partition and delegate themselves to some sub-workers the parallel processing.
529
Load Balancing How to speed up processing multiple requests of many clients? deploy many replicated instances of stateless components on multiple machines The Load Balancer routes requests among a pool of workers, which answer directly to the clients
Let’s see now how do we design a scalable architecture which instead of supporting one client sending a huge amount of data, it can grow to handle multiple requests of many clients. This was the original challenge of building a scalable architecture in terms of the workload. You have one component that is the bottleneck. You can tell by measuring its response time when you have too many clients that are sending requests and they need to wait longer and longer for the responses. If we can assume that the bottleneck component is stateless, if the responses to the requests only depend on the input and don’t depend on previous requests, we can deploy many copies, many replicas of the component on different machines, and we can spread the request across all of them using a special component called the load balancer. The load balancer is involved in routing the requests. The responses can go directly to the clients. As opposed to scatter/gather, there is no need to aggregate responses because for each request going to be processed by a worker, the response can go back directly as soon as it’s ready.
530
Load Balancing Clients
Load Balancer
Directory
Worker A
Worker B
Assign to Worker
This is how the flow works: we have lots of clients. Here’s the request from one client. The load balancer needs to find a suitable worker for processing this request. Here we have a directory that is managing the knowledge about which workers are available and where to find them. The load balancer will use the directory to perform the matchmaking between what the client wants to do and where is a worker that can do it. Once the assignment has been made, the request will be forwarded to the worker. The worker will take it, process it, and send back the response directly. Also in this case the client doesn’t need to know where the worker is, because that is hidden behind a load balancer that is going to make the decision for the client where to send the message.The worker should have an address and where to send back the response. If you have headers in the messages, the request message will have a ”reply to“ header which contains the identity of the client that is supposed to receive the response. When the next client will come in, the directory will make a different decision because this time the first worker is busy with the first client request. The decision is about: does it make sense to queue a request for this worker? What if we have another one that is free? In this case, the request will be forwarded to the worker that is free. And the worker would process it and the response will come back to the client. As you can see, there is a scatter phase, during which different requests go to different places, but there is no gather phase at the end, because the responses go back directly. This implementation with the directory is also dynamic, so the set of workers can change. Over time we can add more workers when there is increasing traffic and we can remove them when there is not so much traffic. The function of the load balancer can be implemented also the lower level of the network, since it is about routing and forwarding packets to the worker endpoint addresses, which are set by the directory. We can split the mechanism of forwarding the work, which is something that has to happen really fast – otherwise the load balancer becomes a bottleneck itself – and the directory which is the place where the strategy can be implemented to decide: how to rotate the work between different available workers? how to keep track of them? how to detect if a worker is busy or not? What if a worker fails? 531
The directory takes care of this, helping to keep the load balancer simple: it takes an incoming message, picks the next available worker, and forwards the message to it. These operations should be performed rather quickly. If you really want to scale, you perform them using hardware, like when introducing a specialized networking device dedicated to spraying packets around your cluster.
Load Balancing Strategies
Location
• Round Robin
• Server (transparent from client)
• Random • Random Robin
• Client (aware of choice between alternative workers)
• First Available
• Directory (e.g., DNS)
• Nearest
Layer • Hardware (network device) • Software (OS, or user-level)
There can be different strategies for balancing the work among different workers. The simplest one is to use a counter. You know how many workers there are and for every request you increase the counter and use its value to select the worker. If you run out of workers, you start again from zero. We call this the round Robin strategy. It’s super fast, but doesn’t take into account any information about the availability or the current load of the workers. The assumption is that by the time you go around, the worker will be free again. Depending on how long it takes to process each request, this assumption may be true or not. What if instead of incrementing the counter, one can skip the counter and assign requests to a random worker? As long as you pick random workers with uniform probability, we can say that statistically the load will be distributed evenly. But we don’t necessarily do it always in the same order. And there can be a benefit in doing that. You can also combine the two strategies. So you can go around once. And then shuffle the workers for the next round. And when you assign the work to all of them, then you reshuffle them for the next round. If there is a way to keep track of the availability of the workers, it is possible to create a feedback loop between the worker that takes the work and the directory. Workers have to notify the directory about their availability state. And the directory just keeps a list of available workers and takes the first one off the list. So there are feed-forward and feed-back strategies. The former just keep a list of work532
ers and they go through the list in some sequence. The latter assume that the worker inform the directory about their state of availability. The nearest strategy can be used when scaling to “Web scale”: if you have clients all over the world, it can make sense to route requests to workers which are located near where the clients are. Do you know where the client request is coming from? You can forward it to a data center which is located in the same region of where the client is located so that you can minimize the latency and the ping between the worker and the client, especially when you want to send the answer back. Sometimes this is done at the DNS level, so when you do a DNS look up, the DNS server can see where the look up is coming from and can reply with a different numeric IP address depending on where the client is located. This brings us to the awareness about the location of the workers. How transparent do we want this to be? The idea is that the client can send a request to a certain server. The client knows the location of the server (he has to discover it using some directory). But, behind the server, the fact that the request will be load balanced is kept transparent from the client. So the load balancer knows where to find the workers, but the client doesn’t. From the client perspective there is only one single endpoint. There is only one place where you talk to and behind the scenes we will do the load balancing. This leaves a bottleneck intrinsically, because of the single endpoint that all the clients have to talk to. Which implies that the load balancer becomes a hot spot since every client will have to send a message with this particular IP address even if behind it there will be others that can take care of servicing the requests. But this first step will eventually limit how much we can scale. To avoid this, we can push the work out to the clients and this is one of the basic scaling strategy that we discussed last week. If you cannot handle the work yourself. Just get whoever is asking you to do the work to do it for you. In this case, we can give a list of endpoints to the clients. And if a client notices that one endpoints is becoming slow, then you can switch and send requests to another one. And clients can follow themselves a random strategy: out of these possible addresses, I will pick one randomly and over millions of clients that are trying to do the same,statistically the work will even out so that all the endpoints will be loaded in a similar way. The downside is that the client has to be aware of the fact that there are different workers and you have to not only implement a client that sends a request, but you also have to implement a decentralized load balancing strategy inside the client. And you have to keep the list of workers up to date, because if you even if you choose one randomly and your choice and ends up in an obsolete address and then, then you have a problem. But if you remember about caching and the directory, the idea was that the client looks up the list in the in the directory. Keeps in the cache and only when it expires it can repeat the look up again. If we have a load balancer that uses a directory behind the server, the client sends a request to the load balancer and the directory is located behind. However, we could also expose the directory to the client, so the client can do the look up and then send directly to the worker without all having to go through this single point. As an alternative, we can have the directory perform the load balancing. We can have a load balancing DNS, for example. Client need to look it up anyway, because they don’t know the server is, so what’s the point of asking the directory: give me the location of a server? And then I will call the server and the server will have to load balance my request of our pool of workers. You might as well have the directory send back the location of the worker because the client will do the look up anyway. If you want to build a highly scalable system, you will do a combination of these. You will start by having a load-balancing DNS. The DNS will send addresses to clients that 533
are, for example, nearest to them on the network. The clients will contact one of these addresses – maybe they get multiple lookup results, maybe they get only one. They will choose randomly on one of them. And once they get to these addresses, the server again will have a pool of workers behind. And will do the load balancing, so these are not necessary alternatives. But they can also be introduced in combination. When you use them in combination, you’re talking about serious investment in terms of components dedicated to scalability, which can also be an overkill for more scenarios, but if you really need to go all the way, you know that you have a lot of tools at your disposal. The load balancer can become a bottleneck for the reason that we discussed so typically also another way to mitigate this can be too implemented in hardware. There are network devices that are just doing this. They don’t even open the packets, they just rewrite the addresses in the packet headers. You can also do it at the OS level. You can have also have load-balancing Web server proxies, or you can also always implement load balancing inside your user applications.
Load Balancing Variants • Stateless: every request from any client goes to any worker (which must be stateless) • Session-based: requests from the same client always go to the same worker (which could be stateful) • Elastic: the pool of workers is dynamically resized based on the amount of traf c Load balancing as we have introduced it makes the assumption that the components that you are replicating are stateless. And this is the simplest way that you can do load balancing if every request is independent. If you have a client that sends multiple requests, you don’t have to remember your decision about where to send each of them. You can just take every request, send it to the next available worker and forget about it. If however, we have stateful components, which means that the order of the requests is important and the sequence in which they are processed is important, you cannot process a request without having seen the previous ones. This puts a constraint on the load balancer: because it has to remember the previous requests from a given client so that they are sent to the same worker. The worker is accumulating state from the stream of requests which cannot be processed unless they all arrive in the right order at the worker. We can relax the assumption that we were making before. Load balancing is easy to introduce as a scalability technique, if you assume that components are stateless. If they’re stateful, you have this constraint and you solve it by establishing a session between the load balancer and the client. The load balancer, in other words, has to distinguish the 534
identity of the client and remember for each client which worker was assigned. Unknown clients can be assigned a worker using the previously mentioned strategies. But follow up requests from the same client must be routed to the same worker. This also limits how flexible and how much churn, how much change you can you can expect from your pool of workers. If you have a stateless load balancer, there is no problem if you have seen a worker before and you don’t care if the worker is new or if a worker disappears. The load balancer is completely free to assign any request to any worker. If you use a session-based load balancer, you need to be careful that new workers only take request of new clients once a client has already sent you request before then he has to go to a worker that has seen the client before. Another important design decision is: how long do you keep this information? How long are your sessions? If you keep them too short, then your clients will complain because they will start to get incorrect results. And if you keep them too long, you may over-constraint the life cycle of the workers which need to stay available for a long time. There is also an overhead in terms of how much session information do you need to keep track of. Since we are talking about scalability, you talk about millions of clients. It is expensive to remember all of them within the mapping table connecting each client with each worker. The complexity of the of the protocol between the client and the load balancer also increases. There needs to be a mechanism to establish the session, but also a way to stop the session. Also after a session expires, the client will need to discover this. There could also be a way for client to prolong the session or negotiate its duration. The last variant of load balancing is about the relationship between the pool of workers and the amount of traffic. Are you in a static environment where you have a limit on the workers that is fixed? So the capacity of your system is fixed and then just have to use them as efficiently as possible. Or is it possible when you have a traffic spike to increase the size of the pool of workers? While at one point in time you need additional capacity if the traffic decreases then you can free up some resources to do something else when the spike goes away. Traditional load balancing assumes a static environment in which you have a set of workers and you spread the work between them using the strategies that we have discussed before and the constraint on whether the workers are stateless or stateful. If you have an elastic load balancer, things are even more complicated: the load balancer is going to measure the amount of traffic and make decisions on whether to dynamically allocate more workers or dynamically stop some of the workers. More in detail, we have a queue of pending client requests. The load balancer consumes them by assigning them to a worker. And if the queue gets above a certain threshold, you can think that the currently available workers are not enough. You don’t have enough capacity, so you might as well look for additional workers and try to increase the capacity of the system so the queue of request goes under the threshold again. If the queue gets empty, you will notice that also workers are not doing anything. Idle workers can selfdestruct if they don’t receive work within a given time, or they can be stopped by the load balancer and restarted only as the next wave starts to rise again.
535
Sharding How to scale beyond the capacity of a single database? partition the data across multiple databases Route the queries to the corresponding data partition ensuring each partition remains independent and balanced
The last pattern focuses on how to scale beyond the limits of a single storage element, for example, a single database. When we talk about the capacity of a single database, this is can be limited by the capacity of the physical storage. For example, going beyond 10s of terabytes. How to scale to storing petabytes of data when there is a clear physical limit on how much each storage device can provide space for? To go beyond, it should be possible to partition the data into pieces. Partition it and store it across multiple databases, so this is clearly a scale out strategy. You have been scaling up until you reach the limit, so there is no disk that can store more than 10 terabytes. Just to say a number. You can take two discs, put them side by side and now we have a storage for 20. However, we will need to partition the data so that each part fits in within the limit of the amount of storage space that we have available. Once we partition the data then we have to know and remember where it is located. So we again we have a directory problem. We have a discovery problem. Where do we find the data? Where do we find the partition to which we can route the queries? When a client needs to read the information or update the information we need to send the query to the corresponding partition. The challenge is to make sure that most queries get routed to only one partition. If all queries are routed everywhere, then all partitions of the whole system will be busy to service all the queries, thus limiting its scalability in terms of how many queries it can process. While the main goal of sharding is to scale out beyond the limits of individual storage element, an additional goal is to also take advantage of the partitions to do load balancing. Different queries can be serviced in parallel by reading or writing to different partitions, which will be stored on different disks, thus increasing the amount of I/O operations which can be performed by the storage.
536
Sharding Partitioned Dataset Component
Query Router
shard_key
shard_key
Shard A
Common Dataset
Shard B
Shared
query
query
query
query
Queries are directed to the corresponding shard (partition) Queries should not involve more than one independent shard, even if they may use some shared non-sharded dataset that may need to be replicated on each shard Sharding is also a relatively complex pattern, which can be understood as it combines together all of the ideas on how to scale that we have discussed so far. We have a query, we have to find where is the data that needs to be read or written by the query. This is discovery problem. We have lots of queries: we have to balance the load among the partitions. We need to do so without creating hot spots, without resulting in a disk that is always going to be involved for all queries. If you can keep the traffic balanced, each storage resources will be uniformly busy and fill up at a similar rate. Sharding works by load balancing each query coming into your storage system. There is an element that is going to make the decision on where to send the query and this will be done based on some property of the data that you’re trying to access. Information extracted from the query can be used to compute the location of the target partition. We call this the key of the shard. Shard is a synonym for partition and the key identifies which shard are you going to use. So based on the shard key, we’re executing the query on the corresponding shard. The result is sent back to the original component. If a different query comes in, depending on the corresponding shard key, it may end up in a different shard. This is a way to scale the bandwidth for reading or writing into physical disks. If you have multiple partitions stored on separate disks, the amount of data which can be transferred grows if read and write operations are performed in parallel on the various disks. What if each query reads data from a specific partition but also needs to access common data from a shared partition? If this shared data is read-only it can be replicated on each partition to ensure the queries can still be processed independently.
537
Sharding • Two goals: • scale capacity (storage size, bandwidth, concurrent transactions) • balance query workload (avoid hotspots) • What to shard? • Large data collections (they do not t) • Hot spots (increase throughput of queries targeting a subset of the data)
When we talk about sharding, we have two goals. The first is to go beyond the limits of the available amount of storage space, the second is to grow the I/O bandwidth. How many concurrent transactions a database can process? When we hit the limit, we replicate and partition the data accordingly. The second challenge is to how to balance the queries so that we avoid sending all the queries to the same partition. We use sharding to work with very large amounts of data which do not fit in a single storage element. To do so, we have to partition it. In case we detect a hot spot, we can use sharding to replicate it and thus increase the throughput and raise the capacity of the database to process concurrent queries. If there is a limit on how many read transactions your database can perform, you can always make a copy of the data (it doesn’t change anyway) and then you can double the throughput that you can achieve because you can read from each copy independently.
538
Sharding Computing the Shard Key • Key Ranges • Geo-spatial (country shards) • Time Range (year, month, day) • Hash Partitioning • Modulo (Number of Shards) Usually assumes a xed and static number of shards The key decision that you need to make when you decide to introduce sharding is: how are you going to compute the key that you use to decide where to place the different partitions? And you already know that in your database you have keys that identify the elements that you are storing. And so one easy solution is to take the values that you assigned to your keys and map the range of values to fit within the number of partitions you plan to use. For example, if you have to use integer keys that go from zero to 1,000,000. Then you can make 2 shards. One will be for entries from zero to 500,000 and the next shard will store the ones above 500,000. This also gives you an easy way to figure out how to transform the key into the shard identifier. If you are storing geographical data, you can, for example, partition your data set into different countries. Since some countries are much bigger than others, you may not end up with uniform partitions. If you work with time series, you have data that has historical perspective, going back many years. You could consider storing each year in a different shard. Given the time, it’s easy to determine where to find the corresponding data. The current year is the place where you read and write, while past years go in a read-only database, since old data should no longer change. The granularity of the partitioning is critical. If you shrink the time range and make a shard per month, week, or even day, it will be very efficient to access information on a daily basis, but if you have to run a query to aggregate the quarterly results, you will need to scatter the query and gather the results from many different partitions. Based on the level of aggregation of your system data access patterns, you can decide how to partition it. You have to balance being too fine grained and risking that then you have to aggregate information for multiple shards or keeping it not so fine grained and then having hot spots. When everything else fails, you can always try hashing. Take a piece of data, you hash it, and the hash can be mapped to the shard key. If hashing is expensive, you can take the modulo to map some numeric identifier within the number of shards. For example, if you have 10 shards you will take the last digit of the numeric identifier to know where 539
to store each object. No matter which strategy you choose, the assumption is that the set of shards is fixed and static. So sharding is one of those decision that you make up front. How you plan to partition the data will constrain how you can access it efficiently (and vice versa). Since sharding affects the physical storage location of the information, if you change your decision, you will have to physically move data around and this is super expensive for large amounts of data.
Sharding Looking up the Shard Key • Business domain-driven (customer/tenant id) • Helps to keep data balanced • Requires to maintan a master index (i.e., a directory for shards, a ZooKeeper)
Computing the shard in general requires to come up with a function which maps your data to the shard key. The function should be deterministic, if you want to be able to consistently find where the data was placed. The advantage of computing the shard key is that you can derive the shard key from the data itself. The disadvantage is that it may not be trivial to come up with such a function. Mapping data objects or tuples to partitions can also be seen as a lookup problem. If the previous strategies don’t work out well, you can always store the key. Somewhere in your data model there will be a column which will store the address of the shard in which the data is located. The advantage is that if you notice that some shards are getting bigger than others, you can just change the key to move data around and you don’t have to change the way that you compute the key because the key is just another attribute that you associated with your data. However, if you follow this strategy, then you have to remember where are the shards located. In other words, how do you go from the shard key to the actual location of the data? A directory can help. Given a piece of data which contains a shard key, you have a logical identifier which can be used to lookup the physical location of the corresponding shard, the address of the database server in which the shard is found. It is important to keep this indirection, because if you store the physical location of the database and you ever need to move it, every single piece of data referring to another shard in your in your system will need to be updated, which again you may want to avoid. If you just remember the logical identity of the Shard, then you can just change the look up table an you will know where to go and look for the corresponding storage element. 540
Re-sharding your data is expensive and typically involves down time, because while you’re shuffling the data around, you really don’t want to change it, and depending on how much data you’re talking about, this takes time.
Sharding • Changing the number of shards or the sharding strategy may require an expensive and complex repartitioning operation • Transactions should only involve one shard (some shared data may need to be replicated on each shard) • Sharding was originally implemented outside the data layer, sometime as part of the application logic. Some databases are starting to offer native sharding support Sharding was introduced when people were hitting the limits of how much a single database could manage, and if you have such limit in a component you keep, you keep the component as it is. You make copies of it and in front of it you develop something that can give you a unified view over all of these copies. Sharding was originally implemented by the applications accessing the database. Over time, this feature, since it was used in many large-scale applications, migrated down into the databases themselves. Nowadays when you buy a database that can scale beyond a certain size limit, it will offer you some native sharding support and the advantages that from the application side, it looks like the database can indeed store petabytes of data. Behind the scenes there will be all of these strategies available to be configured which can help you to scale.
541
Sharding Different systems use different terms to name data partitioning for scalability Shard
MongoDB, Elasticsearch, SolrCloud
Region
HBase
Tablet
Bigtable
vnode
Cassandra, Riak
vbucket
Couchbase
542
References • Martin L. Abbott, Michael T. Fisher, The Art of Scalability, Pearson, 2015 • Gregor Hohpe and Bobby Woolf, Enterprise Integration Patterns, Addison-Wesley, October 2003, ISBN 0321200683 • Martin Kleppmann, Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable and Maintainable Systems, O' Reilly, 2017, ISBN 978-1-449-37332-0
543
So tware Architecture
Availability and Services
11
Contents • Components vs. Services • Monitoring Availability • Impact of Downtime • Control Cascading Failures: Circuit Breaker, Canary Call • Redundancy with Replication: Consistency vs. Availability • Event Sourcing
544
After discussing about how can we scale, let’s look into the availability of the architecture. And as we have been looking at different qualities, we have been trying to find what is the construct, for example, the architectural concepts (e.g., components and modularity, or interface and reusability and so on) that are conceptually close with the corresponding quality attribute. Today we will make the jump between software components and services. We will use the difference between components services to define the challenge of availability. When we look into the quality of flexibility and how can the architecture deal with change, we will make the next jump going from services to microservices. As part of the lecture, we’re going to give an operational definition of availability. What does availability imply? and what is the impact if the system is not available? What could possibly go wrong if we have a downtime? This is particularly critical for services. We will show how we can monitor it and how to design systems using patterns that can help us to design systems that are highly available. In the same way that we looked into how to scale and learn how to design a scalable architecture, we are going to look into how do we design highly available architectures.
545
Example
Components vs. Services Map Drawing Component //lat, lng - any GPS coordinate on Earth //zoom - show details up to 1m resolution Map getMap(lat, lng, zoom)
• What is the "size" of this component? • How to deliver the component to customers so that it can be included in their own applications? To introduce the idea of what is a service and what is the difference between a service and a component, I would like to present you with this simple component interface. The component purpose is to be able to display a geographical map. In the interface there is a single operation which receives GPS coordinates anywhere on the planet, and then it has also a zoom level. The result is a map. You can see it as an image, or some other visual representation. Knowing that you can show a very detailed map, up to 1 meter resolution, or you can zoom out and just show the borders between the different countries, what I would like to ask you is: if you have to build the component and then you have to deliver it to your customers which will take the component and for example use it to display a map in their application – What is the size of this component? We know that we have to deploy components before we can use them in production and to deploy them we have to package them so they can be released. And if we have to make them self-contained so that there are no dependencies, all the software and all of the data (the most important element in this case) that is necessary for the component to work will be included in the package. Do you have an estimate for how big is this component going to be? How many bits? If you try to go back in time, from your personal experience, do you remember a time in the history of software architecture in which the Internet didn’t exist? All software had to work offline. There used to be a time in which if you had to package and release the component, you had to actually copy it from a golden master on some physical media: Floppy disks, CDs, DVDs, USB sticks. Clearly, you are limited by the capacity of the media. If you have to deliver the map component whose size is 7 terabytes, how many discs do you need? How much is that going to cost you?
546
Example
Components vs. Services
The software industry before the Internet looked like this. We were writing code, but there was a physical aspect to it at the end of the build process you had to burn the software into CDs. Imagine how much plastic e-waste was produced with every upgrade to deliver the latest version, how much energy to ship those disks around the world. Also, someone had to physically unwrap them and use them to load the software onto each piece of hardware to deploy the upgrade. One problem with maps is that they tend to change. New houses are built, new roads and bridges are built, and the map tends to become obsolete over time. We need a stream of CDs that you ship whenever the map needs to be updated. After the Internet came into existence, people started to find ways for the software to be more efficiently delivered. You can take an image of the CD and you can make it available as a file and then you can put it on a website or an FTP server. Users can connect their computers to the FTP server, download the image of the software and install it. This was greatly streamlining the software delivery experience. There was no longer the ”commercial off the shelf” (COTS) software metaphor, because there was no shelf from which you would take a beautifully designed box to open the box and get the CD out of it. Now you just had to find out the proper address of the website. And maybe after exchanging some payment, you would download and install the app. After many attempts, App stores were born and you had a huge selection of software. But this was still about finding an efficient mechanism for transferring the software. The bits to unpackage and install from their source into the environment in which they get deployed into so that they can run. Another opportunity for delivering software through the Internet is actually to switch to a distributed architecture. A client/server style in which your software is not being delivered at all. It runs somewhere else, and if you want to use it, you’re just sending messages to it and getting results and interacting with it from a client. In this case you have the option of whether you have to install a specific client. You solve the problem in half: the majority of the software, for example, the data, you keep 547
it on the server. Makes it easier to keep it up to date and we don’t have to download terabytes with the map of the entire planet if your client just needs to draw the map of a city. Just keep it on the server and let the clients access the part that they are interested about. However, you still have the problem of delivering a compatible client that can connect to your server. Eventually this problem was solved by building a universal client, which is called a Web browser that can connect to any other application in the cloud. You don’t have to change the browser when you change the application that you’re using. The software in the client is the same and it’s just so flexible that it can work with any other back-end system.
Business model: how to sell
Components vs. Services
• Component developers charge on a per-deployment basis: whenever a new client downloads the component. • Component upgrades may be sold separately to generate a revenue stream • Components can be licensed to be redistributed within larger systems and developers can demand royalties from the revenue of the nal product
• Service providers can charge on a per-call basis: each time an existing client interacts with a service by exchanging a new message. • Service providers can charge a monthly/yearly at access fee • Services can be made available for free and providers can support them with advertising revenue
If you are trying to make money with software, what are the opportunities to do so if you work with a component-based approach? The idea is that you make money every time someone needs to deploy your software. In the old days your customers would need to pick up the box with the CD and would have to pay to carry it out of the store. Now the equivalent is that your credit card gets charged when you click on the ‘Install App‘ button in the App Store. Every time this installation happens, every time there is a new deployment, you make money. We all know that software is never finished, never completely done, especially software that is being actually used. Every time you make a change, every time you improve it, is another opportunity to extract money from your user community. At some point you decide that you need more money, so instead of releasing free updates, you decide that the next one about to be shipped is a major upgrade and they have to pay for it. The toolbar icons and the app logo design have changed so much after all. Sometimes, after you buy the software, updates are free for a year, but afterwards you 548
have to pay again to get additional improvements to keep the revenue stream flowing for the developers. These examples hold if you’re doing retail: if you’re selling the software to the end users directly. However, there is a whole other industry of OEM integration, where you are selling components to companies that integrate the components into applications for the end users. In this case the users pay for the applications and then you get a royalty because your component was used inside the application that was sold. That’s another business model. The advantage is that you don’t have to worry about interacting with and supporting so many users; you just have one client and then the client will resell your software to the larger audience, maybe in combination with other components. These are the opportunities to make money which are implied by this simple – but revolutionary – idea that you can actually make people pay to get the right to use a piece of software. It was not always like this, because the software used to come for free together with the hardware. And there was no point in selling software by itself because it wouldn’t work without the hardware. So you pay for the hardware and the software is just an afterthought. At some point somebody thought that they could actually make money by selling the software by itself. And with this simple idea, you know the story, one of the largest fortunes in the world was made. However, at some point the Internet showed up and disrupted this model. When we turn our component into a service available over the Internet. Other opportunities to make money appear. For example: every time users click, every time they interact with your service running in the cloud, you can charge them for that. If you think about how often does someone install a piece of software, as opposed to how often they use it: this is a paradigm shift. Having one exchange of value with one interaction when users download and install your software doesn’t provide any feedback on what they are doing with the software. Switching to a business model in which every day every time that they interact with your system you are going to be paid changes everything. Pay-per-use could be too fine grained. Maybe it’s complicated to track. And there is a risk that it becomes too expensive if users use it a lot. It’s also possible to select a flatrate subscription model: to access my system, you have to pay every month. You get a discount if you pay for a year in advance. This is often found with software as a service (SaaS) providers. More and more software is available for free. So if you make it available for free, how can you make money with it? One answer is the so-called surveillance economy where users are hooked into using free applications and then everything that they do with the applications is tracked, mined and monetized, for example, to sell targeted advertising. If people give you something for free, you should wonder how can they survive financially. And what are they doing with your data? They may be going back to the original model where the software is free, but you pay for the hardware to run it. This will tempt you in using the software as an incentive to drive hardware sales. Like when you need to buy a new phone to get the latest free updates of your favourite apps.
549
IT as a "manufacturing" industry (ship software components)
IT as a "service" industry (publish software on the Web)
In general, while going through the software lifecycle there is a point in which money is exchanged. For example, between the release and the installation, at deployment time, or later at run-time. While pay-per-deploy is the original model tied to software components, pay-per-use was not possible before because after the software gets installed, users run it on their system, which will be offline. This makes it challenging to track what the user is doing. We can see here the two reference examples of the two different models for the software industry. One follows the metaphor that we are manufacturing software components. This means that at certain point they come out of the factory. As components get finished, they are shipped as part of the release; and that’s when you can sell them. It turned out to be a highly successful concept, especially before the Internet. Exploiting the opportunities of the Internet, we can make the software available as a service through the web. We still need to build the software, but the software is not something physical that you will ship out. You just deploy it on some servers in the Cloud. What really matters is: What is the quality of service that you provide to your users? You switch from a product-oriented type of industry into a service business where you establish long-term relationships with your customers, where customers come together with you to create value, as opposed to just taking a box off the shelf and walking out the store.
550
Design decisions
Components vs. Services
• Constraint: Encapsulate implementation behind wellde ned interfaces, free choice of connector • Decision: to buy or to make a component? • Promote reusability and independent component development
• Constraint: Remote connector (e.g., Message bus) • Services operated by different organizations: high availability, security/trust and loose coupling (independent evolution) • Many service providers compete to deliver "Software as a Service" from the Cloud
Let’s look from a technical perspective at what are the differences between components and services. Let’s just recap the definition of a software component. We draw a boundary between the component in the rest of the system so that we can encapsulate the implementation behind a well defined interface. Components are meant to be connected to each other using a suitable software connector. In this case we don’t have a constraint on the type of connector that we can use. The main decision once we identify that there is a specific component in our system is whether we are going to buy it or to make it ourselves. If we buy it, it means that somebody is selling it to us and we go back to the previous discussion on the business model. If we make it in house, we need to write the code and go through the all the quality assurance processes before we can release it in production. The purpose of all the component technology that we have seen so far is to make it possible to reuse components. And make it possible for them to be developed independently from each other. You decompose an architecture into different components so you can parallelize the construction work: you don’t want to wait for a component to be finished before you start with the next one. You can do all of them together because they have well defined interfaces. If we switch to services, we assume we can access our software through the Internet. The critical decision becomes: how are we going to connect to the software if it is delivered as a service? We have to go through the Internet therefore we will have a remote or a distributed deployment, as a consequence this puts a limit on which type of connectors we can use. It doesn’t make sense to invoke a service running in the Cloud while attempting to use a share memory buffer to exchange the data. The service is over there, your client is over here, there is no way that you can use local shared memory to send and receive information. Still, there are various connectors that are possible. As an example, the most frequently mentioned when we hear about services, is the message bus, which used to be known as the enterprise service bus (ESB). Due to its asynchronous nature, it also gives you the better availability. 551
While in the original component paradigm you would need to download and install locally the software, this would imply that somehow - not necessary in a legal sense – you would be owning the piece of software. The license to use it gives you the right to install and start the application on your computer. Nobody else is doing it for you: you are in control of the operational side. You don’t necessarily build it yourself, but once you buy it, you download it and you run it so you operate it yourself. With services, it’s exactly the opposite: the software that you use is operated by someone else. This other organization is responsible for making it accessible to you with good enough quality, including availability. Additionally, there are multitenancy issues related to trusting that when you send your data into their system, this data is not, for example, revealed to your competitors. It is theoretically possible to send encrypted data into the cloud so that it can be processed there without decrypting it, but it is super expensive to do so. These are things you have to do if you cannot run locally the software when you send the data to the other side, you trust that the data is forgotten once the response comes back. The fact that services are operated by someone else means that someone else is going to make decisions on their life cycle: they will switch it on and switch it off whenever they want. And they will apply changes to their services without necessarily warning you. Some popular service APIs post countdown timers on websites announcing breaking changes years in advance so that client developers are supposed to know in advance that the changes are coming. Especially if the changes are incompatible, they need time to prepare for that. In other cases you wake up one morning, you discover that your system is broken, because the service provider went bankrupt and they have disappeared. This is similar if you buy a flight ticket you go to the airport and you discover that the airline is bankrupt and your ticket is not going to take you anywhere. This can always happen and when you make a choice to rely on a service, your software depends on its availability and you are willing to take the risk it may disappear. The other interesting aspect is that once you can access the software as a service in the cloud you have a whole market place with a set of competing providers that offer you their software. You have a choice where to go and which software to use at different price points, with different features, delivered with certain qualities and, most important, associated with a brand, which you may or not trust, also depending on the reputation of the provider. Components were born in an offline world, so component technology is much older than services. Still, you can have remote and distributed components. We still call them components if the organization that is running them is the organization that bought them. So you buy the component, you run it yourself even if you deploy it across a distributed environment. You are in control of its life cycle. A service is a component that is remote but is also provided to you by a separate organization that is in full control of its lifecycle. All you can do is call the service and hope that it answers you. There can be some expectations and guarantees, but you cannot do anything if the service provider is offline. Or when a service disappears. With the component you still have the bits of the software which can run as long as you can deploy them in a compatible environment. In other words, if you are in a company and you have the IT department that is supporting the applications within the company, when the lights go out, if the application stops working you have a phone number that you can call and there is somebody from the same company that answers it. This is internal support for ensuring the availability of the component. If the component is delivered as a service, the number that you call will be answered by an external entity, and you hope that they still answer and fix the problem with a comparable speed. 552
Overall, the main difference between remote components and services is not really a technical one: it’s really about ownership and responsibility: who to blame for unavailability.
Technology: how to use
Components vs. Services
• Component need to be packaged to be deployed as part of some larger application system
• Services need to be published on the Web by advertising their location to potential clients
• Components need to be downloaded and installed, assuming they are compatible with the existing framework used to develop the system
• Services are invoked using standard protocols based on HTTP, XML or JSON
• Problem: There are (too) many incompatible component frameworks
• Problem: Services are distributed, remote components outside your control. Beware if they become unavailable!
Let’s look closer to more detailed technological steps to deliver the software so that it can be integrated within our architectures. Delivering software as a service from the cloud is different than just releasing your software, publishing it on a website so that users can download and install it and run it locally. We’ve seen already the build pipeline that we use to go from the original source of the software into the component that we can release, and we can install. There is a point in which you are transferring bits and you copy them on some environment that you control, and assuming that you are downloading multiple components and you want to use them together in the same architecture, you make an assumption that the components are compatible. Component interfaces must be compatible with each other, but also they need to be compatible with the underlying framework they require, with the programming language runtime. In other words, if you to take a random example, if you go to the Eclipse marketplace and you look for plugins, you look for components meant to be deployed on this platform, you expect that you will find components written in Java. The chances that you can easily take a component that is written in Ruby – they are called gems – which are meant to work within the Rails framework, and deploy it within Eclispe are slim. Those are completely different technological platforms. NPM modules cannot be installed using “apt-get”. Package managers work great as long as you stay within the same platform. If you try to mix components from different languages and different frameworks then you will have some problems due to the large number of programming languages and even larger number of different, incompatible by design, frameworks. 553
Some of you might mention the word ”Docker”. This is a recent development which is trying to solve such component incompatibility problem and provide standard way of packaging and installing the component no matter programming language it is written in as long as it can run on a Linux-flavoured runtime. Components deployed in containers expose programming language independent interfaces which make it easier to connect them together. Such standard interfaces were originally invented to connect services. If we switch to services, the goal is completely different. One of the reasons why there are so many frameworks is that vendors make money with the frameworks as they attempt to establish a virtuous cycle where you have a successful framework because there are many developers that write components for it, and the more components are there, the more developers will use the framework to find components for their applications. And therefore, if you establish a new framework you want to lock in the developers and – even if the programming language appears to be the same – you will define framework interfaces and APIs that are incompatible. Once you write a component for that framework, it will be very challenging to port it somewhere else due to its framework-specific dependencies. This makes sense if your business model is tied to getting people to use your technology and make them dependent on it. If your goal is to make your software accessible through the Internet, then you want the software to be as compatible as possible. You try to make it accessible from clients that are written in any programming language. You seek the freedom to implement your service behind a standard interface running in the cloud so that you can implement it with the most suitable language. You also don’t want this information to be actually visible or known to your clients. It wouldn’t make sense to state: you can use Google, as long as if you use this particular version of C++ with this compiler to invoke its search API. If you switch the compiler, then you risk becoming unable to call this particular Web API. This has been a huge evolutionary shift in mindset and also concerning the technology base. After decades spent into standardizing all the necessary protocols so that you can exchange data, you can represent it using a programming language independent way, today most interfaces use HTTP as a way to transfer the data and then on top of it they use either XML or JSON to represent the payloads, the message content sent back and forth. Another important aspect is the discovery of matching interfaces. We already discussed about registries, about directories. They are also relevant with services because you need to find out where on the Web a reputable service provider that you can trust can be found. For these reasons we have a big difference, services put emphasis on interoperability, so that the service interface is accessible from as many clients as possible. Components face the challenge of portability: components written in some languages need to run in different, potentially incompatible operating environments. This is not a problem with services, because the service provider is in control of the operation environment. They can optimize it and choose the most suitable one to ensure good performance, scalability, and availability. Components vs. services also impact software piracy. If you want to steal a copy of the software, if your business model depends on people downloading and installing the software, then it’s very easy to make a copy. If you are delivering the software as a service, nobody will ever see your software. The software stays safe locked inside your data center and people just call it from all over the Internet. To run a google search you do not have to install your personal copy of google.exe – such piece of software is never going to be downloaded anywhere. It stays within the boundaries of the service provider, which however will need to deal with account sharing and fake API keys. 554
While services can be made accessible from anywhere, the remaining issue is that they run in a distributed environment outside of your control. Once you draw a dependency in your architectural model to represent: our system depends on a service API, there should be a big red flag going up: What if this service API becomes unavailable? Such unavailability could be temporary – the service becomes unreachable for just a few minutes and nobody notices. Or the service disappears for a few days and then you read about the outage on the newspapers. Or maybe the service provider goes bankrupt and then your system depends on something that doesn’t exist anymore. If you deliver a service which depends on external services, their availability will have a strong impact on your own availability. How can you guarantee your own system will actually work if you are not sure whether your dependencies will be there when you need them?
555
When software runs in the cloud, sometimes we have thunderstorms. The system is either not reachable because there are network problems – what if you try to access a system while you’re traveling around with your mobile device? What if you are crossing the Swiss Alps and you don’t have very good connectivity? Will the software work in offline mode? You have experienced a network partition. The user device is disconnected. How is this going to affect the user experience? Are you going to just disable the user interface and stop the application from working? Once you reconnect you enable the user interface and everything works again because the backend in the cloud is reachable. How can you deal with this problem? Add a backup link. Make the connectivity more reliable. Switch from the WiFi to the 4G or 5G antenna (or viceversa). We can generalize this concept: this is a fundamental idea that we can use to improve availability is to introduce redundancy into the architecture. If something stops working, you have a backup. You have an alternative that hopefully will take over so that the system still works. We have seen that redundancy and replication also help with scalability. For example, consider a content delivery network, we need to serve read-only assets and we can spread them out throughout the CDN so that we can improve the bandwidth and latency towards a large number of clients. If we make multiple copies of the data, we have introduced redundancy: somewhere we have a mirror which we can use if the main site is offline or unreachable. But is it a good idea to stop the user from working when offline? If there is a glitch with the Internet, should the world stop? Within every architecture there should be the possibility to revert back to a state before the Internet existed. Once upon a time, we used to know how to run software without the Internet. We should still be able to do it 556
today. Just download a local copy of the software and the relevant data. Even if we are disconnected, we can still allow users to work with it. Then at some point we assume that the connection comes back and we can synchronize: apply the local changes to the server-side state, while fetching any updates from the server. There is a price to pay: conflicts need to be detected and possibly resolved.
Availability Questions Is it running? Where is it running? Is it reachable? How long does it take to reply? Response Time
Workload
Response Time
Resources
How long are you willing to wait for the reply? The first question you can ask to measure whether something is available concerns whether it is running or not. What is the current state of execution of your system? The system started, it’s running, it’s available, it’s ready for the users, or not: the system stopped, crashed, it’s not running, it is running through an infinite loop, it takes forever to respond, it’s not available. Now if it is running, you also may need to know where is it running. That’s a different level of awareness: from the user perspective, it is enough to know: ”if I need to use it, it’s running and I can use it“. But from an operational perspective, you also should have a more precise idea: we started it and is now running on this machine. We deployed it using this container and while we run the container on this machine, if something happens to that particular machine, this component will be affected. If you just know that is running somewhere, you have no ability to predict what kind of failures will impact, whether it keeps running or not. If you know where it is running, then you know which machines you should protect and you should be careful not to unplug them at the wrong time. This is a most important concern from the service provider side: I’m running it, it’s over there and I’m watching that nothing bad happens to that machine. Since it’s remote, you also have a concern from the client perspective. Even if you say it’s running, you can reach it, but it doesn’t work for me. So availability may be 557
affected by network reachability issues. If you get disconnected, having a second network connection can help mitigate the problem. This may help to improve the reachability of the service. The fact that it is running somewhere is necessary, but it’s not sufficient if you cannot reach it, so both of these conditions have to be true: It has to be running and it has to be reachable. Then we can start to look into the details and we can ask: how long does it take to reply? I’m going to upload my video on Panopto. And it takes one hour and a half to process it. Is that really available? Yes, it’s reachable, I can transfer the data and it’s answering when I refresh, but there is this progress bar that is going up and down. You see something that tells you we are 95%. And then after one minute it’s gone back down to 20%. So it’s particularly hard to estimate when you’re going to get the final answer. One way to explain this is to look again at the scalability curves: we can see the response time depends on the workload: how busy the system is. Green values show that the system has an acceptable response time. How long does it take to reply? Not too long, if I can get the video uploaded before the lecture starts, it’s fine. But if we are in the red zone, we still get a reply, eventually, but the reply is late. If the reply is late you have to ask yourself, are you willing to wait that long for the reply? If the reply comes too late, it may be correct, but it may be completely useless. When you worry about the availability of a system. Only as a first approximation: It can be a black or white. If it’s running and reachable, it’s available. When is down and offline, it is not available. Then you can start to wonder, do I really care about availability unless I need to use it? And when I use it, when I need to use it, is it going to give me a timely reply? This way we can translate the availability question to ask: how long does it take and how long am I willing to wait for it? This can also be seen as: How soon will the client timeout? When you send a message, you may expect a response; but if the response doesn’t come within a reasonable amount of time, you can conclude the response will never come. You’re simply not going to wait anymore. Even if the response comes later, you no longer wait for it, so you are going to miss it, because you already concluded that the service is not available. From your perspective, the perspective of the observer, or the client: if a system is too slow, it is not available.
558
Watchdog
Monitoring Availability
Watchdog
Heartbeat
Service
Clock
Monitor
Service
probe ok
beat
tick Ok
Ok
tick
probe
No Beat beat
ok
Ok
Slow
tick
probe
No Beat timeout!
tick
probe
Service is down! Service is down!
Watchdog
Clock
Monitor
Service
Service
You cannot know if a service is available unless you try to interact with it. To keep track of the availability, however, we don’t want to rely on the users reporting outages. Instead, we want to have an automated system that can continuously check whether the availability state of a service. Given the need of an observer to determine the availability of a system, there are two different options for monitoring the availability of services, depending on whether the service is passive or active. One is by periodically sending probes from another system, called the watchdog. The watchdog periodically probes the service whose availability we want to check. In case the service is active, we use a heartbeat message which is periodically sent from the service back to the monitor (or passive watchdog). The watchdog is a component whose purpose is to call the service and check whether the service gives an answer within the expected time. The watchdog (active) calls the service (passive). If it receives a timely answer, we can conclude at this point in time the service is available. If we send another request, another probe, after some time, maybe it takes a bit longer to answer. The watchdog can not only measure whether we get an answer or not. That’s black and white, but it can measure how long does it take to get an answer. Based on the response time, we can determine whether the system is fast enough. However, if we send a probe and we don’t get an answer. We cannot wait forever: there will be a point in time in which we decide we have waited long enough. The watchdog times out. And then we try again. If we don’t get an answer after a certain number of attempts, then we can conclude that there is a availability problem with this particular service. We go from a state of availability where we get a timely answer, to an intermediate state in which the performance is degrading. Eventually we lost the service because we have been patiently waiting for a very long time but the service still doesn’t answer. So that’s the worst case scenario: a slow failure. It’s also possible that when we 559
try to connect to the service, we immediately get an error. This will be a faster way to determine that the service is not available. With this type of watchdog, the service is passive. So the service is just servicing yet another request. It doesn’t care or it should not need to know whether requests are coming from a normal client or from the watchdog. The watchdog is just one more client that will check whether the service is capable of processing some probe request. Still, you need to decide whether the watchdog calls a normal operation of the service, or if you have a special interface that you just use from the watchdog to check whether the whole service available. If you have a dedicated and separate probing endpoint, you need to deal with false positives and true negatives, when the probe times out but the main interface is still working. The same holds if you use separate network planes to transfer production traffic independently of the systems management messages. If our goal is to monitor the availability of the service. Is this solution enough? What could possibly go wrong? We’re making an assumption here. We have two components. One is watching the other. What happens to the watchdog itself? Who’s watching the watcher? Consider setting up a cross configuration, with a ring or a chain of watch dogs that are checking each other if they’re available. And then they also collectively check the service. We need not only to worry about the availability of the service, but also of the infrastructure that is checking whether the service is available or not. The service is available if it answers to the probe within the expected time. The probe can be any message accepted by the service API. The challenge of having a watchdog that probes the service is to make sure that if the answer is successful, this is a representative observation where the same behavior would also be perceived by any other client. Better avoid the situation in which the watchdog thinks that the service is available, but then the service will not work for any other client. If you don’t get answers, you insist, you keep sending multiple probes. Only if no answers comes back, after multiple attempts, you can conclude that the service was not available. How many times should you keep trying? It’s also possible that when you try to connect, you cannot reach the servers. Also in this case, from your perspective there is an availability issue. But the problem could be due to networking issues on your side and although the service is running, you cannot reach it. This is why you may need multiple watchdogs to reach the conclusion: the service appears not to be available from everyone. Is it down for you too? There are some scenarios in which you can afford to have a service which is active. This means that the service will be sending periodically heartbeat messages back to a monitoring component, which will basically check whether the messages arrive when they are expected or not. This alternative design is also used in practice with the assumption that the service is actually going to be notifying the watchdog about its existence, for example when registering its location with a directory. We have a service that periodically notifies the passive watchdog of its availability. That is, the watchdog expects to hear from the service every so often. In case this doesn’t happen, then basically the service has skipped a heartbeat, or the watchdog missed it. Probably something happened on the other side. There will be a window in which the watchdog is listening for messages; if the message arrives on time, it can conclude that the service was running. If you don’t get any message you can start to worry, but the message could still arrive late. We have a slow service that for some reason is behind the schedule. If however this situation persists like in the example, with two missing messages – it could be more, depending on how sensitive is the watchdog – we could conclude that the service was lost. While the monitor can only show when was the last time we heard 560
from the service and how many heartbeat went missing, if this time grows above a certain threshold, the watchdog may start to actively probe the service to confirm that indeed it is no longer running. In the same way the directory helps to discover the location the other components, also the watchdog helps to keep track of the availability state of the other components. Both suffer from the same limitation, since the watchdog cannot monitor itself and the directory location cannot be looked up on the directory itself. What is the role of the clock? The clock tells the monitor when to expect a new heartbeat message. There has to be an agreement between the clock and the service regarding how often the message is sent. This is similar to the decision on how often the watchdog should probe the service. You don’t want to do it too often, because otherwise the watchdog is overloading the service. But the less frequently you do this, the longer the delay will be between when the service goes down and when you can actually can detect it. In the same way that you have to decide for how long or how many attempts are you willing to wait before you give up and declare a connection timeout, we also need to configure how frequently to perform each attempt. It may also be a good idea to employ exponential back off in case the probe itself may overload the service. Another aspect to consider, for example with deployments in an Internet of Things environment, is the energy consumption induced by the watchdog or monitoring system. Sending probes and heartbeat messages has an energy cost. The more you send, the faster the battery gets depleted. So you need to trade off the delay with which you detect outages against the duration of the window in which a battery powered watchdog can run unattended.
561
Which kind of monitor? Watchdog Passive Service
Heartbeat
Track Response Time
Active Service
Reachable Service
Missing Request
Missing Response
Reachable Monitor
Connection Refused
To understand the difference between these two different solutions, I would like to ask you if you can fill out the different properties for each of them. Unlike a monitor, with a watchdog you can monitor not only whether the service is available or not, but you can also track its response time. If you use a watchdog, the service has to be reachable, since the interaction starts from the watchdog which has to be able to contact the service. You can use an active watchdog when the service is passive: the service by itself is not aware that is getting probed. It does not have to do anything extra so that you can track its availability. The service is just deployed and started, and then the watchdog calls it and the service replies as if it would receive requests from any other client. The watchdog is going to periodically send messages to the service. This is the difference with the heartbeat, in which the service has to send the messages to the watchdog. And this means that the service, in addition to all the other things that is doing, has to do one extra thing, it has to remember to send the hartbeat on time. For this to work, the monitor has to be reachable. When it comes to the errors, then you should put yourself from the perspective of the component that is monitoring the system, so it’s the perspective of the watchdog or the monitor. If you send a probe and after a while you don’t get a response, then you know that the service is probably not available. If you send another probe and the response is also missing, then you can raise an issue with the availability. If you try to send a probe and the connection is refused. That’s even a stronger sign. You don’t need to wait to detect that there is a problem, since the service does not accept your connection, you know that it will be impossible to send out the probe. As opposed to waiting for a missing response after sending the probe – the probe goes out into outer space and you listen into the void for a faint echo – if when you try to send the probe, you cannot even do that, then it’s clear the service is not available. 562
From the hardware perspective the heartbeat will help to detect availability issues when there is no request coming from the service that is supposed to send a heartbeat messages. You can still get a connection refused or a missing response, but since the service is active, these errors will tell the service that monitor that is supposed to receive the heartbeat is not available. From the monitoring system perspective this is not useful information. If you introduce a watchdog, you have a service that is just there waiting for clients and the watchdog becomes yet another client that has of course to be able to reach the service. It has to be able to connect and send the request. If the response comes back, you know that the service is available. If you choose the heartbeat alternative, you are in the opposite situation: you have a service that is active and needs to be able to send messages to the monitor. If the monitor doesn’t get the request from the heartbeat message from the service, it knows that the service has disappeared. In terms of connectors, since there is no response required for heartbeat messages, they can be sent asynchronously over a message bus, while the watchdog uses a call interaction with a request followed by a response message.
563
Availability Incidents No Impact: The service is down while no-one needs it. A client may retry the call and nd the service back up again (assuming idempotent interactions). Automated failover mechanisms can mitigate the failure. Low Impact: Manual intervention required (few hours) to recover the service, some customers start to complain. Easy to spot the cause of the failure, more dif cult to nd the operator/developer in charge to x it. High Impact: Media visibility of the (1+day-long) outage. A whole disaster recovery team needed to bring it back up. Let’s see what happens when something is not available. Of course, if you are down when nobody calls, nobody notices. There is no impact. Just like software nobody uses has no bugs, services nobody calls are always available. What happens when things go wrong? Sometimes you might have experienced a temporary glitch. You just retry the call and it works again. Something failed, but there is some automated recovery process that takes over and fixes the issue. The impact is very limited, the failure has been contained. Maybe only one user notices something, but the other users are not really affected. We can also have incidents with low impact. There is no automated recovery. We need to have some manual intervention. Typically the delay is caused by the fact that we have to find the right people who know which is the right button to click on and have the right to restart the system. It’s easy to fix it once you find the right person that knows how to do it. When the down time grows beyond a few hours, many people start to notice and you start to hear complaints. Since, however, the system works again after you restart it, user disruption is contained. You can also have a major impact when recovering from the disaster requires to set up a whole team and execute a small project that takes days to succeed. Imagine a popular service, some kind of social network, when such incident happens there will be some visibility in the news. The media will report the outage. Or the outage will be reported by users through their social network accounts. In the same way you need multiple watchdogs to monitor one another, we need multiple social networks because people can vent on one social network and complain about not being able to use the other one and vice versa. If you subscribe to the view that there is no bad publicity, as long as people are talking about you; you can also take advantage of the spotlight to enhance your brand visibility, be creative with the error messages (the big failure whale comes to mind). Even though people cannot achieve what they want to do when they try to use your service that is not available, they still get a little bit of entertainment and they don’t go away totally frustrated. 564
Downtime Impact Revenue Loss: While the service is not available, the business is disrupted and the revenue may stop owing Reputation: Customers are unlikely to trust a service with their traf c and their data if the service is not responsive and the data is out of reach when they need it. Planning and announcing outages in advance as well as some degree of transparency in the recovery process of unexpected downtimes may help to manage customer expectations. Recovery Cost: The activities to diagnose and repair faults may require a signi cant effort and to provision backup resources Refund: Long lasting outages may violate Service Level Agreements and trigger refunds (in cash or credit for future usage) to avoid litigation What’s the impact of worst-case scenarios? You lose money. When something doesn’t work, the business stops, economic value creation stops, financial flows stop. For example, there is a particular system to get payments from your customers. If this doesn’t work, you’re not getting the money. This can quickly become a critical issue, especially for high throughput money making machines, every second downtime can translate to millions of lost revenue. There is also an impact on your reputation. Clients over time learn to expect a certain quality of service. If when they come to you, you don’t deliver on the expectations you created. If you break your promises: your service is not responsive, users entrusted your service with their data and oops, your irreplaceable bits have been lost, we are sorry for the inconvenience. This is a major issue. It takes a long time to build up a reputation and a very short time to tarnish it. Expectations matter. There is a big difference between planned outages, which can be announced beforehand and the ones that happen without any previous warning. If you promise that you will be available again, after a short down time, then as long as clients are informed, they will not be so disappointed. In 2020, this can also happen in the opposite way. There is now an expectation that when you try to order something in an online shop, typically the delivery windows are never available because they are so overloaded at the moment. One day I was surprised to find an open delivery window right away. I connected and everything was green. I placed an order and they actually took my order. Just like in the good old days. You need something and they are available. They will take care of it. And then a few hours later, I received one of those emails: actually there was an error, our service was mistakenly available at that time. It was supposed to be closed, but for some reason it was misconfigured and it was open and taking the orders. So we are going to cancel the order and 565
refund the payment. Thank you for the understanding. Availability is affected by overload and lack of capacity to scale beyond a certain workload, and thus it can be improved by investing in additional resources (assuming the architecture can scale to benefit from them). Restoring availability after failures, requires to pay the cost of recovery.How much effort do you need to bring it back up? So is it just a matter of restarting something? Or do you have to rebuild an entire data center somewhere else because the hurricane flooded the current one? How much time do you need to get the backup resources online? Depending on your service business model, if you have some subscribers that are not able to use your service during a certain time because of the outage, you might start to get some liability. Not only you lose revenue from pay-per-usage clients, but you also have to refund subscriptions or give them credit for future usage because your system was not available.
566
In general, with services, the expectation is that you offer them with 24/7 availability. Users from all over the world should be able to access and use your system no matter what, no matter when or where they are. And this is a very challenging thing to achieve. Let’s see what kind of architectural patterns we can introduce to be able to actually do that. How can we contain the impact of failures? As you can see in this model, there was this idea that even though an iceberg may punch a hole through the hull of the ship, only one compartment would get flooded. The ship will not sink. How many flooded compartments can the ship survive? The idea is that you try to isolate the impact of the failure, you try to protect the overall system, even though parts of it might suffer. You need to prevent cascading failures to bring down the whole system. You install fireproof doors which should remain closed. You put travelers in quarantine to stop the plague from spreading. You dig a second sidetunnel across the mountain. To access the Internet: wireless as well as wired access, thank you, even for laptops. What’s the simplest idea that we can introduce in our software architecture to be able to do this? It boils down to the notion of redundancy. You want to have redundancy in your architecture if you want to increase the availability and reliability of your system. But with redundancy, the software becomes more expensive to build and to operate. You can ask your accountant who is skeptical about the cost of the redundancy if you can afford the cost of the recovery.
567
Retry How to recover from temporary failures? Retry the failed operation If the call fails or no response is received (a timeout occurs), repeat it. If a message is lost, resend it.
Let’s start with a very simple form of redundancy over time. If you can expect the failure to be temporary. From the client side, when you try to invoke a service and it doesn’t work, what can you do? You just try again. Try again later, hoping for the best. You don’t give up, keep trying until it works. There are many events which may lead to the opportunity to retry a synchronous call. The call doesn’t go through, the call might fail, or maybe a timeout or a disconnection occurs during the call. You are sure the request was sent, but you don’t get a response. After waiting long enough, just hit the refresh button, maybe it will work now. If something goes wrong while sending a message, just resend it. This implies that you need to keep a copy of the message until you’re sure that the other acknowledges its receipt. Only when the other side confirms that they received the message, you can forget it. Doing this will improve the chances that the message gets across, eventually.
568
Retry Client
Service Down
Slow
1
timeout
timeout
retry
retry
timeout random wait retry
Service
Client
2 1
late success
2
success (again)
Ok
success
Retries eventually succeed (with temporary failures)
Retries introduce duplicates (redundant rework)
How does the retry mechanism works? We try to talk with the service. But the service is down, so there is no answer. Even if we wait for the answer, the answer doesn’t come back. At some point, we have a timeout that will trigger the repetition of the same call, the same request will be resent. We’re lucky, the service is back up. Now the service is available and we get the reply back. To retry is to be optimistic. It assumes that the service will eventually recover, that the failure is temporary. It also assumes that the messages can be resent as many times as you want. What kind of messages can be repeated? Can you always do that with arbitrary messages with arbitrary operations over service interfaces? There is a problem. When we repeat a message, when we retry, we introduce redundancy in the system and this can be a problem. Let’s take a look at this other case, in which we have a slow service. Even if the service is slow, the first message still gets across. The message, gets processed, but it takes longer than the client expects. If the client is not patient, if the client resends the messag too soon, we have now a copy of the message that will be processed. The client will actually receive not only the response to the original message, but it will also receive the response for the second one. Since have introduced a duplicate message into our system, an operation that was completed slowly once is actually done slowly twice. As a result, we have more load on the service due to the client timeout being too short. The client was not willing to wait anymore, so the service had to repeat the work twice.
569
Retry • Which operations can be safely retried? • Idempotent operations (result does not depend on how many times they are executed) • Stateless components (result only depends on input, no side-effects on the component state) • How soon? • Retry immediately (risk of overload by refresh) • Retry a ter delay (exponential backoff) • Which is better? • fail after the rst try (soon) • keep retrying until it works (maybe forever) Think about what kind of messages can be retried? Can you give me an example of a message which is not possible to safely process more than once? What could go wrong if we process the message twice? Maybe the request is about reading some information. If you retry, you may get 2 copies of what you were trying to read. Maybe one copy is more recent than the other. But there is no side effect on the service. Read operations can be retried. What if, for example, we have a financial transaction. The semantics of the messages is about to deposit some money into the users account. Of course the user is happy if you do it twice. But the bank is not so happy. However, if you attempt to withdraw money. You want to make sure that you withdraw the money exactly once. Any kind of interaction which changes the state of the service incrementally, based on the previous state of the service, cannot be retried. Otherwise, side effects accumulate and both the result and the state of the service will be different depending on whether the request was retried or not. The idea is that retries should enhance the availability but not change the basic functionality. If the service interface offers operations in which the result does not depend on how many times you do it, they are called idempotent operations, then retries are fine. You can also have stateless components – as we’ve seen the result of stateless operations depends only on the input. Therefore there is no side effect. Therefore it’s possible to just retry the computation. You will waste CPU cycles, but there is no problem with the consistency of the system. How can you turn every every other type of operation into something that can be retried? You have to put unique identifiers on the original messages and then you have to do message deduplication. You have to remember that you’ve already seen a message and you shouldn’t process it twice. There is a cost to check every incoming message against 570
the identifiers of all messages that have been previously received. The duration of the time window used to remember previous message identifiers should be set according to the retry timeout. If the retry doesn’t work, we can keep retrying for a while. Maybe wait longer and longer between retries, to avoid flooding your service with too many of these retries. When a service is overloaded – it is not yet down but just on the brink of collapse – a few of these extra retries from impatient clients may be enough to bring it down for good. This happens more often than you think. Most users have been trained: If it takes too long, just refresh. If the service is slow, it takes too long, what to do? Click refresh and therefore the service gets even slower, even more loaded with even more requests. This can start one of these positive feedback loops in which the longer it takes, the more people refresh and then it takes even longer. It’s hard to recover from these situations unless you’re willing to stop retrying, at least for a while. Since we can protect the service from duplicate messages induced by retries, should we always retry? Should we always keep retrying forever? The service is bound to come back up, right? What if we recover the service on a separate server? What if the client keeps sending messages to the previous address? When is the client going to give up and consider doing a directory lookup to discover that the service has moved? And yes, it was already available at the new location? We should not retry forever, since in the worst case failures are never temporary. The question is: How soon do we give up? When can we escape the retry loop to attempt a higher level recovery strategy? What if we skip the retry at the first sign of failure so that we can report the error as soon as possible? Retry is a good strategy if you have temporary failures that can be expected to be recovered soon, so called glitches. While stuck in a retry loop, the system will not necessarily report there is a problem, because it’s hoping that it will work, at the next retry. This means that users may perceive the system as hanging, and their usual recovery strategy (click refresh) is already being attempted by the system, so it’s no use for them to add another manual retry on top of the automatic ones. The message is that if infinite loops are the enemy of developers, infinite retry loops are the enemy of operators dealing with outages. It would be a good idea to be able to exit the loop and just fail, as opposed to waiting forever while delaying the inevitable. Somewhere there should be a configuration option called: maximum number of retries.
571
Circuit Breaker How to avoid retrying forever synchronous calls? turn a slow failure into a fast failure If the failure/timeout occurs during a remote call the circuit breaker trips and avoids performing future calls to the failed service
Wouldn’t it be nice to be able to remember from the past whether you were trying to interact with a service that was not available? You learned the hard way it’s down because you kept retrying for two days, but it still didn’t work. So don’t even try to enter the retry loop the next time. If it didn’t work before, what’s the use of trying so hard only to give up after reaching the maximum number of allowed retry attempts? While it’s clear that skipping retries altogether is for pessimists, there is some pragmatism in trying to learn from previous experience. When we introduce a circuit breaker into the architecture, we specialize the remote procedure call connector to use a smart retry strategy. The problem of remote procedure calls is that they block the client until the call has completed. While this may not be a problem if the service is available and the call completes with the usual speed. In case retries come into play, completing the call will start to take much much longer than usual. The circuit breaker is a special kind of connector, a variation over the remote procedure call connector which deals with these temporary failures, these availability issues and helps clients to avoid getting stuck into an endless retry loop while attempting to call a service which is already known to be unavailable. In other words, we want to remember what is the state of availability, or at least an estimate over the last known state availability of the service that we’re trying to call. If we expect the service not to be available, we give up immediately. While if we expect the service to be available, we call it. If our expectation is invalid, we switch between the two states. If it becomes possible to immediately respond that there is a problem, it may be possible to take some alternative action or recover from the client or simply inform the user right away and let them decide if they want to refresh. For example, if the call didn’t work, you can display the cached results from the previous call. You can make users aware of the unavailability without having to slow down their system.
572
Circuit Breaker Circuit Breaker
Client
call slow failure
!
fast failure
!
timeout
Service
failure
not available
Let’s see how it works. If you make a call and the call fails or it times out after some retries, then you will stop calling the service in the future. While the first time it takes some time, you have to wait for the timeout after all, the next time the call fails immediately. We can see here how the interaction works. We have the client making the call. The circuit breaker is in the middle, so this can be absracted as a connector which takes the call and forwards it to the service. If the service is not available, it takes time before we detect that there is no answer. When there is a time out, we have an error, which gets forwarded to the original client: This service you are trying to call is not available, despite several attempts, we didn’t receive an answer. Additionally, the circuit breaker remembers this outcome. The next time we try to make a call to that service, the circuit breaker will immediately respond with the error. This may seem like a minor optimization. However, imagine a complex architecture in which you have not only a client calling a service, but you have a client calling service which calls another service which calls another service. If there are a lot of nested calls, a slow failure with one call will affect all the other calls. And it may take a very long time before the client notices that somewhere down the stack a service is unavailable, after everyone in between finally gives up retrying after going into timeouts several times.
573
Circuit Breaker Service
Client Normal Operation
Failure Detected
Recovery
Within the circuit breaker, the decision on whether to forward calls or stop them depends on the state of the breaker, which reflects the latest known state of availability of the service protected by the breaker. A closed circuit breaker lets the call through when the service is assumed to be available. When the client makes the call, the service breaker forwards it to the service and the answer is returned within the usual time, depending on the performance and the workload of the service. However, if the circuit breaker detects that there is a problem with the service, it will switch to the open state in which the next call will be bounced back to the client immediately. The detection can happen by observing the failed outcome of previous calls, or with the help of a watchdog. The metaphor works like in electrical circuits. When accidents like short circuits happen, the circuit breaker can save your life because it disconnects and removes the power automatically. To restore the original state from the disconnected state, you have to reset the breaker manually, that is after solving the problem which tripped the breaker in the first place. Tracking the state of the breakers can help to monitor the availability of the corresponding services. One can easily build a monitoring dashboard showing the state of all circuit breakers in the architecture. You notice the ones that have tripped and after recovering the corresponding services, you can manually reset them to let the calls through again. Or you can also try to do it automatically: every once in a while when you get a call the breaker will try to forward it to the service to see if the service is back up. And in case this particular call works, then the breaker goes back to the normal operation state. You start to recover it after a certain amount of time, to give enough breathing room to the people trying to recover the service so that you don’t immediately flood it with traffic.
574
Circuit Breaker • Isolate the impact of failures (errors and timeouts) • Need to determine how many failures can be tolerated before tripping the breaker • Recovery: the circuit breaker needs to be reset (manually or automatically) • Apply whenever you do not trust your dependencies to be available at all times • The client needs to expect failures and tolerate them (use default or cached response value, temporarily disable functionality)
The main idea of the circuit breaker is that for a synchronous interaction, the client is calling the service and the service needs to be there to take the call. The success of this interaction is highly sensitive to the availability of the service. If you make a call and the service is not available, you typically end up waiting before you give up and conclude the answer is never going to arrive. The idea is that we can close the circuit breaker and immediately fail the call: “sorry, you’re trying to call something that is not going to answer you“. Clients want to know this as soon as possible. What can clients do when they get such a response? This response is not coming from the service because the service is not available. If clients are trying to read information from the service, this can be cached. If the service is not available with the latest updates, a previous version can be fetched from the cache, whose availability should be independent from the one of the service. Clients will not get the latest information, but at least get a copy from the past and maybe that’s good enough. If clients are trying to write into the service, the circuit breaker quickly informs them that the update didn’t go through. They will need to retry it again later when the service recovered. So it is not the responsibility of the circuit breaker to buffer messages and automatically resend them. The circuit breaker is not a message queue. Depending on the client, the rejected updated will need to be stored locally, or an error is reported to the users, who can wait for better times to retry the operation. One important decision regards: how sensitive is the breaker? Is it going to trip on the first failure? Or how many failures should happen before it switches to the closed state? Also, how do you recover it? How do you reset the state? Not only you have to recover the service, after the service is back up, remember to reset its breaker, or clients will not be able to call it. We introduce this type of solution when we do not trust our dependencies. In case dependencies are expected to go up and down, and we want our system to avoid get575
ting stuck either in a retry loop or just waiting for a timeout, we add a circuit breaker in between. When the dependencies become not available, clients will behave more predictably. They will fail rapidly and consistently, as opposed to just hanging and timing out after some time. This idea only works if the client is built in a way that can handle failures. When the circuit breaker trips, calls are guaranteed to fail fast. The circuit breaker does not remove the exception, it just makes it happen faster. Instead of waiting for 10 minutes and then raising the exception, when the circuit breaker trips, clients need to deal with the exception right away. Clients need a strategy to tolerate unavailable services. They can disable part of the user interface, but keep the main user workflow available. Maybe the call was to get some optional information about a minor subcomponent of the user interface. If it takes time before you realize you are never going to get the answer, you are going to block the whole system for nothing. Instead if you can just hide or disable the failing component, but show everything else immediately, that’s a much better experience for the user.
576
Canary Call How to avoid crashing all recipients of a poisoned request? use an heuristic to evaluate the request Try the potentially dangerous requests on one recipient and scatter the request only if it survives
The Canary call is a useful pattern when there is a chance that when you send a request, the recipient may crash directly as a consequence of your request. We call this a poisoned request. The idea of the canary call is that you do not forward suspicious requests to everyone. But you first observe how one behaves when processing the request. If you get more confidence that this is not a poisoned request, then you can broadcast it. First try the request with one recipient and scatter it to all the others only if the first recipient survives. We use the Canary in the coal mine metaphor, if this particular recipient has a problem and then you can expect that everybody else will also have the same problem. So we are in the context for this is the scatter gather scenario in which just naively forwarding the request is going to crash all of your workers. And the goal is to avoid a potential high recovery cost.
577
Canary Call Scatter/ Gather
Client
A
B
C
! There is a chance that scattering the request, will result in crashing the workers receiving it.
We illustrate the problem in the context of the scatter/gather pattern. You are not only sending the request to one component, but you send it to many components at the same time. For example, with a master worker architecture, what if there is something wrong with the input data and after partitioning it, you send the chunks out to a cluster with thousands of parallel workers and you crash all of them? To recover, you will need to reboot the cluster. It will take a long time before the system is available again. You didn’t just loose one node, you lost everything, so all other requests got stuck in the queue. What we can do is first try the request on one worker. And if it crashes, we don’t forward it to the others, we just reject it. This is a simple way to evaluate how dangerous is the request, with the advantages that if the request is rejected after one worker crashed, the other workers survive to process another set of requests. If the request is successful on one worker, we can send it to everybody else. This is a pattern that we can apply in the context of a scalable architecture in which we want to spread out requests across multiple elements, and we want to try to protect these elements, prevent them from crashing when they receive a bad request.
578
Canary Call Client
Scatter/ Gather
Request Rejected
A
Try Request
B
Worker Crashes
C
Other Workers Survive
The rst request is sent before all others to check if it would harm the worker receiving it (canary request). If the canary request crashes the worker, it is not forwarded to the others which survive the attack.
Canary Call Client
Scatter/ Gather Try Request
A
B
C
Request successful
If it worked scatter in parallel
The rst request is sent before all others to check if it would harm the worker receiving it (canary request). The decision on whether all other requests should be scattered depends on an heuristics: continue if the canary request is successful.
579
Canary Call • Performance/Robustness Trade-Off: • Decreased Performance: the scatter phase waits for the canary call to succeed (response time doubles) • Increase Robustness: most workers survive a poisonous call, which would have failed anyway • Apply when worker recovery is expensive, or when there are thousands of workers involved • Heuristic: Failed canary calls are not necessarily poisonous
If we introduce this pattern, what’s the advantage in terms of failures? Only one element will fail as opposed to every element. In the successful case, when no failure occurs, what is the disadvantage? What do we lose? Compared to the original case in which we just risk to directly forward the request to all the elements, it takes more time. Canary call requires to trade off between: How many of our workers we want to save? How much longer will it take before we can send the final response? First we have to try out the request in one place, so this will take some time. If the request is valid, we can work in parallel as we scatter it out. We have just duplicated, in the best case scenario, the time that it takes to service the request. Since we have to wait first to see if it works and then we have to process it in the other places. It now takes twice as long. Canary call is not something to be introduced without being aware of its performance cost. If you want to protect the availability of your cluster, you will just make every request run twice as slow. You want to apply it when recovering a worker is very expensive. Or you have so many that could potentially fail that it will severely impact your data center. For example, if you send a request to modify the server configuration or install a security patch. Would you fire off such request to all of your servers without first upgrading one and observing whether it remains stable for a little while? If you lose half a data center, then it takes time before you can recover it and you may not have any backup while you are doing so. If a request fails one worker, is it a good indication to predict that all workers will fail? How can you be sure that a crash in one place means that it is was poisonous request? Or was that machine going to fail anyway? That’s difficult to tell, the pattern relies on a heuristic decision. You try, you observe and then you draw an inference about what may happen elsewhere. What’s interesting is that today there are many tools to make predictions. Machines can learn. Canary call is applicable also with more advanced heuristic to validate requests. If too many requests are discarded, clients may complain. If poisonous requests 580
do not always crash a worker, they may make it past the filter and wreck havoc on your data center. A request could be perfectly valid, but it triggers a bug in the recipient and until the recipient is not fixed, the request should be stopped from propagating. In the simplest case, you can always try to run it somewhere and see what happens. If you can anticipate and detect malicious requests without actually running them, that’s even better.
581
Redundancy High Availability is achieved by introducing redundancy for critical components (primary and at least one backup) Cost of redundancy < (1-Availability) * Cost of Unavailability Fail-over is the process of automatically switching from the primary to a backup instance, which is standing by: • Hot, ready to begin working immediately • Cold, needs to be initialized before it can take over Load Balancing also helps to increase availability when faulty workers are pulled from rotation If you are going to fly on an airplane and something goes wrong with one of the engines. Would you rather be flying on an airplane with four engines or an airplane that only has one engine? Redundancy is about paying the price to increase the reliability of your architecture by dedicating enough spare resources so that in case something goes wrong you can still land safely. If you cut costs in the name of being lean, then you will probably sacrifice the minimum level of redundancy that will make it possible for you to survive when something goes wrong. It may be a good idea to have some spare parts. To invest in cash reserves. To anticipate expected or unexpected problems in the future. Keep a few extra masks around the house for the next wave. When we design a software architecture, we can decide to introduce redundancy. That means that we will make multiple copies, replicas, instances at different levels of granularity. The entire system could be deployed in different data centers. Individual components will be installed and be running in different containers, hopefully running on different physical hosts, installed across different data centers. Or the same interface could be re-implemented by different independent development teams to avoid replicating bugs. Redundancy will have a cost, but these costs will be less than the penalty that you will need to pay in case something goes wrong. If, when something goes wrong, there is no problem – after all users are always glad to take some unplanned vacations while you rebuild their server, restore the database from backups, and will just start happily working again when the problem is resolved. But if you cannot afford that, then it’s better to invest up front to introduce enough redundancy. It will pay for itself, or even be cheaper when things go wrong, but your company workflows are not disrupted thanks to such redundant components: when one 582
of them fails, we can still use the others. If one of the engines of the airplane doesn’t work, then we can still fly because we have the others. To minimize disruption, in some cases you will keep these backup components hot. This means that as soon as something stops, you can immediately continue by using the back up. We say that the failover is instantaneous. Alternatively, you have to start from a cold backup, which takes longer to initialize and bring up to speed. If the hardware is not running to save energy, you have to power up the hardware. This takes time. Then you have to start the software and make sure it has the latest version of the data. Cold standby means that it’s ready to go, but you have to switch it on and wait for it to come online. If it’s hot there is a higher cost to keep it running, but the advantage is that fail over is supposed to happen instantaneously. You can also use redundancy in combination with load balancing strategies that we have discussed earlier. With load balancing you introduce redundancy for the purpose of scalability because you need more capacity. As a consequence, if one of the workers becomes problematic, you remove it from the pool. You stop sending work to this particular failed component. The rest of the system can survive because you have already the other workers which are still available.
State Replication Client
Client write
read
write
Client read
write
read
Storage Replicated Storage
Client 1. write
High Availability
Client 3. read
2. write Strong Consistency
1. write
Client 2. read
3. write Weak Consistency
write
read
Partition = Inconsistency
It’s easy to build a redundant architecture if you have stateless components. Trivial: you can start as many copies as you want. Quickly initialize the component so that it can receive messages and process them without affecting its state, because there is none. So we’re not going to talk about that. What we will focus on is: how do we make redundant architecture out of stateful components? In other words, how do keep replicated state synchronized or consistent? 583
That’s the main challenge. When you have stateful components, you make multiple copies, each represented with the usual cylinder. When there is no replication, life is easy because we can update the information atomically. There is only one place where this information is stored and that means that whenever we change it, it becomes immediately available for reading it. After we introduce replication, to help in case one of the copies is lost, we have one or more backups. From a high level perspective it should look the same. Clients can interact with the component, change its state, read the updated made by other clients. Ideally, there should not be any outside impact due to the internal replication. Clients shouldn’t notice that there are now multiple replicas that need to be kept synchronized. Also they should not notice when we lose one copy and fail over to the other one. We want to keep the same interface for the stateful component and just deploy it in a redundant configuration, which can tolerate the loss of some replicas. However, imagine there are two copies; then we have two scenarios. You write into one copy. And then you synchronize with the other before reading. The first step will be that you send an update. You write something into the system. One copy transitions to the new state, then the replica follows, and then you read. Since both copies synchronized with the other, when you read it doesn’t matter where you read from, because they’re all in the same state. We call this situation strong consistency. We guarantee that no matter how many replicas you have, the information that you write can be read from any of the copies. It’s like an ideal scenario, because you write something and then you can actually read it back. There is a cost: before you can read it, you have to propagate the update to all replicas. You have to synchronize the update, and this takes time. The other option does not give such guarantee. You write into one copy, but when you read, you may get the latest value or not. It can happen that you read before the synchronization has happened. You read from a stale replica. This is what we call the weak consistency. There is a risk that if you introduce replication, you not always able to see the changes that you make immediately. There will be a delay before all replicas have transitioned to the latest version. As we have defined availability as: how long are you willing to wait? In some failure scenarios the replication delay becomes very long, maybe it becomes impossible for the two copies to synchronize because we have a partition between them. A network partition just means that the two sides cannot talk to each other, and therefore the synchronization will never happen. At least until the partition is resolved. Even if the replicas cannot synchronize, we can still write on one side and read from the other. Even if we retry, we keep reading the old data because it doesn’t get synchronized because of the partition. As a consequence, we observe an inconsistency between the different replicas. This just a very high level view into what can happen when we try to replicate stateful components. This problem occurs no matter whether they use a databases or just keep their state in memory. As soon as we make another instance and deploy it somewhere, there will be a second copy of the component state, which will suffer from this issue. The advantage is that when something fails, you don’t lose the data. You can use the other copy. In this scenario, you have the main copy and the backup. You have two copies, so you can survive if one fails. When you lost a copy, you’re back into the non replicated scenario. So your goal is to avoid that, since you only one more failure away from losing everything. If something fail, you need to spawn a new replica as soon as possible. Depending on how big is the amount of state that you have to replicate, it will take time. Failures strike fast, disk crashes occur rapidly, while creating a new backup like if you need to copy several terabytes – will take a significant amount of time.
584
Which kind of replication? Synchronous
Asynchronous Weak Consistency
Strong Consistency
Client
Client
Write(X)
Write(X)
Ack
Write(X)
Write(X) Ack
Ack Ack
Eventual Consistency
What is the difference between synchronous and asynchronous replication? Here is a more detailed view about what we have just discussed about strong and weak consistency. When you are doing synchronous replication, you write some information into one of the replicas, then you try to send the copy to the other replica. You wait for the replica to confirm that the update it’s actually been stored, and only then you tell the client that the write operation has completed. If you do this, you get strong consistency as a consequence. Since the client has to wait for all the replicas to agree that the new state is the one that the client is writing, you have to wait until all replicas agree. This means that writes takes as long as the slowest replica. If replicas are scattered around the globe in different data centers, there will be a high latency before they confirm and you have to wait for the slowest replica to confirm before you acknowledge the client. That’s the price to pay if you want to guarantee a strong consistency using synchronous replication. Unless all replicas agree, the write didn’t happen. If the client is in a hurry, you can speed up the write with the asynchronous replication alternative. You update one copy. As soon as you send out the update to the other replicas you already – very optimistically – tell the client that the information has been stored. So what does this mean? Well, it has been stored in one place, but the synchronization with the others is still in progress. It can happen that the client receives the acknowledgement, and for some reason wants to read the value. There is now a race between the client read and the synchronization. It is possible that the read will not return the latest value because it reads from replica that is not yet up to date. We call it asynchronous because the synchronization happens in the background. The main reason for doing that is to make the writes from the client faster. We give the opportunity for the replicas to catch up at their own pace. As a consequence, we have weak consistency since there is a point in time in which replicas do not agree. Since the 585
synchronization is in progress, eventually the replicas will actually get to the new state. That’s why it’s also known as eventual consistency. We have seen two simple replication protocols, with different behavior when something fails. If one waits for all the replicas to reach the new state, the information will be redundantly stored before the write is confirmed. If it is not possible to duplicate the data, the write will fail. With the optimistic approach, the synchronization happens asynchronously. If you lose the only up-to-date copy, before you managed to propagate it to the replica, you can have information loss after informing the client that the write was successful. It is possible that as you acknowledge the client you discover that there is a partition which doesn’t let the write through to the replica. This means that until the partition is resolved, there will be only one replica which has the latest state. If you lose it, there is no backup.
586
CAP Theorem All replicas agree on the latest version of their state
Consistency Not Replicated
Availability
CP AP
Partition Tolerance
The network may loose arbitrarily many messages between replicas
Eric Brewer
Every request routed to a non-failing replica must result in a timely response
CA
Not Available
Not Consistent
A distributed replicated system cannot have both Strong Consistency and Availability What we have discussed is also called the CAP theorem. When you have a stateful component that is replicated across a distributed system, there is a limit that stated by this theorem in which you cannot have both availability and strong consistency and also you can be tolerant of partitions. What does it mean when you have only two out of three properties? As you can see from the Venn diagram, we have consistency if all of the replicas agree on the latest version of the state. We have availability if when we send a request we get a timely answer. That’s the quality we have been talking about today. What does it mean to survive partitions between the different replicas? This is true if the network is fully reliable and will never lose any of those synchronization messages. The CAP theorem states that in the intersection of all of these three properties is not possible. You have to choose which pair of properties you want to have. Which one do you prefer? Availability or Consistency? Let’s explore the trade-off. If you want to design an architecture in which you want to provide consistency, therefore, when a partition occur, you will have to sacrifice availability. If you prefer to have availability, you will need to sacrifice the consistency. Why do I say that only these two cases are interesting. Because the only kind of system that is partition tolerant is a centralized system that is not replicated. Only in those systems you can have both availability and consistency. As soon as you replicate, you bring in distribution, decentralization, and then you have to choose. A real network will fail. You will have partitions and therefore your architecture can either be not available or not consistent.
587
CAP Theorem Proof
Client read = ? 1. Timeout (Not Available) 2. Y (Not Consistent)
write(X)
X Partition
Y Y
This is probably the only proof that we do in the lecture. Let’s go back to the previous scenario in which you have a client that is writing and updating one of the replicas. So what can happen? The client wants to read? Check that the state transition was successful. Let’s say that we are not partition tolerant and a partition just happened. So it is not possible to synchronize the replicas. Because of this problem, what could the read result be? Since the client wrote X but it reads Y from an out-of-date replica, the result is not consistent. Since synchronization was not possible, the result is stale. What if we wait to give just enough time for the partition to resolve itself and the synchronization to work. Is the client going to be willing to wait that long? You mean the client is going to time out and give up reading? But that means that the client will say that the read was not available. Well, at least the client didn’t read an inconsistent value. Actually, it didn’t read any value at all. These are the only two cases when we have a partition. Either immediately read an obsolete result (available, but inconsistent) or wait forever for the result to arrive, in practice give up waiting (not available). Also in this case, if the client receives a result, it will be the consistent one.
588
Eventual Consistency A
A
B
B Partition
B B
C
B
cy n ste i s C on c In Reconciliation C
Some applications prioritize availability at the expense of consistency. During the recovery from a partition, con icts between multiple inconsistent states need to be resolved manually or automatically (e.g., last writer wins). The data semantics within some applications may tolerate this for short periods of time.
While availability and consistency cannot be guaranteed at the time of the partition, if we choose availability we can still recover the consistency, eventually, after the partition has been resolved. If one waits long enough, the replicas will get synchronized. Eventually, after rebooting your WiFi router you will be online again. Eventual consistency helps to keep your system available. You just don’t promise that the latest state will be visible everywhere, but you just say that if you wait long enough it will get across. What does this mean? Here we have a replicated system that is in a certain state. Then we have the partition. The two replicas can no longer talk to each other. Is the system automatically inconsistent? Not yet. It’s just a partition, but the two copies are still the same, so we’re still consistent. There’s still a chance that if you fix the network, you reconnect while you never lost the consistency. The inconsistency is only introduced when you do the update. If you change the state only one of the replicas would be updated and the other one will be become obsolete and it would be inconsistent. Is the inconsistency visible at that time? Not unless you can reach all the replicas and compare their state. After solving the partition, you have to decide which replica has the latest state. How to solve this conflict? It is possible to associate a version counter or a timestamp and see which has the most recent version. You may have experienced something similar once you tried to pull changes into your code and there was a merge conflict. Sometimes the merge works automatically, sometimes you need to resolve it manually. If the inconsistency is small enough, if the amount of change that you have introduced 589
is not too big, if the time in which some replica was offline is not too long, then it’s possible to reconcile it in a finite amount of time. If there is a replica cut off for a number of years, then you have a lot of work ahead if you want to reconcile all the changes accumulated for such a long time. Still, despite the fact that you were inconsistent for awhile, eventually you can go back to a consistent state. It all depends on how long “eventually“ means for you.
590
Event Sourcing How to reconstruct past states? How to synchronize replicated states? Keep an explicit representation of the state history Log all events that led to state transitions so that they can be replayed.
Speaking of accumulating changes, this is what the Event Sourcing pattern is about. This is relevant for highly available, redundant architectures, in which there are replicated stateful components. The idea is not to store only a snapshot of their latest state, but to store the history of all the changes from which the current state can be obtained. This helps when you have to deal with conflict resolution, since you don’t just have two inconsistent states to compare, but you can see how you got there by tracking the sequence of changes. Even though your source code files conflict, you can resolve the conflict because developers have been editing different lines and you can merge them in the right sequence. Instead of storing the latest value, keep a history of all the changes that you’ve done. Since events trigger state transitions, we can use the term event to represent a change. If still remember the previous state before the event occurred, the change that was applied, then you can obtain the new state. To store all events, we use a log structure. If you want to take it to the extreme, you can use a blockchain, but you don’t have to. You just do it in a simple log in which you have a time stamp and information associated with the change event. If timestamps are not possible, it is enough to keep an ordering between the events. If you log all the input that you gave to a stateful and deterministic component, given a known initial state, you can always replay the log and stop at a certain point to reconstruct the past history of the state of this particular component.
591
Event Sourcing Service
Client X PUT GET X DELETE GET 404
Log
X Append
[] [X]
[X] Replay D Append
[X,D]
[X,D] Replay
Track individual changes as opposed to complete snapshots
For example, if we make a change to the state of a certain service and we use event sourcing, it means that we need to keep track of state changing operations by appending them to the event log. A PUT operation, which means that you want to write this value X as the next state does not simply set the current state to X, but logs the event (the state is now X) into the log. The log is represented as an array of all the changes that have occurred so far. It’s easy to reconstruct what is the current state, because there is only one entry in the log. The initial state was empty and after the PUT operation we enter into the state X, which is what the GET (a read operation) returns. Here comes another change: DELETE should reset the state to the initial one. We’re not simply going to do it directly. We’re not emptying the state, we’re actually storing the fact that we have a request to delete the state. And the log grows to include this event (represented as D). When we read the current state again, we will replay the log. The state does not become X anymore because of the delete event. As a consequence, we show to the outside that after you delete something then the information is gone. However, in the log it is still present. If you wanted to undo the effect of some of the changes, you can still time travel along the past history.
592
Event Sourcing Performance Tradeo • Append-only log of immutable events (fast write) • Replay log to compute the latest state snapshot from known initial state (slow read) • Cheaper to store (or transfer) a log of state transitions vs. the complete history of state snapshots • Cheaper to store one snapshot than the log of all state transitions to reproduce it
When you use event sourcing to implement state storage, writes are fast because it’s enough to append them to the log, while reads can be slow because you need to replay the whole log, unless you keep a cached version of the latest state. The amount of data that you have to transfer is the critical bottleneck when you introduce replication in your stateful components. Event sourcing can help to build an eventually consistent, replicated solution in which you store the component state across multiple copies. As opposed to transfer entire and potentially large snapshots every time a few bits change, it is more efficient to send a stream of changes to keep the copies consistent. In general, event sourcing makes sense when the cost of logging individual state changes is smaller than storing a snapshot of the whole dataset. Consider introducing it if the data is large and there are few small changes. It may be inefficient if the data is small and there are lots of changes, so the log size grows above the size of just taking a snapshot. How long is the history of state transitions? If you keep your stateful component running for a long time with lots of clients invoking it, then you have a long history and it will become cheaper just to store the current state and read it as opposed to replaying all the log to compute the latest state. There will be a point that even with event sourcing, you will take anyway a snapshot and only keep the changes from the snapshots onwards. A good time to do so is when all replicas are in sync. By flattening a large log of state changes into a snapshot of the current state you can save space but loose the ability to time travel.
593
Event Sourcing Advantages • Record audit trail (what happened when, who did what) • Synchronization (replay from the last known event to catch up) • Con ict Resolution (make sure events are played in the same order) • Rollback (replay up to a given event) • Reproduce bugs (replay production log on a test/upgraded system)
In addition to the synchronization between replicas, and the fine-grained conflict resolution, event sourcing has more advantages. You can establish some kind of a legal audit trail. You track not only which are all the changes that happened, and when they took place, but also who is responsible for them. This can be very useful also from a business point of view to store all the history. Also implementing undo is just a matter of replaying the log and stopping before reaching the final event. It’s also great for testing and reproducing bugs. The log is a valuable source of input for your tests. Replay the log and check if different versions of the component end up in the same state. After you make a new release of the system, you can spot compatibility issues or regressions by simply applying the event log and comparing the obtained state.
594
Event Sourcing Disadvantages • Storage size (logs tend to grow unbounded, take a snapshot from time to time) • Serializability (what's the right order of the events?) • Consistency (what if events get lost?) • Evolvability (how to version events in the log? how to migrate old logs so that they can be played with newer clients?)
Disadvantage include the problem of an unbounded growth of the event log. The more state transitions happen, the more events need to be listed in the log. If you want to join the mining effort, you get the privilege of downloading and storing your own copy of all bitcoin transactions since the origin block. You have to be careful about the order in which you’re logging things, especially if you have multiple logs in a decentralized architecture. It’s not trivial to know what is the right sequence of events unless there is a centralized log or you are really careful with the clock synchronization. It is also critical not to lose any of the events. Given an incomplete log, there is no guarantee that replaying it will result in the same state. That’s why blockchain miners dedicate so much energy to continously re-checking all the hashes to make sure that the log didn’t get tampered with. What about if the format of the events changes? Can a new release of the component still play back events logged from an older version of the component? There is an issue of migrating the logs that due to their historical nature spanning across multiple releases of the component is more complex as opposed to just migrating the snapshots. We will see more about software evolution in the next lecture.
595
References • Werner Vogels, Eventually Consistent, ACM Queue, Dec 2008 • Eric Brewer, CAP Twelve Years Later: How the "Rules" Have Changed, Computer, 45(2):23-29, Feb. 2012 • Niall Richard Murphy, Betsy Beyer, Chris Jones, Jennifer Petoff (Eds.), Site Reliability Engineering: How Google Runs Production Systems, O'Reilly, April 2016, ISBN 978-1491929124 • Michael T. Nygard, Release It! Design and Deploy Production-Ready Software, 2018, ISBN 9781-68050-239-8 • Ali Basiri, Nora Jones, Aaron Blohowiak, Lorin Hochstein, Casey Rosenthal, Chaos Engineering, O'Reilly, August 2017, ISBN 9781491988459 • Guy Pardon, Cesare Pautasso, Olaf Zimmermann: Consistent Disaster Recovery for Microservices: the BAC Theorem. In: IEEE Cloud Computing, 5(1): 49-59, 2018 • Is it down right now?
596
So tware Architecture
Flexibility and Microservices
12
Contents • API Evolution: To break or not to break • Layers of defense against change • Extensibility and Plugins • Microservices • Feature Toggles • Splitting the Monolith
597
This lecture is organized following some of the quality attributes a software architecture can have. We started from many design-time qualities. We’ve seen deployability as we made the transition to operations and runtime. Then we talked about scalability and availability. Today we close the cycle as we look at how easy it is to change a software architecture. This goes under the overall umbrella term of flexibility. We will see that the construct that is meant to make it easy to change a software architecture is called: “microservice”. This is the latest addition to our zoo of architectural concepts, starting from the good old days of component based software engineering, followed by the dawn of serviceoriented architectures. As we are about to enter microservices land – before we define what microservices are by combining many ideas and concepts that we have already illustrated in the previous lectures – I want to have a more general discussion on what it means to change software, focusing on the most difficult part to change: the interfaces. How do we change an API? We will see the most critical decision will be about whether to break or not to break clients. From the client point of view, it becomes critical to isolate clients and protect them from the impact of changes using layering. Layers give you a powerful design mechanism to contain the impact of changes. For example, every component is encapsulated by an interface. The interface can be seen as a layer mapping between the external data representation and the internal data representation. We will also take a look at how it is possible to design an open and extensible architecture with the use of plugins. As we finally introduce microservices, we will see how it is possible to develop highly flexible architectures. In particular, we will discuss the feature toggle mechanism and a procedure to gradually migrate monolithic architectures and split them into microservices.
598
How do we evolve interfaces? You can see here a physical example which shows you how hardware technology has evolved to feature very different kinds of ports available on different type of devices. We can spot a trend: over the years the set of ports has been shrinking as well as their diversity has been significantly reduced. One driver behind this change has been this idea that laptops should become thinner. Another observation is that we can abstract and generalize the type of interface. Each USB C connector can play the role of any other port of the previous interface, and can also be converted back to the previous interface by purchasing the corresponding dongle. What’s most surprising is that you can also power the machine through the same connector which is used also to exchange data. This was pioneered on mobile phones, where the available surface which can be dedicated to ports is however much more limited. This interface evolution occurred through decades of technological improvement. Beyond reducing thickness, courage and simplification, what are the drivers that caused it? Is it a sign of progress that you actually remove features and simplify the interface so much that you can design a future where only one type of plug will be sufficient? While life would be much simpler if one could cleanly break away from past decisions. Of course such disruptive interface evolution will have a big impact if you try to connect your shiny new laptop with older devices. Your decision to break backwards compatibility can sometime lead to creating new business opportunities, like selling adapters as a solution to the problem you created. With the additional benefit that if something breaks, you can always blame the adapter and not the incompatible hardware. How are the users of this type of system impacted? Will they just follow along and silently live with the consequences of these upgrades? 599
Who has the power to force change across an interface boundary? This requires a closer look at the relationship between the two sides of an interface. Do you have a collaboration type of relationship where both parties are trying to build different components that have to match based on previously negotiated agreements? Or as represented in the picture: do you have an imbalanced relationship where one side is in control while the captive side will need to deal with the consequences of whatever decisions you take?
Only one chance... ...to get the design right: • Application Programming Interface • Programming Language (and Standard Library) • External Data Model: • File/Document Format • Database Schema • Wire/Message Representation Format
Once the API becomes public, it is out of your hands and it will be very expensive to change!
Why is it so critical to get the design of an interface right? Because you only have one chance to get it right. Once you have designed your API, then other developers will build applications on top of it. There will be hopefully many programs written on top of the platform that you provide for them. If the platform is not stable or poorly designed, the life of many developers will be a misery, unless they simply refuse to get near it. This is the same position in which you find yourself if you invent a new programming language. I am sure some of you in your future career will invent a programming language or a domain-specific language. As you release the language specification, you release the compiler and maybe also the standard library that usually goes with it, it will be out of your hands. As other developers will try to write code in your language, they will notice if you make a change to the language as they probably will have to throw away all the code they have been writing and start from scratch. We also have this type of situation when we design an external data representation: for example, the way data is represented as it is exchanged through an interface; but also documents written in a certain format; or data persisted in a database following a given schema. In general, we decide which would be the syntax, structure and semantics of the messages that we exchange between the outside world and our system. 600
These decisions are going to affect all the external entities that will ever interact with our system. All components that will ever have to read or write data that we can understand. And if all of a sudden you make a change to the format, every other component that used to be able to exchange data with us will be affected. So once you release the API, then as a designer you are no longer fully in control because you share it with other people who will depend on it. While before the first release, you could iterate quickly and do whatever modification and improvement you wanted, once it’s released this will become a much slower process and also much more expensive in case you make a mistake, it will be super expensive or even impossible to fix it. If you made a small mistake in a programming language which eventually became highly used, people will still be laughing about it 25 years later.
API Evolution • Once in the API, keep it there forever • Never add something you do not intend to keep forever • Easier to add than to remove • Make sure you explicit version all changes (including documentation) pointing out incompatibilities • Publish preview (0.x) versions of an API before freezing it to get early feedback from clients • Rename a component if its API has changed too much so you can start the redesign from scratch without breaking old clients • Keep changes backwards and forwards compatible Is there a simple rule we can follow to guide us with the evolution of our API? Once a feature is in the API, the simplest thing to avoid breaking your clients is to keep it in the API as it is, unchanged, forever. If you are in doubt, leave it out. Never add something you do not intend to keep in the API forever. It’s much easier to add something later rather than to remove it, or to change it. Evolution by addition has visible consequences in many APIs, languages or data formats which have been around for a long time. Every change which requires to remove a feature is hard. APIs hardly ever get simpler, cleaner. APIs are rarely refactored. For example, if you discover a better name for a feature, you can easily introduce the new name, but you may only deprecate the old name as removing it completely will break clients. Instead, after each iteration, each new release, APIs just keep growing as they accumulate new features. Since nothing should be removed, the size and complexity of old APIs tends to grow. 601
Another important point is that there needs to be a clear specification of the interface. If there is a place where you have to use version control is exactly in this documentation of the API. This will help clients realize whether the new version becomes incompatible. Every change that you introduce, you need to predict if this change breaks or not breaks clients depending on the previous version of the interface. There are some conventions that you can follow, for example using preview or experimental releases. So this means that you are going to tell your clients that this version is not stable. The API is not yet frozen. Only later the frozen API can be expected not to change anymore. This helps to get feedback from interested clients: Is this feature useful? How can we improve it? But you still warn people not to seriously depend on it because it may still change. This is how you can avoid this one shot release that is very hard to get right because you only have one chance. To converge towards the perfect API design you need multiple iterations with clients that are willing and brave enough to depend on an interface which is not yet stable. It is only fair to warn clients that the API is not finished and may end up in a completely different shape. If any API has evolved so much that you cannot recognize it anymore due to the amount of new and deprecated features that have encrusted over the years, it is better at a certain point to mark it as obsolete and start from scratch. Give it – this is very important – a new name so that clients do not confuse it with the previous one and somehow still expect it to be compatible. Especially if it’s going to be totally different, you might as well change its name (not only the version identifier) as you make drastic changes to the interface.
602
API Compatibility • Backwards compatibility: new version of API compatible with old client • Forwards compatibility: old version of API compatible with new client
Client 1.0
API 1.0
backwards 2.0 1.0 forwards 2.0 1.0 2.0
2.0
What is important is to try to minimize breakage and to keep changes backwards and forwards compatible. Let’s see what this means. Compatibility requires that the interfaces match. Clients require something that is provided and there is a perfect fit so you don’t need an adapter between the two interfaces. Backwards and forwards compatibility use a spatial metaphor to describe how compatibility is affected by the interface evolution. Each side of an interface can evolve independently. Here we can see that there is a client that depends on certain API and they’re both in the same version. And then we see that this version evolves from 1.0 to version 2.0, so this is also a situation in which we can expect the two sides to be compatible. However, there is a point in which, since we’re talking about two different components it is possible to think that they evolve at their own speed. So in this case we have a scenario in which we have changed the API, we have changed the interface on which the component depends, but the component itself hasn’t been touched. We have the old client talking to the new API. From the perspective of the API, we’re looking backwards in time. We’re trying to receive an incoming message that was sent from the past. If an older client works with a newer API it means that the change that you introduce in the API is backwards compatible. How can you achieve such backwards compatible changes? The 2.0 version of the API is a super set of the 1.0 version. That means that when you plug the old client into it, the old API features they depend on are still there. So from the perspective of the old client, they will actually see the interface as being unchanged and will just ignore all the new features. That’s easy to achieve if you are not touching what is already there. Forwards compatibility refers to the opposite scenario in which the client is from the future and the API is from the past. So this is a bit more complicated to achieve because the client could expect to depend on features that the API doesn’t provide yet. The question for you to think about is: how can you make an interface that not only works with today’s clients, but also can still work with tomorrow’s clients? 603
This was the definition of the difference between forwards and backwards compatibility. The term is defined from the perspective of the interface. If you go backwards, it means that the client is older than the interface, and if you go forward, the client is newer than the interface.
Semantic Versioning
MAJOR .
incompatible API changes
MINOR
PATCH
backwardscompatible bug/security xes
Additional versioning metadata:
https://semver.org/
new functionality but backwardscompatible
.
• Build counter • Release candidates Once a version has been released, its content is immutable. Any changes must be released as a new version. One way to identify the amount and the type of change that we have introduced in a certain element of the architecture is to use semantic versioning. This convention uses three numbers: Major, Minor, and Patch. As opposed to floating point numbers, where there is only one dot, in computer science we use version numbers with two dots to identify how things evolve. By convention when you increase the Patch number (on the right) this represents a small, internal change which can be assumed to be backwards compatible. Bumping up the patch counter typically occurs for fixes rather than new features. You might have a security fix. You might have a bug fix. Apart from the improvement, the system is feature equivalent as before. If you make a minor change, this means that the system is growing with new features but still keeps the backwards compatibility. As long as the first number doesn’t change, you can expect some degree of backwards compatibility as well as some improvement. When you touch the major component of the version identifier, then you are signaling that there has been some incompatibility. This is a way to warn clients that they will not work without putting some effort to bring them up to speed. Sometimes you also include additional numbers: for example a build counter. While setting the major.minor.patch is something that you have to decide manually based on the impact of the change that you introduce, you can adjust the build counter automatically in your continuous integration pipeline. This is a very simple counter that is incre604
mented every time you run the pipeline, and this can be appended to the version identifier so that every build results in a unique identifier. Sometimes as you are about to make a big jump, you know, while making a major release, you want to iterate quickly over it to polish and iron out the last showstoppers. To do so, you can make release candidates. This is version 1.0RC1. It is not yet stable, it is converging but it is meant to be tested and not for production. These RC version counter will eventually disappear once the version stabilizes and the release is accepted. Version identifiers help to enforce a very important general rule: immutable releases. Anything produced from your automated build pipeline should be immutable and versioned. That means that this particular version that has a unique identifier corresponds to a known entity built using a certain reproducible process. Every version identifier corresponds to an explicitly tagged source that you have in your version control system. If you make any change to the source, you should run the build again and generate a new version. No matter what you do, no matter how small the change. That’s why we have a build counter anyway. Even if you don’t touch the other version counters, that build will increase and that will help you to guarantee this property, which helps to track the lineage of your release artifacts. Also, once you have generated the artifacts tagged with a certain version, you should never touch those manually. You should always recompile, regenerate, retest them through the automatic build pipeline.
605
Changes and the Build Pipeline broken source compatibility Change it
Rebuild needed?
yes
no
Does it build?
no
yes
Does it pass the tests?
semantic compatibility
no
Does it run? no
source compatibility
broken semantic compatibility yes
binary compatibility
broken binary compatibility
Speaking about the build pipeline, let’s see how different types of changes impact compatibility at different levels. Whenever we make a change we have to ask ourselves: Are the dependencies going to be affected by this change? In which way? For example, if you change your library, do you have to rebuild your clients or not? If not, if the clients can run without rebuilding them, we just update the library and the clients just need to be rebooted to load the new version. If this works, then you have achieved binary compatibility. This is great because not only your clients are not affected, you don’t have to recompile them, they still work even though you have changed their dependencies. If this is not the case, you have broken binary compatibility. You will need to rebuild the clients. Clients need to be recompiled every time their dependencies change. After you have checked out the new version of the library, does the build still work? If not, you have also broken source compatibility: your client will need to be rewritten, to be extended, to be modified so that it can work with the updated dependency. However, let’s be optimistic: after pulling the updated dependencies, you rebuild and it works. This means you have source compatibility of the change. After you compile, of course you should re-run the tests. This is true also when you have binary compatibility. After you update the library. You restart it, the new library links successfully, but will it pass the tests? If you pass the test with the change in the dependency then you have also a stronger level of compatibility. You have not broken the compatibility of the semantics of the implementation between the two sides. If the tests do not pass, then even if you achieved source or binary compatibility, after all, your change has broken the actual semantics which the client expected from its dependencies. 606
You can use this flowchart to define this concept of binary vs. source or semantic compatibility. The three concepts fit with the major.minor.patch versioning scheme. Patching could be in some cases binary compatible, minor changes could still be kept source compatible, while if you have a major change you can expect also the semantic compatibility to be broken.
Version Identifier 1.0.0
1.1.1
1.0.1
1.0.0
Client
1.0.3
1.0.0
Service
• Version Interface Descriptions • Version Message Payloads • Version Component Implementations • Version Container Images and Addresses • Version Database Schemas After introducing version identifiers, let’s see which elements of the architecture can be versioned. If you take it to the extreme, every architectural element can carry a version identifier. For sure interface descriptions should be versioned. If there is an API that you depend on, this API is specified and there is a version identifier associated with it. We can also version component implementations: the implementation behind the interface can have his own version. Also the client component on the other side has his own version. When a client looks up an interface, the directory will bind it to a certain implementation. Version identifiers associated with interfaces and implementations can be used by the client to constrain or identify which version does it depend on. When clients look up dependencies, they can look for the latest version, or a specific version within a certain range. Once client interact with implementations, then the messages themselves, the messages that are exchanged can also carry their own version identifier. These are much more fine-grained than the whole interface. They can identify the version of the interface endpoint to which the message is directed. The message version identifier can also be used to check that it matches the version of the interface. A more coarse-grained version identifier refers to the container or the image deployed 607
in a container. Images aggregates many components and interfaces which would need to be labeled so that they can be versioned as a whole. When you start a given image on a container or runtime environment (whose configuration should be versioned), you will run a service on a certain server. The server also has network identifiers, for example, their IP address. IP addresses can also include version identifiers, for example if multiple versions of a service are active at the same time. Messages can be routed to their endpoints based on the version identifier embedded in the corresponding IP addresses. Finally, we also version the database schema used to store the state of stateful components. Why is versioning the schema important? Because if we change the structure of the data, we might need to do a migration from a certain version of the schema to another one. Therefore it’s important to know what changed and the fact that we are going to connect our stateful component to a database which uses a different schema. Ultimately, the goal of using version identifiers is to be able to have an easy, lightweight mechanism to spot differences. You can read an entire API description, read a schema specification and compare it against an older one. This is quite a lot of information and requires effort to spot the differences, not only by feeding them through an automated diff checker to detect which parts have changed but also what is the importance or the impact of such changes. Or you can just read the metadata: there is an attribute that identifies the version and if the numbers change, you assume that they are different and depending on which number has changed then you will efficiently infer whether it’s a minor or major change.
608
Two in Production Client 1.0
2.0 100%
API Gateway 0%
API 1.0
2.0
1.0 2.0 transition time
Gradually transition clients to the new API version without breaking old ones Once all clients have been migrated the old version can be retired How can we use this notion of having different versions of interfaces and dependencies to support the gradual evolution of our system? At some point when it grows beyond a certain size, it is no longer possible to apply a change in lockstep everywhere. If you have a very small system, you have the luxury to be able to decide that you make a change and everything changes atomically. As the system grows inevitably some elements fall behind. You will make a change on the server side and the client remains the old one for a while. Therefore it helps to be able to avoid breaking all clients and support their gradual transition over time. If this transition takes time, it means that for a while you will have both old and new clients talking to your service, with your interface. How to keep them both compatible as cheaply as possible? Does this require any additional coding effort? It’s just a matter of running side by side the old implementation for the old clients and the new implementation for the new clients. You can directly connect the clients to the corresponding implementation. Or you can try to make it transparent and have a single endpoint. So you basically do not touch the address and then you use a special kind of load balancer that is going to route the requests depending on the version of the client. Each request message carries a version identifier which will help the load balancer to send it to the corresponding new or old implementation. It’s important to put the version in the messages, because once you get a message coming from client 1.0, this component can forward it to the correct implementation. If you have this setup to keep track of the traffic, you can keep track of how many clients are there which have been updated and they already use the new version as well as how many clients are still behind and they need to run against the old version. At the beginning, most of your clients are routed to the old interface. Over time you will notice more and more clients are shifting to the latest version. This will also give you a chance to gain confidence that the new version is working. Most of your clients are still depending on the old which may be out of date. Maybe it doesn’t have the features that you need, but at least you know that it’s stable and it works. 609
And then you have some of the clients – the pioneers – going towards the future, they can work with the new system. As you gain more confidence and you put additional effort into migrating the clients then you will see that at the end you only have a few laggards, a few clients that are stuck in the old ways and eventually they disappear. Once this happens, you notice that there is no more traffic going to the old version and you can switch it off and you just keep the new one. This pattern is called “two in production”. It allows you to buy the necessary time to perform this transition because there is a cost to migrate all the clients at once. Sometimes other people have to pay the price, but they still need to be willing to do it. And this way you can support them during this transition. The alternative would be to switch off the old version of the API immediately. Let the old clients break and while they are upgrading or unless they upgrade nothing will work for their site. If you use two in production this doesn’t happen because the old still works. This means that you have to somehow give them an incentive to switch because of course if you keep supporting the old version forever then there is no incentive for them to do the migration. Two in production is just a simple example. In reality in some architectures you have three in production: 1.0 will be the super old version that very few still use, but it still necessary to run it. 2.0 is the majority, that’s the stable version; the 3.0. is the experimental one about which you are starting to get feedback from the clients. You will soon want to introduce 4.0, but it’s too expensive to run four versions side by side and you will only do it when the 1.0 disappears. If you have three in production, you have a way to handle very old clients. The majority is in the center and then you have also a chance to test out new features oriented towards the future. So you have the past, the present and the future represented in your architecture. By keeping track of the traffic, this solution also helps you to know when you can discontinue the support for the past.
610
API Sunrise and Sunset Manage expectations and avoid surprises: warn clients that APIs they depend on are about to change and inform them about which features may change or disappear Experimental Still likely to change, feedback welcome, help Preview give clients an early start Deprecation Will no longer be supported in future releases Limited Will be supported during a limited time only Lifetime Eternal Will you be able to keep the promise? Lifetime
Regarding change and the life cycle of dependencies, it’s important to manage the expectations of your clients. You want to give them a warning so that the APIs they depend on may change. And sometimes they even disappear. To visualize the beginning and the end of the life cycle, we can use the sunrise and sunset metaphors. While change is positive because you grow the features, you make it more powerful, you improve the performance, you improve the availability, correct bugs: everything gets better and better all the time. But sometimes you also have to remove features that people need. Because it’s too costly to maintain them. Your clients need to be warned that this will happen. To do so, we use these terms associated with different maturity levels of the interfaces and different kind of promises that we make towards the clients. If you release an experimental preview, this means that your goal is to involve clients in the design of the API and still change it based on their feedback. There’s a point in which you say, I think this will work. This is a good enough design. Let me see whether the clients agree so you open it up, you release it as an experimental preview. Some curious clients are willing to try it and give you feedback. Maybe you need to pay them for this. But in general their incentive is to get an early start with moving ahead with their side so that later when you make the final stable release, they already are well on their way to building something that can make use of it. Of course, if during the experiment with the preview you discover that this will never work and you have to rewrite and redesign the whole thing, then of course they will be severely affected because whatever they invest into their client prototypes will also have to be thrown away. The goal is to minimize the chances this happens and have a good starting point in the first place. That’s how you start right at the beginning with every API. Every interface should go through such early feedback phase before you actually freeze it. Once you make a 1.0. it’s frozen for good. But if you are at zero point something (0.99) then you know that you’re in experimental preview mode. 611
On the other side, when you want to start to remove things, you should never make something disappear from one day to the next. You should first tag features that you plan not to support in the future as deprecated. If you write code there are standard annotations which you can use not only for the documentation, but even so that warnings will show up in the compiler logs whenever you are calling something that has been deprecated. Of course it’s only a warning and can be ignored but you cannot say you have not been warned if eventually it disappears and your build breaks. You should have already stopped depending on it. After this phase in which a feature that was stable becomes deprecated, then you can also add additional pressure to add: while it is ”only“ deprecated now, it will disappear within one year. This way, clients know they have one year of time to find an alternative to rely on due to the limited lifetime during which they can use what we offer. End of Life is an important aspect of technology. Due to the constant churn of technology evolution, there are always better ways that are found, new discoveries, innovations, and therefore obsolete stuff to be thrown away in the wastebin of technology history. This evolution is also part of the business model: I sell you something that only works for three months, and then if you want to keep using it, you have to buy a new item that will be released right on time as the first one is expiring. When they invented light bulbs, they made them so good that the light bulbs would never burn out. Then they realized they would soon run out of customers. So they actually designed the light bulbs to burn out after some amount of time, so that people will have to buy new ones to replace them. You can either know in advance when something will expire or you can have a surprise that makes your life interesting. Why did it have to really run out today? Now, if you try to get a competitive edge, you can also try to release an interface and promise that this will be supported forever. This should make your API more attractive. If clients need to build a system which will depend on you, for them is much better if a dependency will be stable and supported forever, as opposed to having uncertain information about its lifetime or just a limited guarantee: we will support it for a couple of years, but then it will most probably change in a non-backwards compatible way. Clients will have to renew their license or sometime bribe us so that we can continue to support the old version for an even longer time. Or they look for the cheapest upgrade path, if they are unable to find a compatible replacement, in the worst case, clients will need to adapt and rewrite their code to fit with the new version.
612
To break or not to break • Sometimes it is unavoidable • Cost(Keep Service Compatible) > Cost(Fix Broken Clients) • Warn clients in advance
• Provide migration path (to alternative provider) or fallback solutions (shims, poly lls) As you consider retiring an APIs – depending on the languages that you use to describe it – you can annotate its feature with these terms: deprecation or sunset. These are mechanisms within the way that you describe interfaces (deprecation) or the way that you interact with them (sunset headers) to annotate them with this ”expiration date” or ”guaranteed to work until” metadata. This particular feature is deprecated or this particular endpoint will disappear within a certain time. Even when we sunset or deprecate APIs, callers will still come knocking! The question is: why do these breaking changes happen? Sometimes it’s something that is unavoidable. Maybe it’s not even your fault, like when the break is caused by a dependency down the stack that is outside of your control. You wish you could maintain the compatibility and support this interface for longer, but then you discover that something else stops working and then as a consequence your component stops working in the way it used to and your client is broken. There is a simple comparison you can make. Is the cost of keeping the service compatible and supporting all clients greater than the cost of fixing the clients that will be broken by stopping the compatibility? The cost of keeping the service compatible is paid by the service provider. Whereas the bill for fixing the broken clients is paid by the clients themselves. What kind of relationship have you established with your customers? If you pay the price for maintaining compatibility, can you offset this additional cost to the customers themselves? If you are unable to maintain compatibility and you break them, then they will have to deal with the consequences. It may be cheaper for clients to avoid breaking changes by paying the service provider to maintain the service compatible as opposed to fix the clients. Also, if you’re going to break it, don’t do it in the middle of the night when nobody 613
is watching. Break it with an advance warning. This is an old example, it’s called the “OAuthpocalypse“: there was a Twitter API which changed on a fundamental level the way that users authenticated. And this, as a result, ended up breaking all of the apps and all of the clients trying to access the Twitter API. To warn developers (and users) they set up a website prominently featuring a big countdown clock. The message was clear: you have nine weeks left to update your application or to pressure its developers to release an update for it. Despite these advanced warning, the day they flipped the switch, there were still many clients that stopped working. This is something related, in general, with how people react to deadlines. Too bad they could not simply redirect the old clients to the new endpoints. Together with a warning, you should also provide a migration path. Or some fall back solution so that the client can adapt and avoid the worst possible outcome.
614
Who should keep it compatible? 1.0 Client
1.0 Service
We’ve seen when you break compatibility, you introduce a mismatch between interfaces. A possible solution is an adapter, if we assume that there is an equivalent alternative way to do the same thing. If a feature is removed from the new interface, the feature is gone and there is very little an adapter can do. Unless the missing feature gets re-implemented from scratch inside the adapter. Let’s take a look at the interplay between adaptation and evolution. This can also help us to explain more in detail who needs to pay the price to deal with the consequences of breaking changes. Here we start – nice and easy – from where everything is compatible.
615
Who should keep it compatible? 1.0
1.0 Service
Client
Service 2.0
Two in Production: Client still using old version Now we introduce a change. There is a new version of the interface. Although there is a new version, we decided that we’re not going to stop the old one, because otherwise the old client will break. To avoid that we are using two in production: two versions of the service are running side by side and the client is still depending on the old one. With two in production, the advantage is that the investment in the old version of the services is already amortized and we keep running it as it originally was. Basically, we don’t touch it. Next there is a new version of the service that has just been released. Freshly developed, maybe after a complete rewrite from scratch, or just some changes which unfortunately break compatibility.
616
Who should keep it compatible? 1.0
1.0 1.0->2.0
Client
Service 2.0
Provider keeps supporting old clients by running the adapter As it is running on the side, conceptually, we could use an adapter to replace the old implementation and switch to the new implementation while keeping the interface compatible with the old version, which is the one required by the old clients. For example, such adapter may need to fill in default values for data fields which were added in the new version, or rename a few fields from the old (and deprecated) names to the new ones, which offer an equivalent semantics. In general, if it’s possible to implement it using an adapter, the old clients will not notice it as they still think that they are using the old version. Behind the old interface, the adapter just forwards their calls to the new version by transforming them as necessary. Why would you do this as opposed to keeping two in production? What if your service is stateful? If you run both versions in parallel, their state will drift apart. The state of the old service is isolated from the state of the new version: if the client makes a change to the data through the old API, this data will not be visible from the new interface, because it’s a completely separate system. If you can merge the two versions (or at least their state) in this way, the old client and the new client will be able to see the same state of the service, each from a compatible interface. If you can manage to have a single database where you keep the state for both interfaces, then, for example, a new client can write some information into the system through the new interface and then you can read the same information from the old client using the original version of the interface. Still, the cost of running this adapter is something that the service provider is paying for at the moment. Because we are still offering two versions of the interface, we have just merged the underlying implementation and state, which are all running on the service provider side.
617
Who should keep it compatible? 1.0
1.0 1.0->2.0
Client
Service
1.0 Client
1.0->2.0
2.0
Old clients need to run the adapter to use the new service version We have reached the point in which the service provider is no longer willing to pay the price of maintaining the old version of the interface and running the corresponding adapter. While there is a way to do the adaptation, the service is no longer motivated or interested to perform it for its old clients, whose population may have been steadily disappearing. The client should actually upgrade and use the new version, but what if the client cannot do that. What if the client is stuck in the stone age? It may happen that clients are unable to upgrade due to their obsolete hardware/operating system version. Or simply their source code was lost or there are no developers left on the team capable of correctly modifying their code. In this case, it still possible to keep the adaptation running but just push it onto the client. From the service provider perspective, only the version 2.0 is now in production. Old clients have to install and run a local adapter in between so that their old client can access the new service interface through the adapter deployed on the client side. This is what happens when you hear about installing a piece of software called a “shim” or a “polyfill”: you have an old client, you augment it so that it can use the new service, the new version of the API.
618
Layers How to contain the impact of change? • Change of Dependencies • Change of Technology Platforms • Change of External Service Provider APIs • Change of Data Representation Formats
What if we expect changes in our dependencies? Something that we depend on will change. We can have changes in the dependencies underlying the system, the technological platforms on which we build it tend to slowly evolve underneath. Also there can be sudden changes from service provider APIs. Likewise, data sources move, data formats shift and we have to avoid coupling too tightly the data that we use within our system with the external one. How can we control and limit the impact of changes? Add a layer of indirection.
619
Layers
Changes of one layer only affect the neighbours Cross-layer changes may impact all layers The simplest idea to isolate the impact of changes is to use this concept of architectural layers. Consider this layered architecture. What if we change the bottom layer? Which layer is affected? Only the one next to it. If I tell you the changes are happening at the top layer, then you probably see that the layer that is affected is only the one right underneath. If you change something in the middle, this will probably affect both sides: these layers are directly in contact with the one that is affected. It may also be possible to constrain the impact only on one side, depending on the dependency relationship direction between the components of each layer. There can be layers where the impact of changes goes only in one direction. For example, from the bottom to the top. If you change something on top, you don’t affect what is underneath. This helps to constrain the impact of changes. But if we change fundamental layers at the bottom, also in this case if layers are done right, the change will impact only the immediate layer on top and should not ripple further up the stack. There are some changes however, which unfortunately happen to impact all the layers of your architecture. Depending on the strategy you followed to introduce layering, it will be difficult to anticipate all possible changes and just keep them isolated within certain layers. So-called cross-cutting changes will impact all layers of the architecture. While usually you can change the implementation, for example, optimize its performance, without touching the interface so that clients above are not affected, what if you need to add support for a new feature? This new feature will require an extension of the implementation. It will also need to be exposed through the interface. Likewise, the component above will need to be extended so that it can make use of the new feature. Ultimately this will ripple all the way up to the user interface so that users can actually see and interact with the new feature. Also, the change will sink through the layers all the way to the database because the new feature needs information from the user and this information has to be stored somewhere. For such an extension, for such a change, your layering is completely useless because you have to start from the top of the system, add a new format, the new UI widget, extend 620
the API, extend the business logic controllers, extend the model, make the new information persistent, all the way to the bottom. If you’re planning to grow the system in this way, you chose the wrong layering strategy. Maybe your goal should have been like this to slice the architecture in vertical layers. This way, your changes, your new features are encapsulated and the impact of adding them stays within one of those vertical sections and the others are minimally affected. Depending on the level of granularity, depending on the goal of your decomposition strategy, you should not feel constrained to use only one kind of layering. If you zoom out far enough layers will appear to be oriented vertically, while if you zoom into each of the vertical slices, horizontal layers will appear. The main concept remains valid: use layers to surround components likely to change to control the impact of such changes on the rest of the system.
621
Layers Application API Application
Language Virtual Machine
API
Container
Application
Operating System
Operating System
Operating System
Device Driver
Virtual Machine
Hardware
Hardware
Hardware
Let’s take a look at some examples. The classical layers: application, operating system, hardware. The OS layer isolates the application from as well as it abstracts the hardware on which the application is running. Within these layers we can add more intermediate layers. We can add the operating system API which hides the implementation from the application. There is a clear point in which the application touches on the operating system. The operating system can change underneath without affecting the application, as long as the API remains stable. We also have device drivers to abstract the low-level hardware and separate them from higher levels of the operating system. Again, originally we have the operating system on top of the hardware. More in detail, there is an additional layer in the middle that will need to change if you change the hardware, but as long as the driver interface (e.g., for a generic class of devices) remains the same, the rest of the operating system is not affected. If we introduce virtualization, we add a few more layers. There are virtual machines for specific programming languages that are running inside containers. There are guest operating systems that run inside virtual machines. And you can have as many layers as you want, so that if you make a change underneath what is above is not affected. While we have seen such abstraction layers when we were discussing portability, here we emphasize their benefits concerning the evolution of the system. Users want to be able to upgrade the operating system to get performance improvements, apply security patches and so on, but users do not want their applications to be affected. If every time you ship an upgrade of the operating system you break all the applications, many users will be annoyed. Designing appropriate interface layers helps to deal with this backwards compatibility requirement. 622
Layers User Interface Presentation
API
Logic
Business Logic
State
Object-Relational Mapper Database
Another classical example is the layering of an architecture into: presentation for the user interface interaction, the business logic, and then we have the persistence layer where we separately store and manage the information processed by the application. The persistence layer deals with consistency, concurrency control, optimized query planning, and all the features that you get from a database. Why does this layering help? If you make a change to the user interface because you switch from using a command line terminal to a graphical user interface where you can use the mouse to select menus, click on buttons and dialog boxes, and then you switch to a touch based user interface, then you switch one more time to voice control but then augmented or virtual reality becomes fashionable. These are all superficial changes affecting the user interface of your system. They should not affect the underlying logic nor the information content, only its visual representation. You can focus the impact of the user interface technology evolution on your system in one place (the user interface layer) so that the rest can remain the same. We can also zoom in so we can see that to isolate the business logic from the user interface we can introduce again an API. We make an explicit interface. Likewise to isolate the business logic from the data storage we can have a dedicated mapping layer, like for example an object relational mapper. This layer is going to take the memory data structures that we use when we process the information and transform them so they can be stored persistently. As a result, the business logic becomes independent from the database schema. The representation that we use to store the data does not need to be the same one that we use to process it.
623
Layers External Data Representation Validation, Mapping: Serialization/Deserialization Internal Data Representation
Generalizing from the previous case, as a last example, we can also distinguish “Data on the Outside versus Data on the Inside”. The external representation for the information can be different from the internal one, and this can happen every time you cross any component boundary, any interface in your architecture. Each component has an inside and an outside. Any information that is traveling between components will cross such boundaries. That is, the external data representation is used to exchange information between components, which however may store it as part of the state of the component using the internal representation, which is local to each component. You have the opportunity as you cross the boundary of the component to transform between these two representations. This allows you to couple the implementation of the component to the internal representation, which has a local scope so that the implementation can evolve without being coupled to the assumptions that you make on how the data is represented outside. If you don’t take advantage of this opportunity, it means that every time you modify your data (e.g., how it is represented, its syntax, structure and semantics) somewhere in your architecture, every component will be affected. Every component will need to know that the globally defined and universally shared data types have changed. For example, the data structure to store the customer phone numbers has been updated, so you have to rebuild all the components that deal with customer data. If you can keep this change local within the internal data representation of one component, then you keep components compatible and need to rebuild only one component. Going from external to internal and vice versa, we find a mapping layer sandwiched in between. This layer is responsible for serializing the internal data so that you can send as the payload of an outgoing message, or deserializing data when you receive the message. With every interaction, messages need to be received, buffered, read, parsed so that the information needed by the component can be extracted and processed internally. How much validation to do before or after this transformation? As the data enters (or leaves) the boundary, there is an opportunity to check that what arrived from the outside is correct. Ensure that it makes sense before you transform it into a suitable internal representation. The layer not only builds a line of defense for design-time coupling between data models but also can be used at run-time to detect illegal border crossings of invalid messages. The layers shown in this example help to reduce coupling because when change hap624
pens on either side, the only place that will be affected is the mapping layer. If the external data representation has changed, adapting the mapping makes it possible to keep the same internal one. Everything else within the component that depends on it will not be affected. This is possible thanks to mapping layer which absorbs this change. Likewise, if you want to change the way that you represent the data within your component, the rest of the architecture is not affected because the mapping is going to transform the data on the inside to the shared standard representation visible on the outside.
625
Tolerant Reader How can clients or services function properly when receiving some unexpected content? Ignore unexpected content and fail only if required content is missing Tolerant Readers ignore new items, the absence of optional items, and unexpected data values for which defaults can be used instead
Every time data is exchanged through the interface there is an opportunity to perform validation, especially for what concerns external messages that arrive into your component. Before accepting them, one should check them so that if the message goes past the interface the rest of the system can make some assumptions about the content of the message and will not have to worry that this might be an invalid or incorrect message. When we do this validation we have two options: to be strict or to be lenient. How can clients or services function properly when receiving some unexpected content from the outside? The idea of introducing a tolerant reader is to ignore the unexpected content and only reject the incoming message if content that is actually required is invalid or missing. As opposed to saying that this message has one bit that doesn’t fit with my expectations, therefore I reject everything. If the bit that you didn’t expect is not something that you actually depend on, then you can survive by ignoring it and just reading out what you really need. As we introduce a change in the structure of the messages that we exchange, and in particular we add something new. We may wonder about whether the change is backwards and forwards compatible. If we have a tolerant reader, it is much easier to keep the compatibility: any unexpected items (including the new ones) will be ignored. Also if you have optional items that are missing, they will be ignored as long as the reader can fall back to read a default value. Any value that is missing as well as out of range can be reverted back to a default so that the tolerant reader can accept the input.
626
Tolerant Reader Which kind of reader? Strict
Tolerant Survive non-critical invalid input
Crash as soon one bit is off HTML Parser XML Parser Fail if input validation fails no matter what
Validate the input only if actually needed Hide impact of malformed input
Static Typing Duck Typing
Which type of behavior fits with the definition of tolerant reader? and which one is the behavior of a strict reader? A strict reader will crash as soon as one bit is off. And will fail if the input validation fails no matter what, whereas the advantage of a tolerant reader is that it will only validate the input which is needed. If something that is not needed is invalid, it will still accept the message for downstream processing. Attempting to hide the impact of malformed input can be seen as a good recipe for being flexible and surviving the impact of change. It can also be a problem in case you are trying to estimate or detect the impact of a change, because if you hide the impact of such problematic changes then it will actually be more difficult to discover the original source of such malformed input. We can see two representative languages (HTML and XML): the behavior of the corresponding parsers is a good example of each type of these readers. As most beginner Web developers, you may have experienced that web browser is very forgiving when reading strangely formatted HTML pages. Instead, if you try to parse XML messages, the same kind of errors will not be tolerated. As soon as you forget to correctly nest one tag, then the whole document will be rejected. You can also think of the difference between languages that are strongly typed and languages which are dynamically typed. You notice the difference when you try to take a JSON document and you want to parse it into a Java object. If the structure of the JSON document doesn’t fit your class definition, then there will be a problem to transform the external representation, which uses a flexible representation format, into the internal one which is much more constrained.
627
Let’s think about the future because the future will be here sooner than you think. How does the future influence the design of our architecture? There is a tension between your expectations about what the future will bring and whether you should already act on them in the current version of your architecture. We have already seen in a past lecture the term YAGNI (you are not going to need it) and how this influences API design. Still, if there are some things that you can anticipate that you will most probably need, then you might as well get your architecture ready for supporting them in a future version. If you might need it, you should prepare for it. You don’t have to build it yet, but it’s a good idea to ensure your architecture is extensible so that these future needs can be accommodated without having to redesign and rewrite the whole system from scratch.
628
Extensibility and Plugins How to design an extensible architecture? use plugins Allow end-users to independently choose which components to deploy anytime as long as they t with the extension point interface
Plugins are a particularly useful mechanism to design an open and extensible architecture which can grow in anticipated ways. With plugins, the choice of the components that make up the architecture of your system is pushed and delegated all the way to the end users who are going to deploy and run your architecture. The end users will choose which components – in this case we call them plugins – they are going to configure your architecture to load, either on startup or while it runs. How do you control which components can be used? You can only use compatibility as a constraint. So if users try to plug in something that is not compatible, clearly the plugin is not going to work. But as long as it fits, the user can plug in any component they want into your architecture. Introducing a plugin system opens up your design of a highly flexible and extensible system. It is also risky because the more the architecture can be extended by third parties, the more you are willing to accept code developed by someone else. When something goes wrong, you don’t know exactly who to blame. Adding unknown plugins from random sources may void the warranty. If users attempt to extend your system in unsupported ways, they breach your support contract. Not to mention if untrusted plugins act as trojan attack vectors.
629
Extensibility and Plugins Extension Point
Plugin Component Extension
Extensible Component The architecture de nes explicit extension points so that the system can be customized or even dynamically composed out of plugins Examples: Browser Extensions, Eclipse, Photoshop, WordPress
How does the plugin architecture work? We are still in the context of a component-based architecture in which however you define a specific type of interface which is called the “extension point”. This makes it possible to design extensible components whose interfaces contain explicit extension points in which other components can be connected (or plugged in) as long as they match the assumption and the expectation encoded in the extension point interface. We see this type of extensible architecture everywhere. Web browsers can be easily extended and customized by downloading and installing so-called browser extensions. Another example is the Eclipse technology platform where the whole development environment is actually built out of plugins. Opening up your system with plugins also can provide support for a business model where you make money not because you sell the platform, but because you sell plugins. Like Photoshop filters. Like WordPress, a content management system which is also the foundation for a rich ecosystem of all kinds of extensions.
630
Extensibility and Plugins Pre-requisite: Plugin interface description and discovery Mechanism: Plug in and plug out When are plugins plugged in? • Static (con guration/deployment/installation) • Startup • Dynamic (without restarting the component) Ecosystem: the extensible component becomes a platform where alternative plugins compete
When are plugins initialized or selected? You can extend an architecture statically when you install the system. As part of its configuration you can refer to a number of pre-deployed plugins, which will be loaded when the system starts. Configuring which plugins should be loaded can be as simple as storing the corresponding artifacts or packages in a given folder. Still, unless you stop and restart it, you will not be able to change which plugins get loaded and activated at runtime. We also have the option of a dynamic plugin system where the linkage between the plugins and the base components is actually done at runtime. What’s interesting is that you can plug in and also in some cases plug out components without having to restart the whole architecture.
631
Extensibility and Plugins Cardinality: Multiple Plugins can t in the same extension points
Extension Point Extensible Component
Plugin Component Extension Point Plugin Component
Recursive: Plugins get extended by other plugins
In the simplest case, we have one component that is the foundation which can be extended by plugins. This design can be applied recursively. Not only we can have an extension point in which we plug multiple plugins, but we can also recursively plug plugins into plugins. Plugins themselves, in other words, can be designed to be extensible and offer their own extension points. We have seen how to achieve extensibility with plugins and extension points. These are solutions to design flexible architectures from the days of component-based software engineering. Plugins assume that you physically download and install the components to run them. As you do so, you can also download, configure and activate additional plugins. Users enjoy the flexibility of this kind of open system because they can decide how to extend it long after the original release has been shipped.
632
Microservices Martin Fowler and James Lewis
The microservice architectural style is an approach to developing a single application as a suite of small services, each running in its own container and communicating with lightweight mechanisms, often an HTTP resource API. These services are built around business capabilities and independently deployable by fully automated deployment machinery. There is a bare minimum of centralized management of these services, which may be written in di erent programming languages and use di erent data storage technologies. We have already discussed the evolution of components into services because of the impact of the Internet on the software industry. While services focus on availability and successful services also need scalability, what about flexibility? We have now reached the need for the ultimate level of flexibility in which we not only deliver the software as a service, but we do so in a way that it can be rapidly evolve and it is easy to change. Here you can read the classical microservice definition, in which we can see microservices positioned as an architectural style: an approach to develop a single application decomposed into multiple small services. So the definition builds on the notion of service, as long as their size can be measured and be kept small. Each of these microservices is deployed and runs in his own container. We already discussed what containers are. Many of the concepts we have previously seen are coming back. Each runs in a dedicated container so that it can have an independent operations lifecycle. To communicate, microservices use some lightweight connector mechanisms. For example, some API based on HTTP. So this makes it possible for them to be deployed remotely. They don’t have to be all running in the same local environment, because they can send to each other HTTP messages. Then there is the sentence that describes how to decompose your architecture into these microservices? Each of them should do something specific from a business point of view. Each implements a different feature. Each delivers a different business capability. Somehow these design heuristics should guide you in your quest to decompose them and select which service you will need. What is most important is that they are independently deployable. If each can be deployed independently, your architecture is flexible. You can change a microservice and you can deliver the change as soon as possible. You can release it, you can install it and you can run it all independently of the other microservice that you have in your architecture. This is fundamental and if it is not true for your architecture, then you can hardly call its components microservices. Since such deployment occurs often and you need to do it independently for many microservices, you shall do it supported by some automated deployment machinery. Yes,
633
we already have discussed how this works during the lecture on deployability with continuous build, integration, delivery and deployment pipelines. You should also try to keep centralized management to a minimum. Every team that you have is dedicated to an independent microservice and they need to coordinate only concerning the interfaces, the microservice APIs. Regarding taking all other kinds of decisions about how the microservice are implemented, for example, the choice of programming language or the decision on: how are we going to store the data? Which database should manage the state of our microservice? All of these decisions can be taken independently. Why is this important? There are many different programming languages. There are many different storage technologies and you want to pick the best solution for the problem that you have. And it’s not true that if you use the same database, you will get the same performance or the same expressiveness that you need to represent information in the appropriate way for all microservices. The idea of using microservices is that you can choose and pick the right technology for the job without any centralized constraint. The only constraint which needs to be in place concerns the interfaces and the connectors. If you want software written in different programming languages to be deployed you can use Docker and if these heterogeneous components need to talk to each other, they can speak HTTP. If you want to reliably change them, you use all the automated deployment machinery (containers) and to make it fast to evolve them, you must keep their size small. Or at least, that’s according to this classical definition.
634
To share with you a visual image of the elements that we are going to discuss in this lecture, I would like to introduce to you the Monolith. The Big Monolith at the bottom represents the situation in which you do not want to be. If you make a change to the monolith, you will do so slowly because it just takes too much time and effort before you can make a new release for the whole thing. It will also be very expensive to deploy it in the Cloud. What we do is break it down into microservices, which are supposed to be small. Here on top of the monolith you can see a couple of stacks of microservices. They are so lightweight and easy to deploy, for example, as a cloud native application. If one stops working, you can just throw it away and cut a better one.
635
Will this component always terminate? function f() { ... return 42; }
Development To understand how architectural abstractions have evolved all the way to microservices, let’s take another look at components. The main concern when developing software components is whether they are correct: whether they give correct and timely results. Thanks to their interfaces, components become reusable. They can also be interoperable thanks to standardized interfaces. We can try to minimize their dependencies, and ensure they are self-contained. The most challenging part of building high quality components is to prove that their implementation is correct and just in case also write some tests to highlight the presence of some bugs in case regressions may occur after releasing bug fixes. You definitely want to avoid releasing components which when invoked will take forever to reply.
636
Will this service run forever? while (true) { on f { return f() }; }
Operations When you move to the cloud, when you publish your components on the Internet, then you have another issue which is no longer to make sure that the component terminates but it is actually the opposite. When a software component is delivered as a service, you must ensure that it is provided with an acceptable level of availability. This means that the service has to run forever. You have to do whatever it takes to make sure that the service never stops. While the service is running, clients will call the service and its developers then inherit the previous challenge. The service has to be able to compute something and give an answer as fast as possible, no matter how many concurrent clients are there. We have discussed the concerns of scalability and availability before. These are quality attributes achieved through operations. We’ve seen that you can introduce redundancy into the architecture. You have to monitor heartbeats, you run a watchdog which will take swift recovery actions when something fails to recover. Likewise, you will run an autoscaling controller which can allocate or deallocate resources depending on the workload. Architects, developers and operators learn how to design, build and run service oriented architectures which have to be available no matter what. And in some cases, if they are successful, they also need to scale.
637
Will this microservice continuously change? while (true) { on f { return f() + //return f() + return f2() }; }
DevOps
As we evolve into microservices, we have another challenge ahead. We not only have to build something that runs correctly and gives back a result; not only we have to keep it available and possibly scale it, but we also have to be able to continuously change it. The only way to be able to do all of the above requires developers and operators to work together and close the software lifecycle. This is why it is called the development and operation (DevOps) lifecycle. There’s no longer a boundary. There is no distinction between the developers and the operations anymore since we are continuously changing and redeploying our system. It’s easier to do that if you keep the amount of changes that you do and the size of the overall component as small as possible. To summarize the differences between components and services: components are reusable; they need to be deployed and installed and they are operated under the ownership and control of who is using them. So typically you build a component, you release it. And then you can forget about it, as it runs outside of your control. If we make the component remotely accessible, to also give quality of service guarantees about its availability we need to be in charge of running it, not only of developing it. So we inherit all challenges component developer face, but also we become an operator tasked to keep the service alive so that users can call it from all over the world. If we can do all of that but also keep continuously evolving our system, then we have a microservice architecture. So I hope this has clarified the core differences between software components, services and microservices. Do not worry too much about the size. We will see that it is not so important. What is critical is this ability to continuously change all the elements of your architecture in a way that doesn’t limit the rate of change and can push it to hundreds of releases per day.
638
DevOps Deploy
Re le as e
Run
Build
an Pl
e Cod
Test
Monitor
Close the Feedback Loop between Operations and Development to signi cantly speed up the development release cycle Scripting of Automated Build, Continuous Integration, Testing, Release, Deployment and Operations Management tasks We have seen this DevOps cycle already. While before we focused on the deployment transition between development and operation, with microservices we close the loop and run it as rapidly as possible. To do so, you need automation, you introduce continuous integration, build pipelines, you ensure quality with tests. If you can make sure that making a release and a new deployment is as easy as clicking a button. If there are no risks to try out a new version with green/blue deployments, if doing experiments for A/B testing is not a major multiyear, high-risk project but something common, which happens every day, then you are already going through this cycle multiple times. Adopting microservices implies adopting this type of practices and all the technologies (automation, continuous build pipelines, containers) that go with them. The goal is to be faster in doing iterations over the lifecycle. And if you can iterate faster, you can be faster in improving your system as well as learning from your users by observing them, getting explicit feedback and dealing with it. Also you can afford to move fast and break things. It will happen: you will make mistakes, break things, but as fast as you break them you should also fix them. We have seen you should be able to undo failed releases. Also with two in production you should not have to expose all clients to a new service version. Overall, with devops the goal is to move fast while keeping the quality high. Quality and speed are no longer a trade off, with the appropriate safety net of practices and infrastructure.
639
DevOps Requirements Con dence in the code (code reviews, automated tests, small, incremental releases, feature toggles) 100% automated, dependable and repeatable deployment Monitoring feedback (smoke tests, A/B testing, analytics)
Secure build and runtime (every change may introduce weaknesses)
Gregor Hohpe
Elastic runtime (dynamically adapt to workload changes)
How can you check if you are ready to adopt microservices? Are you in a company that not only wants to adopt this architectural style but also is willing to follow these continuous development and release practices? First, you need to have confidence in the quality of what you do. There are many techniques which help boost your confidence. For example, you want to adopt code reviews. I hope you have been hearing about what a code review is in some of the software engineering lectures. You should also have automatic quality assurance. This requires to write, run and maintain all kinds of tests: unit tests, integration tests, capacity tests, performance and scalability tests. Regarding availability, you can use chaos engineering to turn unlikely or infrequent events into regularly occurring ones: inject failures in a controlled way and see if your system survives thanks to the automated recovery tools you have introduced. The larger your code, the more difficult it is to be confident about whether it meets its functional and extra-functional requirements. Here is the microservice size argument again: changes should be small so that their impact is easier to control. Also you want to make frequent incremental releases. You do not want to make a Big Bang release every year. You want to make continuous releases every few seconds. There is one technique that we’re going to discuss that helps you to do that is called feature toggles. Microservices cannot be adopted without continuous integration technology. 100% automated means that the whole deployment and release process is reliable and repeatable. It does not depend on the presence or the good health of certain team members. Everyone should be entrusted to push the code in production. And also everybody should be responsible for the consequences. If things go wrong, you want to have a feedback loop which makes it possible to observe what is happening, detect problems without or with the help of your users so that you are able to react to them. We saw, for example, that as soon as you activate the new release you have to run a smoke test. You have to be able to quickly check if you did something stupid and you 640
want to actually pull the plug and revert back to the previous stable state. You want to learn from your users, so you want to be able to do experiments with A/B testing. In general, choose the right metrics so that you can measure the way your system operates. You want to quantitatively observe and track users. This is no longer a piece of software that someone in some unknown corner of the world takes off the shelf and installs on their machine which is disconnected from the Internet. Users instead are going to interact with you every day. You get to know who they are and you can build a profile and you can remember everything they ever did. With this knowledge you can try to do all kinds of analysis that can give you actionable insight into – let’s put in a positive light – how to improve your system so that they become more efficient at what they do. A not-so-positive consequence now that you can observe and track your user is to study their behavior and experiment with ideas on how to get more money out of their pockets, to increase their engagement or addiction with your service, or to nudge them towards paying for those extra features they may not actually need. We’ve seen also that the runtime environment has not only to be highly available, but we can expect that it would achieve elastic scalability: given a target performance and by observing the workload, it can allocate or deallocate the resources that are necessary to deliver that level of performance. As the workload is something that you cannot control or predict, then you have to be able to adapt dynamically to changes in traffic. What is also critical since you are doing these changes very frequently, you do not want anybody to tamper with your build pipeline and inject malicious code into it. You have to be really careful about the dependencies, where they come from and whether they do something more than they claim to do. Today’s tools will happily fetch the dependencies, package them together with your software as you ship it and deliver it to be deployed everywhere. You have to trust the quality and safety of this code that you depend on. More and more attacks nowadays are not focused on the systems in production because they are well protected, after all the Cloud is secure, data centers are hard to get into. But maybe the developer machines are weak from a security standpoint. Maybe the build servers and the continuous integration pipelines are weak, so you attack those. And that’s when you can inject into the new release, the backdoors that you can use later to get into the system and exfiltrate sensitive information. You should invest into securing not only the production environment, of course you have to protect the data you have to protect the servers, but you also have to protect your build pipeline from which a stream of fresh software releases flows on the servers. As you can see, most of the concepts that we have introduced this semester are applicable to microservices.
641
Feature Toggles How to minimize the cost of undoing changes? compose features using con guration Activate or deactivate features with con guration switches, without having to change, rebuild and redeploy the code
Let’s now focus on this particular technique for gradually and gracefully introducing change into an architecture. Feature toggles help to keep the cost of introducing changes low and also help to easily undo the effect of problematic changes. You’re writing some code but you’re not so sure that this code is going to be an improvement. You should design your code in a way that it is as cheap as possible to add or remove the feature. After you check if it works or if it doesn’t, you can still quickly go back to the previous version. To do so, configuration comes into play. We use configuration to postpone making design decisions until our users are ready to make them. Features can be activated or deactivated (toggled) as part of the system’s configuration. The configuration composes the features that you want to use in production. Every feature in your code is associated with configuration toggle, with a switch that you can activate or deactivate. By definition, a configuration change is cheap because it does not require you to recompile and reinstall the system, but only in the worst case to restart it.
642
Feature Toggles Version 1 Deploy intermediate release Feature Toggle: inactive Version 1
Version 2
Test ok
Test fails
Feature Toggle: active Version 1
Version 2
Deploy new release Version 2
Given a stable, configurable system we can produce a new version with a new configuration option. Activating or deactivating such option is a very simple way to make the feature come alive or remove it because it doesn’t work. We are in this transition phase, going from a version to another. We know that when we make this evolution step, the change is risky. We’re going to build three different versions of the system. We start from the initial version, where the new feature is missing. Then we have a version in which actually both versions are embedded and the new feature can be activated or deactivated. And then eventually we get to the new version, in which the new feature is stable, always present and running. Such hybrid, intermediary version is the one that contains also the feature toggle. In the initial phase we are going to introduce the change, release it, but keep the change inactive. In reality, we’re still running the old version. Users do not see the effect of your change yet. But the change is already baked into your system. You may ask: what’s the point of introducing a change into a system, building it, releasing it, but not using it? Why would you want to go through all that? It is worth doing it, if you’re not yet sure about the new version. You make a so-called Dark Launch. You launch the new version, but the new feature is not visible. Still, you can check if the mere presence of the new code has an effect on the previous version. You can see if there is some conflict between a version assumed to be stable and the new code that you have added. The code is not active, but if the original version is stable and this hybrid version is not stable without running the new code yet, you know that there is already some problem. If you detect some instability even before you activate the feature toggle, you can already go back to version one and try again. 643
If, on the contrary, the dark launch result is still stable, then what you can do is to finally switch the toggle. Now we go into a different state in which the new version is active, so users can actually try to use it. And then you can observe what happens. This transition is very cheap: just flip one bit to change and activate the new feature. If you have a problem. It’s again only one bit that you have to switch to revert back to the original version. This idea really works if the switch between the two version is fast. If you have to rebuild your system to activate the feature then this is not a feature toggle. Feature toggles are part of the configuration. If you’re happy with the new version, then of course you can really make a stable new release and it’s highly recommended to throw away the old version. Why do you want to throw it away and not keep the feature toggle around? It’s meant as a rhetorical question: You don’t need it anymore because now you trust the new version. But if you want to keep your system flexible, why not be able to activate and deactivate every feature all the time? What if is there is a version 3? Yes, you should be able to switch between version 1, 2 and version 3. Keep in mind that this is not necessarily a version toggle, but it’s a feature toggle. So for every feature that you introduce in your system, you should be able to activate or deactivate the feature. How many possible versions are there as a result of all possible combinations of each feature of your system? Here we have two versions because we have one feature that can be active or inactive. Imagine that you have F independent features and you want to control each of them independently. How many versions do you need to test? How many possible versions or variants of your architecture are there? Managing 2 to the power of F versions exponentially grows to be a large number and it is simply not feasible to maintain such large and fast-growing configuration space. Feature toggles is a great technique to have flexibility during transitions between stable releases. You may think: once we built the feature toggle, we might as well keep it over the entire history of our system, you know to keep the architecture flexible. Ok, but you might end up with a combinatorial explosion of possible configurations, possible feature combinations, and if you really have to run tests for all of them, it will be very expensive and take longer and longer to check all possible versions. So expensive that you will probably not do it exhaustively and only manage to test the most frequently used feature combinations. There’s historical evidence of systems that failed because they forgot to remove the feature toggle. After a number of years, somebody by mistake deactivated or activated an old feature toggle and this led to a major failure in production. The toggle was still there, but that particular combination was untested and they learned it wouldn’t work the hard way.
644
Feature Toggles • How to implement? • Toggle Points: if, #ifdef, late binding, strategy, factory • Toggle Routers: Con guration File, Con guration Store, Environment Flag, Command Line Argument • Context: A/B testing, Canary Releases • When to toggle? On build, deploy, startup, live • How to test? Alway run tests on both sides of the toggle • Feature Toggle Combinatorial Explosion: remove toggles after features become stable to keep the release validation complexity under control and avoid potential unexpected feature interactions There are many ways that you can use to implement feature toggles. We can use compilers, macros, simple if statements, late binding, virtual methods. Are you familiar with the strategy pattern? You can also use the factory pattern to create an instance of different classes depending on which feature is activated. In general, you have variation points or branches into your code which you can use to program how to toggle between an activated or deactivated feature. How do you control what gets executed at those points? For example, an if statement is testing a value that comes from a configuration file. This assumes that somewhere there is a file that you specify using some language (nowadays YAML, but doesn’t have to be). This configuration file may be centrally managed in a repository or in an operating system registry. Depending on the platform there are conventions for locating config files, environmental variables, command line settings for each particular deployment (e.g., development, testing, staging, production). There can also be specific databases that are just used for storing configuration settings. Configuration settings can be persistent, but also specified or overridden at startup through environmental variables. They can be passed as command line arguments. Some configuration changes can also be dynamically applied by the end users based on their preferences. For example, canary users are able to activate or deactivate feature toggles. They signed a special agreement to indicate they are willing to risk it and run experimental features on the latest version. They are aware that some feature toggles could be unstable but they’re willing to try it, knowing that they can easily revert back to the stable version by changing a few flags. This can also work with software as a service: when we get requests from such user, based on their profile configuration we will activate the feature only for them, while all the other users will run in the previous configuration with the feature that is not acti645
vated. This feature toggle is not about changing a config file option, but is about associating the feature toggle with the user account or the API key that is sending you the requests. Depending on how the feature toggle is built we can constrain who can flip the switch and when this can happen. Build time feature toggles typically use macro expansion or similar metaprogramming constructs. Changes of configuration files can be done at deployment time by installer scripts prompting operators. At startup, the system can read command line arguments or environment flags. We can also ask users live, and this helps to combine feature toggles with A/B testing or canary releases. How do feature toggles interact with testing and quality assurance? Before and after you activate a feature toggle you should run the tests. Green tests are a pre-requisite to switch it on. Green tests are needed to keep it on. After you are confident about the new feature, to avoid the feature toggle combinatorial explosion, remove the toggle. Once you trust the feature burn it, hard-wire it into your system so that you cannot go back because you don’t need to go back anyway. The more feature toggles you have open at the same time, the more variability you have and the more difficult it is to keep unwanted feature interactions under control.
646
How small is a Microservice? Full control of the devops (code, build, test, release, deploy and operate) cycle by one team Iterate fast: Many small frequent releases better than few large releases Rapid Evolution: If you have to hold a release until some other team is ready you do not have two separate microservices Avoid Cascading Failures: A failed microservice should not bring down the whole system Focused Investment: Scale and Replicate critical parts of the system as opposed to the whole system Let’s go back to the discussion about microservices and their size. What do we mean by small? So how small is a microservice, really? The limiting factor is how many developers you have or how many teams of developers are there to build your architecture. Each team is responsible not only for writing the code, but is also responsible for building and testing it, for making the release, deploying it and operating the microservice. We are talking about a DevOps cross-functional team. One of these teams will have between 5 and 10 people. What’s the largest possible piece of software that a team of about 10 people can assemble? This already gives you a limit on how large each microservice is going to grow. Then you can ask: is every team going to run their own microservice? Or there are ambitious teams responsible for multiple microservices? For sure, you will not have more than one team working on the same microservice. This would contradict the autonomy principle: every microservices is independent. Therefore you have an independent team that is going to both develop and operate it. Size also impacts the speed at which the team can iterate and incrementally grow their microservice. Let’s consider a client/server architecture. If you make a change in the server, can you release it before the client has been upgraded? If client and server are developed and operated by autonomous teams, one group can change the server without asking the client team for permission. You have autonomy. You have the independence that you need to set up a microservice architecture. If one microservice team is ready for the release but they have to hold the release to ask for somebody else’s permission then they’re not really independent microservices. There is a point in which their processes block, their DevOps cycle has to synchronize with the other. This is a very simple behavior to observe: We announce we want to make a change. We will push the release tomorrow. If somebody calls you in the middle of the night and says please, please don’t do it, because otherwise we’re going to be severely impacted by the change, then you do not have a micro service architecture. 647
If you can make a change whenever you want wherever you want, then you manage to keep your microservices decoupled and independent. This property is related with the size, but it’s really a different property. So far we have discussed examples of changes at design-time all the way to deployment time. Let’s talk about availability and run-time failures. One of the advantages of microservices is that each microservices can fail independently of the others. If one fails, then the rest will survive. While this is also known as partial failures, the idea is to avoid bringing down everything if just one element fails. We’ve seen how to achieve this with redundancy and also by introducing, for example, circuit breakers or asynchronous messaging. These type of solutions help the whole system to survive loss of availability of individual elements which can come and go, they can go up and down, but the rest of the system should not be affected. If you have a monolith, the Monolith has the advantage that you don’t have partial failures. Either everything works or nothing works. Of course, if nothing works we have a major problem. With microservices, the idea is that most of them would work most of the time. So users can still access the applications and their workflow is not entirely compromised if only a few features or activities get disabled from time to time. If you have a hot spot – an element of your architecture that becomes a performance bottleneck – you know where to invest to provide the additional capacity. Microservices also help with scalability because can control each independently and decide how many resources to allocate: how much computing power, how much storage, how much bandwidth should be given to run this particular element. Overall this can be more efficient as opposed to having to scale a large system. If you focus the investment, it would be cheaper than having to come up with a fully redundant deployment for the entire system, as opposed to just few critical elements. These are all different forces that help us to keep the size of the microservice in check because you want to encourage high frequency, incremental releases, therefore keep the changes small. You want to grow the microservice, but not too much, otherwise you would have to deal with all these synchronizations with other microservices that will block your continuous stream of releases. You want to have something small that can fail, but the rest can survive. If you make it too large there will be a limit on how much you can grow. If you grow too big then if this part fails, maybe it’s so important that the rest cannot survive and most users will start to notice as it may take a long time to recover. The same is true for scaling: small microservices which become hotspots are cheaper to scale up and scale out.
648
Continuous Evolution Code
Small change 1 day
Large commit
Time 1 month
Another measurable aspect is the amount of change applied to a system. While the overall system can be large or small, the change that we apply to it should be kept small. Is it more likely that to make a change often, the change should be small. Only by changing less frequently we can afford to apply larger changes. On this picture we show the size as a function of time: the history of the growth of your system measured with some metric counting the size of the code. It does not necessarily grow monotonically. In most cases, the size will grow as more code, more features, more extensions get integrated. But some changes may refactor, drop some features, consolidate and reduce redundancy by shrinking the system. Each point is measured when a new version is released. The vertical difference shows the growth in the code, while the horizontal distance indicates how long it took to write, build, test and release the new version, how long it took to apply the change. We can make a big jump from the initial 1.0 version to the new 2.0 version. How long did it take? Months, years? How much code was added or changed? Thousands, millions of lines of code? The alternative is to follow a continuous evolution path, in which we see that changes are happening every day. But since we push multiple changes every day, there is only so much that can change in one day. The amount of change will not be as large but it will happen more often. Which strategy will help you to reach the goal faster? A few big jumps or many small ones? Are you sure you will end up in the same spot? In this chart, these two trajectories are somehow overlapping: we may either make a big jump or a lot of small jumps, but we actually end up in the same place. Also, we hope that we land in the right place. What does a release bring? The opportunity to learn whether the work invested into it works in production with actual users. Every time we make a new release, we have a chance to fail, recover, and learn; whereas if you make one Big Bang release, you make a big jump, the probability that you fail is much higher given the larger amount of change involved. If you have to recover, maybe you have to work for a long time to fix it and 649
you will still struggle with the quality of a large fix. Also, experimenting with so many changes together in one release will make it more difficult to observe the relationship between individual changes and the resulting user satisfaction. We have illustrated that small size of changes helps to gain confidence at every step and iterations can thus proceed at a higher rate. The increased overhead of making more releases (with small changes) compensates for the increased risk when making fewer releases (with large changes).
650
Hexagon Model UI
AP I
t Tes
Micro Service
Microservice
AP I
t Tes DB
Micro Service
Microservices are often visualized as hexagons. Where does this hexagon visualization come from? When representing an architecture either in the logical view or also sometimes in the deployment perspective, you will see this shape associated with microservices. It looks great, since you can tile many microservices side by side and it will look less boring than the usual boxes. However, the hexagon as a metaphor for components is even older. The idea is that with a hexagon we have six sides and different sides represent different contexts in which the component can be used. The main vertical axis represents the production side of your system. When you put a component in production it means that you have a user interface on top. If the component is stateful (most of them usually are) it will use a database shown at the bottom to store persistently its state. So this is the production axis: users access and process their actual data. Before you get to production, you are staging and testing your component and to do manual tests through the user interface is expensive. Instead you run automated tests that control the component. These are often encapsulated in test drivers and are shown attached to the top right side of the hexagon. They exercise the code trying to cover 100% of all the paths. When you run the tests, you shouldn’t do it on top of your production database and potentially corrupt all the data. Instead, you want to attach as shown on the bottom left side a copy of the database dedicated for testing. This has the advantage that you can write test oracles which have expectations about the initial state of the component. The visualization shows that for testing stateful components, they will be attached to a test database and then the tests drivers will use it to check that expectations are fulfilled when the component is using the test data. This is the quality assurance axis. But this is a hexagon, so we have one more axis. That’s the integration axis. That’s 651
where we find the API of the component which will be used to integrate it with other components. The top left side represents the interface provided by our microservice to the rest of the microservices which require it, while the bottom right side shows the opposite dependencies, the ones required by our microservices and satisfied elsewhere in the architecture. For unit testing, the microservice wouldn’t work without its dependencies: in the same way stateful components will be bound to a test dataset, we can connect the microservices to mock microservices, which will then be switched to the real dependencies for integration testing. This hexagonal component model – recently adopted for microservices – helps us to distinguish three different types of interfaces along the production axis, the quality assurance axis and the integration axis. In general, elements attached to the top sides will depend on our component, while elements attached to the bottom sides represent our own dependencies.
652
Decomposition Monolith
microservices
When we introduce microservices, the big challenge is how we transform an architecture with a single element into one with multiple ones. While this is relatively easy to do for the code following the good old maximize cohesion and minimize coupling rule, it is more difficult to split the database. It’s more difficult to split a large schema with many relationships into many small ones. Once you have decomposed the monolith into multiple microservices you also need to recompose them together using the right type of connectors. But first we go in the opposite direction. We start from something large and we try to break it into little pieces.
653
Independent DevOps Lifecycle microservices
Re le as e
Re le as e
Build
Re le as e Monitor
Monitor
Re le as e
Re le as e
Build
Deploy
Operate
Monitor
an Pl
Test
e Cod
Deploy
Operate
an Pl
e Cod
Build
Build
Test
Monitor
Deploy
Operate
an Pl
e Cod
Deploy
Operate
Test
Test
Operate
an Pl
Deploy
an Pl
e Cod
e Cod
Build
Monolith
Test
Monitor
Why do we want to do that? Because this way every one of those pieces will follow its own independent development and operation life cycle. You will be able to introduce changes into each of the microservices without affecting the life cycle of the others. Instead if you want to make a change in the monolith, you have to stop everything and you have to restart the whole thing every time, and every change can potentially affect everything else in the architecture, since there is no structure to contain and absorb the impact of changes.
654
Isolated Microservices
Customer
Re le as e
Build
Build
Re le as e
Build
Re le as e Monitor
Deploy
Operate
Monolith
Test
e Cod
an Pl
Order
Deploy
Operate
Customer
Monitor
an Pl
Test
e Cod
Deploy
Operate
an Pl
e Cod
Test
Monitor
Order
Microservices
One important constraint that is also part of the microservice definition regards what kind of connections we can establish between. In this picture we can see that we are using a shared database connector between two different components. They use the same database to store their state as well as they can also communicate through this database. The boundary surrounds both of them: on the left we are still talking about a modular monolith. It is deployed and it uses a single database, but the code inside is already modular. We can already distinguish the different subcomponents. If we break this into microservices and then each of the code components will have their own independent storage. The question is: how do these elements communicate? There are two possibilities. The Black one represents the correct solution in which one microservices asking the other one for some information. The red arrow instead is bypassing the API of the microservices and directly going to the database to fetch the data. This mimics what originally happened in the original version. Here it was allowed because the database was shared. With microservices the data is not meant to be shared at all, so using this connector is actually a mistake. Why is this such a big deal? It has to do with flexibility - today’s topic. It has to do with how easy it is to introduce changes into the internal representation of the data owned by each microservice. In both scenarios we have a schema for the database that will be shared between the two components if both components access the same database. With microservices we want to avoid that because if we do not share the schema, we can change it without affecting external components. However, if a component directly reads the information from the other database and we change the schema of this database, the component will be affected. Microservices should encapsulate their database behind their API so that from the outside you can still access the information, but you do not make any assumptions about the way that they are represented inside. The constraint recommends using appropriate layers for having a compatible representation of shared data from the outside and keeping the freedom to change the way it is represented inside the microservice. 655
Isolated Microservices
Werner Vogels
For us service orientation means encapsulating the data with the business logic that operates on the data, with the only access through a published service interface. No direct database access is allowed from outside the service, and there’s no data sharing among the services.
Introducing microservices requires to keep them isolated. This means that you use encapsulation not only for the business logic but also for the data. The only way that you can access the data inside the microservices is through its API, the service interface. There is no data sharing, no direct database access is allowed. This is one of those rules that if you want to call your architecture a microservice architecture, you have to follow. If for some reason you bypass the interface and you directly extract the information from the database, you’re violating this design constraint. In the short term, maybe you get an advantage. Maybe it’s faster to just send a query to the database to read the data that another microservice needs. In the long term this will create a problem, because it will prevent these microservices from independently evolving the internal representation of its data. The boundary you draw around the microservice does not only include the microservice and its data but it should include all other microservices which share some assumption about the data, which is no longer private as it is hidden inside the microservice but has become shared potentially among all microservices of your architecture.
656
Splitting the Monolith . Start from a working system with a modular, monolithic architecture . Decompose it into microservices as it grows too large for a single team
Later it becomes easier to see where to draw the boundaries
Martin Fowler
It is faster to start with a working centralized system
How can we design a microservice architecture? Should we start from scratch introducing microservices? Or should we split an existing Monolith? Let’s start from something that works but is just a single element. We notice that it is now growing beyond the capacity of a single team. We hit a limit and we can no longer manage the growth of this particular element. It is just too large and our velocity is slowing down. This is a good starting point for decomposing it. The messages is: don’t start from scratch with microservices, start with whatever it takes to get it working. And once it works then you can improve the way you can evolve it in the future by introducing the practices we have discussed so far. Microservices are better for evolving our system, but they’re not as fast to start a new system from scratch. How do we make the transition? Once you have a system that works but that is centralized, then you can start to see which possible boundaries you can draw around its internal components. Along which lines can we split it? This is an example of an architectural refactoring of a system. We are not going to refactor the code. We will refactor the design so that we can make a transition from a monolith that has grown too large into multiple microservices. As we split it into different parts, the constraint is that we want to keep the client working during this transition. From the client perspective, the system should be available all the time. It’s like when they do some construction work on the highway but they cannot close the highway. You want to keep the cars driving through. Or maybe a more appropriate image regarding software is: you are flying an airplane and you decide that you need to make a few changes but you cannot land anywhere to replace the engine.
657
Splitting the Monolith
Client
Monolith
Assumption: the Monolith should have some kind of interface At the beginning we have a monolith. And we have some kind of an interface that allows the client to interact with it. This is the initial state of our architecture. The first thing that you should do is to isolate the client from the system where you’re working on.
Splitting the Monolith
Client
Proxy
Monolith
Add an intermediate layer to intercept and redirect traf c You should then introduce an intermediate layer so that we can control the communication between the two parties. We can intercept it and redirect it. This is done usually through some kind of proxy. The HTTP protocol supports proxying for free. If the protocol is more complex, then you have to find a solution. In the worst case you can resort to a man-in-the-middle attack.
658
Splitting the Monolith
Client
Proxy
Monolith
Micro service
Carve out the rst microservice and redirect requests towards it To test if replacement works, run them in parallel and compare results Assumption: the microservice is stateless Now that you have intercepted the traffic, you can start to carve out, extract, and package the first microservice. Redirect the relevant requests towards it. The client depends on certain interface. Part of this interface is a subset we can implement in the microservice. When the client wants to access that part, we will send the request over to the microservice; when the client wants to access everything else, it still goes back to the original system, from which here we see that we have been cutting off a small piece. In reality the piece is still there, but we don’t use it anymore. Typically the monolith you actually never touch, you are afraid to actually change it. If you want to test how successful the replacement was, you can take the request, send it to both sides and then compare the results. If the microservice and the monolith agree, then you know that you have made a good replacement. If the microservice sends you back a different result, you can blame it. The monolith is the ground truth. This helps to catch mistakes in the transition using the original implementation as the oracle. This is very easy to do if the microservice is stateless. You can just take the code and put it in a new container; there is no state that you have to migrate. For stateful microservices you also need to partition the database. You will need to extract a subset of the schema which will be encapsulated within the microservice. If you need to migrate part of the state of the monolith and bring it into the microservice you can still follow this process, but at some point you will need to decide which fork of the state will become the master version. While at the beginning the state within the microservice is likely to drift and become incorrect. It will need to be often reseeded based on the master copy maintained by the monolith. Eventually however you will stop updating the state in a monolith after you fully trust the correctness of the microservice.
659
Splitting the Monolith
Client
Proxy
Monolith
Micro service Micro service
Keep extracting microservices (most valuable, least risky rst)
After first step, the first microservice is working. The client didn’t notice the switch. We can try to repeat the process as many times as needed. Here’s the second piece. We have prioritized the work to extract the most valuable piece of code and the one that is least risky to repackage as a standalone microservice. Sometimes during the transition we’re actually rewriting the code like from COBOL to Java.
660
Splitting the Monolith This migration can be done gradually without affecting clients Much easier to split stateless components than stateful ones
Client
API Gateway
Only at the very end the monolith can be retired Micro service Micro service
Micro service
This major undertaking can take a long time, but eventually you have managed to trick the client into believing that is still talking to the old Monolith, while in reality behind the scenes you have your microservice architecture and you have retired the Monolith. This process can be done slowly and gradually and every step along the way you can check that the clients do not notice.
661
In this lecture we have seen how to make architectures flexible. This included the microservice architectural style. The term emphasizes the small size of the software components, with some even bringing mini-services or nano-services into play. In my humble opinion, size doesn’t really matter. What matters is whether and to which extent the elements in your architecture are coupled from each other. If they are loosely coupled or not coupled, they can change independently and they can change often. Does size impact coupling? For sure, but you can also have large components which have very low coupling and therefore can evolve independently or end up with many small and tightly coupled components. Another heuristic related to the size is known as the ”2-pizza team”. It should be possible to feed each microservice team with 2 pizzas. In Italy this would translate to maximum one developer and one operator. Elsewhere pizzas can feed a few more people. As we have almost finished this lecture, let’s go back over the history of software architecture and see where some of these ideas that are being very popular nowadays with microservices come from. We will see that they are actually much older than the buzzword they are associated with. This is a very common trend in our industry: repackaging old ideas with new labels. It is easier to keep selling new technologies if you are just changing their names. For students, this might be hard to believe because you’re trying to learn the current state of the art, but after a few years of industry experience you will start to observe these paradigm shifts sweeping through the technology landscape. You should realize that the problems you are trying to solve always remain the same, so you can just learn how to match problems with solutions no matter what they are called today.
662
Bezos's Mandate (2002) . All teams will henceforth expose their data and functionality through service interfaces. . Teams must communicate with each other through these interfaces. . There will be no other form of interprocess communication allowed: no direct linking, no direct reads of another team's data store, no sharedmemory model, no back-doors whatsoever. The only communication allowed is via service interface calls over the network . It doesn’t matter what technology they use. HTTP, Corba, Pubsub, custom protocols — doesn’t matter. . All service interfaces, without exception, must be designed from the ground up to be externalizable. That is to say, the team must plan and design to be able to expose the interface to developers in the outside world. No exceptions. Back almost 20 years ago, even before when the cloud was starting to appear, there was a company in which one day, all the IT department got to work and received an email from their boss. The email informed them that from now on, every software developed in the team will have to be exposed through a service interface. If you want to call other software provided by other teams, you have to go through this interface. If they want to call your software, they have to use its service interface. There is no other integration mechanism allowed. You cannot directly read the data from somebody elses database. This is forbidden. You cannot share data through memory. You cannot come up with any other kind of back door. It is only possible and allowed to communicate through the front door: a service interface. This is given as a constraint on an abstract level: call the service interface across the network. It’s ok to use HTTP, CORBA, MQ, gRPC, you choose the protocol, you pick the most appropriate connector. There is some flexibility in the choice of connector, but you have to go through the service interface. This constrain created a billion dollar business out at a very simple design decision. All service interfaces without exception must be designed to be externalisable. What does it mean? We need to be ready to support both internal and external clients. It means that once you have an interface. Everyone may call the service. You can first build something for internal use, and then later you can open it up to the rest of the world so that, for example, you can make money with it. No exceptions. If you, by the way, ignore these rules, you will be fired. That’s not part of the slide, but it was actually part of the email. This impulse started the service-oriented architecture trend which eventually evolved into cloud computing, everything as a service and microservices.
663
Evans's Bounded Context (2004) There is no single uniform model to design large systems. Models are only valid and consistent within a given context boundary. Some translation may be needed to cross boundaries.
From a similar time frame, domain-driven design (DDD) is also highly relevant today, as with microservices we need to find a way to decompose a large system into small pieces. There is a limit not only in the number of people that you can work with to support the system, but there is a limit also in the conceptual integrity of the content managed by the system. If its domain grows beyond a certain size, its complexity will also grow as you try to cover within the same system concepts that are becoming increasingly different. Consider your natural language understanding skills. For example, you are born in a certain village, you go to primary school everybody speaks the same language. Then one day you will take a walk and you go to the next village and you start to realize that people speak in a slightly different way. Eventually you grow up and move to the city. Your accent is different than people can tell where you come from. One day you cross the boundary to a new country and there the language is completely different. The same happens with software systems: when they are young and small, they’re very consistent because they just focus on doing precisely something that is very narrow. Eventually they grow by adding more features, more concepts, more data and eventually they hit a boundary. They need to inter-operate with another system and that’s when you start to need to have a translation because the other side speaks a different language to represent shared concepts. You can hardly mix data on the inside with data from the outside. Same holds with systems that are initially used only by one user. As the user community grows, there is more and more pressure to support different use cases, personalized scenarios. With a sufficiently large user population, it is impossible to find the only way to do something or to keep doing things in the same way for everyone. The only way to keep growing is to manage diversity by drawing boundaries. Is there ever going to be the only Universal solution? Or just many failed attempts at over-generalization.
664
Bob Martin's Single Responsibility Principle (2003) A class should have only one reason to change. Gather together things that change for the same reason, and separate those things that change for different reasons.
This is also a similar concept for keeping your components cohesive: the idea of the single responsibility principle states that if you have a piece of software – like in this case a class in object oriented programming, but one can also extend it to a component, or a microservice – it should only have one reason to change. If things change for the same reason, you keep them together (high coupling). And if things change for different reasons, you separate them (low coupling) This abstract advice can help you with your monolith decomposition efforts. Think about who would request possible changes, trace their impact across the architecture and bring together all elements that can be directly or indirectly affected by the requirements provided by each key stakeholder.
665
UNIX Philosophy (1978) Write programs that do one thing and do it well. Write programs to work together.
Design and build software to be tried early, ideally within weeks. Don't hesitate to throw away the clumsy parts and rebuild them.
M. Douglas McIlroy
Write programs to handle text streams, because that is a universal interface.
We keep going back in time even further. In the early days of UNIX, they started to reflect about its design. They didn’t call it architecture, but they wrote about the design philosophy of the system. These ideas also resonate with today’s microservices, back then they were called programs. When you write some programs, each program should do one thing. And do it well. There is no need to replace them with some other program if you have the best program to do a certain thing, to solve your problem. However, since they only do one thing, sometimes you have to combine them to actually do something useful. Since you cannot anticipate and should not constrain how components can be composed, you have to write them so they can offer a universal interface. Back then, the universal interface was textual data streams: standard input and standard output. We still have it today. For composing distributed components, HTTP could be considered as a universal interface. As you start with a new project: design and build software to be tried and tested as early as possible, ideally within one or two weeks. Do not wait for end of the project to deliver the first result that finally works. You should have as early as possible, and then you can keep growing it. Today we call this a minimum viable product (MVP). Don’t hesitate to throw away the clumsy parts and rebuild them. This is also an idea often cited with microservices. Since they are small, you can afford to throw them away and rebuild them. If they become too large, you have invested so much into building them and you will think twice before replacing them. If you hesitate before replacing a microservice, it’s already grown beyond the limit of what keeps a microservice small.
666
Separation of Concerns (1974) One is willing to study in depth an aspect of one's subject matter in isolation, for the sake of its own consistency, all the time knowing that one is occupying oneself with only one of the aspects. Edsger W. Dijkstra
But nothing is gained -- on the contrary! -- by tackling these various aspects simultaneously.
The term “separation of concerns” is really old, but still helpful today as we struggle with microservice decomposition. Once you try to understand how to solve a problem, you want to separate different aspects so that you can solve them in isolation. And this helps you to focus your problem solving skills on something that is manageable. If you try to solve everything together at the same time, then it will be very difficult to make it work. Also, if you found multiple problems that can be solved independently, you can involve multiple teams of developers to build the corresponding solutions, also known as microservices.
667
Parnas's Criteria (1971) One begins with a list of dif cult design decisions or design decisions which are likely to change. Each module is then designed to hide such a decision from the others. It should be possible to make drastic changes to one module without a need to change others
Only three years after the Garmisch conference where the term software engineering was born, David Parnas proposed his criteria to decompose systems into modules. At that time they didn’t call them components. Microservices were yet to come. You start by making a list of difficult design decisions, or what we have called architectural decisions. Hard to change later on, but which should make it possible to design a flexible architecture. Each module should be designed to hide such a decision from the others. This will make it possible to change what’s inside one module without affecting the rest. Encapsulation, modularity and information hiding lead to the constraint to use exclusively the service interface as opposed to offer direct access to the underlying implementation. We can rewrite an entire microservice using a different programming language, the rest will not notice because the service interface remains the same.
668
Conway's Law (1968) Any organization that designs a system will inevitably produce a design whose structure is a copy of the organization's communication structure
We reach now probably the most cited concept when you hear a microservices talk in the industry: Conway’s law, sometimes also mentioned as the “reverse Conway manoeuvre”. Any organization which designs a software architecture will inevitably produce a design whose structure is the copy of the communication structure within the organization itself. If people are used to communicate in a hierarchical reporting structure, like where every team talks to the manager, the managers talk to their manager, all the way up to the CEO, the software that they design will actually look like a hierarchical decomposition of components. Taken to the extreme, depending on how large the organization is, the number of components in the software architecture will be proportional to the number of developers, the number of offices or the number of floors. If you would like the software architecture to change, you will need to re-arrange the chairs around the table, or the cubicle floorplan.
669
Conway's Law (1968) Support Helpdesk
Developers
Users
UI
UI Designers
Monolith
Testers
Database
Database Administrators
System Administrators The monolithic architecture shows an example in which development is kept separate from operation. Developers build the system, they release it, and then they push it over the wall to somebody else who is in charge of running it.
670
Conway's Law (1968) Cross-Functional Team
Users
Client User Interface
UI Designers Testers Developers Database Administrator System Administrator
Users UI Designers
API
API
Catalog
Order
Products
Orders
Microservices
Tester Developers Database Administrator System Administrator
Only by changing the organization of your team, you can actually achieve a flexible microservice architecture. Each microservice is developed and operated by a crossfunctional team with enough members to play all of the roles covering the whole life cycle of the software.
671
Vogels's Lesson (2006) The services model has been a key enabler in creating teams that can innovate quickly with a strong customer focus. Giving developers operational responsibilities has greatly enhanced the quality of the services, both from a customer and a technology point of view. Why is it important to have cross-functional teams dedicated to each microservice? Let’s go back to where we started from: service-orientation has been a key enabler in creating teams that can innovate quickly with a strong customer focus. Empowering developers with operational responsibilities also enhances the quality of the services both from a customer as well as a technology point of view. This is a polite way to say that if a developer gets a call in the middle of the night and they have to fix their service which just went down, the developer will tend to be much more careful in introducing bugs, performance or availability issues in their system as opposed to when it’s somebody else’s problem to fix. The same should hold for architects, I wonder why they never get to enter their beautiful buildings and try to live inside them for a day or two.
672
References • Stephan Murer, Bruno Bonati, Frank J. Furrer, Managed Evolution: A Strategy for Very Large Information Systems, Springer, 2011, ISBN 978-3-642-01632-5 • Werner Vogels, Interviews Web Services: Learning from the Amazon technology platform, ACM Queue, 4(4), June 30, 2006 • Melvin E. Conway, How do Committees Invent?, Datamation, 14 (5): 28–31, April 1968 • Edsger W. Dijkstra, On the role of scientific thought, In: Selected Writings on Computing: A Personal Perspective, Springer, 1982. ISBN 0–387–90652–5. • David L. Parnas, On the criteria to be used in decomposing systems into modules, Communications of the ACM, 15(12): 1053-1058, December 1972 • Eric Evans. Domain-Driven Design: Tackling Complexity in the Heart of Software. AddisonWesley, 2004, ISBN 978-032-112521-7 • James Lewis, Martin Fowler, Microservices, 2014 • Sam Newman, Building Microservices, O'Reilly, February 2015, ISBN 978-1491950357 • https://www.microservice-api-patterns.org/ (API Evolution Patterns)
673