Distributed Computer Systems Engineering: Lecture Notes

Citation preview

6.824 2006 Lecture 1: Introduction and O/S Review Opening building distributed systems construction techniques, robustness, good performance lectures on design, paper reading for case studies you'll build real systems why take the course? synthesize many different areas in order to build working systems Internet has made area much more attractive and timely hard/unsolved: not many deployed sophisticated distrib systems Example: how to build HotMail? mail arrives from outside world store it until... user's Outlook/Eudora reads/deletes/saves it Simple Solution: One server w/ disk to store mail-boxes [picture: MS, sending "clients", reading clients] What happens as your mail service gets popular? Topic: Stable performance under high load Example: Starbucks. 5 seconds to write down incoming request. 10 seconds to make it. [graph: x=requests, y=output] max thruput at 4 drinks/minute. what happens at 6 req/min? thruput goes to zero at 12 requests/minute. Efficiency *decreases* with load -- bad. Careful system design to avoid this -- flat line at 4 drinks. Peets, for example. Better: build systems whose efficiency *increases* w/ load w/ e.g. batching, disk scheduling Topic: scalable performance What if more clients than one Hotmail server can handle? How to use more servers to handle more clients? Idea: partition users across servers bottlenecks: how to ensure incoming mail arrives at the right server? scaling: will 10 servers allow us to handle 10x as many users? load balance: what if some users get much more mail than others? layout: what if we want to detect spam by looking at all mailboxes? Topic: high availability Can I get at my HotMail mailbox if some servers / networks are down? Yes: replicate the data. Problem: replica consistency. delete mail, re-appears. Problem: physical independence vs communication latency Problem: partition vs availability. airline reservations. Tempting problem: can 2 servers yield 2x availability AND 2x performance? Topic: global scalability this is really an opportunity Cite as: Robert Morris, course materials for 6.824 Distributed Computer Systems Engineering, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

we have the entire Internet as a resource what neat new big systems can we build that take advantage? are there any principles to be discovered? finding objects storing objects "out there" serving same objects to many consumers widely distributed computing (e.g. grid computing) Topic: security old view: secrecy via encryption (msg to Moscow embassy) user authentication via passwords &c all parties know each other! Internet has changed focus. global exposure to random attacks from millions of bored students and serious hackers, e.g. intrusions for spam bot nets you fetch a new Firefox binary, how do you know it hasn't been hacked? how do you know that was Amazon you gave your credit card number to? how does Amazon know it was you? no purely technical approach is likely to solve these problem We want to understand the individual techniques, and how to assemble them. -------------Course structure URL meetings: 1/2 lectures, 1/2 paper discussions research papers on working systems, starting next week must read papers before class otherwise boring, and you can't pick it up by listening we will post paper questions 24 hours in advance hand in answer on paper in class, one or two paragraphs two in-class quizzes (no final) Labs: build a real cluster file server, cache consistency, locking Project. look at the project information page! design, implement, report teams proposal conferences two drafts demo report Emil is TA, office hours TBA Look at the web site: sign up for course machine accounts look at the first lab, due in a week -------------O/S kernel overview context in which you build distributed systems o/s has big impact on design, robustness, performance sometimes because of o/s quirks mostly because o/s solves some hard problems This should be review for most of you Cite as: Robert Morris, course materials for 6.824 Distributed Computer Systems Engineering, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

Want to tell what I think is important Give you a chance to ask questions What problems does o/s solve? sharing hardware resources protection communication hardware independence (everyone faces these problems) Approach to solutions? o/s designers think like programmers, abstractions + interfaces UNIX abstractions (we'll be programming UNIX in labs, my favorite O/S) process address space thread of control user ID file system file descriptor on-disk file pipe network connection device All this is implemented by a "kernel" with hardware privileges Note we're partially virtualizing o/s multiplexes physical resource among multiple processes CPU, memory, disk, network to share, to control, to provide a simple model to apps abstraction helps virtualization: easier to share TCP conns than enet Can't completely virtualize file system and network stack not the same as physical foundation the differences make sharing possible abstractions interact, must form a coherent set if o/s can start programs, it must know how to read files System call interface to kernel abstractions looks like function call, but special fork, exec open, read, creat Standard picture app (show two of them, mark addresses from zero) libraries ----FS disk driver (mention address spaces, protection boundaries) (mention h/w runs kernel address space w/ special permissions) Why Big Kernels have been successful. easy for kernel subsystems to cooperate Cite as: Robert Morris, course materials for 6.824 Distributed Computer Systems Engineering, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

disk buffer shares phys mem with virtual mem system all kernel code is 100% privileged very simple security model easy to implement sophisticated and efficient services Why UNIX abstractions are not perfect kernel is big kernel has room for lots of bugs; it's all privileged kernel limits flexibility multiple threads per process? single thread crossing into a different address space? control disk layout of files for performance? don't like the kernel's TCP implementation? we'll discuss a number of improved abstractions Alternate set of abstractions: micro-kernel Move complex abstractions to server processes Talk to FS server, rather than FS module in kernel Kernel mostly handles IPC also grants h/w access to privileged servers e.g. FS server can read/write disk h/w Looks like a miniature distributed system! Move FS server to a different machine, via network? Lots of overlap with our concerns in this class. Let's review some basics which will come up a lot: process / kernel communication how processes and kernel wait for events (disk and network i/o) Life-cycle of a simple UNIX system call [diagram. process, kernel] See the handout... Interesting points: protected transfer h/w allows process to get kernel permissions but only by jumping to *known* entry point in kernel process suspended until system call finishes What if the system call needs to wait, e.g. for the disk? We care: this is what busy servers do sys_open(path) for each pathname component start read of directory from disk sleep waiting for the disk read process the directory contents sleep() save *kernel* registers to PCB1 (including SP) find runnable PCB2 restore PCB2 kernel registers (SP...) return Note: each user process has its own kernel stack [draw in diagram] kernel stack contains state of partially executed system call "kernel half" trap handler must execute on the right stack Cite as: Robert Morris, course materials for 6.824 Distributed Computer Systems Engineering, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

"blocking system call" What happens when disk completion interrupt occurs? Device interrupt routine finds the process waiting for that I/O. Marks process as runnable. Returns from interrupt. Someday process scheduler will switch to the waiting process. Now let's look at how services use this kernel structure. Explain server_1 web server in handout Problem [draw this time-line] Time-lines for CPU, disk, network Server alternates waiting for each of them CPU, disk, network are each idle much of the time OK if only one client. Not OK if there are clients waiting for service. We may have lots of work AND idle resources. Not good. s/w structure forces one-at-time processing How can we use the system's resources more efficiently?

Cite as: Robert Morris, course materials for 6.824 Distributed Computer Systems Engineering, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

6.824 2006 Lecture 2: I/O Concurrency Recall timeline [draw this time-line] Time-lines for CPU, disk, network How can we use the system's resources more efficiently? What we want is *I/O concurrency* Ability to overlap I/O wait with other useful work. In web server case, I/O wait mostly for net transfer to client. Could be disk I/O: compile 1st part of file while fetching 2nd part. Could be user interaction: emacs GC while waiting for you to type. Performance benefits of I/O concurrency can be huge Suppose we're waiting for disk for client one, 10 milliseconds We can probably server 100 other clients from cache during that time! Typical ways to get concurrency. This is about s/w structure. There are any number of potential structures. [list these quickly] 0. (One process) 1. Multiple processes 2. One process, many threads 3. Event-driven Depends on O/S facilities and type of application. Degree of interaction among different sub-tasks. One process can be better than you think! O/S provides I/O concurrency transparently when it can O/S does read-ahead into cache, write-behind from buffer works for disk and network connections I/O Concurrency with multiple processes Start a new UNIX process for each client connection / request Master processes hands out connections. Now plenty of work available to keep system busy Still simple: look at server_2() in handout. fork() after accept() Preserves original s/w structure. Isolated: bug for one client does not crash the whole server Most interaction hidden by O/S. E.g. lock the disk queue. If > 1 CPU, CPU concurrency as a side effect We may also want *CPU concurrency* Make use of multiple CPUs on shared memory machine. Often I/O concurrency tools can be used to get CPU concurrency. Of course O/S designer had to work a lot harder... CPU concurrency much less important than I/O concurrency: 2x, not 100x In general, very hard to program to get good scaling. Usually easier to buy two separate computers, which we *will* talk about. Multiple process problems Cost of starting a new process (fork()) may be high. Cite as: Robert Morris, course materials for 6.824 Distributed Computer Systems Engineering, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

New address space &c. 300 microseconds *min* on my computer. Processes are fairly isolated by default E.g. they do not share memory What if you want a web cache? Must be shared among processes. Or even just keep statistics? Concurrency with threads Looks a bit like multiple processes But thread_fork() leaves address space alone So all threads share memory One stack per thread, inside process [picture: thread boxes inside process boxes] Seems simple -- still preserves single-process structure. Potentially easier to have e.g. shared web cache But programmer needs to know about some kind of locking. Also easier for one thread to corrupt another There are some low-level but very important details that are hard to get right. What happens when a thread calls read()? Or some other blocking system call? Does the whole process block until disk I/O has finished? If you don't get this right, you don't get I/O concurrency. Kernel-supported threads O/S kernel knows about each thread It knows a thread was just blocked, e.g. in disk read wait Can schedule another thread [picture: thread boxes dip down into the kernel] What does kernel need for this? Per-thread kernel stack. Per-thread tables (e.g. saved registers). Semantics: per-process resources: addr space, file descriptors per-thread resources: user stack, kernel stack, kernel state Kernel can schedule one thread per CPU This sounds like just what we want for our server BUT kernel threads are usually expensive, just like processes Kernel has to help create each thread Kernel has to help with each context switch? So it knows which thread took a fault... lock/unlock must go through kernel, but bad for them to be slow Many O/S do not provide kernel-supported threads, not portable User-level threads Implemented purely inside program, kernel does not know User scheduler for threads inside the program In addition to kernel process scheduler [picture] User-level scheduler must: Know when a thread is making a blocking system call. Don't actually block, but switch to another thread. Know when I/O has completed so it can wake up original thread. Answer: thread library has fake read(), write(), accept(), &c system calls library knows how to *start* syscall operations without waiting library marks threads as waiting, switches to a runnable thread Cite as: Robert Morris, course materials for 6.824 Distributed Computer Systems Engineering, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

kernel notifies library of I/O completion and other events library marks waiting thread runnable read(){ tell kernel to start read; mark thread as waiting for read; sched(); } sched(){ ask kernel for I/O completion events mark threads runnable find a runnable thread; restore registers and return; } Events we would like from kernel: new network connection data arrived on socket disk read completed client/socket ready to receive new data Like a miniature O/S inside the process Problem: user-level threads need significant kernel support 1. non-blocking system calls 2. uniform event delivery mechanism Typical O/S provides only partial support for event notification yes: new TCP connections, arriving TCP/pipe/tty data no: file-system operation completion Similarly, not all system calls operations can be started w/o waiting yes: connect(), socket read(), write() no: open(), stat() maybe: disk read() Why are non-blocking system calls hard in general? Typical system call implementation, inside the kernel: [sys_read.c] Can we just return to user program instead of wait_for_disk? No: how will kernel know where to continue? ie. should it run userspace code or continue in the kernel syscall? Big problem: keeping state for multi-step operations. Options: Live with only partial support for user-level threads New operating system with totally different syscall interface. One system call per non-blocking sub-operation. So kernel doesn't need to keep state across multiple steps. e.g. lookup_one_path_component() Microkernel: no system calls, only messages to servers. and non-blocking communication Helper processes that block for you (Flash paper next week) Threads are hard to program The point is to share data structures in one address space Thread *model* involves CPU concurrency even on a single CPU so programmer may need to use locks even if only goal was to overlap I/O wait But *events* usually occur one at a time Cite as: Robert Morris, course materials for 6.824 Distributed Computer Systems Engineering, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

could do CPU processing sequentially, overlap only the I/O waiting Event-driven programming Suggested by user threads implementation Organize the s/w around arrival of events Write s/w in state-machine style When this event occurs, execute this function Library support to register interest in events The point: this preserves the serial natures of the events Programmer sees events/functions occuring one at a time

Cite as: Robert Morris, course materials for 6.824 Distributed Computer Systems Engineering, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

a2ps -A fill -o x --line-numbers=1 events.c webclient.c webclient_libasync.c Office hours Discussion mailing list (Wiki?) **** Recall simple web-like server? Now we'll look at a client for the server Show how to write some asynchronous code Build up to libasync which you will and have been using for labs [webclient.c] Where does this block? connect: makes a tcp connection write(s): is remote side willing to take data? read(s): has data come back from the remote side? write(1): is terminal ready for output? How to program in event style? Identify events and appropriate responses: state machine programmer has to know when something might block! Write a loop that handles incoming events (I/O events) [events.c Example 1] select() Need a way to multiplex sockets Program must then interleave operations [write prototype on the board: nfds, reads, writes, excepts, time] Condition is actually "read() would not block"; might be EOF. select() blocks (if timeout > 0) to avoid wasteful polling. this is important; you *do* want to block in select(). Translate low-level system events into application level events Buffer net I/O, maintain individual application state Writing this event loop for each program is tedious [sketch implementation on the board] What if your program does the one thing in parallel? Have to partition up state for each client Need to maintain sets of file descriptors What if your program does many things? e.g. let's add DNS resolution Hard to be modular if event loop knows about all activities. And knows how to consult all state. We would prefer abstraction... Use a library to provide main loop (e.g. libasync) Programmer provides "callbacks" to handle events [events.c: Example 2] Break up code into functions with non-blocking ops let the library handle the boring async stuff [prototypes in webclient_libasync.c] Cite as: Robert Morris, course materials for 6.824 Distributed Computer Systems Engineering, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

It's unfortunately hard for async programs to maintain state [draw logical diagram of select loop and function calls] Ordinary programs, and threads, use variables. Which persist across function calls, and blocking operations. Since they are stored on the stack. Async programs can't keep state on the stack. Since each callback must return immediately. How can they maintain state across calls? Use global variables Use the heap: Programmers package up state in a struct, malloc struct Each callback could take a void * (libevent) (In C++, can do this somewhat implicitly using an object.) This turns out to be hard to program No type safety Must declare structs for every set of state transfer User has to manage memory in potentially tricky cases libasync provides a form of closures cb = wrap(fn, a, b) generates a closure. That is, a function pointer plus an environment of saved values. cb() calls fn(a, b) Also provides something like function currying. useful later on when callbacks do different things based on input Given a function with signature "R fn (A, B)": cb = wrap (fn) -> callback::ref use it like this: cb (a, b) Or: wrap (fn, a) -> callback::ref Limited compared to Scheme closures: You must explicitly indicate what variables to keep around. Can only pass a certain number of arguments How are callbacks implemented? See callback.h: one of the few commented files in libasync. templates to generate dynamic structs holding values templates provide type safety: R fn (A, B); cb = wrap (fn) -> callback::ref cb (a, b) cb = wrap (fn, a) -> callback::ref cb (b); cb = wrap (fn, a, b) -> callback::ref cb (); callbacks are reference counted to simplify mem mgmt normally, arguments in the wrap would have been on stack now, values are stored in closures created by wrap(). How do we know when we've used a callback the last time? That's why they're reference counted. What is the result? [webclient_libasync.c] what's the difference between filename and buf? This is still somewhat tedious... Must handle memory allocation for strings Cite as: Robert Morris, course materials for 6.824 Distributed Computer Systems Engineering, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

Must manually buffer data to and from client Have to translate network read/writes into application level events libasync provides some solutions: suio and aios handle raw and line-oriented i/o reference counted data (strings and general dynamic structs) asynchronous RPC library but you still have to do work like splitting your code up into functions loops can still be a pain

Cite as: Robert Morris, course materials for 6.824 Distributed Computer Systems Engineering, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

Today's plan: [Any questions about lock server lab?] Reviewing event driven programming Outline structure of the remaining labs Common libasync/libarpc/nfsloop programming idioms: writing rpc client code writing async functions that call RPCs writing rpc server code Flash Event driven programming Achieve I/O concurrency for communication efficiently Threads give cpu *and* i/o concurrency Never quote clear when you'll context switch: cpu+i/o concurrency State machine style execution Lots of "threads": request handling state machines in parallel Single address space: no context switch overhead ==> efficient Have kernel notify us of I/O events that we can handle w/o blocking The point: this preserves the serial natures of the events Programmer sees events/functions occuring one at a time Simplifies locking (but when do you still need it?) libasync handles most of the busywork [draw amain/select on board again] e.g. write-ability events are usually boring libarpc translates to events that the programmer might care about: rpcs ccfs architecture: [draw block diagram on the board: OS [app, ccfs] --> blockserver lockserver reply causes the sbp to get deleted. Writing user-level NFS servers: classfsd code will allow you to mount a local NFS server w/o root nfsserv_udp handles tedious work, we register a dispatch function Similar to generic RPC server but use nfscall *, instead of svccb. Adds features like nc->error () You'll need to do multiple operations to handle each RPC [draw RPC issue timeline os->kernel->ccfs->lockserver/blockserver] Not unlike how we might operate: get an e-mail from friend: can you make it to my wedding? check class calendar on web, check research deadlines send IM to wife, research ticket prices, reply Or Amazon.com login... [Example 6] An aside on locking: No locking etc needed usually: e.g. to increment a variable When do you need locking? When an operation involving multiple stages Be careful about callbacks that are supposed to happen "later" e.g. delaycb (send_grant); Parallelism and [Example 7a]: [Example 7b]: [Example 7c]: [Example 7d]:

loops synchronous code serialized and async parallelism but yet... better parallelism?

Summary Cite as: Robert Morris, course materials for 6.824 Distributed Computer Systems Engineering, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

Events programming gives programmer a view that is roughly consistent with what happens. Can build abstractions to handle app level events Need to break up state and program flow but always know when there's a wait, and have good control over parallelism

Cite as: Robert Morris, course materials for 6.824 Distributed Computer Systems Engineering, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

6.824 (6.033)notes, Appendix 4-B, Case study of the Network File System (NFS) The original paper on NFS: Design and Implementation of the Sun Network File System Sandberg, Goldberg, Kleiman, Walsh, Lyon Usenix 1995 NFS is a neat system: NFS was very successful, and still is You can view much net fs research as fixing problems with NFS You'll use NFS in labs Why would anyone want a network file system? Why not store your files on the local disk? What's the architecture? Server w/ disk LAN Client hosts apps, system calls, RPCs What RPCs would be generated by reading file.txt, e.g.: fd = open("file.txt", 0) read(fd, buf, 8192) close(fd) What's in a file handle? i-number (basically a disk address) generation number file system ID What's the point of the generation number? Why not embed file name in file handle? How does client know what file handle to send? Client got it from previous LOOKUP or CREATE Returned handle stored in appropriate vnode A file descriptor refers to a vnode Where does the client get the very first file handle? Why not slice the system at the system call interface? I.e. directly package up system calls in RPCs? UNIX semantics were defined in terms of files *not* just file names the files themselves have identities, i-number in the disk file system These refer to the same object even after a rename: File descriptor Home directory Cache contents So vnodes are there to remember file-handles What RPCs would be generated by creating a file, e.g.: fd = creat("d/f", 0666); Cite as: Robert Morris, course materials for 6.824 Distributed Computer Systems Engineering, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

write(fd, "foo", 3); close(fd); If a server crashes and reboots, will client requests still work? Will client's file handles still make sense? File handle == disk address of i-node. What if the server crashes just after client sends it an RPC? What if the server crashes just after replying to a WRITE RPC? So what has to happen on the server during a WRITE? I.e. what does it do before it replies to the RPC? Data safe on disk. Inode with new block # and new length safe on disk. Indirect block safe on disk. Three writes, three seeks, 45 milliseconds. 22 writes per second. 180 kb/sec. How could we do better than 180 kb/sec? Write whole file sequentially at a few MB/sec. Then update inode &c at end. Why doesn't NFS do this? NFS v3 unstable WRITE and COMMIT help solve this performance problem. server doesn't write to disk on WRITE, just caches, waits to batch many writes server returns a "verifier" that changes on reboot client leaves written data dirty in its file cache remembers verifier in client close(): make sure all WRITEs sent and replied to send COMMIT for file handle wait for reply if reply verifier != any cached block verifier, re-send all WRITEs and COMMIT else free file cache blocks why in close()? for cache consistency among clients, in case server crashes after close() and reveals old data to other clients What caches do typical NFS implementations have? And why exactly is each cache helpful? Server caches disk blocks, and maybe others. Client caches file content blocks, clean and dirty. Client caches file attributes. Client caches name-> fh mappings. Client caches directory contents. You will need to think a little about the client caches for your labs. They suppress RPCs you might expect to receive at the server. They may become stale and cause client to see things different from server and other clients. What if client A has something cached, and client B changes it? Examples where we might care about cache consistency? Two windows open, different clients, Cite as: Robert Morris, course materials for 6.824 Distributed Computer Systems Engineering, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

emacs -> make make -> run the program Or distributed app (cvs?) with its own locks. Examples where we might not care about consistency? I just use one client workstation. Different users don't interact / share files. Straw man consistency protocol: Every read() asks server if file has changed; if not, used cache copy. Is that sufficient to make each read see latest write? What's the effect on performance? Do we need that much consistency? Compromise: close-to-open consistency this is what most NFS clients do promise: if client A writes a file, then close()s it, then client B open()s the file, and reads it, client B's reads will reflect client A's writes. the point: clients only need to contact server during open() and close() not every read and write close-to-open consistency fixes the emacs/make example but the user has to wait until emacs says it's done writing! and cvs has to wait until close() returns before releasing lock How NFS implements close-to-open consistency: taken from FreeBSD source; NFS spec doesn't say. client keeps file mtime and size for each cached file block close() starts WRITEs all file's dirty blocks close() waits for all of server's replies to those WRITEs open() always sends GETATTR to check file's mtime and size, caches fattr read() uses cached blocks only if mtime/length have not changed client checks cached directory contents w/ GETATTR and ctime name-to-filehandle cache may not be checked for consistency on each LOOKUP you may get a stale file handle error if file was deleted or the wrong answer if file was renamed, and a new file created w/ same name What prevents random people from sending NFS messages to my NFS server? Or from forging NFS replies to my client? Would it be reasonable to use NFS in Athena? Security -- untrusted users with root on workstations Scalability -- how many clients can a server support? Writes &c always go through to server. Even for private files that will soon be deleted. Can you run it on a large complex network? How is it affected by latency? Packet loss? Bottlenecks?

Cite as: Robert Morris, course materials for 6.824 Distributed Computer Systems Engineering, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

6.824 2006 Lecture 5: RPC Transparency Goal: transparency Preserve original client code Preserve original server code RPC glues client/server together w/o changing behavior Programmer doesn't have to think about network How well does this work? Let's use NFS as a case study of RPC transparency What is the non-distributed version of NFS? I.e. what are we letting RPC split up? apps, syscalls, kernel file system, local disk Where's the module boundary that NFS RPCs cut at: In kernel, just below syscall interface. "vnode" layer. NFS client code essentially a set of stubs for system calls. Package up arguments, send them to the server. Does NFS preserve client function call API? At syntactic level? Yes: no change to arguments / return values of system calls. Does NFS use server-side implementation w/o changes? Fairly close. NFS provides in-kernel threads. They act much like processes making system calls. But NFS server required file system implementation changes: File handles instead of file descriptors. Generation numbers in on-disk i-nodes. User-id carried as arguments instead of implicit in process owner. Flag arguments indicating synchronous updates. Does NFS preserve semantics of file It is not enough to preserve just System calls must *mean* the same Otherwise existing programs may

system operations? the API. thing. compile and run but not be correct.

New semantics: server failure Before, open() only failed if file didn't exist. Now it (and all others) can fail if server has died. Apps have to know to retry or fail gracefully. *Or* open() could hang forever, which was never the case before. Apps have to know to set their own timeouts if they don't want to hang. This is fundamental, not an NFS quirk. New semantics: close() might fail if server disk out of space. Side effect of async write RPCs in client, for efficiency. Client only waits in close(). close() never returns an error for local file system. So apps have to check close() for out-of-space, as well as write(). This is caused by NFS trying to hide latency by batching. They could have made write() synchronous (and much slower). Cite as: Robert Morris, course materials for 6.824 Distributed Computer Systems Engineering, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

New semantics: error returns from successful operations I call rename("a", "b") on an NFS file. Suppose server performs rename, crashes before sending reply. So NFS client re-sends rename(). But now "a" doesn't exist, so I get an error. This never used to happen. Side-effect of NFS's statelessness. NFS server could remember all operations it has performed. But hard to fix: hard to keep that state consistent across crashes. Update state first? Or perform operation first? New semantics: deletion of open files I open a file for reading. Some other client deletes it while I have it open. Old behavior: my reads still work. NFS behavior: my reads fail. Side-effect of NFS's statelessness. They could have fixed this; server could track opens. AFS keeps the state required to do this. Preserving semantics leads to bad performance: If you write() local file, UNIX can delay write to disk. To collect multiple adjacent blocks and write with single seek. What if machine crashes? Then you lose both app and the dirty buffers in memory. Suppose we did the same with NFS server: WRITE request just write buffer in server's memory, then return. What if server crashed and rebooted? App would *not* have crashed, in fact would not notice. But its written data would mysteriously disappear. To solve this, NFS server does synchronous writes. Does not reply to write RPC until data is on disk. So if write returns and server crashes, data safe on disk. Takes three seeks: data, indirect block, i-node. i.e. 45 milliseconds. So 22 writes per second, 180 kb/sec. That's about 10% of max disk throughput. This is just NFS v2 lameness; AFS and NFS v3 fix it. Improving performance leads to different consistency semantics: Suppose clients cache disk blocks when they read them. But writes always go through to the server. This is not enough: Suppose I'm using two workstations (w/ two X windows). Write out editor buffer on one workstation. Type make on the other workstation. Does make/compiler see my changes? Ask server "has the file changed" before every read()? Almost as slow as just reading from the server. NFS solution: Ask server "has the file changed" at each open(). But don't ask for individual read()s after open(). So if you change file while I have it open, I don't see your changes. This is OK for editor/make, but not always what you want: make > out (on compute server) tail -f out (on my workstation) Cite as: Robert Morris, course materials for 6.824 Distributed Computer Systems Engineering, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

Side-effect of NFS's statelessness. Server could remember who has read-cached blocks. Send them invalidation messages. Security is totally different On local system: UNIX enforces read/write protections Can't read my files w/o my password How does NFS server know who the user is? What prevents random people from sending NFS requests to my NFS server? Or from forging NFS replies to my client? Does it help for the server to look at the src IP address? Why aren't NFS servers ridiculously vulnerable? Hard to guess correct file handles. This is fixable: SFS, AFS, even some NFS variants do it. Require clients to authenticate themselves cryptographically. Very hard to reconcile with statelessness. However, it turns out none of these issues prevent NFS from being useful. People fix their programs to handle new semantics. Or install firewalls for security. And get most advantages of transparent client/server. Multi-module example NFS very simple: only one server, only one data type (file handle). What if symmetric interaction, lots of data types? suppose program starts w/ three modules in same address space example modules: web front end; customer DB; order DB class connection { int fd; int state; char *buf; } easy to pass object references among all three e.g. pointer to current connection what if you split all three off into separate servers how do you pass a "class connection"? Stub generator would just send the structure elements? what if processing for connection goes through order DB, then customer DB, then back to front end to send reply front end only knows *contents* of passed connection object *real* connection may have changed... so we actually wanted to pass object references, not object contents NFS solves this with file handles But got no support from the RPC package Areas of RPC non-transparency 1. Partial failure, network failure 2. Latency 3. Efficiency/semantics tradeoff 4. Security. You can rarely deal with it transparently. 5. Pointers. Write-sharing. Portable object references. 6. Concurrency (if multiple clients) Solution 1: expose RPC to application Solution 2: work harder on transparent RPC Conclusion Cite as: Robert Morris, course materials for 6.824 Distributed Computer Systems Engineering, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

Automatic marshaling has been a big success Mimicing procedure call interface is not that useful Attempt at full transparency has been mostly a failure But you can push this pretty hard: Network Objects, Java RMI

Cite as: Robert Morris, course materials for 6.824 Distributed Computer Systems Engineering, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

6.824 2006 Lecture 6: Crash Recovery for Disk File Systems Common theme in system-building: You have some persistent state you're maintaining, reading and writing. Maybe the data is replicated in multiple servers' memory, maybe it's on one or more disks. You use caching and fancy data structures for efficiency. The hard part always turns out to be recovery after a crash. Goals: 1. maintain storage system's internal invariants 2. preserve ordered prefix of user's operations Most solutions have a similar flavor: 1. each operation takes storage from legal state to legal state, perhaps w/ multiple updates to storage system 2. order updates to persistent store so there are commit points 3. recovery procedure can finish or un-do partial operations. it all has to be fast! persistent storage has always been slow. Case study: disk file systems. Critical for performance and crash recovery of individual machines. Interacts with distributed protocols, for both reasons. Crash recovery techniques similar to those in distributed systems. Trade-offs are often the same (performance vs durability). A file system is a fairly complex abstract data structure: (this is for UNIX) tree rooted at root i-node directory entries file/subdirectory i-nodes file blocks file block number lists i-nodes and directory contents usually called meta-data. As opposed to file contents. Even at this abstract level there are crash recovery problems: What if you crash in the middle of a rename()? But there is more: These objects live somewhere on the disk. [circle with i-node, data, &c] The file system objects have disk block addresses. Sector number? File system must allocate disk blocks for new i-nodes &c. Someone decides where to place i-nodes. File system must release unused blocks after deletion. So there must be a list of free disk blocks. And you don't want allocated block on free list! Will it be expensive to keep free list up to date on the disk? What does recovery do? For example, UNIX fsck program that runs at boot-time. Very similar to a mark-and-sweep garbage collector. Descends tree, remembers all allocated i-nodes and blocks. All others must be free, so fsck just re-initializes free lists. Cite as: Robert Morris, course materials for 6.824 Distributed Computer Systems Engineering, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

Also checks for block used by two files, file length != number of blocks, &c. May find problem it can't fix (which file should get the doubly-used block). Asks the user. Goal: finish or un-do ops in progress at the time of the crash. Leave file system in a state that it could have been in if the crash had happened ether before or after last user op. So perhaps user loses last few ops, no other problems. Or notify the user if that's not possible. Final ingredient: Kernel's in-memory disk buffer cache of recently used blocks. Hugely effective for reads (all those root i-node accesses). The result is that the bottleneck is often disk writes. So disk caches are also usually write-*back*: they hold dirty blocks. Dirty blocks are lost if there's a crash! What are the specific problems? Example: fd = create("d/f", 0666); write(fd, "hello", 5); Ignore reads since cached, here are the block writes: 1. i-node free bit-map (get a free i-node for f) 2. f's i-node (write owner &c) 3. d's contents (add "f" -> i-number mapping) 4. d's i-node (longer length, mtime) 5. block free bit-map (get a free i-node for f's data) 6. data block 7. f's i-node (add block to list, update mtime and length) How fast can we create small files? If each write goes to disk, then 70 ms/file, or 14 files/second. Pretty slow if you are are un-tarring a big tar file. If FS only writes into disk cache, very quickly. But cache will eventually fill up with dirty blocks, must write to disk. Then writes 1, 2, 3, 4, 5, and 7 are amortized over many files But write 6 is one per file. Sustained rate of maybe 100 files/second. 10x faster than sync writes. So you would like write-back! And unlink: unlink("d/f"); 8. d's contents 9. d's i-node (mtime) 10. free i-node bitmap 11. free block bitmap Can we recover sensibly with a write-back cache? The cache module may write to disk in any order. The game: a few dirty blocks flushed to disk, then crash, recovery. Example: 1-7 and 8, recovery sees unused i-node, frees it. Cite as: Robert Morris, course materials for 6.824 Distributed Computer Systems Engineering, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

Example: just 3, recovery sees used i-node on free list; ask the user? Example: 1-7 and 10, recovery sees used i-node on free list; ask the user? These are benign, though annoying for the user. Clearly there's a vast number of outcomes. Are they all benign? Here's the worst: unlink("f1"); create("f2"); Create happens to re-use the i-node freed by the unlink. Suppose only create write #3 goes to disk, but none of the unlink's writes Crash. After re-start, what does recovery see? The file system looks correct! Nothing to fix. But file f1 actually has file f2's contents! Serious *undetected* inconsisency. This is *not* a state the file system could have been in if the crash had occured slightly earlier or later. We didn't just lose the last few updates. And fsck did not notify the user there was an unfixable problem! How can we avoid this delete/create inconsistency? Observation: we only care about what's visible in the file system tree. Goal: on-disk directory entry must always point to correct on-disk inode. Unlink rule: remove dirent *on disk* before freeing i-node. Create rule: initialize new i-node *on disk* before creating directory entry. In general, directory entry writes should be commit points. Crash just before leaves us with unused allocated i-node. Crash just after is fine. Synchronous disk writes in the order I gave is sufficient. For most file system operations, there is some recoverable synchronous order. Because the file system is a tree, you can prepare the new sub-tree and cause it to appear in the old one with one operation. And most operations are "small", just affect leaves. What about rename()? Can we eliminate some of the sync writes in file creation? To speed up file creation. What ordering constraints can we identify? #2 before #3 #6 before #7 #3 before #4? hard to say, maybe fsck can correct length, but not mtime. #1 and #5 need never occur, since fsck recovers it. #3 can be deferred, since it is a commit point once #2 has completed. So perhaps only #2 and #6 need to be synchronous. To force them to occur before #3 and #7. Perhaps #3 to force it to happen before #4. UNIX: #2, #3, but not #6. Cite as: Robert Morris, course materials for 6.824 Distributed Computer Systems Engineering, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

It defends only its own meta-data, not your file data. Use fsync() if you care. Performance: two disk writes/seeks for a file create, so 50 files/second. Can we do better?

Cite as: Robert Morris, course materials for 6.824 Distributed Computer Systems Engineering, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

6.824 2006 Lecture 7: Logging What's the overall topic? Atomic updates of complex data w.r.t. failures. Today just a single system, we'll be seeing distributed versions later. Why aren't synchronous meta-data updates enough? (from last lecture on file system crash recovery) They're slow Recovery may require scanning the whole disk Some operations don't have an obvious single committing write Example: FFS rename editor could use re-name from temp file for careful update echo a > d1/f1 echo b > d2/f2 mv d2/f2 d1/f1 need to update two directories, stored in two blocks on disk. remove then add? add then remove? probably want add then remove what if a crash? what does fsck do? it knows something is wrong, since link count is 1, but two links. can't roll back -- which one to delete? has to just increase the link count. this is *not* a legal result of rename! but at least we haven't lost the file. so FFS is slow *and* it doesn't get semantics right. You can push tree update one step farther. Prepare a new copy of the entire affected sub-tree. Replace old subtree in one final write. Very expensive if done in the obvious way. But you can share structure between old and new tree. Only need new storage between change points and sub-tree root. (NetApp WAFL does this and more.) This approach only works for tree data structures. and doesn't support concurrent operations very well What are the reasons to use logging? atomic commit of compound operations. w.r.t. crashes. fast recovery (unlike fsck). well-defined post-recovery state: serial prefix of operations. as if synchronous and crash had occured a bit earlier can be applied to almost any existing data structure e.g. database tables, free lists representation is compact on a disk, so very fast to append useful to coordinate updates to distributed data structures let's all do this operation oops, someone didn't say "yes" how to back out or complete? Transactions The main point of a log is to make complex operations atomic. I.e. operations that involve many individual writes. You want all writes or none, even if a crash in the middle. Cite as: Robert Morris, course materials for 6.824 Distributed Computer Systems Engineering, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

A "transaction" is a multi-write operation that should be atomic. The logging system needs to know which sets of writes form a transaction. re-organize code to mark start/end of group of atomic operations create() begin_transaction update free list update i-node update directory entry end_transaction app sends writes to the logging system there may be multiple concurrent transactions e.g. if two processes are making system calls Terminology in-memory data cache on-disk data in-memory log on-disk log dirty vs clean sync write vs async naive re-do log keep a "log" of updates B TID [begin] W TID B# new-data [write] E TID [end == commit] Example: B T1 W T1 B1 25 E T1 B T2 W T2 B1 30 B T3 W T3 B2 99 W T3 B3 50 E T3 for now, log lives on its own infinite disk note we include record from uncommitted xactions in the log records from concurrent xactions may be inter-mingled we can write dirty in-memory data blocks to disk any time we want recovery 1. discard all on-disk data 2. scan whole log and remember all Committed TIDs 3. scan whole log, ignore non-committed TIDs, replay the writes why can't we use any of on-disk data's contents during recovery? don't know if a block is from an uncommitted xaction i.e. was written to disk before commit the *real* data is in the log! the on-disk data structure is just a cache for speed since it's hard to *find* things in a log so what have we achieved? atomic update of complex data structures: gets rename() right recoverable operations are fast problems: we have to store the whole log forever Cite as: Robert Morris, course materials for 6.824 Distributed Computer Systems Engineering, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

recovery has to replay from the beginning of time re-do with checkpoint most logs work like this, e.g. FSD allows much faster recovery: can use on-disk data write-ahead rule delay flushing dirty blocks from in-memory data cache until corresponding commit record is on disk so keep updates of uncommitted xactions in in-memory data cache (not disk) so no un-committed data on disk. but disk may be missing some committed data recovery needs to replay committed data from the log how can we avoid re-playing the whole log on recovery? recovery needs to know a point in log at which it can start a "checkpoint", pointer into log, stored on disk how to ensure recovery can ignore everything before the checkpoint? checkpoint rule: all data writes before check point must be stable on disk checkpoint may not advance beyond first uncommitted Begin in background, flush a bunch of early writes, update checkpoint ptr three log regions: data guaranteed on disk (checkpoint) data might be on disk (log write point) data cannot be on disk (end of in-memory log) on recovery, re-play commited updates from checkpoint onward it's ok if we flush but crash before updating checkpoint pointer we will re-write exactly the same data during recovery can free log space before checkpoint! problem: uncommitted transactions use space in in-memory data cache a problem for long-running transactions (not a problem for file systems) un-do/re-do with checkpoint suppose we want to write uncommitted data to disk? need to be able to un-do them in recovery so include old value in each log record W TID B# old-data new-data now we can write data from in-memory data cache to disk after log entry is on disk no need to wait for the End to be on disk so we can free in-memory data cache blocks of uncommitted transactions recovery: for each block mentioned in the log find the last xaction that wrote that block if committed: re-do if not committed: un-do two pointers stored on disk: checkpoint and tail checkpoint: all in-memory data cache entries flushed up to this point no need to re-do before this point Cite as: Robert Morris, course materials for 6.824 Distributed Computer Systems Engineering, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

but may need to un-do before this point tail: start of first uncommitted transaction no need to un-do before this point so can free before this point it's ok if we crash just before updating the tail pointer itself we would have advanced it over committed transaction(s) so we will re-do them, no problem what if there's an un-do record for block never written to disk? it's ok: un-do will re-write same value that's already there what if B T1 W T1 B1 old=10 new=20 B T2 W T2 B1 old=20 new=30 crash The right answer is B1 = 10, since neither committed But it looks like we'll un-do to 20 What went wrong? How to fix it? careful disk writing log usually stored in a dedicated known area of the disk so it's easy to find after a reboot where's the start? checkpoint, a pointer in a known disk sector where's the end? hard if crash interrupted log append append records in order include unique ascending sequence # in each record also a checksum for multi-sector records (maybe in End?) recovery must search forward for highest sequential # i'm assuming disk sector writes are atomic, and "work correctly" see FSD paper for better handling of disk failures why is logging fast? group commit -- batched log writes. could delay flushing log -- may lose committed transactions but at least you have a prefix. single seek to implement a transaction. maybe less if no intervening disk activity, or group commit write-behind of data allows batched / scheduled. one data block may reflect many transactions. i.e. create many files in a directory. don't have to be so careful since the log is the real information

Cite as: Robert Morris, course materials for 6.824 Distributed Computer Systems Engineering, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

6.824 2006 Lecture 8: Tutorial on Cache Consistency and Locking lecture overview a tutorial to help you with labs 4 and 5 lab 4: locking for correctness with multiple servers lab 5: caching for performance overall goal: ccfs-based distributed file system try to increase number of clients supported by single block server assume that (usually) clients work w/ different files so let's make this case efficient using caching but let's also preserve correctness start with your lab 3 ccfs [draw picture: two ccfs servers, one block server] first: correctness w/ multiple servers suppose both servers executing a CREATE RPC on same directory they both get() dir contents, add a new entry, put() contents first put is overwritten, so one file is lost how do we know this was the wrong answer? need a definition of correctness for concurrent operations traditional definition: atomicity result of two concurrent operations must be the same as if they were run in some one-at-a-time order usual solution: serialize operations wait for one to finish, then start the second if you serialize, and each operation is correct when run alone, then the whole system is correct don't need to reason specifically about every concurrent interleaving you'll serialize w/ locks in lab 4 [add lock server to picture] what should each lock protect? whole file system? no: prevents concurrency that would have been OK. just one block? maybe, but then need one lock per dirent for NFS3_CREATE. i-node + contents: perhaps this will match atomic operation granularity. so let's have locks with name == file handle what operations need to be atomic in ccfs? certainly CREATE, due to get()-modify-put() SETATTR? WRITE? (sub-block writes to same block, or updating block lists) READ? maybe confusing if size != actual amount of data and atime update requires read-modify-write span of a lock in time? CREATE checks if file exists, creates new i-node, reads directory contents, writes contents, writes directory i-node better hold the directory lock the whole time! in general, acquire lock first, release when totally done then we get serialization Cite as: Robert Morris, course materials for 6.824 Distributed Computer Systems Engineering, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

lucky we're using file handle as lock name, which means we can acquire lock before any get() can't release lock until after last put() completes and better not reply to RPC until put() completes what if a single ccfs gets concurrent CREATEs in the same dir? must still execute one at a time so you actually need locks even for a single ccfs that's why we never wrote more than 8192 bytes in lab3 tester (NFS client sends WRITEs concurrently for same file) do we ever need multiple locks? CREATE involves two file handles (directory and new file) REMOVE involves both a file and a directory do we need to hold two locks? RENAME probably requires two locks, if two directories deadlock, order of acquisition what about performance? every NFS RPC now involves many RPCs to block and lock server likely to be slow Lab 5 plan: want to operate out of local cache, w/ no RPCs to block/lock servers as long as only one ccfs is using a given file &c only talk to block/lock servers when others need our blocks/locks step 1: add block caching to ccfs you will modify blockdbc.C and .h get() checks local cache first if in local cache: just return otherwise: fetch from block server, add to local cache, return put() *just* adds to cache, marks block as dirty you can copy some code from blockdbd.C: the hash table need to know when another ccfs wants to read a block that's dirty in our cache and when another ccfs wants to write a block that's clean in our cache and need to ensure at most one ccfs has a dirty copy of any given block we need "cache consistency" informally, a read sees the most recent write here's a good rule: you can cache a block (dirty or not) if you hold the file's lock you cannot have a block cached if you don't hold the corresponding lock so need to "flush" blocks before releasing lock back to server drop clean blocks from cache put() dirty blocks but this hasn't helped performance! must flush data cache before each release() still doing many get()/put()/acquire()/release() per NFS RPC idea: cache the locks also! Cite as: Robert Morris, course materials for 6.824 Distributed Computer Systems Engineering, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

so you need to change lock_client to cache locks locally so that release() just marks lock locally as released if you acquire() it again, no need to talk to lock server need to make lock server send a REVOKE if some other client is waiting lock_client should tell fs.C what lock is being revoked fs.C should tell block client to send that file's dirty blocks to block server, and drop file's clean blocks all file's blocks: content, attribute, &c fs.C should tell lock_client when block server has replied to all PUTs then lock_client should send a RELEASE RPC to the server Details given a lock name, how to figure out keys of blocks that should be flushed? lock name should be file handle (so easy to flush attributes) name other file block in a predictable way from file handle typical sequencing when interacting with locks client #1 is caching the lock and dirty blocks client #2 calls acquire() #2 -> LS : ACQUIRE LS -> #2 : reply LS -> #1 : REVOKE #1 -> LS : reply #1 -> BS : put(fh, v) BS -> #1 : reply #1 -> LS : RELEASE LS -> #1 : reply LS -> #2 : GRANT #2 -> LS : reply #2 -> BS : get(fh) #1 must ensure the block server has the dirty data before releasing! lab 5 quirks NFS3_READ must take the lock, not for atomicity, but to get latest data NFS3_REMOVE may need the file lock to force file handle to be stale if you only lock the directory, you leave i-node in other caches so future GETATTR for file may succeed NFS3_CREATE may need to grab lock on new i-node to force others to read from our cache What you're *not* responsible for: atomicity w.r.t. crashes recovering lock state after lock server reboot replicating the block server client crash while holding locks: un-do partial operations?

Cite as: Robert Morris, course materials for 6.824 Distributed Computer Systems Engineering, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

6.824 2006 Lecture 9: Memory Consistency (1) Replicated data a huge theme in distributed systems For performance and fault tolerance Often easier to replicate data than computation (Hypervisor...) Examples: Replicated mailboxes and user info in Porcupine. Caches in NFS and Echo. File server in labs 4 and 5. shared-memory multiprocessors All these examples involve sophisticated optimizations for performance How do we know if an optimization is correct? We need to know how to think about correct execution of distributed programs. Most of these ideas from multiprocessors 20/30 years ago. So I'll talk about memory, loads, stores. But ideas are similar for e.g. lab 5 For now, just correctness and efficiency, not fault-tolerance. What is "correct" for uniprocessor programs? First, define "correct" for each instruction independently takes the machine from one state to another e.g. ADD R1, R2, R3 or LD gets value of last ST to same address Rule: a result is "correct" if it's the same as a result obtainable by: Executing the instructions one at a time, waiting for each to complete Uniprocessor correctness is a useful definition because programmer can use it to predict what program will do, and thus write correct programs you can tell this is a *definition* because modern CPUs don't work like this; they break this rule; lots of logic to ensure they *look* like they are enforcing it Example of why CPU doesn't want to *implement* the uniprocessor rule: MUL R1, R2, R3 ST x, R1 LD y, R4 MUL is pretty slow, ST has to wait for it. Dependency via R1. But LD does not need to wait: will get same result if we execute it early. So generally the LD executes before the ST, for speed. But CPU h/w checks if &x == &y, stalls LD if same memory address. Point: this optimization only possible w/ a definition! What about correctness for distributed computations? multiple hosts, shared memory memory could be files, DSM (next paper), or DHT Naive distributed memory: internet cloud, hosts CPU0, CPU1, CPU2 assume each host has a local copy of all of memory reads are local, so they are very fast Cite as: Robert Morris, course materials for 6.824 Distributed Computer Systems Engineering, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

send write msg to each other host (but don't wait) Example 1: Simple mutual exclusion algorithm, for locking. x and y start as zero on both CPUs CPU0: x = 1; if(y == 0) critical section; CPU1: y = 1; if(x == 0) critical section; Intuitive explanation for why this should "work": If CPU0 sees y == 0, CPU1 can't have reached "y = 1", so CPU1 must see x == 1, so it won't execute critical section. perhaps neither will enter, but never both Example 1 fails w/ naive distributed memory (and on most multiprocessors). Problem A: CPU0 sends write x=1 msg, reads local y=0 CPU1 reads local x=0 before write msg arrives local memory and slow writes cause disagreement about r/w order CPU0 thinks its x=1 was before CPU1's read of x CPU1 thinks its read of x was before arrival of x=1 so both can enter the critical section! Example 2: CPU0: v0 = f0(); done0 = true; CPU1: while(done0 == false) ; v1 = f1(v0); done1 = true; CPU2: while(done1 == false) ; v2 = f2(v0, v1); Intuitive intent: CPU2 should execute f2() with results from CPU0 and CPU1 waiting for CPU1 implies waiting for CPU0 Example 2 won't work with naive distributed memory: Problem B: CPU0's writes of v0 and done0 may be interchanged by network leaving v0 unset but done0=true But assume each CPU sees each other's writes in issue order Problem C: CPU2 sees CPU1's writes before CPU0's writes i.e. CPU2 and CPU1 disagree on order of CPU0 and CPU1 writes Lesson: either naive distributed memory isn't "correct" Cite as: Robert Morris, course materials for 6.824 Distributed Computer Systems Engineering, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

or we should not have expected Examples to work How can we write correct distributed programs w/ shared storage? Memory system promises to behave according to certain rules. We write programs assuming those rules. Rules are a "consistency model" Contract between memory system and programmer What makes a good consistency model? There are no "right" or "wrong" models A model may make it harder or easier to program i.e. lead to more or less intuitive results A model may be harder or easier to implement efficiently How about "strict consistency": each instruction is stamped with the wall-clock time at which it started across all CPUs Rule 1: LD gets value of most recent previous ST to same address Rule 2: each CPU's instructions have time-stamps in execution order Essentially the same as on uniprocessor Would strict consistency execute Example 1 intuitively? Could both CPUs be in the critical section? i.e. could both CPUs read 0? I.e. can we show a time-stamp ordering of operations, consistent with rules, that leads to both CPUs in critical section? Rule 2 says each CPU's operations occur in time order in execution order, so CPU0: w(x)1 r(y)0 CPU1: w(y)1 r(x)0 but we're not sure of interleave CPU0's r(y)0 means w(y)1 hadn't executed by Rule 1, so CPU0: w(x)1 r(y)0 CPU1: w(y)1 r(x)0 But we've violated Rule 1, since w(x)1 followed by r(x)0 So both CPUs cannot be in the critical section. In general strict consistency produces intuitive behavior. How do you implement strict consistency? Time: 1 2 3 4 CPU0: ST ST CPU1: LD LD Time between instructions r(x)1 because w(x)1 must have finished before r(x) starts Example of a faster consistency model? We're willing to accept more work for the programmer. Though we still want a well-defined model. And in return we expect faster execution. Release Consistency You rarely see programs like the a=1; if(b==0) example. Because it's so hard to reason about them. Instead, parallel programs typically lock data that is shared and mutable. To create atomic multi-step sequences. (Not the same as cache ownership tokens...) Example: bank account transfer: acquire(l); b1 = b1 + x; b2 = b2 - x; release(l); Cite as: Robert Morris, course materials for 6.824 Distributed Computer Systems Engineering, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

Other CPUs aren't allowed to look at b1 or b2 while l is locked. So CPU could do the operations in any order within the critical section. I.e. load b2 before storing b1. Rules: 1. CPU can't re-order any LD/ST before the acquire(). (otherwise you might read b1 while someone else has the lock) 2. Writes must finish before release() completes. (otherwise other CPUs might not see the writes) Can re-order, cache, &c within release/acquire, so fast. But: memory system must understand locks, acquire(), and release(). The Treadmarks paper is all about implementing release consistency.

Cite as: Robert Morris, course materials for 6.824 Distributed Computer Systems Engineering, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

6.824 2006 Lecture 12: Vector Timestamps Topic: Vector Timestamps / Version Vectors Usually used for file synchronizers Treadmarks use them also so far we've talked about enforcing some order of operations either implicitly because they go thru file server or because of e.g. Ivy DSM sequential consistency Nice semantics, but sometimes awkward requires lots of chatter, w/ file server or other clients can you sensibly operate out of your cache without seeing other people's updates? example: disconnected operation with laptops source-code repository, cooperating programmers each programmer's laptop has a complete copy of the files want to be able to modify files while disconnected programmers synchronize files pair-wise: "file synchronizer" how to decide which files to copy, and in which direction? non-conflicting changes: take latest version conflicting changes: report an error, ask humans to resolve "optimistic replication" Example 1: Just focus on a modifications to a single file "f" H1: w(f)1 ->H2 ->H3 H2: w(f)2 H3: ->H2 What is the right thing to do? Is it enough to simply take file with latest modification time? Yes in this case, as long as you carry them along correctly. I.e. H3 remembers mtime assigned by H1, not mtime of sync. Example 2: H1: w(f)1 ->H2 w(f)2 H2: w(f)0 ->H1 H2's mtime will be bigger. Should the file synchronizer use "0" and discard "2"? No! They were conflicting changes. We need to detect this case. Modification times are not enough by themselves What is the principle here? We're not *enforcing* a sequential update history. No locks, no Ivy-style owners. But *if* updates were actually sequential, we want to detect it and carry along most recent copy. This is where the "optimism" comes in. And if updates were not sequential, we want to detect the problem (after the fact). So what is the problem with using wall-clock time? Certainly if T1 < T2, T1 does not supersede T2. *But* if T2 > T1, that doesn't imply that T2 supersedes T1. Ordered in time doesn't mean everyone agrees on the order. Cite as: Robert Morris, course materials for 6.824 Distributed Computer Systems Engineering, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

How can we decide whether we're looking at a sequential history? We could record each file's entire modification history. List of hostname/localtime pairs. And carry history along when synchronizing between hosts. For example 1: H2: H1/T1,H2/T2 H3: H1/T1 For example 2: H1: H1/T1,H1/T2 H2: H1/T1,H2/T3 Then its easy to decide if version X supersedes version Y: If Y's history is a prefix of X's history. This exactly captures our desire for an *ordered* sequence of updates. Note we're not comparing times, so this works w/ out-of-sync host clocks. Why is complete history not such a great solution? The histories grow without bound if we modify a file a lot. Can we compress the histories? Proposal: Vector Timestamps. Summarize a history w/ each host's highest time. So throw away all but the last history record for each host. Last TS from each host preserves information essential to detecting conflicting updates: Had I seen your latest change when I made my change? Only care about latest change. A vector timestamp is a vector of times, one entry per host. The times are (as in the histories) local wall-clock times. Though they could be any monotonic count: local version number. If a and b are vector time-stamps: a = b if they agree at every element. a b[j], for some i,j. i.e. a and b conflict If one history was a prefix of the other, then one vector timestamp will be less than the other. If one history is not a prefix of the other, then (at least by example) VTs will not be comparable. So now we can perform file synchronization. Laptops carry along a version vector with each file, rather than just the single timestamp of last change. When a pair of laptops syncs a file, they accept the file with the higher vector timestamp. And signal an error if the VTs conflict. Illustrate with the two examples. What if there *are* conflicting updates? VTs can detect them, but then what? Depends on the application. Easy: mailbox file with distinct messages, just merge. Medium: changes to different lines of a C source file. Hard: changes to the same line of C source. Reconciliation must be done manually for the hard cases. Cite as: Robert Morris, course materials for 6.824 Distributed Computer Systems Engineering, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

CVS keeps extra information, equiv to complete history. So it can generate diffs from common ancestor. Much better than diffing the two final versions against each other. Which is newer? Or did they both change? What about file deletion? Deletes have to have VTs so you can detect delete/write conflicts. Could treat as a special kind of write. But then you have to remember deletion timestamp indefinitely. In case some host out there hasn't heard about it. Can we garbage collect deletes? We've used VTs to *detect* violations of sequential consistency. We can also use VTs to help *ensure* consistency. TreadMarks uses a similar technique...

Cite as: Robert Morris, course materials for 6.824 Distributed Computer Systems Engineering, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

6.824 2006 Lecture 13: Two-Phase Commit General topic: coordinating servers in a distributed system. Prev lectures: agree on order of updates to replicas, no failures. Today: general-purpose agreement on some action, w/ failures. Next few lectures: agree on updates to replcias, w/ failures. Example: Transfer money from bank A to bank B. Debit at A, credit at B, tell client "ok". We want both to do it, or both not to do it. We *never* want only one to act. We'd rather have nothing happen (this is important). We want an "Atomic Commit Protocol" Bigger transaction processing picture: we want two kinds of atomicity: Serializability, as if in some order, locking Recoverability, all or nothing, no partial results Today is really about recoverability I'll assume some external force serializes Perhaps a lock server forcing one-at-a-time transactions Perhaps there's only one source of transactions Atomic Commit is hard "I'll commit if you commit" What if I don't hear back from you? Neither party can finally decide Straw Man: invent a Transaction Coordinator as single authoritative entity four entities: client, Transaction Coordinator, Bank A, Bank B client sends "go" to TC TC sends "debit" to A TC sends "credit" to B TC reports "ok" to client. Timeline... No ACKs. How can Maybe Maybe Maybe Maybe Maybe

this go wrong? there's not enough money in A's bank account. B's bank account no longer exists. the network link to B is broken. A or B has crashed. TC crashes between sending the messages.

We want two properties: TC, A, and B each have a notion of committing Correctness: if one commits, no one aborts. if one aborts, no one commits. Performance: if no failures, and A and B can commit, then commit. if failures, come to some conclusion ASAP. Let's do correctness first. Correct atomic commit protocol: TC sends "prepare" messages to A and B. Cite as: Robert Morris, course materials for 6.824 Distributed Computer Systems Engineering, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

A and B respond, saying whether they're willing to commit. If both say "yes", TC sends "commit" messages. If either says "no", TC sends "abort" messages. A/B "decide to commit" if they get a commit message. I.e. they actually change the bank account. Why is this now correct? Neither can commit unless they both agreed. What about performance? Crashes or message losses can still prevent completion. We have two types of problems: Timeout. I'm up, but I don't recv a msg I expect. Maybe the other host crashed. Maybe the network is broken. We cannot usually tell the difference, so must be correct in either case. Reboot. I crashed, and I'm rebooting, and I need to clean up. Let's deal with timeout first. Where do hosts wait for messages? 1) TC waits for yes/no. 2) A and B wait for commit/abort. Handling TC timeout waiting for yes/no. TC has not sent any "commit" messages. So TC can safely abort, and send "abort" messages. We've preserved correctness, but sacrificed performance. Because (maybe) A and B were both prepared to commit, but we lost one of the messages, we could have committed, but TC didn't know it. TC is being conservative. Handling A/B timeout while waiting for commit/abort. Let's talk about just B (A is symmetric). If B voted "no", it can unilaterally abort. So what if B voted "yes"? Can B unilaterally decide to abort? No! TC might have gotten "yes" from both, and sent out "commit" to A, but crashed before sending to B. So then A would commit and B would abort: incorrect. B can't unilaterally commit, either: A might have voted "no". B could just wait forever until it gets commit/abort from TC. But you can do better than that: Termination protocol for B if it voted "yes": B sends "status" request message to A. Asks if A knows whether transaction should commit. If B doesn't hear reply from A: no decision, wait for TC. If A received "commit" or "abort" from TC: B decides the same way. Can't disagree with TC... If A hasn't voted yes/no yet: B and A both abort. TC can't have decided "commit", so it will eventually hear from A or B. If A voted "no": B and A both abort. Cite as: Robert Morris, course materials for 6.824 Distributed Computer Systems Engineering, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

TC can't have decided "commit". If A voted "yes": no decision possible! TC might have decided "commit" and replied to client. Or TC might have timed out and aborted. A and B must wait for the TC. What have we achieved? Can resolve some timeout situations w/ guaranteed correctness. But sometimes A and B must block. Due to TC failure, or failure of TC's network connection. How to handle crash/reboot? Cannot back out of a commit if already decided. TC crashes just after deciding "commit". A/B crash just after sending "yes". Big trouble if they reboot and don't remember saying "yes"! They might change their minds after the reboot. Or *even after everybody re-starts* they may not be able to decide! If all nodes know state before crash, you can use termination protocol. But also talk to TC, which may know that it committed. How do you know what state you were in before the crash? Assume non-volatile memory, perhaps a disk. But do you write disk, then send "yes" message? Or "commit" if TC? Or send message and then update the state on the disk? We cannot send the message before writing the disk: Since then we might change our mind after the reboot. I.e. B might have voted "yes", then reboot, then "no". Does it work to write the disk before sending the message? For TC w.r.t. "commit", and A/B w.r.t. "yes". Recovery protocol w/ non-volatile state: If you're the TC, and there's no "commit" on disk, abort. Because no commit on disk -> you didn't send any "commit" messages. If you're A/B, and no "yes" on disk, abort. No "yes" on disk -> didn't vote yes -> nobody could have committed. If you're A/B, and there is a "yes" on your disk: Ordinary termination protocol, might block. If everyone has rebooted and is reachable, you can decide. Based just on whether TC has "commit" on its disk. This protocol is called "two-phase commit". What properties does it have? 1. All hosts that decide reach the same decision. 2. No commit unless everyone says "yes". 3. No failures and all "yes", then commit. 4. If failures, then repair, wait long enough, then some decision.

Cite as: Robert Morris, course materials for 6.824 Distributed Computer Systems Engineering, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

6.824 2006 Lecture 14: Paxos From Paxos Made Simple, by Leslie Lamport, 2001 introduction 2-phase commit is good if different nodes are doing different things but in general you have to wait for all sites and TC to be up you have to know if each site voted yes or no and the TC must be up to decide not very fault-tolerant: has to wait for repair can we get work done even if some nodes can't be contacted? yes: in the special case of replication state machine replication works for any kind of replicated service: storage or lock server or whatever every replica must see same operations in same order if deterministic, replicas will end up with same state how to ensure all replicas see operations in the same order? primary + backup(s) clients send all operations to current primary primary chooses order, sends to backups, replies to client what if the primary fails? need to worry about that last operation, possibly not complete need to pick a new primary can't afford to have two primaries! suppose lowest-numbered live server is the primary so after failure, everyone pings everyone then everyone knows who new primary is? well, maybe not: pings may be lost => two primaries pings may be delayed => two primaries partition => two primaries idea: a majority of nodes must agree on the primary at most one network partition can have a majority if two potential primaries, their majorities must overlap technique: "view change" algorithm system goes through a sequence of views view: view# and set of participants ensure agreement on unique successor of each view the participant set allows everyone to agree on new primary view change requires "fault-tolerant agreement" at most a single value is chosen agree despite lost messages and crashed nodes can't really guarantee to agree but we can guarantee to *not* "agree" on different values! Paxos fault-tolerant agreement protocol eventually succeeds if a majority of participants are reachable best known algorithm general Paxos approach Cite as: Robert Morris, course materials for 6.824 Distributed Computer Systems Engineering, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

one (or more) nodes decide to be the leader leader chooses a proposed value to agree on (view# and participant set) leader contacts participants, tries to assemble a majority participants are all the nodes in the old view (including unreachable) or a fixed set of configuration master nodes if a majority respond, we're done why agreement is hard what if two nodes decide to be the leader? what if network partition leads to two leaders? what if the leader crashes after persuading only some of the nodes? what if leader got a majority, then failed, without announcing result? or announced result to only a few nodes? new leader might choose a different value, even though we agreed Paxos has three phases may have to start over if failure/timeouts state (per view) n_a, v_a: highest value and n which node has accepted n_h: highest n seen in a Q1 done: leader says agreement was reached, we can start new view Paxos Phase 1 a node (maybe more than one...) decides to be leader picks a proposal number n must be unique, good if it's higher than any known # how about last known proposal number, plus one, append node ID sends Q1(n) to every node (including itself) if node gets Q1(n) and n > n_h: n_h = n return R1(n_a, v_a) Paxos Phase 2 if leader gets R1 from majority of nodes (including self): if any R1(n,v) had a value, v = value of highest n else leader gets to choose a value old view# + 1, set of pingable nodes send Q2(n, v) to all responders if node gets Q2(n, v) and n >= n_h n_a = n v_a = v return R2() Paxos Phase 3 if leader gets a majority of R2(): send Q3() to all if node gets Q3(): done = true primary is lowest-numbered node in v_a if at any time any node gets bored (times out) it declares itself a leader and starts a new Phase 1 Cite as: Robert Morris, course materials for 6.824 Distributed Computer Systems Engineering, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

if nothing goes wrong, Paxos clearly reaches agreement how do we ensure good probability that there is only one leader? every node has to be prepared to be leader, to cope w/ failure so delay a random amount of time after you realize a new view is required or delay your ID times some constant key danger: nodes w/ different v_a receive Q3 goal: if Q3 *could* have been sent, future Q3s guaranteed to have same v_a what if more than one leader? due to timeout or partition or lost packets the two leaders used different n, say 10 and 11 if 10 didn't get a majority to R2 it never will, since no-one will R2 10 after seeing 11's Q1 or perhaps 10 is in a network partition if 10 did get a majority to R2 i.e. might have sent Q3 10's majority saw 10's Q2 before 11's Q1 otherwise they would have ignored 10's Q2, so no majority so 11 will get a R1 from at least one node that saw 10's Q2 so 11 will be aware of 10's value so 11 will use 10's value, rather than making a new one so we agreed on a v after all what if leader fails before sending Q2s? some node will time out and become a leader old leader didn't send any Q3, so we don't care what he did it's good, but not neccessary, that new leader chooses higher n if it doesn't, timeout and some other leader will try eventually we'll get a leader that knew old n and will use a higher n what if leader fails after sending a minority of Q2s? same as two leaders... what if leader fails after sending a majority of Q2s? i.e. potentially after reaching agreement! same as two leaders... what if a node fails after receiving Q2? if it doesn't restart, possible timeout in Phase 3, new leader it it does restart, it must remember v_a/n_a! (on disk) leader might have failed after sending a few Q3s new leader must choose same value our node might be the intersecting node of the two majorities what if a node reboots after sending R1? does it have to remember n_h on disk? it uses n_h to reject Q1/Q2 with smaller n scenario: leader1 sends Q1(n=10), a bare majority sends R1 so node X's n_h = 10 Cite as: Robert Morris, course materials for 6.824 Distributed Computer Systems Engineering, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

leader2 sends Q1(n=11), a majority intersecting only at node X sends R1 node X's n_h = 11 leader2 got no R1 with a value, so it chooses v=200 node X crashes and reboots, loses n_h leader1 sends Q2(n=10, v=100), its bare majority gets it including node X (which should have rejected it...) so we have agreement w/ v=100 leader2 sends Q2(n=11, v=200) its bare majority all accept the message including node X, since 11 > n_h so we have agreement w/ v=200. oops. so: each node must remember n_h on disk conclusion what have we achieved? remember the original goal was replicated state machines and we want to continue even if some nodes are not available after each failure we can perform view change using Paxos agreement that is, we can agree on exactly which nodes are in the new view so, for example, everyone can agree on a single new primary but we haven't talked at all about how to manage the data

Cite as: Robert Morris, course materials for 6.824 Distributed Computer Systems Engineering, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

6.824 2006 Lecture 15: Viewstamped Replication From "Viewstamped Replication: A New Primary Copy Method to Support Highly-Available Distributed Systems," Oki and Liskov, PODC 1988. Recall overall topic: replicated fault-tolerant services E.g. lock server that can continue even if some servers crash. Last lecture: Paxos view-change (we'll use it in this lecture) This lecture: reconstructing the data after a failures (i.e. during view change) Next class: complete system built with these tools (Harp) Overall plan for fault-tolerant service: primary-copy Elect a primary. Clients send all operations through primary. Primary processes client operations in some order. Primary sends each operation to backups, waits for ACKs. If primary fails, clients start using a backup as the primary. Problem: status of the last operation if primary fails. Problem: network partition might lead to two primaries. Problem: net or backup failure may cause a replica to miss some operations. Core idea: quorums majority -> only one view intersection -> can find latest data from old view Quorums are not as simple as they may seem: A, B, and C are running the lock service C is separated, leaving A and B A crashes and loses memory, and B separates A restarts and C's network connection is fixed A and C form a majority -- but do they know that C's data is stale? Solution: Viewstamped Replication Harp uses a variant of viewstamped replication. I'll present a simplified form. More suitable for e.g. lock server than for file server. Properties: Assume there are 2b + 1 replicas. Can tolerate failure (node or net) of up to b replicas. This includes primary failures. Will operate if 1b+1 replicas are reachable. Handles partition correctly: If one partition has b + 1 or more, that partition can proceed. A partition with b or fewer replicas does nothing. Mimics a single reliable server. Will only operate if it can guarantee linear operation history I.e. every server's state reflects same op sequence (harder than FAB) And guarantees not to forget an ACKed operation Overview of Viewstamped Replication System goes through a series of views, one at a time. A "view" consists of a primary and the replicas it can talk to Cite as: Robert Morris, course materials for 6.824 Distributed Computer Systems Engineering, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

Start a new view if set of reachable nodes changes (e.g. primary crashes) In each view: Agree on the primary and set of backups (much like Paxos) Recover latest state from previous view Primary accepts client operations, sends to other replicas Why the system is going to be correct I.e. mimic a single reliable server A view must have a majority, so one view at a time Primary sequences client operations within each view Primary sends each operation to every server in the view Next view must also be a majority, thus include one replica from prev view So all operations that completed in previous view known to next view This part is harder than it seems and it's the core of the technique. Data types: viewid: viewstamp: State maintained by each replica: cur_viewid data last_viewstamp max_viewid crashed cur_viewid is on disk, preserved even if there's a crash. Others (including data) are in memory, lost in a crash. Operation within a view: [draw a timeline, to help understand last operation issues] View consists of primary and at least b other replicas. Clients send operations to the primary. They know because non-primary replicas will redirect them. Primary picks the next client operation. Primary assigns that operation the next viewstamp. Primary sends the operation+viewstamp to all the replicas *in the view*. Thus possible to proceed despite some failed replicas. Primary waits for all ACKs. (otherwise different replicas apply different subsets of operations.) Primary sends ACK to client. When does a View Change occur? Primary notices a backup in the view has died. Any node in the view notices the primary has died. Any node in the view notices a node outside the view has revived. First step: pick a new primary and a new view # This is much like Paxos/viewchange from previous lecture One or more servers send invitations (maybe simultaneously) Each invite contains a new viewid: higher than max_viewid. Cite as: Robert Morris, course materials for 6.824 Distributed Computer Systems Engineering, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

We want highest viewid to win: higher viewcount always wins. If viewcounts the same, higher replica_id wins. If a node sees an invite with viewid > max_viewid, drop everything and participate in election. Reachable replicas will agree on a viewid if nothing fails/joins for long enough. Node that proposed the winning viewid is the new primary. Primary needs to reconstruct state. Which server (if any) has up-to-date state, reflecting all commits? We don't want to forget about any ACKed operations. So primary needs to establish three properties: 1. At most one view at a time, so well-defined "previous view". 2. New view knows what the previous view and viewid was. 3. New view knows last state of previous view. Up through the last operation that could have been ACKed to a client. Property 1 is satisfied if primary assembles a majority. Property 2 is also satisfied with a majority. Any majority must include a server from the previous view. That server has the previous view's cur_viewid on disk. Even if it crashed. We're assuming disks don't fail. So we're really assuming a majority with intact disks. Property 3 satisfied if we have one non-crashed server from the previous view. Since all servers in previous view saw all committed operations. Old primary would not have committed w/o ACK from every replica in old view. So if our one replica doesn't have the viewstamp, it didn't commit. Servers *know* if they crashed+rebooted since last view change. New primary sends all servers in view the state from the server it has chosen. All servers in view write cur_viewid to disk. New primary can now start processing client requests. Are we guaranteed to include last operation if it was ACKed to client? Old primary only sends ACK after all replicas ACK So any non-crashed server from previous view has latest ACKed operation So "yes" What about operations that hadn't quite committed when the view change started? Primary had sent operation to one server, but not all of them. Since primary had not gotten ACKs, it didn't send response to client. So we can commit it, or not commit it; either is legal. Depends on which server primary chooses as "most recent state". So we might accept a final operation that old primary never sent ACK for Client may re-send, primary needs to know if a duplicate operation. So somehow transaction IDs or RPC seq #s must be in the state? Cite as: Robert Morris, course materials for 6.824 Distributed Computer Systems Engineering, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

Or operations must be safely repeatable. When *can't* we form a new view? remember earlier example: A, B, and C are running the lock service C is separated, leaving A and B A crashes and loses memory, and B separates A restarts and C's network connection is fixed A and C form a majority -- but do they know that C's data is stale? How does VSR prevent formation of a new view? A kept last cur_viewid on disk. So it can tell C isn't up to date. And A knows it isn't up to date either, since it crashed. Must wait for B to rejoin; it has the latest state. Note that A and B are enough by themselves Majority means they can figure out the latest viewid B not crashed means they have latest state Similarly, B and C are enough Are we ever forbidden to form a view when it would actually be correct? I.e. is this the best fault-tolerance possible? Suppose A has intact data from the previous view. But B and C have crashed and are still down. It would actually be correct for A to proceed alone. But it cannot know that, so it must wait. Is this system perfect? Must copy full state around at each view change. OK if lock service, a disaster if NFS service. Vulnerable to power failures. Nodes lose state if power fails. May strike all nodes at the same time. Primary executes operations one at a time. This is probably slow, cannot overlap execute of one with send of the next. Would be even slower if every op had to be written to disk. We'll see solutions to these problems in the Harp paper. Replicated NFS server that uses viewstamped replication.

Cite as: Robert Morris, course materials for 6.824 Distributed Computer Systems Engineering, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

Replication in the Harp File System Liskov, Ghemawat, Gruber, Johnson, Shrira, Williams SOSP 1991 Key points 2b+1 servers, keeps going w/ up to b failures, might recover from > b improvements over viewstamped replication: supports concurrent/overlapped operations, log to maintain order supports huge state, uses log to bring recovered disks up to date can recover after simultaneous power failures (of all 2b+1) Outline basic operation. Client, primary, backup, witness. Client -> Primary Primary -> Backups Backups -> Primary, primary waits for all backups Primary replies to Client Primary tells clients to commit Why does Harp use a log? 1. keep track of multiple concurrent ops 2. log is the in-memory state recovered by VSR 3. log maintains order so we recover a prefix after failure 4. log can bring a separated backup's disk up to date What is in a typical log record? Why does Harp have so many log pointers? FP most recent client request CP commit point (real in primary, latest heard in slave) AP highest record sent to disk on this node LB disk has completed up to here GLB all nodes have completed disk up to here? Why the FP-CP gap? So primary doesn't need to wait for ACKs from each backup before sending next operation to backups Higher throughput: can overlap wait for prev op with exec of next Probably most useful when there are multiple clients Why the CP-AP gap? Why not apply to disk at CP? exactly what happens at AP? how are ops applied? Why the AP-LB gap? allows delay between issue of op and when it must complete to disk why? What is the LB? How does Harp find out what the current LB is? Why the LB-GLB gap? I.e. why not delete log record when disk write completes? Can't throw away log records until we know *everyone* has applied them Because we might need to use our log to bring someone up to date How does failure recovery work? Cite as: Robert Morris, course materials for 6.824 Distributed Computer Systems Engineering, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

Scenarios 5 servers, 1-5, 1 is usually primary, 2-3 backups, 4-5 witnesses S2's cooling fan fails, so its cpu melts, and it crashes new view S4 is promoted (witness -> backup) S4 gets copy of log starting at GLB (i.e. all ops not known to be on disks) S4 starts logging all operations, but doesn't apply them but GLB advances, so primary discards log entries why bother promoting S4? S2 gets a new CPU and reboots new view S4 sends big log to S2, S2 plays it to get all missing operations S2 suffers a disk failure needs to get complete disk image + log from S1 or S3 what if S1 crashes just after replying to a client? where will new primary's FP and CP be after view change? does new primary have to do anything special about ops between CP and FP? did other backups get the op? does this do the right thing for ops that the old primary *didn't* reply to? All nodes suffer power failure just after S1 replies to a client. Then they all re-start. Can they continue? Where were the logs stored while the power was out? What if they had all crashed -- could they continue? Crash == lost memory contents (despite UPS). How do they tell the difference? Why does Harp focus so much on UPS and power failure? Since it already has a good story for more serious h/w falures? S2 and S3 are partitioned (but still alive) Can S1+S4+S5 continue to process operations? S4 moves to S2/S3 partition Can S2+S3+S4 continue? S2 and S3 are partitioned (but still alive) S4 crashes, loses memory contents, reboots in S2/S3 partition Can they continue? Depends on what S4's on-disk view # says. Everybody suffers a power failure. S4 disk and memory are lost, but it does re-start after repair. S1 and S5 never recover. S2 and S3 save everything on disk, re-start just fine. Can S2+S3+S4 continue? In general, how do you know you can form a view? 1. No other view possible. 2. Know about most recent view. Cite as: Robert Morris, course materials for 6.824 Distributed Computer Systems Engineering, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

3. Know all ops from most recent view. #1 is true if you have n+1 nodes in new view. #2 is true if you have n+1 nodes that did not lose view # since last view. View # stored on disk, so they just have to know disk is OK. One of them *must* have been in the previous view. So just take the highest view number. Now that we know last view number, Need a disk image, and a log, that together reflect all operations through the end of the previous view. Perhaps from different servers, e.g. log from promoted witness, disk from backup that failed multiple views ago. If a node recovers w/ working disk, can you really replay a log into it? What if log contains operations already applied to the disk? If a node recovers but disk needs fsck Is it legal to run fsck? Does Harp run fsck? Can you avoid fsck and repair by re-doing the log? As in FSD? If a node recovers w/o disk contents, i.e. w/ empty disk Does it work to copy another server's disk? What if the other server is actively serving Harp/NFS ops? Can we avoid pausing for the entire time of disk copy? How does primary generate return values for ops? It replies at CP, before ops have been applied to the file system! For example, how do you know an UNLINK would succeed? Or the file handle of a CREATE? How does Harp handle read-only operations? e.g. GETATTR? Why doesn't it have to consult the backups? Why is it correct to ignore ops between CP and FP when generating the reply? What if client sends WRITE then READ before WRITE reaches CP? Does Harp have performance benefits? Yes, due to UPS, no need for sync disk writes. But in general, not 3x performance. Why graph x=load y=response-time? Why does this graph make sense? Why not just graph total time to perform X operations? One reason is that systems sometimes get more/less efficient w/ high load. And we care a lot how they perform w/ overload. Why does response time go up with load? Why first gradual... Queuing and random bursts? And some ops more expensive than others, cause temp delays. Then almost straight up? Probably has hard limits, like disk I/Os per second. Cite as: Robert Morris, course materials for 6.824 Distributed Computer Systems Engineering, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

Frangipani: A Scalable Distributed File System Thekkath, Mann, Lee SOSP 1997 Why not primary copy? Actually they do use primary copy inside Petal But many Petal primary/backup pairs in one FS Work out our own design... How to divide data among the network disks. What does Petal do / guarantee? What happens if a client e.g. creates a file? What steps does the server go through? acquire lock, append to *local* log, update local meta-data, release lock locally, reply to client. What if a client on a different server reads that file? S1 gets the REVOKE writes log to Petal, writes meta-data to Petal, RELEASEs lock Why must it write the log entry to Petal before writing the meta-data? Why must it write the meta-data to Petal before releasing the lock? What if two clients try to create the same file at the same time? The locks are doing two things: Atomic multi-write transactions. Serializing updates to meta-data (cache consistency). What if a server dies and it is not holding any locks? Can the other servers totally ignore the failure? What if a server dies while holding locks? Can we just ignore it until it comes back up and recovers itself? Can we just revoke its locks and continue? What does Frangipani do to recover? What's in a log record? S1 creates f2, crashes while holding lock how does replay work? if S1 crashed before any flush of anything? mid-way through flushing log? mid-way through flushing data? just after all flushing, before releasing lock? just after releasing the lock? What effect will the logging have on ordinary performance? Suppose S1 deletes f1, flushes its block+log, releases lock. Then S2 acquires lock and creates a new f1. Then S1 crashes. Will recovery re-play the delete? Details depend on whether S2 has written the block yet. Cite as: Robert Morris, course materials for 6.824 Distributed Computer Systems Engineering, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

Does the recovery manager have to acquire locks before playing records? What if some other server currently holds the lock? Might the other server have stale data cached? From before replay? What if two servers crash at about the same time? And they both modified the same file, then released lock. How do we know what order to replay their logs in? I.e. can we replay one, then the other? Or must we interleave in the original order? What if power failure affects all servers? Suppose S1 creates f1, creates f2, then crashes. What combinations of f1 and f2 are allowed after recovery? What if a server runs out of log space? What if it hasn't yet flushed corresponding blocks to Petal? What happens if the network partitions? Could a partitioned file server perform updates? Serve stale data out of its cache? What if the partition heals just before the lease expires? Could file server and lock server disagree about who holds the lock? Why isn't the lock service a performance bottleneck? What if a lock server crashes? Why does their lock service use Paxos? Why does Frangipani have a disk-like interface to Petal? Frangipani was never intended to use a disk, so no compatibility reason might some other interface work better? Table 2: why are creates relatively slow, but deletes fast? Why is figure 5 flat? Why not more load -> longer run times?

Cite as: Robert Morris, course materials for 6.824 Distributed Computer Systems Engineering, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

6.824 2006: Scalable Lookup Prior focus has been on traditional distributed systems e.g. NFS, DSM/Hypervisor, Harp Machine room: well maintained, centrally located. Relatively stable population: can be known in entirety. Focus on performance, semantics, recovery. Biggest system might be Porcupine. Now: Internet scale systems Machines owned by you and me: no central authority. Huge number of distributed machines: can't know everyone e.g. From e-mail to Napster. Problems How do you name nodes and objects? How do you find other nodes in the system (efficiently)? How should data be split up between nodes? How to prevent data from being lost? How to keep it available? How to provide consistency? How to provide security? anonymity? What structure could be used to organize nodes? Central contact point: Napster Hierarchy: DNS for E-mail, WWW Flat? Let's look at a system with a flat interface: DHT Scalable lookup: Provide an abstract interface to store and find data Typical DHT interface: put(key, value) get(key) -> value loose guarantees about keeping data alive log(n) hops, even for new nodes guarantees about load balance, even for new nodes Potential DHT applications: publishing: DHT keys are like links file system, use DHT as a sort of distributed disk drive keys are like block numbers Petal is a little bit like this location tracking keys are e.g. cell phone numbers a value is a phone's current location Basic idea Two layers: routing (lookup) and data storage Routing layer handles naming and arranging nodes and then finding them. Storage layer handles actually putting and maintaining data. What 1. 2. 3. 4. 5.

does a complete algorithm have to do? Define IDs, document ID to node ID assignment Define per-node routing table contents Lookup algorithm that uses routing tables Join procedure to reflect new nodes in tables Failure recovery

Cite as: Robert Morris, course materials for 6.824 Distributed Computer Systems Engineering, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

6. Move data around when nodes join 7. Make new replicas when nodes fail Typical approach: Give each node a unique ID Have a rule for assigning keys to nodes based on node/key ID e.g. key X goes on node node with nearest ID to X Now how, given X, do we find that node? Arrange nodes in an ID-space, based on ID i.e. use ID as coordinates Build a global sense of direction Examples: 1D line, 2D square, Tree based on bits, hypercube, or ID circle Build routing tables to allow ID-space navigation Each node knows about ID-space neighbors I.e. knows neighbors' IDs and IP addresses Perhaps each node knows a few farther-away nodes To move long distances quickly The "Chord" peer-to-peer lookup system By Stoica, Morris, Karger, Kaashoek and Balakrishnan http://pdos.csail.mit.edu/chord/ An example system of this type ID-space topology Ring: All IDs are 160-bit numbers, viewed in a ring. Everyone agrees on how the ring is divided between nodes Just based on ID bits Assignment of key IDs to node IDs? Key stored on first node whose ID is equal to or greater than key ID. Closeness is defined as the "clockwise distance" If node and key IDs are uniform, we get reasonable load balance. Node IDs can be assigned, chosen randomly, SHA-1 hash of IP address... Key IDs can be drived from data, or chosen by user Routing? Query is at some node. Node needs to forward the query to a node "closer" to key. Simplest system: either you are the "closest" or your neighbor is closer. Hand-off queries in a clockwise direction until done Only state necessary is "successor". n.find_successor (k): if k in (n,successor]: return successor else: return successor.find_successor (k) Slow but steady; how can we make this faster? This looks like a linked list: O(n) Can we make it more like a binary search? Need to be able to halve the distance at each step. Finger table routing: Keep track of nodes exponentially further away: New state: succ(n + 2^i) Cite as: Robert Morris, course materials for 6.824 Distributed Computer Systems Engineering, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

Many of these entries will be the same in full system: expect O(lg N) n.find_successor (k): if k in (n,successor]: return successor else: n' = closest_preceding node (k) return n'.find_successor (k) Maybe node 8's looks like this: 1: 14 2: 14 4: 14 8: 21 16: 32 32: 42 There's a complete tree rooted at every node Starts at that node's row 0 Threaded through other nodes' row 1, &c Every node acts as a root, so there's no root hotspot This is *better* than simply arranging the nodes in one tree How does a new node acquire correct tables? General approach: Assume system starts out w/ correct routing tables. Use routing tables to help the new node find information. Add new node in a way that maintains correctness. Issues a lookup for its own key to any existing node. Finds new node's successor. Ask that node for its finger table. At this point the new node can forward queries correctly: Tweak its own finger table as necessary. Does routing *to* us now work? If new node doesn't do anything, query will go to where it would have gone before we joined. I.e. to the existing node numerically closest to us. So, for correctness, we need to let people know that we are here. Each node keeps track of its current predecessor. When you join, tell your successor that its predecessor has changed. Periodically ask your successor who its predecessor is: If that node is closer to you, switch to that guy. Is that enough? Everyone must also continue to update their finger tables: Periodically lookup your n + 2^i-th key What about concurrent joins? E.g. two new nodes with very close ids, might have same successor. e.g. 44 and 46. Both may find node 48... spiky tree! Good news: periodic stabilization takes care of this. What about node failures? Assume nodes fail w/o warning. Strictly harder than graceful departure. Cite as: Robert Morris, course materials for 6.824 Distributed Computer Systems Engineering, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

Two issues: Other nodes' routing tables refer to dead node. Dead nodes predecessor has no successor. If you try to route via dead node, detect timeout, treat as empty table entry. I.e. route to numerically closer entry instead. Repair: ask any node on same row for a copy of its corresponding entry. Or any node on rows below. All these share the right prefix. For missing successor Failed node might have been closest to key ID! Need to know next-closest. Maintain a _list_ of successors: r successors. If you expect really bad luck, maintain O(log N) successors. We can route around failure. The system is effectively self-correcting. Locality Lookup takes log(n) messages. But they are to random nodes on the Internet! Will often be very far away. Can we route through nodes close to us on underlying network? This boils down to whether we have choices: If multiple correct next hops, we can try to choose closest. Chord doesn't allow much choice. Observe: Strict successor for finger not necessary. Sample nodes in successor list of true finger, pick closest. What's the effect? Individual hops are lower latency. But less and less choice (lower node density) as you get close in ID space. So last few hops likely to be very long. Thus you don't *end up* close to the initiating node. You just get there quicker. How fast could proximity be? 1 + 1/4 + 1/16 + 1/64 Not as good as real shortest-path routing! Any down side to locality routing? Harder to prove independent failure. Maybe no big deal, since no locality for successor lists sets. Easier to trick me into using malicious nodes in my tables. What about security? Self-authenticating data, e.g. key = SHA1(value) So DHT node can't forge data Of course it's annoying to have immutable data... Can a DHT node claim that data doesn't exist? Yes, though perhaps you can check other replicas Can a host join w/ IDs chosen to sit under every replica? Or "join" many times, so it is most of the DHT nodes? How are IDs chosen? Why not just keep complete routing tables? Cite as: Robert Morris, course materials for 6.824 Distributed Computer Systems Engineering, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

So you can always route in one hop? Danger in large systems: timeouts or cost of keeping tables up to date. Accordion (NSDI '05) trade off between keeping complete state and performance subject to a budget Are there any reasonable applications for DHTs? For example, could you build a DHT-based Gnutella? Next time: wide area storage.

Cite as: Robert Morris, course materials for 6.824 Distributed Computer Systems Engineering, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].