The Architecture of Open Source Applications The Architecture of Open Source Applications Volume II: Structure, Scale, and a Few More Fearless Hacks Edited by Amy Brown & Greg Wilson The Architecture of Open Source Applications, Volume 2 Edited by Amy Brown and Greg Wilson This work is licensed under the Creative Commons Attribution 3.0 Unported license (CC BY 3.0). You are free: • to Share—to copy, distribute and transmit the work • to Remix—to adapt the work under the following conditions: • Attribution—you must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work). with the understanding that: • Waiver—Any of the above conditions can be waived if you get permission from the copyright holder. • Public Domain—Where the work or any of its elements is in the public domain under applicable law, that status is in no way affected by the license. • Other Rights—In no way are any of the following rights affected by the license: – Your fair dealing or fair use rights, or other applicable copyright exceptions and limitations; – The author’s moral rights; – Rights other persons may have either in the work itself or in how the work is used, such as publicity or privacy rights. • Notice—For any reuse or distribution, you must make clear to others the license terms of this work. The best way to do this is with a link to http://creativecommons.org/licenses/by/3.0/ To view a copy of this license, visit http://creativecommons.org/licenses/by/3.0/ or send a letter to Creative Commons, 444 Castro Street, Suite 900, Mountain View, California, 94041, USA. The full text of this book is available online at http://www.aosabook.org/ All royalties from its sale will be donated to Amnesty International. Product and company names mentioned herein may be the trademarks of their respective owners. While every precaution has been taken in the preparation of this book, the editors and authors assume no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein. Front cover photo © James Howe. Revision Date: April 10, 2013 ISBN: 978-1-105-57181-7 In memory of Dennis Ritchie (1941-2011) We hope he would have enjoyed reading what we have written. Contents Introduction ix by Amy Brown and Greg Wilson 1 Scalable Web Architecture and Distributed Systems 1 by Kate Matsudaira 2 Firefox Release Engineering 23 by Chris AtLee, Lukas Blakk, John O’Duinn, and Armen Zambrano Gasparnian 3 FreeRTOS 39 by Christopher Svec 4 GDB 53 by Stan Shebs 5 The Glasgow Haskell Compiler 67 by Simon Marlow and Simon Peyton Jones 6 Git 89 by Susan Potter 7 GPSD 101 by Eric Raymond 8 The Dynamic Language Runtime and the Iron Languages 113 by Jeff Hardy 9 ITK 127 by Luis Ibáñez and Brad King 10 GNU Mailman 149 by Barry Warsaw 11 matplotlib 165 by John Hunter and Michael Droettboom 12 MediaWiki 179 by Sumana Harihareswara and Guillaume Paumier 13 Moodle 195 by Tim Hunt 14 nginx 211 by Andrew Alexeev 15 Open MPI 225 by Jeffrey M. Squyres 16 OSCAR 239 by Jennifer Ruttan 17 Processing.js 251 by Mike Kamermans 18 Puppet 267 by Luke Kanies 19 PyPy 279 by Benjamin Peterson 20 SQLAlchemy 291 by Michael Bayer 21 Twisted 315 by Jessica McKellar 22 Yesod 331 by Michael Snoyman 23 Yocto 347 by Elizabeth Flanagan 24 ZeroMQ 359 by Martin Sústrik viii CONTENTS Introduction Amy Brown and Greg Wilson In the introduction to Volume 1 of this series, we wrote: Building architecture and software architecture have a lot in common, but there is one crucial difference. While architects study thousands of buildings in their training and during their careers, most software developers only ever get to know a handful of large programs well. . . As a result, they repeat one another’s mistakes rather than building on one another’s successes. . . This book is our attempt to change that. In the year since that book appeared, over two dozen people have worked hard to create the sequel you have in your hands. They have done so because they believe, as we do, that software design can and should be taught by example—that the best way to learn how think like an expert is to study how experts think. From web servers and compilers through health record management systems to the infrastructure that Mozilla uses to get Firefox out the door, there are lessons all around us. We hope that by collecting some of them together in this book, we can help you become a better developer. — Amy Brown and Greg Wilson Contributors Andrew Alexeev (nginx) : Andrew is a co-founder of Nginx, Inc.—the company behind nginx. Prior to joining Nginx, Inc. at the beginning of 2011, Andrew worked in the Internet industry and in a variety of ICT divisions for enterprises. Andrew holds a diploma in Electronics from St. Petersburg Electrotechnical University and an executive MBA from Antwerp Management School. Chris AtLee (Firefox Release Engineering) : Chris is loving his job managing Release Engineers at Mozilla. He has a BMath in Computer Science from the University of Waterloo. His online ramblings can be found at http://atlee.ca Michael Bayer (SQLAlchemy) : Michael Bayer has been working with open source software and databases since the mid-1990s. Today he’s active in the Python community, working to spread good software practices to an ever wider audience. Follow Mike on Twitter at @zzzeek Lukas Blakk (Firefox Release Engineering) : Lukas graduated from Toronto’s Seneca College with a bachelor of Software Development in 2009, but started working with Mozilla’s Release Engineering team while still a student thanks to Dave Humphrey’s ( http://vocamus.net/dave/ ) Topics in Open Source classes. Lukas Blakk’s adventures with open source can be followed on her blog at http://lukasblakk.com Amy Brown (editorial) : Amy worked in the software industry for ten years before quitting to create a freelance editing and book production business. She has an underused degree in Math from the University of Waterloo. She can be found online at http://www.amyrbrown.ca/ Michael Droettboom (matplotlib) : Michael Droettboom works for STScI developing science and calibration software for the Hubble and James Webb Space Telescopes. He has worked on the matplotlib project since 2007. Elizabeth Flanagan (Yocto) : Elizabeth Flanagan works for the Open Source Technologies Center at Intel Corp as the Yocto Project’s Build and Release engineer. She is the maintainer of the Yocto Autobuilder and contributes to the Yocto Project and OE-Core. She lives in Portland, Oregon and can be found online at http://www.hacklikeagirl.com Jeff Hardy (Iron Languages) : Jeff started programming in high school, which led to a bachelor’s degree in Software Engineering from the University of Alberta and his current position writing Python code for Amazon.com in Seattle. He has also led IronPython’s development since 2010. You can find more information about him at http://jdhardy.ca Sumana Harihareswara (MediaWiki) : Sumana is the community manager for MediaWiki as the volunteer development coordinator for the Wikimedia Foundation. She previously worked with the GNOME, Empathy, Telepathy, Miro, and AltLaw projects. Sumana is an advisory board member for the Ada Initiative, which supports women in open technology and culture. She lives in New York City. Her personal site is at http://www.harihareswara.net/ Tim Hunt (Moodle) : Tim Hunt started out as a mathematician, getting as far as a PhD in non-linear dynamics from the University of Cambridge before deciding to do something a bit less esoteric with his life. He now works as a Leading Software Developer at the Open University in Milton Keynes, UK, working on their learning and teaching systems which are based on Moodle. Since 2006 he has been the maintainer of the Moodle quiz module and the question bank code, a role he still enjoys. From 2008 to 2009, Tim spent a year in Australia working at the Moodle HQ offices. He blogs at http://tjhunt.blogspot.com and can be found @tim_hunt on Twitter. John Hunter (matplotlib) : John Hunter is a Quantitative Analyst at TradeLink Securities. He received his doctorate in neurobiology at the University of Chicago for experimental and numerical modeling work on synchronization, and continued his work on synchronization processes as a postdoc in Neurology working on epilepsy. He left academia for quantitative finance in 2005. An avid Python programmer and lecturer in scientific computing in Python, he is original author and lead developer of the scientific visualization package matplotlib. Luis Ibáñez (ITK) : Luis has worked for 12 years on the development of the Insight Toolkit (ITK), an open source library for medical imaging analysis. Luis is a strong supporter of open access and the revival of reproducibility verification in scientific publishing. Luis has been teaching a course on Open Source Software Practices at Rensselaer Polytechnic Institute since 2007. Mike Kamermans (Processing.js) : Mike started his career in computer science by failing technical Computer Science and promptly moved on to getting a master’s degree in Artificial Intelligence, instead. He’s been programming in order not to have to program since 1998, with a focus on getting people the tools they need to get the jobs they need done, done. He has focussed on many other things as well, including writing a book on Japanese grammar, and writing a detailed explanation of the math behind Bézier curves. His under-used home page is at http://pomax.nihongoresources.com Luke Kanies (Puppet) : Luke founded Puppet and Puppet Labs in 2005 out of fear and desperation, with the goal of producing better operations tools and changing how we manage systems. He has been publishing and speaking on his work in Unix administration since 1997, focusing on development since 2001. He has developed and published multiple simple sysadmin tools and contributed to established products like Cfengine, and has presented on Puppet and other tools around the world, x Introduction including at OSCON, LISA, Linux.Conf.au, and FOSS.in. His work with Puppet has been an important part of DevOps and delivering on the promise of cloud computing. Brad King (ITK) : Brad King joined Kitware as a founding member of the Software Process group. He earned a PhD in Computer Science from Rensselaer Polytechnic Institute. He is one of the original developers of the Insight Toolkit (ITK), an open source library for medical imaging analysis. At Kitware Dr. King’s work focuses on methods and tools for open source software development. He is a core developer of CMake and has made contributions to many open source projects including VTK and ParaView. Simon Marlow (The Glasgow Haskell Compiler) : Simon Marlow is a developer at Microsoft Research’s Cambridge lab, and for the last 14 years has been doing research and development using Haskell. He is one of the lead developers of the Glasgow Haskell Compiler, and amongst other things is responsible for its runtime system. Recently, Simon’s main focus has been on providing great support for concurrent and parallel programming with Haskell. Simon can be reached via @simonmar on Twitter, or +Simon Marlow on Google+. Kate Matsudaira (Scalable Web Architecture and Distributed Systems) : Kate Matsudaira has worked as the VP Engineering/CTO at several technology startups, including currently at Decide, and formerly at SEOmoz and Delve Networks (acquired by Limelight). Prior to joining the startup world she spent time as a software engineer and technical lead/manager at Amazon and Microsoft. Kate has hands-on knowledge and experience with building large scale distributed web systems, big data, cloud computing and technical leadership. Kate has a BS in Computer Science from Harvey Mudd College, and has completed graduate work at the University of Washington in both Business and Computer Science (MS). You can read more on her blog and website http://katemats.com Jessica McKellar (Twisted) : Jessica is a software engineer from Boston, MA. She is a Twisted maintainer, Python Software Foundation member, and an organizer for the Boston Python user group. She can be found online at http://jesstess.com John O’Duinn (Firefox Release Engineering) : John has led Mozilla’s Release Engineering group since May 2007. In that time, he’s led work to streamline Mozilla’s release mechanics, improve developer productivity—and do it all while also making the lives of Release Engineers better. John got involved in Release Engineering 19 years ago when he shipped software that reintroduced a bug that had been fixed in a previous release. John’s blog is at http://oduinn.com/ Guillaume Paumier (MediaWiki) : Guillaume is Technical Communications Manager at the Wikimedia Foundation, the nonprofit behind Wikipedia and MediaWiki. A Wikipedia photographer and editor since 2005, Guillaume is the author of a Wikipedia handbook in French. He also holds an engineering degree in Physics and a PhD in microsystems for life sciences. His home online is at http://guillaumepaumier.com Benjamin Peterson (PyPy) : Benjamin contributes to CPython and PyPy as well as several Python libraries. In general, he is interested in compilers and interpreters, particularly for dynamic languages. Outside of programming, he enjoys music (clarinet, piano, and composition), pure math, German literature, and great food. His website is http://benjamin-peterson.org Simon Peyton Jones (The Glasgow Haskell Compiler) : Simon Peyton Jones is a researcher at Microsoft Research Cambridge, before which he was a professor of computer science at Glasgow University. Inspired by the elegance of purely-functional programming when he was a student, Simon has focused nearly thirty years of research on pursuing that idea to see where it leads. Haskell is his first baby, and still forms the platform for much of his research. http://research.microsoft. com/~simonpj Susan Potter (Git) : Susan is a polyglot software developer with a penchant for skepticism. She has been designing, developing and deploying distributed trading services and applications since Amy Brown and Greg Wilson xi 1996, recently switching to building multi-tenant systems for software firms. Susan is a passionate power user of Git, Linux, and Vim. You can find her tweeting random thoughts on Erlang, Haskell, Scala, and (of course) Git @SusanPotter Eric Raymond (GPSD) : Eric S. Raymond is a wandering anthropologist and trouble-making philosopher. He’s written some code, too. If you’re not laughing by now, why are you reading this book? Jennifer Ruttan (OSCAR) : Jennifer Ruttan lives in Toronto. Since graduating from the University of Toronto with a degree in Computer Science, she has worked as a software engineer for Indivica, a company devoted to improving patient health care through the use of new technology. Follow her on Twitter @jenruttan Stan Shebs (GDB) : Stan has had open source as his day job since 1989, when a colleague at Apple needed a compiler to generate code for an experimental VM and GCC 1.31 was conveniently at hand. After following up with the oft-disbelieved Mac System 7 port of GCC (it was the experiment’s control case), Stan went to Cygnus Support, where he maintained GDB for the FSF and helped on many embedded tools projects. Returning to Apple in 2000, he worked on GCC and GDB for Mac OS X. A short time at Mozilla preceded a jump to CodeSourcery, now part of Mentor Graphics, where he continues to develop new features for GDB. Stan’s professorial tone is explained by his PhD in Computer Science from the University of Utah. Michael Snoyman (Yesod) : Michael Snoyman received his BS in Mathematics from UCLA. After working as an actuary in the US, he moved to Israel and began a career in web development. In order to produce high-performance, robust sites quickly, he created the Yesod Web Framework and its associated libraries. Jeffrey M. Squyres (Open MPI) : Jeff works in the rack server division at Cisco; he is Cisco’s representative to the MPI Forum standards body and is a chapter author of the MPI-2 standard. Jeff is Cisco’s core software developer in the open source Open MPI project. He has worked in the High Performance Computing (HPC) field since his early graduate-student days in the mid-1990s. After some active duty tours in the military, Jeff received his doctorate in Computer Science and Engineering from the University of Notre Dame in 2004. Martin Sústrik (ZeroMQ) : Martin Sústrik is an expert in the field of messaging middleware, and participated in the creation and reference implementation of the AMQP standard. He has been involved in various messaging projects in the financial industry. He is a founder of the ØMQ project, and currently is working on integration of messaging technology with operating systems and the Internet stack. He can be reached at sustrik@250bpm.com , http://www.250bpm.com and on Twitter as @sustrik Christopher Svec (FreeRTOS) : Chris is an embedded software engineer who currently develops firmware for low-power wireless chips. In a previous life he designed x86 processors, which comes in handy more often than you’d think when working on non-x86 processors. Chris has bachelor’s and master’s degrees in Electrical and Computer Engineering, both from Purdue University. He lives in Boston with his wife and golden retriever. You can find him on the web at http://saidsvec.com Barry Warsaw (Mailman) : Barry Warsaw is the project leader for GNU Mailman. He has been a core Python developer since 1995, and release manager for several Python versions. He currently works for Canonical as a software engineer on the Ubuntu Platform Foundations team. He can be reached at barry@python.org or @pumpichank on Twitter. His home page is http://barry.warsaw.us Greg Wilson (editorial) : Greg has worked over the past 25 years in high-performance scientific computing, data visualization, and computer security, and is the author or editor of several computing xii Introduction books (including the 2008 Jolt Award winner Beautiful Code ) and two books for children. Greg received a PhD in Computer Science from the University of Edinburgh in 1993. Armen Zambrano Gasparnian (Firefox Release Engineering) : Armen has been working for Mozilla since 2008 as a Release Engineer. He has worked on releases, developers’ infrastructure optimization and localization. Armen works with youth at the Church on the Rock, Toronto, and has worked with international Christian non-profits for years. Armen has a bachelor in Software Development from Seneca College and has taken a few years of Computer Science at the University of Malaga. He blogs at http://armenzg.blogspot.com Acknowledgments We would like to thank Google for their support of Amy Brown’s work on this project, and Cat Allman for arranging it. We would also like to thank all of our technical reviewers: Johan Harjono Justin Sheehy Nikita Pchelin Laurie McDougall Sookraj Tom Plaskon Greg Lapouchnian Will Schroeder Bill Hoffman Audrey Tang James Crook Todd Ritchie Josh McCarthy Andrew Petersen Pascal Rapicault Eric Aderhold Jonathan Deber Trevor Bekolay Taavi Burns Tina Yee Colin Morris Christian Muise David Scannell Victor Ng Blake Winton Kim Moir Simon Stewart Jonathan Dursi Richard Barry Ric Holt Maria Khomenko Erick Dransch Ian Bull Ellen Hsiang especially Tavish Armstrong and Trevor Bekolay, without whose above-and-beyond assistance this book would have taken a lot longer to produce. Thanks also to everyone who offered to review but was unable to for various reasons, and to everyone else who helped and supported the production of this book. Thank you also to James Howe 1 , who kindly let us use his picture of New York’s Equitable Building for the cover. Contributing Dozens of volunteers worked hard to create this book, but there is still a lot to do. You can help by reporting errors, helping to translate the content into other languages, or describing the architecture of other open source projects. Please contact us at aosa@aosabook.org if you would like to get involved. 1 http://jameshowephotography.com/ xiii xiv Introduction [chapter 1] Scalable Web Architecture and Distributed Systems Kate Matsudaira Open source software has become a fundamental building block for some of the biggest websites. And as those websites have grown, best practices and guiding principles around their architectures have emerged. This chapter seeks to cover some of the key issues to consider when designing large websites, as well as some of the building blocks used to achieve these goals. This chapter is largely focused on web systems, although some of the material is applicable to other distributed systems as well. 1.1 Principles of Web Distributed Systems Design What exactly does it mean to build and operate a scalable web site or application? At a primitive level it’s just connecting users with remote resources via the Internet—the part that makes it scalable is that the resources, or access to those resources, are distributed across multiple servers. Like most things in life, taking the time to plan ahead when building a web service can help in the long run; understanding some of the considerations and tradeoffs behind big websites can result in smarter decisions at the creation of smaller web sites. Below are some of the key principles that influence the design of large-scale web systems: Availability: The uptime of a website is absolutely critical to the reputation and functionality of many companies. For some of the larger online retail sites, being unavailable for even minutes can result in thousands or millions of dollars in lost revenue, so designing their systems to be constantly available and resilient to failure is both a fundamental business and a technology requirement. High availability in distributed systems requires the careful consideration of redundancy for key components, rapid recovery in the event of partial system failures, and graceful degradation when problems occur. Performance: Website performance has become an important consideration for most sites. The speed of a website affects usage and user satisfaction, as well as search engine rankings, a factor that directly correlates to revenue and retention. As a result, creating a system that is optimized for fast responses and low latency is key. Reliability: A system needs to be reliable, such that a request for data will consistently return the same data. In the event the data changes or is updated, then that same request should return the new data. Users need to know that if something is written to the system, or stored, it will persist and can be relied on to be in place for future retrieval. Scalability: When it comes to any large distributed system, size is just one aspect of scale that needs to be considered. Just as important is the effort required to increase capacity to handle greater amounts of load, commonly referred to as the scalability of the system. Scalability can refer to many different parameters of the system: how much additional traffic can it handle, how easy is it to add more storage capacity, or even how many more transactions can be processed. Manageability: Designing a system that is easy to operate is another important consideration. The manageability of the system equates to the scalability of operations: maintenance and updates. Things to consider for manageability are the ease of diagnosing and understanding problems when they occur, ease of making updates or modifications, and how simple the system is to operate. (I.e., does it routinely operate without failure or exceptions?) Cost: Cost is an important factor. This obviously can include hardware and software costs, but it is also important to consider other facets needed to deploy and maintain the system. The amount of developer time the system takes to build, the amount of operational effort required to run the system, and even the amount of training required should all be considered. Cost is the total cost of ownership. Each of these principles provides the basis for decisions in designing a distributed web architecture. However, they also can be at odds with one another, such that achieving one objective comes at the cost of another. A basic example: choosing to address capacity by simply adding more servers (scalability) can come at the price of manageability (you have to operate an additional server) and cost (the price of the servers). When designing any sort of web application it is important to consider these key principles, even if it is to acknowledge that a design may sacrifice one or more of them. 1.2 The Basics When it comes to system architecture there are a few things to consider: what are the right pieces, how these pieces fit together, and what are the right tradeoffs. Investing in scaling before it is needed is generally not a smart business proposition; however, some forethought into the design can save substantial time and resources in the future. This section is focused on some of the core factors that are central to almost all large web applications: services , redundancy , partitions , and handling failure . Each of these factors involves choices and compromises, particularly in the context of the principles described in the previous section. In order to explain these in detail it is best to start with an example. Example: Image Hosting Application At some point you have probably posted an image online. For big sites that host and deliver lots of images, there are challenges in building an architecture that is cost-effective, highly available, and has low latency (fast retrieval). Imagine a system where users are able to upload their images to a central server, and the images can be requested via a web link or API, just like Flickr or Picasa. For the sake of simplicity, let’s assume that this application has two key parts: the ability to upload (write) an image to the server, and the ability to query for an image. While we certainly want the upload to be efficient, we care most about having very fast delivery when someone requests an image (for example, images could 2 Scalable Web Architecture and Distributed Systems be requested for a web page or other application). This is very similar functionality to what a web server or Content Delivery Network (CDN) edge server (a server CDN uses to store content in many locations so content is geographically/physically closer to users, resulting in faster performance) might provide. Other important aspects of the system are: • There is no limit to the number of images that will be stored, so storage scalability, in terms of image count needs to be considered. • There needs to be low latency for image downloads/requests. • If a user uploads an image, the image should always be there (data reliability for images). • The system should be easy to maintain (manageability). • Since image hosting doesn’t have high profit margins, the system needs to be cost-effective. Figure 1.1 is a simplified diagram of the functionality. Figure 1.1: Simplified architecture diagram for image hosting application In this image hosting example, the system must be perceivably fast, its data stored reliably and all of these attributes highly scalable. Building a small version of this application would be trivial and easily hosted on a single server; however, that would not be interesting for this chapter. Let’s assume that we want to build something that could grow as big as Flickr. Services When considering scalable system design, it helps to decouple functionality and think about each part of the system as its own service with a clearly defined interface. In practice, systems designed in this way are said to have a Service-Oriented Architecture (SOA). For these types of systems, each service has its own distinct functional context, and interaction with anything outside of that context takes place through an abstract interface, typically the public-facing API of another service. Kate Matsudaira 3 Deconstructing a system into a set of complementary services decouples the operation of those pieces from one another. This abstraction helps establish clear relationships between the service, its underlying environment, and the consumers of that service. Creating these clear delineations can help isolate problems, but also allows each piece to scale independently of one another. This sort of service-oriented design for systems is very similar to object-oriented design for programming. In our example, all requests to upload and retrieve images are processed by the same server; however, as the system needs to scale it makes sense to break out these two functions into their own services. Fast-forward and assume that the service is in heavy use; such a scenario makes it easy to see how longer writes will impact the time it takes to read the images (since they two functions will be competing for shared resources). Depending on the architecture this effect can be substantial. Even if the upload and download speeds are the same (which is not true of most IP networks, since most are designed for at least a 3:1 download-speed:upload-speed ratio), read files will typically be read from cache, and writes will have to go to disk eventually (and perhaps be written several times in eventually consistent situations). Even if everything is in memory or read from disks (like SSDs), database writes will almost always be slower than reads 1 Another potential problem with this design is that a web server like Apache or lighttpd typically has an upper limit on the number of simultaneous connections it can maintain (defaults are around 500, but can go much higher) and in high traffic, writes can quickly consume all of those. Since reads can be asynchronous, or take advantage of other performance optimizations like gzip compression or chunked transfer encoding, the web server can switch serve reads faster and switch between clients quickly serving many more requests per second than the max number of connections (with Apache and max connections set to 500, it is not uncommon to serve several thousand read requests per second). Writes, on the other hand, tend to maintain an open connection for the duration for the upload, so uploading a 1MB file could take more than 1 second on most home networks, so that web server could only handle 500 such simultaneous writes. Planning for this sort of bottleneck makes a good case to split out reads and writes of images into their own services, shown in Figure 1.2. This allows us to scale each of them independently (since it is likely we will always do more reading than writing), but also helps clarify what is going on at each point. Finally, this separates future concerns, which would make it easier to troubleshoot and scale a problem like slow reads. The advantage of this approach is that we are able to solve problems independently of one another—we don’t have to worry about writing and retrieving new images in the same context. Both of these services still leverage the global corpus of images, but they are free to optimize their own performance with service-appropriate methods (for example, queuing up requests, or caching popular images—more on this below). And from a maintenance and cost perspective each service can scale independently as needed, which is great because if they were combined and intermingled, one could inadvertently impact the performance of the other as in the scenario discussed above. Of course, the above example can work well when you have two different endpoints (in fact this is very similar to several cloud storage providers’ implementations and Content Delivery Networks). There are lots of ways to address these types of bottlenecks though, and each has different tradeoffs. For example, Flickr solves this read/write issue by distributing users across different shards such that each shard can only handle a set number of users, and as users increase more shards are added to 1 Pole Position, an open source tool for DB benchmarking, http://polepos.org/ and results http://polepos. sourceforge.net/results/PolePositionClientServer.pdf 4 Scalable Web Architecture and Distributed Systems Figure 1.2: Splitting out reads and writes the cluster 2 . In the first example it is easier to scale hardware based on actual usage (the number of reads and writes across the whole system), whereas Flickr scales with their user base (but forces the assumption of equal usage across users so there can be extra capacity). In the former an outage or issue with one of the services brings down functionality across the whole system (no-one can write files, for example), whereas an outage with one of Flickr’s shards will only affect those users. In the first example it is easier to perform operations across the whole dataset—for example, updating the write service to include new metadata or searching across all image metadata—whereas with the Flickr architecture each shard would need to be updated or searched (or a search service would need to be created to collate that metadata—which is in fact what they do). When it comes to these systems there is no right answer, but it helps to go back to the principles at the start of this chapter, determine the system needs (heavy reads or writes or both, level of concur- rency, queries across the data set, ranges, sorts, etc.), benchmark different alternatives, understand how the system will fail, and have a solid plan for when failure happens. Redundancy In order to handle failure gracefully a web architecture must have redundancy of its services and data. For example, if there is only one copy of a file stored on a single server, then losing that server means losing that file. Losing data is seldom a good thing, and a common way of handling it is to create multiple, or redundant, copies. This same principle also applies to services. If there is a core piece of functionality for an application, ensuring that multiple copies or versions are running simultaneously can secure against the failure of a single node. 2 Presentation on Flickr’s scaling: http://mysqldba.blogspot.com/2008/04/mysql-uc-2007-presentation-file. html Kate Matsudaira 5 Creating redundancy in a system can remove single points of failure and provide a backup or spare functionality if needed in a crisis. For example, if there are two instances of the same service running in production, and one fails or degrades, the system can failover to the healthy copy. Failover can happen automatically or require manual intervention. Another key part of service redundancy is creating a shared-nothing architecture . With this architecture, each node is able to operate independently of one another and there is no central “brain” managing state or coordinating activities for the other nodes. This helps a lot with scalability since new nodes can be added without special conditions or knowledge. However, and most importantly, there is no single point of failure in these systems, so they are much more resilient to failure. For example, in our image server application, all images would have redundant copies on another piece of hardware somewhere (ideally in a different geographic location in the event of a catastrophe like an earthquake or fire in the data center), and the services to access the images would be redundant, all potentially servicing requests. (See Figure 1.3.) (Load balancers are a great way to make this possible, but there is more on that below). Figure 1.3: Image hosting application with redundancy Partitions There may be very large data sets that are unable to fit on a single server. It may also be the case that an operation requires too many computing resources, diminishing performance and making it necessary to add capacity. In either case you have two choices: scale vertically or horizontally. Scaling vertically means adding more resources to an individual server. So for a very large data set, this might mean adding more (or bigger) hard drives so a single server can contain the entire data set. In the case of the compute operation, this could mean moving the computation to a bigger server with a faster CPU or more memory. In each case, vertical scaling is accomplished by making the individual resource capable of handling more on its own. To scale horizontally, on the other hand, is to add more nodes. In the case of the large data set, this might be a second server to store parts of the data set, and for the computing resource it would mean splitting the operation or load across some additional nodes. To take full advantage of 6 Scalable Web Architecture and Distributed Systems