A distributed system is a collection of networked computers that appear to users as a single coherent system. [1] So far, we have already created distributed systems that involve the browser, server and database:
-
The web browser communicates with the web server by HTTP/HTTPS
-
The web server communicates to a database management system (e.g., PostgreSQL or MongoDB) using a proprietary network protocol
During development, these systems are often running on the same server. In production, they typically run on separate servers.
If the software grows in popularity, it may soon be necessary to run them on multiple computers because no single computer can handle all of the demand.
Shifting from a small number of computers to a distributed system of networked computers creates new challenges. There are more points of failure and more ways that things can go wrong.
Fallacies of distributed computing
The eight fallacies of distributed computing [2] are assumptions that developers too often make about distributed computing:
-
The network is reliable
-
Latency is zero
-
Bandwidth is infinite
-
The network is secure
-
Topology doesn’t change
-
There is one administrator
-
Transport cost is zero
-
The network is homogeneous
Real-world practice inevitably violates these assumptions:
- The network is unreliable
-
IP packets are routinely lost or incorrectly routed. Firewalls can block access. Some protocols can fail. For example, a running microwave oven creates interference that can interrupt WiFi packets.
- Latency is high
-
The laws of physics prevent information from being transmitted faster than the speed of light. It takes 240 milliseconds (each direction) for satellite internet or 70 milliseconds (each direction) for fiber optic cable to send data from Sydney to Silicon Valley. In practice, there are many other reasons for delays: slow routers, overloaded or shared links, inefficient message encodings or signaling, and transmission over slower mediums.
- Bandwidth is limited
-
Users in rural locations, in developing nations, onboard aircraft or onboard cruise ships will not be able to download large files. Residential high-speed broadband may also be overloaded when users are streaming video.
- The network is insecure
-
As I noted in Chapter 9, the network is extremely hostile. Nothing can be trusted. Even internal networks can be insecure and attacks could come from authenticated users.
- Topology changes
-
The network is constantly changing. Your users may install software in new and unexpected configurations. End-users might change firewall configurations or switch from wired ethernet to a mobile hotspot.
- Information technology departments are complex
-
End-users might not be able to change their network settings. In large corporations, there may be multiple people in charge of IT decisions and configuration.
- Overhead is everywhere
-
Every part of the networking stack adds overheads: when data is streamed to the operating system, when the operating system encodes the data into TCP/IP packets, when the network card encodes it onto ethernet frames and when signals propagate through the physical medium. Every layer incurs a penalty of additional headers and processing time for encoding and translation.
- The internet is heterogenous
-
Users may use: Linux, Windows, Mac, iOS or Android; Firefox, Chrome, Edge, Safari, Internet Explorer, Opera, KaiOS Browser. They may have different network configurations and use VPNs, firewalls and proxies. Some devices may be faulty. Some devices may not strictly follow internet standards.
Reliable systems from unreliable components
Users have come to expect the ability to access websites anytime and anywhere, and businesses today expect to offer 24×7 availability. [3]
The goal of 24×7 availability brings us to the most exciting problem in internet programming:
How can reliable systems be built when networks are unreliable?
Most machines stop working when an essential component is damaged or removed. For example, A fridge with a broken thermostat will stop cooling food.
However, it would be unacceptable for networks to fail when a single server crashes. Network and server failure happens far too often for this to be sensible.
Instead, we should design our systems to avoid single points of failure. In a sense, this is the challenge faced by life itself: individual cells in our bodies continuously die and are replaced, yet we live on as people.
Is it possible?
Before discussing how to build reliable systems, it is important first to consider if it is possible to build reliable systems from unreliable parts.
Once we have established the possibility of building such systems, we can then explore how to realize such systems.
The following is a simple (but highly impractical) scheme:
-
Deploy two copies of the same application on separate servers (e.g.,
1.example.com
and2.example.com
). -
As a user, you must always open both URLs in separate windows.
-
As a user, you must perform every action identically on both websites (e.g., you’ll create an account on each site and then upload the same messages to each website).
-
If one of the servers crashes, then you telephone a system administrator. They replace the failed server with a copy of the server that did not fail.
Of course, very few users would have the patience to entertain this kind of impractical setup. However, it does suggest that there are ways to build reliable systems. As long as one server stays online, it should be possible for the administrators to restore the other system.
Not only reliability
Reliability is not the only concern. Systems need to be practical, easy to use, fast and responsive. As much as possible, users should experience what they believe is a single, reliable, fast computer. The distributed system hides the fact that it has many unreliable and perhaps slow components.
Performance efficiency and reliability were among the quality characteristics discussed in Chapter 10. In this chapter and the following chapters, I will focus on the end-user experience of these quality characteristics. In particular, I will consider performance efficiency and reliability in terms of the following quantitative metrics:
-
Throughput: the total number of requests handled for some duration (e.g., a website might have a throughput of 10,000 requests per second)
-
Response time: the total time to respond to a single request (e.g., on average it might take 100ms to respond to a request)
-
Latency: the amount of delay caused by the network (e.g., there might be a 140ms latency for requests between Sydney and Silicon Valley due to the speed of light)
-
Scalability: how well the site grows with demand (e.g., if the same software can support more users by installing additional servers, then the site is scalable)
-
Availability: the proportion of the time users can access the system (e.g., Google is said to routinely achieve 99.999% availability: less than 5.26 minutes of downtime per year [4]
Distributed architectures
In this and the following two chapters, I will address how to build reliable systems on unreliable networks, in three parts:
-
In this chapter, I will explore the architectures that allow multiple servers to be combined to appear as a single system (Chapter 12)
-
Next, I will explore how to ensure that combining multiple servers does not result in data loss or corruption (Chapter 13)
-
Finally, I will explore how to increase performance so that reliable does not mean slow (Chapter 14)