Multiply by a Million
I have spent over a decade designing and building online systems. Systems that need to run flawlessly with tens of thousands people using them at any given time. In this environment one lesson you learn pretty quickly is that computers behave very differently with that many active users.
This isn't just true of computer systems, but systems in general. Whether you're stuck in traffic, queueing at a local coffee shop or sat in a hospital waiting room you're experiencing a system that is at or above capacity. A system that is starting to creak under the weight of the number of people using it.
Learning about how systems function can teach us a lot about the world we live in, because as I mentioned in this post - life as we know it is just a whole bunch of interconnected and interdependent systems. So what happens when systems break, and why should you care?
Push It To The Limit
On the 25th of June, 2009 I went to see a band called The Misfits live in concert. They are one of my all-time favourite bands and it was a great show. The reason I mention it is because something else significant happened that night. It was the night that Michael Jackson died.
In case you somehow don’t know, Michal Jackson was a pop music superstar (and alleged paedophile) and had been for decades. He was also a hugely controversial figure whose private life had been the subject of some pretty intense scrutiny. These things combined meant that millions of people all over the world stood up and paid attention when they heard the news.
It’s also worth noting that almost exactly two years earlier the iPhone had launched, meaning that we were two years into an internet connectivity revolution. Social media had also taken off at this point, with Facebook and Twitter already household names. The stage was set for millions of people to take to the internet, all clamouring for more information about the death of an icon.
Google News, Twitter, Facebook and MySpace, giants of the internet at the time, all experienced problems during this surge of activity. Google’s systems perceived the huge spike in traffic to be an attempt at hacking their servers. Facebook and MySpace users reported lag (the websites were functioning, but very slowly). During the peak of the social media frenzy Twitter saw double the usual number of tweets per second, bringing the site to its knees for a brief period.
There have been other events that have caused similar ripples in the internet continuum, but this is the most significant one that I can remember. Their failure to gracefully handle this unexpected surge in traffic shows a general principle that is present in all systems:
Every system has a capacity and when that capacity is reached that system will cease to function normally.
Most of us actually experience this principle on a daily basis in the form of traffic. Road networks are systems that are capable of supporting a certain number of cars at any given time. If you go out for a drive in a city at 4am you will likely find the roads to be empty and your journey stress free. At this time the system is operating well under capacity. If, however, you were to make that same journey at 8:30am, right in the middle of rush hour, you would find something completely different. There would be queues of cars at every set of traffic lights, often accompanied by the sounds of car horns. The system at this point is operating at, or close to, capacity.
Just like a computer system, things can start to go wrong when the road system is under such load. The chance of an accident occurring increases significantly and if that accident blocks a lane on a busy road the traffic gets worse still. Eventually everything just grinds to a halt. This is equivalent to a website crashing. If you visit a website but the pages fail to load, or do load but very slowly, it is likely that the server is just too busy to handle your request. What’s more is your request to that server is adding to the problem. There were already too many people sending requests to that server and now there is one more. Remember, you're never stuck in traffic you are traffic.
Finding your limits
When creating software to run on servers it is surprisingly difficult to test for these extreme scenarios. The normal development cycle involves writing code, testing that your code works and then writing more code until finally you have built an entire system. As you are building the system it will also be subjected to rigorous testing by a dedicated QA (Quality Assurance) team. This team may consist of as few as two people in a small company or hundreds in a larger one. They will do their best to break the system that you have created. When they find bugs (and they will) they create a ‘ticket’ for the development team, who will review and fix the issue.
Testing software is a difficult because there are so many different combinations of actions and states that a computer can be in. They need to try a test every possible scenario that the user may come up against. For example any piece of software that needs to take time into account will have to ensure that the software works as expected after our clocks have been adjusted for daylight savings. It is very easy to forget for something like that (as I discussed in a previous post: Everything is more complicated than you think #1: Dates).
While QA teams do a great job of testing the core functionality of a software system, one thing they will struggle with is pushing a system to its limit. A team of say, ten people, could hardly be expected to generate the traffic of ten thousand people. For that we need computers. We need to generate load by creating what is essentially a simulation of a user. If the average user sends ten requests to our server in the space of one minute and then disconnects, then we need to write a program that sends ten request to our server in a minute and then disconnects. Then we need to run that program thousands of times, simultaneously, meaning it will simulate thousands of concurrent users. Once we do this we will get a much more accurate picture of what might happen once we release our system out into the wild. We call this test a ‘load test’, and I have yet to see a server system gracefully handle the load on the first run. Usually the whole thing crashes and burns. From there we can review exactly what happened, fix the system and then try again.
The exact same principles apply to real life. Think about it like this: A car driving a 60kph will handle a lot differently than a car going 200kph. The steering will get shaky and the entire car will become more sensitive to tiny little issues. A slight bump in the road at 200kph is a very different story to hitting that same bump at 60kph. The tires will also wear out a lot faster, as will the brake pads. As a result, you couldn’t just test a car at 60kph and call it quits, you need to push it to its limit. Once you know where the limit is you are better prepared to reduce the likelihood of catastrophic failure.
From a design point of view it is essential to be aware of your system’s capacity so that you can plan for huge spikes in traffic. When building a new road for example, you may create a three lane motorway despite there being very little traffic on that particular route. An upfront investment like this in the face of future increases in traffic is a lot more cost effective than building a small single lane road that will need to be expanded ten years down the line, while it is already operational.
We need to look at our actions today and the systems we are building now, but also think about the future. The ironic thing being that a system that successfully achieves its goal will likely suffer from capacity issues due to its success. If the road authority in any given country builds a road that cuts people’s commute time in half, people will of course use that road. The road has done what it was designed to do. The problem is that now there is so much traffic on the road that the commute time is the same as the old alternate route.
I have seen this happen first hand in Dublin, Ireland. There is a motorway that circles the city called the M50. When it first opened it eliminated much of the traffic from the city, where bottlenecks frequently occurred as people tried to get from north to south or vice versa. Before long, however, as more and more people started using the M50, their commute times steadily increased. It wasn’t long before people began referring to it as the country's largest car park. Once this system hit capacity extensive construction began in order to add an additional lane and make the on/off ramps flow better. This caused even more traffic and greater disruption. Once the work was completed the cars began to flow like water once again. A few years later it was back to being a car park. This system struggles as a consequence of its success.
We see the same thing in computers all the time. Imagine you built a website that crashed due to heavy traffic. An optimist might call this a good problem. At least people are using your website, right?
The Problem of Potential Users
Most of the time our systems just seem to tick along nicely, but working as a server engineer has left me wary of these times of relative calm. I am constantly wondering ‘what if’? What if ten thousand people log in at the same time? What if five million people create accounts? Will we have enough space to store all of that information? What if there is some problem with the servers, how will we be able to handle the support tickets from that many people? I am essentially worrying about one thing here, and I call it ‘potential users’. The online systems I have worked on have generally had a fairly predictable stream of activity, but what about the millions of people out there that could potentially become users of the system at any given time?
We can illustrate the problem of ‘potential users’ by looking at a very interesting phenomenon known as the ‘TV Pickup’.
When watching a particularly engrossing show or intense sports event on your TV, do you ever wait until the ad break, or even the end, to use the toilet or get yourself a snack? Well so do millions of other people. In the UK it is a regular occurrence and the electrical and sewerage companies need to account for it. On the 4th of July, 1990 England faced West Germany in the World Cup semi-final. The game ended with a penalty shootout, and a large percentage of the English population on the edge of their seat. As soon as was all over millions of people got up to use the toilet, grab something from the fridge and turn on their kettle for a hard earned cup of tea. This caused surges at water and sewerage pumping stations and imposed a 2800 megawatt demand on the electrical network. Whenever some significant event is televised the nation tunes in and once it over they all head for the toilet and the fridge.
Operators of these systems need to keep this in mind and have even designed systems to monitor their systems. They look at historical data to try and predict when surges will occur and keep up to date with the latest TV shows in order to anticipate when some climactic event might occur that draws the attention of the nation. It is usually soap operas, sporting events and reality TV that cause this, though the effect has been somewhat in decline in recent times as more channels become available and improved technology allows live TV to be paused or recorded. There are, however, other events that these engineers need to be mindful of too, such as an Eclipse. The entire population is a group of potential users, but sometimes they all become active users at the same time.
The ‘TV Pickup’ in the UK is a national example of this, but what about a globally? The internet connects the entire world, so we have to view the entire global population as potential users of the system. One trend that has emerged recently in the online world is the concept of ‘Going Viral’. This usually consists of a video that is being circulated between friends that suddenly explodes in popularity. Millions of people are exposed to this content as it is shared around social media between friends. I might share it with one friend, who shares it with ten friends and then each of them shares it with ten friends and so on. As humans it can be difficult for us to grasp large numbers and how quickly something like this can spread. Let's do some maths.
Let's say I share a video with two people and each of those people shares with two different people, and each of those people shares it with two people and so on. Now let's consider each group of people sharing as a single ‘step’. Provided everyone that shares the video shares it with someone that has not seen it before we will have surpassed the entire world population after just 43 steps. This seems crazy but it is simple addition and is easy to test. Start with me, then add two people, for each of those two people add two more and so on, which looks like this: 1 + 2 + 4 + 8 + 16… By the time you get to 40 steps the video will have been shared with around 1 billion people. That group of one billion people will then share it with two billion other people, bringing our total to three billion.
There is an ancient Indian legend that that beautifully illustrates this principle. According to the legend a local king who really enjoyed playing chess would frequently challenge people to a game. One day a traveling sage passed through his kingdom and was promptly challenged by the zealous king. He offered the traveller any reward he could name if he won. The sage, a seemingly modest man, simply asked for a few grains of rice. If he won, the king was to place a single grain of rice on the first square of the chess board, then two on the second square, four on the third and so on. The wise traveller went on to win the game and so the king, true to his word, asked his servants to fetch a bag of rice so that he could give the traveller his reward. He started to lay out the grains on the chess board but quickly realized that he had a problem. By the time he got to the final square he would have to provide over 18,000,000,000,000,000,000 grains of rice, or more than 200 billion tons - more rice than the earth is capable of producing. The traveling chess player went on to reveal that he was actually Lord Krishna (a Hindu deity) and told the king that he could pay his debt over time. To this day it is still being repaid at the feast of Paal Paysam, celebrated every year.
It is naturally very difficult for humans to think at this sort of scale. You have likely never seen a crowd of one million people. I know I haven’t, though I have been at a music festival where there the numbers exceeded seventy five thousand people, and that crowd felt immense. Once the festival finished all of these people left the venue at the same time, descending upon the local infrastructure and pushing it to breaking point. Traffic needed to be controlled by police and caused delays for the poor innocent commuters who knew nothing of the festival. The public transport system couldn’t keep up with the volume of people. There were also bins overflowing with rubbish and long queues for the portable toilets. These systems just weren’t built to manage seventy five thousand people all at once.
As server engineers we have ways to handle this. It usually involves monitoring traffic and trying to react if we notice a spike in that traffic, just like the engineers in the electrical plants and sewerage stations we talked about earlier. If we notice that one of our servers is struggling we may decide to create an additional server and start sending some of the requests to it. Some websites will actually consist of hundreds of servers with incoming traffic being distributed evenly among them.
This is how we solve capacity problems in general - we add more capacity. For websites that would involve adding more servers. For our road networks we add more roads, or expand the existing ones. We can also optimize the existing systems, make our servers run faster with improved technology and software. To deal with traffic jams on our roads we could invest in traffic light systems that better optimize the flow of traffic. All of these solutions come at a cost however. Upgrading our servers to be more efficient will cost time and money, while adding more of them also increases the cost of running them. Same goes for our roads. A time and energy investment is needed, but that’s not all. We are also limited geographically.
There is only so much space for roads in any given country. Ireland, for example, is a small island. We need space for houses, businesses and recreation. Ideally we would also want to maintain some of the scenic beauty that earned our little island the nickname ‘The Emerald Isle’. This means we are dealing with a finite resource in terms of our ability to address road traffic. As we will see, everything we do relies on a finite resource, as the internet isn’t the only system that (potentially) connects every single person on the planet. We are all connected by the fact that we live on this rock.
It can be fun to think about the interesting ways in which humans are globally connected. For example, the very same particles of air that I exhale from my body today could conceivably be inhaled by someone in Australia in a years' time. However, once we start down this road we quickly see that there are some pretty scary implications to the fact that we all share this planet. In a study funded by Nasa in 2014, scientists concluded that the excessive pollution in China, a country that experienced a meteoric rise in economic production and the pollution associated with it, can have a very real effect on the weather experienced by other countries. As consumers our demand for cheaper products has fuelled this rise in pollution. Simply put, the social systems in one country can affect economic systems in another country, which in turn affect weather systems globally.
This is important because we rely on nature to sustain us, though it can be hard to appreciate that fact because we have created systems that put distance between us and the natural world. It used to be the case the humankind’s primary activity was seeking food and shelter. In modern society these concerns have become secondary. We have systems in place that deliver food to us without us having to think about where it came from or how it was produced. We now have a deep understanding of how to procure food through this man made system, but not through nature, the core system. The same goes for the clothes on our backs and the shoes on our feet. We build robust shelters that we can depend on year after year, though we generally don’t build them ourselves or worry about where the materials came from.
Creating these systems on top of the natural systems we have always depended on does not replace those systems, it augments them. We have simply become more efficient at taking what nature has to offer and distributing it amongst ourselves. This gives us easier access to the world’s natural resources, meaning more and more potential users become active users and the core underlying system gets pushed closer to capacity.
I named this blog post ‘Multiply by a Million’ because it is somewhat of a mantra that I repeat to myself on a daily basis. Whenever I do anything that affects anyone or anything other than myself I like to think ‘what if a million people were to do this’, just like I do when I build server systems. I am asking myself whether or not the system I am interacting with could cope with millions of people using it the way I do.
This idea can be applied to simple everyday tasks, things we do habitually with little or no thought. For example, do you ever leave the tap running when brushing your teeth? This might not waste all that much water when you do it, maybe three litres or so if you spend two minutes brushing. But what if one million people all did this twice a day? At three litres per person that would be six million litres of water a day wasted. If you live in a country with a modern water system then all that water has most likely been treated, which consumes energy - more waste, and an important observation:
Systems do not live in isolation, they are both composed of smaller systems and components of larger systems.
Oftentimes the connections between systems and the effects they have on each other are only visible when operating at a large scale. In our previous example the impact of one person leaving the water running while brushing their teeth is negligible, but once we scale up to a million people it is far more noticeable. What if we take this a step further and start thinking on a global scale?
The advent of modern technology has made the world feel a lot smaller. Traveling across the Atlantic ocean from Europe to America used to be an arduous boat trip that took weeks whereas now it takes a few hours by airplane and is far less dangerous. We can communicate instantly over the internet and buy products online from any country in the world. Despite all of this this it is still very difficult for us to think about how the actions of millions can combine on a global scale, often with devastating effect.
At the time of writing, the world population is sitting at somewhere near 7.6 billion people. The United Nations has predicted that this number will be closer to 10 billion by 2055, which is a staggering number of people, and enough ‘traffic’ to have an unpredictable effect on our environmental systems. So in what way do our actions as individuals quickly become magnified due to the fact that there are so many of us and what are the consequences of those actions?
Firstly, it's probably worth focusing on the things that we cannot live without:
Heat (for those of us in colder climates)
Since we cannot live without these things it would be wise to consider just how sustainable the underlying systems are? The answer unfortunately is not very.
Since 1960 global meat production has more than trebled. At the same time milk production has doubled and egg production quadrupled. This can be attributed largely to both the global rise in population as well as a general decrease in poverty. Basically a large number of potential users have become active users. It is estimated that around 30% of the entire planet’s land surface (not including ice covered land) is currently dedicated to the raising of livestock. As the global population increases the competition for land and water to support this system will also increase. Even now, huge swaths of rain forest in the Amazon, also known as the lungs of the world due to the clean air it produces, are being cleared to make way for grazing land for cattle. It is also worth noting that demand is steadily increasing in Asia, where the diet has traditionally consisted of a lot less meat but is now shifting more towards the western diet. This market is huge, with a population of over four billion. A lot of potential users.
We know that we cannot add capacity to planet earth, it is finite. Perhaps one day if we manage to colonize Mars we can increase the overall land mass available to us, but right now this isn’t an option. So what are the alternatives? One possible solution is to try and increase the amount of meat produced per square meter, or ‘factory farming’ as it is currently known. This raises certain ethical questions as well as very real health concerns. Cramming animals into tiny spaces is certainly cruel, but also increases the likelihood of disease spreading between those animals. This means that the animals need to be given antibiotics in order to keep them healthy. Exposure to these medicines has a side effect however, it increases the bacteria’s resistance to the treatment itself, resulting in new strains that we do not yet know how to treat. This can be devastating to livestock and even transfer to the human population.
This problem isn’t exclusive to the food industry either. The fashion industry has a similar impact on our environment. In recent times we have seen what is known as ‘fast fashion’ become prevalent in our society. It used to be the case that there were four seasons in the year when it came to fashion. Clothing lines would be generally be released for these seasons. Now however there is a constant churn of new lines of clothing being manufactured for low prices in developing countries and being shipped all around the world. These factories produce large amounts of waste in the process, affecting the local environment.
The same can even be said for the technology industry. The demand for low cost electronics has increased massively over the past couple of decades. Take a brief walk around any major city and you will likely see hundreds of people all using smart phones. The manufacturing of these devices has a real environmental cost, often requiring precious metals, which are mined at a great cost to the environment.
The transportation industries that facilitate all of these commodities pumps thousands of tons worth of carbon in our atmosphere, a fact which is widely accepted by scientists to be a leading cause of climate change. Climate change, of course, affects everyone. As the sea temperature rises our weather becomes more severe and unpredictable. This will likely increase pressure on the food production industries and lead to droughts all around the world. These can be major contributing factors to the outbreak of violence and even war, as people struggle to survive.
This all seems pretty bleak, and rightly so. This is the effect of ‘multiply by a million’ but on an even larger scale. The only real solution is for each and every one of us to hold ourselves accountable for our actions. If we eat red meat every day while also buying new cheap clothes weekly in order to keep up with the latest fashion trends we are pushing our environment, the system that we are all a part of, to capacity. In the modern world we are actively trying to lead lives of convenience, but the cost of this convenience is the huge volume of waste. The planet is a system, just like the internet, just like our hospitals and road network, and it too has a capacity. There will come a point where we have to ask ourselves just how many people can the planet support at the current levels of consumption?
There is an old adage that says “There’s strength in numbers”. As we have seen there is also great destructive power in numbers, but the potential for good is there too. In the last section we painted a pretty bleak pick of the future of humankind on this planet, but there is hope. The ‘multiply by a million’ effect can also work positively. If you reduce your personal consumption, get educated about sustainable sources of nutritional food, reduce consumption of electronics and cheap convenient products it may not reverse the effects of climate change and undo the damage done to our environment, but what if a million people do, or a billion?
There is something called ‘distributed computing’ that shows what can be done when millions of people chip in together. This basic involves people ‘donating’ small amounts of their CPU to research projects. When our phones are sitting in our pockets they aren’t doing much, same goes for our computers when we are simply browsing the internet or listening to music. There is a lot more power there that can be used to perform complex calculations in aid of scientific research. Some research groups have decided to exploit this fact by creating distributed computing projects that allows you to run a small program in the background that performs small calculations, which in turn become part of much larger calculations. Projects include cancer research, climate change modelling, investigation of greener power sources and even running simulations at CERN (The European Organization for Nuclear Research).
In this distributed computing system users are only contributing a tiny amount of their computer's resources to a much larger project. When combined together we can achieve immense levels of computational power, leading to real advancement for everyone. What your device does itself may be inconsequential, but when combined with millions of other devices it is literally changing the world.
What does it all mean
Computers are often used for running simulations. We can simulate the laws of physics, or chemical reactions. Weather systems are tracked and the paths they might take predicted. However computers can themselves be used as a simulation for systems in general. We can study them and apply general principles to other systems. We cannot push the environment in which we live to the point of collapse in order to learn where that point is, but we can learn the general underlying rules that govern all systems and use those rules to make smart choices.
Regardless of the type of system we are looking at, be it the internet, health care, traffic or nature we must accept the fact that the decisions we make today will have far reaching consequences for the future users of those systems.
The internet today is burdened by the fact that when it was designed no one had anticipated the monster that it would grow to become. We cannot change this but we can learn from it.
Our hospitals need to be prepared for the sudden outbreak of an epidemic, or risk collapse and chaos if it happens.
Our roads need to account not just for the number of cars on the road today but also for the number of cars that may be there in ten years, or risk incurring the huge cost of expansion.
And finally nature, the system on which all other systems depend on.
The decisions we make today may not affect us drastically, but they will certainly affect the next generation and the generation after that.
It’s also worth remembering that systems rarely fail gracefully or gently, more often than not it is catastrophic and hard to recover from.