What Makes a Type of Systems When You Drop The Entire System

Our technology dependence It is stronger than ever, and that has its consequences: when services stop working – and all of them fail to do so sooner or later – chaos arrives. The impact that has that we stay without Gmail or WhatsApp is enormous, but the same is true – if it is that the impact is not-when those who fall are from Amazon or Microsoft servers.

How is that companies of this magnitude can have failures as well it possible? All of them obviously have intricate infrastructure in which issues such as redundancy and high availability are recurrent. And even so, There is always something that just wrong. How they try to avoid those falls these companies, and how to act when these occur despite everything?

Believe us: dropping WhatsApp is not the problem

The truth is that failures they are more common than any of us could imagine. In IEEE Spectrum they carry more than one decade by tracking all those failures that occur in all types of computer systems, and after that time last year published an extensive series of reports in which tried to offer conclusions about all those bugs.

One of those conclusions, for example, is the enormous amount of money that cost those projects that fail or even fail to be implemented after an enormous investment. The Department of Defense of the United States it invested 1,000 million dollars and 12 years in a system of human resources for their army (here the PDF), but much more gruesome was the cost of 18,000 million dollars by the army of the United States in a Future Combat System (FCS) who tried to modernize their systems and that ended up canceling is detected that spending would soar to more than 200,000 million dollars

The security of any system is as strong as its weakest link

IEEE Spectrum conclusions after that decade were clear: “IT systems are becoming increasingly complex and large (in terms of data captured, stored and manipulated), which means that not only are becoming increasingly difficult and costly to develop, but also more complicated to manage”. The complexity of these systems is also due to another large sentences of technology: its constant evolution and improvement:

Over the past decade we have seen great efforts for the modernization of ICT industries, banking, financial, aviation and especially in the Government, usually oriented to replace older ICT systems that were implanted in the 80s and 90s, if that they did not do so before. Many of these efforts sought to replace multiple TIC disparate systems with a single system, which traditionally has proven to be far more difficult both technically and management, as well as much more expensive than I imagined it.

Fantastic graphics documents that leaves us IEEE Spectrum Analysis beyond failures that we talked about at the beginning of this story: we complain that we fall WhatsApp or that suddenly not able to access our mail in Gmail, but These problems are almost anecdotal When we talk about infrastructure if they depend on our lives as electrical, transport networks – a slight problem in an airport can have catastrophic consequences – or anything that affects the health sector.

You not only have to protect themselves from cyber attacks

On August 27, 2013 China ceased to have internet access for four hours. The problem was not (only) a human error: was one of the most important denial of service attacks from the “digital” history of that country. One of the most important sections of the technological infrastructure of any company is obviously This protection against cyber-attacks.

There are many measures that administrators to avoid this kind of problems, but the truth is that the cases of theft of passwords and sensitive data It is increasingly common in all types of services. The security of any system is as strong as its weakest link, they say, and that is an absolute truth in this segment in which any weaknesses can be detected and exploited by the ciberatacantes if they have sufficient motivation and patience.

Even in those cases, as we will see later, is not always offered us a particularly detailed explanation of what happened. The explanation of why a hacker or a cracker has access to certain systems often comes from other experts, not from companies that suffered such attacks: recognize errors and explain them does not usually give very good press to the public in general and to customers in particular, but it is something the community of computer security experts trying to encourage to do openly.

There are two types of companies, which have been hacked and which have been but not yet what do not know

The examples are numerous, and there are fantastic collections of those large theft of data such as which it carried out in Information is Beautiful. There it is possible to obtain more data on the 164 million stolen passwords to MySpace, the 145 million from eBay or the much more shocking case of Mossack Fonseca, and all of them checks to see how these intrusions were due to attacks almost always they took advantage of mistakes from people who manage these systems.

In the case of Mossack Fonseca error was using an old version of Drupal and an out-of-date plugin of WordPress in two of their web sites: something as seemingly trivial as an update of both platforms had avoided a scandal that has splashed thousands of companies and individuals. In reality is just one example – very striking, that if – an increasingly common situation. There are already an old saying in the segment of computer security: There are two types of companies, which have been hacked and which have been but not yet what do not know.

Defend against these cyber attacks is one of the serious problems that face these experts, that set all kinds of measures to try to save of fearsome attacks denial of service – if they are large enough, difficult to evade them–or the exploitation of vulnerabilities that is very difficult to control with the necessary detail. However that is only part of the equation, and firms must also face risks such as cause natural disasters – Amazon knows it well – and omnipresent errors and human oversights.

What Yes have done many companies with cloud services is to treat of report on the State of those services so users can at least verify that you there problems in reality. In some cases we have State panels separated completely from those services that inform us of his status – Amazon is a good example, but in Google also offered that information- and there is even a “metapaneles” as DownDetector or IsItDownRightNow that bring together information from a number of services, something that is attached to the information that some companies offer for example via Twitter.

Secrecy: who is silent does not grant

When we started the process of elaboration of this story we are facing a strong reality: companies do not talk openly about these issues, and they do not because they are highly confidential. Todd Hoff, responsible for High Scalability – one of the blogs of reference in this field – explained us how “is a very hard subject that achieve information: much of it is the ‘secret sauce’ of those companies, and the security implications are huge”.

This is the look of a fall. In this case, of Steam on May 7. Source: ThousandEyes.

Let’s say for example Microsoft, one of the world’s most important technology companies, and one that has a more complex infrastructure. Those responsible could not give specifics, and treated in offer us an overview It offers us a vision of the complexity of these systems. Among the key data:

Companies not openly talk about these issues, and not to do so because the information is highly confidential

Microsoft has with more than 100 data centers in more than 40 countries. The locations of those data centers are not public to “ensure a high security for data of our customers”. In those data centers are offered more than 200 cloud services that used “more than 1,000 millions of customers in 90 markets […] Microsoft data centers have been set up to deliver 99.999 percent availability to meet the needs of our customers SLA and service”.

The Redmond company manages those services through the Microsoft Operations Center (MOC), and has facilities for “lift” all its services in California, Redmond and the India “for instant availability of the continuity of the services in the event of natural disasters or calamities georeplicaos”. We have more data in the informative website of Microsoft, but the reality is that the key details are very limited. They are, as said us Hoff, the secret sauce its operational capability in the cloud.

This reality presented by Microsoft extends to other companies, but it also often happens that when fall some enterprise systems information about those falls and their solutions is limited. This is a constant complaint of IEEE Spectrum experts, who criticized the lack of “post-mortems”, details that make “public autopsies” of what happened and discuss what was the trigger for the problem.

On GitHub, for example, we have a compilation of some of those internal failure analysis of that technology companies suffered in recent years. Fortunately here yes We find these autopsies that you allow to know what was the problem. It is curious to check how those bugs are often simple configuration errors which were not detected during its implementation and that then cost find longer than expected.

Open Compute Project, a good step on a long road

It is true that some companies share increasingly more interesting information and release both those data like projects that are key to your infrastructure. The aforementioned Microsoft, Google, or Facebook they share more and better, and it is the case of Facebook is particularly notable because the company’s Mark Zuckerberg is responsible for so-called Open Compute Project (OCP).

This initiative share product data center designs and that tries to communicate best practices and all kinds of keys, so that all types of businesses can access to scalable, highly efficient and highly available solutions. Facebook makes use of data centers that are “certifiable” by PCO to 100%, and that effort have joined Intel, Google, Apple, Microsoft, Rackspace, Ericsson, Cisco, Juniper Networks, Goldman Sachs, Fidelity, Lenovo and Bank of America.

That does not seem to be sufficient, clear. There is clear criticism to a project which, according to some experts, has created fragmentation in the industry -encouraging standards to others – and he has done that that “OCP certification” hardware vendors have a capacity limited and directed primarily to large companies with own engineering teams.

Also criticized the certification process, and it seems that there are still many fringes for resolve as those who are dedicated to the integrity of the data. The project continues to advance, of course, and although there is already patent benefits – industry needs reliable standards, and OCP is trying to provide them – obviously there long way to go.