Five Considerations For Large Scale Systems

Introduction

With the growth of the Internet, and of connected networks in general, the development and deployment of large scale systems has become increasingly common. A large scale system is one that supports multiple, simultaneous users who access the core functionality through some kind of network. In this sense, they are very different from the historically typical application, generally deployed on CD, where the entire application runs on the target computer.

The benefits of using a three tier architecture for large scale system design are generally well known. A three tier system is separated – both logically and physically – into:

    • The client tier, responsible for interacting with the user via a Graphical User Interface (GUI) and submitting requests via the network to the mid tier.

    • The mid tier, responsible for gathering requests from clients and executing transactions against the data tier.

    • The data tier, responsible for physical storage and manipulation of the information represented by application queries and the responses to those queries.

A diagram of such an architecture is shown in Figure 1. There are many popular choices for implementation at each tier. One particularly common choice for the client tier is the use of a browser, such as Microsoft Internet Explorer or Netscape Navigator. In this case, the mid tier is responsible for producing HTML. If an alternate client tier technology is chosen, some other format may be used. The Extensible Markup Language (XML) is often employed, as support tools are widely available on many platforms. The data tier is almost always some sort of relational database engine, such as the Oracle RDBMS or Microsoft SQL Server.

Figure 1 - Typical Three Tier Architecture

The benefits of a three tier architecture include:

    • Centralization of business logic for ease of maintenance

    • Separation of user interface logic from data access logic

    • The ability to spread work over several machines ( load balancing )

    • When the client tier is a browser, an independence from the platform used to execute user interface logic, allowing a broader reach for the application

Given that more and more projects are developing large scale systems, and given that this architectural framework is so widely employed, why do many large scale development efforts fail? The answer, of course, is that an architecture is not enough – understanding requirements, employing appropriate technology, hiring skilled developers, and many other factors also contribute to the success or failure of a particular effort.

No single document could hope to address all pertinent factors in such a way to ensure success for those who followed its advice. However, there are still a broad set of guidelines that – when kept in mind by system designers throughout the product cycle – can greatly increase the chances of a project meeting its design goals.

This whitepaper will present just such a set of guidelines. We will try to avoid focusing on any particular technology, instead presenting a broad outline that should apply to the technologies of any vendor.

Aspects of Successful Large Scale System Design

While nothing can guarantee the success of any project, there are five factors that – when kept in mind throughout design and implementation – can help system architects ensure that they haven’t overlooked something important. These five are:

    • Scalability

    • Availability

    • Manageability

    • Security

    • Development practices

Let’s examine each of these in detail.

Scalability

A system is said to be scalable if it can handle an increased load without redesign. The lion’s share of the cost of system development is usually labor, so being able to adjust to increasing load without having to rewrite every time ten new users are added is a crucial feature.

Scalable systems grow to adjust to increasing loads by adding hardware. In the mid-tier, for example, we make use of load balancing technologies, such as Cisco Local Director or Big IP Controller. A load balancer provides scalability by distributing requests across several machines, so that each handles only a percentage of the load. This is shown in Figure 2.

Figure 2 - Load Balancing

Load balancing must be taken into consideration during design in order to achieve scalability. For example, mid-tier logic that stores information in memory on a particular machine across requests cannot be efficiently load balanced, since subsequent requests must return to the same machine to access that information. This situation is shown in Figure 1, and while the ability to direct a particular client to the same machine request after request is generally supported by commercially available load balancing projects, designs that require this behavior should be avoided.

Figure 3 - Sticky Sessions

To understand why, consider a trip to your neighborhood grocery store. At checkout time, you generally choose the lane with the line you feel will take the shortest amount of time. In the same manner, a good load balancer will direct an incoming request to the least busy available machine.

If, however, your grocery store remembered what lane you checked out in last time, and required that you again use that lane every time you visited the store, you would quickly take your business elsewhere. After all, what if the line in your assigned lane was the longest in the store but other lanes were completely empty? And what if it were closed entirely? This is clearly less efficient, yet this is exactly analogous to what happens in designs that require all requests from a particular client be directed to the same physical machine.

Scalability in the data tier is achieved in an entirely different manner. Because the whole point of the data tier is to manage information, splitting all requests between more than one machine doesn’t help us at all. If we did that, we’d have to maintain all of the data at each machine, which – while possible – leads to enormous complications in trying to keep each copy in sync with the other.

Large scale designs should attempt to make use of partitioning to achieve scalability in the data tier. Partitioning is putting a subset of the data on each of several machines according to some set of rules, and then using those rules to route the mid-tier machines to the appropriate data store. This is shown in Figure 1.

Figure 4 - Partitioning

Although sometimes partitioning is straightforward – for example with sales data that can easily be divided by region – there are cases where effective partitioning can be very difficult. Although some database vendors (Oracle and Microsoft, for example) are beginning to provide support for automatic partitioning, this can still be a challenging area to address in system design.

Availability

Large scale systems often need to be highly available. Availability is the ability of a system to be operational a large percentage of the time – the extreme being so-called “24/7/365” systems. The largest challenge to availability is surviving system instabilities, whether from hardware or software failures.

In the mid-tier, building highly available systems starts with using high-quality hardware with features such as RAID drive arrays and redundant power supplies in a data center with adequate protection from power outages. However, even with such hardware, systems can still fail. Fortunately, if the system has been designed to be load balanced, and more than one machine is handling the load, it can survive the failure of any individual machine. This situation is shown in Figure 2. Of course, this assumes that the load balancing product can detect the failure of an individual machine and reroute appropriately. Many commercial load balancers have this capability.

Figure 5 - High Mid Tier Availability

Availability in the data tier is achieved through a different technology: clustering . Clustering is the use of more than one machine to provide access to the same data store. Often this means that two machines are connected to the same physical drive array, with one of the machines on “standby.” If something should happen to the active machine, the second machine comes online and begins serving the mid-tier requests in place of the first machine. These two machines are collectively referred to as a cluster , although the term could apply to groups containing more than two machines if higher degrees of availability were required.

Clustering can be combined with partitioning to provide a data tier that is both highly available and highly scalable. Note, however, that each individual partition must be separately clustered. This is shown in figure 6.

Figure 6 - Partitioned and Clustered Data Tier

Manageability

The key to high availability is redundancy: if you have more than one of everything, you’re not susceptible to any single points of failure. However, although redundancy means that the system as a whole will be up when you need it to be up, individual components are still bound to fail – that fact is the whole reason we use technologies like clustering in the first place.

It is important to know when this happens. After all, in a two node cluster, when the first node fails, we’ve lost our backup, and there is now a single point of failure, jeopardizing the high availability characteristics we so carefully crafted our system around. For this reason, manageability is an important aspect of successful large scale system design.

Manageability is twofold. It includes both the ability to detect faults (or the absence of faults) and the ability to automatically react to them. These abilities come from instrumentation and enterprise monitoring, respectively.

Instrumentation is the feature of system components that lets us determine their health at any given point in time. The simplest form of software instrumentation is code that simply logs a message when a fault is detected. While this capability is necessary as an absolute minimum, system designers should strive for a more complete solution. If a system is unresponsive, for example, but no faults have been recorded, is it merely idle? Is it hung? Is the logging mechanism itself faulty?

To be able to answer these important questions, complete instrumentation is called for. This should include not only fault notification, but notification of proper behavior, and possibly even “heartbeat” capability to determine that a component is functioning but idle. Without instrumentation, systems operators are flying blind in the face of alleged or actual systems failures.

Instrumentation – even complete instrumentation – is only half the manageability story, however. Without enterprise monitoring tools, instrumentation only helps if someone happens to be looking at the instrumentation output when a fault occurs. If the output is not constantly monitored, corrective action can only be taken after something more serious occurs, like the failure of a second node in a clustered system. Obviously, we want to know as soon as possible when a failure occurs.

Enterprise monitoring tools, then, are a necessity, unless system designers wish to include in their project plans a budget for an army of low-paid, caffeinated interns dedicated to the task of watching instrument output for failure conditions. Enterprise monitoring tools – such as Tivoli Management Solutions – automate this task, watching systems for health and fault tolerance. Most can be configured to automatically notify administrators in response to specific events.

In the best of all worlds, the enterprise monitoring tool should directly integrate with the system’s instrumentation product. This allows for maximum flexibility and speed in responding to fault events. And given that it can take even well-trained and responsive system support personnel fifteen or more minutes to identify and correct a problem – once noticed – automation is an important facet of maintaining a highly available, scalable system.

Security

Security is an important aspect of system design, and all the more so for distributed systems, since they are often open to attack from agents at any of millions of worldwide locations. Therefore, system designers need to carefully consider what mechanisms they will use for authentication and authorization.

Authentication is the process of having users of the system identify themselves to the system. This typically takes the form of the user entering a username and password into a form or dialog box. Once it has been determined that an appropriate password has been provided for the given username, the system must then use that information to determine whether or not that user should be allowed to perform the set of operations that they are requesting. This is what is meant by authorization, sometimes also called access control.

There is a common myth among systems designers that the most secure authentication and authorization mechanisms are those whose algorithms are secret – perhaps even created specifically for this project. This is known as “security through obscurity,” and it should be avoided. The best way to defeat a determined attacker is to use one of the many publicly available algorithms – like Kerberos, TripleDES,MD5, and SHA-1 - that can withstand a determined attack, even when the details of the algorithm are fully known.

However, simply choosing publicly available algorithms will not make your application secure. Aside from the many considerations of actually using these algorithms in a secure manner - in itself no small task - there are the human factors to consider. Bruce Schneier, in his excellent book "Secrets and Lies", states the three necessary ingredients for a secure system: prevention, detection, and reaction.

Prevention is the one that people tend to get caught up in. Prevention technologies are about denying an attacker any access to the system at all. This is where authentication and access control come in. But it's a losing game - given enough time and resources, a determined attacker can always defeat your prevention mechanism. Always. Thinking otherwise is setting yourself up to get taken advantage of in a big way. But it's not hopeless if you have meaningful detection mechanisms in place.

Detection means building your application in such a way that you can tell when it is being attacked. It goes hand with manageability, as instrumentation can go a long way towards assessing whether anomalous behavior is the result of a glitch in the system or an intruder. Detection is a crucial addition to prevention - having one does not allow you to omit the other - but still isn't the end of the story. Reaction is still required.

Reaction is the ability of your system - and here I mean the whole system, to include the adminstrators and support staff - to take appropriate measures when an attack is detected. In the context of writing code, this might mean putting switches in the program to allow you to quickly disable it while returning a meaningful message to the users. Or to isolate the attacker so they can be observed without their knowledge, allowing you to gain clues to their identity or locate the way they were able to circumvent your protection mechanisms.

Think of these three components in the context of a safe: you wouldn't build one without a lock (prevention), but a theif could still cut through it given enough time. That's why you have burgler alarms (detection). But burgler alarms only do you any good if the result in the police showing up (reaction). All three pieces are necessary.

Development Practices

Properly speaking, this last ingredient is not an aspect of the system itself, but rather an element of the process used to develop the system. Still, employing proper development practices is a fundamental that must be kept in mind during planning, as much as scalability, availability, manageability, and security.

The exact details of what constitutes proper development practice have long been debated, and so much has been written on the topic that there is no need (nor space) to revisit the complete subject here. Here is a (very) incomplete list of considerations:

    • Use of source control is essential

    • Adequate documentation must be generated at every phase of development

    • System developers must be sufficiently trained on the technologies they will be employing

    • Specific requirements for availability, scalability and performance must be captured

Suffice it to say that systems developers are unlikely to succeed unless they take a structured, methodical approach to system construction, using personnel experienced in the technologies they will be employing.

Conclusion

Successfully designing a large scale system is difficult, despite the promises to the contrary by development tool vendors. Building software that will execute correctly and consistently in a distributed environment where hundreds or even millions of requests need to be serviced on a daily basis is no small task. While there is nothing that will guarantee success, designers that remember to design for scalability, availability, manageability, security, and to use proper development practices will greatly decrease the chance of project failure.