A critical advantage of Cloud-hosted services is elasticity, the ability to rapidly scale out a service without needing to purchase and install physical hardware. Cloud providers allow tenant services to add additional virtual servers on-demand, enabling them to meet changes in load (rate of arriving client requests).
In this project, I implemented various techniques to scale out a simulated, Cloud-hosted, multi-tier web service (a web storefront). The most important thing for an auto-scaling system is to evaluate the system to understand what is the bottleneck at different loads, and use this information to decide when to scale out which components in order to improve performance. However, minimizing the number of servers used to handle a particular workload is important, as these resources are not free.
The web service will be a simulated online store. My server needs to handle requests from clients. These can be either a browse request (to provide information on items and categories of items), or a purchase request to buy an item. The system will simulate clients arriving at random intervals. Each client will perform one or more “browse” requests, and may follow up with a purchase, but only if the replies are received within a short time. If the client's request is dropped, takes too long, or results in an error, the client will leave unhappy. Clients will expect browse requests to be serviced within 1 second, and purchases within 2 seconds.
The work of the simulation system mainly consists of three parts: front-end work, application work(back-end), and database. The simulated clients send requests to a load balancer, and the system-provided load balancer will also send requests to registered servers.
The simulation system also provides interfaces to control virtual machines (implemented as processes).
minimizing the number of “unhappy” clients (any that result in a status other than “ok” or “purchased”).
minimizing the number of cloud resources (total VM time) consumed.
Figure 1. Architecture
Starting a VM needs 5 seconds.
Each front-end work needs 60 ms.
Each back-end work needs 200 ms.
Each transaction takes 200 ms in the database.
Each read request takes 50 ms in the database.
The whole system consists of an identification system, sampling system, auto-scaling system, cool-down system, smart-drop system, and cache system.
The best design is to have a central request queue and a central cache. In case of increasing unnecessary VM, I put all the central pieces on the coordinator server, which is also a front-tier server. All front-tier servers send requests to the central queue, and all app-tier servers get requests from this central queue.
Each time the system starts, there would be one server running and I treat this server as my main server.
I use the main server (VM_ID = 1) to be the coordinator of the whole system. The coordinator is responsible for sampling and auto-scaling. For each newly started server, the coordinator provides a lookup service to provide information about what kind of servers they are (front-end servers or app-tier servers). Because the identity of these new servers is made by the decision of the auto-scaling system.
With this sampling system, I periodically, 1000 milliseconds in my case, get the statistical data from all the front-end servers. The statistical data contains request arrival rate (not client arrival rate, which is more useful), queue length, and response time. You know it’s hard to know this kind of information from the front-end servers because the front-end server would wait to get new requests if there is no more request in the queue. However, in order to get a relatively accurate arrival rate, I need to do some approximations. After each iteration, the front-end server would check its queue length. If there are some new requests in the queue, then I would assume they all arrived at the same time.
However, my metrics for this system consist of two parts. The first one is queue length, which tells us maybe it’s time to scale out. The second one is the arrival rate, which is the most useful metric in all circumstances. My scaling decision is purely based on the arrival rate.
After each sampling, there would a decision can be made based on the current state of the system.
I do a lot of benchmarks to get heuristics. For a certain request arrival rate, I know the corresponding number of front-end and app-tier servers are needed in order to successfully handle the workload.
Moreover, sometimes thrashing would happen. In order to prevent this phenomenon from happening, the factor to scale out the system is 0.5, while the factor to scale in the system is 0.25. These two factors define the sensitivity of the system to the request arrival rate.
One thing I've mentioned is a VM from start to running takes 5 seconds, while shutting down a VM can happen instantly. According to this property, there is a special case needed to be considered. In Figure 2, At time 0s, there are 2 servers are running. At time 3s, the sampling system detects that the system needs to have 5 servers in total to cope with this request arrival rate and these servers will be running at time 8s. At time 4s, the arrival rate decreases sharply and only one server would be enough. At time 5s, the system needs 5 servers to cope with a new arrival rate.
In this case, you can see there is a fake decrease spike. To deal with it, the system would get into a cool-down period when there is a scale-out need and wait for all the servers to run, then the server can decrease. It means that, if the arrival rate really drops, a few seconds later, the tendency can still be detected and can make a decision at a later time.
The system rather has more servers running to make clients happy than save more money on servers.
Figure 2. Time and Server Number
The reason the system needs this kind of mechanism is that it is meanless to handle old requests that are meant to make clients unhappy. To cope with this situation, the system realizes that this request can not be handled in time. Therefore, drop this request in order to make more clients happy instead of wasting time on these meanless requests.
I design two different drop mechanisms for both front-end and app-tier servers.
The first one is pre-drop. Any request in the queue that cannot be handled during their time limit should be dropped. It usually happens at the head of the queue.
The second one is after-drop. Right before the server starts to do his work to this request, it would check again and predict whether this request can be done within the time limit before time-out.
The key part of this system is to measure when the request arrives at this system, which is the front-end servers. The system makes an optimistic estimation of the arrival time of requests. For example, in the beginning, there is no request in the queue. Later, 3 requests come. After breaking the wait, it finds out that there are 2 requests in the queue. If includes the just handled one, there are 3 requests before breaking the wait. Because a front-end server needs 60 milliseconds to handle one request. Therefore, the system would estimate the three requests arrive at the same time, which is the current time – 60 milliseconds. After each work has been done by the front-end, it would check the queue trying to find out if there are any new requests.
If the front-end server processes the request without any problem, it will pass the time info about when the request has arrived in the system with the request to the app-tier server.
One central cache system. Each app-tier server sends would use RCP to find is there a hit, and operate on their own to decide whether to access the database directly. All app-tier share the benefit of the read from others.
During the first 5 second of the system, the main server plays all the roles to handle request by itself.
Github Repository for This Project:Auto-Scaling