Fine grained background job

Every so often we are expected to perform some processing in the background at some scheduled interval. Since this is not performed by a user we do not have same kind of response time constraints. Such freedom take us towards creating long running background jobs or batch jobs. There are some significant downsides to this approach.

Let us start with an example of a patient management system. This application is responsible for managing the patient data and it is used by hospital staff (doctors, nurses, administration). The application has an interesting feature of reminding patients, who sign-up for it, about their dosages. This notification goes to the patient via an SMS or IVR depending on their preference. Simple batch job based approach would probably work as follows.

We define a scheduled (cron like) job in the operating system. This job invokes a background application when it is run. Lets say we run this every hour once.
When this job runs, it loads all the patients whose dosage is due in next 1 hour or less.
The application loops through all the loaded patients and calls a patient alert service.

This batch job would work fine till the number of such patients are low and there are no failures.

If the batch job takes longer than an hour to process then it would start skipping some patients because of boundary conditions.

When doing 3 if there is some failure condition patient alert service call would fail for some patients.

Lets see what options we have to resolve these issues.

Handling failures

Fail the entire job

When there is a failure in sending any patient the alert then abort the job at that point. If the failure was because of network or service unavailability when the job runs the next time the issue would get resolved by itself. But when it fails because of incorrect data on a particular patient (or a bug), then we cannot use this approach. Arguably one can classify the failure types and then take the decision based on that. Personally I have found this tricky because one can never be sure of all failure types (Java exceptions).

Manage work items

An alternative would be to create work-items for all the patients for whom the alerts needs to be sent. These work items represent the task that needs to be performed for the patient. If we successfully send the alert for the patient, we delete the work item, otherwise we leave it behind. Some of these works-items might expire after some time, or in other words there is no point in sending the alert to the patient at night when one is sleeping.

Dealing with large volume

Our approach of running one scheduled job, has an upper bound of number of work items that it can process. The intuitive next step is to partition (shard) the work-items. In our case we can do this based on the patient id. Instead of one background application we run multiple jobs or better launch multiple threads from the same job. We assign range of patient ids to each thread to process. Someone who has developed large systems can see that this approach also is not scale able. It is possible that a single process or one machine cannot process all the work-items. Even if we put more powerful machine at some point we would not be able to scale it up anymore.

Fine grained background job

We just saw that sharding helps in processing more work-items per process/machine. If sharding is a good idea then why not take it to extreme and put every work-item in its own shard. That is, define one scheduled job per patient. And, if you are using operating system (cron) job scheduler to do this you can use something like Quartz. (I have noticed that even when people use Quartz the batch mindset still kicks in. Checkout the implementation of aggregate pattern in spring integration). Quartz works out of a relational database for storing its jobs. This means that we can scale out to multiple nodes working from the same Quartz job store (database). If you do not have issues with adding relational database to your infrastructure, then this option works much better than the batch approach.

You do not need to write your own infrastructure code for managing failures. In our example when something goes wrong while processing a patient's background task we need to worry about only that one. Since Quartz allows quite sophisticated schedules you can include retries for every scheduled job, to re-run this job after a certain duration.

This approach essentially makes background processing work like foreground, which is more like user sending individual requests. This doesn't completely solve the scalabilty problem because now we have database as the bottleneck. But you can find a lot of ways to scale that out by sharding the scheduled job store. We can do this based on scheduled-job-type, patient's district and so on.

Another important things is that I don't need any tools for monitoring background jobs, killing them, unblocking them and so on. I feel this simple design change provides a good alternative to say Resque, https://github.com/blog/542-introducing-resque. Though to be completely honest our's is not github scale at this point.

We have followed this approach in our rewrite of Motech Ghana and few other places where we have implemented motech.