High Performance Web: Efficient and scalable cronjobs

Back
Benjamin Coutu on 10/02/2017.

Speed and performance are paramount for any mission critical business application. In our blog segment "High Performance Web“ our chief technology officer Benjamin Coutu shares the tricks and challenges involved in making ZeyOS perform at "lightning speed", so that the development community might benefit from our learnings here at ZeyOS.

Attention: This is a strictly technical tutorial, targeted towards developers who are familiar with the Linux Shell.

Being a quintessential cloud computing system, ZeyOS is deployed as a multi-tenancy architecture, with multiple customer instances running on one single platform. That platform is comprised of a conventional software stack, made up of Ubuntu Linux, PostgreSQL, Nginx and PHP-FPM.

Among the key benefits of multi-tenancy is the enhanced scope of scalability and a simplification of the release management process. Nevertheless, because of concurrency control and generally added overall complexity, multi-tenant applications typically require a larger development effort. Today I'd like to highlight one particular implementation challenge, which is the efficient spawning of multiple PHP maintenance scripts via cronjob. I have chosen the specific example of regularly checking for new e-mails to best illustrate a general solution to such a problem.

Please note that this is a simplified and somewhat contrived example. However, I will try to make it as easily reusable (and copy-pastable) as possible.

The task at hand is to invoke a PHP maintenance cronjob that checks for new e-mails for every customer instance, every single minute.

Basic setup

First, we will have to register a cronjob by adding the following line to our crontab file /etc/cron.d/checkmail:

*   *   *   *   *   root    /opt/checkmail.sh

This will tell the cron daemon to execute the shell script /opt/checkmail.sh each and every minute (with root priviliges).

Now, let's see what the basic structure of /opt/checkmail.sh could actually look like:

#!/bin/sh

for INSTANCE in /srv/instances/*; do
  # ... perform task for each instance ...
done

This will loop over all instances, represented as sub-directories of our base directory /srv/instances, and perform a given task for each one of them.

Parallel sub-process spawning

Let's say our PHP script for actually checking and fetching new e-mails for a specific instance is located at /var/www/<instanceid>/checkmail.php. We would then consider a naive implementation such as the following for our shell script (I will omit the Shebang line #!/bin/sh for much of the remainder of this article):

for INSTANCE in /srv/instances/*; do
  ID=`basename "$INSTANCE"`
  /usr/bin/php "/var/www/$ID/checkmail.php"
done

The problem with this is that it sequentially checks every ZeyOS instance for e-mails, one at a time. So if one instance takes very long to retrieve it's e-mails, another instance will have to wait until that one finishes. Remember, we are talking about hundreds of ZeyOS instances here. We clearly cannot have one customer instance wait for another customer's instance to have fetched all it's e-mails. Our goal is to have a throughput of one e-mail check per instance per minute with no delay for a single instance.

There is of course an easy fix for this. We just start all checkmail processes in parallel by moving all sub-processes to the background (notice the &-symbol at the end of line 3):

for INSTANCE in /srv/instances/*; do
  ID=`basename "$INSTANCE"`
  /usr/bin/php "/var/www/$ID/checkmail.php" &
done

wait

OK, great, it's done, or is it? Not so quick! A new issue has arisen as we will now simultaneously spawn hundreds of PHP processes, one for each instance, but all at the same time. This will clearly not scale very well and we'll probably run out of resources almost immediately.

Scatter sub-process spawning

Let's try to spread the spawning of sub-processes over a given period, in this case the one minute that we have defined to be our time frame for checking e-mails. There is no harm in checking one instance at second #1 of the minute, another one at second #2 of the minute, and others later in the minute, as long as each one is effectively checked once every single minute.

How will we go about this? Well, we shall divide our 60 second time frame into several slots that correspond to the total number of instances. In our example we have 100 instances which means we should spawn 60/100 sub-processes per second. Ergo their should be an interval of 0.6 seconds per process invocation. We can use the sleep command to wait for those 0.6 seconds, meanwhile handing off control to the system kernel, thereby effectively releasing resources.

INTERVAL=0.6

for INSTANCE in /srv/instances/*; do
  ID=`basename "$INSTANCE"`
  /usr/bin/php "/var/www/$ID/checkmail.php" &
  sleep $INTERVAL
done

wait

Let's generalize that and calculate the interval based on the actual number of instances. Also, we must avoid a division by zero by exiting early if there are no instances at all:

COUNT=`echo /srv/instances/* | wc -w`

if [ $COUNT -eq 0 ]; then
  exit
fi

INTERVAL=`echo $COUNT | awk '{print 60 / #{.root->post->html}}'`

for INSTANCE in /srv/instances/*; do
  ID=`basename "$INSTANCE"`
  /usr/bin/php "/var/www/$ID/checkmail.php" &
  sleep $INTERVAL
done

wait

We are now starting 100/60 (~ 1.7) sub-processes per second, whereby we guarantee that each instance will be checked for e-mails exactly once every minute (in a 60 second time frame, not necessarily at the beginning of the minute).

Avoid sub-process overhead

Currently we are scanning /srv/instances twice, once to retrieve $COUNT and then again for the for-loop. Also, using wildcard expansion via /srv/instances/* will return full names, though all we actually need is the respective base names. We can use find in combination with the option -printf to directly return a list of instance identifiers, which lets us avoid repeatedly invoking basename inside the loop. We now simply bail out early when the list is empty.

INSTANCES="$(find /srv/instances -mindepth 1 -maxdepth 1 -printf "%f\n")"

if [ -z "$INSTANCES" ]; then
  exit
fi

INTERVAL=$(echo "$INSTANCES" | wc -w | awk '{print 60 / #{.root->post->html}}')

for ID in $(echo "$INSTANCES"); do
  /usr/bin/php "/var/www/$ID/checkmail.php" &
  sleep $INTERVAL
done

wait

Avoid overlaping execution

What if some checkmail processes take extended periods of time to complete (several minutes for example)? We will have to make sure that while one instance is being checked, that same instance will not get checked again until the previous checking process has completed. Otherwise we might trigger race conditions and the likes.

We can accomplish this through mutual exclusion by using a lock file for each instance. We check for the existence of a lock file and only proceed if there is no such file. If we can acquire a lock, then we spawn the sub-process as usual and then release the lock again afterwards. Of course lock handling and PHP process execution must be considered as one logical unit when moved to the background before sleep can take over. It's wise to create lock files under the /var/lock directory, cause this path is usually mounted in-memory (tmpfs), so those locks don't accidentally persist across reboots (and as a bonus they won't require costly real IO).

INSTANCES="$(find /srv/instances -mindepth 1 -maxdepth 1 -printf "%f\n")"

if [ -z "$INSTANCES" ]; then
  exit
fi

INTERVAL=$(echo "$INSTANCES" | wc -w | awk '{print 60 / #{.root->post->html}}')

for ID in $(echo "$INSTANCES"); do
  LOCK="/var/lock/checkmail-$ID"

  if [ ! -f "$LOCK" ]; then # WARNING: race condition begins
    (
      touch "$LOCK" # WARNING: race condition ends
      /usr/bin/php "/var/www/$ID/checkmail.php"
      rm "$LOCK"
    ) &
  fi

  sleep $INTERVAL
done

wait

This is of course a flawed approach as it still holds a slight risk for a race condition because existence check and actual creation of the lock file are two separate non-atomic steps. We can fix this by using the mkdir command as it guarantees atomicity and, unlike touch will return an error exit code if it cannot create the directory. So we actually know whether acquiring the lock succeeded or not without having to first check for the existence of the lock.

INSTANCES="$(find /srv/instances -mindepth 1 -maxdepth 1 -printf "%f\n")"

if [ -z "$INSTANCES" ]; then
  exit
fi

INTERVAL=$(echo "$INSTANCES" | wc -w | awk '{print 60 / #{.root->post->html}}')

for ID in $(echo "$INSTANCES"); do
  LOCK="/var/lock/checkmail-$ID"

  if mkdir "$LOCK" 2> /dev/null; then
    (
      /usr/bin/php "/var/www/$ID/checkmail.php"
      rmdir "$LOCK"
    ) &
  fi

  sleep $INTERVAL
done

wait

Alternatively, instead of manually handling our lock file/directory, we can simply use flock with the non-blocking option -n. The useful thing about flock is that the file lock will be kept in place until our checkmail process completes. This is true whether it completes successfully or not.

INSTANCES="$(find /srv/instances -mindepth 1 -maxdepth 1 -printf "%f\n")"

if [ -z "$INSTANCES" ]; then
  exit
fi

INTERVAL=$(echo "$INSTANCES" | wc -w | awk '{print 60 / #{.root->post->html}}')

for ID in $(echo "$INSTANCES"); do
  flock -n "/var/lock/checkmail-$ID" -c "/usr/bin/php '/var/www/$ID/checkmail.php'" & 
  sleep $INTERVAL
done

wait

"Use the web server, Luke!"

Is there anything else we can improve upon here? One thing we could optimize is the actual invocation of our PHP script. In our example we use the command line version of PHP (PHP-CLI) to run our checkmail script. Each time we execute checkmail via PHP-CLI we will have to load our entire code base and won't be able to utilize shared caching facilities, such as PHP's OPCache (PHP compilation unit caching) or APCu (user data caching). In practice that means compiling a huge amount of includes, classes, etc., again and again for every invocation of checkmail.php. Of course we wouldn't have that problem when using a web server with PHP-FPM (or PHP-CGI for that matter) as OPCache would be always available, residing in a shared memory segment allocated by one global parent PHP process.

Given that we have configured our web server accordingly we could very well invoke our checkmail script via HTTP request to its URL. Let's say, for simplicity, that /var/www is the web server's document root, then our script could be easily accessible via localhost at http://127.0.0.1/<instanceid>/checkmail.php. In order to issue a proper HTTP request to that URL we can use the command line tool wget.

So, instead of

/usr/bin/php "/var/www/$ID/checkmail.php"

we would just do

wget -O /dev/null -q -T 0 --spider "http://127.0.0.1/$ID/checkmail.php"

wget option -O /dev/null will redirect all possible output to Nirvana (null device); option -q will tell it to be quiet (non-verbose); option -T 0 will tell it to not time out (our CLI script wouldn't time out either); and option --spider will have it behave like a web crawler, which effectively means that it wouldn't download any possible content of the HTTP response body (we don't expect our checkmail.php script to output anything).

As we are using the local host (127.0.0.1), we don't necessarily have to access the script via a secure HTTPS connection as it would just entail unnecessary overhead. Also, in production, we would have configured the server to only allow access to that URL route from localhost anyways, so that no one but our shell script could invoke the checkmail mechanism.

Production-ready

Our final shell script /opt/checkmail.sh is:

#!/bin/sh

INSTANCES="$(find /srv/instances -mindepth 1 -maxdepth 1 -printf "%f\n")"

if [ -z "$INSTANCES" ]; then
  exit
fi

INTERVAL=$(echo "$INSTANCES" | wc -w | awk '{print 60 / #{.root->post->html}}')

for ID in $(echo "$INSTANCES"); do
  flock -n "/var/lock/checkmail-$ID" -c "wget -O /dev/null -q -T 0 --spider 'http://127.0.0.1/$ID/checkmail.php'" & 
  sleep $INTERVAL
done

wait