Race conditions and caching variables

I would like to claim an utter hatred of race conditions. This is where code is written in such a way that it doesn’t fully consider the possibility of another thread (e.g. another website hit) or threads occurring concurrently. Consider the following which has been increasingly frustrating me recently:

Drupal stores variables in the ‘variables’ table. It also caches these in the ‘cache’ table so that rather than doing multiple SELECT queries for multiple variables, it simple gets all the variables straight out of the cache table in one SELECT then unserializes them.

cron_semaphore is one of these variables which is created when cron starts, then it deletes it when finishing. If it isn’t deleted it should mean that cron hasn’t finished running yet, so the next time cron tries to run it will quit straight away. But due to a certain race condition it doesn’t always get properly deleted as follows (p2 is an abbreviation for an unrelated process running concurrently, e.g. a visitor to your website):

1, cron starts, cron_semaphore variable inserted (and variables cache is deleted)
2. p2 starts, variables cache is empty so “SELECT * FROM {variables}” then…
3. cron finishes, cron_semaphore variable deleted and the variables cache is cleared
4. … p2 inserts result of “SELECT * FROM {variables}” into cache, but that SELECT was called before cron deleted the variable
5. you now have no mention of cron_semaphore in the variables table, but there it is back in the variables cache!

Consider many visits to your website concurrently and you soon realise this can become a very common occurrence. In fact, this exact problem inflicts us at least a handful of times every day. As a result cron keeps trying to run but immediately quits when it sees the semaphore variable still there. After an hour it finally deletes the semaphore but in the meantime crucial stuff doesn’t get done.

Web applications can quickly become riddled with race conditions such as these. I’ve spotted at least two more in Drupal’s core code in the past. When the ‘bugs’ occur as a result they can be tricky to pin down, appearing to be freakish random occurrences. Worse yet, even when found they can be a royal pain to untangle.

3 thoughts on “Race conditions and caching variables

  1. mikeytown2 says:

    We removed cron.php and replaced it with our own file. It does a couple of things we like. Logs the slowest cron hook to a file; only allows cron to be ran every 45 minutes via checks to cron_last by directly querying the database; uses locks to prevent more then 1 cron from running at a time.

    Like

  2. dalin says:

    If I understand your explanation correctly, for this to happen drupal_cron_run() runs start to finish faster than P2 can run cache_get(). Seems unlikely even if there’s almost no data in the DB, or if no implementations of hook_cron choose to do anything. The only way that I could see this happening is if you are using MyISAM for the cache table and are experiencing some table locks due to a high insert rate.

    Don’t get me wrong, I think http://drupal.org/node/973436 is a good thing, but your situation seems very strange.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s