Cache-able Queries

We just released the first version of Johnny Cache. It's very exciting to see it done and available to the public. Jason wrote a great write up about the path Johnny has taken recently, so I won't re-hash what he said. Instead, I thought I would give an overview of query cache-ability using the framework.

Just to give a brief overview of how Johnny Cache works, it takes all reads from the database and caches them based on a group of generational keys per table. When a table is written to, the generational key is changed "invalidating" that table from reads. Johnny Cache builds a hash of the tables, sql, and parameters to cache the query.

There are a few ways to optimize the query caching of Johnny.

1. Avoid Dynamic Queries

Dynamic queries are queries that have a dynamic component, typically a date/time stamp. Johnny will not be able to cache queries that are randomly seeded or stamped with the current time since the date/time is inserted into the sql before it is cached. An example:

Post.objects.filter(created_at__gte=\
(datetime.datetime.now() - datetime.timedelta(60))

will yield (in a simplified form):

SELECT * FROM "blog_post" WHERE "blog_post"."created_at" >=
 "2009-12-28 21:48:14.253433"

As you can see, the date is injected into the sql query, so every time the query is run it will change the cached query. A simple way of make this query cache-able is by changing the date to some solid number. An example:

Post.objects.filter(created_at__gte=\
(datetime.datetime.now().date() - datetime.timedelta(60))

Adding the date() will insert the current date instead of datetime allowing the query to be cached for a full day (or until blog_post is written to)

2. Avoid Joins

Joins across multiple tables will force johnny to use the aggregate generation keys from each table. If one table is changed, all joined queries are invalidated. Though you do want to minimize queries, de-normilization in a web environment is typically a "good" thing (remind me again why we have relational databases on the web?), and will allow for scalability sometimes outside of Johnny. But it also causes johnny to only query each table individually, causing less cache misses.

Ultimately in this case, you need to understand your query and understand that Johnny caches on a per table basis. Stats about your cache hits/misses are great for understanding each table and each page. I recently made a very basic stat analyzer that uses mongodb...not suggested for production, but the link is http://github.com/finder/cache_plot. It will hopefully get better in the near future.

3. Use Transactions

If your database supports transactions, you're in luck...so does Johnny! Transactions delay the actual committing of a query until a defined point in time, whether it be at the end of a request, or manually entered. Johnny caches all local transaction requests to the LocalStore instead of the global cache store (read memcached) until the point of the transaction being committed. This includes table invalidation...so while you're in the middle of a long request changing multiple tables, other users loading in other threads will still be able to read the cached tables until you commit.

You can peruse Django's transaction documentation here: http://docs.djangoproject.com/en/dev/topics/db/transactions/

There are of course other things you can do to avoid cache misses, such as sharding data, but they are well beyond the scope of this article. The most useful tool you can have is an understanding of what is going on at the caching and database layer to avoid cache misses, or perhaps cut down on the number of queries you are doing in the first place.



Comments

Click here to post a comment.

I do not think I have seen this depicted that way before. You really have made this so much clearer for me. Thanks! ()

I do not think I've seen this described that way before. You actually have made this so much clearer for me. Thank you!


Re: johnny-cache vs. mysql query cache ()

Sounds useful. Thanks for the explanations.


johnny-cache vs query cache ()

Niran, I plan on writing a more detailed post outlining why you would want johnny-cache even with something like the mysql query cache (which is more or less mentioned in johnny's documentation).

The first benefit is that it's easier to horizontally scale reads with memcached than with master-slave setups. If you are using django pre 1.2, it's actually quite difficult to use multiple database hosts at all.

The second benefit is that the less memory mysql has to burn on its query cache, the more memory it has to load indexes, etc. This can be a big problem for larger projects; disabling mysql's query cache (to free mem for indexes) is sort of a right of passage epoch the scaling history of a website using mysql.

There are all sorts of little side-benefits towards doing this, too. Connections to the memcached don't count towards your max_connections in mysql, mysql doesn't have to deal with cached queries *at all*, etc.

As for performance, it isn't really designed to out-perform the mysql query cache. Honestly, my gut feeling is that memcached outperforms mysql's query cache, since memcached is designed especially for speedy key-value lookups. For a lot of applications, reducing database load (cpu usage, mem usage) is a huge win, the MQC won't get this for you.


Re: johnny-cache vs. mysql query cache ()

I have not compared the performance differences between Johnny cache and MySql's query cache. It'd be an interesting test, and may be something I need to do. There are a few advantages of using both however. Here's MySql's answer:

http://dev.mysql.com/doc/refman/5.0/en/ha-memcached-faq.html#qandaitem-18-5-5-1-16

In the case where johnny was written, the database was a bottleneck even on a machine with eight cores and 16G ram with a very large query cache (though I'm not the DBA so I don't know exactly how much). The MySql query cache also needs to pre-analyze every query to see if it's in the cache. Johnny does a very fast md5 hash of the query and results, and memcache is a very simple key/value lookup, so that analysis is moved and simplified on your app servers. If it's a miss...mysql may still have the query in its own cache. However, it does add a round trip to memcache, so on a write heavy site it may even be a performance decrease (though from what I've seen I seriously doubt it).

The main advantage that memcache gives you is cheap, expandable, horizontal scalability. To do the same thing in mysql would require replicated databases (more disk space), or a sharded table. Memcache allows you to just add more machines or more memory and get an instant performance gain (to an extent...at least without a more customized installation than Johnny).

After we installed (the prequel to) Johnny we saw a gigantic performance increase on the database since it didn't have to handle nearly as many requests. The site was much faster, the dba thought our application was broken because of a lack of load on the database, we can handle easily 3x, perhaps 10x more traffic (we haven't hit the wall yet like we used to).


johnny-cache vs. mysql query cache ()

Have you compared the performance of johnny-cache with MySQL's query cache? I'd assume there would be little to gain from caching queries in memcache instead of the DB.