Apex cold starts and class caching misses

In the last couple of posts I have looked at class deploy times and class recompile times to round this up it’s time to look at class caching on Salesforce.

You have almost certainly seen the impact of class caching at some point, or rather cache misses, when some page takes an unusually long time to load the first time you try and use it. At work we generally talk of these events as being ‘cold starts’ which is a play on a cold cache. However, they can also be caused by invalid class recompilation so to avoid any confusion in this post I am only going to look at cold starts caused by a cold caches. None of the tests I did in this post involve invalid class recompilation.

What do they look like?

It turns out cold starts are actually quite hard to capture. I had to run quite a lot of tests to get a few samples, 9 days worths of tests actually:

This is showing the execution time for an anonymous apex call measured every hour for those 9 days. The spikes here are what we are looking for, calls that takes significantly longer than the average. There is no real pattern as to when these occurred, some were in the middle of the night, other during work hours. Some clustered close together, others in isolation.

I am using an anonymous Apex just because it was easy to run from my test harness. Customers normally see this type of behaviour when trying to access pages but the real cause is somewhere in the class cache system as this org was not being used by anyone while this test was running so no classes could have been invalidated. A number of these cold starts may be being caused by maintenance work on the instance, there is really no way to tell the root cause but the impact is obvious.

If you have watched the new compiler talk you might recall that there are two caches used for Apex classes, a level 1 cache on each application server and a level 2 cache shared between application servers. It’s not clear from this data which cache is causing these spikes. I would like to understand that but pragmatically to our customers it does not really matter, if they see this kind of behaviour often enough they are going to start questioning our ability to write good software.

Parallel Tests

The test result above are from a binary tree of classes with a call to a method on the root node being timed. This method calls the same method on its child class nodes which in turn call the same method on their child classes. This of course requires all the classes in the tree to be loaded for the call to complete. For this test I used a tree with 2048 classes and added 1kB of unrelated Apex code to each.

As I knew this was going to take some time to run I ran the same tests on four other trees of different sizes at the same time so we can compare the impact. Each of these trees has the same total amount of code spread over the classes just so any costs due to code size could be ignored. Looking at one day we get this:

Here we have a couple of cold start issues early in the morning. This looked very much like the primary cost was related to the number of classes so I used the data to calculate a cost/class when a cold start happens to get this:

What I think this is telling us is that the cache miss cost is mostly a factor of the number of classes but bigger classes do have some impact. There is some part which is proportional to the amount of code you put in the class which is consistent with the description of ‘inflation’ from the new compiler talk.

Should I worry about this?

This is a hard question to answer. The pages & triggers of your products need to require quite a lot of classes for the cold start to be significant so this is not really a problem on smaller products but it also depends on how tolerant your customer are to response times.

What you should be wary of if you already have or suspect you will have response time concerns is how you architect and design your product. Using lots of small classes will help your deploy times but could also increase the impact of cold starts. What I can’t really shed much light on is if the same sort of patterns are common across instances or if the incidence varies much over the medium/long run of months, this is just a snapshot of one week on one instance.

Next Steps

My takeaway from the results in this post and the last two is that there are things I am doing when coding that I have not thought through the impact they have on my own developer experience and customer response times. This feels poor, but as yet I don’t know how to correct this by giving myself a new mental model of what I should be aiming for. If I find a happy place I will let you know…

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s