I work on http://www.topgear.com. We get quite a lot of traffic. Quite a lot being 30 million url requests per day, adding up about 500GB. That comes from about 800,000 page views. Spikes are nasty too – about 1000 urls/second at the tastiest times. And interestingly enough, we could get more eyes on our prize if we could serve it. My current focus is to make this possible.
At the provocation of @fastchicken, I will knock out a few posts about what direction we are going with this. This is a simple one to start with, it gets funkier from here! I’ll skip any talk of optimising on the server side – its kind of obvious what to do here and depends on the technologies involved. In our case we cache lots of things in memory (especially database calls via MVC action caching), and try and cache them for as long as possible. But I can elaborate if prompted.
What I plan to focus on is static http caching, especially via the use of CDNs. So, a quick rundown on CDNs (Content Delivery Networks). They are basically big caches/proxies/http dumps. They sit in front of your servers, and if you say so, they hold onto some of your urls for a bit. When User A hits http://www.topgear.com/images/fakenose.jpg, the CDN holds onto the response, so that when User B hits http://www.topgear.com/images/fakenose.jpg your servers don’t serve it, the CDN does. That is one less request served by you, this equates to a slightly cooler server. Like most good things it is a simple idea, with infinite considerations and permutations. We use the CDN Akamai, so will drop the term CDN and just say Akamai from here.
As of now, topgear.com only Akamais purely static urls. Static urls can be defined as “for a given url, the exact same resource should be returned, at least for a given time period”. In our case this means released files (js/css/design images) and editorial images (images published through our cms). Both of these types are totally cool to serve to everyone with exactly the same response.
This alone (before yesterday) was offloading about 70% of our traffic. That is to say, 70% of the bytes being sucked into browsers were not coming from our servers. The reason why we still serve 30% is because:
a) The CDN is actually made of 10,000s of servers and each has their own cache, and must request the original url themselves.
b) HTML is considered dynamic, so isn’t cached
Yesterday we turned on an option that lessens the effect of a). Akamai talks in terms of origin traffic (stuff coming out of your servers) and edge traffic (stuff coming out of their servers going to the browser). As a customer you pay for each byte of edge traffic. But there is a third type, midgress traffic. This is traffic within the Akamai network, which can be lent on to get your origin traffic down. Put simply, their cache, instead of having the 10,000s of edge servers going straight to your origin, can be set up so they instead communicate via a mid tier of servers. This is known as the cache hierarchy or tiered distribution. These mid servers hang on the origin responses that are heading to the edges, and use these to satisfy other edge servers that would normally have hit the origin.
Magic!
You have to pay for the extra midgress traffic coursing through Akamai’s veins, hence why it ain’t on by default, and why a bit of digging was needed to find out to do it. For us though, this has decreased our origin traffic by a full 33%, for an increased cost of 15%. We now have 80% origin offload with a flick of a switch.
Excellent.
Next time, we’ll get match dirtier and discuss how to Akamai html. This is a work in progress (not Live yet). It aims to cache HTML even though each url does not always return the same response (you might have a different header because you are logged in). How? Clues are – it doesn’t rely on javascript (BBC sites can’t) and Akamai lets you set the cache key based on anything in the response, not just the url.