Go faster, Top Gear!

I work on http://www.topgear.com. We get quite a lot of traffic. Quite a lot being 30 million url requests per day, adding up about 500GB. That comes from about 800,000 page views. Spikes are nasty too – about 1000 urls/second at the tastiest times. And interestingly enough, we could get more eyes on our prize if we could serve it. My current focus is to make this possible.

At the provocation of @fastchicken, I will knock out a few posts about what direction we are going with this. This is a simple one to start with, it gets funkier from here! I’ll skip any talk of optimising on the server side – its kind of obvious what to do here and depends on the technologies involved. In our case we cache lots of things in memory (especially database calls via MVC action caching), and try and cache them for as long as possible. But I can elaborate if prompted.

What I plan to focus on is static http caching, especially via the use of CDNs. So, a quick rundown on CDNs (Content Delivery Networks). They are basically big caches/proxies/http dumps. They sit in front of your servers, and if you say so, they hold onto some of your urls for a bit. When User A hits http://www.topgear.com/images/fakenose.jpg, the CDN holds onto the response, so that when User B hits http://www.topgear.com/images/fakenose.jpg your servers don’t serve it, the CDN does. That is one less request served by you, this equates to a slightly cooler server. Like most good things it is a simple idea, with infinite considerations and permutations. We use the CDN Akamai, so will drop the term CDN and just say Akamai from here.

As of now, topgear.com only Akamais purely static urls. Static urls can be defined as “for a given url, the exact same resource should be returned, at least for a given time period”. In our case this means released files (js/css/design images) and editorial images (images published through our cms). Both of these types are totally cool to serve to everyone with exactly the same response.

This alone (before yesterday) was offloading about 70% of our traffic. That is to say, 70% of the bytes being sucked into browsers were not coming from our servers. The reason why we still serve 30% is because:

a) The CDN is actually made of 10,000s of servers and each has their own cache, and must request the original url themselves.
b) HTML is considered dynamic, so isn’t cached

Yesterday we turned on an option that lessens the effect of a). Akamai talks in terms of origin traffic (stuff coming out of your servers) and edge traffic (stuff coming out of their servers going to the browser). As a customer you pay for each byte of edge traffic. But there is a third type, midgress traffic. This is traffic within the Akamai network, which can be lent on to get your origin traffic down. Put simply, their cache, instead of having the 10,000s of edge servers going straight to your origin, can be set up so they instead communicate via a mid tier of servers. This is known as the cache hierarchy or tiered distribution. These mid servers hang on the origin responses that are heading to the edges, and use these to satisfy other edge servers that would normally have hit the origin.

Magic!

You have to pay for the extra midgress traffic coursing through Akamai’s veins, hence why it ain’t on by default, and why a bit of digging was needed to find out to do it. For us though, this has decreased our origin traffic by a full 33%, for an increased cost of 15%. We now have 80% origin offload with a flick of a switch.

Excellent.

Next time, we’ll get match dirtier and discuss how to Akamai html. This is a work in progress (not Live yet). It aims to cache HTML even though each url does not always return the same response (you might have a different header because you are logged in). How? Clues are – it doesn’t rely on javascript (BBC sites can’t) and Akamai lets you set the cache key based on anything in the response, not just the url.

South right hemisphere

Not letting a holiday get in the way of a good interview, I met up with a company called Right Hemisphere while I was in Auckland this week. The plan being to work with them when I return to NZ next year. A simple and slightly daring plan. Some of the best plans are I suppose.

The jist of what they do seems to be centred around the management of 3D information, especially in the domain of industrial design. In essence it is about creating a seamless information flow from the design application (CAD etc) through to manufacturing and maintenance. Content management for helicopters, basically. I can’t help but feel I might become a real programmer again. No offence web world, but no real work gets done on a browser right?

To get the wheels spinning, I have now decided to embark on two different streams of work. First is Spatula, as outlined in my previous post. Secondly, I plan to get my WPF skills back up to semi-pro levels. My tactic is going to be to port an app I wrote at uni; a physical cloth simulation originally created in C++ and OpenGL. I don’t think I will change too much – probably just make the cloth itself directly manipulable and to utilise nice UI elements to set the various force parameters.

Check out here for my original code. Amazing to think I used to actually kind of know maths and stuff. There is blimmin matrix multiplication and discreet integration in there by heck.

Ooh almost forgot, blogs are meant to be opinionated in some way. In that case, I would like to say my home country is more beautiful than your home country. See the attached views out my plane window as proof.

Spatula – aka jumping on the NoSQL bandwagon

Thats what I’ll call it. This O/C/DMS thing. Its a simple name for a simple idea.

And I mean a really simple idea. It really won’t do much. Its not ambitious, overly clever or particularly revolutionary. Its just something I wish I had. And based on the assumption I am not in a completely weird position, it might be useful for others. It might disappear, I might find it is not novel and should be replaced with something else blatantly obvious that already exists. But for now, its interesting to me.

So, in short terms, this is what I think it should do:

  1. Provide a friendly UI for the editing of content, including the usual mixture of dates, hyperlinks, html, numbers text etc. I think this content should be very much limited to the actual real core content – avoiding wherever possible view/layout specific stuff.
  2. Provide workflow facilities on top of this content, to allow the publishing model almost every real world content editing scenario needs.
  3. Incorporate versioning in this workflow, so that content clients can detect and act on changes
  4. Not be statically dependent on any model or schema for this content, to allow general reuse and consistency
  5. To handle assets, such as images and video.
  6. Have a mechanism to make this content available to a client, preferably using a strong domain specific model. This is in contrast to the common situation of being faced with key/string pairs that are a nightmare to write code on top of.
  7. Allow items of content to have structure and relationships with other items of content

These are the things I have found to be necessary when creating content for websites and services.

Here is my broad plan how this could be done (note, this assume basic knowledge of document databases, you might need to look some of this up to follow what I mean):

  1. Use document databases (such as CouchDB, Mongo, Raven) for their ability to store JSON documents without having to have static knowledge of the document resources they are storing.
  2. Use the attachment features of these dbs to manage assets such as videos and images.
  3. Use the document structure to represent the “natural” aggregate structure of content. For example, a car page is made of subparts (the car name, review, makes, models) which are most easily understood by editors as a single thing.
  4. Use the index features in these databases to allow relationships to be set to documents not in the current aggregate. And example might be a home page aggregate, in which you would choose a number of articles via an index into those articles. This index could limit the articles in any way desired, such as by date range, category or any arbitrary part of the article document. These references between documents are a natural part of all document dbs.
  5. Use the versioning features of these dbs to handle workflow. The versioning strategy may depend on the db, but will probably require one document per version, with an extra key to tie these versions together.
  6. JSON schema documents will be used as the “ui overlay” to allow these documents to be created, validated and edited easily and dynamically. That is, Spatula will read a list of JSON schemas (annotated with quite a few extras) and use these to construct a UI around this schema. When this UI is filled out, a document matching this schema would then be written to the document db.
  7. The client then simply needs to read straight from the document db and deserialize these documents into memory objects using whatever techniques the document db has available. Most seem to have http rest, at the very least. The result is very simple – the client (for eg a ReST service) has all the objects it needs in its own native format – no mapping, slicing or coercing needed. These could even be updated (even in ways that invalidate the original schema) with no hassles caused to Spatula. Probably.
  8. As a bonus… any website or service written on top of this would absolutely fly. This is because most pages would involve the loading of a single figure number of documents from a document db.

A lot of these things are things already done in the Top Gear system. The end result is very similar. Except we use a flat key/value based CMS database that is mapped at publish time to a strong domain model. This is then stored using nHibernate into a sql database. Then on page render, this model is loaded out of the database into the domain model again, which is then rendered in the usual MVC way. Spatula, I believe could achieve this much more directly, more simply and definitely much more quickly.

My Plan A is to use Mongo as the db and RoR for the document editing UI. Mongo for its attachment and versioning support, RoR for its general no-fuss-ness and dynamic nature. I suspect the RoR bit could possibly even be replaced with some sort of plugin into another CMS. I think I will only know once I’m there though.

In my next post I might be at the point where I can give some very specific examples, or even code.

CM bloody S

There mustn’t be many sentences in the world of IT more frightening than:

“There isn’t a CMS that suits our needs. I think we should write one”

Having said that, of all the sites I have worked on at the BBC, they all used…. a custom CMS. Urk! Why? Dear god!!!

The reason, as far as I can tell that we did this is because…. well… we never actually needed a CMS. Not what most people call a CMS anyway.

For me a pure CMS is this:
A system that allows editorial users to manage web content.

For everyone else on the internet, CMS seems to mean:
A system that allows editorial users to manage a website. So we need artcles. And to be able to set page titles. And layout. And set colours of the heading. And control SEO. And set what the 404 page looks like. And introduce paging of comments. And forums (with moderation!). And blogs of course. And it has to allow extensibility through scripting. And manage users. And, and, and…

According to this definition, there are many stable, strong, excellent products in the world. Their job is to allow powerful simple tools in order to be able to create websites. Umbraco, Drupal, Joomla, WordPress, Expression Engine. These products can tick through the average set of requirements with confident ease. If you need one of these things – well happy days, son. You’ve got choices!

But I don’t want any of them. I actually don’t need all of the things these systems are so proud of.

The systems I have worked take a very opinionated and kind of arrogant attitude. They are all written by developers. Good developers backed by strong design and editorial control. We know how to write web sites! We don’t need a CMS to get paging going on a gallery. We really don’t. And we like the advantages that this control affords us.

Does it cost money? Definitely. But it means we can use whatever technology we want, and change when we want. When can run it all off a database with strong schema. We can control deployment through local, testing and production environments. It means that the website part of the system is easy to integrate with other behind the scenes processes that shovels content to and from other 3rd parties. It means when we need to write a shopping cart we just go ahead and do it, without needing to tip toe around a “CMS” that has decided to set the rules of the game. For sure, this choice has serious implications. But time and time again, we have decided to take that choice.

Not only that, sometimes we aren’t even writing a damn website! My last project was writing a service to back an iPhone app. What does Drupal have to say to that?. Service content (in this case versioned content via JSON/ReST) is still content. So why am I left in the wilderness? But by doing so very much, the big guns are just massively inappropriate to handle these kinds of needs. So we went our own way. Again.

But….. we still need something that controls how we got content in our system. So we wrote it, called it a CMS, there onwards confusing everyone at our company essentially forever when they try and compare it all the other ones everyone raves about.

So either we (we being BBC Worldwide, in particular creators of http://www.topgear.com) are mad… or just daring. Maybe there should be a rule that everyone, when faced with a big (website/something that can be forced to think of as a website) project should just download Drupal/Expression Engine/etc and get one with it, no exceptions. Maybe this is true, and I (and many others) are simply blind to this sense.

Or alternatively we are similar to many others in the world. We write our systems by ourselves, thankyou. We just need a good way of getting content into them. Topgear developed such a way. Its not bad, and its not perfect. It was quite hard to do.

I have recently had a better idea about how to create such a system.

Lets call it a… O(bject)MS. Or a R(esource)MS. Or, maybe even a D(ocument)Management System. I would LOVE to call it a CMS, but it seems the internet has simply outvoted me.

My next post will explain this new idea.

Getting started with Entity Framework

I am really interested to see how this one goes. I am quite experienced with nHibernate, and like many people turned up my nose at EF when it first came out. It was misguided and badly implemented everyone said (and I agreed).

But when Marcel (http://www.marcdormey.com/) showed my Linq2Sql and the magic of delayed evaluation, I was intrigued and moderately delighted. Entity Framework (4) is its successor, and that is what I shall get into now.

I’m going to start with this article: http://msdn.microsoft.com/en-us/magazine/dvdarchive/ee336128.aspx. But its a bit terse and doesn’t fit well from my model/mapping mentality from my nHibernate background.

This looks better: http://msdn.microsoft.com/en-us/library/bb399182.aspx. But it is generate your model from the database as well. Rats. Where is some POCO?

Aha! Maybe these will do it:

http://blogs.msdn.com/b/adonet/archive/2009/05/21/poco-in-the-entity-framework-part-1-the-experience.aspx
http://blogs.msdn.com/b/adonet/archive/2009/05/28/poco-in-the-entity-framework-part-2-complex-types-deferred-loading-and-explicit-loading.aspx
http://blogs.msdn.com/b/adonet/archive/2009/06/10/poco-in-the-entity-framework-part-3-change-tracking-with-poco.aspx

With this when things get more serious:
http://thedatafarm.com/blog/data-access/agile-entity-framework-4-repository-part-1-model-and-poco-classes/

It looks like this is going to work – a lot of things just have to match up. Woe if you change namespaces after generator your edmx for example. Lots of runtime exceptions trying to find namespaces etc. Should get it working soon though. Following the first set of links is turning out to be really handy – as it includes a downloadable Northwind sample.

Having connected the dots I found out my entities could not be mapped due to difference in casing between the model mappings generated by the edmx and what I had coded on my POCO classes. “Mapping and metadata information could not be found” was probably not the best exception to help me here! How do you debug mapping problems in EF? I guess I will find out soon.

Some other first impressions: Looking at the edmx files, there seems to be a lot of repetition when compared to nHibernate hbms and definitely compared to Fluent nHibernate. Do I really need to have 3 separate references to every single model property? It seems there is a “storage model” (the table?), the “conceptual model” (the classes?) and then the “mapping”(does these tie the other 2 together?). Seems… wordy. Maybe everything has its right place, but surely some convention over configuration would help here! Death by XML drowning is on the horizon.

Can you spot the identicalness below?

Storage section:

</code>
<!-- SSDL content -->
 <edmx:StorageModels>
 <Schema Namespace="SchoolModel.Store" Alias="Self" Provider="System.Data.SqlClient" ProviderManifestToken="2008" xmlns:store="http://schemas.microsoft.com/ado/2007/12/edm/EntityStoreSchemaGenerator" xmlns="http://schemas.microsoft.com/ado/2009/02/edm/ssdl">
 <EntityContainer Name="SchoolModelStoreContainer">
 <EntitySet Name="Department" EntityType="SchoolModel.Store.Department" store:Type="Tables" Schema="dbo" />
 </EntityContainer>
 <EntityType Name="Department">
 <Key>
 <PropertyRef Name="DepartmentId" />
 </Key>
 <Property Name="DepartmentId" Type="int" Nullable="false" />
 <Property Name="Name" Type="nvarchar" Nullable="false" MaxLength="50" />
 <Property Name="Budget" Type="money" Nullable="false" />
 <Property Name="StartDate" Type="datetime" Nullable="false" />
 <Property Name="Administrator" Type="int" />
 </EntityType>
 </Schema>
 </edmx:StorageModels>
<code>

Conceptual section:

<!-- CSDL content -->
<edmx:ConceptualModels></pre>
</pre>
<Schema Namespace="SchoolModel" Alias="Self" xmlns:annotation="http://schemas.microsoft.com/ado/2009/02/edm/annotation" xmlns="http://schemas.microsoft.com/ado/2008/09/edm">
<EntityContainer Name="SchoolEntities" annotation:LazyLoadingEnabled="true">
<EntitySet Name="Departments" EntityType="SchoolModel.Department" />
</EntityContainer>
<EntityType Name="Department">
<Key>
<PropertyRef Name="DepartmentId" />
</Key>
<Property Name="DepartmentId" Type="Int32" Nullable="false" />
<Property Name="Name" Type="String" Nullable="false" MaxLength="50" Unicode="true" FixedLength="false" />
<Property Name="Budget" Type="Decimal" Nullable="false" Precision="19" Scale="4" />
<Property Name="StartDate" Type="DateTime" Nullable="false" />
<Property Name="Administrator" Type="Int32" />
</EntityType>
</Schema>
</edmx:ConceptualModels>
<code>

And now the Mappings:

</pre>
<edmx:Mappings>
 <Mapping Space="C-S" xmlns="http://schemas.microsoft.com/ado/2008/09/mapping/cs">
 <EntityContainerMapping StorageEntityContainer="SchoolModelStoreContainer" CdmEntityContainer="SchoolEntities">
 <EntitySetMapping Name="Departments"><EntityTypeMapping TypeName="SchoolModel.Department"><MappingFragment StoreEntitySet="Department">
 <ScalarProperty Name="DepartmentId" ColumnName="DepartmentId" />
 <ScalarProperty Name="Name" ColumnName="Name" />
 <ScalarProperty Name="Budget" ColumnName="Budget" />
 <ScalarProperty Name="StartDate" ColumnName="StartDate" />
 <ScalarProperty Name="Administrator" ColumnName="Administrator" />
 </MappingFragment></EntityTypeMapping></EntitySetMapping>
 </EntityContainerMapping>
 </Mapping>
 </edmx:Mappings>
<pre>

Its not like it is difficult to understand, just a bit excessively prescriptive. Maybe there are shortcuts, but the xml structure doesn’t suggest so.

Wow! I just added a second part of the model (with a 1-many) guessing an IList<> would work, and it did. This is starting to feel a lot better.
Overall, this has been easier to get started with than nHibernate was when I first started. Good sign?

Setting up Git on Windows with remote repository

Every project needs source control, so lets set up Git. Its newish and great. Twitter told me so.

First guess is to use gitextensions, which needs msysGit and KDiff. This should give me a relatively familiar GUI based experience similar to Visual SVN. Reading through http://git.or.cz/course/svn.html to give me an idea of what to expect coming from a SVN world. Its a bit abstract, I will see what happens when I create my repository. Which I imagine I will probably do at least 10 times before getting it right…

Right, so after lots of make file looking activity and downloading. “Git” is in my Visual Studio menu bar. Now to actually make it do something.

After a few minutes messing with Git Extensions settings, I’m not getting very far. Setting the path to git.cmd, git.exe and kdiff just doesn’t seem to be working. The settings just don’t change and the checklist items are staying red. Looking at TortoiseGit now to see if I have more luck. There are lots of “its not ready” comments on the internet, but most are 6 months old. Cross fingers.

TortoiseGit installed fine.

So now off to GitHub to see what I can find.

Repositoy created easily enough on Git Hub. Now to hook it up locally. Problem is TortoiseGit is giving me a blank dialog when creating a repository. Erk. Maybe this is my problem: http://stackoverflow.com/questions/1286011/tortioisegit-trouble-creating-a-local-repository.

Ah yes. Installed the wrong msysgit! Silly me. Installing http://msysgit.googlecode.com/files/Git-1.6.4-preview20090730.exe and assigning this within the TortoiseGit settings does the trick. I think the first thing I downloaded was the msysgit source code – not needed for this type of stuff.

Time to hook things up. This looks like it should help: http://petermorlion.blogspot.com/2010/03/okay-i-finally-got-git-to-work.html.

Yep works. Following this and adjusting the setting up of keys steps to apply to GitHub and all is well. I can create the local repository and sync to the remote one on GitHub. And… having done all that Git extensions now work. It was that wrong version of msysgit all along. I think I am set up in a similar way to VisualSVN/TortoiseSVN now.

Now onto my next thing on the list – Entity Framework.

Getting going

Righto. So this blog is going to be an outlet, and also a journal, of my efforts to get some more skills under my belt.

This is my second “day per fortnight” to really allow myself time to look into the things I want to look into. The end goal is to propel myself towards a great job when I get back to NZ. I am making no secret of the fact that that job might be at http://www.righthemisphere.com. The sooner they find out, the easier it will be for everyone.

I have figured the easiest way to get this all going will be to contrive some sort of half-real project. Which by sheer force of will will have everything I am interested involved. Somewho.

What are those things? So far my list is:

  • Rails
  • Android
  • Deeper into Linq
  • Entity Framework
  • Umbraco
  • WordPress
  • Android (Java)
  • Silverlight Windows 7
  • XNA Windows 7
  • Git
  • oData
  • Powershell
  • Jquery
  • WPF
  • Raven
  • Mongo
  • IIS 7

Its a bit of a eclectic mix.. At the moment I think it will have to be some sort of website and/or suite of apps driven by some sort of service. Hmmm… now to decide what that will be…