In building Linkscape we've had a lot of high performance computing optimization challenges. We've used Amazon Web Services (AWS) extensively and I can heartily recommend it for the functionality at which it excels. One area of optimization I often see neglected in these kinds of essays on HPC is cost optimization. What are the techniques you, as a technology leader, need to master to succeed on this playing field? Below I describe some of our experience with this aspect of optimization.
Of course, we've had to optimize the traditional performance front too. Our data set is many terabytes; we use plenty of traditional and proprietary compression techniques. Every day we turn over many hundreds of gigabytes of data, pushed across the network, pushed to disk, pushed into memory, and pushed back out again. In order to grab hundreds of terabytes of web data, we have to pull in hundreds of megabytes per second. Every user request to the Linkscape index hits several servers, and pages through tens or hundreds megabytes of data in well under a second for most requests. This is a quality, search-scale data source.
Our development began, as all things of this scale should, with several prototypes, the most serious of which started with the following alternative cost projections. You can imagine the scale of these pies makes these deicions very important.
These charts paint a fairly straight forward picture: the biggest slice up there, on the colocation chart, is "savings". We spent a lot of energy to produce these charts, and it was time well spent. We built at least two early prototypes using AWS. So at this point we had a fairly good idea, at a high-level of what our architecture would be, especially the key costs of our system. Unfortunately, after gathering quotes from colocation providers, it became clear that AWS, in aggregate, could not compete on a pure cost basis for the total system.
However, what these charts fail to capture is the overall cost of development and maintenance, and many "soft" features. The reason AWS was so helpful during our prototype process has turned out to be the same reason we continue to use it. AWS' flexibility of elastic computing, elastic storage, and a variety of other features are aimed (in my opinion) at making the development process as smooth as possible. And the cost-benefit of these features goes beyond the (many) dollars we send them every month.
When we update our index we bring up a new request processing cluster, install our data, and roll it into production seamlessly. When the roll-over is complete (which takes a couple of days), we terminate the old cluster, only paying for the extra computing for a day or so. We handle redundancy and scaling out the same way. Developing new features on such a large data set can be a challenge, but bringing up a development cluster of many machines becomes almost trivial on the development end, much to the chagrin of our COO and CFO.
These things are difficult to quantify. But these are critical features which make our project feasible at any scale. And they're the same features the most respected leaders in HPC are using.
All of this analysis forced us to develop a hybrid solution. Using this solution, we have been able to leverage the strength of co-location for some of our most expensive system components (network bandwidth and crawling), along with the strengths of utility computing with AWS. We've captured virtually all of the potential savings (illustrated below), while retaining most of our computing (and system value) at AWS.
I would encourage any tech leaders out there to consider their system set-up carefully. What are your large cost compentents? Which of them require the kind of flexibility of EC2 or EBS? Which ones need the added reliability of something like S3 with it's 4x redundancy? (which we can't find for less anywhere else). And which pieces need less of these things? Which you can install in a colocation facility, for less?
As an aside, in my opinion AWS is the only utility computing solution worth investigating for this kind of HPC. Their primitives (EC2, S3, EBS, etc.) are exactly what we need for development and for cost projections. Recently we had a spate of EC2 instance issues, and I was personally in touch with four Amazon reps to address my issue.
2 comments:
Thanks for the insights Nick. I knew Linkscape was an awesome tool, I just had not considered the true depth of the data that had to be used to support it.
And Thanks for your efforts and continuing development of this really phenomenal resource.
Do you use Amazon's SQS service to manage your crawl/request processes? I'm not sure if it's worth it, especially if there's already a centralised MySQL master-master DB on EC2
Post a Comment