Wednesday, March 25, 2009

Advertising fail

Amazon ad for hoity-toity

Not quite ironic enough for FailBlog, but still funny. Now if only the same ad appeared for [hanky-panky]...

Tuesday, March 17, 2009

Looking for a good home: Sammy


This is Sammy. He's a foster cat from the Humane Society who's currently living with us to get a break from the shelter. He's totally adorable so I'm looking for a good home for him!

Sammy's been with us for nearly 2 weeks and he's been super low-maintenance. He's very friendly; he'll immediately walk up to new people rather than hiding or shying away. He has long whiskers and soft, silky fur which only gets silkier the more you pet him. He loves to be petted, especially on his face and belly, and he'll purr pretty much as soon as you touch him.

Sammy on my lap He seems to really like being around people, and is very amenable to whatever ways you want to give him attention. He doesn't mind sitting on your lap, being picked up, cuddled, used as a pillow, etc. He's also very happy to just sit in the same room as you while you're doing whatever; we've spent many hours sewing, knitting, and surfing the web together. He doesn't get freaked out by noises (like the sewing machine, or even the vacuum cleaner). I think he would make a great match for anyone looking for a lap cat, or a companion to just hang out with while watching movies and putzing around the house.

Incidentally, he doesn't do any of the bad stuff that my cats do, like chewing on cords or tipping over wastebaskets and playing in the trash. And he doesn't scratch the furniture (although he does like to knead his claws into the carpet when he's really happy).

If you or anyone you know is looking for a cat like this, please let me know (or contact the Humane Society directly)! You're also welcome to come over and meet him. More photos here.

Sunday, March 15, 2009

Performance Measurement for Small and Large Scale Deployments

As well as powering a few cool tools, Linkscape is a data platform. Performance (and its measurement) isn't just important to reduce user latency, or cut costs. It's actually something we're hoping is part of our core competency, something that adds significant value to our startup. And the shortest path to performance is measurement.

For those in a hurry, jump straight to the tools we're using.

This post is inspired by (and at times borrowed from) an email I sent to some friends for a consulting gig I did recently. But it rings so true, and I come back to it so often, that I thought I would share it. Alex and Nick, I hope you don't mind me sharing some of the work we've done on your very neat, very fun Facebook game.

Let me motivate the need for performance monitoring with a couple of case studies taken from our infrastructure:

This dashboard (above) illustrates 28 hours of load on our API cluster. I can immediately see service issues on the first server (the red segment of the first graph). This is correlated with a spike in CPU and some strange request patterns on the second server (the layered, multi-colored bar on the graph below). The degraded service lasted for a few hours, which was a configuration issue I solved in our monitoring framework. It should have guaranteed downtimes of no more than 4 minutes.

Even after solving our monitoring issue I still needed to investigate the underlying issue: I can see the CPU and request pattern are related. Ultimately I solved this issue within two weeks. Without this kind of measurement I would not even have known we had an issue, and would not have had the data to solve it.

From part of our back-end batch-mode processing, we had thought we'd tuned our system about as well as we could. At times we were pulling data through at a very respectable pace, roughly 10MB/sec per node. but we had also observed occasional unresponsiveness on nodes, with a corresponding slowness in processing. We left the system alone for a while, thinking, "don't fix it if it ain't broke". But recently we've been tuning performance for cost reasons; so we came back to this system.

Once we instrumented our machines with performance monitoring (illustrated above) we saw that the anecdotes were actually part of a worrying trend: the red circles show this. Our periods of 10MB/sec throughput are punctuated by periods of extremely high load. The graphs above show load averages of 10 or more on 4 core nodes, along with one process spiking up to hundreds of megabytes and nearly exhausting system memory. This high system load dramatically reduced our processing throughput.

It turned out that the load was caused by a single rogue program which consumed all available system memory due to buffered I/O. Usually we have a few I/O pipelines and give each many megabytes for buffering. However, this program has many dozens of pipelines, altogether consuming nearly a gigabyte of memory. This lead to significant paging and finally thrashing on disk.

Once we reduced the size of buffers (from roughly 40-100MB per pipeline to just 1-2MB per pipeline) we saw dramatic improvements in performance: a nearly 60% boost! And the nodes have become dramatically more responsive—no more load averages of 10+. The graphs above show load average maxing out at 4 and plenty of memory available. The data suggest that we might even be able to nearly double our performance with the same hardware by increasing parallelism and running another pipeline on each node.

All of this work is powered by simple monitoring and measurement techniques. Sometimes this has lead to significant, but necessary engineering work. But sometimes it's lead to a single afternoon's efforts yielding a 60% performance boost, with an opportunity to nearly double performance on top of that.

We're using a few tools:

  • collectd measures the system health dimensions (cpu, mem usage, disk usage, etc.) and sends those measurements to a central server for logging.
  • RRDTool records and visualizes the data in an industry standard way.
  • drraw gives me a very simple web interface to view and manage my visualizations.
  • Monit watches processes and system resources, bringing things back up if they crash and sending emails if things go wrong.

These tools work together, in an open, plug-in powered way. I could swap out individual components and move to other tools, such as Nagios (which I've used for other projects) or Cacti (which I have not used).

Whether you're an on-the-ground operations engineer looking to watch system health and fix issues before they turn into downtime, or you're managing large-scale engineering, looking to cut costs and squeeze out more page or API hits, these tools and techniques point you in the right direction and give you hard data to justify your efforts after the fact. We've had many high ROI efforts initiated and justified by this kind of measurement.

Sunday, March 8, 2009

Why is this Report So Slow: Let the Database Handle the Data

We have a data-rich report in our Linkscape tool with even more in our Advanced Report. We think the data is great. But the advanced report can be awfully slow to load. Don't get me wrong, we think it's worth the wait. But this kind of latency is a challenge for many products, and clearly, there's room for improvement. We're finding improvements by porting logic from the front-end into the data layer, and by paging through data in small chunks.

We present our data (links) in two forms. One is an aggregated view, showing the frequency of anchor text, one attribute of each link:

We also present a paged list of links, showing all the attributes we've got:

The time we spend on each request is very roughly illustrated by this diagram. From it you can see each component in our system: disk access, data processing, and a front-end scripting environment. I've included the aggregate time the user experiences as well. We have a custom data management system rather than using a SQL RDBMS such as MySQL. But I list it as SQL because SQL presents the same challenge.

In total the user can experience between 15 seconds to three minutes of latency! The slowness comes from a couple of design flaws. The first is that we're doing a lot of data processing outside our data processing system. Saying that programming environment doesn't matter is a growing trend, which has some advantages; rapid development comes to mind. But (and I'm a back-end data guy, so I'm a bit biased) it's important to let each part of your system do the work it's best at. For presentation and rapid prototyping that means your scripting environment. But for data that means data processing.

We're currently working on moving data processing into our data layer, resulting in performance something like that illustrated in the diagram below. The orange bars represent time spent in this new solution; the original blue bars are included for comparison.

In addition to latency improvements, pulling this logic out of our front-end adds that data to our platform and consequently makes it re-usable by many users and by applications. The maintenance of this feature will then lie in the hands of our data processing team, rather than our front-end developers. And we've taken substantial load off of our front-end servers, in exchange for a smaller amount of extra load on our data-processing layer. For us this is a win across the board.

The other problem we've got is that we're pulling up to 3000 records for every report, even though the user has a paged interface. And those 3000 records are generated from a join which is distributed across our data management platform, involving many machines, and potentially several megabytes of final data pulled from many gigabytes of source data.

The other big improvement we want to introduce is to implement paging at the data-access level. Since our users already get the data in a paged interface, this will have no negative effect on usability. And it'll make things substantially faster, as illustrated (again very roughly) below. The yellow bars illustrate the new solution's projected performance. Orange and blue bars are included for comparison.

The key challenge here is to build appropriate indexes for fast, paged, in-order retrieval of the data by many attributes. Without such indexes we would still have to pull all the data and sort it at run-time, which defeats the purpose.

In the solution we're currently working on we've addressed two issues. First, we've been processing data in the least appropriate segment of our system. Process data in your data management layer if possible. Second, we've been pulling much more data than we need to show to a user. Only pull as much data as you need to present to users; show small pages if you can. The challenges have been to port this logic from a rapid prototyping language like Ruby into a higher-performance language like C or stored procedures, and to build appropriate indexes for fast retrieval. But the advantages of this work are substantial, and are clearly worth it.

These issues are part of many systems out there and result in both end-user latency problems, as well as overall system scalability problems. Fixing those problems results in higher user satisfaction (and hopfully higher revenue), and reduces overall system costs.

By the way, we haven't released anything around these improvements in performance yet. If you want to keep up to date on Linkscape improvements watch the SEOmoz Blog or follow me on Twitter @gerner.

Wednesday, March 4, 2009

High Performance Computing at Amazon: A Cost Study

In building Linkscape we've had a lot of high performance computing optimization challenges. We've used Amazon Web Services (AWS) extensively and I can heartily recommend it for the functionality at which it excels. One area of optimization I often see neglected in these kinds of essays on HPC is cost optimization. What are the techniques you, as a technology leader, need to master to succeed on this playing field? Below I describe some of our experience with this aspect of optimization.

Of course, we've had to optimize the traditional performance front too. Our data set is many terabytes; we use plenty of traditional and proprietary compression techniques. Every day we turn over many hundreds of gigabytes of data, pushed across the network, pushed to disk, pushed into memory, and pushed back out again. In order to grab hundreds of terabytes of web data, we have to pull in hundreds of megabytes per second. Every user request to the Linkscape index hits several servers, and pages through tens or hundreds megabytes of data in well under a second for most requests. This is a quality, search-scale data source.

Our development began, as all things of this scale should, with several prototypes, the most serious of which started with the following alternative cost projections. You can imagine the scale of these pies makes these deicions very important.

These charts paint a fairly straight forward picture: the biggest slice up there, on the colocation chart, is "savings". We spent a lot of energy to produce these charts, and it was time well spent. We built at least two early prototypes using AWS. So at this point we had a fairly good idea, at a high-level of what our architecture would be, especially the key costs of our system. Unfortunately, after gathering quotes from colocation providers, it became clear that AWS, in aggregate, could not compete on a pure cost basis for the total system.

However, what these charts fail to capture is the overall cost of development and maintenance, and many "soft" features. The reason AWS was so helpful during our prototype process has turned out to be the same reason we continue to use it. AWS' flexibility of elastic computing, elastic storage, and a variety of other features are aimed (in my opinion) at making the development process as smooth as possible. And the cost-benefit of these features goes beyond the (many) dollars we send them every month.

When we update our index we bring up a new request processing cluster, install our data, and roll it into production seamlessly. When the roll-over is complete (which takes a couple of days), we terminate the old cluster, only paying for the extra computing for a day or so. We handle redundancy and scaling out the same way. Developing new features on such a large data set can be a challenge, but bringing up a development cluster of many machines becomes almost trivial on the development end, much to the chagrin of our COO and CFO.

These things are difficult to quantify. But these are critical features which make our project feasible at any scale. And they're the same features the most respected leaders in HPC are using.

All of this analysis forced us to develop a hybrid solution. Using this solution, we have been able to leverage the strength of co-location for some of our most expensive system components (network bandwidth and crawling), along with the strengths of utility computing with AWS. We've captured virtually all of the potential savings (illustrated below), while retaining most of our computing (and system value) at AWS.

I would encourage any tech leaders out there to consider their system set-up carefully. What are your large cost compentents? Which of them require the kind of flexibility of EC2 or EBS? Which ones need the added reliability of something like S3 with it's 4x redundancy? (which we can't find for less anywhere else). And which pieces need less of these things? Which you can install in a colocation facility, for less?

As an aside, in my opinion AWS is the only utility computing solution worth investigating for this kind of HPC. Their primitives (EC2, S3, EBS, etc.) are exactly what we need for development and for cost projections. Recently we had a spate of EC2 instance issues, and I was personally in touch with four Amazon reps to address my issue.