Sunday, March 8, 2009

Why is this Report So Slow: Let the Database Handle the Data

We have a data-rich report in our Linkscape tool with even more in our Advanced Report. We think the data is great. But the advanced report can be awfully slow to load. Don't get me wrong, we think it's worth the wait. But this kind of latency is a challenge for many products, and clearly, there's room for improvement. We're finding improvements by porting logic from the front-end into the data layer, and by paging through data in small chunks.

We present our data (links) in two forms. One is an aggregated view, showing the frequency of anchor text, one attribute of each link:

We also present a paged list of links, showing all the attributes we've got:

The time we spend on each request is very roughly illustrated by this diagram. From it you can see each component in our system: disk access, data processing, and a front-end scripting environment. I've included the aggregate time the user experiences as well. We have a custom data management system rather than using a SQL RDBMS such as MySQL. But I list it as SQL because SQL presents the same challenge.

In total the user can experience between 15 seconds to three minutes of latency! The slowness comes from a couple of design flaws. The first is that we're doing a lot of data processing outside our data processing system. Saying that programming environment doesn't matter is a growing trend, which has some advantages; rapid development comes to mind. But (and I'm a back-end data guy, so I'm a bit biased) it's important to let each part of your system do the work it's best at. For presentation and rapid prototyping that means your scripting environment. But for data that means data processing.

We're currently working on moving data processing into our data layer, resulting in performance something like that illustrated in the diagram below. The orange bars represent time spent in this new solution; the original blue bars are included for comparison.

In addition to latency improvements, pulling this logic out of our front-end adds that data to our platform and consequently makes it re-usable by many users and by applications. The maintenance of this feature will then lie in the hands of our data processing team, rather than our front-end developers. And we've taken substantial load off of our front-end servers, in exchange for a smaller amount of extra load on our data-processing layer. For us this is a win across the board.

The other problem we've got is that we're pulling up to 3000 records for every report, even though the user has a paged interface. And those 3000 records are generated from a join which is distributed across our data management platform, involving many machines, and potentially several megabytes of final data pulled from many gigabytes of source data.

The other big improvement we want to introduce is to implement paging at the data-access level. Since our users already get the data in a paged interface, this will have no negative effect on usability. And it'll make things substantially faster, as illustrated (again very roughly) below. The yellow bars illustrate the new solution's projected performance. Orange and blue bars are included for comparison.

The key challenge here is to build appropriate indexes for fast, paged, in-order retrieval of the data by many attributes. Without such indexes we would still have to pull all the data and sort it at run-time, which defeats the purpose.

In the solution we're currently working on we've addressed two issues. First, we've been processing data in the least appropriate segment of our system. Process data in your data management layer if possible. Second, we've been pulling much more data than we need to show to a user. Only pull as much data as you need to present to users; show small pages if you can. The challenges have been to port this logic from a rapid prototyping language like Ruby into a higher-performance language like C or stored procedures, and to build appropriate indexes for fast retrieval. But the advantages of this work are substantial, and are clearly worth it.

These issues are part of many systems out there and result in both end-user latency problems, as well as overall system scalability problems. Fixing those problems results in higher user satisfaction (and hopfully higher revenue), and reduces overall system costs.

By the way, we haven't released anything around these improvements in performance yet. If you want to keep up to date on Linkscape improvements watch the SEOmoz Blog or follow me on Twitter @gerner.

No comments: