Retrofitting Table Level Delete/Archive Strategies - Unbounded Table Growth by Design

by Bill Wunder

In this discussion we will look at some data maintenance techniques for logging type tables. This will not be a discussion about the SQL Server Transaction Log database device that records all transactions and the database modifications made by each transaction and is used for recoverability and replication by SQL Server. Instead, we'll be examining a particular type of application table that records events at the application level. Typically a logging table will see a relatively high volume of inserts and is designed to always have rows added to the end of the data set. Optimally, to achieve this end the data set is ordered by the clustering key of the table. The main benefit of this design is that new rows do not suffer the overhead of the b-trees split and promote processing at insert. Logged data is sometimes referenced later by the application but ideally it is not changed once it is inserted into the logging table. In practice it is far too common for a logged data table to be used to track the state of a main "thing" or entity described in the initial logged event. Likewise it is far too common for an application design to consider only rows being added to a logging tables and to omit considerations for managing unbounded growth of the data set. In this discussion we'll try to cover some ways to accommodate these table and application design deficiencies without the need to rewrite applications solely for the purpose of being able to maintain the data.

Examples of events that might get logged might be DNA samples entering a universities research database, shipments sent from a clearing warehouse, story requests at a e-magazine, pump bearing temperature measurements for a water treatment plant, or even the access per user to an interesting web page. I'll be talking in more detail about about these examples later in the discussion. First let's consider some general software development practices and begin to understand why and how managing the size and growth of logging tables can become problematic.

I keep not reading about people that have problems trying to maintain logging table growth using a datetime column index. At the same time I keep seeing applications in my shop choking on that very activity. Admittedly, part of the problem is that we wait until we have tens of GBs of data in a table before someone decides to try to schedule a job or execute a hastily thrown together one off purge from Query Analyzer that is supposed to keep just the last six months or two weeks or only those rows with a Boolean value set to the incomplete state or what ever comfort zone of history makes sense for a given logging table. Interestingly, the reasons for keeping n days or weeks or months or years of data or only a particular state  aren't always well founded. Still, illusive business rules and questionable reasons not withstanding, the impact on the system tends to be fairly predictable. The delete/archive processing is slower than expected and the contention inflicted on an application as it attempts to slip additional records into the logging table during the longer than expected delete/archive operation processing is much worse than anticipated. In the worst cases, particularly when logging volume is high and the number of rows to be deleted is large, primary application queries will time out waiting for the delete operation to complete or a hastily concocted ad hoc purge will irreversibly remove the wrong rows, or another surprising calamity will befall the application as a direct result of trying to remove the presumed to be unneeded rows.

For a date based delete, for example, we might typically see a first iteration approach such as:

    declare @Cutoff datetime

    set @Cutoff = getdate() - 180  

    delete SampleLog where CompletionDate < @Cutoff   

And then, after the discovery that the delete operation is still executing hours later and holding a table lock along with many more granular locks and thereby preventing all inserts the developer may kill the job - or complain to a DBA that there is "something wrong with the SQL Server" and the DBA kills the job. Then the developer goes back to the cutting room and returns with something like:

    declare @Cutoff datetime

    set @Cutoff = getdate() - 180  

    set rowcount 1000

    delete SampleLog where CompletionDate < @Cutoff

    while @@rowcount = 1000

        delete SampleLog where CompletionDate < @Cutoff

    set rowcount 0           

Usually after the second attempt fails when the system is brought to it's knees due to severe disk queuing or the delete simply cannot remove rows as fast as they are inserted or the log device fills a disk in Full Recovery mode or another of the myriad of "gotcha"s gets us we have to come up with yet another alternative. In most cases the situation is somewhat critical in the moment and it may even be the middle of the night so a truncate table issued on the log table can bring some temporary relief if the available disk space has been exhausted or a server restart can get a server that has stopped responding to respond again or maybe even a combination of a table truncation and a service restart is needed. Sooner or later though it becomes necessary to come up with a solution that is elegant and provides a means of preserving the desired log data and gets the aged rows off the server.

One class of tactic that proves flexible and effective for me in retrofitting some discipline into applications with unbounded logging table growth built in to the design are some variations of locally partitioning the logging table. There doesn't seem to be a single one size fits all partitioning strategy because logging table data usage is so different from application to application and even logging table to logging table. The range of partitioning possibilities is from the classic fully partitioned view down to a legitimate regularly scheduled table truncation. A good understanding of the possibilities and a complete consideration of a logging table's usage are key to unlocking the right solution. 

My first preference for planning delete/archive strategies is to do so when the application or application subsystem is still in design. Of course I frequently hear the farfetched - even if somewhat founded in reality - argument that the system with gigabytes of log data is still in design. Always reluctant to stay out of other peoples fantasies I won't delve too deeply into that position. Suffice it to say that when I say planning delete/archive in design phase I mean empirically deciding how much log data to keep, exploiting the features of the architecture to support the delete/archive strategy, and actually testing the delete/archive processing in the integrated system testing environment with a specific test plan element to measure design compatibility and contention issues between the delete/archive process and the application. Designing for unbounded growth of data that does not need to be kept indefinitely is akin to planning a trip to Mars and holding of on how to achieve the return trip until you get to the red planet. The truth is I have yet to work in the shop that considers delete/archive a legitimate aspect of application development. (That is to say I know many developers that are now permanent Martian residents.) I have however, over and over again, lived through the production crises and failures that come from such denial in the software lifecycle. It's an interesting even if frustrating problem. My hope in sharing some solutions here is to alleviate some of the frustration for you if you find yourself having to fight such problems of inadequate planning and design; and also to provide some apologetics for proper design fundamentals. The latter owing to the fact that shoehorning a delete/archive strategy onto an application as we must do in these cases is somewhat complicated, potentially fragile, and always a more difficult path than thinking about application needs and realities during the early design phases of a project. Or, to complete my metaphor, we are much more likely to have the resources to get home from Mars if we plan the whole trip before the launch date.

Since I'm describing only a retrofit to an existing application in this discussion, I want to be clear that one of the objectives of such an initiative needs to be to limit the changes needed in the application logic as much as possible. I also want to be clear that I am making no certification as to the legitimacy of logging table data in this discussion. At the point of disruption to production it is incumbent upon the DBA to first do what is necessary to stabilize the production environment and then - if necessary and appropriate - to invite a consideration toward redesign of the application or subsystem. If we were designing holistically and a priori the possibilities are not so limited, but in this discussion where the application context is a pre-existing and an in situ given (i.e. already in production) it is of the utmost importance that we do not introduce monumental application changes in order to provide an effective delete/archive process. Almost seems a little silly at one level - don't you agree - to have to take this narrow approach but in the context of regression testing and iterative development there is clearly an imperative to not create a new application solely so the application can effectively rid itself of unneeded data.

With that said, I want to next state that it may not always be possible to completely avoid a rewrite when delete/archive has been ignored at the fore of the development process. It is even more likely that a retrofit delete/archive can and will degrade other application behaviors. Adding delete/archive capabilities by definition involves touching the fundamental underpinnings of a data model and therefore compromises are unavoidable. Knowing when to step back and alter an application design and when to forge ahead with a delete/archive retrofit can be treacherous and deserves the same care and consideration as any other software development iteration. There should always be checkpoints and collaboration with any software revision - just as this is crucial in the initial design -  to keep from ending up with an unworkable solution.

Building from this groundwork, in my next installment I'll move into a discussion that attempts to classify logged tables and then I'll propose some variations of data partitioning and suggest some tools and tricks that might be used to effectively support a delete/archive approach for each of those general classifications. And since we're all all stuck at least 35 million allegorical miles from home until we can get this worked out I'm hoping you find the time to check it out.