Retrofitting Table Level Delete/Archive Strategies - Classifications of Logging Tables

By Bill Wunder

Welcome back for the second part of a discussion concerning remedial delete/archive methods for logging type tables that feature untenable and unbounded growth due to poor or inadequate application design. If you didn't catch part one I encourage you to give it a read before you continue. In that piece I defined a logging table as a genre of data container with a high volume of inserts that is well organized on it's clustering key. Then I explored the problems and pitfalls brought on by otherwise successful application designs that hit the wall due to unbounded logging table growth, the possibility of using data partitioning to address the  problems, and the advantages of better design efforts over the need to implement a delete/archive strategy after an application is deeply involved in the production environment. At this time it seems apropos to continue that discussion with a break down of some of the different general types of logging tables. The value in classifying logging tables will then be found in allowing the general type of a logging table guide us to a suitable partitioning design and rollover process. 

Wholly Unnecessary

Sometimes the application does not directly use the logged data. The rows are thrown into a table "just in case". "Just in case" an ad hoc audit becomes necessary or "just in case" a need to debug application behaviors arises or "just in case" some other unknown, undefined or speculative purpose suddenly becomes known, defined, or meaningful. If there truly is no defined purpose for the data and the business and/or development staff is unable to see it's way clear to eliminate the unnecessary processing cycles and storage requirements necessary to support the data (I can't believe how many times I've run in to this conundrum) then truncate table may be as good a delete/archive strategy as any. If you can find a consensus that there is no need to maintain a history of the logged data once it reaches a certain and relatively young age then all that remains is to set a truncation frequency and create a SQL Agent job. If you do want to have access to the historically logged data - again probably "just in case" - then immediately after each regular full backup clear the table and start growing it again. If the data is ever needed you can go back to the tape archive for the desired day and load up a snapshot of the database complete with the desired log data onto a playground SQL Server. Of course, this strategy is only as deep as your tape archive strategy so be sure to understand and adjust the tape rotation policy as necessary to support whatever perceived business requirement you are trying to satisfy in the plan to save the ostensibly worthless historical rows.

If you are keeping relatively large reserves of data "just in case" there ought to be a big neon yellow flag with a bold red question mark in the center waving over such a pile of bits. Be sure not to be the one that stands on such a mountain of data holding that flag because its more likely that you're perched on a pile of rotting garbage as you are on a store of gold ore or , if there is gold, its likely to be in such low concentration as to not be economically viable to extract. Keeping data "just in case" seems to be a surprisingly common phenomenon. I suggest that this is usually an ill-conceived attempt to anticipate future unspecified needs in software development: far too common an elements of built in inefficiency in otherwise quality application designs. In my view the realm of software design and development is quickly and surely evolving in the direction of rapid iteration cycles consistent with "just in time" (much different than "just in case") business process models,  Rapid Application Development  (RAD) philosophies and Extreme Programming type software lifecycle models. This evolution is counter to such well meaning but unproductive tactics as keeping "just in case" data and other attempts to anticipate and deploy guesses as to future needs in software development. Even proponents of anticipation in software design recognize the perils and heretofore wasted time involved in such development practices. In general I think it is a more effective development strategy to stick to the specifications. If there must be guessing, it should at least be limited to the design phase.

From the logging examples proposed in the first part of this discussion, consider the logging of each persons access to a specific web page. Let's suppose that the original requirement was to record the number of times each user accessed a web site. Then , perhaps in an attempt to be thorough, a well meaning developer took it upon himself to extend that requirement and record the access of each web page on the site by user. As it happened, an interest in this information did not materialize and the web master is relying on page hit counts readily available from the web server tool suite to determine usage patterns. We now have either a large table with unusable heterogeneous data containing an access log to every page on a web site or possibly multiple tables each containing access logging information for specific pages depending upon which blind alley our developer took us down. In either case the data is steadily growing at what ever rate people are looking at the web pages and the growth is unbounded. Quite possibly the developer is long gone and the web site is now oozing outside of it's allocated database storage limits. The right thing to do in this case is remove the unnecessary logging. In the interim a safety net solution might be to regularly truncate the table(s) via a SQL Agent.

Hopefully such situations are the exception but if that is completely true then I'd have to say I seem to find myself in some truly exceptional shops.

History of the World

At the other end of the spectrum is logged data that is heavily referenced by the application and must be available for an extended period of time or even in perpetuity in it's purest form. Often the data container is well known and must be quickly accessible. Well know in this case means that the table is referenced by multiple store procedures and quickly accessible implies that there may be multiple indexes necessary. At it's most extreme, the table is subject to updates as well as the logging inserts and selects. It may be difficult or impossible from a data consistency perspective to safely remove even the oldest of rows from the table. If such tables cannot be sent off to the data warehouse's star schema you may have one of the most difficult of logged data maintenance problems.

If the business decision comes down that you can never delete from the table at least you won't have to address the delete processing dilemma. You will still need to create a means to monitor and manage the ever increasing storage requirements and - depending upon the impact of update processing and variety of access paths - table fragmentation maintenance overhead. If the business requirements dictate that a very large amount of data must remain yet the oldest records must be removed from the collection you will indeed have your hands full. The local table partitioning schemes that are my focus in this discussion are not likely to be practical in this scenario. For this you'll want to fully investigate your scale-up and scale-out options to find a workable answer.  Almost certainly in this situation it will be necessary to make significant application layer changes to support radical storage changes such as moving the data container into a distributed and partitioned storage design. A Books Online keyword search on "distributed partitioned views" is one good place to continue your research on managing a data design that requires support in the unbounded growth mode.  

The DNA sample logging would probably be kept around indefinitely. There may be tens of thousands of DNA samples, but fortunately not tens of millions of samples in this case. Most likely the logged row to record the receipt of each sample should never be updated to assure an accurate and unchanged history of when DNA samples entered the system but the need to access even the oldest samples must be certain. In all likelihood the range and domain of this data could effectively be housed in a single logging table and unbounded table growth could be supported by a relatively modest storage requirement. Most likely unbounded table growth was not considered at design time and by pure luck can probably be supported unchanged for the life of the application. It's when the circumstances necessitate keeping much larger row collections around and in tact in perpetuity that you will find the need to explore the possibilities listed above.             

Divide and Conquer

Hopefully you don't have to deal with the Wholly Unnecessary use case and as stated the History of the World case likely has no delete archive requirement appropriate for our discussion of effective local partitioning strategies. If you do find that you must attack such architectural issues with the strategies presented here chances are that there are some forces at work in your shop that are well outside the realms of normal and reasonable software development philosophy (you have my sympathy). That's not to say that what does fall into the range of normal and reasonable development philosophies means the challenges are easy but it does provide some hope that we can respond to the delete/archive requirements without the need to go back to the drawing board  to support the History of the World or to to implement complex and/or esoteric process to support obtuse business requirements. Even though the unbounded growth model is usually the design default it is almost never the correct or appropriate model. In most situations I suggest there is a practical way to locally partition a table to achieve optimal delete/archive efficiencies with minimized application affects and not too much brain pain. What I don't want to do is limit your possibilities to only the partitioning schemes presented here. The first step toward solving any delete/archive problem ought to involve some thought about what makes the most sense in that particular situation. In many - perhaps most - circumstances a data partitioning strategy will provide a satisfactory and relatively easy growth management resolution.

Between those situations where we don't really need any of the data or those where we need to keep and use it all I suggest we will find most of the real world situations. The actual designs and therefore the actual growth management requirements tend to fall always at a different spot on a continuum between those two extremes. Your exact situation will probably not fit neatly into any one classification.  For the purposes of this discussion I'm going to proceed with the understanding that you will need to adequately evaluate your unbounded growth problem and that upon completion of that evaluation you feel some confidence that a partitioning strategy will serve to manage that growth problem. You may need to keep only a small frame of logged data that we can classify as short term purposes that we can think of like an episode of Saturday Night Live: make the most of it in the moment but not much need to remember much about it tomorrow. Perhaps you need to keeps enough rows to satisfy the smallest degree of aggregation you want to derive from the data such as reports requested for a day or perhaps because the data looses it's validity in a full business lifecycle OLTP system such as unfulfilled orders that have subsequently spawned back-orders. At the next classification you may need to have a somewhat more heterogeneous application requirement. Most of the time data models will fall into this Middle of the Road classification. A typical example would be the need to provide a months worth of drill down capability for the daily report usage aggregations or a processing requirements that may need data older than the primary application usage of the data such as a 15 day moving average stock level computation for the aged unfulfilled orders. And finally, getting close to the History of the World end of the spectrum, you may have data that needs to stay around for a fair amount of time and is interesting to multiple application components for the Life of the Party such as the pump bearing temperature samples in our earlier laid out collection of locally partitioned candidates. In that case tracking the bearing temperature trends along with a variety of other control measurements in the treatment plant can be useful to identify holistic patterns within the bearing, the pump, and even the plant in a more sophisticated factory floor monitoring application. There are plenty of other real world examples of a Life of the Party situation and as you might expect, the details can get pretty hairy as the Life of the Party problem approaches the History of the World extreme.

In the next and final installment of this series we will go into the local partitioning implementation coding details and differences useful to convert an ever growing problem table that falls somewhere along the Divide and Conquer continuum class into a locally partitioned view. We'll consider tactics for converting to the partitioned arraignment as well as rollover strategies necessary to keep the growth under control. You'll find some of the T-SQL examples I hope you can use in concert with the any analytical insights enabled by the discussion up to this point to help things run smoothly in your shop.

Bill