Getting More From the SQL Agent

by Bill Wunder

The SQL Agent provide some great job automation capabilities. Out of the box with the SQL Agent you get a nice amount of scheduling flexibility provided the repeating nature of a schedule is simple and static. Likewise there are intrinsic interfaces to ActiveX, the command line, and T-SQL subsystems. For those circumstances where a dependency on a desktop client application such as Outlook does not compromise the stability of the SQL Server installation the SQL Agent provides notification via SQLMail. It is even possible to to centralize job control through the multi-server administration configuration and tools set. In sum, I find the SQL Agent to be a solid 80% solution for my needs. Really that says a lot of good things about the SQL Agent. What I'd like to do in this article is share some of the tactics I used for those 20% of my needs when I fiddle a bit to get the desired behavior from a SQL Agent scheduled job.

SQLMail-a-way

When something happens at the Exchange server or when the Outlook client looses it's mind I can't afford to have my job scheduling system - or worse, my SQL Server - go down. For this reason alone I put a significant amount of effort into making the xp_smtp_sendmail extended stored procedure available for free download at www.SQLDev.net work asynchronously in all my environments for the SQL Agent and the SQL Server services. I no longer even install a MAPI client on my SQL Servers and I'm happy to tell you my server stability has improved dramatically. I no longer have the occasional and always inopportune occurrences of a hung SQL Server or SQL Agent due to MAPI problems or the phenomenon of an important job being stuck forever in a "performing completion tasks" state or the inconvenience of needing to schedule a SQL Server outage while an upgrade, service pack, or patch is applied to the Outlook client and the machine is rebooted. If you'd like to see what I've done to completely eliminate MAPI from my SQL Servers be sure to check out my previous series of articles on the topic:

Take My SQLMail, Please! - for a philosophical overview

From xp_sendmail to safe_sendmail - to establish an asynchronous email delivery environment

From xp_sendmail to xp_smtp_sendmail - to make the switch from MAPI to the SMTP extended stored procedure within SQL Server

From SQLMail to SMTP in the SQLAgent - to make the switch from MAPI to the SMTP extended stored procedure within SQL Agent

You'll find all the code I use to make this work for me in the sswug.org Script Library. Most of it in the scripts Asynchronous SMTP alternative to SQLMail, Run a job from all existing SQL Agent Alerts, and meaningful Agent job notification, still check out the articles for the full scoop and plenty of additional code. To be sure, using an alternative email delivery system requires an additional job step for each scheduled job for either failure notification or completion notification, however the configuration I propose in the articles is very standard so in the end its about the same as having to go to the Notification Tab to configure the built-in MAPI notifications.

Clear Cutting or Selective Logging?

Whether you use MAPI or SMTP to deliver SQL Agent failure messages it can be useful to archive error messages someplace other than msdb.dbo.sysjobhistory. With the built in job history processing it's pretty easy to loose sight of why a job failed unless you raise the total number of msdb.dbo.sysjobhistory rows to keep and the number of msdb.dbo.sysjobhistory rows per job to keep to very high values )configurable from the SQL Agent properties dialog in Enterprise Manager). That means to keep track of a failure on Friday night until you get in on Monday morning for a job that runs once a minute on a SQL Agent subsystem with a few hundred jobs where many other jobs run frequently you've got to save thousands and thousands of "The job completed successfully" msdb.dbo.sysjobhistory rows just to be able to go back to that one failure message - often only to find that the actual error was truncated from the message stream because of the 255 character size of the column in msdb.dbo.sysjobhistory. Even if you don't want to go all the way and implement SMTP for your agent jobs you can still take advantage of this alternative that saves only failure messages to an archive. Refer to the script admin db - meaningful Agent job notification for some ideas on how to get started. Basically if you change the reference to the safe_sendmail procedure to be a reference to xp_sendmail this procedure will work for you in a MAPI mail delivery environment.

Expanding this notion to another level, you can also easily implement an escalating notification paradigm. Suppose for the first hour that a job is failing you want to be notified just so you can keep track of things. If the condition that is generating the notifications persists beyond an hour you want to notify the 24x7 operations staff and make sure some corrective action is initiated. To achieve this effect, it's pretty straightforward to modify the admin db - meaningful Agent job notification script to behave like the admin.dbo.NotifySLSLogShippingFailure stored procedure included in the admin db - Heavy Duty LiteSpeed Log Shipping script.

The State of the Onion

There are many layers of concern in so many of the processes that may need to be automated via the SQL Agent. One good example in my shop is the need to carefully watch log growth during index maintenance on databases running in the "Full" recovery model. There is nothing in the Agent that is going to let me monitor the used size of a log device and make job control decisions based on that value. As you know DBCC INDEXDEFRAG can consume lots of log disk space in a relatively short period of time. If I don't do something to manage this I can easily fill my log device and bring the server to a grinding lurch. What I've done to monitor and manage the growth issue is to spawn a job that does the DBCC INDEXDEFRAG from the main indexing job and then put the original job in a busy wait that repeats a loop to check the log space used counter in master.dbo.sysperfinfo regularly and often for a threshold size I've predefined. When the log usage hits the threshold the controlling job stops the dynamically generated defrag job and resumes the busy wait watching for the log to fall below the usage threshold that will occur after the next regularly scheduled log backup. Once the index defrag operation has completed the spawned job is removed from the SQ: Agent and the control job can go on about it's index maintenance business. You can review the details of how I accomplish this in the script admin db - Index (fragmentation) Maintenance for Log Shipping.

Stepping on Yourself

Sometimes its useful to start a job from another process. Most often I see this capability used when there is a relatively long running process that can be completed asynchronously from a controlling process when something first must happen - or more usually must complete in the controlling process before the other task can begin. An example might be a restore on the query box that cannot start until the backup is done on the production box. All the production backup really needs to do is tell the restore job on the query server to start running. The rub comes if for any of dozens of reason the restore is already active on the query machine. Suppose for example that the restore needs to take care of that index maintenance we were just speaking of and for some reason the log backup job has been stopped. If in this case we try to start the job on the query box from the production backup process we are likely to get a message that the production backup failed. If that happens at 2AM the 24x7 operations staff is probably going to get somebody out of bed to see why the backup failed. If it's someone else they call then no problem (excuse my wicked laugh), but if it's going to be you wouldn't you rather just check to see if the restore is already running before you try to start it and do your best to eek out the full 5 available hours of sleep tonight?

I have couple of different ways to tell if a job is in a state that is safe to start. One way uses a table function to get the SQL Agent job status using the OLE automation procedures. The other is a hack from sp_help_job that uses the undocumented xp_sqlagent_enum_jobs to find out if an SQL Agent job can be started now. It would be pretty easy to interchange the parts of the latter to change the function to use xp_sqlagent_enum_jobs or vice versa.

The Morning After

Consider the rather peculiar circumstance I find myself in once a month. I reboot my SQL Servers once a month to recover leaked memory in favor of the practice of waiting until the server chokes to do the reboot. It's just that much easier to get management and the SLAs (Service Level Agreements) to go along with a scheduled reboot than it is a Heimlich maneuver every now and again in the middle of the production day. We have evolved this scheduled reboot to happen on the first Saturday of the month at midnight. This time the rub comes with the reality that the midnight reboot on the first Saturday of the month can affect other jobs that run in the minutes and hours after midnight on Sunday morning. Depending on where the first Saturday of the month falls, the next day can be the first or second Sunday of the month. That means I want those jobs that run after midnight to run when yesterday was the first Saturday and not run when yesterday was not the first Saturday of the month. Maybe I'm missing something, but I don't see that option in the SQL Agent's job scheduler.

Actually our maintenance window runs from midnight to 4AM so I added a job step to the beginning of the jobs that I don't want to execute during the maintenance window to help me make the right choice:

declare @dt datetime

set @dt = getdate()

-- job ran at regularly scheduled time today so don't run again

if Not(datepart(weekday,@dt-1) = 7 -- yesterday was Saturday

and datepart(day,@dt-1) < 8) -- yesterday was in first week of month

And datepart(hour, @dt) = 4 -- now 4AM

raiserror( 'Request for delayed (4AM) Sunday start when not Sunday morning after monthly window aborted.',16,1)

-- job has to wait for the maintenance window to expire before it can run today

if datepart(weekday,@dt-1) = 7 -- yesterday was Saturday

And datepart(day,@dt-1) < 8 -- yesterday was in first week of month

And datepart(hour, @dt) < 4 -- not yet 4AM

raiserror( 'Request to run during monthly maintenance window aborted.',16,1)

Using the raiserror allows me to bail on the process using the SQL Agent flow control functionality. I can alternately send a failure notification email or just not run the job if a raiserror is triggered from this code.

Spy -vs.- Spy

There must be a ton of other scheduling gymnastics that people have dealt with when the SQL Agent scheduler won't quite do it for you. Certainly I have a few others that I haven't included mostly because a particular requirement is so obscure or because a particular workaround is less than elegant. Hopefully what I've conveyed in this article is not the gamut of possibilities but rather the diversity of possibilities. If you found something you can use in your environment that's cool. If I helped you stretch your imagination enough to see a possibility that will work for you but is not discussed in this article or available in the off the shelf SQL Agent even better!

Bill