I’ve had this goal for quite some time now, every since my employer went with Sun X4540 storage systems to serve as our data storage for backup applications. The goal was to handle data replication at the ZFS file system level, removing the need for application-level awareness of the file system replication.
A couple products from Sun seemingly accomplish this, one being ZFS via the ‘zfs send/recv’ function.
I will describe concepts and provide a basic script that be used to frequent volume snapshotting and replication.
ZFS has native replication via a send/recv options, an incredible feature that enables a lot of cool things. There are many resources online about how it works technically, so I would suggest reading up a bit as I wont explain here.
I explored the ZFS Timeslider tools, which have the ability for every local snapshot you take, to execute an additional command (such as zfs send/recv via SSH). That worked for a while, but it was not designed to handle ZFS replication as part of the suite. When my snapshot sizes began to grow, the send/recv operation would take longer than the window before the next local snapshot was taken.
This caused the service to always enter maintenance mode, as conflicting operations would happen.
Then I found a help blog by Sun Engineer Constantin Gonzalez (http://blogs.sun.com/constantin/entry/zfs_replicator_script_new_edition)
Where he described and made available a script that would handle parts of ZFS replication, from the initial snapshot, to sending it to a remote hosts over SSH. However, the same issues haunted me there, the send/recv operation would run past the scheduled window, and subsequent jobs would step on each other and cause issues.
Clearly, these tools accomplished a lot of great things, but some additional logic could be added to ensure jobs can run past their window, without risk of additional jobs trying to take snapshots while others are in progress.
Enter ZFS user properties; you can set arbitrary properties on a per filesystem, or per snapshot level. For example, you can “lock” a file system so that your programs will check to see if a flag exists, and if so, quit gracefully and notify you.
Jobs running past their window will always happen, and a simple check to see if an existing job is running on your data set would avoid conflicting snapshots, failed jobs, etc.
Short of using an enterprise job scheduling program like Control-M, this functionality is simple to add to existing shell scripts!
But theres more, why not use the ZFS user properties to assign additional flags, such as flagging it when all operations complete, or if a snapshot depends on another for incremental sends, or if the local snapshot has been replicated fully.
I took examples from the tools previously created (see above), and added some of those checks and flags. I’ve posted the script on my site, in hopes others will find the additions helpfull, and hopefully improve on some of the incorrect ways I’ve done things.
By no means am I great at writings scripts or programs, so if you see any bugs, or improvements you can make, please make them!
Again, suggest or make any improvements, and enjoy!