Here is a memo on backing up MediaWiki instances, say deployed as a part of a Web site mywebsite.com.

Here is a listing of concrete steps:

Get inside the backup root directory on local file system:

cd /Volumes/BACKUP/mywebsite.com

Backup using the backup_mediawiki.sh backup script

  1. Login web server
  2. Update VCS repository https://github.com/lumeng/MediaWiki_Backup
  3. Back up using MediaWiki_Backup/backup_mediawiki.sh
     # assuming web directory is ~/mywebsite.com/wiki
     WIKI_PATH="mywebsite.com/wiki"
     # assuming the path to save a subdirectory backup_YYYYMMDD created by backup is path/to/backup/mywebsite.com/wiki
     WIKI_BACKUP_PATH="path/to/backup/mywebsite.com/wiki"
     # get to the home path before start
     cd
     # Start backup. This will create backup path/to/backup/mywebsite.com/wiki/backup_YYYYMMDD.
     path/to/backup_mediawiki.sh -d $WIKI_BACKUP_PATH -w $WIKI_PATH
    
  4. Rsync the backup to a local hard drive:
    cd /Volumes/BACKUP/mywebsite.com
    
    

    Backup the whole web site user's home directory that includes the backup files created above, using rsync

    
    
    
    rsync --exclude-from rsync_backup_exclusion.txt -thrivpbl user@webhost.com:/home/websiteuser rsync_backup/
  5. Ideally, upload the backup to cloud storage such as Drobpox.

HTML backup using wget for immediate read

Optionally, one can also keep a crawled version of a MediaWiki instances. Sometimes, it can be useful to have a copy of HTML files for immediate read offline.

cd /Volumes/BACKUP/mywebsite.com/wget_backup
mkdir mywebsite.com-wiki__wget_backup_YYYYMMDD
cd mywebsite.com-wiki__wget_backup_YYYYMMDD
# crawl the whole Web site
# wget -k -p -r -E http://www.mywebsite.com/
# crawl the pages of the MediaWiki instance excluding the Help and Special pages
wget -k -p -r --user-agent='Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2049.0 Safari/537.36' -R '*Special*' -R '*Help*' -E http://www.mywebsite.com/wiki/
cd ..
7z -a -mx=9 mywebsite.com-wiki__wget_backup_YYYYMMDD.7z wget_backup_YYYYMMDD

Remarks:

  • -k: convert links to suit local viewing
  • -p: download page requisites/dependencies
  • -r: download recursively
  • --user-agent: set "fake" user agent for the purpose of emulating regular browsing as sometimes site checks user agent. Check user agent string at useragentstring.com.

As for time cost to create the wget-crawled backup, for reference, it took about 30 min to download a small MediaWiki installation with hundreds of user-created pages in an experiment I did.

If there is a small set of pages that you need to backup, curl may be alternatively used, for example,

# download multiple pages
curl -O http://mywebsite.com/wiki/Foo_Bar[01-10]

References

  • https://www.mediawiki.org/wiki/Manual:Backing_up_a_wiki
  • https://www.mediawiki.org/wiki/Fullsitebackup
  • https://www.mediawiki.org/wiki/Manual:DumpBackup.php
  • https://wikitech.wikimedia.org/wiki/Category:Dumps
blog comments powered by Disqus