Download web pages recursively under an URL1

 wget \
 --recursive \
 --no-clobber \
 --page-requisites \
 --adjust-extension \
 --convert-links \
 --restrict-file-names=windows \
 --domains example.com \
 -nH --cut-dirs=some_subdir \
 -e robots=off \
 --random-wait \
 --wait 5 \
 --no-parent \
     www.example.com/subdirectory/

  • Substitute example.com and www.example.com/subdirectory/ with relevant expressions in your problem.
  • -r --recursive: download the entire Web site.
  • -D --domains website.org: don't follow links outside website.org.
  • -np --no-parent: don't follow links outside the directory subdirectory.
  • -p --page-requisites: get all the elements that compose the page (images, CSS and so on).
  • -E `--adjust-xtension.
  • -k --convert-links: convert links so that they work locally, off-line.
  • --restrict-file-names=windows: modify filenames so that they will work in Windows as well.
  • -nc --no-clobber: don't overwrite any existing files (used in case the download is interrupted and resumed).
  • -e robots=off: force crawling regardless of robots.txt setting.
  • -nH --cut-dirs=some_subdir: cuts out hostname and subdirectory name.
  • --random-wait: randomizes the time between requests to vary between 0.5 and 1.5 times of the waiting time specified by the --wait option.
  • -w --wait=5: number of seconds to wait between requests. (See --random-wait.)

References


  1. linuxjournal.com. Downloading an Entire Web Site with wget. 2008. https://www.linuxjournal.com/content/downloading-entire-web-site-wget
blog comments powered by Disqus