Crawl Websites with wget2 - Johannes Filter

There are many ways how to crawl and scrape websites. The most popular way is probably the Python package scrapy. But there are more barebone alternatives such as the well-known GNU wget. The successor wget2 has some neat new features (HTTP/2, parallel downloading) that improve the scraping utilities even further. Now the basic usage.

--accept-regex 'pattern'

This make is possible to discard certain URS. The option

--filter-urls

is important to filter out URLs before sending requests. Otherwise, the request is sent and checked against afterward.

This will save the whole website to space. This may require some space but on the other hand, you do not have to scrape the same website over and over again, when you realized you require to select different information from the website.

One downside is, that you cannot stop and continue the crawling process. So you have to start it and let it run through.

wget2
--no-clobber \
--mirror \
--page-requisites \
--adjust-extension \
--no-parent \
--convert-links \
--reject '*.js,*.css,*.ico,*.txt,*.gif,*.jpg,*.jpeg,*.png,*.mp3,*.pdf,*.tgz,*.flv,*.avi,*.mpeg,*.iso' \
--ignore-tags=img,link,script \
--header="Accept: text/html" \
--execute robots=off \
--user-agent=Mozilla \
--max-threads 5 \
--accept-regex '(^[^\?]+$)|(^.*\?page.*$)' \
--filter-urls \
--stats-site=csv stats.csv \
www.example.com