siiky
2022/11/09
2022/11/09
en
Sent an email asking for permission to make a mirror/archive. This time it was for Pagat, a site with tons and tons of card games. And like last time, permission was given provided that I don't make any archives/mirrors public. Fair enough!
My Raspberry Pi has been busy downloading the whole thing:
wget -o download.log -w 30 --random-wait --mirror -k -K -p -i links.txt
The links.txt file was generated from the sitemap.xml with this CHICKEN script:
(import srfi-1 ssax) (let* ((sitemap (ssax:xml->sxml (current-input-port) '())) (entries (cdaddr sitemap)) (urls (map (o car (cute alist-ref 'http://www.sitemaps.org/schemas/sitemap/0.9:loc <>) cdr) entries))) (for-each print urls))
Some details so far:
$ find www.pagat.com/ -type f | wc -l 2941 $ find www.pagat.com/ -type f -iname '*.html' | wc -l 1812 $ du -bchs www.pagat.com/ 66M www.pagat.com/ 66M total