Pagat Archive

siiky

2022/11/09

2022/11/09

en

Sent an email asking for permission to make a mirror/archive. This time it was for Pagat, a site with tons and tons of card games. And like last time, permission was given provided that I don't make any archives/mirrors public. Fair enough!

My Raspberry Pi has been busy downloading the whole thing:

wget -o download.log -w 30 --random-wait --mirror -k -K -p -i links.txt

The links.txt file was generated from the sitemap.xml with this CHICKEN script:

(import srfi-1 ssax)
(let* ((sitemap (ssax:xml->sxml (current-input-port) '()))
       (entries (cdaddr sitemap))
       (urls (map (o car (cute alist-ref 'http://www.sitemaps.org/schemas/sitemap/0.9:loc <>) cdr) entries)))
  (for-each print urls))

Some details so far:

$ find www.pagat.com/ -type f | wc -l
2941
$ find www.pagat.com/ -type f -iname '*.html' | wc -l
1812
$ du -bchs www.pagat.com/
66M	www.pagat.com/
66M	total