How to archive a website in a future-proof way (involves PDF hybrid)

evenwicht@lemmy.sdf.org · 2 months ago

In principle the ideal archive would contain the JavaScript for forensic (and similar) use cases, as there is both a document (HTML) and an app (JS) involved. But then we would want the choice whether to run the app (or at least inspect it), while also having the option to offline faithfully restore the original rendering. You seem to imply that saving JS is an option. I wonder if you choose to save the JS, does it then save the stock skeleton of the HTML, or the result in that case?

evenwicht@lemmy.sdf.org · edit-2 2 months ago

wget has a --load-cookies file option. It wants the original Netscape cookie file format. Depending on your GUI browser you may have to convert it. I recall in one case I had to parse the session ID out of a cookie file then build the expected format around it. I don’t recall the circumstances.

Another problem: some anti-bot mechanisms crudely look at user-agent headers and block curl attempts on that basis alone.

(edit) when cookies are not an issue, wkhtmltopdf is a good way to get a PDF of a webpage. So you could have a script do a wget to get the HTML faithfully, and wkhtmltopdf to get a PDF, then pdfattach to put the HTML inside the PDF.

evenwicht@lemmy.sdf.org · 2 months ago

It’s perhaps the best way for someone that has a good handle on it. Docs say it “sets infinite recursion depth and keeps FTP directory listings. It is currently equivalent to -r -N -l inf --no-remove-listing.” So you would need to tune it so that it’s not grabbing objects that are irrelevent to the view, and probably exclude some file types like videos and audio. If you get a well-tuned command worked out, that would be quite useful. But I do see a couple shortcomings nonetheless:

If you’re on a page that required you to login to and do some interactive things to get there, then I think passing the cookie from the gui browser to wget would be non-trivial.
If you’re on a capped internet connection, you might want to save from the brower’s cache rather that refetch everything.

But those issues aside I like the fact that wget does not rely on a plugin.

evenwicht@lemmy.sdf.org · 2 months ago

The other thing is, what about JavaScript? JS changes the presentation.

Markdown is probably ideal when saving an article, like a new story. It might even be quite useful to get it into a Gemini-compatible language. But what if you are saving the receipt for a purchase? A tax auditor would suspect shenanigans. So the idea with archival is generally to closely (faithfully) preserve the doc.

evenwicht@lemmy.sdf.org · edit-2 2 months ago

IIUC you are referring to this extension, which is Firefox-only (like the save page WE is).

Indeed the beauty of ZIP is stability. But the contents are not. HTML changes so rapidly, I bet if I unzip an old MAFF file it would not have stood the test of time well. That’s why I like the PDF wrapper. Nonetheless, this WebScrapBook could stand in place of the MHTML from the save page WE extension. In fact, save page WE usually fails to save all objects for some reason. So WebScrapBook is probably more complete.

(edit) Apparently webscrapbook gives a choice between htz and maff. I like that it timestamps the content, which is a good idea for archived docs.

(edit2) Do you know what happens with JavaScript? I think JS can be quite disruptive to archival. If webscrapbook saves the JS, it’s saving an app, in effect, and that language changes. The JS also may depend on being able to access the web, which makes a shitshow of archival because obviously you must be online and all the same external URLs must still be reachable. OTOH, saving the JS is probably desirable if doing the hybrid PDF save because the PDF version would contain the result, not the JS.

evenwicht@lemmy.sdf.org · edit-2 2 months ago

How to archive a website in a future-proof way (involves PDF hybrid)