wget has a --load-cookies file
option. It wants the original Netscape cookie file format. Depending on your GUI browser you may have to convert it. I recall in one case I had to parse the session ID out of a cookie file then build the expected format around it. I don’t recall the circumstances.
Another problem: some anti-bot mechanisms crudely look at user-agent headers and block curl attempts on that basis alone.
(edit) when cookies are not an issue, wkhtmltopdf
is a good way to get a PDF of a webpage. So you could have a script do a wget
to get the HTML faithfully, and wkhtmltopdf
to get a PDF, then pdfattach
to put the HTML inside the PDF.
In principle the ideal archive would contain the JavaScript for forensic (and similar) use cases, as there is both a document (HTML) and an app (JS) involved. But then we would want the choice whether to run the app (or at least inspect it), while also having the option to offline faithfully restore the original rendering. You seem to imply that saving JS is an option. I wonder if you choose to save the JS, does it then save the stock skeleton of the HTML, or the result in that case?