C-Command Software Forum

Any best practice for saving web pages in terms of most useful format?

Often times, I end up saving a web page in multiple formats… anyone have suggestions on their workflow for easily clipping content? What I want:

  1. It needs to be as easy as a bookmark - I don’t want to “think” about which format. So, picking a format, or saving a format each time, isn’t working out. Yet I am afraid to commit to one format. I don’t want to go a year and then realize, shucks, I shoulda stuck with PDF, or shoulda went webarchive all the time.

  2. I would like to preserve the structure - i.e., text import with images is usually messy in other apps when I’ve tried it. Some Javascript ends up showing as text, all layout is lost, etc.

  3. I want it to be fully-text searchable - this means image screenshots are out, since they’d then need to be run OCR. Tried DevonThink for this reason, but it was slow to OCR, and not close to 100% accurate even on screenshots of simple fonts!

  4. I want to be sure it works without the Internet connected, and without the original site available.

  5. If it could clean out any banner ads, great, haha.

WebArchive seemed to be the obvious choice - I was okay with the lack of compatibility beyond Safari. FIgured I could also do a batch convert to PDF on some or all of the archives later. However, I noticed sometimes they are slow to load! I thought that WebArchive would pull all images in things like image slideshows that use JavaScript, but now I think that is not the case.

So, if it won’t truly pull everything that could appear on a modern web page, I’d rather just use PDF capture - seems cleaner, lighter, more portable. I was surprised that the PDB kit that EagleFiler uses seems to get a very nice layout, not sure if this is limited to sites that have a stylesheet for printing though? I am guessing that is the case.

I also have a nice PDF capture tool in Safari, forget who makes it, that scrolls the whole page WITHOUT reloading - that gets a very accurate snapshot of what you’re seeing, of course. But then it’s not searchable.

Also, if sticking with PDF - continuous, or split? Do I give up the benefit of cleaner page breaks if I want split later, using another tool (or even re-capturing within EagleFlier from the continuous version’s record)? I figure continuous will look more like a web page, less space wasted when browsing on any computing device, etc.; just not as good for reading via e-reader apps, I believe (I think then one wants pages sized to fit the aspect ration of the e-reader, right?). This use case is probably going to be very rare though (reading on an e-ink screen with slow refresh rate), so I guess I don’t mind having to make a conscious decision on a different capture for content like that, or converting it afterwards, as and when needed.

Thanks for any suggestions! I’ve also tried Pocket, Instapaper, and Raindrop.io. RainDrop had me excited as it does offline saving on their servers, but has a Dropbox sync option - but I found out it is only syncing an HTML bookmark file of all bookmarks. :frowning: Also, just saving a bookmark takes a few to several seconds, which is very annoying. I realize saving a WebArchive can take time too, but, I think I can navigate away from the page, even, and it will still complete. I don’t need to wait while it captures in EagleFiler, I mean. Pocket and Instapaper are okay - InstaPaper is just too simple, only good for things you intend to read for a while, I’d say, not for general bookmark stuff. Pocket seems good, but app is barebones on Mac (can’t even put left column in Dark Mode, only the rest). I might keep using Pocket, but it doesn’t feel like a bookmark tool either. It is quick though. But can’t capture logged in sites either, since it captures from their server, from what I understand. And no way to store offline content on your computer, only on their server, so again, no ability to browse saved copies on your computer with no Internet (haven’t actually tested that, but I am pretty sure the Pocket app isn’t caching everything in your account locally)

Safari is supposed to save all the resources (images and such) into the Web archive, however there are complications for some pages that use JavaScript to load content later. Sometimes, on display, the page will load fresh copies of the resources, ignoring what’s in the archive. This can be disabled using the EnableJavaScript esoteric preference if you want. That will speed things up, although PDF will always be somewhat faster.

Secondly, some rare pages may not load all the images until certain user actions happen (e.g. clicking on a carousel) and so those won’t all get saved into the archive (or PDF).

Generally, EagleFiler will use the screen stylesheet, but for some sites it uses the print one, and some pages reformat for printing when using the default stylesheet.

Right, you can navigate away as soon as you hear the capture sound.

Ahh, that explains it all now! No wonder some pages seem like they’re “loading” even though they’re in the archive format. (not just in EagleFiler, but in any app I’ve tried) I was puzzled since I see the items, such as images, that it’s apparently reloading already in the archive’s contents. I had just started reading some of those esoteric preferences as I read the manual, but probably wouldn’t have picked up on that one that you mentioned, so thanks for that! I think probably web archive will serve my needs then since that will make it feel truly “offline” and snappier. :slight_smile: As long as Apple doesn’t drop support for loading webarchives, it seems like it’s the best choice since I can always “down convert” to PDF or whatever I want, later. (or even batch convert, I imagine - there must be some tool that does that, it appears there are utilities that do)

I saw just now that there is a script for EagleFiler to convert and then replace webarchives in a library to PDF, so yep, webarchive seems like the best choice! (https://c-command.com/scripts/eaglefiler/web-archive-to-pdf)

Thanks again for your help. I’m amazed I got both my questions answered/resolved in one day, on a Sunday no less! That alone truly puts EagleFiler in its own class.

1 Like