Optionally capture .html rather than .webarchive?

Roy · July 4, 2008, 3:24pm

Greetings to the group, and to Michael. I’ve been testing out EF for a little bit… and like many I am impressed all around. My library is 2.8GB, containing just over 10,000 records, but so far EF is handling all the data admirably. The RAM footprint is under 100 MB and search is amazingly fast. Having this identical library in DevonThink caused that program to consume 350MB at boot and over 750MB if tried to used any of its AI features (emphasis on ‘try’ because there were many beach balls to be had). So color me impressed… I think I’ll be sticking around

One question has come up during my experimenting with EF, however, and I hope you’ll indulge me in a discussion of it. One of the biggest assets to EF is that it stores everything in a normal folder structure, ensuring that everything can be read long into the future. In much the same vein, for the past several years I have always saved webpages in plain HTML format rather than Safari’s Webarchive because the latter is, so far as I know, a proprietary format only readable by Webkit-enabled applications. As such, I find capturing in plain HTML preferable for a number of reasons:

Data longevity. Despite being an open-source project, it is hard to assume that WebKit will be around forever, or that the latest OS incarnation of WebKit will always be able to read old Webarchives. However, so long as webpages are based on HTML, the OS rendering kit will always have to be able to parse .html files, so I see the latter as a much more secure format.
Web access from PCs. Another concern is that, should I ever want to put my EF directory structure in the cloud for web access, there’s a good chance that I would be accessing it on a PC without Safari.
Size. A typical news article on complex site can be over 600KB in Archive format, but only 70KB in HTML. This is over an 8x difference in size. I know hard drives will always get cheaper and bigger, but I think it’s advantageous to keep the library as lean as possible.

I realize that it would be simple enough to just use Safari’s save dialog to save as Plain HTML into the To Import folder, but I think that the Capture and Capture with Options features (along with the additional metadata they gather) are even more useful. I suppose what I am asking is… is it possible in the future for EF to capture as Plain HTML instead of a WebArchive? If such an option would be prohibitively difficult (or just not supported by the WebKit framework), I understand, but I thought I would suggest it, as a .html + metadata Capture would be ideal for my library.

Once again, much kudos to you Michael for making a data organization app that not just works, but works well.

Michael_Tsai · July 5, 2008, 9:16am

I think the Web archive format is safe. It’s a simple format, and the code to read it is open-source. There are so many Web archives out there that if Apple dropped the ball someone would pick it up.

That said, I can understand your wanting to use a more widely supported format. (It’s a mystery to me why Safari for Windows doesn’t read Web archives!) I’m not sure HTML is the right one, though. With so many sites taking advantage of CSS, JavaScript, images, etc. most pages will not look very good as plain .html files.

Is that really what you want? Or would you prefer PDF or RTF? You can generate PDFs of Web pages using “Save PDF to EagleFiler,” and there’s a command in the Records menu to convert Web archives to RTFs. If there’s demand, perhaps I’ll make an option to capture URLs directly into one of those formats.

Roy · July 5, 2008, 6:50pm

Thanks Michael for the reply.

At your suggestion, I’ve been playing around with RTF and PDF formats. Using my 700KB webarchive test article from a complicated website, a PDF is about 200KB. An RTF footprint takes about 400KB, but the conversion renders the article unreadable (i.e. graphics and layout get messed up and put before any of the text). So at least size-wise, HTML is still the winner at 70KB. I agree though that some pages don’t look that good as plain HTML.

So this is still a bit of a conundrum. Looking at the characteristics of my ideal storage format:

Readable cross-platform
Lean file size
Able to be generated from Capture script with ‘From’ and ‘URL’ metadata

It seems the best case scenario in this situation would be an RTF capture, perhaps not of the entire page but of a selection (so as to avoid RTF conversions of the page layout). I can currently do this in EF but the process is a bit involved: 1) dragging selected text into EF, 2) convert to RTF, 3) delete original webarchive. This process gives me a 88KB file, basically just as good as plain HTML (that 88KB also includes the feature photo for the article, so it stands to reason that RTF in some cases may have a smaller file size than HTML).

While metadata does get preserved in the above process, it requires a click-hold-drag for every article I wish to import (which on a normal day would likely be a dozen or so times), and there’s no opportunity to assign title/tags during the import process. Perhaps the ideal situation would be to have a native RTF capture and an “import selected text only” checkbox in the “Capture with Options” dialog. But I would imagine that such features would likely require substantial work, so unless there’s lots of demand for them, I should be able to use the above method just fine.

Thanks again for hearing me out… I really like the open way in which EF organizes files and I want to make sure that the files themselves are just as open

Michael_Tsai · July 6, 2008, 5:03pm

The RTF conversion is handled by the OS. I think in most cases it’s pretty good (better than plain HTML), although sometimes it messes up. In such cases it’s usually possible to delete the unwanted graphics and layout (even the formatting, if you want).

Thanks for the feedback. I don’t think that everything you want is possible, but much of it is, and I’m working on it.

robrecord · July 29, 2008, 3:37pm

I would like pure URL (webloc) and RTF import. I suggest the ability to edit down the imported file to just the snippet you desire - or use Safari’s method of doing this and save to the ‘To Import’ folder - unless there’s a way EagleFiler can handle this?

Michael_Tsai · July 29, 2008, 5:34pm

Which Safari method are you referring to?

robrecord · July 31, 2008, 1:39pm

Which Safari method are you referring to?

Clip Web Archive

Right-click on an something in Safari, use Clip Web Archive > With Selection/Element.
It allows you to edit and then save to (for instance) the To Import folder.

Is there a way to rig it so that you can capture this with options (to enter tags/folder/info)?

I am aware of MyPage bookmarklet but it’s a little cumbersome sometimes.

Thanks for all the work you’re doing Michael

Michael_Tsai · July 31, 2008, 2:07pm

Oh, that’s not a feature of Safari. It’s part of the SafariStand input manager.

However, you can drag and drop the selection from Safari into EagleFiler to import just part of the page. Perhaps in the future this will work with EagleFiler’s options dialog.

Michael_Tsai · October 14, 2008, 9:19am

EagleFiler 1.4 adds support for these and other formats.

Michael_Tsai · November 23, 2010, 2:34pm

With EagleFiler 1.5 you can hold down the Option key when importing (via drag and drop, the Drop Pad, etc.) to have it show the Options window.