HTML This is the most uncontrolled format of all in terms of embedded metadata and there are literally hundreds of applications that can create it, each adding its own flavor of syntax and scripts.
While the <meta name= tagging> in the <head> block is usually respected by all, we have seen the wildest excesses of what follows after the equal symbol.
By far not all originators use <dc... or <dc.terms> and often those that do add fancy designations of their own. Recently we have seen custom tags with Facebook and Twitter prefixes which often contain citation relevant values.
Suitable software should contain a tool that allows re-mapping unusual meta names to valid dc.elements, dc.terms, citation variables or custom value/attribute pairs even before adding these to a collection.
Another issue with HTML files is that many are virtual or created on the fly, e.g. in response to a query, and many are the result of multiple re-directions and are not necessarily the file the user thought he clicked upon. This happens typically on sites built with frame sets or master pages with many links.
Suitable software should allow collections and tables of contents to include dc.elements and dc.terms of virtual files as well.
As downloading such files can yield zero-length files, it is recommended that all downloads be verified before they are added to a collection.
For this reason digi-libris shows all the files that were opened during a single click on a link.