The well-behaved document

A well-behaved document is an electronic document that is both user friendly and search friendly

The Open Access idea is gaining momentum and the sheer amount of (scientific, professional and other) documents available on the Internet makes keeping an overview a real challenge. To best organize the mass of material that accumulates over time and re-find the information again when needed for work, documents must be easy to read and easy to classify with little or no manual intervention. And you must have suitable software (such as digi-libris Reader) to automatically index and alphabetically sort all newly added documents to help you keep track of it all. (see example).
A well-behaved PDF or ePub document that is user friendly and search friendly offers important advantages for document producers, distributors and end users:
      • facilitates scholarly communication
      • is easier to discover and retrieve
      • is easier to be found again in one’s personal knowledge base
      • offers better authors’ exposure
      • has a better chance of being referenced.
User friendly
means a document is easy to read and easy to navigate on any reading device and for which reading software is readily available. It is in an open format and does not depend on proprietary (paid) software for display, styles and multimedia content. It must be searchable, has bookmarks (in applications that allow for it, such as PDF files in Acrobat or Adobe Reader), an interactive table of contents, i.e. one with “clickable” links to the correct target page, and possibly an interactive index, cross references and links to external resources. Except for copyrighted material it should not be password protected or encrypted but must allow the user to print it out and to copy/paste portions of the text and possibly to add bookmarks and comments of his own.
This applies not only to scientific papers, monographs and manuals but to all documents that one would consult or refer to rather than read in a continuous stream from cover to cover, like novels or literary works.

Making documents interactive and embedding metadata does not necessarily require any extra work if properly planned and some simple rules (consistent use of styles) are observed.
The author having spent a year on a thesis can certainly spend 10 more minutes to write down some keywords plus a description, the typesetter who produces a table of Content anyhow has only to check a single box before exporting to PDF and the publisher can easily import an XMP file containing metadata into the final document.
Search friendly
is a document that has useful embedded meta data which librarians, digital asset managers and individuals can exploit to classify a document in his personal knowledge base with little or no manual intervention.
University and public libraries prefer to keep the meta data of all their documents in separate catalogues or data bases for reasons of integrity and maintainability, but since one does not exclude the other, embedding the same meta data or a selection thereof also directly into a digital resource, automatically makes this data available to third parties who download or otherwise obtain access to such resources which they may want to preserve locally in their own knowledge base and/or to consult off-line. Notation in attribute/literal pairs is probably adequate for most private or local repositories.

Search-friendly scholarly publications
Search-friendliness, or machine-readability, is increasingly important in view of the global influence of digitization and open access in the changing publishing and archiving environment. Most scholarly publications are becoming available on the Internet, which makes their processing and systematic archiving a real challenge. To organize a bulk of the Internet content, scholarly papers should be easily classifiable with little or no manual intervention, which requires properly embedding metadata. Explicit metadata facilitate the work of librarians, digital asset managers and non-expert users because
  • sources are automatically classified and indexed for searching across a collection of documents
  • journal submissions and publications are easier to locate and cite
  • interdisciplinary networking and sharing of information is facilitated
  • authors get more and better exposure
  • no need to hunt for citation relevant metadata on the Internet, particularly beneficial for students who lack patience or the wherewithal to locate relevant repositories
  • citations and bibliographic references can be generated off-line, a must for self-published articles, work in progress and editorial content.

Metadata standards
Dozens of metadata standards are currently available, each being linked to its own vocabulary. Unfortunately, none of the standards is universally applicable. A student seeks data to generate citations while an expert searches his collection of papers, employing certain technical criteria. Information about book publishers, image or painting copyrights holders, song writers, or architects of ancient pyramids are all essential metadata and attributes of the items. Metadata are processed to classify items in search engines to share them with the global community. Different users seek different pieces of data. Art critiques, veterinary specialists, physicists, and lawyers download contents from interdisciplinary web domains, and they would prefer to do so without manual intervention, relying on embedded metadata.

Universities and public libraries are challenged to upgrade their services and to more actively contribute to scientific research. Although they prefer to integrate and preserve metadata of all their documents in separate catalogues or databases, I think that one should not exclude the other. Embedding a descriptive selection thereof in a digital resource automatically makes this data available to users for off-line consulting and referencing. And it saves their time. A notation in attribute/literal pairs is probably adequate for most private or local repositories. A separate sidecar Extensible Metadata Platform (XMP) file can be linked or sent along if direct embedding is impossible (eg due to checksum).

A pragmatic solution
Documents with embedded metadata are gradually increasing in open-access repositories and on publishers’ websites. It is partly due to the institutional requirements to provide metadata along with documents. New forms of metadata such as those on HTML pages pointing to Facebook and Twitter are constantly developing. Citation specific variables are currently used in conjunction with Citation Style Language (CSL). And, adding to the jumble, there is a wide range of proprietary name spaces, where each organisation defines metadata specific for different subjects. A document can, therefore, include hundreds of metadata variables, which may or may not be meaningful for users. Solution to this issue should be universal. I suggest an individually extensible and universally applicable metadata set that builds on the
widely used Dublin Core standard (minus refinements) plus an unlimited number of customizable attribute/value pairs for the data. Consider it as an alternative Dublin Core application profile (DCAP) for individuals who may or may not have to rely on a single standard issued by a parent institution. For the exchange of metadata with third parties it relies on Adobe®’s XMP technology.

Who should provide and embed metadata?
  • The ultimate responsibility for the inclusion of useful metadata lies with the publisher. However, all other stakeholders in the development of adocument, from author through to distributor, should also contribute by adding metadata to the final versions of their documents because
  • authors know their subject best and should propose relevant tags to their papers’ abstracts and citations. Ideally, they should generate a list and submit it along with their manuscripts. XMP sidecar files are probably the best option to ensure integrity of their metadata
  • reviewers may suggest changes to the titles and descriptions in addition to factual adjustments
  • editors and translators may include different data and add keywords for optimal searches through search engines
  • publishers can adapt metadata and add specifics such as Creative Commons licenses, copyrights details, dates of submission and acceptance, and ISSN/DOI identifiers.
  • libraries and content providers, who gather metadata for their catalogues, should ensure that useful metadata accompanies each document for automatic classification, indexing, and retrievability.

Adding metadata to PDF files
Adobe’s® XMP™ technology is well suited for embedding metadata. This is the format implemented in PDF documents. It has placeholders for Dublin Core elements and other standard meta types such as Dicomed for medical applications and IPTC which is used by the International Press community and professional photographers to secure their copyrights. It also allows to define proprietary sets with their own namespaces as well as unlimited number of custom attribute/variable pairs which can be used to describe anything. To view, to edit and to export metadata, a suitable (free or low cost) software is required. To embed these in a PDF document Acrobat® or another PDF tool that can import XMP files are used.
