Guidelines for Preserving New Forms of Scholarship

Guidelines

Embedded enhanced features, especially those that link to resources outside of the publication or use an unusual format, are at the highest risk of failing in the future. For this reason, a meaningful caption is vital for providing clues to future readers about what they should expect to find in that location in the text, and preferably some means of finding it and accessing it. Ideally, this caption would include a title, source, unique persistent identifier (e.g. DOI, ARK ID, or Handle), and a link to an archived copy if different from the identifier. Though any link could ultimately fail, this information would at least provide clues to where the user might find an archived copy. When creating captions, apply the standards available within the format you are using to support automated parsing. For example, HTML5 has the <figure> and <figcaption> elements. “Alt'' tags are also widely used to supply context if a feature cannot be viewed. In this respect, a meaningful caption may also meet standards for digital accessibility.

Where non-text features are supplied as separate publication resources, this guideline may also be relevant:
24. Create metadata for each publication resource

Some platforms support assigning each publication resource its own descriptive metadata and landing page making it possible to cite them independently of the text as a whole. In these cases, if the publisher has the capacity to assign unique persistent identifiers such as valid DOIs, ARK IDs, or handles to each publication resource and to provide this as part of the metadata, this can help maintain connections between the components of a publication and sustain citation links. As an example, consider the case where a video is embedded in an EPUB and it has a caption under it that includes a registered DOI. The DOI points to a page dedicated to the published video. If the publisher no longer has that material, a preservation service may have the option to register the location of its preservation copy with doi.org so that the link would point to a new location. If a resource is local to the publication and is not intended to be cited or described independently, then a meaningful caption provides useful context, but creating persistent identifiers isn’t necessary.

These guidelines also relate to the use of identifiers:
17. Use persistent identifiers to link or cite external resources
24. Create descriptive metadata for each publication resource, include identifiers
31. Assign persistent identifiers to significant versions

Correct handling of character encoding can make an enormous difference to whether a publication is properly rendered. Encoding type should be expressed in the metadata, and/or within the publication as appropriate for the format. For example, websites may include encoding in the metatags and/or the charset property of the HTTP headers.

EPUB3 has the potential to be a solid format for preservation when creators (1) abide by the official standard; (2) keep to core media types; (3) avoid encryption; and (4) encapsulate all required resources within the EPUB file. Including large files, remote resources, or interactive features, however, can make the EPUB large and therefore impractical for general distribution. Where the publishing platform has mechanisms for generating EPUBs, implementing a workflow for a preservation-specific EPUB3 that can be created alongside the public-facing ebook would be a boon to preservation services.

When preserving a website rather than an EPUB, this guideline may be relevant:
58. For a custom web application, consider encapsulation early

Each EPUB should have structured bibliographic metadata associated with it. This can be expressed in the package metadata within the EPUB (the OPF or equivalent), or as a separate file stored adjacent to the EPUB—this may be generated during export e.g. ONIX, JATS, Dublin Core. In order to process metadata at scale, the file naming convention, location of the file relative to the EPUB, and format should all remain consistent.

These guidelines relate to bibliographic metadata for other formats:
21. Provide structured bibliographic metadata with exported publications
45. Embed bibliographic metadata in the <head> of a web-based publication

If a publication contains digital enhancements that are important enough to warrant preservation, the publication inclusive of its enhancements may be substantial enough to warrant a new ISBN, DOI, or other persistent identifier. This practice would ensure that the new version can be easily distinguished from other unenhanced versions of the publication in the preservation system.

These guidelines also relate to management of versions and use of identifiers:
9. Define the “version of record” in your context
17. Use persistent identifiers to link or cite external resources
23. Include version information in bibliographic metadata

If EPUBs are encrypted by the publisher for distribution and compatibility with specific EPUB readers, they will need to make the unencrypted version available to the preservation institution. Encryption will make opening or even validating an EPUB difficult, perhaps impossible.

Some publishers may use copyrighted fonts and obfuscate them in order to protect the rights when embedded in the EPUB. Because obfuscated fonts create both a technical and copyright challenge for preservation, open fonts should be used.

The EPUB specification defines a list of core media types that are supported. Using formats outside of this list introduces an additional risk for preservation since EPUB reader tools may not support these formats. Publishers should therefore consider whether using something outside of these types is justifiable given that doing so may result in the loss of that media.

This more general guideline may also be useful to consider:
11. Use non-proprietary, broadly supported and adopted open file formats

Embedding media resources within an EPUB ensures that a future reader will be able to locate these resources and view them in the original context of the work. In order to keep the overall size on an EPUB manageable for access, it may be advantageous to embed lower quality copies of the media and link to higher resolution versions via persistent links such as DOIs.

These guidelines also refer to where media content is hosted:
14. Avoid depending on externally hosted web services in general
29. Consider a preservation-specific version of the EPUB

Where there is a strong justification for using remote resources or non-core media, EPUB supports a fallback option that allows something else that is supported to be displayed in its place. This functionality should be used in these instances.

These guidelines may also be relevant when considering use of non-core media types:
29. Consider a preservation-specific version of the EPUB
41. Harvesting the content of iframes may have unpredictable outcomes

Basic errors in the formatting of an EPUB may not have an impact on the presentation in your favorite EPUB reader. Anything that does not follow the EPUB specification, however, may cause problems in other tools and with future playback. W3C’s EPUBCheck is a tool that can help identify any formatting issues and provides an opportunity to resolve them before distribution or preservation.

If externally linked web content must be visually embedded in an EPUB, recognize that it is at very high risk for loss. If the content cannot be moved inside the EPUB container using supported features, this material should have an informative caption and be described clearly in the structural metadata within the EPUB. Specifically, the package’s manifest metadata should have an item that: (a) specifies the resource URL (b) lists “remote-resources” as a property, and (c) defines a fallback item. If the embedded web content is not supplied to the preservation service, but can be successfully harvested, this additional metadata could facilitate a preservation workflow to identify and capture these features using an appropriate harvesting tool. If for example a visually embedded Google Trends chart no longer displays active content in the future, an archived web page with this chart could be accessed instead. This content should be noted consistently and documented as part of the publication that needs to be preserved. In general, any consistency that makes it easy to automatically identify the visually embedded web-based features within the text increases the chance of designing a scalable workflow to manage it.

These guidelines may also be relevant to embedding web content in an EPUB:
16. Captions for non-text features add meaningful context
40. Indicate the license status of resource in the HTML around the object
41. Use HTML iframes with caution
42. Facilitate a local web archive workflow for iframe content

Iframe, short for “inline frame,” is an HTML tag that can be used to embed the content from any URL inside an HTML-based document such as an EPUB or webpage. Some publishers may use an iframe to embed things like YouTube videos, or advanced media players into an EPUB. It is more sustainable to use html <video> or <audio> elements when embedding audio or video. EPUB3 readers are not required to support iframes. If used, the content may not render in all EPUB3 readers and is at a high risk of loss through link rot.

These guidelines are also be relevant to embedding media in EPUBs:
12. Start discussions around multimedia early in the process
14. Avoid external dependencies in general
34. Opt for core media types when embedding multimedia in an EPUB

A preservation service may not collect web content outside of the agreed upon domain names unless copyright for the content being harvested is clear. If third-party pages and features that are visually embedded in an EPUB or a web-based publication are meant to be preserved, it should be possible to identify which content publishers have the right to collect so that a web crawler can be configured to include or exclude it. One way to differentiate could be to consistently express the rights in the metadata that is supplied to the preservation service. Another option is to apply structured metadata describing the rights status to the HTML. The Creative Commons REL documentation includes examples of this that cover both page- and object-level licenses - this approach could support automated harvesting decisions at either level. Alternatively, a publisher could supply a list of domain names to include for harvest during the initial preservation workflow configuration.

These guidelines may also be useful to consider when embedding external web content:
25. Add license information to resource-level metadata
38. List the URLs for external web content in the metadata
45. Embed metadata that includes a license in the <head> of a web page

An HTML iframe can contain a wide range of types of content, from a wide range of sources, which makes them a challenge for preservation. The quality of automated website archiving in general can vary greatly. If an iframe is embedded in an EPUB or website, the more inconsistent, complex, and dynamic their content, the more likely they will be lost in an automated process. If these features are important to preserve, consider a manual process to capture and package the intellectual components of the iframe content in another form. For example, a video or screenshot with a caption that links to the website might be a sufficient fallback for conveying the contents of the iframe.

These guidelines may also be relevant to use of iframes:
38. List the URLs for each embedded iframe in the metadata
39. Avoid use of iframes in EPUBs
42. Facilitate a local web archiving workflow to support iframes

Preservation services might not support a workflow that automatically harvests the content of iframes embedded within an EPUB. Even with such a service, the quality could vary greatly, and the content might change following publication. If fallback options are not sufficient a more stable approach would be for the publisher to create an archived copy of the web page featured in the iframe. While there are tools that can be run locally by the publisher to perform single page archiving, there are also third party archiving services such as archive.today or Internet Archive’s Save Page Now service that allow you to archive a single page before publication and generate a persistent link for the embedded web content. This link could be included in a descriptive caption under the embedded feature. Publishers should test the outcome of these single page captures as quality can vary depending on the complexity of the website and the harvest method applied.

These guidelines may also be relevant:
14. Avoid dependence on externally hosted platforms for core features
38. Avoid the use of iframes to embed multimedia