SPECTRUM Data Standard – comments

Nick Poole  has recently written a proposal for the development of a SPECTRUM Data Standard.  While I fully support this idea, I think that the concept should be broadened, as outlined below.

A standard for assertions

The proposed SPECTRUM Data Standard would be “a modelling of the SPECTRUM Units of Information using the CIDOC CRM”.  As such, it will allow statements to be made about collections objects, their history and management processes applied to them.  While Nick’s position paper states that the SPECTRUM Data Standard will not in itself be an interchange standard, it will nonetheless require some concrete syntax.  Given the intention to use the ResearchSpace work as a starting point, I think it is reasonable to assume that this concrete syntax will be RDF.  This means that each SPECTRUM Unit of Information will be represented by one or more CIDOC CRM classes or properties, expressed as a template for a Linked Data assertion.

The need for shared Linked Data “instances”

While the proposed Data Standard will allow different institutions to make statements like “this object was donated by person X” in a consistent way, this will only be of marginal benefit if every institution defines “person X” in a different way.  The point Nick makes about needing agreement on field names such as “Title” applies equally to data values held within those fields.

There are two levels of problem here.  The first is that pretty much all existing data is stored in collections management systems as string values (e.g. “Light, Richard”). While it is perfectly allowable to include these strings in your RDF results, and valid to say that they conform to the SPECTRUM Data Standard, in practice the results are ambiguous. (Look for “Richard Light” on Youtube – that guy isn’t me!) In order to be clear which person called Richard Light you mean, you need to employ a Linked Data approach, and create a unique, persistent identifier for that person.  The standard way of doing this is to invent a URL for each person, and back it up with facts about the person (expressed, again, as RDF), so that others can determine which Richard Light you mean.  So there is a requirement on collections management systems, to provide cultural heritage institutions with the toolset to enable the conversion of their string-value data to URLs.

The second problem is that inventing URLs for people (etc.) will not bring any shared benefit if each institution mints their own set of URLs.  While it is possible to “co-reference” URLs (i.e. to say that they refer to the same individual), this becomes a fruitless exercise in stable-door-shutting once the number of systems to co-reference gets much above three.  Much more effective would be the shared use of a standard Linked Data authority, so that everyone can use the same URLs when they are describing the same entity.  This begs the question: where are these standard authorities to come from?  In some cases they will already be out there, for example there are Linked Data authorities for geographical information (Geonames; OpenStreetMap; Ordnance Survey) which could simply be adopted.  In other cases, the specialized needs of the cultural heritage sector have already been met – in particular the Getty vocabularies (AAT, ULAN, TGN) are being prepared for publication as Linked Data resources.  However, there will be other cases where no central authority exists.  In my view, it should be part of the SPECTRUM Data Standard brief (a) to identify and promote the use of existing Linked Data authorities which are suitable for cultural heritage use and (b) to enable the creation and maintenance/development of new authorities, where required by the community.

Textual resources

Nick mentions in passing the different needs of a conservation description and a web site description.  These are both textual resources. The scenario he outlines suggests that these descriptions are simple string values, i.e. that they have no internal structure.  Such string values are often found in Linked Data applications (e.g. the multilingual abstracts which form a central part of dbpedia descriptions), but cultural heritage institutions produce a great deal of material which has a more complex structure than this. Even a simple exhibition label will typically have a heading, several paragraphs, and a reference number.

I think that a major opportunity will have been lost if the SPECTRUM Data Standard project does not make some attempt to apply the COPE philosophy to richly-structured textual resources. Museums have a wealth of formally published material (think exhibition catalogues, wall texts) as well as “grey literature” (e.g. conservation reports, correspondence, email) which is increasingly held in a digital form.  Whereas in the past this material would have been stored in an impenetrable binary format, it is now increasingly possible to access it in a machine-processible form (typically some sort of XML).  This opens up the possibility of applying COPE to such resources as they stand, or (more realistically) of converting them to a standard format which can then be used for COPE delivery.  An example of such a format is the Text Encoding Initiative, which is a well-established framework for encoding humanities documents of any kind.

As well as allowing all of part of these textual resources to be freely repurposed using COPE, their publication as a set of stable web-accessible resources would enable cross-reference and annotation, using standards such as Open Annotation.

Comments on SPECTRUM DAM criteria

This post examines the proposed criteria for eligibility for the SPECTRUM DAM Partner scheme. First, the definition of a DAMS and selected parts of the mapping of DAM activity to SPECTRUM procedures are presented. These are extracted from the SPECTRUM DAM 1.0 document, available at http://www.collectionslink.org.uk/spectrum-dam-resources. The eligibility criteria (taken from http://www.collectionslink.org.uk/spectrum-resources/2104-spectrum-dam-partner-scheme-dam-partner-scheme) are then quoted.  Finally, I provide comments on, and suggestions for, the criteria, and a proposed re-wording of them all.

Definition of a DAMS

The Canadian Heritage Information Network (CHIN) defines digital assets as:

“Digital materials created or owned by your institution.

Digital assets exist in a variety of formats, and can include text, web, audio, video and image files. Digital images of objects in your collection are digital assets, as are logo image files, corporate Powerpoint presentations and any other digital resources created by your institution that generate revenue or that provide valuable content to employees or clients.

Digital assets may be used in many contexts, including sales, marketing, education, web development, collections management and digital preservation. Sometimes you will see the term ‘media asset’ used to refer more narrowly to audio or video content.”

In the broadest sense, DAM refers to the processes and practices involved in the creation, description, storage, discovery, re-use and preservation of digital assets.

Significant statements relating to SPECTRUM procedures

Many of the SPECTRUM procedures mentioned in the mapping of DAM activity to SPECTRUM procedures simply say that this procedure may generate digital media, which should be managed by the DAMS.  The following points are more significant/specific. Numbers in brackets refer to correspondences to the criteria given in the next section:

Object entry The digital assets should be associated permanently with the object number of their corresponding collection item through a scheme of persistent identifiers . (3) [Incidentally, I think this is conflating two separate points. The DAMS should provide a means of assigning persistent identifiers to the digital resources it manages. Quite separately, it should ensure that there are persistent links between digital resources and relevant collections objects and processes.]
Acquisition At the same time as the material object is formally accessioned into the permanent collection, all associated digital assets must be accessioned into the DAMS. (1)
Inventory control Digital assets must be counted as assets of the organisation in the same sense as the physical collections items. The DAMS should serve as a central inventory of digital assets belonging to the organisation in the same context as inventory-level records in the Collections Management System. (1)
Location and movement control In the sense that the purpose of the SPECTRUM Procedure is to promote the recoverability of the collection items, the equivalent DAMS activity must ensure that the organisation is able to retrieve associated digital assets as efficiently as possible.
Transport Where digital assets are sent or transferred (for example, by email, cloud storage or FTP service), the organisation should ensure adequate management and documentation of these processes both to promote accountability and to mitigate the risk of infringement or misuse.  (5)
Cataloguing Where possible, the DAMS and associated policies and practices should support the capturing of information about the provenance, rights, usage, format and preservation requirements of the associated digital assets.The cataloguing of digital assets ought equally to provide information about connections or cross-references between assets.The cataloguing of digital assets should be based on a common taxonomy and controlled vocabularies with those used in the classification of the physical collections. (4)
Conservation and collections care Principles of preventive conservation are also relevant to effective digital asset management. It is important to plan the conversion of redundant or non-standard digital asset formats as part of the ongoing maintenance of the DAMS and DAM Strategy. This also relates to the regular assessment of preferred formats when creating new digital assets. (6)
Risk Management Policies and procedures for backup, recovery and disaster planning in relation to digital assets (7) ought to be clearly linked to associated SPECTRUM policies and procedures for overall risk management.
Audit   The DAMS should support the process of audit, including provision for the assessment of authenticity and provenance of digital media … (8)
Rights Management    Information about the specific rights management conditions associated with the digital assets should be created, captured and managed within the DAMS.  (2)
Use of Collections   … the DAMS should be capable of interoperating with a wide range of other systems, such as web publishing and e-commerce platforms. (5)

Criteria for Eligibility (as specified by CT)

  1. You run a DAM system or have a DAM module or functionality plug-in that is interoperable with a collections management system that is based on, or compatible with SPECTRUM
  2. Your DAM system includes comprehensive rights management
  3. Your DAM system can share a common schema and persistent IDs, respecting the complexity of cultural heritage metadata
  4. Your DAM system can share common vocabularies/authorities
  5. Your system has the ability to batch-export and import data for interoperability with yours and other systems without losing quality and content
  6. Your DAM supports a wide range of current industry-standard media formats and is extensible to account for future formats
  7. Support for back up and data integrity is provided
  8. Support for data auditing and cleaning is provided
  9. The end-user has access to support/ a support community/ a user group / community of developers
  10. You can provide 2 years of accounting records for due diligence purposes plus a recent client reference

Questions and concerns

  1. Delete (or justify) ‘plug-in’. Before ‘interoperable with’, add ‘part of, or’ (and add a comma after ‘with’)
  2. Replace ‘comprehensive rights management’ by ‘rights management capabilities conforming to SPECTRUM standards’
  3. Seems to be two separate points (before and after the comma).  First part: delete (or justify/define) ‘common schema’; change ‘share … persistent IDs’ to ‘assign persistent IDs to digital resources’.  Second part:  presumably this is trying to say that the DAMS should be capable of storing [and delivering? – see (5)] custom cultural heritage metadata?
  4. Replace ‘share’ by ‘facilitate conformance to’
  5. I’m not sure that ‘batch-export and import’ captures the sense of ‘interoperating’ in the source text (above).  For example, a key capability might be the ability to deliver an image, at a specified resolution and containing the desired metadata, in response to a web request.  This may well result in a loss of quality – e.g. if a thumbnail image is requested. This delivery may also involve a change of format/encoding, e.g. TIFF -> JPEG; XML -> HTML. Also, there is potential confusion between ‘data’ and ‘digital resources’.  Finally, the suggested wording loses the point in the original text (above), which relates to recording delivery of a digital resource.
  6. Shouldn’t this include something about format conversion capabilities (as per the source text)?
  7. OK
  8. After ‘auditing’ add ‘conforming to SPECTRUM standards’. What does ‘[data] cleaning’ mean?  Is it tidying of metadata? Or doing something to the digital resource itself? What happened to the need to assess authenticity and provenance, mentioned in the source text?
  9. OK
  10. OK

The criteria cover all the points I extracted from the full SPECTRUM DAMS document, apart from retrievability (Location and Movement Control). Also, as noted under (5), the need to record delivery of a resource has not been included.  The point about the DAMS being able to support delivery of digital resources in a flexible way (e.g. choosing image size/resolution/encoding) needs either to appear in the criteria, or to form part of an (as yet missing) normative description of what a DAMS actually is/does.

 Proposed re-wording of the criteria

  1. You run a DAM system or have a DAM module or functionality that is part of, or interoperable with, a collections management system based on, or compatible with, SPECTRUM
  2. Your DAM system includes rights management capabilities conforming to SPECTRUM standards
  3. Your DAM system can assign persistent IDs to the digital resources it manages, and allows these IDs to be used to request resources
  4. Your DAM system can store and deliver custom cultural heritage metadata
  5. Your DAM system can facilitate conformance to common cultural heritage vocabularies/authorities
  6. Your system has the ability to deliver each digital resource in an appropriate variety of formats and resolutions, while retaining required metadata
  7. Your DAM supports a wide range of current industry-standard media formats and is extensible to account for future formats
  8. Support for back up and for ensuring data integrity is provided
  9. Support is provided for data auditing which conforms to SPECTRUM standards
  10. The end-user has access to support/ a support community/ a user group / community of developers
  11. You can provide 2 years of accounting records for due diligence purposes plus a recent client reference

This still leaves the need to say something about searchability, bulk import/export (possibly), and the recording and/or protection of digital resource delivery actions.

Modes WordPress plugin documentation

The Modes plugin framework consists of a number of WordPress shortcodes. These work together to deliver the required content to the current page.  The control framework is Ajax-based.

On first loading the page, each Modes data source will use the active page parameters to frame an HTTP request, and will get back an XML response.  The response handler will in turn trigger each shortcode which uses that data source, passing the XML response DOM as a parameter. Each shortcode will use the information in the XML response to update its HTML contents, typically by applying an XSLT transform to it.

Thus, when the page is first loaded, each Modes shortcode function will simply write an empty named <div> element to the page, to act as the root for the content which will eventually be placed there.

The approach adopted will be based on the Culture Grid explorer (cg-search.htm), and it should be possible to put much of the functionality into Javascript libraries.  This will limit the amount of “Javascript written by PHP” code which is required.

Modes datasource

Every page requires a Modes datasource. This is a “non-visual” shortcode, whose purpose is to define the interface to a Modes data file.  When the datasource is initialised, it will build an array of the Modes components which make use of it.

Modes summary/data

A paged view of a set of Modes records.  If there is no query specified, this can default to a Modes browser control, which displays a fixed set of records/searches.

Modes detail

Like Modes summary (maybe not a separate component?); shows a single record, with controls to navigate forward and backwards through the result set.

Modes navigator

A component which allows the user to move through a Modes summary/data result set.  Needs to specify the name of the control which it applies to.

Modes facets

A component which displays terms which are relevant to the current search result, relating to one or more facets. Selecting a term will add that term to the current search, thus refining it.

Modes search terms

This control displays each active search term, and provides a means to deselect any term.

Modes lightbox

Displays a result set as a grid of images, using a simple XPath expression to get the image filenames.

Modes map

Displays a (Google) map, with selected items marked.

Modes timeline

Displays a (Simile) timeline, with selected items marked.

Modes and WordPress

We are working to produce a WordPress plugin for the Modes software family. This WordPress site will document our progress, and will also act as a test-bed to demonstrate what we have managed to achieve so far. If you have any thoughts or comments on this, you can leave them as comments on these articles, or send an email to me at richard at light.demon.co.uk.

The approach we are adopting is to support one or more WordPress “shortcodes”, which are codes enclosed in square brackets.  These shortcodes will be replaced by suitable Modes data and associated controls. (For example there will be a “navigator”, which allows users to browse through a set of results from a search.) The first round of work produced a “Modes data” plugin, which was demonstrated at the Modes Workshop in September 2012. This supported searches on Modes indexes, and the display of specific records (or sets of records).  It is AJAX-based, meaning that the results are updated without needing to refresh the complete page.

For the production-level plugin, we need to make it easy for the public to see what is in a Modes collection without having to run a search.  Therefore we want to develop a “Modes browser” which offers a number of pre-designed searches or subsets of the data, so that with a single click users can see a set of records, and with a second click they can be looking at the detail of one specific object.

Apart from the shortcode to support browsing and searching, we are interested in developing a plugin to support the production of user-generated content and the uploading of such data to a Modes data file.  This may range from a simple “comment” box through to complex structured data entry.  It could be used equally for public participation and for inhouse collections management support.