Common language: integration and reconciliation

Overview

Reconciliation is the process of unifying instances in a dataset with a common vocabulary (basically a dictionary of terms, people, things, places, or concepts). By making all references to "Marcel Duchamp" point to the same person record we ensure that data expressing his roles in an object's history can be accessed along with other relevant data about him. Likewise, reconciling object classifications lets us move efficiently across art and archival systems.

Vocabularies used

The Duchamp Research Portal uses six primary sources to unify handling of entities:

  • AAT – Getty Art & Architecture Thesaurus
  • ULAN – Getty Union List of Artist Names
  • LCNAF – Library of Congress Name Authority File
  • LCREL - Library of Congress Relators
  • ISNI - International Standard Name Identifier
  • Wikidata - for inter-vocabulary walking

Basic identifier patterns

URIs from the AAT are used directly in entity classifications (such as object classifications as 'paintings', identifier classifications as 'primary', etc.) and role technique classifications, eg:

{
    "@context": "https://linked.art/ns/v1/linked-art.json",
    "about": [],
    "classified_as": [
        {
            "id": "aat:300033618",
            "label": "paintings (visual works)",
            "type": "Type"
        },
        {
            "id": "aat:300133025",
            "label": "works of art",
            "type": "Type"
        }
    ],
    "produced_by": {
        "type": "Production",
        "consists_of": [
            {
                "carried_out_by": [
                    {
                        "id": "http://data.duchamparchives.org/pma/archive/actor/LCNAF/n80057220",
                        "label": "Duchamp, Marcel, 1887-1968",
                        "type": "Actor"
                    }
                ],
                "technique": [
                    {
                        "id": "aat:300025103",
                        "label": "artists (visual artists)",
                        "type": "Type"
                    }
                ],
                "type": "Production"
            }
        ]
    }
}

People and organizations are connected to their vocabulary terms using the same skos:exact_match pattern that linked.art uses:

{
    "id": "http://data.duchamparchives.org/pma/archive/actor/LCNAF/n80057220",
    "label": "Duchamp, Marcel, 1887-1968",
    "type": "Actor",
    "exact_match": [ "ulan:500115393" ]
}

Term reconciliation and language-specific labels

Classifications between institutions and cataloguing languages are reconciled with a table that provides equivalent terms and french labels for all classification terms in the dataset. Labels for dataset-internal classifications (for example, preferred or "aat:300404670") are fetched from the Getty Vocabulary (GVP) on data refreshes and used in the raw data releases. Labels are extracted using this SPARQL query:

SELECT * WHERE {
  ?entity_uri a gvp:Concept ;
    gvp:prefLabelGVP ?pref_label .

  ?pref_label a xl:Label ;
    gvp:term ?label_with_lang .
}

The results are cached for use when the endpoint is unreachable. In cases where the GVP expresses multiple preferred labels, we use the shorter of the two.

People and organization names

Person vocabularies tend to be opinionated and highly specialized, with editorial standards varying widely for things like name shortening, born names vs given names, and name language or kind. As a result, we use whatever actor names a source system provides and prefer the name provided by the Philadelphia Museum of Art's archival source system.

To generate the reconciliation candidates report, we use Wikidata to walk between possible names in ULAN, LCNAF, and Wikidata. Along with relevant metadata, the report includes candidate biographies and Wikipedia links (in case that data is needed to be inserted into source systems).

References