DATA MODELING AS A HIGH-WIRE ACT

Balancing Requirements, Juggling Vocabularies, and not Falling (Short of Established Best Practice)

Bernhard Oberreither, ACDH-CH, Austrian Academy of Sciences

Limiting “Work” – a Single Solution for Modeling Occupations

A common part of biographical data is a person’s occupation. Our source data contains information on 16,500 persons with 18.000 occupations modelled slightly different, depending on the data source: sometimes including time spans, employers, etc. While some ontologies for modelling occupations offer, e.g., differentiations between profession and employment, for the purposes of SK it is sufficient to subsume these under one class of a vocabulary already in use, FRBRoo’s F51 Pursuit. This in turn can be linked to an employer – or to be more precise: to an E4 Period representing the period of existence of the employing entity. Turning to ontologies already in use in the project helped keeping the list of reused ontologies short and minimized integration issues.

Responsive image — > Prefixes and Links
> The Model in Full

Modeling Roles with a Minimalist’s Approach

The documents in Rechtsakten Karl Kraus (Karl Kraus Legal Files) are structured around the ca. 215 legal cases Kraus’s attorney Oskar Samek litigated on Kraus’s behalf. However, the data does not only represent the files but also lists the persons linked to each of these legal cases, an information we wanted to preserve. While there are more complex vocabularies to represent a person’s role in an event, this CIDOC-crm-only approach is rather minimalist and, again, helped keeping the list of reused ontologies short, this way minimizing integration issues. Our modeling states nothing but the facts that there is an activity that a) is being carried out by a person, b) is part of a larger event (a legal case) and c) has an E55 Type, e.g. “being a defendant”, “being a plaintiff”, etc.

Bibliographical Standards upheld with FRBRoo

We are accustomed to our academic citation standards, but when using an event-based model, even the simple notion of a text with an author and a date breaks down into numerous entities. Still we want to adhere to the standards our source data provides (and that are necessary to achieve a certain precision anyway). SK uses FRBRoo for the modeling of bibliographical information, thus differentiating, e.g., between a text as created by the author and its published expression. Breaking down the concept of a text in this way and consequently introducing numerous events for the creation of these textual entities leads to ambiguities: Dates, for example, must now be explicitly assigned as dates of composition, premiere, or publication - information that often remains implicit in traditional academic practice.

Mixed Collections in a One-fits-all Solution

The project Rechtsakten Karl Kraus (Karl Kraus Legal Files) has edited about 4.000 documents from the ca. 215 legal cases Oskar Samek has litigated for his most notorious client, Karl Kraus. These documents are legal motions, verdicts, appeals, but also invoices, newspaper clippings, small file notes. Many of these texts cannot be subsumed under the notion of a “Self-Contained Expression” otherwise favored in the SK data model: The FRBRoo class definition states that those are “immaterial realisations of individual works at a particular time that are regarded as a complete whole” – a rather high standard that is probably not met by, e.g., an invoice.
The whole collection was moved up the class hierarchy to the broader CIDOC crm term “Information Object”, each text being accompanied by its physical carrier, a “Manifestation Singleton”.

> Prefixes and Links
> The Model in Full

Texts Containing Texts Containing Texts Containing Texts …

Regarding the text index of Die Fackel, data enrichment and modeling were among the most challenging aspects of the project. This was not a surprise, since the journal comprises 37 volumes from 1899 to 1936 with 415 issues and about 13,000 texts, and the structure of the issues has changed fundamentally again and again over the years. Over the course of the years, Die Fackel would contain

sequences of treatises from different authors which are introduced, re-contextualized and commented on by other texts,
monolithic essays filling whole issues
large sections containing satirical glosses, many of which consisting of nothing but a title and an excerpt from yet another text (mostly newspaper articles)
rubrics with various editorial pieces, ranging from answers to letters to the editor to documentations of Kraus’s legal cases to information for subscribers
recurring segments containing aphorisms, poems, riddles
collections of quotes from prominent writers or politicians
sections documenting Kraus’s public readings including large quantities of press reviews about the events

While FRBRoo offers a number of classes that could be relevant to model these contents – F23 Expression Fragment comes to mind for all the text passages quoted one way or the other –, classifying the contents in-depth and applying a large number of different modelings was not feasible. Instead, some hard decisions had to be made: The core class SK relies on is the FRBRoo’s F22 Self-contained Expression, which is defined among other things by a) completeness b) as intended by an author. This had a huge impact on both the question of text boundaries and authorship attribution. E.g. a collection of Bismarck quotes is now attributed to Kraus, him being the one responsible for the composition as a whole, while the quotes themselves are not part of the data (due to their lack of completeness). This way, the data model had a profound effect on the selection and shape of the data introduced into the dataset.
In other cases, the hierarchical structure of texts or rubrics that contain (or consist entirely of) other texts (complete texts, that is) is represented by F22 Self-contained Expressions that contain other F22 Self-contained Expressions, each with the corresponding author.

Person Mentions

While there are more straightforward ways to model the fact that a text mentions a person (something along the lines of “text > mentions > person”), it made sense to do it by stating the fact that a text passage shows an actualization of a feature, the feature being a reference, the reference referring to a person. – That’s a somewhat cumbersome modeling (as well as sentence) for sure, but besides re-using an ontology already in use elsewhere in the model, the resulting reification of the steps between text passage and person offers the future opportunity to directly add further (e.g. provenance) information to each node. Additionally, this modeling takes into account the crucial difference between a text feature and its actualization, the former being understood as a generalized category, the latter as a text phenomenon assigned to that category.

Texts and Text Passages – Prioritizing Precision

Similar to what FRBRoo’s “Self-contained Expression” and “Publication Expression” represent on the level of whole texts, INTRO’s “TextPassage” and “Segment” represent different notions of ‘parts of texts’: the former a part of a text as an intellectual creation or work, the latter a part of a specific version of a text (e.g. a print edition) – with the latter being the ‘carrier’ of the former.
The distinction made here is subtle, but necessary: The digital edition of Kraus’s late work Dritte Walpurgisnacht for example contains annotations to some 2,000 text passages quoting other texts. In many cases (often with regard to literary ‘classics’) these references target ‘texts’ in the broader sense of intellectual creations or works, with no indication of a specific edition. In other cases (e.g., when it comes to current newspaper articles or other more recent publications), Kraus indeed refers to specific editions. The former can be represented by a reference to a “TextPassage”, the latter by a reference to an “TextPassage” passage incorporated in a “Segment” which in turn is linked to a text in its published form (FRBRoo’s “Publication Expression”).

Intertextuality – Reified

One of the key areas of the data model is concerned with intertextuality – since Karl Kraus is known for his highly differentiated art of quotation and his texts accordingly show a high frequency of references to other texts. There are many ways to categorize these references: In his literary writings, Kraus makes use of plain citation and quotation, parody, allusion, intentional misquotations, etc., his legal files quote texts that are the subject of dispute or serve as evidence for legal argumentation; in turn, excerpts from these legal files end up in Die Fackel.
SK provides the same modeling for all of these cases: An instance of INT3 Intertextual Relation connected via two properties to the referring and the referred to entities. Other vocabularies offer the possibility to represent intertextuality via properties specifying the mode of the reference. SK opted for INTRO’s more generic approach, since the reification that comes with it allows for the separation of representing the relation and categorizing it (by adding an E55 Type).

> The Model in Full