Preparing a TEI Corpus for the Text Alignment Network

Kalvesmaki, Joel Douglass
Catholic University of America, United States of America

Table of contents

1. Preparing a TEI Corpus for the Text Alignment Network

Do you wish that your corpus of Text Encoding Initiative (TEI) XML files could be made more interoperable with those created by other projects? Are your TEI annotations missing from the Semantic Web? Do you struggle organizing and aligning a TEI corpus of multiple versions of the same works, perhaps fragmentary or disordered? Do you wish you had a set of core tools for developing applications for your TEI files?

In this workshop, you will learn one way to approach these problems, through the Text Alignment Network (TAN, TAN, a customized extension of TEI, is a suite of modular, highly regulated XML formats designed to maximize the syntactic and semantic interoperability of texts, annotations, and language resources. It was created specifically to help scholars make their TEI corpora more cross-project oriented, and part of the Linked Open Data ecosystem. TAN is particularly suited for corpora that have multiple versions of the same text (copies, translations, paraphrases) and for specifying quotations and other types of text reuse across works. TAN also features one of the most extensive and useful XSLT function libraries for developing TEI applications.

This workshop, designed for intermediate and advanced users of TEI XML, is a bootstrap introduction to TAN. Participants will learn how to take a small set of TEI files of one’s choice—multiple versions of a single work—and bring them into the TAN ecosystem. By the end of the workshop, members will be able to use and configure a flagship TAN application that renders those multiple versions in a collated parallel HTML reading edition.

2. Prerequisites

• At least basic familiarity with XML and TEI, and with the concepts of well-formed and schema-valid XML.

• A rudimentary knowledge of XPath and either XQuery or XSLT.

• Some familiarity with Unicode and regular expressions.

• A corpus of TEI texts. The corpus may be small, and consist of short texts (recommended, in fact), but it must include at least three versions of one work, of the participant’s choice. The versions may be in different languages (e.g., translations) or they may be in the same language (e.g., revisions, variations).

• A computer

• A licensed, local installation of Oxygen XML Editor

• An account with a file sharing service such as Google Drive or Dropbox (service TBD)

• Internet access

• Optional: ability to post XML files to the Internet (e.g., through a personal or institutional server)

Maximum participants: 12

Intended length: one full day

Cost: none

3. Session 1: TEI and TAN: foundations and design principles

We begin with an overview of TAN. TAN is an extension (not a replacement) of TEI, but follows a different set of design principles. We will explore how TAN schemas and other digital resources are organized, and how its metadata principles differ from but complement those adopted by TEI. The three classes of TAN formats will be surveyed, to show how TAN uses RDF, regular expressions, new inclusion patterns, and a novel pointing structure, to approach textual scholarship in a way that supplements existing conventions. Participants will install TAN, configure their work space, and familiarize themselves with the resources.

4. Session 2: TEI, TAN, and RDF: building vocabulary

TAN is Semantic Web–oriented, and the first task in developing a TAN project is to find IRIs (URIs) to say what we mean, and mean what we say. Participants will create a TAN-voc file, which declares vocabulary that will be tersely invoked throughout their TEI files, and thereby honor the programmer’s adage DRY, “Don’t Repeat Yourself.” They will become familiar with how vocabulary gets reused, and how other TAN inclusion patterns work. The two types of TAN validation will be explained, and it will be shown how to find TAN files and get them to “talk” to each other.

5. Session 3: Preparing TEI transcriptions for the network

Preparing a TEI corpus to be TAN-compliant is not difficult, but it does require some deliberation. Participants will survey the rather bewildering variety of approaches to TEI, and learn how to analyze a TEI file to find the best strategy to make it part of the TAN ecosystem. In this session emphasis will be placed on theoretical questions about normalization, text division typology, reference systems, and the scholarly gaze. The TAN Oxygen framework will be introduced, along with tools specially created for Oxygen’s Author mode. Under guidance, users will adjust their sample TEI corpus to make it TAN-compliant, and to connect it with the vocabulary file created in session 2.

6. Session 4: Annotating TAN-TEI files and connecting to the network

TEI files can get messy and illegible in the face of ambitious, extensive inline markup. Even stand-off annotation, which produces cleaner TEI files, is not easily navigated or maintained. In this session, participants will learn a different approach to annotating TEI files, through the TAN-A format, which supports both alignment and annotation. Users will set up a TAN-A file for their transcriptions, and learn how to coordinate the versions by making ad-hoc alignment adjustments. Each participant will link their corpus to that of another participant, and write cross-project annotations that connect the two corpora. Now part of a small, cross-project network, users will familiarize themselves with TAN’s distinctive approach to pointing, via reference system, word, and letter, and develop hands-on experience in annotating specific parts of their corpus.

7. Session 5: Using and developing applications for TAN-TEI corpora

Our TEI corpus is not an end in itself. We want to use it. It should come into conversation with other corpora, to facilitate comparative reading, or statistical analysis. Participants will learn about standard TAN applications and, perhaps more important, the TAN function library, one of the most extensive and useful set of XSLT-based libraries of its kind for textual research. Participants will learn how to incorporate TAN functions into their own applications and workflow. They will process the alignment file created in session 4 through a seminal TAN application that renders the corpus as a collated parallel reading edition in HTML. We will get under the hood, and learn how to configure the application to make select adjustments, and discuss strategies to develop pipelines into and out of TAN-TEI. We will conclude by considering next steps beyond this workshop.