Developing an XML Encoding Specification for Papyrological Analysis
by Matthew Brook O'Donnell (02/27/2001)
The fundamental issue involved in the development of an encoding/annotation scheme for a papyrus text is representation. Which features of the primary text or manuscript should be maintained and represented in the encoding and which are less important? Are line divisions important? Should the size of individual characters be recorded in some way? What about the textures and colours of the media?
These questions are the same for the production of a standard printed edition as for the encoding of a machine-readable version. The advantage of a machine-readable edition, and particularly one encoded in a markup language such as XML, is that it allows for the separation of encoding and display/rendering. In a printed edition the visual display format confines the encoding of data from and about the manuscript. Sub/super linear markings and notes are the main ways of including additional information to the basic character and word data. A change in either the encoding or the display requires the construction of a new edition. With an electronic text it is possible to display the encoded data in many formats, that is, one encoding resulting in many different views (see below).
The potential views of an encoded text are limited only by the quality and amount of information encoded in the base text. For example, line divisions can only be displayed if the character positions at which the breaks occur are noted in the base text.
The issue of how textual and manuscript data should be represented in an encoding scheme is, therefore, of prime importance. The work of the Text Encoding Initiative (TEI) represents a considerable and highly significant advance in this area. The TEI guidelines include a number and variety of tags and suggestions for dealing with primary texts, missing characters and words, abbreviations, lacunae, the physical characteristics of a manuscript, and the like. This flexibility and the recognized status of the TEI make it an attractive candidate for encoding papyrus manuscripts, such as P.Oxy. 119. However, as an initial proposal OpenText.org has opted to develop a domain-specific XML encoding scheme, making use of recent XML linking technologies (XLink and XPointer).
A printed edition of a manuscript encodes all levels of annotation in one text--character marking for clear, uncertain, illegible and missing letters, the reconstruction of uncertain and missing letters, word divisions, spelling correction and standardization, the addition of accents, and so on. In contrast, the OpenText.org papyrus annotation model aims to separate the annotation of manuscript data and editorial comment and amendment into distinct levels. The current proposal suggests three levels associated with the three editions discussed by Porter in the introductory article: (1) diplomatic or character level, where data pertaining to characters and physical information regarding the manuscript is encoded, (2) reconstructed or word level, where word divisions are introduced, missing characters reconstructed and abbreviations expanded, and (3) reading or regularized level, where standardized orthography and morphology are introduced, as well as variant readings and editorial interpretations.
The base level of annotation takes a character level view of the data. Full details of the scheme are given in the Character Level Specification. Each character (a <c> element) is given a unique identifier (e.g. 'c1') and two required attributes: (1) status and (2) visibility.
A character's visibility is how clearly it can be seen on the papyrus. Values are: (1) clear (default), (2) unclear (e.g. only part of character is visible), (3) illegible (some ink is visible but shape and nature of character cannot be determined), and (4) none (the character is missing and cannot be seen at all). A character's status relates its presence on the papyrus. Values are: (1) present (default), (2) missing (a hole or missing section where the character would have been), (3) deleted (erasure of the character can be detected), and (4) inserted (the character has been inserted above or between characters or written over a deleted character).
These two attributes in combination cover the common features marked for characters in a printed edition (e.g. missing characters, illegible and unclear characters). The Character Level Specification contains a comparison of the Leiden convention markings used in printed editions with the use of these two attributes. Other attributes that can be specified for a character element are: join (indicating an orthographic link between the character and its following character) and decoration (indicating markings attached or associated with the letter, such as a stroke above).
The second and third levels (and any subsequent levels added in the future) reference levels below them. For example, the second level marks word boundaries by pointing to the range of characters in the character level contained within the word, e.g.
As discussed above, the separation of encoding and display made possible through the use of a markup language such as XML provides great flexibility for the presentation of the encoded data. XSLT stylesheets are one powerful mechanism for integrating the data encoded at the different levels into different views, such as a diplomatic edition, a reconstructed edition and a reading edition.The demonstration application for P.Oxy. 119 illustrates these three editions, each view results from an XSLT stylesheet.
In this demonstration the underlying character and marking data are transformed into graphic files for each character to avoid the difficulties associated with different Greek fonts. However, a further set of stylesheets could be written to present the encoded data in a specific font or in a transliterated form.