Annotation Schema

XML Annotation
Gloss Strings




There are two levels of annotation in the Yapese Corpora.

The first of these is XML annotation, in the form of XML tags. The XML annotation provides information about the level of embeddedness of particular items; for example, whether or not they are glosses, words, subclauses, clauses, or turns. The XML annotation also provides information about corpus metadata, identifying information such as the date of collection of the text, the authors of written texts and the speakers involved in conversation.

The second level of annotation is present within the tagged strings. Because these corpora were compiled from interlinearized data, a good deal of annotation is to be found in the gloss strings. For instance, a gloss string such as [1.sg.nom] indicates that the glossed item is a first person singular nominative. The clause and subclause numbering scheme is also found within the tagged string.

 

XML Annotation

Top Level Element

Metadata
    Metadata Template

 Data
   XML Template For Data Level (Written Corpus)
   Special Tags for Spoken Data Portion -- <introenglish>
   XML Tags for Introenglish and Data Level (Spoken Corpus)

Top level element

<interlineartext>            All XML documents require a top level element. The top level element for the Yapese corpora is <interlineartext>.

 

Metadata

<metadata>                  The metadata within the corpus is enclosed between <metadata> ... </metadata> tags.

<author>                      The author refers to the creator of the corpus

<translator>                  The translator(s) of the particular text.

<textname>                  The name of the text.

<source>                      The source refers to the original print source of the text. In the case of the written portion, the source is the written document. For the spoken portion of the corpus, the source is the Colonia Corpus of Spoken Yapese.

<longcitation>               A full bibliographic citation for the text as it exists as part of the corpus.

<shortcitation>             A short citation for the text.

<date>                         The date the text was collected and incorporated into the corpus.

<textidentifier>             Each text has a unique letter code to identify it.

<ethnologuelanguagecode>       Identifies the language of the text according to the three-letter schema of the Ethnologue.

<speaker>                    Identifies the speaker in the text. Each spoken text has a single primary speaker. Two subtags are embedded under the <speaker> node: <name> and <speechcommunity>.

<name>                        The name of the speaker

<speechcommunity>     The speech community to which the speaker belongs. The speech community is selected by the speakers, who were asked to provide information about which speech community they belonged to. Speakers were asked where they were from, and where they lived, and it was explained to them that this information was for the purposes of recognizing that there were differences in the way that people spoke Yapese. Speakers then chose the community that they felt was most representative of their speech. No systematic study of the sociolinguistic validity of these divisions was undertaken, and so scholars should bear in mind that these divisions are based on speaker selection only.

<interviewer>               The spoken portion of the corpus is in the form of interviews. The <interviewer> tag records the name of the primary interviewer.

<participant>                Other persons present at the interview.

<taperef>                     Each of the original tapes is assigned a unique reference number. The interview may be recorded on side A or side B. Finally, the tape counter across the span of the interview is recorded. Thus a <taperef> value of 002B:000-050 indicates that the interview is recorded on tape number 002, side B, and that the interview spans the time from 000 to 050 on the tape counter.

XML Metadata Template


-<metadata>
  <author />
-<speaker>
  <name />
  <speechcommunity />
</speaker>
<interviewer />
<translator />
<participant />
<textname />
<source />
<shortcitation />
<longcitation />
<date />
<textidentifier />
<taperef />
<ethnologuelanguagecode/>
</metadata>


Data

<data>                         Text data is set off from metadata by the tag <data>

 

<turn>                          Sets off individual turns. Note that for written data, the entire text is regarded as a single turn.

<spkrname>                 Embedded under the <turn> node; records the name of the speaker of that turn.

<turnnumber>               Each turn is assigned a unique alphanumeric code. The letter portion indicates the text identifier, and turns are numbered sequentially. Turns which are broken by short interjections or backchannelling are generally regarded as a single turn.

 

<clause>                      Sets off an individual clause. Clauses are defined as those strings which contain a finite verb, along with all of the dependents of that verb, including dependent subclauses. See Jensen et al.'s 1979 Yapese Reference Grammar for more on clause structure in Yapese.

<clausenumber>           Indicates the unique clause number for each clause (see information on subclause and clause numbering in Gloss Strings, below, for more on clause numbering conventions)

<subclause>                 Sets off an individual subclause. Note that a subclause is not identical to a subordinate clause - a main clause comprising a matrix clause and a relative clause will have two subclauses properly embedded under the <clause> node. Again, see Jensen et al.'s 1979 Yapese Reference Grammar for more on clause structure in Yapese.

<subclausenumber>     Indicates the subclause number (see information on subclause and clause numbering in Gloss Strings, below, for more on subclause numbering conventions).

<word>                        Sets off a word. The <word> tag dominates the tags <Yapese>, <gloss>, and <tapetime>. For the written portion of the corpora, word boundaries are determined by spaces in the original document. For the spoken portion, they are determined by speaker consensus (see Ballantyne 2005, Chapter 2, for more on translation methodology). Because Yapese has relatively recently become a written language, the notion of word boundary can be somewhat variable.

<yapese>                     The Yapese word.

<gloss>                        The English gloss.

<tapetime>                   In the spoken portion of the corpus, tape counts are embedded under the <word> tag at every 5-unit increment.

<freetrans>                   A free translation of the preceding unit. Note that for the written portion, the <freetrans> node is embedded under <clause>, and free translations are given for every individual clause. In the spoken corpus, however, <freetrans> is embedded under <turn>

XML Template For Data Level (Written Corpus)


<data>
<turn>
<clause>
<clausenumber> </clausenumber>
<subclause>
<subclausenumber> </subclausenumber>
<word>
<yapese> </yapese>
<gloss> </gloss>
</word>
</subclause>
<freetrans> </freetrans>
</clause>
</turn>
</data>


Special Tags for Spoken Data Portion -- <introenglish>

<introenglish>               Each tape contains a short introduction in English, stating the date, location and participants, for the purposes of a backup identification method for tapes. This data is transcribed and set off by the tag <introenglish>. The <introenglish> node is dominated by the <interlineartext> node.

<English>                     Spoken English is set off by the tag <English>, and is parsed only to the <subclause> level.


XML Tags for Introenglish and Data Level (Spoken Corpus)



<introenglish>
<turn>
<turnnumber> </turnnumber>
<spkrname> </spkrname>
<clause>
<clausenumber> </clausenumber>
<subclause>
<subclausenumber> </subclausenumber>
<English> </English>
</subclause>
</clause>
</turn>
</introenglish>
<data>
<turn>
<spkrname> </spkrname>
<turnnumber> </turnnumber>
<clause>
<clausenumber> </clausenumber>
<subclause>
<subclausenumber> </subclausenumber>
<word>
<yapese> </yapese>
<gloss> </gloss>
<tapetime></tapetime>
</word>
</subclause>
</clause>
<freetrans></freetrans>
</turn>
</data>


Gloss Strings

Alphanumeric Identifiers
Text Identifiers
Clause Numbering System
Letter Key for Identifying Clause Type

Interlinear Glosses

The second level of annotation is present within the text. There are two elements to this level. First is the set of alphanumeric identifiers for texts, turns, clauses and subclauses. Second is the interlinear glossing conventions.

Alphanumeric Identifiers

Text Identifiers

Each text has a unique letter code.

Honolulu Corpus
G        Guwchiig ‘Dolphins'
L         L’Agruw i Maabgol ‘The Married Couple’
M       Beaq Ni Ba Moqon Ngea Ba Raan’ I Moongkii ‘There was a Man and a Bunch of Monkeys’
T         Thiliig Kaakaroom ‘A Storm Long Ago’

Colonia Corpus

S       Schooldays – interview with Sheri Manna’
D      Dapael ‘Menstrual Houses’
W     M’uw ‘Canoes’

Turn, Clause & Subclause Identifiers

In the written corpus:
e.g. L2.3AC indicates the clause is from the text L'Agruw I Maabgol, it is the second clause and the third subclause embedded within that clause
 

In the spoken corpus:

e.g W5.23.1.M is from the text M’uw ‘Canoes’ and it is the first clause within the 23rd main clausal unit, which is in the fifth conversational turn, and it is a main clause.

Letter Designations indicating clause type

M                   Main Clause
RC                 Relative Clause
AC                 Adverbial Clause
CC                 Complement Clause
QS                 Quoted Speech
Q                    Question
FS                  False Start

Interlinear Glosses

The following abbreviations are used for interlinear glosses

1

first person

incl

inclusive

2

second person

inf

infinitive

3

third person

intr

intransitive

acc

accusative

ints

intensifier

AdvP

adverbial phrase

irr

irrealis

caus

causative

loc

locative

clsfr

classifier

locpro

locative pronoun

cmp

complementizer

neg

negative

dat

dative

nom

nominative

def

definite

non-pres

non-present

dimin

diminutive

non-sg

non-singular

dist

distal

NPC

noun phrase connector

DM

discourse marker

perf

perfect

dmn

demonstrative

pl

plural

dpro

demonstrative pronoun

poss

possessive

du

dual

prior

priorative

emph

emphatic

prog

progressive

ex

exclusive

prx

proximal

FM

focus marker

pst

past

h

hearer

ref

referential

hbt

habitual

relpro

relative pronoun

idf

indefinite

s

speaker

idfpro

indefinite pronoun

sg

singular

imper

impersonal

stat

stative

inc

inceptive

tns

transitive

Multi-part glosses are indicated by dots; e.g. ngea inc.3.sg ‘inceptive third person singular’, raayog irr.can ‘irrealis can’, yoqor a.lot ‘a lot’. Note that all verbs are marked for valency.

CA conventions for representing conversation:

square brackets            overlapping speech
ellipsis                          short pause