| XML
Annotation |
Gloss Strings | ||
There are two levels of annotation in the Yapese Corpora.
The first of these is XML annotation, in the form of XML tags. The XML annotation provides information about the level of embeddedness of particular items; for example, whether or not they are glosses, words, subclauses, clauses, or turns. The XML annotation also provides information about corpus metadata, identifying information such as the date of collection of the text, the authors of written texts and the speakers involved in conversation.
The second level of annotation is present within the tagged strings. Because these corpora were compiled from interlinearized data, a good deal of annotation is to be found in the gloss strings. For instance, a gloss string such as [1.sg.nom] indicates that the glossed item is a first person singular nominative. The clause and subclause numbering scheme is also found within the tagged string.
Metadata
Metadata Template
Special
Tags for Spoken Data Portion -- <introenglish>
XML
Tags for Introenglish and Data Level (Spoken Corpus)
<interlineartext> All XML documents require a top level element. The top level element for the Yapese corpora is <interlineartext>.
<metadata> The metadata within the corpus is enclosed between <metadata> ... </metadata> tags.
<author> The author refers to the creator of the corpus
<translator> The translator(s) of the particular text.
<textname> The name of the text.
<source> The source refers to the original print source of the text. In the case of the written portion, the source is the written document. For the spoken portion of the corpus, the source is the Colonia Corpus of Spoken Yapese.
<longcitation> A full bibliographic citation for the text as it exists as part of the corpus.
<shortcitation> A short citation for the text.
<date> The date the text was collected and incorporated into the corpus.
<textidentifier> Each text has a unique letter code to identify it.
<ethnologuelanguagecode> Identifies the language of the text according to the three-letter schema of the Ethnologue.
<speaker> Identifies the speaker in the text. Each spoken text has a single primary speaker. Two subtags are embedded under the <speaker> node: <name> and <speechcommunity>.
<name> The name of the speaker
<speechcommunity> The speech community to which the speaker belongs. The speech community is selected by the speakers, who were asked to provide information about which speech community they belonged to. Speakers were asked where they were from, and where they lived, and it was explained to them that this information was for the purposes of recognizing that there were differences in the way that people spoke Yapese. Speakers then chose the community that they felt was most representative of their speech. No systematic study of the sociolinguistic validity of these divisions was undertaken, and so scholars should bear in mind that these divisions are based on speaker selection only.
<interviewer> The spoken portion of the corpus is in the form of interviews. The <interviewer> tag records the name of the primary interviewer.
<participant> Other persons present at the interview.
<taperef>
Each of the original
tapes is assigned a unique reference number. The interview may be
recorded on
side A or side B. Finally, the tape counter across the span of the
interview is
recorded. Thus a <taperef> value of 002B:000-050 indicates that
the
interview is recorded on tape number 002, side B, and that the
interview spans
the time from 000 to 050 on the tape counter.
-<metadata>
<author />
</metadata>-<speaker> <name />
</speaker><speechcommunity /> <interviewer /> <translator /> <participant /> <textname /> <source /> <shortcitation /> <longcitation /> <date /> <textidentifier /> <taperef /> <ethnologuelanguagecode/> |
<data> Text data is set off from metadata by the tag <data>
<turn>
Sets off
individual turns. Note that for written data, the entire text is
regarded as a
single turn.
<spkrname>
Embedded under the
<turn> node; records the name of the speaker of that turn.
<turnnumber>
Each
turn is assigned a unique alphanumeric code. The letter portion
indicates
the text identifier, and turns are numbered sequentially. Turns which
are
broken by short interjections or backchannelling are generally regarded
as a
single turn.
<clause> Sets off an individual clause. Clauses are defined as those strings which contain a finite verb, along with all of the dependents of that verb, including dependent subclauses. See Jensen et al.'s 1979 Yapese Reference Grammar for more on clause structure in Yapese.
<clausenumber> Indicates the unique clause number for each clause (see information on subclause and clause numbering in Gloss Strings, below, for more on clause numbering conventions)
<subclause>
Sets off an individual
subclause. Note that a subclause is not identical to a subordinate
clause - a main clause comprising a matrix clause and a relative clause
will have two subclauses properly embedded under the <clause>
node. Again, see Jensen et al.'s 1979 Yapese Reference Grammar
for
more on clause structure in Yapese.
<subclausenumber> Indicates the subclause number (see information on subclause and clause numbering in Gloss Strings, below, for more on subclause numbering conventions).
<word>
Sets off a word. The
<word> tag dominates the tags <Yapese>, <gloss>, and
<tapetime>. For the written portion of the corpora, word
boundaries are
determined by spaces in the original document. For the spoken portion,
they are
determined by speaker consensus (see Ballantyne 2005, Chapter 2, for
more on
translation methodology). Because Yapese has relatively recently become
a
written language, the notion of word boundary can be somewhat variable.
<yapese>
The Yapese word.
<gloss>
The English gloss.
<tapetime>
In the spoken portion of the
corpus, tape counts are embedded under the <word> tag at every
5-unit
increment.
<freetrans>
A free translation of the
preceding unit. Note that for the written portion, the
<freetrans> node
is embedded under <clause>, and free translations are given for
every
individual clause. In the spoken corpus, however, <freetrans> is
embedded
under <turn>
|
<data> <turn>
<clause>
<clausenumber>
</clausenumber>
<subclause> <subclausenumber>
</subclausenumber>
</subclause><word> <yapese> </yapese>
</word><gloss> </gloss> <freetrans> </freetrans> </clause>
</turn></data>
|
<introenglish> Each tape contains a short introduction in English, stating the date, location and participants, for the purposes of a backup identification method for tapes. This data is transcribed and set off by the tag <introenglish>. The <introenglish> node is dominated by the <interlineartext> node.
<English> Spoken English is set off by the tag <English>, and is parsed only to the <subclause> level.
<introenglish>
<turn>
</introenglish><turnnumber> </turnnumber> <spkrname> </spkrname> <clause>
</turn><clausenumber> </clausenumber> <subclause>
</clause><subclausenumber> </subclausenumber> <English> </English> </subclause> <data> <turn>
</data><spkrname> </spkrname> <turnnumber> </turnnumber> <clause> <clausenumber>
</clausenumber>
</clause><subclause> <subclausenumber>
</subclausenumber>
</subclause><word> <yapese> </yapese>
</word><gloss> </gloss> <tapetime></tapetime> <freetrans></freetrans> </turn> |
Colonia Corpus
D Dapael
‘Menstrual Houses’
W M’uw
‘Canoes’
In the spoken
corpus:
e.g W5.23.1.M
is from the text M’uw ‘Canoes’ and it
is the first clause within the 23rd main clausal unit, which
is in
the fifth conversational turn, and it is a main clause.
The following abbreviations are used for interlinear glosses
|
1 |
first person |
incl |
inclusive |
|
2 |
second person |
inf |
infinitive |
|
3 |
third person |
intr |
intransitive |
|
acc |
accusative |
ints |
intensifier |
|
AdvP |
adverbial phrase |
irr |
irrealis |
|
caus |
causative |
loc |
locative |
|
clsfr |
classifier |
locpro |
locative pronoun |
|
cmp |
complementizer |
neg |
negative |
|
dat |
dative |
nom |
nominative |
|
def |
definite |
non-pres |
non-present |
|
dimin |
diminutive |
non-sg |
non-singular |
|
dist |
distal |
NPC |
noun phrase
connector |
|
DM |
discourse marker |
perf |
perfect |
|
dmn |
demonstrative |
pl |
plural |
|
dpro |
demonstrative
pronoun |
poss |
possessive |
|
du |
dual |
prior |
priorative |
|
emph |
emphatic |
prog |
progressive |
|
ex |
exclusive |
prx |
proximal |
|
FM |
focus marker |
pst |
past |
|
h |
hearer |
ref |
referential |
|
hbt |
habitual |
relpro |
relative pronoun |
|
idf |
indefinite |
s |
speaker |
|
idfpro |
indefinite
pronoun |
sg |
singular |
|
imper |
impersonal |
stat |
stative |
|
inc |
inceptive |
tns |
transitive |
Multi-part glosses are indicated by dots; e.g. ngea inc.3.sg ‘inceptive third person singular’, raayog irr.can ‘irrealis can’, yoqor a.lot ‘a lot’. Note that all verbs are marked for valency.
CA conventions for representing conversation:
square brackets
overlapping
speech
ellipsis
short
pause