famvasup.blogg.se

Text encoding initiative
Text encoding initiative















Initially, the dictionary was encoded in a tabular format, in a mixture of encodings, and subsequently rendered in HTML. The dictionary is planned as a long-term project in which a simple dictionary is to be gradually enlarged and enhanced, taking into account the needs of the students. In this paper we present a hypertext dictionary of Japanese lexical units for Slovene students of Japanese at the Faculty of Arts of Ljubljana University. The TF has now produced a number of reports that simplify and make explicit the conversion of SGML TEI (version P3) to XML T. This is why in 2002 the TEI Consortium established a Task Force on SGML to XML migration. However, despite the fact that XML is a subset of SGML, migration is not a trivial process, especially in the case of large holdings of legacy language resources.

#TEXT ENCODING INITIATIVE SOFTWARE#

Apart from validation, the most compelling reason for migration is the scarcity of SGML-aware software and the abundance of XML-based tools and related recommendations. These projects could now benefit from migrating their data to XML. TEI chose as its underlying standard SGML (Standard Generalized Markup Language), and in the years before the inception of XML, a number of projects encoded their data according to some SGML DTD, TEI compliant, or otherwise. The largest effort in the area of standardisation of computer encoding of language resources has been the Text Encoding Initiative (TEI), established in 1987. The JOS corpora and specifications have a standardised encoding (Text Encoding Initiative Guidelines TEI P5) and are available for research. On the morphosyntactic level, each word is annotated with its morphosyntactic description and lemma on the syntactic level the sentences are annotated with dependency links on the semantic level, all the occurrences of 100 top nouns in the corpus are annotated with their wordnet synset from the Slovene semantic lexicon sloWNet.

text encoding initiative text encoding initiative

The paper introduces these components, and concentrates on jos100k, a 100,000 word sampled balanced monolingual Slovene corpus, manually annotated for three levels of linguistic description. The JOS language resources are meant to facilitate developments of HLT and corpus linguistics for the Slovene language and consist of the morphosyntactic specifications, defining the Slovene morphosyntactic features and tagset two annotated corpora (jos100k and jos1M) and two web services (a concordancer and text annotation tool).















Text encoding initiative