The MILE corpus for less commonly taught languages

Alison Alvarez, Lori Levin, Robert Frederking, Simon Fung, Donna Gates, Jeff Good

Research output: Contribution to conferencePaperpeer-review

4 Scopus citations

Abstract

This paper describes a small, structured English corpus that is designed for translation into Less Commonly Taught Languages (LCTLs), and a set of re-usable tools for creation of similar corpora. The corpus systematically explores meanings that are known to affect morphology or syntax in the world's languages. Each sentence is associated with a feature structure showing the elements of meaning that are represented in the sentence. The corpus is highly structured so that it can support machine learning with only a small amount of data. As part of the REFLEX program, the corpus will be translated into multiple LCTLs, resulting in parallel corpora can be used for training of MT and other language technologies. Only the untranslated English corpus is described in this paper.

Original languageEnglish
Pages5-8
Number of pages4
StatePublished - 2006
Event2006 Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, HLT-NAACL 2006 - New York, United States
Duration: Jun 4 2006Jun 9 2006

Conference

Conference2006 Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, HLT-NAACL 2006
Country/TerritoryUnited States
CityNew York
Period06/4/0606/9/06

Fingerprint

Dive into the research topics of 'The MILE corpus for less commonly taught languages'. Together they form a unique fingerprint.

Cite this