Capítulo 3. Cosas que saben los lingüistas computacionales

Tabla de contenidos

Un paseo histórico por la red
TEI
Cíbola y Oleada
EAGLES I y II (1995-1999)
MULTEXT (1994-1996)
Corpus textual especializado plurilingüe
MATE
ISLE (2000-2002)
POINTER (-1996)
ELRA
SALT (2000-2001)
LISA y OSCAR

Un paseo histórico por la red

A los dos lados del charco las instituciones públicas y las universidades han desarrollado una serie de proyectos de altísimo interés para nosotros.

TEI

[Aquí hablar de la creación de TEI, que ya celebró sus 10 años]

Cíbola y Oleada

En el lado americano, Oleada es un desarrollo derivado de TIPSTER II[19] creado por Bill Ogden[20]

«Cíbola» and «Oleada» are two related systems that provide multilingual text processing technology to language instructors, learners, translators, and analysts. The systems consist of a set of component tools that have been designed with a user-centered methodology.

Oleada proporciona

EAGLES I y II (1995-1999)

En Europa destacan EAGLES I y II (Expert Advisory Group on Language Engineering Standards)

El primer proyecto terminó en 1996. El segundo proyecto se extendió entre 1997 y primavera de 1999. Según la introducción ( http://www.ilc.pi.cnr.it/EAGLES96/intro.html )

EAGLES is an initiative of the European Commission (…) which aims to accelerate the provision of standards for:

  • Very large-scale language resources (such as text corpora, computational lexicons and speech corpora);

  • Means of manipulating such knowledge, via computational linguistic formalisms, mark up languages and various software tools;

  • Means of assessing and evaluating resources, tools and products.

The work towards common specifications is carried out by five working groups:

  • Text Corpora

  • Computational Lexicons

  • Grammar Formalisms

  • Evaluation

  • Spoken Language

Un resultado de los trabajos fue el Corpus Encoding Standard (CES, http://www.cs.vassar.edu/CES/) y XCES ( http://www.cs.vassar.edu/XCES/), la versión XML.

The CES is designed to be optimally suited for use in language engineering research and applications, in order to serve as a widely accepted set of encoding standards for corpus-based work in natural language processing applications. The CES is an application of SGML compliant with the specifications of the TEI Guidelines.

The CES specifies a minimal encoding level that corpora must achieve to be considered standardized in terms of descriptive representation (marking of structural and typographic information) as well as general architecture (so as to be maximally suited for use in a text database). It also provides encoding specifications for linguistic annotation, together with a data architecture for linguistic corpora.

In its present form, the CES provides the following:

  • a set of metalanguage level recommendations (particular profile of SGML use, character sets, etc.);

  • tagsets and recommendations for documentation of encoded data;

  • tagsets and recommendations for encoding primary data, including written texts across all genres, for the purposes of corpus-based work in language engineering.

  • tagsets and recommendations for encoding linguistic annotation commonly associated with texts in language engineering, currently including:

    • segmentation of the text into sentences and words (tokens),

    • morpho-syntactic tagging,

    • parallel text alignment.

Sin embargo lo más influyente de los resultados de los proyectos son las Directrices EAGLES. Los trabajos del grupo los prosigue el proyecto ISLE.

Relacionados con EAGLES y CES estaban los proyectos PAROLE (Preparatory Action for Linguistic Resources Organisation for Language Engineering, LE2-4017, http://www.dcs.shef.ac.uk/research/groups/nlp/funded/parole.html )

y

MULTEXT (1994-1996)

MULTEXT (Multilingual Text Tools and Corpora, LRE 62-050, 1994-96, http://www.lpl.univ-aix.fr/projects/multext/). Estos eran sus objetivos iniciales:

Existing tools for NLP and MT corpus-based research are typically embedded in large, non-adaptable systems which are fundamentally incompatible. Little effort has been made to develop software standards, and software reusability is virtually non-existent. As a result, there is a serious lack of generally usable tools to manipulate and analyze text corpora that are widely available for research, especially for multi-lingual applications. At the same time, the availability of data is hampered by a lack of well-established standards for encoding corpora. Although the TEI has provided guidelines for text encoding, they are so far largely untested on real-scale data, especially multi-lingual data. Further, the TEI guidelines offer a broad range of text encoding solutions serving a variety of disciplines and applications, and are not intended to provide specific guidance for the purposes of NLP and MT corpus-based research. MULTEXT proposes to tackle both of these problems. First, MULTEXT will work toward establishing a software standard, which we see as an essential step toward reusability, and publish the standard to enable future development by others. Second, MULTEXT will test and extend the TEI standards on real-size data, and ultimately develop TEI-based encoding conventions specifically suited to multi-lingual corpora and the needs of NLP and MT corpus-based research.

Herramientas elaboradas por el proyecto MULTEXT son

Corpus textual especializado plurilingüe

Un desarrollo catalán: ( http://www.iula.upf.es/corpus/corpuses.htm)

El proyecto Corpus es el proyecto de investigación prioritario del IULA. Recopila textos escritos en cinco lenguas diferentes (catalán, castellano, inglés, francés y alemán) de las áreas de especialidad de la economía, el derecho, el medio ambiente, la medicina y la informática. A través del establecimiento del corpus, se intentan inferir las leyes que rigen el comportamiento de cada lengua en cada área.

Las investigaciones previstas sobre el corpus son las siguientes: detección de neologismos y términos, estudios sobre variación lingüística, análisis sintáctico parcial, alineación de textos, extracción de datos para la enseñanza de segundas lenguas, extracción de datos para la construcción de diccionarios electrónicos, elaboración de tesaurus, etc.

Los textos son marcados de acuerdo con el estándar SGML y siguiendo las directrices CES de la iniciativa EAGLES.

El procesamiento de los textos del corpus sigue los siguientes pasos:

  • marcaje estructural

  • preproceso (detección de fechas, números, locuciones, nombres propios…)

  • análisis y marcaje morfológicos de acuerdo con los etiquetarios morfosintácticos diseñados en el IULA

  • desambiguación lingüística y/o estadística

  • almacenamiento en una base de datos textual

Problema

Creo que gran parte de sus desarrollos no son libres.

MATE

Un tanto marginal para nuestros intereses, pero interesante en lo que toca a la anotación, es MATE (Multilevel Annotation, Tools Engineering, Telematics Project LE4-8370) http://mate.nis.sdu.dk/

MATE aims to develop a preliminary form of standard and a workbench for the annotation of spoken dialogue corpora. The annotation standard will:

  • allow multiple annotation levels, where the various annotation levels can be related to each other;

  • allow coexistence of a multitude of coding schemes and standards;

  • allow multilinguality;

  • integrate standardisation efforts in the US, Europe and Japan; and

  • be open with respect to the information levels and categories

within each level.

The MATE results will be of particular relevance for:

  • the construction of SLDS (Spoken Language Dialogue Systems) lexicons

  • corpus-based learning procedures for the acquisition of language-models, part-of-speech-tagging, grammar induction, extraction of structures to be used in the dialogue control of SLDSs;

  • lexicon and grammar development based on explicit descriptions of the interrelationships between phenomena at different descriptive levels (e.g. lexical, grammatical, prosodic clues for semantics and discourse segmentation, for inferring dialogue acts, etc).

El código producido por el proyecto, ‘The MATE Workbench’, en java y con licencia GPL, se puede descargar de http://www.cogsci.ed.ac.uk/~dmck/MateCode/

ISLE (2000-2002)

El sitio web del proyecto se encuentra en http://lingue.ilc.pi.cnr.it/EAGLES96/isle/ISLE_Home_Page.htm .

Leemos

The ISLE project which started on 1 January 2000 continues work carried out under the EAGLES initiative. ISLE (International Standards for Language Engineering) is both the name of a project and the name of an entire set of co-ordinated activities regarding the HLT field. ISLE acts under the aegis of the EAGLES initiative, which has seen a successful development and a broad deployment of a number of recommendations and de facto standards.[21]

The aim of ISLE is to develop HLT standards within an international framework, in the context of the EU-US International Research Cooperation initiative. Its objectives are to support national projects, HLT RTD projects and the language technology industry in general by developing, disseminating and promoting de facto HLT standards and guidelines for language resources, tools and products.[22]

ISLE targets the 3 areas: multilingual lexicons, natural interaction and multimodality (NIMM), and evaluation of HLT systems. These areas were chosen not only for their relevance to the current HLT call but also for their long-term significance. For multilingual computational lexicons, ISLE will:[23]

  • extend EAGLES work on lexical semantics, necessary to establish inter-language links;

  • design standards for multilingual lexicons;

  • develop a prototype tool to implement lexicon guidelines and standards;

  • create exemplary EAGLES-conformant sample lexicons and tag exemplary corpora for validation purposes;

  • develop standardised evaluation procedures for lexicons.

POINTER (-1996)

En el campo de la terminología fue muy importante el proyecto POINTER, co-financiado por la Comunidad Europea y que emitió su Informe Final (revisión nº 54) en enero de 1996. Dice http://www.computing.surrey.ac.uk/ai/pointer/

The aim of the POINTER project is to provide a set of concrete feasible proposals which will support users of terminology throughout Europe by facilitating the distribution of terminologies, as well as their re-use in different contexts and for different purposes.

El Informe Final del proyecto POINTER señala las deficiencias del campo de la terminología tal y como se daba en Europa entonces (de intercambio de terminologías, de validación y verificación, del interfaz de usuario y su «localización», de extracción de terminologías a partir de corpora lingüísticos, necesidad de mejorar las técnicas de almacenamiento y recuperación de información y de integrar las terminologías en el software), y recomienda direcciones para solucionarlas.

ELRA

The European Language Resources Association (ELRA) was founded in February 1995 and is the recipient of EU funds under the MLIS (MultiLingual Information Society) programme on a shared-cost basis. Established at the instigation of the European Commission with the active participation of the POINTER, PAROLE (corpora/lexica) and SPEECHDAT (speech data) projects in conjunction with the RELATOR project (A European Network of Repositories for Linguistic Resources), ELRA aims to validate and distribute European language resources that are offered to it for that purpose. In addition, it acts as a clearinghouse for information on language engineering, gathering data on market needs and providing high-quality advice to potential and actual funders, including the European Commission and national governments. Equally, it supports the development and application of standards and quality control measures and methodologies for developing electronic resources in the European languages. In time ELRA aims, in its own words, «to become the focal point for pressure in the creation of high-quality and innovative language resources in Europe».

SALT (2000-2001)

«SALT» (Standards-based Access to multilingual Lexicons and Terminologies) fue un proyecto integrado en el V Programa Marco (2000-2001).

Una de sus páginas web está en http://www.loria.fr/projets/SALT/saltsite.html . El proyecto nace de la toma de conciencia de una necesidad:

This project responds to the fact that many organizations in the localization industry are now using both human translation enhanced by productivity tools and MT with or without human post-editing. This duality of translation modes brings with it the need to integrate existing resources in the form of (a) the NLP lexicons used in MT (which we categorize as lexbases) and (b) the concept-oriented terminology databases used in human-translation productivity tools (which we call termbases). This integration facilitates consistency among various translation activities and lever-ages data from expensive information sources for both lex side and the term side of language processing.

The SALT project combines two recently finalized interchange formats: «OLIF» (Open Lexicon Interchange Format), which focuses on the interchange of data among lexbase resources from various machine translation systems, (Thurmaier et al. 1999), and «MARTIF» (ISO 12200:1999, MAchine-Readable Terminology Interchange Format), which facilitates the interchange of termbase resources with conceptual data models ranging from simple to sophisticated. The goal of SALT is to integrate lexbase and termbase resources into a new kind of database, a lex/term-base called «XLT» (eXchange format for Lex/Term-data).

XLT se basa en XML. El «Default XLT» se conoce como «TBX»: ‘TermBase eXchange format’.

Control of TBX has been handed over from the SALT project (…) to LISA (and its OSCAR SIG).

LISA y OSCAR

Pendiente y urgente: TMX, TBX, SRX.



[20] Software de antes de 1997!?