Carla Parra
Detection of complex translation equivalents to improve Spanish – German/Norwegian word alignment
My project aims at improving the subsentential alignment of Spanish – German/Norwegian complex translation equivalents. One of the major problems that both human translators and machine translation systems encounter when dealing with Romance and Germanic pairs of languages is the existence of complex nominal compounds in the Germanic languages (German, Norwegian, Swedish, etc.) that are translated as phraseological units into the Romance languages (Spanish, French, Italian, Portuguese…) and vice-versa. This work is embedded in the CLARA (Common Language Resources and their Application) project, whose main objective is to create a large language research infrastructure for all of Europe.
German (as well as other Germanic languages such as Norwegian or Swedish) shows a great tendency to use compound nominals, whereas these compounds are translated into Spanish as phraseological expressions. Machine translation systems usually fail to translate German compounds into the appropriate Spanish phraseological expression and Spanish-German MT systems fail to produce the compound nouns a native German translator would produce and thus translations into German are also inaccurate. This phenomenon is also a great challenge for the induction of bilingual lexica, as they are usually based on subsentential alignment and the subsentential alignment of multiwords has not been proven accurate enough yet either. Thus, the main aim of my research project consists in improving multiword alignments to single words for Germanic and Romance language pairs which share this challenge. Concretely, we shall start with the pair of languages German-Spanish, and then expand it to Norwegian and other languages.
In order to improve subsentential alignment we will run out several experiments on a specialized corpus which is currently being compiled and aligned at sentence level. The corpus consists of texts originally drafted in Austria, Germany and Spain and translated into Spanish and German respectively. The texts have been retrieved from a database of the European Commission, they are classified in 13 different domains and subdmains and their dates go from 1990 to 2010.
The starting point of this research project is that there might be some latent linguistic clues that may help us to automatically identify phraseological expressions in Romance languages which are likely to have as a translation equivalent a nominal compound in Germanic languages. We will use the compiled corpus to verify this hypothesis and then run a series of experiments using subsentential aligners with the aim of improving their output using the conclusions of our preliminary study of the corpus. Once word alignments are proven to be right, it will be possible to use the results to gather "translational images" which can in turn be used by means of the semantics mirrors method to derive thesaurus-type databases.
The results of this project will not only improve subsentential alignment and dictionary induction techniques, but they could be also used in other Natural Language Processing applications such as Computer Assisted Translation, Cross-lingual Information Retrieval, Computer Assisted Language Learning, and any other NLP applications involving more than one language.
Last updated 21.3.2012