ANGLABHARTI

ANGLABHARTI: A MULTILINGUAL MACHINE AIDED TRANSLATION METHODLOGY FOR TRANSLATION FROM ENGLISH TO INDIAN LANGUAGES

Prof. R.M.K. Sinha Department of Computer Science and Engineering Indian Institute of Technology, Kanpur 208 016 India
e-mail : rmk@iitk.ac.in

ANGLABHARTI represents a machine-aided translation methodology specifically designed for translating English to Indian languages. English is a SVO language while Indian languages are SOV and are relatively of free word-order. Instead of designing translators for English to each Indian language, Anglabharti uses a pseudo-interlingua approach. It analyses English only once and creates an intermediate structure called PLIL (Pseudo Lingua for Indian Languages). This is the basic translation process translating the English source language to PLIL with most of the disambiguation having been performed. The PLIL structure is then converted to each Indian language through a process of text-generation. The effort in analyzing the English sentences and translating into PLIL is estimated to be about 70% and the text-generation accounts for the rest of the 30%. Thus only with an additional 30% effort, a new English to Indian language translator can be built. Some of the major design considerations in design of Anglabharti have been aimed at:

- providing a practical aid for translation wherein an attempt is made to get 90% of the task done by the machine and 10% left to the human post-editing;

- a system which could grow incrementally to handle more complex situations;

- an uniform mechanism by which translation from English to majority of Indian languages with attachment of appropriate text generator modules; and

- a human engineered man-machine interface to facilitate both its usage and augmentation.

Anglabharti is a pattern directed rule based system with context free grammar like structure for English (source language) which generate a `pseudo-target' (PLIL) applicable to a group of Indian languages (target languages). A set of rules obtained through corpus analysis is used to identify plausible constituents with respect to which movement rules for the PLIL is constructed. The idea of using PLIL is primarily to exploit structural similarity to obtain advantages similar to that of using interlingua approach. It also uses some example-base to identify noun and verb phrasals and resolve their ambiguities.

Indian languages are verb ending, free word-group order language with lot of structural similarity. Indian languages can be classified into four broad groups according to their origin. These are Indo-Aryan family (Hindi, Bangla, Asamiya,Punjabi, Marathi, Oriya, Gujrati etc.); Dravidian family (Tamil, Telugu, Kannada & Malayalam); Austro-Asian family and Tibetan-Burmese family. Within each group the languages exhibit a high degree of structural homogeneity. The methodology exploits this similarity to a great extent in its design. Paninian framework based on Sanskrit grammar using Karak (similar to 'case') relationship provides an uniform way of designing the Indian language text generators using selectional constraints and preferences.

IIT Kanpur in association with the Technology Development for Indian Language (TDIL) Programme of Govt. of India has now taken an initiative to make the AnglaBharti Technology available to all the thirteen Resource Centres in the coutry. These rsource centres have been established across the country for development of Indian languages technology solutions in their regional languages. The details of this mission can be found at AnglaBharti Mission

The lexical database is the fuel to the translation engine. A number of ontological/semantic tags are used to resolve sense ambiguity in the source language. We use semantics to resolve most of the intra-sentence anaphora/pronoun references. Alternative meanings for the unresolved ambiguities are retained in the pseudo target language. A text generator module for each of the target languages transforms the pseudo target language to the target language. These transformations do lead to sentences which may be ill-formed. A corrector for ill-formed sentences is used for each of the target languages. Finally, a human-engineered post-editing package is used to make the final corrections. The post-editor needs to know only the target language.

The ANGLABHARTI methodology was used to design a functional prototype for English to Hindi on Sun system. Feasibility on extending this for English to Telugu/Tamil was also demonstrated. Thereafter, during 1995-97, the DOE/MIT TDIL programme funded a project for porting the English to Hindi translation software on a PC platform in Linux for translating English Health Slogans into Hindi. ER & DCI Lucknow/Noida (now CDAC Noida) was associated with the project for field testing and packaging the software. In year 2000 the project received further funding for making it more comprehensive. The outcome of this project has been release of the first version of the software named AnglaHindi (an English to Hindi version based on Anglabharti approach) which accepts wide variety of English text. AnglaHindi software technology has been transferred to two organizations and is being made available on both the Linux and Windows platforms. Language Generation}, in the sense that the latter has also to decide `what to say' (the strategic level) in addition to `how to say it' (the tactical level).

It may be noted that by having different text generators using the same rule-base and sense disambiguator, a generic MT system is obtained for a host of target languages. We have used Paninian framework for synthesising the Indian language text.

AnglaBharti-II

 

In 2004, phase-II of system development has been launched which addresses many of the shortcomings of the earlier architecture. This has been named AnglaBharti-II.

AnglaBharti-II uses a generalized example-base (GEB) for hybridization besides a raw example-base (REB). During the development phase, when it is found that the modification in the rule-base is difficult and may result in unpredictable results, the example-base is grown interactively by augmenting it. At the time of actual usage, the system first attempts a match in REB and GEB before invoking the rule-base. In AnglaBharti-II, we have made provision for automated pre-editing & paraphrasing, generalized & conditional multi-word expressions, recognition of named-entities and incorporated an error-analysis module and statistical language-model for automated post-editing. The purpose of automatic pre-editing module is to transform/paraphrase the input sentence to a form which is more easily translatable. Automated pre-editing may even fragment an input sentence if the fragments are easily translatable and positioned in the final translation Such fragmentation may be triggered by in case of a failure of translation by the 'failure analysis' module. The failure analysis consists of heuristics on speculating what might have gone wrong. The entire system is pipelined with various sub-modules. All these have contributed significantly to greater accuracy and robustness to the system.