GRAMMAR ACQUISITION FROM REAL CORPUS

Rishabh Nigam
Shubhdeep Kochhar
Advisor: Amitabh Mukherjee
2011-12

ABSTRACT

This project deals with unsupervised learning of Natural Languages. Through a Corpus of sentences which are realistic and natural, we are able to extract pattern out of them and in turn generate new sentences that were not part of original corpus. This Extraction and Generation process is applied on various Corpus of English and Hindi Language and the results are analysed. The Algorithm used for this process is called ADIOS( Automatic Distillation OF Structures).Given a corpus of strings (text or speech, DNA sequencing etc) this algorithm recursively distills a heirarchical structured patterns.

LINKs

link to code
proposal report presentation

ABOUT THE ALGORITHM

The algorithm basically tries to divide similar lexicons into equivalence classes. For doing so it uses probabilistic interferrence of patterns and generation of more and more patterns recursively. It uses Mex criterion to decide whether two nodes are equivalent or not. The algorithm starts with N nodes(total different words) and m paths(total number of sentences). As this is a highly random graph, we try to organize it. For this we use the MEX criterion which is based on a matrix of the ratio of number of paths to a node and number of paths to a node before it. The algorithm iteratively uses the condition to find more and more complex patterns.

OUR APPROACH

Implementation of the algorithm on a English database had been done in the past. What we tried was to extend this to an HINDI database and see of similar structures could be found in HINDI. We ran our code on a few thousand sentences of Corpora(2500 for HINDI and 12000 for ENGLISH) and tried to obtain the patterns and generate new sentences based on the grammar learnt.

RESULTS

We were able to generate 101 equivalence classes for the CHILDES database, 12 for HINDI(for 2500 sentences) and 74 (for 5000 sentences), 14 for the commentary. The results

CHILDES	corpus	generate	labels	Image
HINDI(2500)	corpus	generate	labels	Image
HINDI(5000)	corpus	generate	labels
COMMENTARY	corpus	generate	labels	Image
SAMPLE	corpus	generate	labels	Image

REFERENCES

[1] Heider. Waterfall ,Ben Sandbank,Luca Onnis and Shimon Edelman , An empirical generative framework for computational modeling of language acquisition : Cambridge University Press 2010 http://kybele.psych.cornell.edu/ edelman/Waterfall-Sandbank-Onnis-Edelman-JCL10.pdf
[2] Zach Solan PHD thesis, Unsupervised Learning of Natural Languages, under Professor David Horn, Professor Shimon Edelman, Professor Eytan Ruppin : Senate of Tel Aviv University 2006 http://horn.tau.ac.il/publications/ZachSolanThesis.pdf
[3] ADIOS website by Zach Solan, Tel Aviv University http://adios.tau.ac.il/algorithm.html