GRAMMAR ACQUISITION FROM REAL CORPUS
Rishabh Nigam
Shubhdeep Kochhar
Advisor: Amitabh Mukherjee
2011-12
This project deals with unsupervised learning of Natural Languages. Through a Corpus of sentences
which are realistic and natural, we are able to extract pattern out of them and in turn generate new
sentences that were not part of original corpus. This Extraction and Generation process is applied
on various Corpus of English and Hindi Language and the results are analysed. The Algorithm used
for this process is called ADIOS( Automatic Distillation OF Structures).Given a corpus of strings
(text or speech, DNA sequencing etc) this algorithm recursively distills a heirarchical structured
patterns.
Shubhdeep Kochhar
Advisor: Amitabh Mukherjee
2011-12
ABSTRACT
This project deals with unsupervised learning of Natural Languages. Through a Corpus of sentences
which are realistic and natural, we are able to extract pattern out of them and in turn generate new
sentences that were not part of original corpus. This Extraction and Generation process is applied
on various Corpus of English and Hindi Language and the results are analysed. The Algorithm used
for this process is called ADIOS( Automatic Distillation OF Structures).Given a corpus of strings
(text or speech, DNA sequencing etc) this algorithm recursively distills a heirarchical structured
patterns.
LINKs
link to codeproposal report presentation
ABOUT THE ALGORITHM
The algorithm basically tries to divide similar lexicons into equivalence classes. For doing so it uses probabilistic interferrence of patterns
and generation of more and more patterns recursively. It uses Mex criterion to decide whether two nodes are equivalent or not. The algorithm starts with
N nodes(total different words) and m paths(total number of sentences). As this is a highly random graph, we try to organize it. For this we use the MEX criterion which is based on a matrix of the ratio of number of paths to a node and number of paths to a node before it. The algorithm iteratively uses the condition to find more and more complex patterns.
OUR APPROACH
Implementation of the algorithm on a English database had been done in the past. What we tried was to extend this to an HINDI database and see of similar structures could be found in HINDI. We ran our code on a few thousand sentences of Corpora(2500 for HINDI and 12000 for ENGLISH) and tried to obtain the patterns and generate new sentences based on the grammar learnt.
RESULTS
We were able to generate 101 equivalence classes for the CHILDES database, 12 for HINDI(for 2500 sentences) and 74 (for 5000 sentences), 14 for the commentary. The resultsCHILDES | corpus | generate | labels | Image |
HINDI(2500) | corpus | generate | labels | Image |
HINDI(5000) | corpus | generate | labels | |
COMMENTARY | corpus | generate | labels | Image |
SAMPLE | corpus | generate | labels | Image |
REFERENCES
[1] Heider. Waterfall ,Ben Sandbank,Luca Onnis and Shimon Edelman , An empirical generative
framework for computational modeling of language acquisition : Cambridge University Press 2010
http://kybele.psych.cornell.edu/ edelman/Waterfall-Sandbank-Onnis-Edelman-JCL10.pdf[2] Zach Solan PHD thesis, Unsupervised Learning of Natural Languages, under Professor David Horn, Professor Shimon Edelman, Professor Eytan Ruppin : Senate of Tel Aviv University 2006 http://horn.tau.ac.il/publications/ZachSolanThesis.pdf
[3] ADIOS website by Zach Solan, Tel Aviv University http://adios.tau.ac.il/algorithm.html