Homework 1 - CS671, Natural Language Processing

Amlan Kar - 13105

Part A. List of top syllables and syllable bigrams

Acknowledgement:

Syllables were distinguished using a suitable modification of Choudhury's Algorithm [ 1 ]

Part 1. Indian language chosen - Odia

A.1.1 - The corpora for Odia was self made by taking paragraphs from random articles from or.wikipedia.org.

A.1.2 - List of top syllable unigrams for Odia can be found here.

A.1.3 - List of top syllable bigrams for Odia can be found here.

Part 2. Language chosen - Hindi (Devnagari script)

A.2.1 - The corpora for Hindi was provided in the assignment problem as random articles from hi.wikipedia.org.

A.2.2 - List of top syllable unigrams for Hindi can be found here.

A.2.3 - List of top syllable bigrams for Hindi can be found here.

Part B. Plot of log frequency distribution for top 1000 syllables

Part 1. For Odia language

Statistics:

Number of syllables = 109035
Number of distinct syllables = 2995

B.1.1 Log-Frequency vs Rank plot for Odia dataset

Part 2. For Hindi language (Devnagari script)

Statistics:

Number of syllables = 26307
Number of distinct syllables = 2083

B.2.1 Log-Frequency vs Rank plot for Hindi dataset

Part C. Self made dataset

C.1.1 - The self made dataset for Odia language can be found here.

C.2.1 - The dataset provided in the assignment problem (hwiki.txt) was used in the problem for denavagari script hindi.

Part D. Code

The code for the assignment can be found in the directory ./code/

References

[1]Choudhury, Monojit. "Rule-based grapheme to phoneme mapping for hindi speech synthesis." 90th Indian Science Congress of the International Speech Communication Association (ISCA), Bangalore, India. 2003