CS671 Assignment 1

- Nishant Rai, 13449


A. List of top syllables and syllable bigrams



The format is "String : Frequency"

List of top syllables for Hindi (Individually made corpus)

List of top syllables for Latin Devanagari

List of top syllables for English (Optional)

List of top bigrams for Hindi (Individually made corpus)

NOTE : THE FOLLOWING RESULTS MIGHT NOT SEEM PERFECT DUE TO ERRORS IN THE SYLLIBIFICATION PROCESS

List of top bigrams for Latin Devanagari

List of top bigrams for English (Optional)



B. A plot of the log-frequency distribution for the top 1000 syllables



Plot for SET A (Hindi (Personal Corpus)):



The algorithm used for calculating syllables is a small modification of Choudhari's Algorithm

The size parameters are,
Number of syllables : 9678
Number of Words: 20445


Plot for SET B (Hindi (Latin Devanagri):




The size parameters are,
Number of syllables : 1621
Number of Words: 9482


Plot for SET C (English (Optional)):



We can observe that the frequencies decrease fairly rapidly even in the case of the log frequencies. Thus some of the syllables (words) are used very often while a majority of them are used sparsely.

The size parameters are,
Number of syllables : 16604
Number of Words: 208970

Let's plot the whole plot till 10000 ranks [Here (LOG_freq)] and [Here (Normal_freq)], then if we observe closely we can notice that the area under the curve for the first part (Point till the 0.1* max_freq) and the rest of the part is roughly equal.

Sum_P1 = 3867153
Sum_P2 = 2980027
Sum_P1/Sum_P2 = 1.29

(Sum refers to the sum of the frequencies, and consequently the area under the curve)

This effect is referred to as long tail effect and is common in plots related to purchases of products. So we can also conclude that majority of interactions are formed by less used words.
The following are the plots for normal frequencies (instead of log)



C. Link to the corpus (Individually created)



Hindi corpus : Random paragraphs from random pages from the hindi wikipedia.


D. Link to the code



Code