2 - Probabilities, Entropy

Probabilitiies on Given Data

On analysing the given text, the probablities of occurences various alphabets were clculated which are representated as follows

The numbers from 1 to 26 represent the corresponding alphabets and the height of the bar represents the probability of their occurence in the given data.

Entropy of Given Data

Using the formula than Entropy= -1* Summation of P(i)*ln(P(i)) where P(i) is the probability of the occurence of the ith alphabet , we get
Entropy = 2.8431

Analysis of Larger Data Set

For this part, I took the data from Wikipedia about the frequency of occurence of various alphabets in the whole of English Language. The probabilities of occurence of various alphabets are as follows -

The numbers from 1 to 26 represent the corresponding alphabets and the height of the bar represents the probability of their occurence in the given data.
The entropy of this distribution is 2.8944

The two distributions have many similarities like the alphabet 'e' occurs most frequency in both followed by 't'. Also, the letters like 'z','x' have low probabilities in both. This shows that there might be some kind of frequency pattern which the english alphabets tend to follow.
Also, the entropies of both the distributins are similar.

- Shubham Tulsiani