Natural Language Processing or NLP is a field of Artificial Intelligence which focuses on enabling the systems for understanding and processing the human languages. In this article, I will use NLP to analyze my WhatsApp Chats. For some privacy reasons, I will use Person 1, Person 2 and so on in my WhatsApp Chats.
If you have never exported your whatsapp chats before, don’t worry it’s very easy. For NLP of WhatsApp chats, you need to extract the whatsapp chats from your smartphone. You just need to open any chat in your whatsapp then select the export chat option. The text file you will get as a return will look like this:
["[02/07/2017, 5:47:33 pm] Person_1: Hey there! This is the first message",
"[02/07/2017, 5:48:24 pm] Person_1: This is the second message",
"[02/07/2017, 5:48:44 pm] Person_1: Third…",
"[02/07/2017, 8:10:52 pm] Person_2: Hey Person_1! This is the fourth message",
"[02/07/2017, 8:14:11 pm] Person_2: Fifth …etc"]
I will use two different approaches for the NLP of WhatsApp Chats. First, by focusing on the fundamentals of NLP and the other is by using the datetime stamp at the starting of every conversation.
To analyze our whatsapp conversations, initially, our conversation needs to be formatted in the form of data. This involved a few basic steps in achieving the formation of data by creating a dictionary, constructed within two keys with each of the respective values with a list of the person tokenized conversations.
ppl=defaultdict(list) for line in content:
try:
person = line.split(':')[2][7:]
text = nltk.sent_tokenize(':'.join(line.split(':')[3:]))
ppl[person].extend(text) # If key exists (person), extend list with value (text),
# if not create a new key, with value added to list
except:
print(line) # in case reading a line fails, examine why pass
ppl = {'Person_1' : ['This is message 1', 'Another message',
'Hi Person_2', ... , 'My last tokenised message in the chat'] ,
'Person_2':['Hello Person_1!', 'How's it going?', 'Another messsage', ...]}
The classification of tokenized conversations will ne be achieved by training a Naive Bayes Classification model or the training set with some pre-categorized chat styles conversations:
Our trained model can be tested by using a test set or even by user input. Our model is trained in a way that can classify any tokenized sentence into different categories like Greetings, Statements, Emotions, questions, etc.
classifier.classify(extract_features('Hi there!'))
‘Greet’
Now let’s run the model on WhatsApp data for counting the occurrences of each category of the tokenized conversations:
ax = df.T.plot(kind='bar', figsize=(10, 7),
legend=True, fontsize=16, color=['y','g'])
ax.set_title("Frequency of Message Categories", fontsize= 18)
ax.set_xlabel("Message Category", fontsize=14)
ax.set_ylabel("Frequency", fontsize=14) #plt.savefig('plots/cat_message') # uncomment to save plt.show()
We all use emojis, everyone, not only on WhatsApp but with any other chatting platform. Now let’s see what emojis are being used in most of the conversations.
Person_1's emojis: ๐๐บ๐ผ๐ป๐ฎ๐คค๐ญ๐๐๐ผ๐๐๐๐ณ๐๐๐ฑ๐๐ณโบ๐ญ๐๐ซโญโจ๐ฅ๐๐๐๐๐๐๐ญ๐ญ๐ญ๐ญ๐ญ๐โ ๐ฑ๐๐ญ๐๐๐๐๐๐ญ๐ญ๐ญ๐ญ๐ญ๐ญ๐ธ๐๐๐ ๐๐ญ๐๐ช๐ญ๐ โ๐โ๐โ๐โ๐๐๐๐๐๐๐๐๐๐๐๐๐๐ด๐๐บ๐ผ๐ญ๐๐ญ๐๐๐๐๐๐ฉ๐ญ๐ญ๐ญ๐ญ๐ญ๐ญ๐ญ๐ญ๐ญ๐ญ๐ญ๐ญ๐ญ๐ญ๐ญ๐ญ๐ญ๐ญ๐ญ๐ญ๐๐๐๐๐๐ฑ๐๐๐๐๐ค๐๐๐๐๐๐๐ญ๐ญ๐๐๐๐ญ๐๐คฐ๐ผ๐๐๐๐ฐ๐๐ผโ๐ญ๐๐๐ค๐๐ญ๐ญ๐ญ๐ญ๐ญ๐๐๐ฉโน๐๐ผโ๐๐ด๐ฒ๐๐๐๐ญโน๐๐๐๐๐ค๐ค๐ป๐โ๐๐๐ฐ๐๐บ๐ฅ๐ฉ๐๐จ๐๐ฑ๐ข๐ญ๐ญ๐ญ๐ญ๐ญ๐ญ๐ญ๐ญ๐ญ๐ญ๐๐ค๐ค๐ค๐ค๐ค๐ค๐ค๐๐๐๐๐๐๐โน๐๐ฉ๐๐โก๐ฅ๐ฅโน๐ญ๐ฉ๐ญ๐ฐ๐ฑ๐ ๐ ๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐ฅ๐๐๐๐๐๐๐๐๐ญ๐๐๐๐ป๐๐๐๐๐๐๐ช๐คง๐๐ฅ๐๐๐๐๐๐๐๐๐๐๐ฑ๐๐ญ๐ญ๐๐โโผโญโจ๐ซโ๐ท๐๐๐โ๐๐ฎ๐ปโโฐ๐จ๐๐๐ฅ๐๐ท๐ถ๐ ๐โ๐น๐๐ โ๐๐๐๐๐๐๐๐๐ญ๐ญ๐ญ๐ญ๐ญ๐ญ๐๐๐๐ฅ๐๐๐๐ Most common: [('๐', 77), ('๐ญ', 68), ('๐', 16), ('๐', 13), ('๐', 11), ('๐', 10), ('๐ค', 8), ('๐ผ', 6), ('๐ฑ', 6), ('๐', 6)] Person_2's emojis: ๐๐๐ค๐ ๐๐๐๐ฌ๐ป๐๐โ๐ด๐ฌ๐ฌ๐๐โ๐๐ช๐๐ฌ๐๐ฌ๐๐ฌ๐๐๐คข๐๐๐๐๐๐๐ ๐๐ช๐๐ฌ๐๐โ๐ด๐ฌ๐ ๐๐๐ฌ๐๐ฌ๐ฌ๐๐โ๐๐๐๐ฎ๐โ๐๐๐โ๐ฑ๐ฉ๐ฌโโ๐๐๐โ๐๐๐๐ ๐๐ฌ๐๐๐๐๐ด๐๐๐๐๐ ๐ด๐๐๐๐๐๐๐๐๐๐ฑโ๐๐๐๐๐๐๐๐๐๐๐๐ฌ๐๐๐๐๐โโบ๐๐๐ฌ๐๐ฑ๐โโบ๐๐๐๐๐๐๐๐ผ๐ ๐๐๐๐โ๐ค๐๐๐๐โโ๐๐๐๐๐โ๐๐๐๐๐ ๐๐๐๐๐๐๐๐โ๐๐๐๐โ๐๐๐๐ฌ๐โ๐๐๐ผ๐๐๐ค๐ฉ๐๐๐๐โ๐โ๐โ๐๐คโ๐๐ผโ๐ฌ๐๐๐๐โ๐๐๐๐ค๐คโนโก๐ฌ๐ฏ๐ช๐โน๐๐๐๐๐ด๐๐๐๐โ๐๐๐ฌโ๐๐๐๐๐๐๐๐ขโน๐๐๐๐๐ฌ๐โ๐๐๐๐๐ค๐๐๐โ๐๐๐๐๐๐๐โน๐๐ข๐ฌโ๐๐ฌโ๐ฌ๐๐๐๐๐๐๐๐๐ง๐๐ช๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐คณ๐๐๐๐๐ผโ๐๐๐๐๐๐๐๐๐ค๐ค๐ค๐๐ฌ๐ค๐๐ ๐ป๐๐๐๐๐ค๐๐ค๐๐๐๐๐๐๐๐๐ผโ๐คฃ๐๐๐ฌ๐๐๐โ๐๐๐๐๐๐ค๐ค๐๐๐๐๐๐โน๐ฐ๐๐๐๐ฌ๐๐๐๐โ๐โ๐๐๐โน๐ค๐ฆ๐ฆ๐ฌ๐๐ด๐๐๐๐๐โ๐๐๐โ๐๐ฌ๐๐ฌ๐ค๐๐๐๐๐ซโน๐ค๐ฉ๐๐๐ฐ๐ค๐๐๐ฐ๐ณ๐ฃ๐๐๐๐ค๐๐ ๐๐ค๐๐๐๐ฃ๐บ๐ฎ๐โนโน๐๐คโน๐ฌ๐ณ๐๐ฌ๐๐ค๐โ๐๐๐ข๐๐๐๐๐๐๐๐๐๐๐ค๐๐๐๐โ๐ด๐๐๐๐๐๐๐๐๐ฌ๐๐๐๐โน๐๐โ๐๐๐๐๐๐ค๐๐๐๐๐๐๐๐๐๐ฌ๐๐ฌ๐๐ฅ๐คโน๐๐๐โ๐๐๐๐๐๐๐ญ๐๐ฌ๐๐๐๐ฌ๐โบ๐๐๐๐๐๐๐๐ผโ๐๐ฌ๐๐๐๐๐๐๐๐๐๐๐๐ฉ๐๐๐๐ด๐ค๐๐ฉ๐ฉ๐ฉ๐๐ฌ๐๐๐ฌ๐๐๐ฑ๐ป๐ฝ๐๐๐ด๐ค๐๐๐๐คโน๐๐ค๐๐ฝ๐๐๐ค๐โน๐๐๐๐๐ฉโน๐๐๐๐๐๐๐โน๐๐ค๐ง๐โน๐๐๐๐๐๐๐๐ค๐๐ฐ๐๐ข๐ค๐ฐ๐๐๐ค๐คฃ๐คฃ๐คฃ๐๐ข๐๐ฌ๐ค๐๐โ๐๐๐๐๐ฅ๐ค๐๐๐๐โก๐โโก๐๐๐๐คโน๐ค๐ข๐ณ๐ณ๐๐โบ๐โนโกโกโกโนโนโน๐โน๐๐ฅ๐ฅ๐ข๐ฐ๐๐ฌ๐๐ค๐ป๐๐๐ผโ๐๐ซ๐๐๐๐๐ค๐คโน๐ช๐๐๐๐ช๐ญ๐๐ฉ๐คค๐โนโน๐๐ค๐๐๐๐คฅ๐๐๐๐๐ค๐๐๐๐๐๐๐ฏ๐๐๐๐๐ค๐๐๐๐๐ฑ๐๐๐๐๐๐ญ๐๐๐๐ด๐๐๐ค๐ค๐๐คข๐๐ญ๐ญ๐๐ฌ๐๐๐๐๐๐๐๐๐๐โโ๐ญ๐ค๐๐๐ค๐๐ผโ๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐ฌ๐๐๐๐๐๐๐๐๐๐๐๐๐โน๐๐๐๐๐ค๐๐๐๐๐๐๐๐๐๐๐๐ผโ๐๐๐๐ค๐๐๐๐๐๐๐๐๐๐๐๐โน๐๐๐๐๐๐๐ด๐ค๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐โ๐ด๐๐๐ฅ๐๐๐ ๐๐๐ค๐๐๐๐๐๐๐๐๐ฒ๐๐๐ค๐ซ๐คฃ๐ณ๐๐๐ข๐ฏ๐๐๐๐๐๐โ๐๐๐๐๐๐๐๐๐๐ข๐๐๐๐๐๐๐๐๐๐คทโ๐๐๐๐๐๐๐๐๐๐๐ ๐๐ค๐๐๐๐๐ค๐ค๐ค๐๐๐๐๐๐๐ฌ๐โ๐โน๐๐๐ Most common: [('๐', 138), ('๐', 103), ('๐', 91), ('๐ฌ', 42), ('๐', 29), ('โน', 29), ('๐', 28), ('โ', 27), ('๐', 25), ('๐', 24)]
The plotting of sentiments against the datetime is not as easy as it looks. As there are many different sentiments on the same day, so the first step is to calculate the mean sentiment for each day and then grouping by datetime. So let’s see how we can do this:
Now let’s have a look at the frequency of whatsapp chats which is not a part of NLP for Whatsapp but it is a part of time series analysis. We can use time series here to see the frequency of chats. First, need to create a colour pallete ordered by the total number of messages for each day.