How to create an AI that chats like you on WhatsApp

A quick warning before we get started

It’s always better not to run random scripts on personal information (like personal chat messages).

I guarantee there’s no catch, but you can always check the native code that’s being used:messaging-chat-parserandpistoBot, all the sources are open-source on github.

Get the data

First of all, we need to gather the data from our chat applications. We will now learn how to export data from two of the most commonly used instant messaging apps: WhatsApp and Telegram.

WhatsApp export

We have to export one .txt file for each chat we want to include in the final dataset. So, as described on theofficial WhatsApp website:

Note that only 1 to 1 chats are allowed (namely individual), we suggest to export chats with the highest number of messages, in order to achieve a bigger dataset and get better final results.

Now you should have more files, each with a structure that looks like the snippet below:

Take note of the text you find under placeholder in your exported chats. This parameter is your name for the WhatsApp app and we will use this value later.

1.2 Telegram

The process here will be faster than WhatsApp because Telegram will export everything in a single .json file, without having the limit of exporting one chat at a time.

So, as described on theofficial Telegram website:

Now you should have one file named telegram_dump.json with this structure:

2. Parse the data

To train a GPT-2 neural network, first of all we need to pre-process the data, in order to obtain a single .txt with a machine-learning compatible structure.

2.1 Google Colab

For the sake of simplicity and since the ML model we will use requires a GPU to work, we are going to use Google Colab for the next step.

If you don’t know what Google Colab is, check this otherarticle.

2.2 Start the notebook

Open thisColab notebookand follow these steps:

2.3 Load the data

To work with the data, we need to upload them on Colab, into the right folders.

WhatsApp chats

Select all your .txt files and upload everything into the following notebook folder:

./messaging-chat-parser/data/chat_raw/whatsapp/

Telegram JSON

Get the file telegram_dump.json and upload it into the following notebook folder:

./messaging-chat-parser/data/chat_raw/telegram/

2.4 Parse the data

Now, run all the cells up until the block “2️⃣ Parse the data”.

Here we need to replace the variable “whatsapp_user_name” with your WhatsApp name, called on the 1.1 chapter.

You can also change the date format parsing system if some of the exported data show a different format due to local time formatting.

So, for example, if my name is “Bob” and I’m from America, the code I should use is the following:

3. Train a GPT-2 model

Now execute the cell under the “3️⃣ Train a GTP2 model” notebook chapter, it will run a new training using your provided data.

A progress bar will be shown, and the training could take up to 10 hours, it depends mostly on which GPU type Colab is running and how many much messages were provided.

Wait until the process ends.

4. Chat with the model

After the training is completed, run all the remaining notebook cells: the last one will show a text block with a ✍ symbol on the left.

You could use this text box to insert the messages you want to “send” to the ML model. So write your message and then press enter.

4.1 How read the results

After the first message is sent, the system will prompt some information about the conversation.

You will now see the most interesting results as a list of messages:

After the replied message is generated, you could continue to chat for a total of 5 messages. After this, you could re-run the cell to start a new conversation with the model.

5. Conclusion

So in this guide we have seen how simple it is to train your GPT-2 model from scratch, the task is simple (but not trivial!) only thanks to toaitextgenpackage that runs under thepistoBothood.

Note that if your chat messages are in English you could easily obtain better results than the ones we got with this standard approach, since you could use the transfer learning from a GPT-2 pretrained model.

The pistoBot repository allows you to train (or fine-tune) different models, including the chance to start from a GPT-2 pretrained model:check the repository folderfor more information.

We have chosen the standard, un-trained GPT-2 model so that even the non-english users could use this AI.

This article was written bySimone Guardatiand originally published onTowards Data Science. You can read ithere.

Story bySimone Guardati