Using Natural Language Processing to prevent suicide
There is no doubt that you, me and mostly the majority of teenagers around the globe use some sort of social media. I mean why wouldn’t you?
It’s a good place to connect and also a good place for teenagers to be someone there not.
Most content is harmless, there are the pictures of people and their friends, the political tweets and my personal favourite the reviews of terrible movies.
However sometimes that isn’t always the case. Sometimes people post more personal content that show signs of something going wrong in their life. It’s a call for help, these posts may have words to convey fear, loneliness, hopelessness yet we don’t even realize it.
Don’t believe me?
This is just one of the examples, I bet you have probably encountered a tweet like this before but probably didn’t notice, or thought it was a joke.
Currently, social media platforms like twitter have forms where you can send in a message if you see some sort of suicidal ideation.
The problem with this is that if someone were to see the tweet a couple hours after the tweet was posted, then you never know if that person could still be alive.
Time is of the essence when it comes to suicide and one minute could change someones life.
Twitter states “Many people use Twitter to express unique points of view and talk openly about concerns. As we grapple with the weight and reality of an unprecedented public health crisis, it is our job to ensure that Twitter remains a safe space for anyone interested in mental health tips and resources or opening up about their individual mental health concerns.”
However, the current methods are unreliable and time costly.
I built a NLP model that classifies tweets as suicidal and not suicidal accurately 88% of the time.
I am not gonna go too in depth about Machine Learning (ML) but if you want to learn more check out my series #AIwithAlisha where I cover all the topics in Machine Learning.
The way I like to look at ML is really just the science of having machines learn. ML is a subsection of AI and is a system that can take data to classify it as well as make new predictions.
ML has many different sub-categories, such as reinforcement learning, supervised learning, and unsupervised learning.
If you were to predict the likelihood of an outcome from a certain piece of data, Machine Learning would be used, as ML is classified as a way to train a neural network using algorithms to do a specific task without additional human interaction.
But let’s go a little deeper into Natural Language Processing (NLP) which is a subsection of Machine Learning.
Natural Language Processing
How I like to look at NLP is a field of AI that gives machines the ability to read, understand and derive meaning from the human language.
If you think about it like this, every line in a book or every tweet you read has information that can be extracted. It’s easy to derive meaning from a sentence when there is only one of them. But imagine having to look at millions of tweets or text messages, it is just not manageable.
This is unstructured data (Data generated from conversations, declarations or even tweets).
When building algorithms we usually have some traditional rows and column structure of relational databases. All in all, “neat” data. However with unstructured data, it is messy and very difficult to manipulate.
With NLP it is not just about interpreting a text or speech based on its keywords but actually understanding the meaning behind those works. Using NLP we can perform sentiment analysis’s and can also detect figures of speech like exaggeration.
A couple days ago I was on a call with a friend and she was talking about how she hated her life. Us as humans know that she was exaggerating with her tone, but can you really ever know?
That’s where NLP comes in!
So let’s go over some of the techniques used in NLP to build a sentiment analysis.
Bag of Words
Bag of words is a really common used model that allows you to count all words in a piece of text. It basically creates an occurrence matrix for the sentence, through looking at these word frequencies they are then used as features for training a classifier.
For an example I am using the sing “Can’t Stop the feeling” and using two phrases from the song
I feel that hot blood in my body when it drops
I can’t take my eyes up off it, moving so phenomenally
Now let’s count the words:
However there are several downsides like the “semantics” and meaning of the words as well some words might not be weighted correctly (“blood” weighs less than the word “can’t”.
In order to solve this problem a scoring approach called “Term Frequency — Inverse Document Frequency” (TFIDF) and improves the bag of words through adding weights. Through TFIDF frequent terms in the text are “rewarded” (like the word “they” in our example), but they also get “punished” if those terms are frequent in other texts we include in the algorithm too. On the contrary, this method highlights and “rewards” unique or rare terms considering all texts. Nevertheless, this approach still has no context nor semantics.
Another model used for building a stronger sentiment analysis is Topic modelling.
Topic Modelling is when we use unsupervised learning to extract the main topics in a collection of words. In topic modelling there is LDA also known as Latent Discriminant Allocation that is used to classify text in a document to a particular topic.
To learn more about LDA, I would refer you to this article: https://towardsdatascience.com/linear-discriminant-analysis-explained-f88be6c1e00b
Topic modelling can only be used if the text has been converted to bag of words first. After that has been done we have to specify how many topics are there in the data set and then the model is built!
Sentiment Analysis to prevent Suicide
Sentiment analysis is simply the classification of emotions based on text. While it’s more commonly used to classify things like reviews of products for companies, I used it to identify depressed and suicidal users on social media to possibly prevent self harm or suicide.
I built a basic classifier that after analyzing the tweets will depict if the tweet is (1) Suicidal or (2) not suicidal.
If we look at these two tweets, the classifier will really just say which tweet is at risk and which tweet is not.
At Risk Tweet
In order to build this model I used two of the NLP models, Topic Modelling and Bag of words!
The first step is to use our dataset and understand the frequency of a certain word (Bag of Words).
As you can see below Sklearn (Machine Learning library) is used, the reason being is because it is a smart ML tool that can help us better classify the frequency of certain words.
We are then importing the TFIDF (scoring approach that improves the bag of words through adding weights) in order to detect the frequencies of each word.
Term Frequency — Inverse Document Frequency
# TFIDF Vector
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_reportX = df['tweet']
tfidf = TfidfVectorizer(max_features=10000, ngram_range=(1, 2))
X = tfidf.fit_transform(X)
y = df['intention']
For Topic Modelling we have to conduct a LDA analysis (model used to classify text) and as you can see in the code we are loading the LDA model from Sk-learn (Machine Learning library).
Then we are allowing the LDA model to determine the main topics from our data.
Topic Modelling LDA Analysis
warnings.simplefilter("ignore", DeprecationWarning)# Load the LDA model from sk-learn
from sklearn.decomposition import LatentDirichletAllocation as LDA
# Helper function
def print_topics(model, count_vectorizer, n_top_words):
words = count_vectorizer.get_feature_names()
for topic_idx, topic in enumerate(model.components_):
print("\\nTopic #%d:" % topic_idx)
for i in topic.argsort()[:-n_top_words - 1:-1]]))
# Tweak the two parameters below
number_topics = 5
number_words = 10# Create and fit the LDA model
lda = LDA(n_components=number_topics, n_jobs=-1)
lda.fit(count_data)# Print the topics found by the LDA model
print("Topics found via LDA:")
print_topics(lda, count_vectorizer, number_words)
The output from the topic modelling code should look something like this! There are two topics, possibly suicidal and normal. Based on my dataset with over 100,000 tweets these are the words the classifier distinguished to be suicidal and not suicidal. Here is a few examples!
Topic 1: Possibly Suicide
Words: "kill" , "die" , "death" , *"worthless" , "murder" , "self-murder" , "depressed", "Lonely"
Topic 2: Normal
Words:"american","alright" , "covid-19" , "good" , "lover"
Now that we have trained the model, let’s test it!
Let’s test the model!
At risk tweet
X = 'I just want my life to end already'
vec = tfidf.transform([X])
(1) represents the tweet is suicidal
X = 'congratulations, you have done it'
vec = tfidf.transform([X])
(0) represents that the tweet is normal
Training and testing our model
After a lot of tweaking and errors, I got the model to have a 88% accuracy rate, this means that this model could accurately predict at-risk users 88% of the time!
Now that we have identified the problem, the next steps are to think about what the solution would be. If a suicidal tweet is identified, would 911 be called? We will also need to think of important things like does letting a person know that a message you tweet would be analyzed, prevent them from even posting that tweet? And is it a bad thing to post a tweet about you wanting to suicide? Overall, there are many ethical concerns, so if you want to chat about that shoot me a message!
I also want to continue to build this model by adding in more NLP models to continue to build the accuracy and save people’s lives!!
If you enjoyed this article: