Complete Text data preprocessing in Python

More than 99% of the data present in this world is in unstructured format such as text, images, audio and videos. Among these text data is commonly encountered in various domains like social media, customer reviews, emails, and documents. Raw data is often messy making it unsuitable for direct use in machine learning modelling or data analysis. In this blog we will be going to perform text data preprocessing in python which will transform our unstructured text into meaningful, structured data.

Table of Contents

Implementation of Text Data Preprocessing in Python

1. Data Collection

First of all, you have to download the Elon musk Tweet dataset from the Kaggle platform. The dataset contains tweets related to Elon musk companies. There are 36 columns present in this dataset and more than 10K+rows.

Once the dataset is download you have to upload the CSV files in the Google Colab environment.

2. Install and import all the necessary libraries.

Second step is to install all the necessary libraries with the help of below command.

!pip install -q wordcloud emoji

wordcloud helps to visualize the textual data.
emoji help us to detect any emojis present in the text.

Once the libraries are downloaded the next step is import all these necessary libraries.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import emoji
from wordcloud import WordCloud, STOPWORDS
import re
import nltk
from nltk.corpus import stopwords #corpora
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
import warnings
warnings.filterwarnings('ignore')

pandas library is used for loading various types of dataset and perform data cleaning operations.
matplotlib and seaborn library is used for the data visualization purpose.
numpy is used for the mathematical calculation.
emojis for detecting the emojis in the text.
Wordcloud is used for the visualizing the text data.
stopwords are the most common words present in the document such as is, am, are…..
re is a regular expression module used for defining the pattern in order to extract the data.
nltk is responsible for the text data preprocessing.
WordNetLemmatizer is used for apply lemmatization (converting the words to their root form). For example – reduced will be converted to reduce.
word_tokenize function is used for the tokenization purpose.

After that you have to download some packages from nltk. Here punkt, punkt_tab is a tokenizer, and word net is the lemmatizer.

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('wordnet')

2. Load the dataset and perform basic data exploration

The second step is to load the dataset using the pandas library.

df=pd.read_csv("Tesla.csv")
df.head()

There are 36 columns present in the dataset. from that only two are needed which is created_at and tweet column.

df=df[['created_at','tweet']]
df.head()

The next step is to check for the shape and info about the dataset.

print(df.shape)  # (10016, 2)
print(df.info())

The created_at column is of float type and tweet is of object type.

The next step is to check for the missing values and the duplicate values. In this dataset there are no duplicate or missing values in it.

print(df.isna().sum())
print(df.duplicated().sum())

Next step is to define the contractions list. Here the contractions words are expanded into their full form. Additionally the slang words and their full form are given in dictionary format.

# A list of contractions from http://stackoverflow.com/questions/19790188/expanding-english-language-contractions-in-python
# i don't have car
# i do not have car


contractions = {"ain't": "am not", "aren't": "are not", "can't": "cannot",
"can't've": "cannot have", "'cause": "because", "could've": "could have",
"couldn't": "could not", "couldn't've": "could not have", "didn't": "did not",
"doesn't": "does not", "don't": "do not", "hadn't": "had not", "hadn't've": "had not have",
"hasn't": "has not", "haven't": "have not", "he'd": "he would", "he'd've": "he would have",
"he'll": "he will", "he's": "he is", "how'd": "how did", "how'll": "how will",
"how's": "how is", "i'd": "i would", "i'll": "i will", "i'm": "i am", "i've": "i have",
"isn't": "is not", "it'd": "it would", "it'll": "it will", "it's": "it is",
"let's": "let us", "ma'am": "madam", "mayn't": "may not", "might've": "might have",
"mightn't": "might not", "must've": "must have", "mustn't": "must not",
"needn't": "need not", "oughtn't": "ought not", "shan't": "shall not",
"sha'n't": "shall not", "she'd": "she would", "she'll": "she will", "she's": "she is",
"should've": "should have", "shouldn't": "should not", "that'd": "that would",
"that's": "that is", "there'd": "there had", "there's": "there is", "they'd": "they would",
"they'll": "they will", "they're": "they are", "they've": "they have", "wasn't": "was not",
"we'd": "we would", "we'll": "we will", "we're": "we are", "we've": "we have",
"weren't": "were not", "what'll": "what will", "what're": "what are", "what's": "what is",
"what've": "what have", "where'd": "where did", "where's": "where is", "who'll": "who will",
"who's": "who is", "won't": "will not", "wouldn't": "would not", "you'd": "you would",
"you'll": "you will", "you're": "you are", "wfh": "work from home", "wfo": "work from office",
"idk": "i do not know", "brb": "be right back", "btw": "by the way", "tbh": "to be honest",
"omw": "on my way", "lmk": "let me know", "fyi": "for your information",
"imo": "in my opinion", "smh": "shaking my head", "nvm": "never mind",
"ikr": "i know right", "fr": "for real", "rn": "right now", "gg": "good game",
"dm": "direct message", "afaik": "as far as i know", "bff": "best friends forever",
"ftw": "for the win", "hmu": "hit me up", "ggwp": "good game well played"}

3. Text Data cleaning

In this the text data cleaning steps are applied on a sample text.

text= "I don't have a supercar! 😎 Visit https://cars.com now!"

Step 1- Convert the text into lowercase.

text = text.lower()
text
# i don't have a supercar! 😎 visit https://cars.com now!

Step 2- Replace the contraction and slang words into their full form. Here the first thing is to split the text with the help of split function.

text = text.split()
text

Next an empty list is defined. A for loop will iterate through the text wherever the contraction word found replaced with their full form.

new_text = []
for word in text:
    if word in contractions:
        new_text.append(contractions[word])
    else:
        new_text.append(word)

new_text

Now the splitted text is joined with the help of join function.

text = " ".join(new_text)
text
# i do not have a supercar! 😎 visit https://cars.com now!

Step 3- Remove the URL from the text.

text = re.sub(r'https?:\/\/.*[\r\n]*', '', text)
text
# i do not have a supercar! 😎 visit

Step 4- Remove the usernames from the text (e.g., @user)

text = re.sub(r'@[A-Za-z0-9]+', '', text)
text
# i do not have a supercar! 😎 visit

Step 5: Remove HTML and symbols (like html tags such as br, anchor and other symbols) with an empty strings.

text = re.sub(r'\<a href', ' ', text)
text = re.sub(r'&amp;', '', text)
text = re.sub(r'[_"\-;%()|+&=*%.,!?:#$@\[\]/]', ' ', text)
text = re.sub(r'<br />', ' ', text)
text = re.sub(r'\'', ' ', text)

Step 6: Remove emojis (it is represented using Unicode character). so first it is defined after that it is replace with empty strings.

emoji_pattern = re.compile("[\U0001F600-\U0001F64F\U0001F300-\U0001F5FF\U0001F680-\U0001F6FF" \
                               "\U0001F700-\U0001F77F\U0001F780-\U0001F7FF\U0001F800-\U0001F8FF" \
                               "\U0001F900-\U0001F9FF\U0001FA00-\U0001FA6F\U0001FA70-\U0001FAFF" \
                               "\U00002702-\U000027B0\U000024C2-\U0001F251]+", flags=re.UNICODE)

text = emoji_pattern.sub(r'', text)
text
# i do not have a supercar   visit

Step 7: Perform Tokenization with the help of word_tokenize function.

words = word_tokenize(text)
words
# ['i', 'do', 'not', 'have', 'a', 'supercar', 'visit']

Step 8: Remove stopwords from the above splitted words. First of all english stopwords is defined and then it is removed from the text.

stop_words = set(stopwords.words("english"))
words = [word for word in words if word not in stop_words]
words
# ['supercar', 'visit']

Step 9: Apply lemmatization to convert the words into their root form (reduced– reduce). First of all WordNetLemmatizer is defined and then applied on each words.

lemmatizer = WordNetLemmatizer()
words = [lemmatizer.lemmatize(word) for word in words]
words
# ['supercar', 'visit']

Now all these steps from 1 to 9 will be defined inside a function and then applied it on the tweet column.

def clean_text(text, remove_stopwords=True):
    text = text.lower()
    if True:
        #text= "i dont't have supercar"
        text = text.split()
        # ['i','don't','have','supercar']
        new_text = []
        for word in text:
            if word in contractions:
                new_text.append(contractions[word])
                # ['i','do not','have','supercar']
            else:
                new_text.append(word)
        text = " ".join(new_text)
        #[i do not have supercar]

    # Remove URLs
    text = re.sub(r'https?:\/\/.*[\r\n]*', '', text, flags=re.MULTILINE)

    # Remove usernames
    text = re.sub(r'@[A-Za-z0-9]+', '', text)

    text = re.sub(r'\<a href', ' ', text)
    text = re.sub(r'&amp;', '', text)
    text = re.sub(r'[_"\-;%()|+&=*%.,!?:#$@\[\]/]', ' ', text)
    text = re.sub(r'<br />', ' ', text)
    text = re.sub(r'\'', ' ', text)
    # Remove emojis
    emoji_pattern = re.compile("[\U0001F600-\U0001F64F\U0001F300-\U0001F5FF\U0001F680-\U0001F6FF" \
                               "\U0001F700-\U0001F77F\U0001F780-\U0001F7FF\U0001F800-\U0001F8FF" \
                               "\U0001F900-\U0001F9FF\U0001FA00-\U0001FA6F\U0001FA70-\U0001FAFF" \
                               "\U00002702-\U000027B0\U000024C2-\U0001F251]+", flags=re.UNICODE)
    text = emoji_pattern.sub(r'', text)

    # Tokenize text
    words = word_tokenize(text)

    # Remove stopwords if needed
    if remove_stopwords:
        stop_words = set(stopwords.words("english"))
        words = [word for word in words if word not in stop_words]

    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    words = [lemmatizer.lemmatize(word) for word in words]

    return " ".join(words)

df['clean_tweet'] = df['tweet'].apply(lambda x:clean_text(x))
df.head()

At last you have to also process the created_at column. so first of all it is converted into pandas datetime format and the unit will be milliseconds.

from datetime import datetime
df['created_at'] = pd.to_datetime(df['created_at'], unit='ms')
df.head()

Conclusion

Overall in this blog we covered a complete text preprocessing workflow in Python—from cleaning and normalization to tokenization, stopword removal, stemming, lemmatization, and even handling date-time information like the created at column.

If you like the article and would like to support me, make sure to:

👏 Like for this article and subscribe to our newsletter
📰 View more content on my DataSpoof website
🔔 Download the code from Github
🔔 Follow Me: LinkedIn| Youtube | Instagram | Twitter

Implementation of Text Data Preprocessing in Python

1. Data Collection

2. Install and import all the necessary libraries.

2. Load the dataset and perform basic data exploration

3. Text Data cleaning

Conclusion

Please Share This Share this content

You Might Also Like

How to perform text processing in python

Leave a Reply Cancel reply

Share this content