Mastering Text Cleaning for NLP: Techniques and Applications

Lesson 2

Introduction

Welcome to today's lesson on Text Cleaning Techniques! In any Natural Language Processing (NLP) project, the quality of your results depends heavily on the quality of your input. Hence, cleaning our textual data becomes critical for the accuracy of our project. Our main objective for today is to delve into how to clean textual data using Python. By the end of this session, you will be comfortable with creating and applying a simple text cleaning pipeline in Python.

Understanding Text Cleaning

Text cleaning is essential in NLP, involving the preparation of text data for analysis. Why is it necessary? Imagine trying to perform text classification on social media posts; people often use colloquial language, abbreviations, and emojis. In many cases, posts might also be in different languages. These variations make it challenging for machines to understand context without undergoing preprocessing.

We get rid of superfluous variations and distractions to make the text understandable for algorithms, thereby increasing accuracy. These distractions could range from punctuation, special symbols, numbers, to even common words that do not carry significant meaning (commonly referred to as "stop words").

Python's Regex (Regular Expression) library, re, is an ideal tool for such text cleaning tasks, as it is specifically designed to work with string patterns. Within this library, we will be using re.sub, a method employed to replace parts of a string. This method operates by accepting three arguments: re.sub(pattern, repl, string). Here, pattern is the character pattern we're looking to replace, repl is the replacement string, and string is the text being processed. In essence, any part of the string argument that matches the pattern argument gets replaced by the repl argument.

As we proceed, a clearer understanding of the functionality and application of re.sub will be provided. Now, let's delve into it!

Text Cleaning Process

The text cleaning process comprises multiple steps where each step aims to reduce the complexity of the text. Let's take you through the process using a Python function, clean_text.

Python
1import re
2
3def clean_text(text):
4    text = text.lower()  # Convert text to lower case
5    text = re.sub(r'\S*@\S*\s?', '', text)  # Remove email addresses
6    text = re.sub(r'http\S+', '', text)  # Remove URLs
7    text = re.sub(r'\W', ' ', text)  # Remove punctuation and special characters
8    text = re.sub(r'\d', ' ', text)  # Remove digits
9    text = re.sub(r'\s\s+', ' ', text)  # Remove extra spaces
10
11    return text

In the function above we can see how each line corresponds to a step in the cleaning process:

Lowercase: We convert all text to lower case, so every word looks the same unless it carries a different meaning. This way, words like 'The' and 'the' are no longer seen as different.
Email addresses: Email addresses don't usually provide useful information unless we're specifically looking for them. This line of code removes any email addresses found.
URLs: Similarly, URLs are removed as they are typically not useful in text classification tasks.
Special Characters: We remove any non-word characters (\W) and replace it with space using regex. This includes special characters and punctuation.
Numbers: We're dealing with text data, so numbers are also considered distractions unless they carry significant meaning.
Extra spaces: Any resulting extra spaces from the previous steps are removed.

Let's go ahead and run this function on some demo input to see it in action!

Python
1print(clean_text('Check out the course at www.codesignal.com/course123'))

The output of the above code will be:

Plain text
1check out the course at www codesignal com course

Implementing Text Cleaning Function

Now that you are familiar with the workings of the function let's implement it in the 20 Newsgroups dataset.

To apply our cleaning function on the dataset, we will make use of the DataFrame data structure from Pandas, another powerful data manipulation tool in Python.

Python
1import pandas as pd
2from sklearn.datasets import fetch_20newsgroups
3
4# Fetching the 20 Newsgroups Dataset
5newsgroups_data = fetch_20newsgroups(subset='train')
6nlp_df = pd.DataFrame(newsgroups_data.data, columns = ['text'])
7
8# Applied the cleaning function to the text data
9nlp_df['text'] = nlp_df['text'].apply(lambda x: clean_text(x))
10
11# Checking the cleaned text
12print(nlp_df.head())

The output of the above code will be:

Plain text
1                                                text
20  from where s my thing subject what car is this...
31  from guy kuo subject si clock poll final call ...
42  from thomas e willis subject pb questions orga...
53  from joe green subject re weitek p organizatio...
64  from jonathan mcdowell subject re shuttle laun...

In this code, we're applying the clean_text function to each 'text' in our DataFrame using the apply function. The apply function passes every value of the DataFrame column to the clean_text function one by one.

Understanding Effectiveness of Text Cleaning Function

We want to understand the impact of our text cleaning function. We can achieve this by looking at our text before and after cleaning. Let's use some new examples:

Python
1test_texts = ['This is an EXAMPLE!', 'Another ex:ample123 with special characters $#@!', 'example@mail.com is an email address.']
2for text in test_texts:
3    print(f'Original: {text}')
4    print(f'Cleaned: {clean_text(text)}')
5    print('--')

The output of the above code will be:

Plain text
1Original: This is an EXAMPLE!
2Cleaned: this is an example 
3--
4Original: Another ex:ample123 with special characters $#@!
5Cleaned: another ex ample with special characters 
6--
7Original: example@mail.com is an email address.
8Cleaned: is an email address 
9--

In the example above, you will see that our function successfully transforms all text to lower case and removes punctuation, digits, and email addresses!

Lesson Summary and Practice Exercises

Today we delved into the text cleaning process in Natural Language Processing. We shared why it is necessary and how to implement it in Python. We then applied our text cleaning function on a textual dataset.

We have a few exercises lined up based on what we learned today. Keep swimming ahead, and remember, you learn the most by doing. Happy cleaning!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.