Welcome to today's lesson on Text Cleaning Techniques! In any Natural Language Processing (NLP) project, the quality of your results depends heavily on the quality of your input. Hence, cleaning our textual data becomes critical for the accuracy of our project. Our main objective for today is to delve into how to clean textual data using Python. By the end of this session, you will be comfortable with creating and applying a simple text cleaning pipeline in Python.
Text cleaning is essential in NLP, involving the preparation of text data for analysis. Why is it necessary? Imagine trying to perform text classification on social media posts; people often use colloquial language, abbreviations, and emojis. In many cases, posts might also be in different languages. These variations make it challenging for machines to understand context without undergoing preprocessing.
We get rid of superfluous variations and distractions to make the text understandable for algorithms, thereby increasing accuracy. These distractions could range from punctuation, special symbols, numbers, to even common words that do not carry significant meaning (commonly referred to as "stop words").
Python's Regex (Regular Expression) library, re
, is an ideal tool for such text cleaning tasks, as it is specifically designed to work with string patterns. Within this library, we will be using re.sub
, a method employed to replace parts of a string. This method operates by accepting three arguments: re.sub(pattern, repl, string)
. Here, pattern
is the character pattern we're looking to replace, repl
is the replacement string, and string
is the text being processed. In essence, any part of the string
argument that matches the pattern
argument gets replaced by the repl
argument.
As we proceed, a clearer understanding of the functionality and application of re.sub
will be provided. Now, let's delve into it!
The text cleaning process comprises multiple steps where each step aims to reduce the complexity of the text. Let's take you through the process using a Python function, clean_text
.
Python1import re 2 3def clean_text(text): 4 text = text.lower() # Convert text to lower case 5 text = re.sub(r'\S*@\S*\s?', '', text) # Remove email addresses 6 text = re.sub(r'http\S+', '', text) # Remove URLs 7 text = re.sub(r'\W', ' ', text) # Remove punctuation and special characters 8 text = re.sub(r'\d', ' ', text) # Remove digits 9 text = re.sub(r'\s\s+', ' ', text) # Remove extra spaces 10 11 return text
In the function above we can see how each line corresponds to a step in the cleaning process:
- Lowercase: We convert all text to lower case, so every word looks the same unless it carries a different meaning. This way, words like 'The' and 'the' are no longer seen as different.
- Email addresses: Email addresses don't usually provide useful information unless we're specifically looking for them. This line of code removes any email addresses found.
- URLs: Similarly, URLs are removed as they are typically not useful in text classification tasks.
- Special Characters: We remove any non-word characters (
\W
) and replace it with space using regex. This includes special characters and punctuation. - Numbers: We're dealing with text data, so numbers are also considered distractions unless they carry significant meaning.
- Extra spaces: Any resulting extra spaces from the previous steps are removed.
Let's go ahead and run this function on some demo input to see it in action!
Python1print(clean_text('Check out the course at www.codesignal.com/course123'))
The output of the above code will be:
Plain text1check out the course at www codesignal com course
Now that you are familiar with the workings of the function let's implement it in the 20 Newsgroups dataset.
To apply our cleaning function on the dataset, we will make use of the DataFrame data structure from Pandas
, another powerful data manipulation tool in Python.
Python1import pandas as pd 2from sklearn.datasets import fetch_20newsgroups 3 4# Fetching the 20 Newsgroups Dataset 5newsgroups_data = fetch_20newsgroups(subset='train') 6nlp_df = pd.DataFrame(newsgroups_data.data, columns = ['text']) 7 8# Applied the cleaning function to the text data 9nlp_df['text'] = nlp_df['text'].apply(lambda x: clean_text(x)) 10 11# Checking the cleaned text 12print(nlp_df.head())
The output of the above code will be:
Plain text1 text 20 from where s my thing subject what car is this... 31 from guy kuo subject si clock poll final call ... 42 from thomas e willis subject pb questions orga... 53 from joe green subject re weitek p organizatio... 64 from jonathan mcdowell subject re shuttle laun...
In this code, we're applying the clean_text
function to each 'text' in our DataFrame using the apply
function. The apply
function passes every value of the DataFrame column to the clean_text
function one by one.
We want to understand the impact of our text cleaning function. We can achieve this by looking at our text before and after cleaning. Let's use some new examples:
Python1test_texts = ['This is an EXAMPLE!', 'Another ex:ample123 with special characters $#@!', 'example@mail.com is an email address.'] 2for text in test_texts: 3 print(f'Original: {text}') 4 print(f'Cleaned: {clean_text(text)}') 5 print('--')
The output of the above code will be:
Plain text1Original: This is an EXAMPLE! 2Cleaned: this is an example 3-- 4Original: Another ex:ample123 with special characters $#@! 5Cleaned: another ex ample with special characters 6-- 7Original: example@mail.com is an email address. 8Cleaned: is an email address 9--
In the example above, you will see that our function successfully transforms all text to lower case and removes punctuation, digits, and email addresses!
Today we delved into the text cleaning process in Natural Language Processing. We shared why it is necessary and how to implement it in Python. We then applied our text cleaning function on a textual dataset.
We have a few exercises lined up based on what we learned today. Keep swimming ahead, and remember, you learn the most by doing. Happy cleaning!