Welcome to today's lesson! As data science and machine learning professionals, particularly in the Natural Language Processing (NLP) field, we often deal with textual data. Today, we dive into the 'Introduction to Textual Data Collection'. Specifically, we'll explore how to collect, understand and analyze text data using Python
.
Textual data is usually unstructured, being much harder to analyze than structured data. It can take many forms, such as emails, social media posts, books, or transcripts of conversations. Understanding how to handle such data is a critical part of building effective machine learning models, especially for text classification tasks where we 'classify' or categorize texts. The quality of the data we use for these tasks is of utmost importance. Better, well-structured data leads to models that perform better.
The dataset we'll be working with in today's lesson is the 20 Newsgroups dataset. For some historical background, newsgroups were the precursors to modern internet forums, where people gathered to discuss specific topics. In our case, the dataset consists of approximately 20,000 documents from newsgroup discussions. These texts were originally exchanged through Usenet, a global discussion system that predates many modern Internet forums.
The dataset is divided nearly evenly across 20 different newsgroups, each corresponding to a separate topic - this segmentation is one of the main reasons why it is especially useful for text classification tasks. The separation of data makes it excellent for training models to distinguish between different classes, or in our case, newsgroup topics.
From science and religion to politics and sports, the topics covered provide a diversified range of discussions. This diversity adds another layer of complexity and richness, similar to what we might experience with real-world data.
To load this dataset, we use the fetch_20newsgroups()
function from the sklearn.datasets
module in Python. This function retrieves the 20 newsgroup dataset in a format that's useful for machine learning purposes. Let's fetch and examine the dataset.
First, let's import the necessary libraries and fetch the data:
Python1# Importing necessary libraries 2from sklearn.datasets import fetch_20newsgroups 3 4# Fetch data 5newsgroups = fetch_20newsgroups(subset='all')
The datasets fetched from sklearn typically have three attributes—data
, target
, and target_names
. data
refers to the actual content, target
refers to the labels for the texts, and target_names
provides names for the target labels.
Next, let's understand the structure of the fetched data:
Python1# Understanding the structure of the data 2print("\n\nData Structure\n-------------") 3print(f'Type of data: {type(newsgroups.data)}') 4print(f'Type of target: {type(newsgroups.target)}')
We are fetching the data and observing the type of the data
and target
. The type of data
tells us what kind of data structure is used to store the text data while the type of target
shouts what type of structure is used to store the labels. Here is what the output looks like:
Plain text1Data Structure 2------------- 3Type of data: <class 'list'> 4Type of target: <class 'numpy.ndarray'>
As printed out, the data
is stored as a list, and target
as a numpy array.
Now, let's explore the data points, target variables and the potential classes in the dataset:
Python1print("\n\nData Exploration\n----------------") 2print(f'Number of datapoints: {len(newsgroups.data)}') 3print(f'Number of target variables: {len(newsgroups.target)}') 4print(f'Possible classes: {newsgroups.target_names}')
We get the length of the data
list to fetch the number of data points. Also, we get the length of the target
array. Lastly, we fetch the possible classes or newsgroups in the dataset. Here is what we get:
Plain text1Data Exploration 2---------------- 3Number of datapoints: 18846 4Number of target variables: 18846 5Possible classes: ['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']
Lastly, let's fetch and understand what a sample data point and its corresponding label looks like:
Python1print("\n\nSample datapoint\n----------------") 2print(f'\nArticle:\n-------\n{newsgroups.data[10]}') 3print(f'\nCorresponding Topic:\n------------------\n{newsgroups.target_names[newsgroups.target[10]]}')
The Article
fetched is the 10th article in the dataset and Corresponding Topic
is the actual topic that the article belongs to. Here's the output:
Plain text1Sample datapoint 2---------------- 3 4Article: 5------- 6From: sandvik@newton.apple.com (Kent Sandvik) 7Subject: Re: 14 Apr 93 God's Promise in 1 John 1: 7 8Organization: Cookamunga Tourist Bureau 9Lines: 17 10 11In article <1qknu0INNbhv@shelley.u.washington.edu>, > Christian: washed in 12the blood of the lamb. 13> Mithraist: washed in the blood of the bull. 14> 15> If anyone in .netland is in the process of devising a new religion, 16> do not use the lamb or the bull, because they have already been 17> reserved. Please choose another animal, preferably one not 18> on the Endangered Species List. 19 20This will be a hard task, because most cultures used most animals 21for blood sacrifices. It has to be something related to our current 22post-modernism state. Hmm, what about used computers? 23 24Cheers, 25Kent 26--- 27sandvik@newton.apple.com. ALink: KSAND -- Private activities on the net. 28 29 30Corresponding Topic: 31------------------ 32talk.religion.misc
Nice work! Through today's lesson, you've learned to fetch and analyze text data for text classification. If you've followed along, you should now understand the structure of text data and how to fetch and analyze it using Python.
But our journey to text classification is just starting. In upcoming lessons, we'll dive deeper into related topics such as cleaning textual data, handling missing values, and restructuring textual data for analysis. Each step forward improves your expertise in text classification. Keep going!