Introduction to Textual Data Collection in NLP

Lesson 1

Introduction and Text Data Collection

Welcome to today's lesson! As data science and machine learning professionals, particularly in the Natural Language Processing (NLP) field, we often deal with textual data. Today, we dive into the 'Introduction to Textual Data Collection'. Specifically, we'll explore how to collect, understand and analyze text data using Python.

Textual data is usually unstructured, being much harder to analyze than structured data. It can take many forms, such as emails, social media posts, books, or transcripts of conversations. Understanding how to handle such data is a critical part of building effective machine learning models, especially for text classification tasks where we 'classify' or categorize texts. The quality of the data we use for these tasks is of utmost importance. Better, well-structured data leads to models that perform better.

The 20 Newsgroups Dataset

The dataset we'll be working with in today's lesson is the 20 Newsgroups dataset. For some historical background, newsgroups were the precursors to modern internet forums, where people gathered to discuss specific topics. In our case, the dataset consists of approximately 20,000 documents from newsgroup discussions. These texts were originally exchanged through Usenet, a global discussion system that predates many modern Internet forums.

The dataset is divided nearly evenly across 20 different newsgroups, each corresponding to a separate topic - this segmentation is one of the main reasons why it is especially useful for text classification tasks. The separation of data makes it excellent for training models to distinguish between different classes, or in our case, newsgroup topics.

From science and religion to politics and sports, the topics covered provide a diversified range of discussions. This diversity adds another layer of complexity and richness, similar to what we might experience with real-world data.

Fetching and Understanding the Data Structure

To load this dataset, we use the fetch_20newsgroups() function from the sklearn.datasets module in Python. This function retrieves the 20 newsgroup dataset in a format that's useful for machine learning purposes. Let's fetch and examine the dataset.

First, let's import the necessary libraries and fetch the data:

Python
1# Importing necessary libraries
2from sklearn.datasets import fetch_20newsgroups
3
4# Fetch data
5newsgroups = fetch_20newsgroups(subset='all')

The datasets fetched from sklearn typically have three attributes—data, target, and target_names. data refers to the actual content, target refers to the labels for the texts, and target_names provides names for the target labels.

Next, let's understand the structure of the fetched data:

Python
1# Understanding the structure of the data
2print("\n\nData Structure\n-------------")
3print(f'Type of data: {type(newsgroups.data)}')
4print(f'Type of target: {type(newsgroups.target)}')

We are fetching the data and observing the type of the data and target. The type of data tells us what kind of data structure is used to store the text data while the type of target shouts what type of structure is used to store the labels. Here is what the output looks like:

Plain text
1Data Structure
2-------------
3Type of data: <class 'list'>
4Type of target: <class 'numpy.ndarray'>

As printed out, the data is stored as a list, and target as a numpy array.

Diving Into Data Exploration

Now, let's explore the data points, target variables and the potential classes in the dataset:

Python
1print("\n\nData Exploration\n----------------")
2print(f'Number of datapoints: {len(newsgroups.data)}')
3print(f'Number of target variables: {len(newsgroups.target)}')
4print(f'Possible classes: {newsgroups.target_names}')

We get the length of the data list to fetch the number of data points. Also, we get the length of the target array. Lastly, we fetch the possible classes or newsgroups in the dataset. Here is what we get:

Plain text
1Data Exploration
2----------------
3Number of datapoints: 18846
4Number of target variables: 18846
5Possible classes: ['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']

Sample Data Preview

Lastly, let's fetch and understand what a sample data point and its corresponding label looks like:

Python
1print("\n\nSample datapoint\n----------------")
2print(f'\nArticle:\n-------\n{newsgroups.data[10]}')
3print(f'\nCorresponding Topic:\n------------------\n{newsgroups.target_names[newsgroups.target[10]]}')

The Article fetched is the 10th article in the dataset and Corresponding Topic is the actual topic that the article belongs to. Here's the output:

Plain text
1Sample datapoint
2----------------
3
4Article:
5-------
6From: sandvik@newton.apple.com (Kent Sandvik)
7Subject: Re: 14 Apr 93   God's Promise in 1 John 1: 7
8Organization: Cookamunga Tourist Bureau
9Lines: 17
10
11In article <1qknu0INNbhv@shelley.u.washington.edu>, > Christian:  washed in
12the blood of the lamb.
13> Mithraist:  washed in the blood of the bull.
14> 
15> If anyone in .netland is in the process of devising a new religion,
16> do not use the lamb or the bull, because they have already been
17> reserved.  Please choose another animal, preferably one not
18> on the Endangered Species List.  
19
20This will be a hard task, because most cultures used most animals
21for blood sacrifices. It has to be something related to our current
22post-modernism state. Hmm, what about used computers?
23
24Cheers,
25Kent
26---
27sandvik@newton.apple.com. ALink: KSAND -- Private activities on the net.
28
29
30Corresponding Topic:
31------------------
32talk.religion.misc

Lesson Summary

Nice work! Through today's lesson, you've learned to fetch and analyze text data for text classification. If you've followed along, you should now understand the structure of text data and how to fetch and analyze it using Python.

But our journey to text classification is just starting. In upcoming lessons, we'll dive deeper into related topics such as cleaning textual data, handling missing values, and restructuring textual data for analysis. Each step forward improves your expertise in text classification. Keep going!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.