Text Classification for Spam Detection Using spaCy

Lesson 1

Introduction to Text Classification

Welcome! In today's lesson, we will delve into the world of Text Classification. Text Classification refers to the process of categorizing or classifying text into organized groups. It explains the significance of text in Natural Language Processing (NLP), as it helps in organizing data, simplifying search, enabling proper mapping and consistency.

In this practical exercise of spam detection, you will see how Text Classification plays a considerable role. Spam detection is a real-world problem where we differentiate unwanted emails (spam) from real ones (ham).

At a high level, text classification involves converting text data into some kind of numerical feature vectors, which can then be used by machine learning models to categorize. We will dig deeper into this as we progress through the lesson. Let's get started.

Setting up a spaCy Text Classification Pipeline

Before beginning the coding exercise, we need to understand the concept of pipelines in spaCy. In simple terms, a pipeline is a sequence of data processing components in SpaCy. It is designed to take raw text data and perform several operations to convert the text data into valuable insights and information.

To set up a text classification pipeline in spaCy, the first step involves creating a blank spaCy model specific for the English language, as demonstrated in the provided code. This blank model will serve as the structure where we will add NLP components for processing the text data.

Python
1# Load a blank spaCy model
2nlp = spacy.blank("en")

In previous courses, spacy.load("en_core_web_sm") was utilized to load a pre-trained model with components for common NLP tasks. In contrast, spacy.blank("en") initializes a blank model specific for English, allowing for customization by adding only the necessary components for specific tasks like text classification for spam detection.

Data Preparation and Labeling

Effective data preparation is a critical step in any machine learning project. In text classification, labeled data plays a crucial role. The labels act as a form of instruction set for the model to learn and understand the patterns in the data.

We'll initially create a sample dataset, having examples of text labeled as "SPAM" or "HAM".

Python
1# Sample dataset with labels
2training_data = [
3    ("Buy cheap watches now!", {"cats": {"SPAM": 1, "HAM": 0}}),
4    ("Get your discount codes today", {"cats": {"SPAM": 1, "HAM": 0}}),
5    # More examples...
6]

This dataset contains a mix of both types of messages that will help train the model effectively. Each text is paired with annotations indicating whether it is spam or ham.

Adding and Configuring Text Classifier in the Pipeline

With our dataset and initial pipeline setup in place, the next step is to incorporate a text classifier into the pipeline. We'll utilize the TextCatBOW (Bag-of-Words) configuration for this purpose. The Bag-of-Words model represents a text as a 'bag' of its words, disregarding grammar and word order but focusing on word frequency, which is effective for capturing patterns. For instance, the BOW representation of "John likes to walk and likes to sing" would be { "John": 1, "likes": 2, "to": 2, "walk": 1, "and": 1, "sing": 1 }.

Here's how we define the configuration and integrate the TextCatBOW model into our pipeline, ensuring it can classify "SPAM" and "HAM" labels:

Python
1# Add the text classifier to the pipeline
2config = {
3    "threshold": 0.5,  # Decision threshold for labels
4    "model": {
5        "@architectures": "spacy.TextCatBOW.v1",  # Classifier architecture
6        "exclusive_classes": True,  # Mutually exclusive labels
7        "ngram_size": 1,  # Use unigrams
8        "no_output_layer": False  # Include output layer
9    }
10}
11textcat = nlp.add_pipe("textcat", config=config)
12textcat.add_label("SPAM")
13textcat.add_label("HAM")

This configuration allows the pipeline to leverage the Bag-of-Words architecture in distinguishing between different types of text data, tailored specifically for spam detection.

Training the Text Classifier

Now that our pipeline has the text classifier, we can proceed to train it. Model training essentially involves running the model through a loop where it can learn from the training data as well as its mistakes (also referred to as 'loss'). Each run is an iteration, and it is common to run these iterative training loops multiple times to fine-tune the model.

Python
1# Training the classifier
2def train_spam_detector(training_data, nlp, textcat, n_iter=30):
3    optimizer = nlp.initialize(lambda: (
4        Example.from_dict(nlp.make_doc(text), annotations)
5        for text, annotations in training_data
6    ))
7    
8    for i in range(n_iter):
9        losses = {}
10        for text, annotations in training_data:
11            doc = nlp.make_doc(text)
12            example = Example.from_dict(doc, annotations)
13            nlp.update([example], sgd=optimizer, losses=losses)
14        print(f"Iteration {i} - Loss: {losses}")
15
16train_spam_detector(training_data, nlp, textcat)

In the train_spam_detector() function, several critical methods are employed:

nlp.initialize: This method prepares the pipeline and optimizer for training. It sets up the weights of the model according to the architecture and configuration.
nlp.make_doc: This converts a text string into a spaCy Doc object, which is a container that holds the processed text and is integral to how spaCy handles text data.
nlp.update: This function performs an optimization step during training. It takes a batch of examples and updates the model's weights, improving its accuracy based on the loss computed.

These methods are used together to teach the model by matching each text with its corresponding labels (spam or ham) and iteratively updating the model.

Plain text
1Iteration 0 - Loss: {'textcat': 1.2402415126562119}
2Iteration 1 - Loss: {'textcat': 1.1603770107030869}
3Iteration 2 - Loss: {'textcat': 1.087008148431778}

This output shows the loss at each iteration during the training process, which helps in understanding how the model is learning and improving. The reduction in loss value signifies the model is better classifying texts over iterations.

Model Evaluation

Once the model is trained, it's time to evaluate its performance. To achieve this, we test the model on some new data that it hasn't seen during the training process. This provides insights into how well the model generalizes its learning to unseen data.

In the provided code, we've included a few text examples. The model classification scores are calculated, and the confidence scores for "SPAM" and "HAM" are printed, which tell us how confident the model is in its classification.

Python
1# Test the trained model
2test_texts = [
3    "Exclusive deal just for you!",
4    "Can we reschedule the meeting?",
5    ...
6]
7
8for text in test_texts:
9    doc = nlp(text)
10    print(f"Text: {text}")
11    for cat, score in doc.cats.items():
12        print(f"  {cat}: {score:.4f}")
13    print("\n")
14
15# Output:
16# Text: Exclusive deal just for you!
17#   SPAM: 0.5940
18#   HAM: 0.4060
19
20# Text: Can we reschedule the meeting?
21#   SPAM: 0.3539
22#   HAM: 0.6461

Lesson Summary

Great work! In this lesson, you have delved deep into Text Classification using spaCy, specifically focusing on the scenario of spam detection. You learned how to set up a text classification pipeline, prepare data for training, add and configure a text classifier to the pipeline using the Bag-of-Words model, train the model, and evaluate its performance. The practical exercise allowed you to apply these concepts and gain real hands-on experience in this versatile field of Natural Language Processing. In the next lessons, we will continue to build on these foundations and explore more advanced topics. Keep practicing and happy coding!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.