Text Processing with awk

Lesson 5

Introduction to Text Processing with `awk`

Hello! Welcome to your next step in mastering Bash scripting. In this lesson, we will immerse ourselves in the world of text processing with the versatile command-line tool awk. awk is a powerful tool that allows you to manipulate and analyze text files with ease. By the end of this lesson, you’ll be equipped to efficiently handle and process text files, extracting meaningful data and performing relevant computations directly from your Bash scripts.

Let's get started by diving into how we can leverage awk for various text processing tasks.

Creating Initial Data

First, let's create a sample data file to work with. This file will help us learn and practice various awk commands effectively.

The heredoc (short for "here document") is a special syntax in Unix shell scripting that allows you to create a multi-line string. It is particularly useful for creating files or including large blocks of text within your script. The syntax is <<EOF ... EOF, where EOF (End of File) is a marker indicating the beginning and end of the block of text. You can actually use any marker, but EOF is conventionally used.

Let's create a file called data.txt that includes data about computers in inventory.

Bash
1#!/bin/bash
2
3# Create a sample data file
4cat << EOF > data.txt
5Brand   Model     RAM
6Apple   MacBook    32
7Apple   iPad       16
8Dell    XPS        32
9Dell    Inspiron  128
10Lenovo  ThinkPad  128
11Lenovo  Yoga      256
12Apple   MacBook    64
13EOF

Let's break this code down:

cat << EOF: This starts the heredoc and tells the cat command to begin reading the subsequent lines as a string until it encounters the ending EOF marker.
> data.txt: This redirects the output of the cat command to a file named data.txt.
The lines between << EOF and EOF are the content that will be written to data.txt.

Basic Syntax of `awk`

The basic syntax of the awk command in Unix-like systems is:

Plain text
1awk options 'selection_criteria {action}' input-file > output-file

Here's a detailed breakdown of each component:

awk: The command itself.
options: These are optional flags you can pass to awk to modify its behavior (e.g., -F to specify the field separator).
selection_criteria: This is an optional condition or pattern that specifies which lines of the input file to process. It can be a regular expression or a logical condition based on field values.
{action}: This is the block of code to execute for each line that matches the selection criteria. Actions are enclosed in curly braces {}.
input-file: The file that awk processes.
> output-file: This optional part redirects the output to a file. If omitted, awk prints the output to the terminal.

With this understanding of awk syntax, let's dive into some examples.

Printing Entire File Using `awk`

Let's begin with the most basic awk command to print the entire content of the file. The print command is used to output text, fields, or expressions to the terminal or another output stream. It offers flexibility in how the data is displayed and allows custom formatting of text.

Bash
1#!/bin/bash
2
3# Using `awk` to print the entire file
4awk '{print}' data.txt

In this awk command

There are no options or selection_criteria
The action is {print} enclosed in curly braces. The print pattern-action statement in awk tells it to print each line of the file.
There is no output-file, so the result is displayed on the terminal.

Running this command will display all lines of the data.txt file, mirroring the functionality of the cat command.

Field Numbers

In awk, field numbers are used to refer to specific columns in a line of text. Fields are denoted by a dollar sign ($) followed by the field number. The records (lines of text) are automatically split into fields based on a delimiter, which is a space or tab by default but can be changed using the -F option.

$1 denotes the first field of a line of text, $2 represents the second field, and so forth. $0 refers to the whole line.

Often, we need to extract specific columns from a file. Suppose we only want to extract the "Brand" and "Model" of each line. The code is:

Bash
1#!/bin/bash
2
3# Using `awk` to print specific columns (Brand and Model)
4awk '{print $1, $2}' data.txt

$1 and $2 refer to the first and second fields (columns) of each line in the file.
This command skips the first and fourth columns, displaying only the brand and model of each item.

The output of the command is:

Plain text
1Brand Model
2Apple MacBook
3Apple iPad
4Dell XPS
5Dell Inspiron
6Lenovo ThinkPad
7Lenovo Yoga
8Apple MacBook

The output successfully shows only the "Brand" and "Model" columns of the text file.

Conditional Text Selection

Often, you will need to filter lines based on specific conditions. To do this, you place the condition before the {action} block, enclosed in single quotes ('). For instance, we may want to find all entries where the RAM is 64 GB or greater.

Bash
1#!/bin/bash
2
3# Using `awk` to print lines where RAM is greater than or equal to 64
4awk '$3 >= 64 {print $0}' data.txt

$3 >= 64 {print $0} instructs awk to print any line where the third field (RAM) is 64 or greater.
$0 represents the entire line.

The output is:

Plain text
1Brand   Model     RAM
2Dell    Inspiron  128
3Lenovo  ThinkPad  128
4Lenovo  Yoga      256
5Apple   MacBook    64

Built-in NR Variable

NR stands for "Number of Records" in awk. It is a built-in variable that keeps track of the current line number being processed in the input file. Each time awk reads a new line, it increments NR by one. This makes NR useful for actions based on the line number, such as skipping headers, processing specific lines, or adding line numbers to output.

Skip Header Line

Suppose we want to print every line, excluding the header line ("Brand Model RAM)". The header line has an NR value of 1. To skip this line, we use the condition NR > 1.

Bash
1#!/bin/bash
2
3awk 'NR > 1 {print}' data.txt

NR > 1: This condition skips the first line (header) and prints the remaining lines.

The output is:

Plain text
1Apple   MacBook    32
2Apple   iPad       16
3Dell    XPS        32
4Dell    Inspiron  128
5Lenovo  ThinkPad  128
6Lenovo  Yoga      256
7Apple   MacBook    64

Process a Specific Line

Now let's print only the 3rd line of data.txt.

Bash
1#!/bin/bash
2
3awk 'NR == 3 {print}' data.txt

The output is:

Plain text
1Apple   iPad       16

Pattern Matching with `awk`

Pattern matching is one of the core strengths of awk, allowing you to perform actions only on lines that match specific patterns. The syntax for pattern matching in awk involves enclosing regular expressions within slashes (/pattern/). Suppose we only want to print lines that contain "Apple":

Bash
1#!/bin/bash
2
3# Using `awk` with pattern matching
4awk '/Apple/ {print}' data.txt

The /Apple/ {print} pattern checks each line for the string "Apple."
If a line contains "Apple," it is printed.

The output of the command is:

Plain text
1Apple   MacBook    32
2Apple   iPad       16
3Apple   MacBook    64

Performing Calculations: Variables and END

The END keyword is used to specify an action to be executed after all lines have been processed. The syntax is:

Plain text
1awk '{action1} END {action2}'

This command will perform action1 for every line of text. After all lines have been processed, action2 is run once.

Now let's write a command that calculates the average RAM across all entries. To do this

We create a sum variable and count variable.
For each line, we add the RAM value (column $3) to sum and increment the count variable by 1.
After all lines have been processed, we print sum/count

Bash
1#!/bin/bash
2
3# Using `awk` to calculate the average RAM
4awk 'NR>1 {sum+=$3; count++} END {print "Average RAM:", sum/count}' data.txt

NR>1 skips the header line because it does not contain a RAM value.
{sum+=$3; count++} sums the values of the third field (RAM) and increments the count.
END {print "Average RAM:", sum/count} executes after processing all lines, printing the calculated average RAM.

The output of the code is:

Plain text
1Average RAM: 93.7143

Custom Line Messages

You can customize the output format for each line as well. We can add text to our print statement by separating strings/field references with commas. Let’s create a message for each entry.

Bash
1#!/bin/bash
2
3# Using `awk` to print a custom message for each line
4awk 'NR>1 {print "Brand:", $1, "- Model:", $2, "- RAM:", $3}' data.txt

The output of this command is:

Plain text
1Brand: Apple - Model: MacBook - RAM: 32
2Brand: Apple - Model: iPad - RAM: 16
3Brand: Dell - Model: XPS - RAM: 32
4Brand: Dell - Model: Inspiron - RAM: 128
5Brand: Lenovo - Model: ThinkPad - RAM: 128
6Brand: Lenovo - Model: Yoga - RAM: 256
7Brand: Apple - Model: MacBook - RAM: 64

The output is still a bit difficult to read. Let's continue to see how to use printf to format the output.

Table Formatting with `awk`

The printf function in awk offers more control over the formatting of the output compared to the print command. The syntax for printf is:

Plain text
1awk '{printf format, item1, item2, ..., itemN}' input-file

The format string includes text and format specifiers that begin with %. Common format specifiers include:

%d: Integer
%s: String
\n: Newline character

Modifiers can also be added to control the width and alignment:

Minimum Field Width: The number between % and the format specifier defines the minimum width of the field.

Positive Width: Right-justified by default. For example, %10s formats a string, right-aligned, with a minimum width of 10 characters.

Negative Width: Left-justified if prefixed with a minus sign. For example, %-10s formats a string, left-aligned, with a minimum width of 10 characters.

Now, let’s format our output as a neatly aligned table:

Bash
1#!/bin/bash
2
3# Using `awk` to format output as a table
4awk 'BEGIN {print "Brand    Model     RAM"} NR>1 {printf "%-8s %-10s %2d\n", $1, $2, $3}' data.txt

Breakdown of the Command

BEGIN {print "Brand Model RAM"}

The BEGIN block is executed before any lines from the input file are processed.
{print "Brand Model RAM"} This prints the header row "Brand Model RAM" before processing the actual data. The string contains specific spaces to align the header with the columns that will follow.

NR > 1 ensures that the action {printf ...} is applied only to lines after the first one, which is the header line.

%-8s: Left-align (-) a string (s) with a width of 8 characters.
%-10s: Left-align (-) a string (s) with a width of 10 characters.
%2d: Print an integer (d) with exactly 2 digits.
\n: Newline character to move to the next line after printing.
$1, $2, $3: These are the fields to be printed according to the format specifiers.
data.txt: The input file that contains the data to be processed.

The output of the command is:

Plain text
1Brand    Model     RAM
2Apple    MacBook    32
3Apple    iPad       16
4Dell     XPS        32
5Dell     Inspiron   128
6Lenovo   ThinkPad   128
7Lenovo   Yoga       256
8Apple    MacBook    64

Using printf and formating specifiers, the output of our table looks much more clean!

Summary and Next Steps

Great job! In this lesson, you learned how to:

Create and manipulate data files using heredoc syntax.
Print the entire content of a file using awk.
Extract specific columns from a file using field numbers.
Filter lines based on conditions using awk.
Perform pattern matching with awk.
Calculate and display average values using the END block.
Customize and format output using printf in awk.

Now, it’s time to apply what you’ve learned. Head to the practice section to sharpen your awk skills through hands-on exercises. Happy coding!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.