Lesson 5
Text Processing with awk
Introduction to Text Processing with `awk`

Hello! Welcome to your next step in mastering Bash scripting. In this lesson, we will immerse ourselves in the world of text processing with the versatile command-line tool awk. awk is a powerful tool that allows you to manipulate and analyze text files with ease. By the end of this lesson, you’ll be equipped to efficiently handle and process text files, extracting meaningful data and performing relevant computations directly from your Bash scripts.

Let's get started by diving into how we can leverage awk for various text processing tasks.

Creating Initial Data

First, let's create a sample data file to work with. This file will help us learn and practice various awk commands effectively.

The heredoc (short for "here document") is a special syntax in Unix shell scripting that allows you to create a multi-line string. It is particularly useful for creating files or including large blocks of text within your script. The syntax is <<EOF ... EOF, where EOF (End of File) is a marker indicating the beginning and end of the block of text. You can actually use any marker, but EOF is conventionally used.

Let's create a file called data.txt that includes data about computers in inventory.

Bash
1#!/bin/bash 2 3# Create a sample data file 4cat << EOF > data.txt 5Brand Model RAM 6Apple MacBook 32 7Apple iPad 16 8Dell XPS 32 9Dell Inspiron 128 10Lenovo ThinkPad 128 11Lenovo Yoga 256 12Apple MacBook 64 13EOF

Let's break this code down:

  • cat << EOF: This starts the heredoc and tells the cat command to begin reading the subsequent lines as a string until it encounters the ending EOF marker.
  • > data.txt: This redirects the output of the cat command to a file named data.txt.
  • The lines between << EOF and EOF are the content that will be written to data.txt.
Basic Syntax of `awk`

The basic syntax of the awk command in Unix-like systems is:

Plain text
1awk options 'selection_criteria {action}' input-file > output-file

Here's a detailed breakdown of each component:

  • awk: The command itself.
  • options: These are optional flags you can pass to awk to modify its behavior (e.g., -F to specify the field separator).
  • selection_criteria: This is an optional condition or pattern that specifies which lines of the input file to process. It can be a regular expression or a logical condition based on field values.
  • {action}: This is the block of code to execute for each line that matches the selection criteria. Actions are enclosed in curly braces {}.
  • input-file: The file that awk processes.
  • > output-file: This optional part redirects the output to a file. If omitted, awk prints the output to the terminal.

With this understanding of awk syntax, let's dive into some examples.

Printing Entire File Using `awk`

Let's begin with the most basic awk command to print the entire content of the file. The print command is used to output text, fields, or expressions to the terminal or another output stream. It offers flexibility in how the data is displayed and allows custom formatting of text.

Bash
1#!/bin/bash 2 3# Using `awk` to print the entire file 4awk '{print}' data.txt

In this awk command

  • There are no options or selection_criteria
  • The action is {print} enclosed in curly braces. The print pattern-action statement in awk tells it to print each line of the file.
  • There is no output-file, so the result is displayed on the terminal.

Running this command will display all lines of the data.txt file, mirroring the functionality of the cat command.

Field Numbers

In awk, field numbers are used to refer to specific columns in a line of text. Fields are denoted by a dollar sign ($) followed by the field number. The records (lines of text) are automatically split into fields based on a delimiter, which is a space or tab by default but can be changed using the -F option.

$1 denotes the first field of a line of text, $2 represents the second field, and so forth. $0 refers to the whole line.

Often, we need to extract specific columns from a file. Suppose we only want to extract the "Brand" and "Model" of each line. The code is:

Bash
1#!/bin/bash 2 3# Using `awk` to print specific columns (Brand and Model) 4awk '{print $1, $2}' data.txt
  • $1 and $2 refer to the first and second fields (columns) of each line in the file.
  • This command skips the first and fourth columns, displaying only the brand and model of each item.

The output of the command is:

Plain text
1Brand Model 2Apple MacBook 3Apple iPad 4Dell XPS 5Dell Inspiron 6Lenovo ThinkPad 7Lenovo Yoga 8Apple MacBook

The output successfully shows only the "Brand" and "Model" columns of the text file.

Conditional Text Selection

Often, you will need to filter lines based on specific conditions. To do this, you place the condition before the {action} block, enclosed in single quotes ('). For instance, we may want to find all entries where the RAM is 64 GB or greater.

Bash
1#!/bin/bash 2 3# Using `awk` to print lines where RAM is greater than or equal to 64 4awk '$3 >= 64 {print $0}' data.txt
  • $3 >= 64 {print $0} instructs awk to print any line where the third field (RAM) is 64 or greater.
  • $0 represents the entire line.

The output is:

Plain text
1Brand Model RAM 2Dell Inspiron 128 3Lenovo ThinkPad 128 4Lenovo Yoga 256 5Apple MacBook 64
Built-in NR Variable

NR stands for "Number of Records" in awk. It is a built-in variable that keeps track of the current line number being processed in the input file. Each time awk reads a new line, it increments NR by one. This makes NR useful for actions based on the line number, such as skipping headers, processing specific lines, or adding line numbers to output.

Skip Header Line

Suppose we want to print every line, excluding the header line ("Brand Model RAM)". The header line has an NR value of 1. To skip this line, we use the condition NR > 1.

Bash
1#!/bin/bash 2 3awk 'NR > 1 {print}' data.txt
  • NR > 1: This condition skips the first line (header) and prints the remaining lines.

The output is:

Plain text
1Apple MacBook 32 2Apple iPad 16 3Dell XPS 32 4Dell Inspiron 128 5Lenovo ThinkPad 128 6Lenovo Yoga 256 7Apple MacBook 64

Process a Specific Line

Now let's print only the 3rd line of data.txt.

Bash
1#!/bin/bash 2 3awk 'NR == 3 {print}' data.txt

The output is:

Plain text
1Apple iPad 16
Pattern Matching with `awk`

Pattern matching is one of the core strengths of awk, allowing you to perform actions only on lines that match specific patterns. The syntax for pattern matching in awk involves enclosing regular expressions within slashes (/pattern/). Suppose we only want to print lines that contain "Apple":

Bash
1#!/bin/bash 2 3# Using `awk` with pattern matching 4awk '/Apple/ {print}' data.txt
  • The /Apple/ {print} pattern checks each line for the string "Apple."
  • If a line contains "Apple," it is printed.

The output of the command is:

Plain text
1Apple MacBook 32 2Apple iPad 16 3Apple MacBook 64
Performing Calculations: Variables and END

The END keyword is used to specify an action to be executed after all lines have been processed. The syntax is:

Plain text
1awk '{action1} END {action2}'

This command will perform action1 for every line of text. After all lines have been processed, action2 is run once.

Now let's write a command that calculates the average RAM across all entries. To do this

  • We create a sum variable and count variable.
  • For each line, we add the RAM value (column $3) to sum and increment the count variable by 1.
  • After all lines have been processed, we print sum/count
Bash
1#!/bin/bash 2 3# Using `awk` to calculate the average RAM 4awk 'NR>1 {sum+=$3; count++} END {print "Average RAM:", sum/count}' data.txt
  • NR>1 skips the header line because it does not contain a RAM value.
  • {sum+=$3; count++} sums the values of the third field (RAM) and increments the count.
  • END {print "Average RAM:", sum/count} executes after processing all lines, printing the calculated average RAM.

The output of the code is:

Plain text
1Average RAM: 93.7143
Custom Line Messages

You can customize the output format for each line as well. We can add text to our print statement by separating strings/field references with commas. Let’s create a message for each entry.

Bash
1#!/bin/bash 2 3# Using `awk` to print a custom message for each line 4awk 'NR>1 {print "Brand:", $1, "- Model:", $2, "- RAM:", $3}' data.txt

The output of this command is:

Plain text
1Brand: Apple - Model: MacBook - RAM: 32 2Brand: Apple - Model: iPad - RAM: 16 3Brand: Dell - Model: XPS - RAM: 32 4Brand: Dell - Model: Inspiron - RAM: 128 5Brand: Lenovo - Model: ThinkPad - RAM: 128 6Brand: Lenovo - Model: Yoga - RAM: 256 7Brand: Apple - Model: MacBook - RAM: 64

The output is still a bit difficult to read. Let's continue to see how to use printf to format the output.

Table Formatting with `awk`

The printf function in awk offers more control over the formatting of the output compared to the print command. The syntax for printf is:

Plain text
1awk '{printf format, item1, item2, ..., itemN}' input-file

The format string includes text and format specifiers that begin with %. Common format specifiers include:

  • %d: Integer
  • %s: String
  • \n: Newline character

Modifiers can also be added to control the width and alignment:

Minimum Field Width: The number between % and the format specifier defines the minimum width of the field.

Positive Width: Right-justified by default. For example, %10s formats a string, right-aligned, with a minimum width of 10 characters.

Negative Width: Left-justified if prefixed with a minus sign. For example, %-10s formats a string, left-aligned, with a minimum width of 10 characters.

Now, let’s format our output as a neatly aligned table:

Bash
1#!/bin/bash 2 3# Using `awk` to format output as a table 4awk 'BEGIN {print "Brand Model RAM"} NR>1 {printf "%-8s %-10s %2d\n", $1, $2, $3}' data.txt
Breakdown of the Command

BEGIN {print "Brand Model RAM"}

  • The BEGIN block is executed before any lines from the input file are processed.
  • {print "Brand Model RAM"} This prints the header row "Brand Model RAM" before processing the actual data. The string contains specific spaces to align the header with the columns that will follow.

NR > 1 ensures that the action {printf ...} is applied only to lines after the first one, which is the header line.

  • %-8s: Left-align (-) a string (s) with a width of 8 characters.

  • %-10s: Left-align (-) a string (s) with a width of 10 characters.

  • %2d: Print an integer (d) with exactly 2 digits.

  • \n: Newline character to move to the next line after printing.

  • $1, $2, $3: These are the fields to be printed according to the format specifiers.

  • data.txt: The input file that contains the data to be processed.

The output of the command is:

Plain text
1Brand Model RAM 2Apple MacBook 32 3Apple iPad 16 4Dell XPS 32 5Dell Inspiron 128 6Lenovo ThinkPad 128 7Lenovo Yoga 256 8Apple MacBook 64

Using printf and formating specifiers, the output of our table looks much more clean!

Summary and Next Steps

Great job! In this lesson, you learned how to:

  • Create and manipulate data files using heredoc syntax.
  • Print the entire content of a file using awk.
  • Extract specific columns from a file using field numbers.
  • Filter lines based on conditions using awk.
  • Perform pattern matching with awk.
  • Calculate and display average values using the END block.
  • Customize and format output using printf in awk.

Now, it’s time to apply what you’ve learned. Head to the practice section to sharpen your awk skills through hands-on exercises. Happy coding!

Enjoy this lesson? Now it's time to practice with Cosmo!
Practice is how you turn knowledge into actual skills.