Part 2: Regular Expression, File IO & Text Processing

Regular Expression (Regexe)

A Regular Expression (or Regexe) is a pattern (or filter) that describes a set of strings that matches the pattern.  In other words, a regexe accepts a certain set of strings and rejects the rest. A string either matches a regexe or it does not.

A regular expression consists of a sequence of characters, meta-characters (such as ., \d, \s, \S) and operators (such as +, *, ?, |, ^).  In Perl, regexes are delimited by a pair of forward slashes /.../.

Regexes are constructed by combining many smaller sub-expressions. The fundamental building blocks are patterns that match a single character. Most characters, including all letters and digits, match themselves. For example, the regexe /Friday/ matches the string 'Friday' exactly (and rejects the others). To match a special character (such as +, *, \) literally, you precede the character with a backslash (such as \+, \*, and \\).  For example, regexe /10\+/ matches the string '10+'.

Substitution Operator s///

You can substitute a string (or a portion of a string) with another string using s/// substitution operator. The syntax is:


By default s/// operates on the default variable $_. It substitutes the portion of string in $_ that matches regexe by the replacement string. For example,

use strict;
use warnings;
while (<>) {            # Read input into default variable $_
   s/is/at/;            # Substitute 'is' for 'at' in $_
   print 'Result: ', $_;
> perl
This is an apple
Result: That is an apple
That is a pineapple
Result: That at a pineapple
A beautiful oasis
Result: A beautiful oasat
Is that an apple?
Result: Is that an apple?

Modifiers (such as /g, /i, /e, /o, /s and /x) can be used to control the behavior of s///. By default, only the first occurrence of the matching string of each line is replaced. You can use modifier /g to specify global replacements. By default, matching is case-sensitive. You can use the modifier /i to enable case in-sensitive matching.

#!/usr/bin/perl         #
use strict;
use warnings;
while (<>) {            # Read input into default variable $_
   s/is/at/gi;          # Global and case-insensitive substitution
   print 'Result: ', $_;
> perl
This is an OASIS
Result: That at an OASat

OR (|)

A vertical bar | can be used to include alternatives in a regexe, e.g.,

#!/usr/bin/perl           #
use strict;
use warnings;
while (<>) {              # Read input into default variable $_
   s/four|for|floor/4/gi; # Subs 'four', 'for' or 'floor' for 4   
   print 'Result: ', $_;
> perl
For the four persons on the floor
Result: 4 the 4 persons on the 4

You can use variables in the substitution, e.g.,

my $replacement = 4;
my $pattern = 'four|for|floor';

Bracket [ ] and Range [ - ] Expressions

A bracket expression is a list of characters enclosed by [ ], also called character class. It matches any single character in the list. However, if the first character of the list is the caret (^), then it matches any single character NOT in the list. For example, the regexes [02468] matches a single digit 0, 2, 4, 6, or 8; the regexes [^02468] matches any single character other than 0, 2, 4, 6, or 8.

Instead of listing all characters, you could use a range expression inside the bracket. A range expression consists of 2 characters separated by a hyphen (-). It matches any single character that sorts between the two characters, inclusive. For example, [a-d] is the same as [abcd]. You could include a caret (^) in front of the range to invert the matching. For example, [^a-d] is equivalent to [^abcd].

Some named classes of characters are pre-defined within bracket expressions. They are [:alnum:], [:alpha:], [:cntrl:], [:digit:], [:graph:], [:lower:], [:print:], [:punct:], [:space:], [:upper:], and [:xdigit:]. For example, [[:alnum:]] means [0-9A-Za-z]. (Note that the square brackets in these class names are part of the symbolic names, and must be included in addition to the square brackets delimiting the bracket list.)

#!/usr/bin/perl           #
use strict;
use warnings;
while (<>) {              # Input read into default variable $_
   s/[[:alnum:]][[:alnum:]]/x/g;  # substitute 2 alphanumerics with x
   print 'Result: ', $_;
> perl
This is an apple
Result: xx x x xxe
These are apples
Result: xxe xe xxx

To include a ], place it first in the list. To include a ^, place it anywhere but first. Finally, to include a - place it last. Most metacharacters (such as +, *) lose their special meaning inside bracketed lists.

Metacharacters Dot (.), \w, \W, \d, \D, \s, \S

A metacharacter is a symbol with a special meaning inside a regexe.


s/\s\s/ /g;    # Replace two whitespaces with a single space
s/\S\S\s/ /g   # Two non-whitespaces followed by a whitespace
s/\s+/ /g;     # Replace one or more whitespaces with one space

Occurrence Indicators (or Repetition Operators) +, *, ?, {}

A regexe may be followed by an occurrence indicator (or repetition operator):

For example, /xy{2,4}/ ('x' followed by 2 to 4 'y') matches "xyy", "xyyy" and "xyyyy".


*, +, ?, { } repetition operators are greedy operators, and by default grasp as many characters as possible for a match.

#!/usr/bin/perl -w        #
use strict;
use warnings;
while (<>) {              # Read input into default variable $_
   print 'Result: ', $_;
> perl
Result: zxzxxxyxzyy

In Perl, you can put an extra ? after the repetition operator to curb its greediness (i.e., stop at the shortest match). For example, *?, +?, ??, {}?.

#!/usr/bin/perl           #
use strict;
use warnings;
while (<>) {              # Read input into default variable $_
   s/xy{2,4}?/z/g;        # curb greediness for shortest match
   print 'Result: ', $_;
> perl
Result: zxzyyxxxyxzyyyy

Positional Metacharacters (or Position Anchors) ^, $, \b, \B, \<, \>, \A, \Z

Positional anchors do NOT match actual character but match position in a string (e.g., begin-of-line, end-of-line, begin-of-word, and end-of-word).

You can use positional anchors liberally to increase the speed of matching. For examples:

s/ing$/xyz/      # Substitute ending 'ing' with 'xyz'.
/^testing 123$/  # Matches only one pattern. Should use eq instead.


Instead of using slash (/) as delimiter for s/// operator, you can use most of the non-alphanumeric characters (e.g., !, @, #). For example,

s!this!that!g    # ! as delimiter
s#1/2#1/4#g      # # as delimiter

Change the default delimiter is confusing, and not recommended.

Matching Operator m//

You can use matching operator m// to check if a pattern (in terms of a regexe) exists in a string. The syntax is:


m//, by default, operates on the default variable $_. It returns true if $_ matches regexe; and false otherwise.

Similar to s/// operator, instead of using slash (/) as delimiter, you could use other non-alphanumeric characters such as @ and %. However, if slash (/) is used as the delimiter, the operator m can be omitted. For example, /.\./ matches any character (except newline) followed by a period.


if (/test/) { $test_mode = 'yes' }  # if $_ contains test.
print 'have space' if /\s/;         # if $_ contains whitespace.

operators =~ and !~ for s/// and m//

By default, the substitution and matching operators operate on the default variable $_. To operate on other variable instead of $_, you could use the =~ and !~ operators as follows:

str =~ m/regexe/ is true if str matches regexe.
str !~ m/regexe/ is true if str does not matches regexe.
str =~ s/regexe/replacement/modifiers replaces occurrence of regexe with replacement in str.

When used with m//, =~ behaves like comparison (== or eq). When used with s///, =~ behaves like assignment (=).

For examples:

my $msg;
if ($msg =~ /hello/) {     # Check if $msg contains 'Hello'
   print 'Hello world'; 
$msg =~ s/hello/hi/g;      # Substitute 'hello' with 'hi' in $msg
print 'yes or no? ';
my $reply;
chomp($reply = <>);        # remove newline from $reply
if ($reply =~ /^y/} {      # Begins with 'y'
   print 'positive!';
} else {
   print 'negative!';

More String Functions: split and join

split(regexe, str) or split(regexe, str, numItems): split the given str using the regular expression regexe, and return the items in an array. The optional third parameter specifies the maximum items to be processed.

join(joinStr, strList) joins the items in strList with the given joinStr (possibly empty).

For example,

use strict;
use warnings;
use 5.010;
my $msg = 'Hello, world again!';
my @words = split(/ /, $msg);  # ('Hello,', 'world', 'again!')
for (@words) { say; }          # Use default scalar variable
say join('--', @words);        # 'Hello,--world--again!'
my $newMsg = join '', @words;  # 'Hello,worldagain!'
say $newMsg;

Parenthesized Back-References & Matched Variables $1,..., $9

Parentheses ( ) serve two purposes in regexes.

Firstly, parentheses ( ) can be used to group sub-expressions for overriding the precedence or applying a repetition operator. For example, /(a|e|i|o|u){3,5}/ is the same as /a{3,5}|e{3,5}|i{3,5}|o{3,5}|u{3,5}/.

Secondly, parentheses are used to provide the so called back-references. A back-reference contains the matched sub-string. For examples, the regexe /(\S+)/ creates one back-reference (\S+), which contains the first word (consecutive non-spaces) in the input string; the regexe /(\S+)\s+(\S+)/ creates two back-references: (\S+) and another (\S+), containing the first two words, separated by one or more spaces \s+.

The back-references are stored in special variables $1, $2, …, $9, where $1 contains the substring matched the first pair of parentheses, and so on. For example, /(\S+)\s+(\S+)/ creates two back-references which matched with the first two words. The matched words are stored in $1 and $2, respectively.

For example, the following expression swap the first and second words:

s/(\S+) (\S+)/$2 $1/;   # Swap the first and second words separated by a single space

Back-references can also be referenced in your program. For example,

(my $word) = ($str =~ /(\S+)/);

The parentheses creates one back-reference, which matches the first word of the $str if there is one, and is placed inside the scalar variable $word. If there is no match, $word is UNDEF.

Another example,

(my $word1, my $word2) = ($str =~ /(\S+)\s+(\S+)/);

The 2 pairs of parentheses place the first two words (separated by one or more white-spaces) of the $str into variables $word1 and $word2 if there are more than two words; otherwise, both $word1 and $word2 are UNDEF. Note that regular expression matching must be complete and there is no partial matching.

\1, \2, \3 has the same meaning as $1, $2, $3, but are valid only inside the s/// or m//. For example, /(\S+)\s\1/ matches a pair of repeated words, separated by a white-space.

Special Variables $`, $', $% and $+

Character Translation Operator tr///

You can use translator operator to translate a character into another character. The syntax is:


replaces or translates fromchars to tochars in $_, and returns the number of characters replaced.

For examples,

tr/a-z/A-Z/         # converts $_ to uppercase.
tr/dog/cat/         # translates d to c, o to a, g to t.
$str =~ tr/0-9/a-j/ # replace 0 by a, etc in $str.
tr/A-CG/KX-Z/       # replace A by K, B by X, C by Y, G by Z.

Instead of forward slash (/), you can use parentheses (), brackets [], curly bracket {} as delimiter, e.g.,

tr[0-9][##########]  # replace numbers by #.
tr{!.}(.!)           # swap ! and ., one pass.

If tochars is shorter than fromchars, the last character of tochars is used repeatedly.

tr/a-z/A-E/       # f to z is replaced by E.

tr/// returns the number of replaced characters. You can use it to count the occurrence of certain characters. For examples,

my $numLetters = ($string =~ tr/a-zA-Z/a-zA-Z/);
my $numDigits  = ($string =~ tr/0-9/0-9/);
my $numSpaces  = ($string =~ tr/ / /);

Modifiers /c, /d and /s for tr///

For examples,

tr/A-Za-z/ /c  # replaces all non-alphabets with space
tr/A-Z//d      # deletes all uppercase (matched with no replacement).
tr/A-Za-z//dc  # deletes all non-alphabets
tr/!//s        # squashes duplicate !

Modifiers /m and /s for s/// and m//

/m lets the ^ and $ anchors match more than once. By default, s/// and m// assume the input to be one string, even though it may contain newline. with /m modifier, each sub-string terminating with newline is treated as a string to be matched.

/s permits . to match the newline.

Functions grep, map

File Input/Output


Filehandles are data structure which your program can use to manipulate files.  A filehandle acts as a gate between your program and the files, directories, or other programs.  Your program first opens a gate, then sends or receives data through the gate, and finally closes the gate. There are many types of gates: one-way vs. two-way, slow vs. fast, wide vs. narrow.

Naming Convention: use uppercase for the name of the filehandle, e.g., FILE, DIR, FILEIN, FILEOUT, and etc.

Once a filehandle is created and connected to a file (or a directory, or a program), you can read or write to the underlying file through the filehandle using angle brackets, e.g., <FILEHANDLE>.

Example: Read and print the content of a text file via a filehandle.

use strict;
use warnings;
# Read & print the content of a text file.
my $filename = shift;    # Get the filename from command line.
# Create a filehandle called FILE and connect to the file.
open(FILE, $filename) or die "Can't open $filename: $!";

while (<FILE>) {      # Set $_ to each line of the file 
   print;             # Print $_

Example: Search and print lines containing a particular search word.

use strict;
use warnings;
# Search for lines containing a search word.
(my $filename, my $word) = @ARGV;   # Get filename & search word.

# Create a filehandle called FILE and connect to the file.
open(FILE, $filename) or die "Can't open $filename: $!";
while (<FILE>) {           # Set $_ to each line of the file
   print if /\b$word\b/i;  # Match $_ with word, case insensitive

Example: Print the content of a directory via a directory handle.

use strict;
use warnings;
# Print the content of a directory.
my $dirname = shift;       # Get directory name from command-line
opendir(DIR, $dirname) or die "Can't open directory $dirname: $!";
my @files = readdir(DIR);
foreach my $file (@files) {
  # Display files not beginning with dot.
  print "$file\n" if ($file !~ /^\./);

You can use C-style's printf for formatted output to file.

File Handling Functions

Function open: open(filehandle, string) opens the filename given by string and associates it with the filehandle. It returns true if success and UNDEF otherwise.

Function close: close(filehandle) closes the file associated with the filehandle. When the program exits, Perl closes all opened filehandles. Closing of file flushes the output buffer to the file.  You only have to explicitly close the file in case the user aborts the program, to ensure data integrity.

A common procedure for modifying a file is to:

  1. Read in the entire file with open(FILE, $filename) and @lines = <FILE>.
  2. Close the filehandle.
  3. Operate upon @lines (which is in the fast RAM) rather than FILE (which is in the slow disk).
  4. Write the new file contents using open(FILE, “>$filename”) and print FILE @lines.
  5. Close the file handle.

Example: Read the contents of the entire file into memory; modify and write back to disk.

use strict;
use warnings;
my $filename = shift;       # Get the filename from command line.

# Create a filehandle called FILE and connect to the file.
open(FILE, $filename) or die "Can't open $filename: $!";
# Read the entire file into an array in memory.
my @lines = <FILE>;

open(FILE, ">$filename") or die "Can't write to $filename: $!";
foreach my $line (@lines) {
   print FILE uc($line);   # Change to uppercase

Example: Reading from a file

use strict;
use warnings;
open(FILEIN, "test.txt") or die "Can't open file: $!";
while (<FILEIN>) {     # set $_ to each line of the file.
   print;              # print $_

Example: Writing to a file

use strict;
use warnings;
my $filename = shift;         # Get the file from command line.
open(FILE, ">$filename") or die "Can't write to $filename: $!";
print FILE "This is line 1\n";    # no comma after FILE.
print FILE "This is line 2\n";
print FILE "This is line 3\n";

Example: Appending to a file

use strict;
use warnings;
my $filename = shift;             # Get the file from command line.
open(FILE, ">>$filename") or die "Can't append to $filename: $!";
print FILE "This is line 4\n";     # no comma after FILE.
print FILE "This is line 5\n";

In-Place Editing

Instead of reading in one file and write to another file, you could do in-place editing by specifying –i flag or use the special variable $^I.

Example: In-place editing using –i flag

#!/usr/bin/perl -i.old     # In-place edit, backup as '.old'
use strict;
use warnings;

while (<>) {
  print;           # Print to the file, not STDOUT.

Example:  In-place editing using $^I special variable.

use strict;
use warnings;

$^I = '.bak';      # Enable in-place editing, backup in '.bak'.
while (<>) {
  print;           # Print to the file, not STDOUT.

Functions seek, tell, truncate

seek(filehandle, position, whence): moves the file pointer of the filehandle to position, as measured from whenceseek() returns 1 upon success and 0 otherwise.  File position is measured in bytes.  whence of 0 measured from the beginning of the file; 1 measured from the current position; and 2 measured from the end.  For example:

seek(FILE, 0, 2);    # 0 byte from end-of-file, give file size.
seek(FILE, -2, 2);   # 2 bytes before end-of-file.
seek(FILE, -10, 1);  # Move file pointer 10 byte backward.
seek(FILE, 20, 0);   # 20 bytes from the begin-of-file.

tell(filehandle): returns the current file position of filehandle.

truncate(FILE, length): truncates FILE to length bytes.  FILE can be either a filehandle or a file name.

To find the length of a file, you could:

seek(FILE, 0, 2);   # Move file point to end of file.
print tell(FILE);   # Print the file size.

Example: Truncate the last 2 bytes if they begin with \x0D,

use strict;
use warnings;
my $filename = shift;            # Get the file from command line.
open(FILE, "+<$filename") or die "Can't open $file: $!";
seek(FILE, -2, 2);        # 2 byte before end-of-file.
my $pos = tell FILE;
my $data = <FILE>;        # read moves the file pointer.
if ($data =~ /^\x0D/) {   # begin with 0D
  truncate FILE, $pos;    # truncate last 2 bytes.

Function eof

eof(filehandle) returns 1 if the file pointer is positioned at the end of the file or if the filehandle is not opened.

Reading Bytes Instead of Lines

The function read(filehandle, var, length, offset) reads length bytes from filehandle starting from the current file pointer, and saves into variable var starting from offset (if omitted, default is 0).  The bytes includes \x0A, \x0D etc.


use strict;
use warnings;
(my $numbytes, my $filename) = @ARGV;
open(FILE, $filename) or die "Can't open $filename: $!";
my $data;
read(FILE, $data, $numbytes);
print $data, "\n----\n";
read(FILE, $data, $numbytes);    # continue from current file ptr
print $data;
print $data, "\n----\n";
read(FILE, $data, $numbytes, 2);  # save in $data offset 2
print $data, "\n----\n";

Piping Data To and From a Process

If you wish your program to receive data from a process or want your program to send data to a process, you could open a pipe to an external program.

Both of these statements return the Process ID (PID) of the command.

Example: The dir command lists the current directory.  By opening a pipe from dir, you can access its output.

use strict;
use warnings;
open(PIPEFROM, "dir|") or die "Pipe failed: $!";
while (<PIPEFROM>) {

Example: This example shows how you can pipe input into the sendmail program.

use strict;
use warnings;
my $my_login = test
open(MAIL, "| sendmail –t –f$my_login") or die "Pipe failed: $!";
print MAIL, "From:\n";
print MAIL, "To:\n";
print MAIL, "Subject: test\n";
print MAIL, "\n";
print MAIL, "Testing line 1\n";
print MAIL, "Testing line 2\n";
close MAIL;

You cannot pipe data both to and from a command.  If you want to read the output of a command that you have opened with the |command, send the output to a file.  For example,

open (PIPETO, "|command > /output.txt");

Deleting file: Function unlink

unlink(FILES) deletes the FILES, returning the number of files deleted.  Do not use unlink() to delete a directory, use rmdir() instead. For example,

unlink $filename;
unlink "/var/adm/message";
unlink "message";

Inspecting Files

You can inspect a file using (-test FILE) condition.  The condition returns true if FILE satisfies testFILE can be a filehandle or filename.  The available test are:


use strict;
use warnings;
my $dir = shift;
opendir(DIR, $dir) or die "Can't open directory: $!";
my @files = readdir(DIR);
foreach my $file (@files) {
   if (-f "$dir/$file") {
      print "$file is a file\n";
      print "$file seems to be a text file\n" if (-T "$dir/$file");
      print "$file seems to be a binary file\n" if (-B "$dir/$file");
      my $size = -s "$dir/$file";
      print "$file size is $size\n";
      print "$file is a empty\n" if (-z "$dir/$file");
   } elsif (-d "$dir/$file") {
      print "$file is a directory\n";
   print "$file is a readable\n" if (-r "$dir/$file");
   print "$file is a writable\n" if (-w "$dir/$file");
   print "$file is a executable\n" if (-x "$dir/$file");

Function stat and lsstat

The function stat(FILE) returns a 13-element array giving the vital statistics of FILElsstat(SYMLINK) returns the same thing for the symbolic link SYMLINK.

The elements are:

Index Value
0 The device
1 The file's inode
2 The file's mode
3 The number of hard links to the file
4 The user ID of the file's owner
5 The group ID of the file
6 The raw device
7 The size of the file
8 The last accessed time
9 The last modified time
10 The last time the file's status changed
11 The block size of the system
12 The number of blocks used by the file

For example: The command

perl -e "$size= (stat('test.txt'))[7]; print $size"

prints the file size of "test.txt".

Accessing the Directories

Example: Print the contents of a given directory.

use strict;
use warnings;
my $dirname = shift;      # first command-line argument.
opendir(DIR, $dirname) or die "can't open $dirname: $!\n";
@files = readdir(DIR);
foreach my $file (@files) {
   print "$file\n";

Example:  Removing empty files in a given directory

use strict;
use warnings;

my $dirname = shift;
opendir(DIR, $dirname) or die "Can't open directory: $!";
my @files = readdir(DIR);
foreach my $file (@files) {
   if ((-f "$dir/$file") && (-z "$dir/$file")) {
      print "deleting $dir/$file\n";
      unlink "$dir/$file";

Example: Display files matches "*.txt"

my @files = glob('*.txt');
foreach (@files) { print; print "\n" }

Example: Display files matches the command-line pattern.

$file = shift;
@files = glob($file);
foreach (@files) {
   print "\n" 

Standard Filehandles

Perl defines the following standard filehandles:

For example:

my $line = <STDIN>    # Set $line to the next line of user input
my $item = <ARGV>     # Set $item to the next command-line argument
my @items = <ARGV>    # Put all command-line arguments into the array.

When you use an empty angle brackets <> to get inputs from user, it uses the STDIN filehandle; when you get the inputs from the command-line, it uses ARGV filehandle.  Perl fills in STDIN or ARGV for you automatically.  Whenever you use print() function, it uses the STDOUT filehandler.

<> behaves like <ARGV> when there is still data to be read from the command-line files, and behave like <STDIN> otherwise.

Text Formatting

Function write

write(filehandle): printed formatted text to filehandle, using the format associated with filehandle. If filehandle is omitted, STDOUT would be used.

Declaring format

format name =

Picture Field @<, @|, @>

@<, @>, @| can be repeated to control the number of characters to be formatted. The number of characters to be formatted is same as the length of the picture field. @###.## formats numbers by lining up the decimal points under ".".

For examples,


Printing Formatting String printf

printf(filehandle, template, array): prints a formatted string to filehandle (similar to C's fprintf()). For example,

printf(FILE "The number is %d", 15);

The available formatting fields are:

Field Expected Value
%s String
%c Character
%d Decimal number
%ld Long decimal Number
%u Unsigned decimal number
%x Hexadecimal number
%lx Long hexadecimal number
%o Octal number
%lo Long octal number
%f Fixed-point floating-point number
%e Exponential floating-point number
%g Compact floating-point number



Latest version tested: Perl 5.10.0 (cygwin)
Last modified: October 6, 2009