Practice Makes Regexp

224 12 6MB

English Pages [206]

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Practice Makes Regexp

Table of contents :
Frontmatter
Regexp use from programming languages
Input data
Exercises
Simple regexps
Character classes
Alternation
Anchoring
Groups
Flags
Backreferences
Replace
Unix shell

Citation preview

Practice Makes Regexp 50 exercises to help you master regular expressions Reuven M. Lerner, PhD

Contents Preface: Practice Makes Regexp 1 About me 2 Acknowledgements

Chapter 1 Regexp use from programming languages 1.1 Python 1.1.1 Defining regexps 1.1.2 Finding one 1.1.3 Finding more than one 1.1.4 Substituting text 1.1.5 Flags 1.1.6 Advanced features 1.1.7 More information 1.1.8 About Python solutions 1.2 Ruby 1.2.1 Defining regexps 1.2.2 Finding one 1.2.3 Finding more than one 1.2.4 Substituting text 1.2.5 Flags 1.2.6 Advanced features 1.2.7 More information 1.2.8 About Ruby solutions 1.3 JavaScript 1.3.1 Defining regexps 1.3.2 Finding one or more 1.3.3 Substituting text 1.3.4 Advanced features 1.3.5 More information 1.3.6 About JavaScript solutions 1.4 PostgreSQL 1.4.1 Defining regexps 1.4.2 True/false operators 1.4.3 Extracting text 1.4.4 Splitting 1.4.5 More information 1.5 grep 1.5.1 Basic use 1.5.2 Backslashes

1.5.3

Context

Chapter 2 Input data 2.1 Dictionary (words.txt) 2.2 Alice in Wonderland (alice.txt) 2.3 Config (config.txt) 2.4 Apache logfile (access-log.txt) 2.5 Linux “passwd” file (passwd.txt) 2.6 Fakelog (fakelog.txt) 2.7 PostgreSQL database

Chapter 3 Exercises 3.1 Simple regexps 3.1.1 Find matches 3.1.2 Five-letter words 3.1.3 Double “f” in the middle 3.1.4 Extract timestamp 3.2 Character classes 3.2.1 End-of-sentence words 3.2.2 Hex numbers 3.2.3 Hexwords 3.2.4 IP addresses 3.2.5 Long, weird words 3.2.6 Matching URLs 3.2.7 Non-zero hours 3.2.8 Quoted text 3.2.9 Supervocalic 3.2.10 Double triple vowel 3.2.11 Postfix dollar 3.3 Alternation 3.3.1 Multiple date formats 3.3.2 “oo” and “ee” words 3.3.3 British and American spelling 3.4 Anchors 3.4.1 Capital vowel starts 3.4.2 Comment lines 3.4.3 Last five characters 3.4.4 u in the 2nd-to-last word 3.5 Groups 3.5.1 Date and time 3.5.2 Config pairs 3.5.3 Quote first and last words 3.5.4 Prices with symbols 3.5.5 Question first word 3.5.6 t, but no “ing” 3.5.7 Usernames and user IDs

3.6

3.7

3.8

3.9

3.5.8 Beheaded usernames 3.5.9 Final question words 3.5.10 “d” user shells Flags 3.6.1 All usernames 3.6.2 abc 3.6.3 abcABC 3.6.4 abcABC, extended 3.6.5 No-error IP addresses Backreferences 3.7.1 Doubled vowels 3.7.2 Hours and seconds 3.7.3 Seven-letter start-finish words 3.7.4 end-start 3.7.5 Singular and plural Replace 3.8.1 Crunch whitespace 3.8.2 New hostname 3.8.3 Detagify 3.8.4 Deunixify paths Unix command line 3.9.1 Disk space 3.9.2 Not-today files 3.9.3 Problem logs 3.9.4 Old and new Office files

Chapter 4 Simple regexps 4.1 Find matches 4.1.1 Solution 4.1.2 Python 4.1.3 Ruby 4.1.4 JavaScript 4.1.5 PostgreSQL 4.2 Five-letter words 4.2.1 Solution 4.2.2 Python 4.2.3 Ruby 4.2.4 JavaScript 4.2.5 PostgreSQL 4.3 Double “f” in the middle 4.3.1 Solution 4.3.2 Python 4.3.3 Ruby 4.3.4 JavaScript 4.3.5 PostgreSQL 4.4 Extract timestamp 4.4.1 Solution 4.4.2 Python 4.4.3 Ruby 4.4.4 JavaScript 4.4.5 PostgreSQL

Chapter 5 Character classes 5.1 End-of-sentence words 5.1.1 Solution 5.1.2 Python 5.1.3 Ruby 5.1.4 JavaScript 5.1.5 PostgreSQL 5.2 Hex numbers 5.2.1 Solution 5.2.2 Python 5.2.3 Ruby 5.2.4 JavaScript 5.2.5 PostgreSQL 5.3 Hexwords 5.3.1 Solution 5.3.2 Python 5.3.3 Ruby 5.3.4 JavaScript 5.3.5 PostgreSQL 5.4 IP addresses 5.4.1 Solution 5.4.2 Python 5.4.3 Ruby 5.4.4 JavaScript 5.4.5 PostgreSQL 5.5 Long, weird words 5.5.1 Solution 5.5.2 Python 5.5.3 Ruby 5.5.4 JavaScript 5.5.5 PostgreSQL 5.6 Matching URLs 5.6.1 Solution 5.6.2 Python 5.6.3 Ruby

5.6.4 JavaScript 5.6.5 PostgreSQL 5.7 Non-zero hours 5.7.1 Solution 5.7.2 Python 5.7.3 Ruby 5.7.4 JavaScript 5.7.5 PostgreSQL 5.8 Quoted text 5.8.1 Solution 5.8.2 Python 5.8.3 Ruby 5.8.4 JavaScript 5.8.5 PostgreSQL 5.9 Supervocalic 5.9.1 Solution 5.9.2 Python 5.9.3 Ruby 5.9.4 JavaScript 5.9.5 PostgreSQL 5.10 Double triple vowel 5.10.1 Solution 5.10.2 Python 5.10.3 Ruby 5.10.4 JavaScript 5.10.5 PostgreSQL 5.11 Postfix dollar 5.11.1 Solution 5.11.2 Python 5.11.3 Ruby 5.11.4 JavaScript 5.11.5 PostgreSQL

Chapter 6 Alternation 6.1 Multiple date formats 6.1.1 Solution 6.1.2 Python 6.1.3 Ruby 6.1.4 JavaScript 6.1.5 PostgreSQL 6.2 “oo” and “ee” words 6.2.1 Solution 6.2.2 Python 6.2.3 Ruby 6.2.4 JavaScript 6.2.5 PostgreSQL 6.3 British and American spelling 6.3.1 Solution 6.3.2 Python 6.3.3 Ruby 6.3.4 JavaScript 6.3.5 PostgreSQL

Chapter 7 Anchoring 7.1 Capital vowel starts 7.1.1 Solution 7.1.2 Python 7.1.3 Ruby 7.1.4 JavaScript 7.1.5 PostgreSQL 7.2 Comment lines 7.2.1 Solution 7.2.2 Python 7.2.3 Ruby 7.2.4 JavaScript 7.2.5 PostgreSQL 7.3 Last five characters 7.3.1 Solution 7.3.2 Python 7.3.3 Ruby 7.3.4 JavaScript 7.3.5 PostgreSQL 7.4 u in the 2nd-to-last word 7.4.1 Solution 7.4.2 Python 7.4.3 Ruby 7.4.4 JavaScript 7.4.5 PostgreSQL

Chapter 8 Groups 8.1 Date and time 8.1.1 Solution 8.1.2 Python 8.1.3 Ruby 8.1.4 JavaScript 8.1.5 PostgreSQL 8.2 Config pairs 8.2.1 Solution 8.2.2 Python 8.2.3 Ruby 8.2.4 JavaScript 8.2.5 PostgreSQL 8.3 Quote first and last words 8.3.1 Solution 8.3.2 Python 8.3.3 Ruby 8.3.4 JavaScript 8.3.5 PostgreSQL 8.4 Prices with symbols 8.4.1 Solution 8.4.2 Python 8.4.3 Ruby 8.4.4 JavaScript 8.4.5 PostgreSQL 8.5 Question first word 8.5.1 Solution 8.5.2 Python 8.5.3 Ruby 8.5.4 JavaScript 8.5.5 PostgreSQL 8.6 t, but no “ing” 8.6.1 Solution 8.6.2 Python 8.6.3 Ruby

8.6.4 JavaScript 8.6.5 PostgreSQL 8.7 Usernames and user IDs 8.7.1 Solution 8.7.2 Python 8.7.3 Ruby 8.7.4 JavaScript 8.7.5 PostgreSQL 8.8 Beheaded usernames 8.8.1 Solution 8.8.2 Python 8.8.3 Ruby 8.8.4 JavaScript 8.8.5 PostgreSQL 8.9 Final question words 8.9.1 Solution 8.9.2 Python 8.9.3 Ruby 8.9.4 JavaScript 8.9.5 PostgreSQL 8.10 “d” user shells 8.10.1 Solution 8.10.2 Python 8.10.3 Ruby 8.10.4 JavaScript 8.10.5 PostgreSQL

Chapter 9 Flags 9.1 All usernames 9.1.1 Solution 9.1.2 Python 9.1.3 Ruby 9.1.4 JavaScript 9.1.5 PostgreSQL 9.2 abc 9.2.1 Solution 9.2.2 Python 9.2.3 Ruby 9.2.4 JavaScript 9.2.5 PostgreSQL 9.3 abcABC 9.3.1 Solution 9.3.2 Python 9.3.3 Ruby 9.3.4 JavaScript 9.3.5 PostgreSQL 9.4 abcABC, extended 9.4.1 Solution 9.4.2 Python 9.4.3 Ruby 9.4.4 JavaScript 9.4.5 PostgreSQL 9.5 No-error IP addresses 9.5.1 Solution 9.5.2 Python 9.5.3 Ruby 9.5.4 JavaScript 9.5.5 PostgreSQL

Chapter 10 Backreferences 10.1 Doubled vowels 10.1.1 Solution 10.1.2 Python 10.1.3 Ruby 10.1.4 JavaScript 10.1.5 PostgreSQL 10.2 Hours and seconds 10.2.1 Solution 10.2.2 Python 10.2.3 Ruby 10.2.4 JavaScript 10.2.5 PostgreSQL 10.3 Seven-letter start-finish words 10.3.1 Solution 10.3.2 Python 10.3.3 Ruby 10.3.4 JavaScript 10.3.5 PostgreSQL 10.4 end-start 10.4.1 Solution 10.4.2 Python 10.4.3 Ruby 10.4.4 JavaScript 10.4.5 PostgreSQL 10.5 Singular and plural 10.5.1 Solution 10.5.2 Python 10.5.3 Ruby 10.5.4 JavaScript 10.5.5 PostgreSQL

Chapter 11 Replace 11.1 Replace 11.2 Crunch whitespace 11.2.1 Solution 11.2.2 Python 11.2.3 Ruby 11.2.4 JavaScript 11.2.5 PostgreSQL 11.3 New hostname 11.3.1 Solution 11.3.2 Python 11.3.3 Ruby 11.3.4 JavaScript 11.3.5 PostgreSQL 11.4 Detagify 11.4.1 Solution 11.4.2 Python 11.4.3 Ruby 11.4.4 JavaScript 11.4.5 PostgreSQL 11.5 Deunixify paths 11.5.1 Solution 11.5.2 Python 11.5.3 Ruby 11.5.4 JavaScript 11.5.5 PostgreSQL

Chapter 12 Unix shell 12.1 Disk space 12.1.1 Solution 12.2 Not-today files 12.2.1 Solution 12.3 Problem logs 12.3.1 Solution 12.4 Old and new Office files 12.4.1 Solution

Preface: Practice Makes Regexp cha-preface Regular expressions (“regexps”) are often seen as equal parts blessing and curse. On the one hand, they are generally acknowledged to be powerful, useful, and often indispensible tools in identifying and retrieving pieces of text from within a larger corpus. In an age in which we are inundated with text, being able to write programs that can search through gigabytes, finding us specific patterns of text is nothing short of amazing. And yet. Regular expressions, for all of their power, remain mysterious, unreadable, and scary. A large number of professional, established programmers I know, who are quite smart and educated, have expressed their doubts about regular expressions – or say that they’ll get around to it one of these days. Or not.

I have to admit that I understand their feelings; my first exposure to regular expressions was in 1988, when I read through the manual for GNU Emacs. The manual’s description of regular expressions seemed intriguing, but when I got to the part of the manual that described how to use them, I wondered whether this was really something that I had to learn, or that I wanted to learn. The answer was a resounding “no,” and I ignored regular expressions for about four more years, when I started to program in Perl. Perl didn’t invent regular expressions, but it did basically require that you use them if you wanted to use the language. It also expanded the standard regular-expression library in many new and different ways, providing additional power – and tricky syntax! – that made it possible to examine, identify, and extract text even more easily than before. If you could master the syntax, of course. So, regular expressions are a technology that is universally seen as powerful and important, but also hard to learn and even harder to put into practice. Much of my time is spent teaching programming courses to large multinational companies, and while a minority of developers there say that they have taught themselves regular expressions, the overwhelming majority are completely unfamiliar with the syntax or use a very small part of regular expressions’ power. I have been teaching regular expressions for years, but it was only in 2015 that I began to teach a separate class on the subject. For two days, we do nothing but drill, drill, drill regexp syntax until it’s coming out of their ears. At the conclusion of the course, participants have written several dozen regexps, and are as a result able to see how to apply them in their own work. (Indeed, one of my favorite things to do in such classes is have

people bring problems from their own work, so that we can build regexps that will be useful in their day-to-day jobs.) The success of this course, has led me to the conclusion that as with so many things that appear to have inscrutible syntax, understanding of regular expressions comes through practice, experimentation, making mistakes, and then having the “aha!” moment in which it all makes sense. In theory, the workplace can provide such opportunities for practice, but in reality, work is often too busy, inflexible, or harried. Plus, when you’re working on a real problem for work, it is almost by definition a new problem – meaning that there isn’t anyone to walk you through the solution. This book is aimed at people who have learned the basics of regular expressions, either in a course or from reading a manual, but don’t quite understand when and how to use each of the regexp syntax. When (and how) do you use groups? When do you define character classes? How (and why) do you create non-capturing groups? This book doesn’t teach regular expressions; you can find numerous tutorials, lectures, and other resources online to get you that far. Rather, this book is intended to get you to understand and internalize regexp syntax through many different exercises. Most of these exercises are quite short, with a simple requirement. That said, the fact that a regexp’s specification is short, and that the regexp that solves the problem is one line long, doesn’t mean that it’ll be easy for you to come up with the solution. For that reason, every exercise comes with not only the solution, but also explanations and working code in Python, Ruby, JavaScript, and PostgreSQL. A final chapter discusses the

Unix command line, concentrating on the venerable – and invaluable – grep program, which is where most of us first encountered regexps. I chose these technologies because they are used by a large (and growing) number of programmers, and because many of the people using them aren’t aware of the fact that they contain sophisticated regexp engines. (Fine, most Ruby developers probably are – but I have encountered many PostgreSQL developers who had no idea that regexps were baked into the database.) The differences between the various implementations, and the ways in which the languages work with regular expressions, also provide me with a chance to demonstrate the pitfalls that developers encounter when working with regular expressions.

1

About me

I am an independent consultant, and have been since 1995. For many years, I have split my time between developing Web applications, consulting to companies about how to use technology to improve their businesses, and teaching programming courses (in the United States, Europe, Israel, and China). I use regular expressions nearly every day in my work, often in multiple technologies. I got my start as a Web developer back in 1993, when I helped to set up one of the first 100 Web sites in the world for The Tech, MIT’s student newspaper. After working for Hewlett Packard and Time Warner in the United States, I moved to Israel in 1995, and began work as a freelance consultant. In 2014, I completed my PhD in Learning Sciences (computer science + cognitive science + design + education) at Northwestern

University. My dissertation research involved the creation and analysis of the Modeling Commons, an online collaborative community for agentbased models written in NetLogo. I have been the Web technology columnist for Linux Journal since 1996, wrote “Core Perl” for Prentice Hall back in 2000, and self-published Practice Makes Python in 2014. I also give frequent lectures at technology conferences, helping technical and non-technical audiences alike to put new technologies into context. I live in Modi’in, Israel (halfway between Jerusalem and Tel Aviv) with my wife and three children. In my spare time, I enjoy reading, spending time with my children, and learning Chinese. (When people say that regexps are as difficult as Chinese, I can actually answer them!) I am very curious to hear from you, the person reading this book. Were the exercises too easy or too hard? Did they focus on the right topics? Are there aspects of regexps that you believe would be more useful to learn and practice? Please let me know what you think, and what improvements, corrections, and additions might be useful in updated editions. You can always reach me at [email protected], or on the Web at http://lerner.co.il.

2

Acknowledgements

I have been fortunate to teach programming to many thousands of people over the years. These students have often given me insights and ideas for new problems, as well as improvements to the solutions that I have provided. I appreciate the feedback and input, and hope that readers of this

book will similarly help to improve my understanding of Python, and the answers provided here. I also thank my family for their constant support, even when they don’t quite know what it is that I do, let alone what “regexps” are.

Chapter 1

Regexp use from programming languages This book is aimed at people using regular expressions in a variety of programming languages. There are three major problems with this approach, however: Every programming language implements a slightly different version, or dialect, of the regexp language. Thus, regexps in Python will be slightly different from regexps in JavaScript, which are different from regexps on the Unix command line. Unfortunately, different versions of a language can sometimes support multiple, conflicting regexp dialects; the Unix command line, in particular, has programs that support a variety of regexp dialects, which can make things even more confusing and frustrating. Every programming language has to implement an interface between the regexp engine and the rest of the language. Thus, how you define the regexp differs from language to language; do you use strings (as in Python and PostgreSQL), or regexp objects delimited by slashes (as in Ruby and JavaScript), or do you create distinct objects? Or do you have multple options available to you? Then, once you have created your regexp, how do you apply it to a piece of text? What operators, methods, and/or functions are available to you?

Finally, every language and technology produces results from regexp operations in different ways. And in many cases, the ways in which you extract results – especially when working with groups – can affect the results that you get. While this book is not meant to teach you regular expressions, I do feel compelled to provide a brief survey of how to use them from within each language. I’ll also provide a number of links for each language, so that you can learn about each in greater detail. The higher-level tiers of this book include the 300+ slides that I use in the class I teach in regular expressions, given to a number of Fortune 500 companies over the last few years. Those slides introduce the regexp syntax as used in Python, in part because of Python’s popularity but also because Python offers a rich version of regexps, with more features than many other languages.

1.1

Python

Python comes with a powerful regular expression engine. It is, in many ways, similar to the engine that comes with Perl 5; while this book does not use Perl in its examples, there is no doubt that Perl’s influence on the world of regexps was strong and long lasting. In particular, such options as nongreedy operators and non-capturing groups were innovations from Perl that have made their way into Python and others. As in Perl, and many other programming languages (but unlike grep and Emacs), you use backslashes in Python to neutralize a metacharacter. Thus,

is a metacharacter, indicating that the previous character must appear one or more times – but \+ matches the plain ol’ + character. +

1.1.1

Defining regexps

In Python, all usage of regular expressions is handled via the re module. This means that if and when you want to work with regexps from within Python, you must include the line 1 import re

somewhere before your first usage of regexps, preferably at the top of the file along with other import statements. You then define a regexp as a string, as in: 1 s = 'abc.def'

It’s important to point out that because all regexps in Python are first created as strings, the Python parser may handle some regexps differently than you might expect. For example, let’s say that your regexp is looking for the string abc as a word on its own. You would likely want to use the \b (word boundary) metacharacter to indicate this in your regexp, as follows: 1 s = '\babc\b'

However, this will fail. That’s because \b is treated by Python’s string parser as a special character (ASCII 8, or backspace). The regexp engine will thus think that it’s to look for the backspace character, rather than the \b metacharacter. The same is true if you use backreferences, which uses backslashes followed by numbers, such as \1.

This isn’t a legal character in a Python string, and you’ll get an error message from Python. In both of these cases, what you need to do is double your backslash, as follows: 1 s = '\\babc\\b'

# doubled backslashes

If this gets annoying, then you can always use a “raw string” – just put an r before the opening quote of a Python string, and the backslashes are automatically doubled. You can think of a raw string as a way to tell Python that you want the string to be precisely as you entered it: 1 s = r'\babc\b'

1.1.2

# raw string

Finding one

Once you have created a regexp string, you can then search for it inside of text. Python provides you with two basic ways to search inside of text with regexps: You can either search for a single occurrence, or for all of the occurrences. To search for a single occurrence of your regexp within a string, you’ll use the re.match or re.search functions. Both of them work in precisely the same way, except that re.match automatically anchors your regexp to the start of the screen. (You can think of re.match as automatically anchoring the regexp with \A representing the start of the string. It’s not the same as anchoring with , because in multiline mode, matches the start of the line, not the starts of the string.)

Some examples: 1 text = 'hello, world'

2 re.match('hello', text) 3 re.search('hello', text)

# Find "hello" at the start of text

# Find "hello" anywhere in text

Both re.search and re.match return either None (if no match was found) or a “match object” if one was. A match object, traditionally named m, has a number of useful attributes, the most popular of which is m.group(0). This asks Python to display the entire string that the regexp matched. If there were any groups within the regexp, then you can retrieve the individual groups with m.group and then passing the group number. In order to avoid trying to invoke group on None, it’s traditional to check to see if m is None (which evaluates to False in a boolean context, such as an if

statement):

For example: 1 2 3 4 5 6

text = 'hello, world'

m = re.search(r'\b(h.)(..o)\b', text)

if m:

print("Full match: {}".format(m.group(0))) print("First part: {}".format(m.group(1))) print("Last part: {}".format(m.group(2)))

# hello

# he

# llo

A regexp string can be compiled into a regexp object. If you are planning to use a regexp within a loop, then it is advisable to reduce your program’s overhead, and compile the regexp a single time, before the first loop iteration. For example: 1 2 3 4 5

text = 'hello, world'

r = re.compile('(h.)(..o)')

m = r.search(text)

if m:

print("Full match: {}".format(m.group(0)))

# hello

6 7

print("First part: {}".format(m.group(1))) print("Last part: {}".format(m.group(2)))

# he

# llo

Notice how re.search is now invoked as a method on r, rather than as a function whose first argument is a regexp string.

1.1.3

Finding more than one

To search for multiple occurences within a string, use re.findall. This function also takes a regexp string and a text string, but is guaranteed to return a Python list, with all of the matches for your regexp. If there were no matches, then it returns an empty list. Note that if your regexp includes groups (i.e., parentheses), then re.findall returns a list of matches for your group (if there was one group) or a list of tuples (if there were multiple groups). For example: 1 2 3 4 5 6 7 8 9 10 11 12 13

# Find all matches of "hello" in book

text = 'hello, world and hello, trees!'

re.findall('hello', text) # ['hello', 'hello']

# Find "h", three characters, and then o -- and match the three

# inner characters. Result is a list of those three characters

re.findall('h(...)o', text) # ['ell', 'ell']

# Find all words start with h and ending with o.

# Put the first two characters in a group, and the final three

# characters in a separate group. Return a list of two-element

# tuples, one with "h." and the other with "..o"

re.findall(r'\b(h.)(..o)\b', text) # [('he', 'llo'), ('he', 'llo')]

If you expect to find a large number of matches, then you might want to use re.finditer rather than re.findall. The only difference is that re.finditer

is an iterator, so it won’t consume large amounts of memory.

re.findall,

by contrast, will return a list of all matches, which might be

quite long.

1.1.4

Substituting text

Substituting text is done with re.sub, which takes a regexp string, a replacement string, and the text in which to search. It returns the transformed string, leaving the original string untouched. (Which is to be expected in Python, where strings are immutable.) For example, the following replaces all vowels in a string with underscores: 1 re.sub('[aeiou]', '_', 'The quick brown fox jumped over the lazy dog')

1.1.5

Flags

Python provides a number of flags that can be used to modify the behavior of regular expressions. Each flag has a short name and a long name, and is passed as an additional, final argument to the re. family of functions. If you wish to pass more than one flag, then you should use bitwise or (the | character) to set them.

1.1.6

Advanced features

Python’s regular expressions are especially rich, taking many elements from the Perl world. As in Perl, and many other programming languages (but unlike grep and Emacs), you use backslashes in Python to neutralize a metacharacter. Thus, + is a metacharacter, indicating that the previous character must appear one or more times – but \+ matches the palin ol’ + character. Another example of where Python took its cue from Perl is in the addition of a non-greedy operator: You can make a number of normally greedy metacharacters, such as + and ?, non-greedy by adding a ? to them – in

other words, you write +? and ??, and these characters indicate that we’re looking for the minimum possible text match, rather than the maximum possible text match. Python also supports non-capturing parentheses. This is especially useful, I have found, when using re.findall, and you want to use parentheses to have ? affect more than one character, but not be used as a group. Python supports several other advanced regexp options, such as positive and negative lookahead and lookbehind (all four combinations), and even named groups. Named groups were actually pioneered by Python, which means that there are several styles of defining them. I find named groups to be particularly exciting, in that you can do something like this: 1 s = 'The price is $123.45.'

2 m = re.search('\$(?P\d+)\.(?P\d+)', s)

3 if m:

4 print(m.group('dollars'))# 123

5 print(m.group('cents')) # 45

6 print(m.groupdict()) # {'cents': '45', 'dollars': '123'}

The syntax for defining named groups is admittedly a bit weird, but that’s what happens when you try to fit new functionality onto a decades-old, very terse syntax.

1.1.7

More information

More information about Python’s re module is available via the Python Web site (for Python 2 or Python 3. A nice summary is also available at the handy regexp site, http://www.regular-expressions.info/python.html. In addition, a Python-flavored Web site that allows you to test regexps is http://pythex.org/. I really love to use this site, especially when teaching

courses, and encourage you to use it in your work, as well.

1.1.8

About Python solutions

Exercise solutions presented in this book will work in both Python 2.7 and 3.5, the latest versions of the language as of this writing. I doubt that any aspects of Python will change in the future so as to make these solutions less accurate. You can download and install Python from http://python.org/.

1.2

Ruby

The Ruby language has often been described as a combination of Perl and Smalltalk. And indeed, this is not a bad description, in that it includes a large helping of Perl-style operators and syntax, along with Smalltalk’s object model. This means that there are several ways to create and work with regexps from within Ruby, typically reflecting the two different language traditions.

1.2.1

Defining regexps

In Perl, and thus in Ruby, we create an instance of Regexp (a class that comes with Ruby, and does’t need to be loaded from an external library) either with slashes (/regexp/) or with Regexp.new. The two are equivalent; the resulting object is normally displayed using slashes. For example: 1 r = Regexp.new('.ain') 2 r = /.ain/

# returns Regexp object /.ain/

# also returns Regexp object /.ain/

1.2.2

Finding one

We can then search in a string for this regexp with the =\( \sim \) (regexp match) operator. The operator can be used with either the string or the regexp coming first: 1 2 3 4

s r r s

= 'It will rain today'

= Regexp.new('.ain') # returns Regexp object /.ain/

=~ s # Returns the Fixnum (integer) 8

=~ r # Also returns the Fixnum (integer) 8

Why 8? Because s[8] (i.e., the 9th character in the string s) is where the first match was found. What if you want the entire string that was matched? You can use the special variable $&, which contains whatever Ruby found: 1 2 3 4

s = 'It will rain today'

r = Regexp.new('.ain') # returns Regexp object /.ain/

r =~ s # Returns the Fixnum (integer) 8

puts $& # Prints "rain"

If you prefer to use a more verbose (and less Perl-like) syntax, you can do so by applying the match method. This returns a MatchData object, which contains all of the information we need about the match. Printing a MatchData

object, or turning it into a string, returns the string that was

found. (If no match was found, then we get nil back, rather than an instance of MatchData. Once again, we can invoke String#match on our regexp or Regexp#match on our string: 1 2 3 4

s = 'It will rain today'

r = Regexp.new('.ain') # returns Regexp object /.ain/

puts r.match(s) # prints "rain"

puts s.match(r) # also prints "rain"

1.2.3

Finding more than one

If we want to find all of the matches, then we must invoke the String#scan method on a regexp. (There is no Regexp#scan to invoke on a string.) For example: 1 s = "the rain in Spain falls mainly on the plain"

2 r = Regexp.new('.ain') # returns Regexp object /.ain/

3 s.scan(r) # returns an array of four 4-character elements

1.2.4

Substituting text

Ruby’s String#sub method replaces the contents of a string. The argument to String#sub can be a string or a regexp; the behavior of the method depends on the object passed to it. We pass to String#sub two arguments, the regexp we want to apply, and the string that should be used in its place. For example: 1 s = "the rain in Spain falls mainly on the plain"

2 r = Regexp.new('.ain') # returns Regexp object /.ain/

3 s.sub(r, 'XXXX') # returns "the XXXX in Spain falls mainly on the plain"

If you want to replace all occurences, then use String#gsub rather than String#sub: 1 s = "the rain in Spain falls mainly on the plain"

2 r = Regexp.new('.ain') # returns Regexp object /.ain/

3 s.gsub(r, 'XXXX') # returns "the XXXX in SXXXX falls XXXXly on the pXXXX"

Both String#sub and String#gsub have alternate versions that modify the original string. As with many methods in Ruby, these add a ! character to the originals’ names:

1 2 3 4 5 6

s = 'The quick brown fox jumped over the lazy dog'

r = /[aeiou]/

s.gsub(r, '_')

puts s # No change

s.gsub!(r, '_')

puts s # Changed to "Th_ q__ck br_wn f_x j_mp_d _v_r th_ l_zy d_g"

1.2.5

Flags

You can modify the behavior of a regexp in Ruby in one of two ways: If you use the // syntax to create your regexp, then you put the modifiers following the final slash. Thus, /abc/i is case insensitive and /abc/im is both case insensitive and multiline. If you create regexps using Regexp.new, then you can pass an optional second argument. If this value is non-nil and non-false, then it’s assumed you want to make it case-insensitive. However, you can also pass one, two, or three modifiers joined with bitwise “or”.

1.2.6

Advanced features

As in Python, capturing is done with parentheses. In such cases, it’s probably a good idea to use String#match, which returns a MatchData object. Similar to Python’s match object, we can retrieve the entire matched string with m[0], and then the individual groups with m[1], m[2], and so forth: 1 2 3 4 5 6

s = 'hello, world'

r = /\b(h.)(..o)\b/

m = s.match(r)

puts m[0] # hello

puts m[1] # he

puts m[2] # llo

Ruby also supports named groups, using the .NET-style syntax. This is slightly different from the Python syntax introduced above: 1 2 3 4 5 6 7

s = 'The price is $123.45.'

r = '\$(?\d+)\.(?\d+)'

m = s.match(r)

if m

puts m['dollars'] # 123

puts m['cents'] # 45

end

There isn’t a built-in Ruby equivalent to python’s groupdict, but the MatchData

object does have a names method that can be used to retrieve all

of them: 1 2 3 4 5 6 7 8

s = 'The price is $123.45.'

r = '\$(?\d+)\.(?\d+)'

m = s.match(r)

if m

m.names.each do |name|

puts "#{name}: #{m[name]}"

end

end

Finally, Ruby supports POSIX-style character classes. In addition to the traditional \w, \s, and \d character classes (and their inverses), you can use things like [[:xdigit:]] to indicate that you’re looking for a hex digit. You can also use Unicode properties as character classes, as in \p{ASCII} and \p{Hebrew}.

1.2.7

More information

More information about Ruby’s Regexp class is available via the Ruby Web site. A nice summary is also available at the useful regexp Web site, http://www.regular-expressions.info/ruby.html.

In addition, a Ruby-flavored Web site that allows you to test regexps is http://rubular.org/.

1.2.8

About Ruby solutions

Exercise solutions presented in this book will work in in Ruby 2.3, the latest version of the language as of this writing. I doubt that any aspects of Ruby will change in the future so as to make these solutions less accurate. You can download and install Ruby from http://ruby-lang.org/.

1.3

JavaScript

JavaScript, also known by the more formal name of ECMAScript, is now considered to be the most popular programming language in the world – in no small part because it sits inside of every Web browser, and quickly gaining favor on the server, as well.

1.3.1

Defining regexps

JavaScript is similar to Ruby in some ways, in that you can define regexps using either the object syntax or a more Perl-like syntax using the RegExp object. For example: 1 var re = /a.c/; 2 var re = RegExp('a.c');

// Perl-like syntax

// object syntax

JavaScript supports three different flags: i (case-insensitive), m (multiline mode, changing the definitions of and $) and g, which tells the regexp that it

should search globally. There is no s modifier that changes the definition of to include newline characters.

.

You can pass these flags to regexps when you create them. Note that the modifiers are passed unquoted in the // syntax, but quoted with the object syntax: 1 2 3 4

var var var var

re re re re

= = = =

/a.c/i; /a.c/im; RegExp('a.c', 'i'); RegExp('a.c', 'im');

// // // //

case case case case

insensitive

insensitive + multiline

insensitive

insensitive + multiline

It should be noted that these two syntaxes create identical objects. Indeed, if you enter an expression in the JavaScript shell, you’ll get back the printed representation of your object, in the // format. This means that even if you define re using the final line of the above example, the printed representation will be /a.c/im. Note that one advantage of defining your regexps with slashes, rather than the RegExp constructor, is that the latter requires you use a string. In such cases, you’ll often find yourself needing to double backslashes, in order to get around the interpretation of \by the JavaScript interpreter for strings. Thus, be careful when using character classes such as \w, which work fine, but need a bit of love and attention (and extra escaping) in order to work.

1.3.2

Finding one or more

To find out whether a string matches a regular expression, invoke the “match” method on a string. The return value is an array of matches it found, or null if it didn’t find anything:

1 2 3 4 5 6 7 8 9 10 11 12 13

var s = 'The quick brown fox jumped over the lazy dog';

var re = /n...n/;

s.match(re) // result is null

var re = /b...n/;

s.match(re) // result is ["brown"]

var re = /[bq]...[kn]/;

s.match(re) // result is ["quick"]

var re = /[bq]...[kn]/g;

s.match(re) // result is ["quick", "brown"]

Note that in the above example, you must use the g modifier to invoke a global search. Alternatively, you can invoke the exec method on a RegExp object. Note, however, that exec will only return a single value each time; you must invoke exec multiple times, stopping when you get a null value, if there were multiple results: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

var s = 'The quick brown fox

var re = /n...n/;

re.exec(s); //

var re = /b...n/;

re.exec(s) //

var re = /[bq]...[kn]/;

re.exec(s) //

var re = /[bq]...[kn]/g;

re.exec(s) // re.exec(s) // re.exec(s) //

jumped over the lazy dog';

result is null

result is ["brown"]

result is ["quick"]

result is ["quick"]

result is ["brown"]

result is null

If you’re merely interested in knowing whether a regexp matches a particular string, you can also use the RegExp.prototype.test method, which returns a true or false value: 1 var s = 'The quick brown fox jumped over the lazy dog'; 2

3 var re = /fox/;

4 5 6 7

re.test(s);

var re = /^fox$/;

re.test(s);

// returns true

// returns false

Groups

1.3.3

Substituting text

Substitution of text is performed using the String.prototype.match method. If the regexp was defined with the g flag, then all of the regexp matches will be replaced. For example: 1 2 3 4 5 6

var s = "the rain in Spain falls mainly on the plain";

var r = /[aeiou]/;

s.replace(r, '_') // Returns "th_ rain in Spain falls mainly on the plain"

var r = /[aeiou]/g; // Make it global

s.replace(r, '_') // Returns "th_ r__n _n Sp__n f_lls m__nly _n th_ pl__n"

1.3.4

Advanced features

JavaScript’s regexps have traditionally not included some of the more advanced features found in other languges. It doesn’t have named capture groups, or lookbehind, although it does have lookahead. It doesn’t support the \A and \Z anchors, although it does support multiline mode via the m flag.

1.3.5

More information

Information about JavaScript’s regexp syntax and usage can be found in a number of places. The official source is the ECMA 262 specification, which you can download and read.

More realistically, you can read about JavaScript’s regexp capabilities and syntax from http://www.regular-expressions.info/javascript.html. Another good source of information, particularly if you’re interested in the latest “ES6” version of JavaScript, is Axel Rauschmayer’s book, “Speaking JavaScript.” You can read the regexp chapter online. Finally, an open-source library for JavaScript called “XRegExp” provides a number of enhancements to the built-in regexp syntax. I won’t use these in the book, but you can learn more and download it from xregexp.com.

1.3.6

About JavaScript solutions

While JavaScript is best known for its work in Web browsers, it can also be used on servers, and is even available as a standard programming language. There are several options for doing this; for the purposes of this book, I am using the REPL (“read-eval-print loop”) for JavaScript included with the popular Node.js program and library. On my computer, I’m able to type node

at the command line, and then to interact with JavaScript.

One big advantage of using Node.js is that it includes a number of the latest additions to JavaScript. This means that, among other things, I have can require

the fs object, giving me access to the filesystem, or the readline

object, allowing me to query the user.Reading from a file in the JavaScript REPL is a bit weird-looking at first, but it works pretty well: 1 2 3 4 5 6 7 8

"use strict";

var fs = require('fs');

fs.readFile('words.txt', 'utf8', function (err, data) {

if (err) {

console.log("Error!\n");

return console.log(err);

}

9 10 11 12 13 14



for (let line of data.split("\n")) {

console.log(line);

}

process.exit();

}

In the above code, I invoke fs.readFile, which takes three arguments – the name of the file to open, the encoding of the file (which will normally be utf8

in this book), and a function which takes two arguments. The first

argument represents an error, if it occurs. The second argument is a string with the contents of the file. However, if we want to iterate over the lines of the file, we’ll need to invoke split on the string, giving us an array object back. I use ES6’s for..of

loop construct, along with the new let variable scope declaration,

to iterate over the elements of that array, then printing each line of the file. Also note that I’m using console.log to display things on the screen. JavaScript programs in this book should all be in “strict” mode, giving us a greater chance of programs errors being caught earlier.

1.4

PostgreSQL

PostgreSQL isn’t a language per se, but rather a relational database system. That said, PostgreSQL includes a powerful regexp engine. It can be used to test which rows match certain criteria, but it can also be used to retrieve selected text from columns inside of a table. Regexps in PostgreSQL are a hidden gem, one which many people don’t even know exists, but which can be extremely useful.

The PostgreSQL regexp engine is descended from the one used in the Tcl language, which differs from the other regexp engines used in many langauges. Many flags are passed using single characters inside of parentheses inside of the regexp, for example. Other aspects of the syntax are just slightly off from other languages; for example, {min,max} cannot have an empty min or max, if it defines a range. Thus, {1,20} is OK, but {,20} is not. Even if you’re used to working with regexps in other languages, it’s worth reading the documentation. for PostgreSQL’s implementation to fully understand how it works.

1.4.1

Defining regexps

Regexps in PostgreSQL are defined using strings. Thus, you will create a string (using single quotes only; you should never use double quotes in PostgreSQL), and then match that to another string. If there is a match, PostgreSQL returns “true.” PostgreSQL’s regexp syntax is similar to that of Python and Ruby, in that you use backslashes to neutralize metacharacters. Thus, + is a metacharacter in PostgreSQL, whereas \+ is a plain “plus” character. However, there are differences between the regexp syntaxes – for example, PostgreSQL’s word-boundary metacharacter is \y whereras in Python and Ruby, it is \b. (This was likely done to avoid conflicts with the ASCII backspace character.) Where things are truly different in PostgreSQL’s implementation is the set of operators and functions used to work with regexps. PostgreSQL’s operators are generally aimed at finding whether a particular regexp

matches text, in order to include or exclude result rows from an SQL query. By contrast, the regexp functions are meant to retrieve some or all of a string from a column’s text value.

1.4.2

True/false operators

PostgreSQL comes with four regexp operators. In each case, the text string to be matched should be on the left, and the regexp should be on the right. All of these operators return true or false: \( \sim \)

case-sensitive match

\( \sim \)*

case-insensitive match

!\( \sim \)

case-sensitive non-match

!\( \sim \)*

case-insensitive non-match

Thus, you can say: 1 select 'abc' ~ 'a.c'; 2 select 'abc' ~ 'A.C'; 3 select 'abc' ~* 'A.C';

-- returns "true"

-- returns "false"

-- returns "true"

In addition to the standard character classes, we can also use POSIX-style character classes: 1 select 'abc' ~* '^[[:xdigit:]]$'; 2 select 'abc' ~* '^[[:xdigit:]]+$'; 3 select 'abcq' ~* '^[[:xdigit:]]+$';

-- returns "false" -- returns "true"

-- returns "false"

This operator, as mentioned above, is often used to include or exclude rows in a query’s WHERE clause:

1 CREATE TABLE Stuff (id SERIAL, thing TEXT);

2 INSERT INTO Stuff (thing) VALUES ('ABC'), ('abc'), ('AbC'), ('Abq'), ('ABCq');

3 SELECT id, thing FROM Stuff WHERE thing ~* '^[abc]{3}$';

This final query should return three rows, those in which thing is equal to abc, Abc,

and ABC.

1.4.3

Extracting text

If you’re interested in the text that was actually matched, then you’ll need to use one of the built-in regexp functions that PostgreSQL provides. For example, the regexp_matches function allows us not only to determine whether a regexp matches some text, but also to get the text that was matched. For each matching column, regexp_matches returns an array of text (even if that array contains a single element). For example: 1 CREATE TABLE Stuff (id SERIAL, thing TEXT);

2 INSERT INTO Stuff (thing) VALUES ('ABC'), ('abc'), ('AbC'), ('Abq'), ('ABCq');

3 SELECT regexp_matches(thing, '^[abc]{3}$') FROM Stuff;

The above will return a single row: {abc}

As you can see, the above returned only a single column (from the function) and a single row (i.e., the one matching it). That’s because when you invoke regexp_matches, you can provide additional flags that modify the way in which it operates. These flags are similar to those used in Python, Ruby, and JavaScript. For example, we can use the i flag to make regexp_matches

case-insensitive:

1 CREATE TABLE Stuff (id SERIAL, thing TEXT);

2 INSERT INTO Stuff (thing) VALUES ('ABC'), ('abc'), ('AbC'), ('Abq'), ('ABCq');

3 SELECT regexp_matches(thing, '^[abc]{3}$', 'i') FROM Stuff;

Now we’ll get three rows back, since we have now made the match caseinsensitive. regexp_matches can take several other flags as well, including g

(for a global search). For example: 1 CREATE TABLE Stuff (id SERIAL, thing TEXT);

2 INSERT INTO Stuff (thing) VALUES ('ABC');

3 SELECT regexp_matches(thing, '.', 'g') FROM Stuff;

Here is the output from regexp_matches: {A}

{B}

{C}

Notice how regexp_matches, because of the g option, returned three rows, with each row containing a single (one-character) array. This indicates that there were three matches. Why is each returned row an array, rather than a string? Because if we use groups to capture parts of the text, the array will contain the groups: 1 CREATE TABLE Stuff (id SERIAL, thing TEXT);

2 INSERT INTO Stuff (thing) VALUES ('ABC'), ('AqC');

3 SELECT regexp_matches(thing, '^(A)(..)$', 'ig') FROM Stuff;

Notice that in the above example, I combined the i and g flags, passing them in a single string. The result is a set of arrays: | regexp_matches |

|----------------|

| {A,BC} |

| {A,qC} |

If we’re interested in retrieving a single element from that array, we’ll need to use [] to grab a particular element. Remember that in PostgreSQL, arrays are indexed starting with 1, not 0. Thus, in the above example, we can 1 CREATE TABLE Stuff (id SERIAL, thing TEXT);

2 INSERT INTO Stuff (thing) VALUES ('ABC');

3 SELECT (regexp_matches(thing, '.', 'g'))[1] FROM Stuff;

The result is: A

B

C

That is, we get a column of text, rather than of one-element text arrays.

1.4.4

Splitting

A common function in many high-level languages is split, which takes a string and returns an array of items. PostgreSQL offers this with its split_part

function, but that only works on strings.

However, PostgreSQL also offers two other functions. splits text into a PostgreSQL text array, while regexp_split_to_table turns it into a table These functions allow us to regexp_split_to_array

split a text string using a regexp, rather than a fixed string. For example, if we say: 1 select regexp_split_to_array('abc def

ghi

jkl', '\s+');

The above will take any length of whitespace, and will use that to split the columns. But you can use any regexp you want to split things, getting an array back. A similar function is regexp_split_to_table, which returns not a single row containing an array, but rather one row for each element. Repeating the above example: 1 select regexp_split_to_table('abc def

ghi

jkl', '\s+');

The above would return a table of four rows, with each split text string in its own row. ### Substituting text The regexp_replace function allows us to create a new text string based on an old one. For example: 1 SELECT regexp_replace('The quick brown fox jumped over the lazy dog',

2 '[aeiou]', '_');

The above returns: Th_ quick brown fox jumped over the lazy dog

Why was only the first vowel replaced? Because when we invoked regexp_replace,

we did so without the g option, making it global:

1 SELECT regexp_replace('The quick brown fox jumped over the lazy dog',

2 '[aeiou]', '_', 'g');

Now all occurrences are replaced:

Th_ q__ck br_wn f_x j_mp_d _v_r th_ l_zy d_g

1.4.5

More information

PostgreSQL’s regexp engine is surprisingly full featured, and I’ve only scratched the surface here. The best and most complete place from which you can learn more is the PostgreSQL documentation. Additional information is available at http://www.regularexpressions.info/postgresql.html. In addition, the “Postgres Online” site contains a good article outlining regexp use in PostgreSQL.

1.5

grep

The grep program has been associated with the Unix command line for many years. Lore has it that the standalone grep program came into being after using a combination of “global” and “print” in sed, with an arbitrary regular expression between the “g” and the “p.” Modern versions of Unix are almost unthinkable without grep. At the same time, we have to realize that there are numerous versions of grep out there. For example, Linux uses the GNU version of grep, maintained by the Free Software Foundation as part of their GNU project. By contrast, FreeBSD and Apple’s OS X include a version of grep that has fewer features, but is directly descended from the traditional Unix grep. There are also variations on these, such as fgrep, egrep, and so forth. Unrelated to these, but worth noting because it’s so incredibly useful, is ngrep,

a “network grep” program that lets you use regexps to examine the

current network traffic to and from your computer. I have used ngrep on numerous occasions when debugging network applications. You can learn more about ngrep from its home page.

1.5.1

Basic use

All versions of grep operate on the assumption that you want to search through a file, line by line, and find those lines that match a regular expression. Thus, certain options associated with regexps in programming languages are no longer relevant, such as multiline mode. Normally, grep is used to find all of the matches in a file: grep 'a.c' myfile.txt

The output will contain all of the lines of the file containing the regexp. It doesn’t matter whether the regexp matches once or multiple times; the fact that there was even one match triggers the printing of the line. You can reverse this with the -v flag. Thus, assuming that I have a file containing Unix-style comments (i.e., # in the first column), I can use grep to find all of the comment lines, or all of the non-comment lines: grep '^#' myfile.txt grep -v '^#' myfile.txt

# Finds all comment lines

# Finds all non-comment lines

Another useful option to grep is -i, which makes the search caseinsensitive.

1.5.2

Backslashes

One of the biggest issues for me when using grep is that it handles backslashes differently from all of the other programming languages mentioned above. In this sense, it’s more traditional, using the metacharacters as they were originally defined and used in Unix. However, I can see why Larry Wall flipped the meaning in Perl, in order to avoid what he called “backslashitis.” The basic idea is that many metacharacters, such as +, *, [ ], {min,max}, and |, are treated as standard characters without a backslash, and metacharacters when they are preceded by a backslash. For example: $ echo 'I want to eat breakfast' > file.txt

$ grep '[aeiou]+' file.txt # no match

$ grep '[aeiou]\+' file.txt # matches

1.5.3 grep,

Context and especially GNU grep, takes a very large number of arguments.

You can read more about these in the grep man page, either for BSD Unix or for GNU grep. However, one of the most useful options is what I call “ABC”: The -A option shows you a number of lines /after/ a match The -B option shows you a number of lines /before/ a match The -C option shows you a number of lines of context (i.e., /both/ before and after) I use these all of the time when I’m looking through logfiles; having a few lines of context above and/or below what I’m searching for, such as an IP address, can be quite useful.

Chapter 2

Input data Regular expressions are not something that you learn or use in a vacuum. Rather, they are a way of consuming, identifying, and extracting text from within larger files. In order to make the exercises a bit more interesting and realistic, I have enclosed a number of files with this

2.1

Dictionary (words.txt)

The Engilsh-language dictionary that I have included in this ebook comes with Linux, and is thus available under an open-source license. The dictionary consists of one word per line in the file, which amounts to more than 235,000 words. I have learned over the years of teaching regexp classes that the dictionary contains a surprisingly large and varied number of words, such that even when you ask for all of the words that have 11 letters in them and start with t, you’ll still get a fairly long list! We will use the dictionary file in exercises where I want you to find “all of the words that…” for some condition that I’ll give in the exercise.

2.2

Alice in Wonderland (alice.txt)

Project Gutenberg is an attempt to make as many books as possible available, for free, over the Internet. It has been around for many years, and publicizes as many books as it can – often, waiting until books are no longer copyrighted, and then publicizing them. I have taken the text of “Alice in Wonderland” from Project Gutenberg. Several of the exercises will ask you to find certain types of text from Alice. Note that I have left the Project Gutenberg notices intact in the file; while they aren’t part of the story, they do provide us with more text to search through, whcih I see as a good thing in a book like this.

2.3

Config (config.txt)

I often use regular expressions to look through configuration files. Many of these config files are of the form “name = value”, with a # at the start of a line indicating that it’s a comment. I have included one simple config-style file, so that we can explore and extract data from it.

2.4

Apache logfile (access-log.txt)

Another type of file on which I often use regexps is a logfile. I have taken an excerpt from the Apache logfile on my server, from many years ago, and have extracted several hundred lines from it, in what I call the “mini acces log.” We will explore this file, and try to find some interesting data points from it.

2.5

Linux “passwd” file (passwd.txt)

As another example of a configuration-type file, I have included a slightly modified version of a Linux “password” file. This file, called /etc/passwd, is traditionally included on Unix and Linux systems, and lists not only the usenrames, but the passwords, as well. In recent years, despite the name, the file does not contain the password. I have modified this file slightly, such that it includes several blank lines and comment lines starting with #.

2.6

Fakelog (fakelog.txt)

Some of the time, you need to work with logfiles whose values extend over a single line. In such cases, you need to write multiline regexps. For those exercises, I have prepared a simple file, fakelog.txt, which simulates such a situation.

2.7

PostgreSQL database

PostgreSQL is a relational database, rather than a programming language. As a result, it cannot easily work with files on disk. In order to make the examples more appropriate for PostgreSQL users, I created a database and dumped it to a file that you can load into PostgreSQL. The assumption is that all of your solutions should work against the appropriately named table in the database, rather than a file on disk. The dumpfile was made with PostgreSQL 9.5, but should be compatible with earlier verisons, as well.

To import the file into PostgreSQL, you’ll first need to create a database on the Unix command line: createdb practice_makes_regexp

The above assumes, of course, that the user via which you are logged in has permissions to create PostgreSQL databases. If not, then check your system configuration to give yourself that ability. Once the database has been created, you can import the dumpfile into PostgreSQL, from the Unix shell prompt: psql practice_makes_regexp < practice_makes_regexp.sql

You can then check to see if it all worked by entering into the practice_makes_regexp

database:

psql practice_makes_regexp

Then, ask to see the current list of tables: \dt

You should see 16 defined tables there, two for each of the files mentioned above. Each table has been added once – the first time, with each line of the file as a separate row in the database table, and the second time, in which the entire file has been inserted into a single row. This was done to ensure that even those exercises in which you’re asked to find text that spans lines of the file can be solved using PostgreSQL.

Chapter 3

Exercises This chapter contains all of the exercises presented later in the book, without the solutions. In this way, you can do the exercises without worrying about peeking at the answers. And no, you shouldn’t peek! Rather, you should work on the exercise, struggling a bit until you either find the answer or give up. But don’t give up too soon; I suggest that you engage in what I call “controlled frustration,” allowing yourself to get annoyed and frustrated, without having an actual work deadline or boss standing over you, waiting for you to finish.

3.1 3.1.1

Simple regexps Find matches

Solution is in section 4.1 This exercise is deliberately very simple, to try to get you into the spirit of working with regular expressions. The idea is to ask the user to enter a regular expression, and then to print all of the lines in a file which match that regexp. In other words you’re going to be creating a simple grep command.

Each programming language has a different way of asking the user for input – and in the case of PostgreSQL, there really isn’t any way, so I fudged it a bit in my solution. Nevertheless, taking a string and turning into a regexp, then finding that regexp in a file, is a good way to start. In this exercise, you are to: 1. Ask the user to enter a regexp (via a string) 2. Print all lines in the dictionary that match that regexp. Note that the regexp doesn’t have to match the /entire/ word. Thus if our regexp is abc, then any word containing the three characters abc in a row should be printed, regardless of whether it is a 3-letter word or a 10-letter word.

3.1.2

Five-letter words

Solution is in section 4.2 In this exercise, you are to display words in the dictionary that are either four letters long, or that are five letters long if they end with an s. The word – not just a subset of the word – should be precisely four or five letters long. For the purposes of this exercise, any character (not just a letter) can be counted in the first four letters of the word. However, if there is a fifth letter, it must be an s.

3.1.3

Double “f” in the middle

Solution is in section 4.3

In this exercise, you need to find all of the words in the dictionary that contain a “ff” in them, so long as those f’s are not the first or final characters in the world. Thus, “affable” would be fine, but “quaff” would not.

3.1.4

Extract timestamp

Solution is in section 4.4 It’s common to use regular expressions to extract information from logfiles. In the access-log.txt file that comes with this book, each HTTP request is accompanied by a timestamp, consisting of a date and time. In this exercise, you must match and retrieve the entire timestamp from each line, starting with [ and ending with ]. For the purposes of this exercise, you cannot assume that this will be the only pair of [ and ] in the logfile, so you cannot use a regexp such as: \[[^]]\]

which would mean, “start with [, end with ], and take everything in the middle.” You’ll need to specify the regexp more explicitly and carefully than that. For example, the first line of access-log.txt contains the following timestamp: [30/Jan/2010:00:03:18 +0200]

You are to retrieve just that part of each line.

3.2 3.2.1

Character classes End-of-sentence words

Solution is in section 5.1 In Alice in Wonderland, find all of the words that are at the end of a sentence. In other words, find and display all of the words that end with ., ?, or !. You should display the punctuation mark along with the word. For the purposes of this exercise, a word is any string of alphanumeric characters at least two characters long.

3.2.2

Hex numbers

Solution is in section 5.2 Given the following sentence: I like the hex numbers 0xfa and 0X123 and 0xcab and 0xff

retrieve all of the hexadecimal numbers. That is, it starts with 0x (or 0X), then has a string of digits or the letters a through f, capital or lowercase.

3.2.3

Hexwords

Solution is in section 5.3 Which words in the dictionary only the letters a through f?

3.2.4

IP addresses

Solution is in section 5.4 Each line of access-log.txt starts with an IP address. Each IP address has four numbers, each containing between one and three digits. The numbers are separated by periods (.). In this exercise, you are to retrieve the IP addresses from access-log.txt by building a chracter class, not by splitting the line across whitespace.

3.2.5

Long, weird words

Solution is in section 5.5 Find all of the words in the dictionary that have the following characteristics: 10 letters long Start with a letter from the first half of the alphabet (a-m) End with a letter from the second half of the alphabet (n-z) Somewhere in the middle, there should be a “p”

3.2.6

Matching URLs

Solution is in section 5.6 Let’s assume that we have defined a string: I love to visit https://example.com/foo.html every day!

More than http://abc-def.co.il/.

Write a regexp that will match both URLs, but not the characters before or after them. Include the /foo.html in the first URL, but not the training period (.) in the second.

3.2.7

Non-zero hours

Solution is in section 5.7 Once again, it’s time to search for certain patterns in access-log.txt: We want to find all of the records in which the hour doesn’t begin with a 0. (Remember that Apache logs, like many other logfiles, operates on a 24hour clock. Thus, 11 p.m. is written as 23:00.) Thus, you should not show the records from 00:00 through 09:59, and then show those from 10:00 through 23:59. For the purposes of this exercise, you may assume that square brackets ([ and ]) only occur around the timestamp.

3.2.8

Quoted text

Solution is in section 5.8 In this exercise, we’re going to look for all of the quotations in Alice in Wonderland. I’m looking for any stretch of text that starts with the doublequote character (“) and ends with that same character. I’m going to assume that quotes are never nested, and that there’s no use of a programmer’s backslash () to escape the double quotes. However, quotes might extend across more than one line.

3.2.9

Supervocalic

Solution is in section 5.9 A word is considered “supervocalic” if it contains all five of the Englishlanguage vowels (a, e, i, o, and u). Each letter should appear only once, and in that order. For this task, you want to find all of the supervocalic words in the dictionary.

3.2.10

Double triple vowel

Solution is in section 5.10 In English, doubled vowels are a pretty common occurrence. Tripled vowels, though, are a pretty rare thing. Your task is to try to find something even rarer: Words in the dictionary with two separate sets of triple vowels. (And yes, the dictionary I’ve included with this book contains 69 such words.)

3.2.11

Postfix dollar

Solution is in section 5.11 In the United States, we put the dollar sign before the price of something, as in $123.45. In my travels, I’ve noticed and discovered that many people, in many countries, aren’t used to this, and put the $ sign after the numbers. Given the sentence: They wrote 1,000$ (not $1000), and 123.45$ (not $123.45).

For this exercise, write a regular expression that finds all of the cases of numbers (including commas and decimal points) followed by dollar signs. Thus, the results should find 1,000$ and 123.45$.

3.3 3.3.1

Alternation Multiple date formats

Solution is in section 6.1 Dates are a well-known problem in the world, in that the same representation can mean different things. If you see the date 1/2/2016, does that mean February 1st or January 2nd? It all depends on whether you’re in the United States or Europe. Asian countries write dates altogether differently, starting with the year, so 2016-2-1 would mean February 1st, 2016. For this exercise, write a regular expression that finds all dates in the following string: I write it as 2015-09-02, but he wrote 2/9/2015, and she wrote 9.2.2015.

3.3.2

“oo” and “ee” words

Solution is in section 6.2 Find all of the words containing the double-letter combination oo and/or ee in the Alice in Wonderland, regardless of case.

3.3.3

British and American spelling

Solution is in section 6.3 The problem here is a relatively simple one. We have a sentence: The new box of cheques is blue in colour.

Or I might have this sentence: The new box of checks is blue in color.

Write a regexp that matches either of these.

3.4 3.4.1

Anchors Capital vowel starts

Solution is in section 7.1 In this assignment, find and print all of words that begin with a capital vowel (A, E, I, O, or U) and are at the start of a line.

3.4.2

Comment lines

Solution is in section 7.2 Many Unix-style files, including programs written in such languages as Python and Ruby, indicate comments by having a # at the start of the line. In this exercise, you are to print all comment lines – meaning, all lines that

start with #, or that are preceded by whitespace. Comments that follow whitespace can be ignored. Thus, given the following file: # Comment 1

# Comment 2

print("Hello") # Comment 3

Your solution should print comments 1 and 2, but not comment 3.

3.4.3

Last five characters

Solution is in section 7.3 In Alice in Wonderland, print the last five characters of every line, in which the third-to-last character is a lowercase letter in the second half of the alphabet (i.e., starting with n).

3.4.4

u in the 2nd-to-last word

Solution is in section 7.4 Show the final two words of each line of Alice in Wonderland in which u is in the second-to-last word.

3.5

Groups

3.5.1

Date and time

Solution is in section 8.1 In access-log.txt, each line contains a timestamp, which looks like this: [30/Jan/2010:00:03:18 +0200]

Notice that the timestamp starts with [, ends with ], and contains both the date (in DD/MMM/YYYY format) and the time (in HH:MM:SS +TZ format). For this exercise, you are to grab the date and time in separate groups. Each language has a slightly different way of extracting the groups; the idea is that for each line, it should be possible to extract and display the date and time separately. The time should include the time zone; for now, we’ll leave it in the format used by the access log.

3.5.2

Config pairs

Solution is in section 8.2 config.txt

is a simple configuration file. Simple, in that the configuration

is set with lines that look like name:value

But as often happens in such files, the people writing the file have gone a bit crazy, and have added lots of extra whitespace. Some lines contain only whitespace, or are generally illegal, without either a name or a value. We want to extract all of the name-value pairs from this file, grabbing the name and value in separate groups from legal lines. Moreover, we want to ignore any leading and trailing whitespace surrounding the name and value.

3.5.3

Quote first and last words

Solution is in section 8.3 In an earlier exercise (5.8), we found all of the quotations in Alice in Wonderland. For this exercise, find the first word and last from each quotation, not including the quotation marks and punctuation. Thus, if the quote is "Hello out

there!"

You should find Hello and there. Note that quotes might extend across lines.

3.5.4

Prices with symbols

Solution is in section 8.4 [Note: This chapter uses Unicode symbols that aren’t printing correctly. I’m working on fixing this. In theory, there should be a dollar sign, a euro symbol, and a UK pound sign.] Assume that we have a string: We sell 10 in the US for $100, in Europe for €100, and in the UK for £100.

We want to retrieve all of the prices from this string, but we don’t want to retrieve the currency symbol as well. In other words, we want to find all of the digits (no commas or decimal points) that follow a currency symbol.

3.5.5

Question first word

Solution is in section 8.5 Once again, let’s extract some text from Alice in Wonderland: Retrieve the first word of every question – meaning, every sentence that ends with a question mark.

3.5.6

t, but no “ing”

Solution is in section 8.6 In this exercise, you are to find all of the words in Alice in Wonderland that start with t and end with ing. However, you are to return the portion of the word that precedes the int. Thus, if the word is trailing, you should only match and return trail.

3.5.7

Usernames and user IDs

Solution is in section 8.7 In linux-etc-passwd, field index 0 is the username, field index 2 is the user ID, and field index -1 contains the user’s shell. For each user in the file, I want a regexp that extracts the user’s name, the user’s ID number, and the user’s shell. The regexp should extract each piece of information using a group. If the language supports it, retrieve each field using a named group, rather than a numbered one.

3.5.8

Beheaded usernames

In this exercise, display the final four characters of any username that starts with a and contains at least five characters. Thus, given the users nobody, root, amotz, atara, adam,

and astronaut, we would see the following

output: motz

tara

naut

3.5.9

Final question words

Solution is in section 8.9 In this exercise, you are to retrieve the final word of each question in Alice in Wonderland. You can assume that a question always ends with a question mark (?). You should not retrieve the question mark, but just the word preceding it.

3.5.10

“d” user shells

Solution is in section 8.10 In /etc/passwd, each line contains a number of different fields, separated by : characters. The first field is the username, and the final field is the user’s shell (i.e., the command interpreter). On a typical Linux box, most people will be using /bin/sh or /bin/bash, whereas others will be using /usr/bin/zsh, or something like that. And then you have the internal system users, whose shells are often /bin/false (so that they cannot log in), or something of the like.

In this exercise, I want you to retrieve the shell from every user whose name contains d. For example, given the following line: daemon:x:1:1:daemon:/usr/sbin:/usr/sbin/nologin

This user (daemon) starts with d, and their shell is /usr/bin/nologin. But we also want shells from users with d elsewhere in the name, as in: redis:x:112:123:redis server,,,:/var/lib/redis:/bin/false

3.6 3.6.1

Flags All usernames

Solution is in section 9.1 In this exercise, you are to find all of the usernames in passwd.txt. However, you are to do this not by looping over the lines in passwd.txt, but rather by applying a regexp to the entire contents of the file as a single string, and retrieving all of the matches found in that string. Just to remind you, the username is at the start of each line, until the first : character.

3.6.2

abc

Solution is in section 9.2 In Alice in Wonderland, find stretches of text that start with a, have a b in the middle, and end with c. Between each of these characters can be up to 20 other characters.

3.6.3

abcABC

Solution is in section 9.3 This exercise is a repeat of the previous one. But whereas the previous exercise asked you to find stretches of a, b, and c with up to 20 characters between each of these letters, here the search should be case-insensitive. That is, now we’re looking for either a or A, then up to 20 characters, then b or B, followed by up to 20 characters, then c or C, followed by up to 20 characters.

3.6.4

abcABC, extended

Solution is in section 9.4 The regexp in the previous exercise was starting to get a bit long and complex. In such cases, it’s a good idea to break the regexp into separate lines, taking advantage of the “extended mode” that many regexp engines offer. In this exercise, I want you to take the regexp from the previous exercise (9.3) and turn it into a multi-line regexp, using extended mode in your language of choice.

3.6.5

No-error IP addresses

Solution is in section 9.5 In this exercise, we’re going to work with fakelog.txt, a logfile using a format that I created for the purposes of my regexp courses. Each entry in

the logfile is two lines long, and represents a response code of some sort, similar to HTTP. The first line contains the timestamp of the error message, followed by the (fake) IP address that caused the error. The second line contains the word Result, followed by a three-digit number indicating the error code, a colon, and a message. Your task is to extract the IP addresses associated with a response code starting with a 2.

3.7 3.7.1

Backreferences Doubled vowels

Solution is in section 10.1 Find all of the words in Alice in Wonderland that contain doubled vowels – that is, the same vowel (a, e, i, o, or u) appears twice in a row. For example, “beer” is a doubled vowel, but “bear” is not.

3.7.2

Hours and seconds

Solution is in section 10.2 In access-log.txt, , find all of the entries in which the hour and second for the entry were identical. Thus, a request at 12:34:12 matches, but 12:34:56 does not.

3.7.3

Seven-letter start-finish words

Solution is in section 10.3 In the dictionary, find all seven-letter words that start and end with the same two letters. For example, restore starts with re and ends with re, and is seven letters long.

3.7.4

end-start

Solution is in section 10.4 Show all words in the dictionary in which the final two letters of one word are the same as the first two letters of the next word. Thus, if the word require is followed by the word requirement, then we’ll want to see require

in our output.

3.7.5

Singular and plural

Solution is in section 10.5 Find all of the words in Alice in Wonderland that appear in both singular and plural forms. For the purposes of this exercise, we’ll generalize, and say that a “plural” is any word with an “s” or “es” on the end. Thus, if both cat

and cats appear in the book, then I want to see cat. We’ll also say that

the singular version of a word must be at least 2 letters long, and that the singular version must precede the plural version.

3.8 3.8.1

Replace Crunch whitespace

Solution is in section 11.2 This is another simple exercise, but one that has great practical implications. The idea is that you have read some text into your program. That text contains a number of types of whitespace characters – spaces, tabs, newlines, and even carriage returns. You want to turn one of those characters, or every multi-character combination, into a single space character. So if you have the string abc

def\n

\tghi \t \r \n jkl

You want to turn it into abc def ghi jkl

3.8.2

New hostname

Solution is in section 11.3 Our company is rebranding from “foocorp” to “barcorp”, and as such, all of the URLs much change. We’re also changing our URLs such that if there is a www. before the foocorp, that should go away as well. And our corporate security team has said we need to use HTTPS instead of HTTP, so all of our URLs that currently use http now need to use https. Can we take care of all three of these at once? In other words, the text Please visit http://www.foocorp.com/.

we should change it to Please visit https://barcorp.com/.

3.8.3

Detagify

Solution is in section 11.4 While regexps shouldn’t be used for parsing HTML and XML, there are stil times when they can be used to manipulate those formats. You have to be careful when doing this; a famous Stack Overflow answer about using regexp to parse XML demonstrates just how frustrated some programmers can get with some questions. However, there are some XML-related tasks for which regexps are perfectly suited. This exercise is one of them: Given a text string, you are to remove all of the XML/HTML tags, leaving everything else in place. It’s fine to leave some corner cases in place; we’re not trying to build the ultimate XML tag parser here. So if you have the string This is a headline

This is a paragraph with a link.



This is another paragraph,

this time on two lines!



We want to strip all of the HTML tags from the above, leaving us with: This is a headline

This is a paragraph with a link.



This is another paragraph,

this time on two lines!

3.8.4

Deunixify paths

Solution is in section 11.5 Our company hired a technical writer who thought we were using Unix, but we were actually using Windows. This means that the paths in our text were all written as dir1/dir2/filename

But they really needed to be dir1\dir2\filename

We want to change all of the / characters to \ characters. Well, not all of them; we only want to do this if there are non-whitespace characters after our / character. Thus, given the following string: My file might be in /tmp/foo or in /tmp/bar; that / is tricky!

We want it to be turned into My file might be in \tmp\foo or in \tmp\bar; that / is tricky!.

Can you save the day, and turn the slashes into backslashes, and make this a Windows-friendly company?

3.9 3.9.1

Unix command line Disk space

Solution is in section 12.1 The df program returns the current disk usage for each of your filesystems. One of the columns indicates the percentage of disk space being used. Use a regexp (and grep) to find those filesystems that have at least 80% usage. You can assume that the output from grep will only use a % sign when reporting the percentage free. You can return the entire line with such a percentage.

3.9.2

Not-today files

Solution is in section 12.2 Find all of the files in a directory that were not modified today. In other words, if today is April 1st, and the directory listing (using ls -l for a “long” listing) looks like this: -rw-r--r--rw-r--r--rw-r--r--rw-r--r--rwxr-xr-x -rw-r--r--rw-r--r--rw-r--r--

1 1 1 1 1 1 1 1

reuven reuven reuven reuven reuven reuven reuven reuven

501 1967 Apr 1 10:02 UNIX-disk-space.md

501 223 Apr 2 22:53 UNIX-files-not-today.md

501 499 Mar 2 09:56 UNIX-old-new-office-files.md

501 177 Mar 2 09:56 UNIX-python-ruby-programs.md

501 3694 Mar 9 11:39 extract-exercises.py*

501 678 Mar 30 09:10 ipython_log.py

501 53769 Mar 23 16:03 solutions.zip

501 939 Apr 1 11:31 template.md

We’re only interested in seeing the lines whose timestamp says Apr 1, and want to see those lines. However, we don’t want to insert a literal Apr 1 in

there; it should reflect the current date. So if I issue that same command tomorrow, it’ll show files from April 2nd.

3.9.3

Problem logs

Solution is in section 12.3 In exercise 9.5, we found the IP addresses for all requests to our server that had no errors. In this exercise, we want to find all of the requests in fakelog.txt

for which there were problems.

We can make this a bit simpler: In fakelog.txt, errors are indicated with a line that looks like: [2015-Sep-2 10:16:44] 11.22.33.44

Result 404: File not found

We can assume that all errors have either the code 404 or 500. Other result codes are not of interest to us. Your task is to use grep to find all of the result codes 404 or 500, and display not only the line on which this code appeared, but the line before it.

3.9.4

Old and new Office files

Solution is in section 12.4 Several years ago, Microsoft started to use the .docx and .xlsx suffix on their files, rather than the three-letter .doc and .xls. Given a directory listing, display all files that have those suffixes. Note that if a file contains

(or any other of these suffixes) in the middle, but not at the end of the file, then it should not be displayed. .doc

Assume that ls -1 gives you a listing of all files in a single column, such that you can treat each filename as a single row in the input to grep.

Chapter 4

Simple regexps 4.1

Find matches

This exercise is deliberately very simple, to try to get you into the spirit of working with regular expressions. The idea is to ask the user to enter a regular expression, and then to print all of the lines in a file which match that regexp. In other words you’re going to be creating a simple grep command. Each programming language has a different way of asking the user for input – and in the case of PostgreSQL, there really isn’t any way, so I fudged it a bit in my solution. Nevertheless, taking a string and turning into a regexp, then finding that regexp in a file, is a good way to start. In this exercise, you are to: 1. Ask the user to enter a regexp (via a string) 2. Print all lines in the dictionary that match that regexp. Note that the regexp doesn’t have to match the /entire/ word. Thus if our regexp is abc, then any word containing the three characters abc in a row should be printed, regardless of whether it is a 3-letter word or a 10-letter word.

4.1.1

Solution

There is no generic solution to this problem. Every language has its own way to ask the user for input, turn that input into a regexp, open a file, and then iterate over that file, looking for the regexp. In Ruby and JavaScript, you have two different ways to create regexps, using either the double-slash syntax or the object-constructor syntax. Because we’re getting input from the user as a string, the latter would appear to be a more appropriate solution in this case.

4.1.2

Python

In Python 2, we get input from the user with the raw_input builtin function. This function has been renamed input in Python 3; I hope that this will be one of the few places in the book where I indicate my preference for Python 2. (That preference is professional, not personal; nearly all of my clients have tons of legacy code, and cannot easily upgrade to Python 3.) After getting the regexp from the user, we then compile it into a regexp object, using re.compile.. This is a common thing to do when applying a regexp many times; rather than compiling it inside of each loop iteration, we’ll compile it once and apply it many times. We then open the file with the open function, returning a file object that remains unnamed in this program. However, we are able to iterate over the file’s lines, one by one, using this standard Python syntax. We then use to look anywhere in the line for a match to our regexp. Any matching line is then printed to the user’s screen. re.search

1 2 3 4 5 6 7 8

import re

r = raw_input("Enter a regexp: ")

ro = re.compile(r)

for line in open('words.txt'):

if ro.search(line):

print(line)

4.1.3

Ruby

The Ruby version is similar in style to the above Python version: We ask the user for input, and receive that input in the form of a string. We turn the string into a regexp using Regexp.new, which automatically compiles it (thus avoiding the need for something like Python’s re.compile). Notice how I take the input from gets, and then apply String#chomp to it, in order to ensure that we remove the newline character from the end of the string. We then iterate over the lines of our dictionary file by opening it and then iterating over the file using File#each_line. We then print the result for each line, indicating whether we found a match or no: 1 2 3 4 5 6 7 8

print "Enter a regexp: "

r = Regexp.new(gets.chomp)

File.open('words.txt').each_line do |line|

if line =~ r

puts line

end

end

4.1.4

JavaScript

The JavaScript solution is similar in many ways to the example program given in the description of working with files in JavaScript. The big difference is that we also need to get user input. This is a bit weird and/or

tricky in JavaScript; fortunately, most of the programs in this book won’t require us to get input from the user. What we need to do, in a nutshell, is use readline to provide an object on which we can invoke createInterface. This function lets us specify the source of its input (process.stdin) and output (process.stdout). We can then invoke the question method on the resulting readline interface object, passing it a function that then gets the answer. In other words, the solution will look something like this: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28

"use strict";

var readline = require('readline');

var fs = require('fs');

var rl = readline.createInterface({

input: process.stdin,

output: process.stdout

});

rl.question("Enter regexp: ", function(user_input) {

var r = RegExp(user_input);

fs.readFile('words.txt', 'utf8', function (err, data) {

if (err) {

console.log("Error!\n");

return console.log(err);

}

for (let line of data.split("\n")) {

if (line.match(r)) {

console.log(line);

}

}

process.exit();

});

});

4.1.5

PostgreSQL

PostgreSQL doesn’t allow us to get user input. Thus, we’ll just have to hard-code it within our query. For the purposes of this exercise, I’ll use the

regexp a....b, meaning six characters starting with a and ending with b. The four interim characters can be anything but a newline, although the fact that each record contains a single line from the dictionary file means that this doesn’t make a difference. I’ll use the words database, which contains the dictionary, with one row of the dictionary rile in each row of the table. We’ll thus create an SQL query that searches through all of the rows in the table, displaying those that match our regexp. This, like many PostgreSQLrelated regexp queries, turns out to be surprisingly short and simple: 1 SELECT line FROM words

2 WHERE line ~ 'a....b';

In this case, all we’re doing is using the built-in \( \sim \) operator. We check the line column against our regexp, and then display the line column when the operator returns a true value.

4.2

Five-letter words

In this exercise, you are to display words in the dictionary that are either four letters long, or that are five letters long if they end with an s. The word – not just a subset of the word – should be precisely four or five letters long. For the purposes of this exercise, any character (not just a letter) can be counted in the first four letters of the word. However, if there is a fifth letter, it must be an s.

4.2.1

Solution

There are two parts to this exercise. First of all, we need to create a regexp that will match four letter words and five-letter words ending with s. Another way of thinking about this is to say that we want to find four characters, followed by an optional s. In regexps, we can use the ? metacharacter to indicate that the preceding character is optional. Our regexp will thus be: ....s?

In other words, four characters that are not newlines (represented by .), and then an optional s. However, if we were merely to search for this regexp in each line of the dictionary, we would find that many longer words would match, as well. That’s because the regexp, left as it is above, will match any word with four or more letters in it. We have several ways to deal with this problem. One is to use anchors to connect the regexp to the start and end of the line. For example: ^....s?$

The anchors the regexp to the front of the line, and the $ anchors it to the end of the line. That’s probably the best way to go about this, I’d say. Another solution is to use the programming language’s string-length function to determine whether the word is either four or five characters in length, and then fits our criteria.

In the below solutions, I use anchors – however, if you aren’t yet familiar or comfortable with them, filtering out strings that are not four or five characters long is a reasonable solution to the problem, as well.

4.2.2 1 2 3 4 5 6 7

import re

ro = re.compile('^....s?$')

for line in open('words.txt'):

if ro.search(line):

print(line)

4.2.3 1 2 3 4 5 6 7

Python

Ruby

r = Regexp.new('^....s?$')

File.open('words.txt').each_line do |line|

if line =~ r

puts line

end

end

4.2.4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

JavaScript

"use strict";

var fs = require('fs');

var r = RegExp('^....s?$');

fs.readFile('words.txt', 'utf8', function (err, data) {

if (err) {

console.log("Error!\n");

return console.log(err);

}

for (let line of data.split("\n")) {

if (line.match(r)) {

console.log(line);

}

}

process.exit();

});

4.2.5

PostgreSQL

1 SELECT line FROM words

2 WHERE line ~ '^....s?$';

4.3

Double “f” in the middle

In this exercise, you need to find all of the words in the dictionary that contain a “ff” in them, so long as those f’s are not the first or final characters in the world. Thus, “affable” would be fine, but “quaff” would not.

4.3.1

Solution

We know that the regexp will need to include ff inside of it. But if we use the simple regexp ff

then we are telling the regexp engine that it’s OK to find ff anywhere in our word, including the start or the finish. We could thus start to use all sorts of metacharacters, to ensure that we have at least one character before and after the ff. For example: .+ff.+

The above says that there can be any number of characters before and after the ff. But if we think about it for a moment, all we care about is having at least one character before and after the ff. We don’t care about anything

else in the string. We can thus whittle our regexp down to a more minimal version: .ff.

4.3.2 1 2 3 4 5 6 7

import re

ro = re.compile('.ff.')

for line in open('words.txt'):

if ro.search(line):

print(line)

4.3.3 1 2 3 4 5 6 7

Python

Ruby

r = Regexp.new('.ff.')

File.open('words.txt').each_line do |line|

if line =~ r

puts line

end

end

4.3.4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

JavaScript

"use strict";

var fs = require('fs');

var r = RegExp('.ff.');

fs.readFile('words.txt', 'utf8', function (err, data) {

if (err) {

console.log("Error!\n");

return console.log(err);

}

for (let line of data.split("\n")) {

if (line.match(r)) {

console.log(line);

}

}

17 process.exit();

18 });

4.3.5

PostgreSQL

1 SELECT line FROM words

2 WHERE line ~ '.ff.';

4.4

Extract timestamp

It’s common to use regular expressions to extract information from logfiles. In the access-log.txt file that comes with this book, each HTTP request is accompanied by a timestamp, consisting of a date and time. In this exercise, you must match and retrieve the entire timestamp from each line, starting with [ and ending with ]. For the purposes of this exercise, you cannot assume that this will be the only pair of [ and ] in the logfile, so you cannot use a regexp such as: \[[^]]\]

which would mean, “start with [, end with ], and take everything in the middle.” You’ll need to specify the regexp more explicitly and carefully than that. For example, the first line of access-log.txt contains the following timestamp: [30/Jan/2010:00:03:18 +0200]

You are to retrieve just that part of each line.

4.4.1

Solution

There are a number of ways to do this. One of the trickiest parts of this task, however, is to recognize that [ and ] are both metacharacters in most modern languages (except Unix). This is the opposite of what you’ll find in grep

and other standard Unix utilities.

I’m going to use the built-in character classes \d (any digit) and \w (any letter or number), as well as the {min,max} way of indicating how many characters we want and the + metacharacter, which allows us to indicate that we want one or more of the preceding character: '\[\d{2}/\w{3}/\d{4}:\d{2}:\d{2}:\d{2} \+\d{4}\]'

The above basically says that we want: a literal opening [, so we precede it with a \ two digits (date), followed by a slash three letters/numbers (month), followed by a slash four digits (year), followed by a colon two digits (hour), followed by a colon two digits (minute), followed by a colon two digits (seconds) space a literal +, so we add a \ four digits (time zone) a literal closing ], so we precede it with a \

I often build my regexps in this way, slowly but surely, especially when they aren’t easy or obvious. I’ll write a regexp that captures the first part of the timestamp, and then move onto longer and more explicit descriptions of what I want, until I have captured the entire thing.

4.4.2 1 2 3 4 5 6 7 8 9

import re

filename = 'access-log.txt'

ro = re.compile('\[\d{2}/\w{3}/\d{4}:\d{2}:\d{2}:\d{2} \+\d{4}\]')

for line in open(filename):

m = ro.search(line)

if m:

print(line)

4.4.3 1 2 3 4 5 6 7 8

Python

Ruby

filename = 'access-log.txt'

r = Regexp.new('\[\d{2}/\w{3}/\d{4}:\d{2}:\d{2}:\d{2} \+\d{4}\]')

File.open(filename).each_line do |line|

if line =~ r

puts line.match(r)

end

end

4.4.4 1 2 3 4 5 6 7 8 9 10 11 12 13 14

JavaScript

"use strict";

var fs = require('fs');

var r = /\[\d{2}\/\w{3}\/\d{4}:\d{2}:\d{2}:\d{2} \+\d{4}\]/;

var filename = 'access-log.txt';

fs.readFile(filename, 'utf8', function (err, data) {

if (err) {

console.log("Error!\n");

return console.log(err);

}

for (let line of data.split("\n")) {

var m = line.match(r);

15 if (m) {

16 for (let match of m) {

17 console.log(match);

18 }

19 }

20 }

21

22 process.exit();

23 });

4.4.5

PostgreSQL

Because we want to extract text, rather than just match it, we need to use regexp_matches

with our regexp. That function returns an array of text,

from which we’ll then grab the element at index 1: 1 SELECT (regexp_matches(line,

2 '\[\d{2}/\w{3}/\d{4}:\d{2}:\d{2}:\d{2} \+\d{4}\]'))[1]

3 FROM access_log;

Chapter 5

Character classes 5.1

End-of-sentence words

In Alice in Wonderland, find all of the words that are at the end of a sentence. In other words, find and display all of the words that end with ., ?, or !. You should display the punctuation mark along with the word. For the purposes of this exercise, a word is any string of alphanumeric characters at least two characters long.

5.1.1

Solution

This is a classic case of using character classes. First of all, we’re looking for three specific characters (., ?, and !). This means that we can define the character class [.?!]. This might lead us to think that the regexp we want is: .[.?!]

But there are three problems with the above: First of all, it doesn’t restrict the character before the punctuation mark to be alphanumeric. Secondly, it only captures a single character, rather than the entire word. Thirdly, the specifications indicate that our word must be at least two characters long.

We can solve all of these problems together by using the built-in \w character class, which is the same as [A-Za-z0-9_]. We can then indicate that we want a minimum of two such characters by using the {min,max} specifier. Our final regexp thus looks like this: '\w{2,}[.?!]'

Note that because more than one sentence might appear on a single line of text, we’ll need to use the functionality that finds all matches, rather than just the first one on a line.

5.1.2 1 2 3 4 5 6 7 8 9

import re

filename = 'alice.txt'

ro = re.compile('\w{2,}[.?!]')

for line in open(filename):

m = ro.findall(line)

if m:

print(m)

5.1.3 1 2 3 4 5 6 7 8 9

Python

Ruby

filename = 'alice.txt'

r = Regexp.new('\w{2,}[.?!]')

File.open(filename).each_line do |line|

m = line.scan(r)

if !m.empty?

puts m

end

end

5.1.4

JavaScript

In the below regexp, notice how I doubled the \in order to avoid \w being interpreted as just a w. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

"use strict";

var fs = require('fs');

var filename = 'alice.txt';

var r = RegExp('\\w{2,}[.?!]');

fs.readFile(filename, 'utf8', function (err, data) {

if (err) {

console.log("Error!\n");

return console.log(err);

}

for (let line of data.split("\n")) {

var m = line.match(r);

if (m) {

for (let match of m) {

console.log(match);

}

}

}

process.exit();

});

5.1.5

PostgreSQL

1 SELECT (regexp_matches(line, '\w{2,}[.?!]'))[1]

2 FROM alice;

5.2

Hex numbers

Given the following sentence: I like the hex numbers 0xfa and 0X123 and 0xcab and 0xff

retrieve all of the hexadecimal numbers. That is, it starts with 0x (or 0X), then has a string of digits or the letters a through f, capital or lowercase.

5.2.1

Solution

We cannot use the built-in \w character class here, because we want a more restricted set of characters. So our character class will look like [A-Fa-f. However, we also want to allow for numeric digits, so we’ll add \d to our custom class. We want any number of these following 0x, which means that our final regexp will be: 0[xX][A-Fa-f\d]+

5.2.2 1 2 3 4 5 6 7

Python

import re

s = 'I like the hex numbers 0xfa and 0x123 and 0xcab and 0xff'

ro = re.compile('0[xX][A-Fa-f\d]+')

print(ro.findall(s))

5.2.3 1 2 3 4 5

Ruby

s = 'I like the hex numbers 0xfa and 0x123 and 0xcab and 0xff'

r = Regexp.new('0[xX][A-Fa-f\d]+')

puts s.scan(r)

5.2.4 1 2 3 4 5 6 7 8

JavaScript

var s = 'I like the hex numbers 0xfa and 0x123 and 0xcab and 0xff';

var r = RegExp('0[xX][A-Fa-f\d]+', 'g');

var m = s.match(r);

if (m) {

for (let item of m) {

console.log(item);

9 10 }

5.2.5

}

PostgreSQL

To do this in PostgreSQL, I’ll use regexp_matches against the string: 1 SELECT (regexp_matches('I like the hex numbers 0xfa and 0x123 and 0xcab and 0xff',

2 '0[xX][A-Fa-f\d]+', 'g'))[1];

5.3

Hexwords

Which words in the dictionary only the letters a through f?

5.3.1

Solution

The solution to this exercise is a regexp that is anchored to the start and end of a word, and contains a character class with the letters a through f: ^[a-f]+$

Notice the +, which indicates that the word might be more than one character long. Forget to add that, and you’ll end up matching a much smaller set of words! Failing to anchor the word to the start and end with and $ will have the result of finding words in which at least one character is from the set [a-f], but other letters might not be.

5.3.2

Python

1 2 3 4 5 6 7 8

import re

filename = 'words.txt'

ro = re.compile('^[a-f]+$')

for line in open(filename):

if ro.search(line):

print(line)

5.3.3 1 2 3 4 5 6 7 8

Ruby

filename = 'words.txt'

r = Regexp.new('^[a-f]+$')

File.open(filename).each_line do |line|

if line =~ r

puts line

end

end

5.3.4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

JavaScript

"use strict";

var fs = require('fs');

var r = RegExp('^[a-f]+$');

var filename = 'words.txt';

fs.readFile(filename, 'utf8', function (err, data) {

if (err) {

console.log("Error!\n");

return console.log(err);

}

for (let line of data.split("\n")) {

if (line.match(r)) {

console.log(line);

}

}

process.exit();

});

5.3.5

PostgreSQL

1 SELECT line FROM words

2 WHERE line ~ '^[a-f]+$';

5.4

IP addresses

Each line of access-log.txt starts with an IP address. Each IP address has four numbers, each containing between one and three digits. The numbers are separated by periods (.). In this exercise, you are to retrieve the IP addresses from access-log.txt by building a chracter class, not by splitting the line across whitespace.

5.4.1

Solution

If I were only interested in four character separated by periods, I coul use a generic regexp, such as: \w\.\w\.\w\.\w

Notice how we need to use \., and not just .. That’s because we don’t want to use the . metacharacter here, but rather a literal . character. To do that, we need to use \.. But the above regexp doesn’t do what we want, in two different ways: First of all, it captures only one \w, when we want to have between one and three. Beyond that, we actually want to have digits (\d), not alphanumeric characters (\w). So we can rewrite the regexp as follows: \d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}

The above will work, and isn’t a bad way to go about things. But we can do one better, albeit using a more advanced technique of grouping: We can

notice that there is a pattern that repeats three times, and can then put that in parentheses, and indicate it should happen three times: (\d{1,3}\.){3}\d{1,3}

In other words: We want to have 1-3 digits, followed by ., three times. Then, we want to have 1-3 digits. Finally, let’s ensure that we only find an IP address that is the first thing on its line, by adding to the front: ^(\d{1,3}\.){3}\d{1,3}

Notice that this now means we’ve introduced a group to our regexp, via the parentheses. In some languages and environments, this will change the way in which we receive output.

5.4.2

Python

In Python, we can always ask to see m.group(0), to see the entire string that the regexp matched: 1 2 3 4 5 6 7 8 9

import re

filename = 'access-log.txt'

ro = re.compile('^(\d{1,3}\.){3}\d{1,3}')

for line in open(filename):

m = ro.search(line)

if m:

print(m.group(0))

5.4.3

Ruby

In order to avoid problems using String#scan with groups in Ruby, I instead used String#match, which returns just the first match: 1 2 3 4 5 6 7 8 9

filename = 'access-log.txt'

r = Regexp.new('^((\d{1,3}\.){3}\d{1,3})')

File.open(filename).each_line do |line|

result = line.match(r)

if result

puts result

end

end

5.4.4

JavaScript

In the below JavaScript program, there are two things we need to watch out for: First of all, we cannot merely pass \d, but must double the backslash there, to avoid problems with JavaScript’s parser. (If we were to use the slash style of defining regexps, that problem would not occur.) The second thing to notice is that because we have defined a group, we must be careful about what we print out from our match object. I thus defined a second group, with parentheses around the entire regexp, allowing us to retrieve the entire match with m[0]. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

"use strict";

var fs = require('fs');

var r = RegExp('^(\\d{1,3}\.){3}\\d{1,3}');

var filename = 'access-log.txt';

fs.readFile(filename, 'utf8', function (err, data) {

if (err) {

console.log("Error!\n");

return console.log(err); }

for (let line of data.split("\n")) {

var m = line.match(r);

if (m) {

console.log(m[0]);

}

}

19 process.exit();

20 });

5.4.5

PostgreSQL

In the PostgreSQL version of this regexp, we can get into a bit of trouble. That’s because regexp_matches returns an array of results – but if the regexp contains a group (delimited with parentheses), the groups are what show up in the array. We thus need to define an additional group, one which encloses the entire regexp. By doing this, group #1 is the entire match: 1 SELECT (regexp_matches(line, '^((\d{1,3}\.){3}\d{1,3})'))[1]

2 FROM access_log;

5.5

Long, weird words

Find all of the words in the dictionary that have the following characteristics: 10 letters long Start with a letter from the first half of the alphabet (a-m) End with a letter from the second half of the alphabet (n-z) Somewhere in the middle, there should be a “p”

5.5.1

Solution

Our regular expression is basically defined by the specification here. Let’s start with the fact that it must start with a letter from the character class [am],

and end with a letter from the character class n-z. If that, plus the need

for the word to be 10 characters long, were the only requirement, then our regexp could look like this: [a-m].{8}[n-z]

Except that this isn’t enough – to begin with, regexps can match anywhere in the target string. This regexp will thus match 10 characters within a longer word, as well as a 10-letter word. We can add anchors to ensure that the word is precisely 10 characters long: ^[a-m].{8}[n-z]$

But of course, we still haven’t indicated that there can or should be a letter p in there somewhere. And that’s where things get a bit complicated. One way to indicate that a p is in there is to add the following: ^[a-m][a-z]*p[a-z]*[n-z]$

The above tells the regexp engine that we want to start with a character from [a-m], end with a character from [n-z], and have a p somewhere in the middle. But what about the length? So far as I can tell, there isn’t any easy way to handle both specifications at the same time. The moment that the p could be anywhere inside of that field, we have lost the ability to specify that “we want eight letters, at least one of which must be p.” In cases like this, I thus rely on the programming language I’m using to do some of the checking for me. We could, instead, check the length with the regexp and look for p inside of our string using a function or method within our chosen language. But to

me, at least, that doesn’t seem as satisfying – and it’s likely to be less efficient, as well, since many high-level languages can calculate the length of a string quickly, but cannot calculate find a substring nearly as fast.

5.5.2 1 2 3 4 5 6 7 8

import re

filename = 'words.txt'

ro = re.compile('^[a-m][a-z]*p[a-z]*[n-z]$')

for line in open(filename):

if len(line) == 10 and ro.search(line):

print(line)

5.5.3 1 2 3 4 5 6 7 8

Python

Ruby

filename = 'words.txt'

r = Regexp.new('^[a-m][a-z]*p[a-z]*[n-z]$')

File.open(filename).each_line do |line|

if line.size == 10 and line =~ r

puts line

end

end

5.5.4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

JavaScript

"use strict";

var fs = require('fs');

var r = RegExp('^[a-m][a-z]*p[a-z]*[n-z]$');

var filename = 'words.txt';

fs.readFile(filename, 'utf8', function (err, data) {

if (err) {

console.log("Error!\n");

return console.log(err);

}

for (let line of data.split("\n")) {

if (line.match(r) && line.length == 10) {

console.log(line);

}

}

18 process.exit();

19 });

5.5.5

PostgreSQL

1 SELECT line FROM words

2 WHERE length(line) = 10

3 AND line ~ '^[a-m][a-z]*p[a-z]*[n-z]$' ;

5.6

Matching URLs

Let’s assume that we have defined a string: I love to visit https://example.com/foo.html every day!

More than http://abc-def.co.il/.

Write a regexp that will match both URLs, but not the characters before or after them. Include the /foo.html in the first URL, but not the training period (.) in the second.

5.6.1

Solution

We often think of URLs are fairly simple. However, matching them can be a bit tricky, because of several variations in the URLs we see here. For example, the first begins with https://, and the second begins with http://.

The first ends with a filename (including a “.html” suffix), while

the second has a hostname containing a - character. Starting from the beginning, we can match the URLs with https?://. The ? metacharacter indicates that the character preceding it (s) is optional, and can appear zero or one times. While URLs can start with any number of

different protocol names, this particular exercise only required that we match http and https at the start. We then need to match the hostname. We don’t want to match every possible character, since not all characters are valid in hostnames. I’m going to assume, for these purposes, that hostnames might contain letters, numbers, underscores, and dashes. We also need to take into account the periods that will appear in the URL, And, of course, they might contain periods as well, separating the host from the domain. (The solution I’m presenting here would also match illegal URLs, such as those containing two consecutive . characters.) We can shorten this character class definition by using the built-in \w character class, which is defined to be the same as [A-Za-z0-9_]. If we want to create a character class that’ll match \w, ., /, and -, then the character will need to be at the start or end of the character class. Otherwise, it’ll be interpreted as defining a range. Also note that . inside of a character class is treated literally, not as a metacharacter. We’ll match any number of these characters, indicated by using a + sign following our character class. Our URL ends with a repeat of our character class, but without any . inside (since our URL cannot end with it). This ensures that we won’t match training punctuation marks. Given all of this, our regular expression could be: https?://[\w./-]+[\w/-]

5.6.2

Python

Remember that in Python, strings normally cannot include literal newlines. Thus, we must use a triple-quoted string, unless we want to use \(n) in our string: 1 2 3 4 5 6 7 8

import re

s = '''I love to visit https://example.com/foo.html every day!

More than http://abc-def.co.il/.'''

ro = re.compile('https?://[\w./-]+[\w/-]')

print(ro.findall(s))

5.6.3 1 2 3 4 5 6

Ruby

s = 'I love to visit https://example.com/foo.html every day!

More than http://abc-def.co.il/.'

r = Regexp.new('https?://[\w./-]+[\w/-]')

puts s.scan(r)

5.6.4

JavaScript

JavaScript doesn’t support multiline strings. We could combine two strings with +, or just have a very long string, but below I’ve used a \to indicate that I want the string to continue onto the next line. To avoid problems with \w, in this case I decided to build the regexp using //. Note that because I want to find all of the matches, and not just the first one, I must pass the g modifier when I create the regexp. But of course, there’s a tradeoff for everything – and in this case, using the // syntax to create our regexp means that we must precede every literal with a backslash.

1 2 3 4 5 6 7

"use strict";

var s = 'I love to visit https://example.com/foo.html every day! \

More than http://abc-def.co.il/.';

var r = /https?:\/\/[\w./-]+[\w/-]/g;

console.log(s.match(r));

5.6.5

PostgreSQL

To do this in PostgreSQL, I’ll use regexp_matches against the string: 1 SELECT (regexp_matches('I love to visit https://example.com/foo.html

2 every day! More than http://abc-def.co.il/.',

3 'https?:\/\/[\w./-]+[\w/-]', 'g'))[1];

5.7

Non-zero hours

Once again, it’s time to search for certain patterns in access-log.txt: We want to find all of the records in which the hour doesn’t begin with a 0. (Remember that Apache logs, like many other logfiles, operates on a 24hour clock. Thus, 11 p.m. is written as 23:00.) Thus, you should not show the records from 00:00 through 09:59, and then show those from 10:00 through 23:59. For the purposes of this exercise, you may assume that square brackets ([ and ]) only occur around the timestamp.

5.7.1

Solution

What we’re looking for is the hour, which consists of two digits surrounded by colons (:), in which the first digit is not a zero. That can be expressed as follows in a regexp: :[1-9]\d:

Normally, we can use \d to describe a digit. But in the case of the first digit, we’re willing to have any digit but 0, This means that we can just create our own, custom character class, setting a range from 1 to 9. The problem is that while the above regexp will indeed find all of the nonzero hours, it’ll also find many others. That’s because we might have such patterns elsewhere in the line, and even elsewhere in the timestamp, thanks to the fact that we also have two-digit minutes, surrounded by colons. We’ll thus need to be a bit more specific. One easy way to do this is to assume that the hour will come after the year, which is a four-digit number starting with 20. That’s probably enough to find what we need; if you want to be completely sure, then you can extend the regexp to match the opening [

or the closing ]. Our regexp thus looks like this: /20\d\d:[1-9]\d:

Again, we could get more specific than this. However, one of the lessons I try to teach people who are learning regexps is that you have to know your data, and you have to know it well enough to know how obsessive to get about correctness. For now, I believe that the above will be sufficient.

5.7.2 1 2 3 4 5 6 7 8

Python

import re

filename = 'access-log.txt'

ro = re.compile(r'/20\d\d:[1-9]\d:')

for line in open(filename):

if ro.search(line):

print(line)

5.7.3 1 2 3 4 5 6 7 8

Ruby

filename = 'access-log.txt'

r = Regexp.new('/20\d\d:[1-9]\d:')

File.open(filename).each_line do |line|

if line =~ r

puts line

end

end

5.7.4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

JavaScript

"use strict";

var fs = require('fs');

var r = /\/20\d\d:[1-9]\d:/;

var filename = 'access-log.txt';

fs.readFile(filename, 'utf8', function (err, data) {

if (err) {

console.log("Error!\n");

return console.log(err);

}

for (let line of data.split("\n")) {

if (line.match(r)) {

console.log(line);

}

}

process.exit();

});

5.7.5

PostgreSQL

1 SELECT line FROM access_log

2 WHERE line ~ '/20\d\d:[1-9]\d:';

5.8

Quoted text

In this exercise, we’re going to look for all of the quotations in Alice in Wonderland. I’m looking for any stretch of text that starts with the doublequote character (“) and ends with that same character. I’m going to assume that quotes are never nested, and that there’s no use of a programmer’s backslash () to escape the double quotes. However, quotes might extend across more than one line.

5.8.1

Solution

My solution to this problem is to use the following regexp: "[^"]+"

As we can see here, the start and end of the regexp are the double-quote characters, which must appear at the start and finish of the matched text. Rather than using a . character to indicate that anything might appear between the double quotes, I’m just going to accept any character other than a quote quote. This is a very common paradigm in regexp solutions; I often find myself wanting to look for everything in a sentence, where “sentence” means, “anything that isn’t a period ending a sentence.” Rather than create a regexp that matches what I do want – which can be tricky! – I create a regexp that matches that description, using the character class [?!.]. (Note that this can result in false positives, given that people can use punctuation inside of words and acronyms. The double quotes are far less likely to result in false positives!) Now, you might be wondering why I didn’t make this non-greedy:

"[^"]+?"

Remember that + always matches the maximum number of characters that it can, whereas +? matches the minimum number of characters that it can. In this particular case, though, there’s no difference between that minimum and maximum, because we’ve stated that we want the regexp to match all non-“ characters, followedy by a “ character. There is only one string that will match that; while it won’t hurt to add the ? to the +, it won’t help, either. Another important point here is that this regexp won’t work if we read the file line by line. (If we do that, then we will only see quotes that are on a single line.) Rather, we’ll need to read the file in as a string, and then find all of the matches caught by our string.

5.8.2

Python

In the Python version of the program, we’ll read the entire file in as a string using file.read. Then, we’ll use re.findall to find all of the quotes that occur in that string. We iterate over the elements in the list returned by re.findall, 1 2 3 4 5 6 7 8 9

and print them.

import re

filename = 'alice.txt'

ro = re.compile('"[^"]+"')

s = open(filename).read()

for quote in ro.findall(s):

print quote

5.8.3

Ruby

1 2 3 4 5 6 7 8

filename = 'alice.txt'

r = Regexp.new('"[^"]+"')

s = File.open(filename).read

s.scan(r).each do |quote|

puts quote

end

5.8.4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

JavaScript

"use strict";

var fs = require('fs');

var r = /"[^"]+"/g;

var filename = 'alice.txt';

fs.readFile(filename, 'utf8', function (err, data) {

if (err) {

console.log("Error!\n");

return console.log(err);

}

for (let quote of data.match(r)) {

console.log(quote);

}

process.exit();

}

);

5.8.5

PostgreSQL

In this case, we’re not going to use the alice table, but rather the alice_onerow

table, in which the entire contents of the book is in a single

row. Remember to use the g option to perform a global search, but then also to retrieve the first element of the returned array: 1 SELECT (regexp_matches(line, '"[^"]+"', 'g'))[1]

2 FROM alice_onerow;

5.9

Supervocalic

A word is considered “supervocalic” if it contains all five of the Englishlanguage vowels (a, e, i, o, and u). Each letter should appear only once, and in that order. For this task, you want to find all of the supervocalic words in the dictionary.

5.9.1

Solution

Let’s build this regexp up, slowly but surely: First of all, we want the word to contain the letter a, which can appear anywhere: a

However, after a appears once, it may not appear again. So we’ll modify our regexp to look as follows: [^a]*a[^a]*

In this way, we know that a appears only once, with zero or more non-a characters coming before it. But now, we want to do the same with e, the next vowel. Let’s do the same thing, indicating that e cannot come before a,

and that it can come at some point after a:

[^ae]*a[^ae]*e

But of course, this will still match only part of the word. So let’s do two things: Anchor the word to the regexp and end of the word we’re trying to match, and ensure that after e we can have characters, but not e again (nor a again, for that matter:

^[^ae]*a[^ae]*e[^ae]$

We can continue with this for some time. The bottom line is that we want each of the vowels, in turn, with zero or more non-vowel characters coming between them. Our regexp ends up looking like this: ^[^aeiou]*a[^aeiou]*e[^aeiou]*i[^aeiou]*o[^aeiou]*u[^aeiou]*$

This regexp should now match supervocalic words.

5.9.2 1 2 3 4 5 6 7 8

Python

import re

filename = 'words.txt'

ro = re.compile('^[^aeiou]*a[^aeiou]*e[^aeiou]*i[^aeiou]*o[^aeiou]*u[^aeiou]*$')

for line in open(filename):

if ro.search(line):

print(line)

5.9.3 1 2 3 4 5 6 7 8

Ruby

filename = 'words.txt'

r = Regexp.new('^[^aeiou]*a[^aeiou]*e[^aeiou]*i[^aeiou]*o[^aeiou]*u[^aeiou]*$')

File.open(filename).each_line do |line|

if line =~ r

puts line

end

end

5.9.4 1 2 3 4 5 6

JavaScript

"use strict";

var fs = require('fs');

var r = RegExp('^[^aeiou]*a[^aeiou]*e[^aeiou]*i[^aeiou]*o[^aeiou]*u[^aeiou]*$');

var filename = 'words.txt';



7 8 9 10 11 12 13 14 15 16 17 18 19

fs.readFile(filename, 'utf8', function (err, data) {

if (err) {

console.log("Error!\n");

return console.log(err);

}

for (let line of data.split("\n")) {

if (line.match(r)) {

console.log(line);

}

}

process.exit();

});

5.9.5

PostgreSQL

1 SELECT line FROM words

2 WHERE line ~ '^[^aeiou]*a[^aeiou]*e[^aeiou]*i[^aeiou]*o[^aeiou]*u[^aeiou]*$';

5.10

Double triple vowel

In English, doubled vowels are a pretty common occurrence. Tripled vowels, though, are a pretty rare thing. Your task is to try to find something even rarer: Words in the dictionary with two separate sets of triple vowels. (And yes, the dictionary I’ve included with this book contains 69 such words.)

5.10.1

Solution

If we are looking for one vowel, then our regexp is [aeiou]

If we want three vowels in a row, then we can use the regexp

[aeiou]{3}

This does not mean that we want the same vowel three times! Rather, it means that three times in a row, the regexp engine should find one of the characters located inside of the character class. If we’re looking for a word with two such sets of letters, then we’ll want to modify our regexp such that it has that pattern twice – but with zero or more characters occurring between them: [aeiou]{3}.*[aeiou]{3}

But wait! What if the vowel is the first letter of the word, is is capitalized? We should thus apply the appropriate flag to make our search caseinsensitive. Alternately, we could just modify our regexp to explicitly include [AEIOU], as well. I’ve heard that this is somewhat faster, because you’re limiting the range that the regexp engine should examine, but haven’t ever tested it. Here’s what it would look like, if you weren’t to use the case-insensitive flag: [AEIOUaeiou]{3}.*[aeiou]{3}

In theory, we could also make the second set case insensitive, but I don’t see a compelling reason to do that. Now, some people might worry that the regexp engine will see four vowels in a row as two sets of three vowels. That is, if I have aeio, then will the regexp engine see this as aei folowed by eio? The answer is “no” – regexps are read from left to right, and once the pointer moves to the right, it won’t go back. Unless it is going to back off a bit, or you’re using

lookahead/lookbehind. But each character in a string is captured by a separate portion of the regexp, which means that you needn’t worry about it.

5.10.2 1 2 3 4 5 6 7 8

import re

filename = 'words.txt'

ro = re.compile('[aeiou]{3}.*[aeiou]{3}', re.IGNORECASE)

for line in open(filename):

if ro.search(line):

print(line)

5.10.3 1 2 3 4 5 6 7 8

Python

Ruby

filename = 'words.txt'

r = Regexp.new('[aeiou]{3}.*[aeiou]{3}', 'i')

File.open(filename).each_line do |line|

if line =~ r

puts line

end

end

5.10.4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

JavaScript

"use strict";

var fs = require('fs');

var r = RegExp('[aeiou]{3}.*[aeiou]{3}', 'i');

var filename = 'words.txt';

fs.readFile(filename, 'utf8', function (err, data) {

if (err) {

console.log("Error!\n");

return console.log(err);

}

for (let line of data.split("\n")) {

if (line.match(r)) {

console.log(line);

}

}

18 process.exit();

19 });

5.10.5

PostgreSQL

1 SELECT line FROM words

2 WHERE line ~ '[aeiou]{3}.*[aeiou]{3}';

5.11

Postfix dollar

In the United States, we put the dollar sign before the price of something, as in $123.45. In my travels, I’ve noticed and discovered that many people, in many countries, aren’t used to this, and put the $ sign after the numbers. Given the sentence: They wrote 1,000$ (not $1000), and 123.45$ (not $123.45).

For this exercise, write a regular expression that finds all of the cases of numbers (including commas and decimal points) followed by dollar signs. Thus, the results should find 1,000$ and 123.45$.

5.11.1

Solution

[\d.,]+\$

To find a decimal digit (0-9), we can use the built-in character class \d. But we don’t want to find just digits; we also need to find decimal points and commas. To that end, I create a new character class, containing not only \d, but also periods and commas.

But of course, we’re not only interested in numbers. We’re interested in numbers that have a trailing $. Normally, you might think that you can use a plain $ at the end of this regular expression. But we can’t do that in this case, because a $ in the final position of a regexp becomes a metacharacter, anchoring the regexp to the end of the string. (Or, if you’re in multi-line mode, it matches the end of a line.) So in order to match a trailing dollar sign, we’ll need to put a backslash before that final $.

5.11.2

Python

import re

s = 'They wrote 1,000$ (not $1000), and 123.45$ (not $123.45).'

print(re.findall('[\d.,]+\$', s))

5.11.3

Ruby

s = 'They wrote 1,000$ (not $1000), and 123.45$ (not $123.45).'

puts s.scan(/[\d.,]+\$/)

5.11.4 1 2 3 4 5 6

JavaScript

"use strict";

var s = 'They wrote 1,000$ (not $1000), and 123.45$ (not $123.45).';

var r = /[\d.,]+\$/g;

console.log(s.match(r));

5.11.5

PostgreSQL

SELECT regexp_matches('They wrote 1,000$ (not $1000), and 123.45$ (not $123.45).',

'[\d.,]+\$', 'g');

Chapter 6

Alternation 6.1

Multiple date formats

Dates are a well-known problem in the world, in that the same representation can mean different things. If you see the date 1/2/2016, does that mean February 1st or January 2nd? It all depends on whether you’re in the United States or Europe. Asian countries write dates altogether differently, starting with the year, so 2016-2-1 would mean February 1st, 2016. For this exercise, write a regular expression that finds all dates in the following string: I write it as 2015-09-02, but he wrote 2/9/2015, and she wrote 9.2.2015.

6.1.1

Solution

The key here, as you might imagine, is to use alternation. We can find all three of the above dates by hard-coding them in a regexp: 2015-09-02|2/9/2015|9\.2\.2015

This will work, but we need something a bit more robust and generic. We can take advantage of the \d character class, which matches digits. And we can use {min,max} to indicate how many numbers we want. Our regexp thus becomes: \d{4}-\d{1,2}-\d{1,2}|\d{1,2}/\d{1,2}/\d{4}|\d{1,2}\.\d{1,2}\.\d{4}

Let’s finish this off by making the symbols a bit more generic, using a character class: (\d{4}[-/.]\d{1,2}[-/.]\d{1,2})|(\d{1,2}[-/.]\d{1,2}[-/.]\d{4})|(\d{1,2}[-/.]\d{1,2}[-/.]\d{4})

Yes, this is a bit long and ugly. In such cases, it’s often a good idea to break the regexp up, using the verbose/extended flag. Notice that I also used parentheses, to ensure that our alternation is handled as a group not an individual character. As a result of these additional parentheses we will get results that contain a bit more than might like.

If you’re a bit more advanced with regexps, then you might want to use non-capturing parentheses (with ?: inside of parentheses) for this purpose: (?:\d{4}[-/.]\d{1,2}[-/.]\d{1,2})|(?:\d{1,2}[-/.]

\d{1,2}[-/.]\d{4})|(?:\d{1,2}[-/.]\d{1,2}[-/.]\d{4})

(Note that the above should be written as a single line.) Using non-capturing parentheses is a bit advanced, and it makes the regexp uglier, but it’s extremely useful.

6.1.2 1 2 3 4 5 6 7 8

import re

s = 'I write it as 2015-09-02, but he wrote 2/9/2015, and she wrote 9.2.2015.'

ro = re.compile("(\d{4}[-/.]\d{1,2}[-/.]\d{1,2})|(\d{1,2}[-/.]" +

"\d{1,2}[-/.]\d{4})|(\d{1,2}[-/.]\d{1,2}[-/.]\d{4})")

print(ro.findall(s))

6.1.3 1 2 3 4 5

Python

Ruby

s = 'I write it as 2015-09-02, but he wrote 2/9/2015, and she wrote 9.2.2015.'

r = /(\d{4}[-\/.]\d{1,2}[-\/.]\d{1,2})|(\d{1,2}[-\/.]\d{1,2}[-\/.]\d{4})|(\d{1,2}[-\/.]\d{1,2}[-\/.]\d{4})/

puts s.scan(r)

6.1.4 1 2 3 4 5 6 7 8 9 10 11 12 13

JavaScript

"use strict";

var s = 'I write it as 2015-09-02, but he wrote 2/9/2015, and she wrote 9.2.2015.';

var r = /(\d{4}[-\/.]\d{1,2}[-\/.]\d{1,2})|(\d{1,2}[-\/.]\d{1,2}[-\/.]\d{4})|(\d{1,2}[-\/.]\d{1,2}[-\/.]\d{4})/;

var m = s.match(r);

if (m) {

for (let item of m) {

console.log(item);

}

}

6.1.5

PostgreSQL

To do this in PostgreSQL, I’ll use regexp_matches against the string: 1 SELECT (regexp_matches('I like the hex numbers 0xfa and 0x123 and 0xcab and 0xff',

2 '0[xX][A-Fa-f\d]+', 'g'))[1];

6.2

“oo” and “ee” words

Find all of the words containing the double-letter combination oo and/or ee in the Alice in Wonderland, regardless of case.

6.2.1

Solution

We’re looking for either oo or ee. We’ll thus need to use alternation, the regexp for which looks as follows: oo|ee

We’re interested not just in the doubled vowel, but in the word in which the doubled vowel occurs. This means that we need to use parentheses to stop | from extending to the edge of the regexp, as follows: (oo|ee)

With that in place, now we can extend the regexp to look for words: \b\w*(oo|ee)\w*\b

Because of the way parentheses and grouping works, we’ll put one final group around the entire regexp: \b(\w*(oo|ee)\w*)\b

6.2.2 1 2 3 4 5 6 7 8 9

import re

filename = 'alice.txt'

ro = re.compile(r'\b(\w*(oo|ee)\w*)\b', re.IGNORECASE)

s = open(filename).read()

for quote in ro.findall(s):

print quote

6.2.3 1 2 3 4 5 6

Python

Ruby

filename = 'alice.txt'

r = Regexp.new('\b(\w*(oo|ee)\w*)\b', 'i')

s = File.open(filename).read

s.scan(r).each do |quote|

7 puts quote

8 end

6.2.4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

JavaScript

"use strict";

var fs = require('fs');

var r = /\b(\w*(oo|ee)\w*)\b/i);

var filename = 'alice.txt';

fs.readFile(filename, 'utf8', function (err, data) {

if (err) {

console.log("Error!\n");

return console.log(err);

}

for (let quote of data.match(r)) {

console.log(quote);

}

process.exit();

}

);

6.2.5

PostgreSQL

The regexp we use in PosgreSQL is identical to the above ones, except that PostgreSQL uses \y rather than \b to indicate word boundary. 1 SELECT (regexp_matches(line, '\y(\w*(oo|ee)\w*)\y',

2 'ig'))[1] from alice_onerow ;

6.3

British and American spelling

The problem here is a relatively simple one. We have a sentence: The new box of cheques is blue in colour.

Or I might have this sentence: The new box of checks is blue in color.

Write a regexp that matches either of these.

6.3.1

Solution

One solution is to use a combination of alternation and the ? metacharacter: The new box of che(que|ck)s is blue in colou?r.

In the first case, we want to match either check or cheque. We could, of course, use something like (check|cheque), and that would work just fine. You could even argue that it would be more readable. But in many cases, we want our regexps to be short and to the point – thus, if we have only a few letters that are different Notice that we put the word inside of parentheses. If we weren’t to do that, the alternation character (|) would look all the way to the front of the string, and all the way to the end of the string. Using parentheses in this way can have some surprising side effects, because it means we have created a group, even if we didn’t intend to do so. In the second case, of color and colour, we could have used alternation. But when it’s just a single character that is optional, I find it easier and more intuitive to use ? to make a specific character optional. Note that this regexp will also match the following sentence: The new box of checks is blue in colour.

Whether you see that as a bug or a feature is, of course, up to you; I’m willing to live with it.

6.3.2 1 2 3 4 5 6 7 8 9

import re

s1 = 'The new box of cheques is blue in colour.'

s2 = 'The new box of checks is blue in color.'

ro = re.compile('The new box of che(que|ck)s is blue in colou?r.')

if ro.match(s1) and ro.match(s2):

print("Matches!")

6.3.3 1 2 3 4 5 6 7 8

Ruby

s1 = 'The new box of cheques is blue in colour.'

s2 = 'The new box of checks is blue in color.'

r = Regexp.new('The new box of che(que|ck)s is blue in colou?r.')

if (s1 =~ r) and (s2 =~ r)

puts "Matches!"

end

6.3.4 1 2 3 4 5 6 7

Python

JavaScript

var s1 = 'The new box of cheques is blue in colour.';

var s2 = 'The new box of checks is blue in color.';

var r = RegExp('The new box of che(que|ck)s is blue in colou?r.');

if (s1.match(r) && s2.match(r)) {

console.log("Matches!");

}

6.3.5

PostgreSQL

To test this regexp with PostgreSQL, we’ll just create a temporary table, and then run the regexp against that table: 1 2 3 4 5 6 7

CREATE TEMP TABLE Stuff (id SERIAL, line TEXT);

INSERT INTO Stuff (line) VALUES

('The new box of cheques is blue in colour.'),

('The new box of checks is blue in color.');

SELECT line FROM Stuff

WHERE line ~ 'The new box of che(que|ck)s is blue in colou?r.';

Chapter 7

Anchoring 7.1

Capital vowel starts

In this assignment, find and print all of words that begin with a capital vowel (A, E, I, O, or U) and are at the start of a line.

7.1.1

Solution

There are two basic ways to solve this problem. One, and the one I prefer, is to read through the file line by line. When we do that, we can use to anchor our regexp to the start of the string. Then all we have to do is continue the word using \w, which represents any alphanumeric character, and then *, which matches zero or more characters. Why would I use *, rather than +? Because two of the capital vowels (A and I) are words. If we were to use +, then the regexp would need to match at least two letters, not just one. Our regexp can thus look like this: ^[AEIOU]\w*

Another method would be to read the entire file as a single string, and then to look for our capital-vowel-word at the start of each line – either by looking for \n followed by our regexp, or by using a flag to indicate multiline mode, such that matches the start of a line, rather than the start of the entire string. See 9 for some exercises involving multi-line mode.

7.1.2 1 2 3 4 5 6 7 8

import re

filename = 'words.txt'

ro = re.compile('^[AEIOU]\w*')

for line in open(filename):

if ro.search(line):

print(line)

7.1.3 1 2 3 4 5 6 7 8

Python

Ruby

filename = 'words.txt'

r = Regexp.new('^[AEIOU]\w*')

File.open(filename).each_line do |line|

if line =~ r

puts line

end

end

7.1.4 1 2 3 4 5 6 7 8 9 10 11 12 13

JavaScript

"use strict";

var fs = require('fs');

var r = RegExp('^[AEIOU]\w*');

var filename = 'words.txt';

fs.readFile(filename, 'utf8', function (err, data) {

if (err) {

console.log("Error!\n");

return console.log(err);

}

for (let line of data.split("\n")) {

14 15 16 17 18 19





if (line.match(r)) {

console.log(line);

}

}

process.exit();

});

7.1.5

PostgreSQL

1 SELECT line FROM words

2 WHERE line ~ '^[AEIOU]\w*'

7.2

Comment lines

Many Unix-style files, including programs written in such languages as Python and Ruby, indicate comments by having a # at the start of the line. In this exercise, you are to print all comment lines – meaning, all lines that start with #, or that are preceded by whitespace. Comments that follow whitespace can be ignored. Thus, given the following file: # Comment 1

# Comment 2

print("Hello") # Comment 3

Your solution should print comments 1 and 2, but not comment 3.

7.2.1

Solution

We’re only interested in comments that appear at the beginning of the line, or coming after whitespace at the start of the line. In other words, we’re

looking for a # character just after the start of the line, or with optional whitesapce before the #. We can thus use the following regexp: ^\s*#

7.2.2 1 2 3 4 5 6 7 8

import re

filename = 'words.txt'

ro = re.compile('^\s*#')

for line in open(filename):

if ro.search(line):

print(line)

7.2.3 1 2 3 4 5 6 7 8

Python

Ruby

filename = 'words.txt'

r = Regexp.new('^\s*#')

File.open(filename).each_line do |line|

if line =~ r

puts line

end

end

7.2.4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

JavaScript

"use strict";

var fs = require('fs');

var r = RegExp('^\s*#');

var filename = 'words.txt';

fs.readFile(filename, 'utf8', function (err, data) {

if (err) {

console.log("Error!\n");

return console.log(err);

}

for (let line of data.split("\n")) {

if (line.match(r)) {

console.log(line);

16 }

17 }

18 process.exit();

19 });

7.2.5

PostgreSQL

1 SELECT line FROM words

2 WHERE line ~ '^\s*#';

7.3

Last five characters

In Alice in Wonderland, print the last five characters of every line, in which the third-to-last character is a lowercase letter in the second half of the alphabet (i.e., starting with n).

7.3.1

Solution

When you hear that you’re looking to match “the first” or “the last” characters on a line, then you almost certainly want to use an anchor. In this case, we’ll use $, which anchors the regexp to the end of a line. If we were looking for the last five characters, we could simply say: .{5}$

But we’re looking for the final five characters, in which the first of those is in the range from n to z. In other words: [n-z].{4}$

And that’s our regexp.

7.3.2 1 2 3 4 5 6 7 8

import re

filename = alice.txt'

ro = re.compile('[n-z].{4}$')

for line in open(filename):

if ro.search(line):

print(line)

7.3.3 1 2 3 4 5 6 7 8

Python

Ruby

filename = 'alice.txt'

r = Regexp.new('[n-z].{4}$')

File.open(filename).each_line do |line|

if line =~ r

puts line

end

end

7.3.4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

JavaScript

"use strict";

var fs = require('fs');

var r = RegExp('[n-z].{4}$');

var filename = 'alice.txt';

fs.readFile(filename, 'utf8', function (err, data) {

if (err) {

console.log("Error!\n");

return console.log(err);

}

for (let line of data.split("\n")) {

if (line.match(r)) {

console.log(line);

}

}

process.exit();

});

7.3.5

PostgreSQL

1 SELECT line FROM alice

2 WHERE line ~ '[n-z].{4}$';

7.4

u in the 2nd-to-last word

Show the final two words of each line of Alice in Wonderland in which u is in the second-to-last word.

7.4.1

Solution

If I want to see the final word in each line, then it’s probably easiest to iterate over each line of the file, grabbing the final non-whitespace characters: \S+$

Note that the above is already potentially problematic: Because of the way in which Unix and Windows mark line endings, using the $ to mark the end of the line and then \S to indicate non-whitespace characters right before it, means that you might miss lines that have a \r\n at the end, from Windows. We will assume, for now, that the file has the appropriate line endings for your operating system. The thing is, we don’t want the final word. We want the final two words. We’ll thus have to capture two such words: \S+\s+\S+$

This gives us the final two words, but we aren’t yet filtering through those words. The first of the two words (i.e., the second-to-last word on the line)

must contain an u. We can do that with the following: \b\w*u\w*\s+\S+$

It’s helpful to read this regexp from the back, because of the $ at the end: We want one or more non-whitespace characters at the end of the line. We could probably have used \w instead of \S; the question is whether we want to include punctuation or not. And indeed, the regexp \tb\W+\w+$

would have roughly the same result. That said, I’ll stick with the one that uses whitespace. The second-to-last word itself is found in the regexp’s first section: \b\w*u\w*

This means that we want to have zero or more letters (well, alphanumeric characters), u, and then zero or more letters. This allows for words that start or end with u, as well as those with u in the middle. By having a \b at the start of the regexp, we ensure that we capture the entire word, rather than just a portion of it. Thus, our final regexp to match the final two words of any line in which the second-to-last word contains a u is: \b\w*u\w*\s+\S+$

7.4.2

Python

Remember to use a raw string (or a doubled backslash) when your raw string includes \b. Otherwise, Python will interpret \b as the backspace character (ASCII 8), which will lead to a mismatch. 1 2 3 4 5 6 7 8

import re

filename = 'words.txt'

ro = re.compile(r'\b\w*u\w*\s+\S+$')

for line in open(filename, 'U'):

if ro.search(line):

print(line)

7.4.3 1 2 3 4 5 6 7 8

Ruby

filename = 'words.txt'

r = Regexp.new('\b\w*u\w*\s+\S+$')

File.open(filename).each_line do |line|

if line =~ r

puts line

end

end

7.4.4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

JavaScript

"use strict";

var fs = require('fs');

var r = RegExp('\b\w*u\w*\s+\S+$');

var filename = 'words.txt';

fs.readFile(filename, 'utf8', function (err, data) {

if (err) {

console.log("Error!\n");

return console.log(err);

}

for (let line of data.split("\n")) {

if (line.match(r)) {

console.log(line);

}

}

process.exit();

});

7.4.5

PostgreSQL

Remember that PostgreSQL uses \y to mark the word boundary, rather than \b. 1 SELECT line FROM words

2 WHERE line ~ '\y\w*u\w*\s+\S+$';

Chapter 8

Groups 8.1

Date and time

In access-log.txt, each line contains a timestamp, which looks like this: [30/Jan/2010:00:03:18 +0200]

Notice that the timestamp starts with [, ends with ], and contains both the date (in DD/MMM/YYYY format) and the time (in HH:MM:SS +TZ format). For this exercise, you are to grab the date and time in separate groups. Each language has a slightly different way of extracting the groups; the idea is that for each line, it should be possible to extract and display the date and time separately. The time should include the time zone; for now, we’ll leave it in the format used by the access log.

8.1.1

Solution

When working on such a problem, in which I have to match multiple parts of a string, I always try to start by matching the first part, and only then by matching the second part. To match our date, we know that we’ll need to find two digits, three letters, and two digits, all separated by slashes. We can do that with: \d{2}/\w{3}/\d{4}

Now, you might be thinking that the middle should use a character class, such as [a-z], rather than \w. But I don’t think that it’s crucial in this particular case; it’s true that \w is more general, and thus slightly slower and more general, but this is a case in which I prefer readability to speed.

Now, the above regexp matches the date. But I want to grab it in a group, and be able to access the group later. Thus, I put it inside of parentheses: (\d{2}/\w{3}/\d{4})

With that in place, I can start to attack the second part, namely the time. That consists of pairs of numbers separated by colons, followed by a space, followed by a + and then four digits indicating the time zone. In other words, the time, by itself, is identifiable as: \d{2}:\d{2}:\d{2} \+\d{4}

Remember that + is a metacharacter, which means that matching a literal + requires using \+! We can then find this as a group by putting parentheses around it: (\d{2}:\d{2}:\d{2} \+\d{4})

Now we can combine our two groups, joining them with the : that appears between the date and time in the access log: (\d{2}/\w{3}/\d{4}):(\d{2}:\d{2}:\d{2} \+\d{4})

If we look for the above in access-log.txt, we’ll find that group #1 is the date, and group #2 is the time.

8.1.2 1 2 3 4 5 6 7 8 9

Python

import re

filename = 'access-log.txt'

ro = re.compile('(\d{2}/\w{3}/\d{4}):(\d{2}:\d{2}:\d{2} \+\d{4})')

for line in open(filename, 'U'):

m = ro.search(line)

if m:

print("Date = '{0}', Time = '{1}'".format(m.group(1), m.group(2)))

8.1.3

Ruby

1 2 3 4 5 6 7 8 9

filename = 'access-log.txt'

r = Regexp.new('(\d{2}/\w{3}/\d{4}):(\d{2}:\d{2}:\d{2} \+\d{4})')

File.open(filename).each_line do |line|

m = line.match(r)

if m

puts "Date = '#{m[1]}', Time = '#{m[2]}'"

end

end

8.1.4

JavaScript

In this case, I decided to use the // syntax to define the regexp; otherwise, the quoting issues get to be too annoying. However, doing that means that we need to use a \before each / character, since a / would otherwise close the regexp. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

"use strict";

var fs = require('fs');

var r = /(\d{2}\/\w{3}\/\d{4}):(\d{2}:\d{2}:\d{2} \+\d{4})/;

var filename = 'access-log.txt';

fs.readFile(filename, 'utf8', function (err, data) {

if (err) {

console.log("Error!\n");

return console.log(err);

}

for (let line of data.split("\n")) {

var m = r.exec(line);

if (m) {

console.log("\tDate = '" + m[1] + "', Time = '" + m[2] + "'");

}

}

process.exit();

});

8.1.5

PostgreSQL

In the case of PostgreSQL, defining groups within a regexp means that invoking will return an array with multiple elements. Assuming that we’re interested in getting the array back, we can invoke the following query: regexp_matches

1 SELECT regexp_matches(line,

2 '(\d{2}/\w{3}/\d{4}):(\d{2}:\d{2}:\d{2} \+\d{4})')

3 FROM access_log;

8.2

Config pairs

is a simple configuration file. Simple, in that the configuration is set with lines that look like config.txt

name:value

But as often happens in such files, the people writing the file have gone a bit crazy, and have added lots of extra whitespace. Some lines contain only whitespace, or are generally illegal, without either a name or a value. We want to extract all of the name-value pairs from this file, grabbing the name and value in separate groups from legal lines. Moreover, we want to ignore any leading and trailing whitespace surrounding the name and value.

8.2.1

Solution

As usual, it’s a good idea to start with the simple part of the regexp, and then work up to the more complex parts. The simplest possible regexp is the one that matches our basic name:value: (\w+):(\w+)

In other words, we’re looking for all of the alphanumeric characters before :, and then all of those after :. Those will be our name and value. But our name and value might have whitespace before and after them. Thus, we need to account for that by using \s, along with *, indicating that the whitespace is optional: (\w+)\s*:\s*(\w+)

Now, what about those illegal lines? We don’t need to worry about them, since they won’t match our regexp: If there isn’t at least one alphanumeric character before and after the colon, the line won’t match our regexp. This is also true for lines that contain only whitespace.

And what about whitespace either before the name or after the value? Again, we don’t need to worry about this, because they occur before and after our regexp’s groups, and thus won’t be captured.

8.2.2 1 2 3 4 5 6 7 8 9

import re

filename = 'config.txt'

ro = re.compile('(\w+)\s*:\s*(\w+)')

for line in open(filename, 'U'):

m = ro.search(line)

if m:

print("Name = '{0}', Value = '{1}'".format(m.group(1), m.group(2)))

8.2.3 1 2 3 4 5 6 7 8 9

Python

Ruby

filename = 'config.txt'

r = Regexp.new('(\w+)\s*:\s*(\w+)')

File.open(filename).each_line do |line|

m = line.match(r)

if m

puts "Name = '#{m[1]}', Value = '#{m[2]}'"

end

end

8.2.4

JavaScript

In this case, I decided to use the // syntax to define the regexp; otherwise, the quoting issues get to be too annoying. However, doing that means that we need to use a before each / character, since a / would otherwise close the regexp. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

"use strict";

var fs = require('fs');

var r = /(\w+)\s*:\s*(\w+)/;

var filename = 'config.txt';

fs.readFile(filename, 'utf8', function (err, data) {

if (err) {

console.log("Error!\n");

return console.log(err);

}

for (let line of data.split("\n")) {

var m = r.exec(line);

if (m) {

console.log("\tName = '" + m[1] + "', Value = '" + m[2] + "'");

}

}

19 process.exit();

20 });

8.2.5

PostgreSQL

In the case of PostgreSQL, defining groups within a regexp means that invoking will return an array with multiple elements. Assuming that we’re interested in getting the array back, we can invoke the following query: regexp_matches

1 SELECT regexp_matches(line, '(\w+)\s*:\s*(\w+)')

2 FROM config;

8.3

Quote first and last words

In an earlier exercise (5.8), we found all of the quotations in Alice in Wonderland. For this exercise, find the first word and last from each quotation, not including the quotation marks and punctuation. Thus, if the quote is "Hello out

there!"

You should find Hello and there. Note that quotes might extend across lines.

8.3.1

Solution

The solution to our previous exercise on quoting was: "[^"]+"

Now we want to find the first and last words in that sentence. Let’s start with the first word, which will contain letters immediately following the opening quotes: "([a-zA-Z']+)[^"]+"

In this case, I decided to match all of the letters (capital and lowercase), as well as apostrophes (’). If I run this regexp across the text of Alice – not line by line, but rather across the entire book, so that I can grab quotes that exist across newlines – then group #1 matches the first word. Now let’s try to grab the last word. On the face of it, this should be the same as the first word. However, the instructions for this exercise indicated that we shouldn’t include any punctuation in our final word. Thus, we’ll need to grab optional punctuation at the end of the quote (i.e., immediately preceding the final quotes), and then letters and apostrophes before that: "([a-zA-Z']+)[^"]+([a-zA-Z']+)[.?!]*"

The thing is, this doesn’t quite work. Instead of the final word in our second group, we get the final character of the final word. What went wrong? The answer lies in the fact that regexps are greedy. This means that as the regexp engine tries to match text, it grabs as much as it can, from left to right. So the first expression in the regexp will get as much as it can, and then the second will get as much as it can, and so forth. The problem is that if you have two expressions in your regexp that are right next to each other, and which can potentially match the same text, the one on the left wins. For example: (\w+)(\w+)

If we match the above against abcde, group #1 will be abcd, and group #2 will be e. This is normally a good thing, but in the case of this exercise, it causes trouble. We don’t want the middle characters of the quotation to come at the expense of the final word! The solution is to make the middle section non-greedy. That is, we still want it to grab characters, but it should grab the minimum possible for a match, rather than the maximum. We can indicate that *, +, ?, and {} are non-greedy by putting an ? after them. For example, let’s try our sample regexp again:

(\w+?)(\w+)

Matched against the string abcde, group #1 will now be a, and group #2 will be bcde. To get the full final word, we thus modify the regexp one last time: "([a-zA-Z']+)[^"]+?([a-zA-Z']+)[.?!]*"

8.3.2

Python

Because this regexp includes both double and single quotes, we’ll need to use a backslash when defining our regexp string in Python, escaping the single quotes within the regexp string: 1 2 3 4 5 6 7 8 9

import re

filename = 'alice.txt'

ro = re.compile('"([a-zA-Z\']+)[^"]+?([a-zA-Z\']+)[.?!]*"')

s = open(filename).read()

for quote in ro.findall(s):

print quote

8.3.3

Ruby

In the case of Ruby, we can avoid the backslashing of quotes by using the // syntax: 1 2 3 4 5 6 7 8

filename = 'alice.txt'

r = /"([a-zA-Z']+)[^"]+?([a-zA-Z']+)[.?!]*"/

s = File.open(filename).read

s.scan(r).each do |quote|

puts quote

end

8.3.4

JavaScript

In the JavaScript version, we’ll use the // syntax, much as in Ruby, to avoid having to escape our single quotes: 1 "use strict";

2

3 var fs = require('fs');

4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

var r = /"([a-zA-Z']+)[^"]+?([a-zA-Z']+)[.?!]*"/g;

var filename = 'alice.txt';

fs.readFile(filename, 'utf8', function (err, data) {

if (err) {

console.log("Error!\n");

return console.log(err);

}

for (let quote of data.match(r)) {

console.log(quote);

}

process.exit();

}

);

8.3.5

PostgreSQL

In this case, we’re not going to use the alice table, but rather the alice_onerow table, in which the entire contents of the book is in a single row. PostgreSQL offers a variety of ways to quote text; in many ways, the easiest solution is to use $$ as the quotes at the start and end of text. This allows us to have " and ’ without escaping. Also remember to use the g option to perform a global search, so that we get all of the results, rather than just one. 1 SELECT regexp_matches(line,

2 $$"([a-zA-Z']+)[^"]+?([a-zA-Z']+)[.?!]*"$$, 'g')

3 FROM alice_onerow;

8.4

Prices with symbols

[Note: This chapter uses Unicode symbols that aren’t printing correctly. I’m working on fixing this. In theory, there should be a dollar sign, a euro symbol, and a UK pound sign.] Assume that we have a string: We sell 10 in the US for $100, in Europe for €100, and in the UK for £100.

We want to retrieve all of the prices from this string, but we don’t want to retrieve the currency symbol as well. In other words, we want to find all of the digits (no commas or decimal points) that follow a currency symbol.

8.4.1

Solution

[$€£](\d+)

The center of the above regexp, and the group I’ve defined, is of \d, a digit, followed by +,

meaning one or more digits. The number, which is what we want to capture, is in parentheses, defining a group, allowing us to retrieve it easily. Preceding that group is a character class containing the currency symbols. At the ends is \b, which ensures that we’re grabbing everything up to the word boundaries.

8.4.2

Python

In Python, this regexp is going to be a bit tricky. That’s because the pound and euro symbols are both Unicode characters. For this reason, it’s important that the search string s and the regexp object ro are both defined using Unicode strings. In Python 3, that’s the default, and thus you don’t need to do anything special. In Python 2, you must explicitly preface the string with u. Fortunately, Python 3 ignores the leading u, so we can write the program a single time. Also note that the re.UNICODE flag is unnecessary here. That flag expands the definition of \w – but since we don’t use \w in this regexp, the flag would have no effect. import re

s = u'We sell 10 in the US for $100, in Europe for €100, and in the UK for £100.'

ro = re.compile(u'[$€£](\d+)')

print(ro.findall(s))

8.4.3

Ruby

Modern versions of Ruby use Unicode by default. Thus, nothing special is needed for this regexp: s = 'We sell 10 in the US for $100, in Europe for €100, and in the UK for £100.'

puts s.scan(/[$€£](\d+)/)

8.4.4

JavaScript

1 2 3 4 5

"use strict";

var s = 'We sell 10 in the US for $100, in Europe for €100, and in the UK for £100.'

var r = RegExp('[$€£](\\d+)', 'g');

console.log(s.match(r));

8.4.5

PostgreSQL

SELECT regexp_matches('We sell 10 in the US for $100, in Europe for €100, and in the UK for £100.',

'[$€£](\d+)', 'g');

8.5

Question first word

Once again, let’s extract some text from Alice in Wonderland: Retrieve the first word of every question – meaning, every sentence that ends with a question mark.

8.5.1

Solution

The first thing we need to figure out in order to solve this problem is how we can describe a question using regular expressions. We know that a question starts with a word – and that word might be only one character long, as in I – and ends with a question mark. Maybe we could identify questions this way: \w+\?

But of course, the above won’t work, because there might be spaces in the middle. We could also use a non-greedy regexp, such as: .+\?

But that won’t go over the newlines, at least not without invoking the single-line flag that most regexp engines offer. Instead, I’m going to use a technique similar to what we saw in Exercise 5.8, in which we said that a quote started with ", ended with ", and that in the middle we had everything that was not a ". That might lead us to the following:

\w[^?]\?

But this will likely pick up all sorts of other things. I’m thus going to expand the negated character class in the middle, to ensure that anything we capture will not cross the boundary of a sentence: \w[^!.?]*\?

I use a * here after the negated character class, to allow for one-letter questions (e.g., I?) Finally, we can indicate that we want the first word, and then capture that word: (\w+)[^.?!]*\?

8.5.2 1 2 3 4 5 6 7 8 9

import re

filename = 'alice.txt'

ro = re.compile('(\w+)[^.?!]*\?')

s = open(filename).read()

for quote in ro.findall(s):

print quote

8.5.3 1 2 3 4 5 6 7 8

Python

Ruby

filename = 'alice.txt'

r = Regexp.new('(\w+)[^.?!]*\?')

s = File.open(filename).read

s.scan(r).each do |quote|

puts quote

end

8.5.4 1 2 3 4 5 6 7 8 9 10

JavaScript

"use strict";

var fs = require('fs');

var r = /(\w+)[^.?!]*\?/g;

var filename = 'alice.txt';

fs.readFile(filename, 'utf8', function (err, data) {

if (err) {

console.log("Error!\n");

return console.log(err);

11 12 13 14 15 16 17 18





}

);

8.5.5

for (let quote of data.match(r)) {

console.log(quote);

}

process.exit();

}

PostgreSQL

1 SELECT (regexp_matches(line, '(\w+)[^.?!]*\?', 'g'))[1]

2 FROM alice_onerow;

8.6

t, but no “ing”

In this exercise, you are to find all of the words in Alice in Wonderland that start with t and end with ing. However, you are to return the portion of the word that precedes the int.

Thus, if the word is trailing, you should only match and return trail.

8.6.1

Solution

Let’s start by defining a regexp that’ll give us all of the words that start with t: \bt\w+\b

The above describes a word (because of the \b on either side). The words starts with t and then continues with at least one more letter (thanks to the +) until it reaches the end of the world. Now, let’s add a check to see if the word ends with ing: \bt\w+ing\b

And finally, we’ll add parentheses to capture the initial part of the word: \b(t\w+)ing\b

8.6.2 1 2 3 4 5 6 7 8 9

import re

filename = 'alice.txt'

ro = re.compile(r'\b(t\w+)ing\b')

s = open(filename).read()

for quote in ro.findall(s):

print quote

8.6.3 1 2 3 4 5 6 7 8

Python

Ruby

filename = 'alice.txt'

r = Regexp.new('r'\b(t\w+)ing\b')

s = File.open(filename).read

s.scan(r).each do |quote|

puts quote

end

8.6.4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

JavaScript

"use strict";

var fs = require('fs');

var r = /\b(t\w+)ing\b/;

var filename = 'alice.txt';

fs.readFile(filename, 'utf8', function (err, data) {

if (err) {

console.log("Error!\n");

return console.log(err);

}

for (let quote of data.match(r)) {

console.log(quote);

}

process.exit();

}

);

8.6.5

PostgreSQL

1 SELECT (regexp_matches(line, '\y(t\w+)ing\y', 'g'))[1]

2 FROM alice_onerow;

8.7

Usernames and user IDs

In linux-etc-passwd, field index 0 is the username, field index 2 is the user ID, and field index -1 contains the user’s shell. For each user in the file, I want a regexp that extracts the user’s name, the user’s ID number, and the user’s shell. The regexp should extract each piece of information using a group. If the language supports it, retrieve each field using a named group, rather than a numbered one.

8.7.1

Solution

Each line in passwd.txt looks like the following: root:x:0:0:root:/root:/bin/bash

We want the first, third, and final fields. Let’s start with the first one, which consists of all characters that aren’t : (our field separator): ^([^:]+):

Then we want to skip over one field, and grab the next one: ^([^:]+):[^:]+:([^:]+)

The above regexp captures the first and third fields, and puts them into the groups numbered 1 and 2. But how can we get the shell, which is in the final field? We can then use .+ to go through the rest of the line, and then anchor the final field to the end: ^([^:]+):[^:]+:([^:]+).+:([^:\s]+)\s*$

Notice that we put \s in the final negative character class, and at the end (before $), along with * – so that there is a newline at the end, we will ignore it. This ensures that we grab the name of the shell, but not the trailing newline.

8.7.2

Python

Python supports named groups; inside the opening parenthesis of a capturing group, you say (?P...) where ... is the regexp you want to capture in the group. You can then use m.groupdict to give you a dictionary whose keys are the group names and whose values are the group values. In this example, we then use ** to turn the Python dictionary into keyword arguments that are passed to str.format: 1 2 3 4 5 6 7 8 9

import re

filename = 'passwd.txt'

ro = re.compile('^(?P[^:]+):[^:]+:(?P[^:]+).+:(?P[^:\s]+)\s*$')

for line in open(filename, 'U'):

m = ro.search(line)

if m:

print("{name}: id {id}, shell {shell}".format(**m.groupdict()))

8.7.3

Ruby

Ruby’s named capture groups look slightly different, in that you use (?...) to capture them. You also retrieve them differently, invoking Regexp#match on a string argument. This returns a MatchData object, with which you can use [ and ] and the names of the captured groups to get the values: 1 2 3 4 5 6 7 8 9 10

filename = 'passwd.txt'

r = Regexp.new('^(?[^:]+):[^:]+:(?[^:]+).+:(?[^:\s]+)\s*$')

File.open(filename).each_line do |line|

m = r.match(line)

if m

puts "#{m[:name]}: id #{m[:id]}, shell #{m[:shell]}"

end

end

8.7.4

JavaScript

JavaScript doesn’t offered named captured groups. Thus, we’ll retrieve the groups the same way as before, using the default regexp in the “Solution” section: 1 2 3 4 5 6

"use strict";

var fs = require('fs');

var r = /^([^:]+):[^:]+:([^:]+).+:([^:\s]+)\s*$/; var filename = 'passwd.txt';



7 8 9 10 11 12 13 14 15 16 17 18 19 20

fs.readFile(filename, 'utf8', function (err, data) {

if (err) {

console.log("Error!\n");

return console.log(err);

}

for (let line of data.split("\n")) {

var m = r.exec(line);

if (m) {

console.log("\tName = '" + m[1] + "', id = '" + m[2] + "', shell = '" + m[3] + "'");

}

}

process.exit();

});

8.7.5

PostgreSQL

1 SELECT regexp_matches(line,

2 '^([^:]+):[^:]+:([^:]+).+:([^:\s]+)\s*$')

3 FROM passwd;

8.8

Beheaded usernames

In this exercise, display the final four characters of any username that starts with a and contains at least five characters. Thus, given the users nobody, root, amotz, atara, adam, and astronaut, we would see the following output: motz

tara

naut

8.8.1

Solution

^a\w*(\w{4}):

This regexp requires the combination of several techniques. First of all, we want the a character to be at the start of a line. This means that we want to anchor it there, using a character at the beginning. We then say that we want the final four characters of those usernames that begin with “a”. (If the username contains only four characters, then it doesn’t match, even if the first letter is “a”.)

We don’t know how many characters the username will contain. We thus use \w*, indicating that we might want to match zero (in the case of a five-character username), and we might want to match more. The \w* is the only truly flexible part of this regexp, and will match a variable number of elements. Following the \w*, we match a precise number of alphanumeric characters – four of them, using \w{4}. The {4} indicates that we must match precisely four characters. Following the username is a : character, which separates fields in /etc/passwd. The group helps us to extract and display the final four characters in our regexp-using program.

8.8.2 1 2 3 4 5 6 7 8

import re

filename = 'passwd.txt'

ro = re.compile('^a\w*(\w{4}):')

for line in open(filename, 'U'):

if ro.search(line):

print(line)

8.8.3 1 2 3 4 5 6 7 8

Python

Ruby

filename = 'passwd.txt'

r = Regexp.new('^a\w*(\w{4}):')

File.open(filename).each_line do |line|

if line =~ r

puts line

end

end

8.8.4 1 2 3 4 5 6 7 8 9 10 11 12

JavaScript

"use strict";

var fs = require('fs');

var r = RegExp('^a\w*(\w{4}):')

var filename = 'passwd.txt';

fs.readFile(filename, 'utf8', function (err, data) {

if (err) {

console.log("Error!\n");

return console.log(err);

}



13 14 15 16 17 18 19



for (let line of data.split("\n")) {

if (line.match(r)) {

console.log(line);

}

}

process.exit();

});

8.8.5

PostgreSQL

1 SELECT regexp_matches(line, '^a\w*(\w{4}):')

2 FROM passwd;

8.9

Final question words

In this exercise, you are to retrieve the final word of each question in Alice in Wonderland. You can assume that a question always ends with a question mark (?). You should not retrieve the question mark, but just the word preceding it.

8.9.1

Solution

There are two basic ways to solve this problem. In all cases, you’re going to look for a question mark. While it would be nice to look for a literal ? character, in the world of regexps, this is a metacharacter. Thus, we’ll need to preface it with a backslash, as in \?. But we’re not interested in the ? itself. Rather, we want the word that precedes it. One way to do this is to use a group: (\w+)\?

In the above regexp, we look for one or more \w character before the ?. To be honest, this is probably the easiesr and more straightforward solution, and is the one I’ll use in the solution code below. By using a group, we can capture the word that’s of interest to us.

However, another way to approach this is with lookahead. Lookahead, as the name implies, allows us to divide the regexp into parts, with the second part not being captured, but rather describing the context in which the first part is found. Consider the following regexp: \w+(?=\?)

The ?= at the start of the group means that this isn’t just a group, but rather an extension to the regexp syntax. In this particular case, it means that we want to look just after the \w,

to make sure that ? follows it. We’re not interested in grabbing the ?, just in making

sure it exists. And thus, lookahead can be useful.

8.9.2 1 2 3 4 5 6 7 8 9

import re

filename = 'alice.txt'

ro = re.compile('(\w+)\?')

for line in open(filename, 'U'):

m = ro.findall(line)

if m:

print(m[0])

8.9.3 1 2 3 4 5 6 7 8

Python

Ruby

filename = 'alice.txt'

r = /(\w+)\?/

File.open(filename).each_line do |line|

line.scan(r).each do |word|

puts word

end

end

8.9.4 1 2 3 4 5 6 7 8 9 10

JavaScript

"use strict";

var fs = require('fs');

var r = /(\w+)\?/g;

var filename = 'alice.txt';

fs.readFile(filename, 'utf8', function (err, data) {

if (err) {

console.log("Error!\n");

return console.log(err);

11 }

12

13 for (let quote of data.match(r)) {

14 console.log(quote);

15 }

16 process.exit();

17 });

8.9.5

PostgreSQL

1 SELECT (regexp_matches(line, '(\w+)\?', 'g'))[1]

2 FROM alice_onerow;

8.10

“d” user shells

In /etc/passwd, each line contains a number of different fields, separated by : characters. The first field is the username, and the final field is the user’s shell (i.e., the command interpreter). On a typical Linux box, most people will be using /bin/sh or /bin/bash, whereas

others will be using /usr/bin/zsh, or something like that. And then you have the internal system users, whose shells are often /bin/false (so that they cannot log in), or something of the like. In this exercise, I want you to retrieve the shell from every user whose name contains d. For example, given the following line: daemon:x:1:1:daemon:/usr/sbin:/usr/sbin/nologin

This user (daemon) starts with d, and their shell is /usr/bin/nologin. But we also want shells from users with d elsewhere in the name, as in: redis:x:112:123:redis server,,,:/var/lib/redis:/bin/false

8.10.1

Solution

To solve this problem, we have to think in two directions as once. On the one hand, we want to look for usernames that contain d. THus, let’s find all such lines: ^\w*d\w*:

The above starts with , to anchor our regexp to the start of the line. Because d can appear anywhere in the username, we thus say that between the start of the line and the first :, we’ll have a d with zero or more characters before or after it. I should note that the above regexp will not match blank lines and comment lines – so while we don’t want to see such lines in our output, we don’t need to worry about them slipping through. Now we turn our attention to the end of the line, namely the shell’s name. What we want to match is something like this: :[\w/]+$

In other words, following a : character, we want to have letters and / characters. But there’s an easier way to do this, namely to grab everything at the end of the string that isn’t a :: :[^:]+$

Now we combine the front and back to get a single regexp, with .* between them, matching the stuff in the middle that isn’t of interest to us: ^\w*d\w*:.*:[^:]+$

Finally, we’ll use a group to grab the matched shell name: ^\w*d\w*:.*:([^:]+)$

8.10.2 1 2 3 4 5 6 7 8 9

Python

import re

filename = 'passwd.txt'

ro = re.compile('^\w*d\w*:.*:([^:]+)$')

for line in open(filename, 'U'):

m = ro.search(line)

if m:

print(m.group(1))

8.10.3 1 2 3 4 5 6 7 8 9

Ruby

filename = 'passwd.txt'

r = Regexp.new('^\w*d\w*:.*:([^:]+)$')

File.open(filename).each_line do |line|

m = line.match(r)

if m

print m[1]

end

end

8.10.4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

JavaScript

"use strict";

var fs = require('fs');

var r = /^\w*d\w*:.*:([^:]+)$')('^....s?$/;

var filename = 'passwd.txt';

fs.readFile(filename, 'utf8', function (err, data) {

if (err) {

console.log("Error!\n");

return console.log(err);

}

for (let line of data.split("\n")) {

var m = r.exec(line);

if (m) {

console.log(m[1]);

}

}

process.exit();

});

8.10.5

PostgreSQL

1 SELECT (regexp_matches(line,

2 '^\w*d\w*:.*:([^:]+)$'))[1]

3 FROM passwd;

Chapter 9

Flags 9.1

All usernames

In this exercise, you are to find all of the usernames in passwd.txt. However, you are to do this not by looping over the lines in passwd.txt, but rather by applying a regexp to the entire contents of the file as a single string, and retrieving all of the matches found in that string. Just to remind you, the username is at the start of each line, until the first : character.

9.1.1

Solution

If we were to read through the file line by line, we could grab the username by grabbing the word preceding the initial :: ^\w+:

But if we were to apply the above regexp to the entire file, we would normally be in trouble. That’s because forces our regexp to match the start of the entire string. There’s only one start to the string, and thus if this regexp were to match, it would be to a username on the first line, starting in the first character position.

(Actually, that’s not quite true: In Ruby, always matches the start of a line, rather than the start of the string. So in Ruby, you don’t have to do anything special. But in Ruby, you also don’t have the option of matching the start of the entire string! If you want to match the start and end of the entire string in Ruby, you can use \A and \Z.) However, there’s a trick we can use, which you might have figured out given the subject of this chapter: We can apply a flag that modifies the behavior of the regexp, such that matches the start of a line, and $ matches the end of the line. Note that these special characters don’t consume any space, and are only special at the start and end of the regexp. $ elsewhere, as we’ve seen in a few other exercise solutions, is considered a normal character except at the end of a regexp. So if we use the above regexp without the “multiline” modifier flag, then it’ll just match the start of the string. But if we use that flag – which is a little different in every language – then the suddenly changes, so that it matches the start of every line. And then, we can match the username at the start of every line. Finally, I’ll just make one adjustment to this regexp, employing lookahead so as not to include the : itself in our username: ^\w+(?=:)

9.1.2

Python

In Python, we use re.MULTILINE to indicate that and $ should match the start and end of a line, rather than of the entire string.

1 2 3 4 5 6 7 8

import re

filename = 'passwd.txt'

ro = re.compile('^\w+(?=:)', re.MULTILINE)

s = open(filename).read()

print('\n'.join(ro.findall(s)))

9.1.3

Ruby

As mentioned above, Ruby requires no changes in order to make and $ match the start and end of the line. Thus, we can write our regexp as per usual: 1 2 3 4 5 6 7 8

filename = 'passwd.txt'

r = /^\w+(?=:)/

s = File.open(filename).read

s.scan(r).each do |username|

puts username

end

9.1.4

JavaScript

In JavaScript, we modify the behavior of our regexp by passing a Perl-style modifier after the trailing slash. In this case, we’re passing the m modifier, for “multiline” mode. Don’t forget to also pass the g modifier, for a “global” search: 1 2 3 4 5 6 7 8 9 10 11 12

"use strict";

var fs = require('fs');

var r = /^\w+(?=:)/gm;

var filename = 'passwd.txt';

fs.readFile(filename, 'utf8', function (err, data) {

if (err) {

console.log("Error!\n");

return console.log(err);

}



13 14 15 16 17 18

);

9.1.5

for (let username of data.match(r)) {

console.log(username);

}

process.exit();

}

PostgreSQL

PostgreSQL’s modifiers stem from the Tcl language. This means that the modifiers go inside of parentheses, anywhere in the string. To turn on multiline mode, or as PostgreSQL calls it, “newline mode,” you insert (?n) inside of the regexp. 1 SELECT (regexp_matches(line, '(?n)^\w+(?=:)', 'g'))[1]

2 FROM passwd_onerow;

9.2

abc

In Alice in Wonderland, find stretches of text that start with a, have a b in the middle, and end with c. Between each of these characters can be up to 20 other characters.

9.2.1

Solution

On the face of it, this is a simple regexp to write: a.{,20}b.{,20}c

But there are at least two problems with this possible solution. First of all, it’ll likely find very few of the matches. That’s because . matches all

characters but newline, which means that if this text crosses a line boundary, you won’t match it. We’ll thus need to tell the language we’re using that we want . to match newlines. This is a standard thing to want to do; unfortunately, every language has its own way of doing this. In Python, you pass an additional re.DOTALL flag to the regexp, allowing . to match newlines as well. In Ruby, you pass the m flag to the regexp, putting it into “single-line mode.” (Yes, it’s quite confusing that Perl uses s to indicate singleline mode and m to indicate multi-line mode, and yet Ruby uses m to indicate single-line mode. Welcome to the world of regexp dialects!) You can pass the flag as /m at the end of a slash-style regexp, or as m as a parameter to an object-style regexp. In JavaScript, there is no equivalent. You’ll thus need to use a character class that includes all characters, such as [\s\S]. In both JavaScript and PostgreSQL, you must explicitly put numbers in {min,max}. You cannot just enter a single number and a comma. In PostgreSQL, you enter single-line mode by putting (?s) at the start of the regexp. However, that’s still not quite enough. That’s because regexps are greedy be default, meaning that they’ll match the maximum number of characters. In many cases, that’s just what we wanted – but in others, it’s less desireable. Thus, while I don’t think that it affects the solution too hugely here, it’s always worth considering adding ? after a quantity modifier, so that it’ll take the minimum, instead, as in: a.{,20}?b.{,20}?c

9.2.2 1 2 3 4 5 6 7 8 9

Python

import re

filename = 'alice.txt'

ro = re.compile('a.{,20}?b.{,20}?c', re.DOTALL)

s = open(filename).read()

for text in ro.findall(s):

print(text)

9.2.3

Ruby

Don’t forget that because we want . to match newline characters, we must pass the m option: 1 2 3 4 5 6 7 8

filename = 'alice.txt'

r = /a.{,20}?b.{,20}?c/m;

s = File.open(filename).read

s.scan(r).each do |text|

puts text

end

9.2.4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

JavaScript

"use strict";

var fs = require('fs');

var r = /a[\s\S]{0,20}?b[\s\S]{0,20}?c/g;

var filename = 'alice.txt';

fs.readFile(filename, 'utf8', function (err, data) {

if (err) {

console.log("Error!\n");

return console.log(err);

}

for (let section of data.match(r)) {

console.log(section);

}

process.exit();

17 18 );

9.2.5

}

PostgreSQL

Remember that with PostgreSQL’s syntax, you not only use (?s) at the start of the regexp to indciate that it should be in single-line mode, but that you cannot use {,max} to indicate that there’s a max but no min. 1 SELECT (regexp_matches(line,

2 '(?s)a.{0,20}?b.{0,20}?c',

3 'g'))[1]

4 FROM alice_onerow;

9.3

abcABC

This exercise is a repeat of the previous one. But whereas the previous exercise asked you to find stretches of a, b, and c with up to 20 characters between each of these letters, here the search should be case-insensitive. That is, now we’re looking for either a or A, then up to 20 characters, then b or B, followed by up to 20 characters, then c or C, followed by up to 20 characters.

9.3.1

Solution

There are several ways to solve this exercise. One is to take our existing regexp: a.{,20}?b.{,20}?c

and use character classes. In other words: [aA.{,20}?[bB].{,20}?[cC]

This will certainly work, and in some cases it’s the best way to go. But in many ways, it’s often just easier to invoke the original regexp with the caseinsensitive flag turned on. Every language has a way to do this: In Python, use the re.IGNORECASE flag, In Ruby, use the /i flag, In JavaScript, also use the /i flag, and In PostgreSQL, use the case-insensitive operators or (if appropriate) use the i parameter passed to regexp_matches. Thus, the regexp remains: a.{,20}?b.{,20}?c

The difference is how we define and use it. Moreover, now we’re going to need to combine flags; in most languages, we’ll need to combine the singleline mode with case insensitivity.

9.3.2

Python

In Python, we combine flags using a bitwise “or” operator. Thus, we can use both re.DOTALL and re.IGNORECASE by using | between them when we define the regexp. 1 2 3 4 5

import re

filename = 'alice.txt'

ro = re.compile('a.{,20}?b.{,20}?c', re.DOTALL | re.IGNORECASE)



6 s = open(filename).read()

7

8 for text in ro.findall(s):

9 print(text)

9.3.3

Ruby

In Ruby, we can similarly use the /i syntax to make searches caseinsensitive. We can combine the /i and /m switches by putting them both after the regexp defintion, either after the training slash or in a twocharacter string following the call to Regexp.new. 1 2 3 4 5 6 7 8

filename = 'alice.txt'

r = /a.{,20}?b.{,20}?c/mi;

s = File.open(filename).read

s.scan(r).each do |text|

puts text

end

9.3.4

JavaScript

In JavaScript, we can use the /i flag for a case-insensitive match. Since there isn’t any single-line mode setting in JavaScript, we’ll just use that single flag: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

"use strict";

var fs = require('fs');

var r = /a[\s\S]{0,20}?b[\s\S]{0,20}?c/g;

var filename = 'alice.txt';

fs.readFile(filename, 'utf8', function (err, data) {

if (err) {

console.log("Error!\n");

return console.log(err);

}

for (let section of data.match(r)) {

console.log(section);

}

process.exit();

17 18 );

9.3.5

}

PostgreSQL

Building on the regexp from the previous exercise, now we need to add the flag at the end, in addition to g, in order to make the search caseinsensitive. i

1 SELECT (regexp_matches(line,

2 '(?s)a.{0,20}?b.{0,20}?c',

3 'gi'))[1]

4 FROM alice_onerow;

9.4

abcABC, extended

The regexp in the previous exercise was starting to get a bit long and complex. In such cases, it’s a good idea to break the regexp into separate lines, taking advantage of the “extended mode” that many regexp engines offer. In this exercise, I want you to take the regexp from the previous exercise (9.3) and turn it into a multi-line regexp, using extended mode in your language of choice.

9.4.1

Solution

Let’s start with the solution from the previous exercise: Thus, the regexp remains:

a.{,20}?b.{,20}?c

Extended mode is different in every language, but the basic idea is that we can break our regexp across multiple lines, and even include comments describing what we’re doing. Thus, in extended mode, we can write our regexp as follows: a .{,20}? b .{,20}? b

# # # # #

Look Look Look Look Look

for for for for for

an a

any character (even newline)

a b

any character (even newline)

a c

Breaking up regexps in this way makes it possible for others to (hopefully) read, understand, and maintain our regexps.

9.4.2

Python

In Python, extended mode is known as “verbose,” and means that whitespace and comments are ignored. Remember that if you want to define a multi-line string in Python, you probably want to use a triplequoted string, which can extend over multiple lines. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

import re

filename = 'alice.txt'

ro = re.compile('''

a # look for "a" or "A"

.{,20}? # up to 20 characters, including \n (non-greedy)

b # look for "b" or "B"

.{,20}? # up to 20 characters, including \n (non-greedy)

c # look for "c" or "C"

''', re.DOTALL | re.IGNORECASE | re.VERBOSE)

s = open(filename).read()

for text in ro.findall(s):

print(text)

9.4.3

Ruby

In Ruby, we use the x option to turn on “extended” mode: 1 2 3 4 5 6 7 8 9 10 11 12

filename = 'alice.txt'

r = /a # Start with a

.{,20}? # up to 20 chars, including \n (non-greedy)

b # Continue with b

.{,20}? # up to 20 chars, including \n (non-greedy)

c/mix; # Look for "c" or "C"

s = File.open(filename).read

s.scan(r).each do |text|

puts text

end

9.4.4

JavaScript

JavaScript doesn’t support an extended or verbose mode for regexps. The XRegExp package does support them, though.

9.4.5

PostgreSQL

Just as we can use (?s) to indicate single-line mode in PostgreSQL, we can use (?x) to indicate extended mode. If we want to combine them, then we must do so at the start of the regexp, with (?sx). Also note that in contrast with expanded mode in Python and Ruby, we may not add comments to in an expanded regexp in PostgreSQL. 1 2 3 4 5 6 7

SELECT (regexp_matches(line,

'(?sx)a

.{1,20}?

b

.{1,20}?

c', 'gi'))[1]

FROM alice_onerow;

9.5

No-error IP addresses

In this exercise, we’re going to work with fakelog.txt, a logfile using a format that I created for the purposes of my regexp courses. Each entry in the logfile is two lines long, and represents a response code of some sort, similar to HTTP. The first line contains the timestamp of the error message, followed by the (fake) IP address that caused the error. The second line contains the word Result, followed by a three-digit number indicating the error code, a colon, and a message. Your task is to extract the IP addresses associated with a response code starting with a 2.

9.5.1

Solution

This problem is most easily solved using a combination of a group (to capture the IP address) and multiline mode, allowing us to grab the timestamp and the result code and message. It’s important to point out that while we could use something like ^\s+Result

to find the message, that won’t help if we need to find the IP address. We’ll need to write a regexp that looks for a timestamp, and then looks for an IP address, and only then looks for the result code and message on the following line.

Let’s start by finding the timestamp: I’m going to do this with the multiline anchor, which lets me find the start of a line. In some languages, I’ll need to indicate I’m in multiline mode for this to work correctly. Assuming that I have read the entire file into a string, I could match the string against: ^\[[^\]]+\]\s+([\d.]+)

The above will find all lines that start with an opening square bracket. We’re not interested in the timestamp, so we’ll go through it, finding everything through the closing square bracket, then some whitespace. Notice that in the above regexp, we want to capture a literal square bracket at the start of the string, and find anything but an empty square bracket in our character class. This means two uses of in one regexp, but for very different reasons. We then get to the IP address, which I’ve represented here as a combination of \d and .. You might want to be more exact, but I’ll let that go here. Now things get interesting: We know that there will be some whitespace, including a newline between the IP address and the Result. It’s probably easiest just to use \s to represent the whitespace, which will include the newline. That leaves our regexp looking like this: ^\[[^\]]+\]\s+([\d.]+)\s+

Once we’ve done that, we merely need to grab the error code, checking its first digit to ensure it’s 2: ^\[[^\]]+\]\s+([\d.]+)\s+Result 2\d\d:

The above should do the job. We need to be in multiline mode, to ensure that will do its job, anchoring the timestamp to the start of the line. And because we’ll do this globally, don’t forget to include a g flag in those languages that require it.

9.5.2 1 2 3 4 5 6 7 8 9

Python

import re

filename = 'fakelog.txt'

ro = re.compile('^\[[^\]]+\]\s+([\d.]+)\s+Result 2\d\d:', re.MULTILINE)

s = open(filename).read()

for text in ro.findall(s):

print(text)

9.5.3

Ruby

Don’t forget that because we want . to match newline characters, we must pass the m option: 1 2 3 4 5 6 7 8

filename = 'fakelog.txt'

r = /^\[[^\]]+\]\s+([\d.]+)\s+Result 2\d\d:/

s = File.open(filename).read

s.scan(r).each do |text|

puts text

end

9.5.4

JavaScript

Because we want to capture groups from more than one match, we’ll use the exec method. If exec returns null, then it has found the final match: 1 "use strict";

2

3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

var fs = require('fs');

var r = /^\[[^\]]+\]\s+([\d.]+)\s+Result 2\d\d:/mg;

var filename = 'fakelog.txt';

fs.readFile(filename, 'utf8', function (err, data) {

if (err) {

console.log("Error!\n");

return console.log(err); }

var m;

while (m = r.exec(data)) {

console.log(m[1]);

}

process.exit();

});

9.5.5

PostgreSQL

In this regexp, we need . to match all characters (including newline), and for and $ to match the start and end of each line, not just the string. The way that PostgreSQL handles this is with “weird” mode, which is turned on with (?w) at the start of the regexp: 1 SELECT regexp_matches(line,

2 $$(?w)^\[[^\]]+\]\s+([\d.]+)\s+Result 2\d\d:$$,

3 'g')

4 FROM fakelog_onerow;

Chapter 10

Backreferences 10.1

Doubled vowels

Find all of the words in Alice in Wonderland that contain doubled vowels – that is, the same vowel (a, e, i, o, or u) appears twice in a row. For example, “beer” is a doubled vowel, but “bear” is not.

10.1.1

Solution

You might think that the following regexp will find two vowels in a row: [aeiou]{2}

And it will – but they won’t necessarily be the same two vowels. The above regexp indicates that we want to grab two characters from the character class, but we don’t indicate that we want the same one each time. The solution is to use a “backreference,” in which we put the first occurrence in a group, and then refer back to it. Every language has a slightly different syntax for doing this, but most use a backslash and then a number, to refer to a numbered group. We can thus use the following: ([aeiou])\1

The parentheses define a group, and then the \1 refers back to that group. But I’m not interested in finding the doubled vowel. Rather, I want to find the word containing the doubled vowel. I’ll thus need to surround the doubled vowel with some more options: \b\w*([aeiou])\1\w*\b

The above regexp indicates that my doubled vowel may have alphanumeric characters before or after, and that those must come before or after a word break. The only problem with the above is the fact that it contains a group. In many systems, such as Python and PostgreSQL, from the moment you have a group, that group is returned, rather than the entire match. In order to grab the entire matched word, we have a few options – but in many ways, the easiest is just to surround the matched word with a second set of parentheses. This will define a second group, which we can then retrieve: \b(\w*([aeiou])\1\w*)\b

But try to use the above regexp, and you’ll find that it no longer works! That’s because the new group we’ve added is group 1 – so the \1 we put in our regexp now points to itself, which isn’t legal. Besides, the vowel to which we’re referring in our backreference is now the second group, not the first, so we’ll need to use \2, not \1. The final, working regexp is thus: \b(\w*([aeiou])\2\w*)\b

10.1.2

Python

1 2 3 4 5 6 7 8 9 10

import re

filename = 'alice.txt'

ro = re.compile(r'\b(\w*([aeiou])\2\w*)\b')



s = open(filename).read()

for word in ro.findall(s):

print word

10.1.3 1 2 3 4 5 6 7 8

Ruby

filename = 'alice.txt'

r = Regexp.new('\b(\w*([aeiou])\2\w*)\b')

s = File.open(filename).read

s.scan(r).each do |word|

puts word

end

10.1.4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

JavaScript

"use strict";

var fs = require('fs');

var r = /(\b\w*([aeiou])\2\w*\b)/;

var filename = 'alice.txt';

fs.readFile(filename, 'utf8', function (err, data) {

if (err) {

console.log("Error!\n");

return console.log(err);

}

for (let word of data.match(r)) {

console.log(word);

}

process.exit();

}

);

10.1.5

PostgreSQL

1 SELECT (regexp_matches(line,

2 '(\y\w*([aeiou])\2\w*\y)', 'g'))[1]

3

FROM alice_onerow;

10.2

Hours and seconds

In access-log.txt, , find all of the entries in which the hour and second for the entry were identical. Thus, a request at 12:34:12 matches, but 12:34:56 does not.

10.2.1

Solution

In order to solve this problem, we’ll first need to extract the time from each line. I believe that the easiest way to do this is to look for the date, and then to carry on forward toward the time. We’ve already seen how do to this before: \[\d{2}/\w{3}/\d{4}:\d{2}:\d{2}\d{2}

The above will find the date, in dd/mmm/yyyy format, followed by the time, in HH:MM:SS format. But we want the final two digits (of the seconds) to be the same as the hours. We can thus use the following regexp, using a backreference: \[\d{2}/\w{3}/\d{4}:(\d{2}):\d{2}:\1

The above regexp should then identify all of the lines that match our criteria.

10.2.2

Python

1 2 3 4 5 6 7 8

import re

filename = 'access-log.txt'

ro = re.compile(r'\[\d{2}/\w{3}/\d{4}:(\d{2}):\d{2}:\1')

for line in open(filename, 'U'):

if ro.search(line):

print(line)

10.2.3 1 2 3 4 5 6 7 8

Ruby

filename = 'access-log.txt'

r = Regexp.new('\[\d{2}/\w{3}/\d{4}:(\d{2}):\d{2}:\1')

File.open(filename).each_line do |line|

if line =~ r

puts line

end

end

10.2.4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

JavaScript

"use strict";

var fs = require('fs');

var r = RegExp.new(/\[\d{2}/\w{3}/\d{4}:(\d{2}):\d{2}:\1/);

var filename = 'access-log.txt';

fs.readFile(filename, 'utf8', function (err, data) {

if (err) {

console.log("Error!\n");

return console.log(err);

}

for (let line of data.split("\n")) {

if (line.match(r)) {

console.log(line);

}

}

process.exit();

});

10.2.5

PostgreSQL

1 SELECT line FROM access_log

2 WHERE line ~ '\[\d{2}/\w{3}/\d{4}:(\d{2}):\d{2}\1';

10.3

Seven-letter start-finish words

In the dictionary, find all seven-letter words that start and end with the same two letters. For example, restore starts with re and ends with re, and is seven letters long.

10.3.1

Solution

We’re looking here for a seven-letter word. That would start off as: ^\w{7}$

Notice how it’s important to anchor the word at the start and end of the line. If we don’t do that, then we might well find seven-letter subsets of longer words that fit our criteria. But of course, we want to capture the first two letters. And while we’re at it, let’s break out the first two letters and last two letters: ^\w{2}\w{3}\w{2}$

Now, this exercise asks us to look for all of the seven-letter words in which the first two letters and the final two letters are the same. We can do this easily by defining the first two inside of a group, and then using a backreference to refer back to that group: ^(\w{2})\w{3}\1$

Here’s a bonus question, while we’re at it: How could we find seven-letter words in which the first two letters and last two letters are the same, but in reversed order? For example, the word evasive has seven letters; the first

and final letters are the same, as are the second and sixth letters. We can do this by capturing the first and letters separately, and using separate backreferences: ^(\w{1})(\w{2})\w{3}\2\1$

10.3.2 1 2 3 4 5 6 7 8

import re

filename = 'words.txt'

ro = re.compile(r'^(\w{2})\w{3}\1$')

for line in open(filename, 'U'):

if ro.search(line):

print(line)

10.3.3 1 2 3 4 5 6 7 8

Python

Ruby

filename = 'words.txt'

r = Regexp.new('^(\w{2})\w{3}\1$')

File.open(filename).each_line do |line|

if line =~ r

puts line

end

end

10.3.4 1 2 3 4 5 6 7 8 9 10 11 12 13

JavaScript

"use strict";

var fs = require('fs');

var r = RegExp.new('^(\w{2})\w{3}\1$')

var filename = 'words.txt';

fs.readFile(filename, 'utf8', function (err, data) {

if (err) {

console.log("Error!\n");

return console.log(err);

}

for (let line of data.split("\n")) {

14 15 16 17 18 19





if (line.match(r)) {

console.log(line);

}

}

process.exit();

});

10.3.5

PostgreSQL

1 SELECT line FROM words

2 WHERE line ~ '^(\w{2})\w{3}\1$';

10.4

end-start

Show all words in the dictionary in which the final two letters of one word are the same as the first two letters of the next word. Thus, if the word require is followed by the word requirement, then we’ll want to see require

10.4.1

in our output.

Solution

We’re looking for a word in the dictionary. That’s easy enough to find: ^\w+$

But we’re looking to find not just a word, but a word whose final two letters match the first two letters of the next word. This means that we’ll need to capture the final two letters of the word: ^\w*(\w\w)$

Notice that I am now using * rather than +, since it’s possible that the entire word is two letters long. Also notice that I’ve put the final two characters inside of parentheses, creating a group to which we can refer later. Also realize that in order to use to identify the start of the line, rather than the start of the entire string, most languages require that you indicate this in the regexp by passing a flag. Now I want to see if our group is at the start of the next word. We can do this with a backreference: ^\w*(\w\w)\n\1

However, there’s a problem with this: If the second word should also be displayed, then this will prevent that from happening. That’s because our backreference will advance the pointer within the file, and make it impossible for the second word to be considered a match. The solution to this problem is to use positive lookahead to search for the newline and backreference: ^\w*(\w\w)(?=\n\1)

With the above in place, we can find all of the matches. However, since we’re looking through the entire file at once – rather than looking through it one line at a time – we’ll likely want to grab the word in a group. Thus, let’s create a capture group for the word, and then change our backreference to mention group 2, rather than group 1: ^(\w*(\w\w))(?=\n\2)

And indeed, the above regexp appears to do the job, finding 853 words that match this specification.

10.4.2

Python

In the Python version of the program, we’ll read the entire file in as a string using file.read. Then, we’ll use re.findall to find all of the quotes that occur in that string. We iterate over the elements in the list returned by re.findall, 1 2 3 4 5 6 7 8 9

and print them.

import re

filename = 'alice.txt'

ro = re.compile(r'^(\w*(\w\w))(?=\n\2)', re.MULTILINE)

s = open(filename).read()

for quote in ro.findall(s):

print quote

10.4.3 1 2 3 4 5 6 7 8

Ruby

filename = 'alice.txt'

r = Regexp.new('^(\w*(\w\w))(?=\n\2)')

s = File.open(filename).read

s.scan(r).each do |quote|

puts quote

end

10.4.4 1 2 3 4 5 6 7 8

JavaScript

"use strict";

var fs = require('fs');

var r = Regexp.new('^(\w*(\w\w))(?=\n\2)', 'g');

var filename = 'alice.txt';

fs.readFile(filename, 'utf8', function (err, data) {

9 10 11 12 13 14 15 16 17 18 19



);

10.4.5

if (err) {

console.log("Error!\n");

return console.log(err);

}

for (let word of data.match(r)) {

console.log(word);

}

process.exit();

}

PostgreSQL

PostgreSQL’s regexp implementation doesn’t allow for the use of backreferences within lookahead constraints. Thus, I don’t believe that there’s a regexp solution to this problem.

10.5

Singular and plural

Find all of the words in Alice in Wonderland that appear in both singular and plural forms. For the purposes of this exercise, we’ll generalize, and say that a “plural” is any word with an “s” or “es” on the end. Thus, if both cat

and cats appear in the book, then I want to see cat. We’ll also say that

the singular version of a word must be at least 2 letters long, and that the singular version must precede the plural version.

10.5.1

Solution

At first glance, this might seem to be a simple backreference problem. After all, we want to find a word, and then find the same word later on. We could thus use a simple regexp like this: (\w{2,}).*\1

In other words, we’ve defined a group here, using parentheses. That group – which is group #1, because it’s the first set of parentheses – contains two or more alphanumeric characters. We then say that there should be one or more characters following that word, and then that same word. Of course, this doesn’t guarantee that we have captured a word. We might have captured part of a word. Thus, we need to add \b to ensure that our word sits on a word boundary: \b(\w{2,})\b.*\1

Now we want to say that the second occurrence of the word has to be followed by either s or es. Here’s how we can do that: \b(\w{2,})\b.*\1e?s

While we’re at it, let’s make sure that our second occurence is also a word, with \b: \b(\w{2,})\b.*\b\1e?s\b

Run this, and you’ll find … that there are very few matches. (In my copy of Alice, there’s only one, matching eBook.) But why? Clearly there are some word that appear in both singular and plural, right? Yes, but you have to remember that when we told the regexp engine to find the \1 backreference, it moved the pointer forward. Thus, it only started to look for the second singular after the first plural’s location.

We don’t want that to happen. Rather, we want to look ahead, see if our backreference is somewhere off in the distance – and then continue searching for singular word #2 after singular word #1. The way to do this is with positive lookahead. We tell the regexp engine to look ahead, but not to move the pointer. We do this with the following syntax: \b(\w{2,})\b(?=.*\b\1e?s\b)

But what if we have to pass through a newline character in order to get to the plural version of the word? In many languages, we can indicate that . should match newlines. But we can make our regexp more universal by simply matching the combination of \s and \S in a character class: \b(\w{2,})\b(?=[\s\S]*\b\1e?s\b)

10.5.2

Python

Be sure to use a raw string with Python. Otherwise, your regexp will fail to match anything, and you won’t know why! 1 2 3 4 5 6 7 8

import re

filename = 'alice.txt'

ro = re.compile(r'\b(\w{2,})\b(?=[\s\S]*\b\1e?s\b)')

s = open(filename).read()

print ro.findall(s)

10.5.3

Ruby

1 2 3 4 5 6 7 8

filename = 'alice.txt'

r = Regexp.new('\b(\w{2,})\b(?=[\s\S]*\b\1e?s\b)')

s = File.open(filename).read

s.scan(r).each do |word|

puts word

end

10.5.4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

JavaScript

"use strict";

var fs = require('fs');

var r = /(\b\w{2,}\b)(?=[\s\S]*\b\1e?s\b)/g;

var filename = 'alice.txt';

fs.readFile(filename, 'utf8', function (err, data) {

if (err) {

console.log("Error!\n");

return console.log(err);

}

for (let match of data.match(r)) {

console.log(match);

}

process.exit();

}

);

10.5.5

PostgreSQL

PostgreSQL’s regexp implementation doesn’t allow for the use of backreferences within lookahead constraints. Thus, I don’t believe that there’s a regexp solution to this problem.

Chapter 11

Replace 11.1

Replace

11.2

Crunch whitespace

This is another simple exercise, but one that has great practical implications. The idea is that you have read some text into your program. That text contains a number of types of whitespace characters – spaces, tabs, newlines, and even carriage returns. You want to turn one of those characters, or every multicharacter combination, into a single space character. So if you have the string abc

def\n

\tghi \t \r \n jkl

You want to turn it into abc def ghi jkl

11.2.1

Solution

The solution is to replace \s+

meaning one or more whitespace characters, with a ‘ ‘ (space) character. This will crunch multiple spaces into one, but it’ll also crunch newlines into a single line. So this is probably not a regexp you’ll want to use when reading an entire file.

11.2.2 1 2 3 4 5

Python

import re

s = 'abc def\n \tghi \t \r \n jkl'

ro = re.compile('\s+')

print(ro.sub(' ', s))

11.2.3

Ruby

1 s = "abc def\n \tghi \t \r \n jkl"

2 r = Regexp.new('\s+')

3 puts s.gsub(r, ' ')

11.2.4 1 2 3 4 5 6

JavaScript

"use strict";

var s = "abc def\n \tghi \t \r \n jkl";

var r = RegExp('\s+', 'g');

console.log(s.replace(r, ' '));

11.2.5

PostgreSQL

In PostgreSQL, we can use the regexp_replace function, in its four-parameter version (source, regexp, replacement, flags) to replace all of the occurences of whitespace, which can also be identified with \s. However, a literal source string should be entered with a leading E, to ensure that the \n and \r, for example, work the right way. Notice also that I added the g flag, for a global replacement:

1 SELECT regexp_replace(E'abc 2 '\s+', ' ', 'g');

11.3

def\n

\tghi \t \r \n jkl',

New hostname

Our company is rebranding from “foocorp” to “barcorp”, and as such, all of the URLs much change. We’re also changing our URLs such that if there is a www. before the foocorp, that should go away as well. And our corporate security team has said we need to use HTTPS instead of HTTP, so all of our URLs that currently use http now need to use https. Can we take care of all three of these at once? In other words, the text Please visit http://www.foocorp.com/.

we should change it to Please visit https://barcorp.com/.

11.3.1

Solution

We need to find three things here: the protocol, which might be http and might be https an optional www. before the hostname the hostname itself The following URL should do the trick: https?://(www\.)?foocorp.com

Having ? after s make that optional, allowing us to match both http and https’. We then make the entire www. optional by putting it in a group, and putting ? after that group. Finally, we also match our hostname. By replacing all of that with https://barcorp.com, we’ll catch all of these variations and standardize them.

11.3.2 1 2 3 4 5

Python

import re

s = 'Please visit http://www.foocorp.com/.'

ro = re.compile('https?://(www\.)?foocorp.com')

print(ro.sub('https://barcorp.com', s))

11.3.3

Ruby

1 s = 'Please visit http://www.foocorp.com/.'

2 r = Regexp.new('https?://(www\.)?foocorp.com')

3 puts s.gsub(r, 'https://barcorp.com')

11.3.4

JavaScript

Don’t forget to escape / characters in the regexp if you (and/or your clients) prefer 1 2 3 4 5

"use strict";

var s = 'Please visit http://www.foocorp.com/.';

var r = /https?:\/\/(www\.)?foocorp.com/;

console.log(s.replace(r, 'https://barcorp.com'));

11.3.5

PostgreSQL

1 SELECT regexp_replace(E'Please visit http://www.foocorp.com/.',

2 'https?://(www\.)?foocorp.com', 'https://barcorp.com', 'g');

11.4

Detagify

While regexps shouldn’t be used for parsing HTML and XML, there are stil times when they can be used to manipulate those formats. You have to be careful when doing this; a famous Stack Overflow answer about using regexp to parse XML demonstrates just how frustrated some programmers can get with some questions. However, there are some XML-related tasks for which regexps are perfectly suited. This exercise is one of them: Given a text string, you are to remove all of the XML/HTML tags, leaving everything else in place. It’s fine to leave some corner cases in place; we’re not trying to build the ultimate XML tag parser here. So if you have the string This is a headline

This is a paragraph with a link.



This is another paragraph,

this time on two lines!



We want to strip all of the HTML tags from the above, leaving us with: This

This

This this

is a headline

is a paragraph with a link.

is another paragraph,

time on two lines!

11.4.1

Solution

The key to this solution is to use a non-greedy regexp. We might think that the following regexp will work:

If we replace the above regexp with an empty string, we won’t get an error message from the system. However, we’ll find that we get an empty string. Why? Because we asked the regexp system to remove everything, starting with the first < it can find and ending with the last > it can find. In other words, it replaced the entire original string with an empty string. One small change to our regexp will make it work perfectly:

The above added ? after *, meaning that * should match the minimum possible, not the maximum. This effectively means that we’ll match a single tag. This is a great example of where the non-greedy operator can have a profound effect on what is matched.

11.4.2 1 2 3 4 5 6 7 8 9 10 11 12 13

import re

s = '''

This is a headline

This is a paragraph with a link.



This is another paragraph,

this time on two lines!



'''

ro = re.compile('', re.DOTALL)

print(ro.sub('', s))

11.4.3 1 2 3 4 5 6 7 8 9

Python

Ruby

s = '

This is a headline

This is a paragraph with a link.



This is another paragraph,

this time on two lines!



'



10 r = Regexp.new('')

11 puts s.gsub(r, '')

11.4.4

JavaScript

JavaScript doesn’t allow us to define strings that include literal newlines. However, we can use a backslash at the end of a line to indicate that the string continues on following line: 1 2 3 4 5 6 7 8 9 10 11 12 13 14

"use strict";

var s = '\n\

This is a headline\n\

\n\

This is a paragraph with a link.

\n\

\n\

This is another paragraph,\n\

this time on two lines!

\n\

';

var r = //g;

console.log(s.replace(r, ''));

11.4.5 1 2 3 4 5 6

PostgreSQL

SELECT regexp_replace(E'This is a headline

This is a paragraph with a link.



This is another paragraph,

this time on two lines!

', '(?s)', '', 'g');

11.5

Deunixify paths

Our company hired a technical writer who thought we were using Unix, but we were actually using Windows. This means that the paths in our text were all written as dir1/dir2/filename

But they really needed to be dir1\dir2\filename

We want to change all of the / characters to \ characters. Well, not all of them; we only want to do this if there are non-whitespace characters after our / character. Thus, given the following string: My file might be in /tmp/foo or in /tmp/bar; that / is tricky!

We want it to be turned into My file might be in \tmp\foo or in \tmp\bar; that / is tricky!.

Can you save the day, and turn the slashes into backslashes, and make this a Windows-friendly company?

11.5.1

Solution

On the face of it, we want to replace / with \. But we need to use lookahead to ensure that the following character is not whitespace. Thus, our regexp will be: /(?=\S)

The above means: Find a / character, but only if the following character is nonwhitespace. We could equivalently use a negative lookahead to say that the following character should not be whitespace: /(?!\s)

11.5.2

Python

Notice that we use a raw string with a double backslash, to avoid problems of prematurely ending the strong: 1 2 3 4 5

import re

s = 'My file might be in /tmp/foo or in /tmp/bar.'

ro = re.compile('/(?!\s)')

print(ro.sub(r'\\', s))

11.5.3

Ruby

1 s = 'My file might be in /tmp/foo or in /tmp/bar; that / is tricky!'

2 r = Regexp.new('/(?!\s)')

3 puts s.gsub(r, '\\')

11.5.4 1 2 3 4 5 6

JavaScript

"use strict";

var s = "My file might be in /tmp/foo or in /tmp/bar; that / is tricky!";

var r = /\/(?!\s)/g;

console.log(s.replace(r, '\\'));

11.5.5

PostgreSQL

1 SELECT regexp_replace(E'My file might be in /tmp/foo or in /tmp/bar; that / is tricky!',

2 $$/(?!\s)$$, '\\', 'g');

Chapter 12

Unix shell 12.1

Disk space

The df program returns the current disk usage for each of your filesystems. One of the columns indicates the percentage of disk space being used. Use a regexp (and grep) to find those filesystems that have at least 80% usage. You can assume that the output from grep will only use a % sign when reporting the percentage free. You can return the entire line with such a percentage.

12.1.1

Solution

In order to solve this problem, we’ll need to invoke df and then pipe its output through grep. Indeed, I’d guess that at least half of the times I use grep in my work, it’s to find matching lines in the output from another program. If we want all of the disk usage, we could use the following: $ df | grep --color '\d\+%'

Notice that because we’re using grep, the + metacharacter must be prefaced with a backslash in order to be seen as special.

But we’re not interested in all percentages; only those that are at least 80% are of interest. Let’s ignore 100% for now; those that are in the range from 80% - 99% will consist of two digits, in which the first is either 8 or 9. We can thus say: $ df | grep --color '[89]\d%'

This will indeed match all percentages from 80 - 99. But it fails to match 100%. However, it doesn’t match 100%. In order to find that, it’s probably easiest to use alternation, using the | character. However, this has two problems: First, in grep, | is only a metacharacter when prefaced by a backslash. Second, the % will then be included in our regexp. Thus, we need to put the numbers inside of parentheses, for them to limit the scope of the |. But even that won’t work, because if we want parentheses to be seen as metacharacters, we need to precede them with backslashes, too! We thus end up with the following: $ df | grep '\(100\|[89]\d\)%'

The above will then match all lines with disk usage between 80% and 100%, inclusive.

12.2

Not-today files

Find all of the files in a directory that were not modified today. In other words, if today is April 1st, and the directory listing (using ls -l for a “long” listing) looks like this:

-rw-r--r--rw-r--r--rw-r--r--rw-r--r--rwxr-xr-x -rw-r--r--rw-r--r--rw-r--r--

1 1 1 1 1 1 1 1

reuven reuven reuven reuven reuven reuven reuven reuven

501 1967 Apr 1 10:02 UNIX-disk-space.md

501 223 Apr 2 22:53 UNIX-files-not-today.md

501 499 Mar 2 09:56 UNIX-old-new-office-files.md

501 177 Mar 2 09:56 UNIX-python-ruby-programs.md

501 3694 Mar 9 11:39 extract-exercises.py*

501 678 Mar 30 09:10 ipython_log.py

501 53769 Mar 23 16:03 solutions.zip

501 939 Apr 1 11:31 template.md

We’re only interested in seeing the lines whose timestamp says Apr 1, and want to see those lines. However, we don’t want to insert a literal Apr 1 in there; it should reflect the current date. So if I issue that same command tomorrow, it’ll show files from April 2nd.

12.2.1

Solution

Solving this problem requires using the Unix date command. This command can display the current date and time when invoked by itself, but it can also display the current date and time in a variety of formats. Depending on what version of Unix you’re using, and whether (and under what names) you have installed the GNU date utility, invoking man date will either give you clear documentation for how to format things, or will say nothing, forcing you to look elsewhere – sometimes, under man strftime,

in my experience.

To get the current date in the format used by ls, in which months are abbreviated to three letters and single-digit dates are padded with spaces rather than 0, you’ll need to use the format %b %e, as in: date +'%b %e'

That will give us the current date. But now we need to use grep to find matching lines. If we were interested in finding all files with a in the line,

we could say ls -l | grep a

So you might think that we could similarly say ls -l | grep date +'%b %e'

But that won’t quite work, because we’re interested using the result of invoking date. To run a command and get its result back as a string, we can use backticks: ls -l | grep `date +'%b %e'`

But even that won’t be quite enough, because there’s whitespace in the result from date. Thus, the Unix shell interprets our command as grep Apr 3, and it doesn’t know what to do with the 3. The solution is to put the backticks inside of double quotes, for a total of three types of quote: ls -l | grep "`date +'%b %e'`"

And sure enough, this work! We calculate the current date, and use that (in double quotes) as an argument to grep. We then use that grep command to filter through the output from ls -l.

12.3

Problem logs

In exercise 9.5, we found the IP addresses for all requests to our server that had no errors. In this exercise, we want to find all of the requests in

fakelog.txt

for which there were problems.

We can make this a bit simpler: In fakelog.txt, errors are indicated with a line that looks like: [2015-Sep-2 10:16:44] 11.22.33.44

Result 404: File not found

We can assume that all errors have either the code 404 or 500. Other result codes are not of interest to us. Your task is to use grep to find all of the result codes 404 or 500, and display not only the line on which this code appeared, but the line before it.

12.3.1

Solution

We can start with the following regexp: grep 'Result \(404\|500\):' fakelog.txt

The above uses alternation to find either 404 or 500. Notice that because we’re using grep, we need to preface (, ), and | with backslashes to make them metacharacters. I always like to have as much context as possible around such matches, to ensure a minimum of false positives. However, the above will only show the matching lines themselves. Because we’re interested not only in that line, but also in the line before it, we’ll use the B option (“before”) to display a single line before the match: grep -B1 'Result \(404\|500\):' fakelog.txt

When applied to fakelog.txt, this shows not only the line with the error, but also the line before it.

12.4

Old and new Office files

Several years ago, Microsoft started to use the .docx and .xlsx suffix on their files, rather than the three-letter .doc and .xls. Given a directory listing, display all files that have those suffixes. Note that if a file contains (or any other of these suffixes) in the middle, but not at the end of the file, then it should not be displayed. .doc

Assume that ls -1 gives you a listing of all files in a single column, such that you can treat each filename as a single row in the input to grep.

12.4.1

Solution

This exercise combines several different aspects of regexps that we’ve seen throughout the book. First and foremost, we want to use ls -1, because it means that the filenames will be displayed in a single file, which allows us to use the and $ anchors. And indeed, that’s what we’re going to do: We know that the suffix will come at the end of a filename. Thus, if we were merely interested in .doc files, we could use: ls -1 | grep '\.doc$'

But we want to find all .doc and .docx files, meaning that our regexp must change to: ls -1 | grep '\.docx\?$'

Notice that I needed to use \?, not ? in the regexp. That’s because when using grep, you need to preface ? with a backslash to make it a metacharacter. But we’re not interested in just .doc and .docx. We’re also interested in .xls

and .xlsx files. Thus, we’re use some alternation:

ls -1 | grep '\.\(doc\|xls\)x\?$'

Perhaps now you can understand why Larry Wall said that the regexps in grep suffered from “backslashitis” – we need to backslash ( and ), as well as |, in order to say that we want to have a leading dot (escaped with a backslash as well), then either doc or xls, then an optional x, just before the end of the filename. While it might look ugly, this does indeed do the job, displaying all of the Excel and Word documents, regardless of suffix.