Practice Makes Regexp

224 12 6MB

English Pages [206]

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Practice Makes Perfect

"Drawn from seven workbooks from the bestselling Practice makes perfect series, this powerhouse volume features all

879 106 86MB Read more

Ruby Regexp

JavaScript RegExp is one of many books oriented to teach you how to learn and write Regex commands for JS, Ruby and Pyth

577 132 285KB Read more

Spanish: Practice Makes Perfect 1260467554

497 142 19MB Read more

Practice Makes Perfect: Latin Verb Tenses 2e Practice Makes Perfect: Latin Verb Tenses 2e 0071817514, 9780071817516

245 83 7MB Read more

Practice Makes Perfect: Exploring Grammar 0071745483, 9780071745482

Helpful instruction and plenty of practice for your child to understand the basics of grammar and vocabulary Understan

762 84 8MB Read more

Practice Makes Perfect: Spanish Conversation 0071744770, 9780071744775

Practice the Art of Conversation in Spanish! Want to strike up a conversation with a native Spanish speaker but nervous

622 79 11MB Read more

Лексикология английского языка. Practice Makes Perfect 9785704224846

Данное учебное пособие предназначено для студентов факультетов иностранных языков педагогических вузов. Основная цель по

503 92 876KB Read more

JavaScript RegExp [1 ed.]

JavaScript RegExp is one of many books oriented to teach you how to learn and write Regex commands for JS, Ruby and Pyth

515 84 417KB Read more

Practice Makes Perfect: Complete French All-In-One Practice Makes Perfect: Complete French All-In-One 9780071819558, 007181955X, 9780071819541, 0071819541

Build your confidence in your French skills with practice, practice, practice! From present tense regular verbs to doub

3,926 91 4MB Read more

Practice Makes Perfect Linear Algebra 9780071778442, 0071778446, 9780071778435, 0071778438

Expert instruction and plenty of practice to reinforce advanced math skills Presents concepts with application to natur

484 131 25MB Read more

Practice Makes Regexp

Author / Uploaded
Reuven M. Lerner

Table of contents :
Frontmatter
Regexp use from programming languages
Input data
Exercises
Simple regexps
Character classes
Alternation
Anchoring
Groups
Flags
Backreferences
Replace
Unix shell

Citation preview

Practice Makes Regexp 50 exercises to help you master regular expressions Reuven M. Lerner, PhD

Contents Preface: Practice Makes Regexp 1 About me 2 Acknowledgements

Chapter 1 Regexp use from programming languages 1.1 Python 1.1.1 Defining regexps 1.1.2 Finding one 1.1.3 Finding more than one 1.1.4 Substituting text 1.1.5 Flags 1.1.6 Advanced features 1.1.7 More information 1.1.8 About Python solutions 1.2 Ruby 1.2.1 Defining regexps 1.2.2 Finding one 1.2.3 Finding more than one 1.2.4 Substituting text 1.2.5 Flags 1.2.6 Advanced features 1.2.7 More information 1.2.8 About Ruby solutions 1.3 JavaScript 1.3.1 Defining regexps 1.3.2 Finding one or more 1.3.3 Substituting text 1.3.4 Advanced features 1.3.5 More information 1.3.6 About JavaScript solutions 1.4 PostgreSQL 1.4.1 Defining regexps 1.4.2 True/false operators 1.4.3 Extracting text 1.4.4 Splitting 1.4.5 More information 1.5 grep 1.5.1 Basic use 1.5.2 Backslashes

1.5.3

Context

Chapter 2 Input data 2.1 Dictionary (words.txt) 2.2 Alice in Wonderland (alice.txt) 2.3 Config (config.txt) 2.4 Apache logfile (access-log.txt) 2.5 Linux “passwd” file (passwd.txt) 2.6 Fakelog (fakelog.txt) 2.7 PostgreSQL database

Chapter 3 Exercises 3.1 Simple regexps 3.1.1 Find matches 3.1.2 Five-letter words 3.1.3 Double “f” in the middle 3.1.4 Extract timestamp 3.2 Character classes 3.2.1 End-of-sentence words 3.2.2 Hex numbers 3.2.3 Hexwords 3.2.4 IP addresses 3.2.5 Long, weird words 3.2.6 Matching URLs 3.2.7 Non-zero hours 3.2.8 Quoted text 3.2.9 Supervocalic 3.2.10 Double triple vowel 3.2.11 Postfix dollar 3.3 Alternation 3.3.1 Multiple date formats 3.3.2 “oo” and “ee” words 3.3.3 British and American spelling 3.4 Anchors 3.4.1 Capital vowel starts 3.4.2 Comment lines 3.4.3 Last five characters 3.4.4 u in the 2nd-to-last word 3.5 Groups 3.5.1 Date and time 3.5.2 Config pairs 3.5.3 Quote first and last words 3.5.4 Prices with symbols 3.5.5 Question first word 3.5.6 t, but no “ing” 3.5.7 Usernames and user IDs

3.6

3.7

3.8

3.9

3.5.8 Beheaded usernames 3.5.9 Final question words 3.5.10 “d” user shells Flags 3.6.1 All usernames 3.6.2 abc 3.6.3 abcABC 3.6.4 abcABC, extended 3.6.5 No-error IP addresses Backreferences 3.7.1 Doubled vowels 3.7.2 Hours and seconds 3.7.3 Seven-letter start-finish words 3.7.4 end-start 3.7.5 Singular and plural Replace 3.8.1 Crunch whitespace 3.8.2 New hostname 3.8.3 Detagify 3.8.4 Deunixify paths Unix command line 3.9.1 Disk space 3.9.2 Not-today files 3.9.3 Problem logs 3.9.4 Old and new Office files

Chapter 4 Simple regexps 4.1 Find matches 4.1.1 Solution 4.1.2 Python 4.1.3 Ruby 4.1.4 JavaScript 4.1.5 PostgreSQL 4.2 Five-letter words 4.2.1 Solution 4.2.2 Python 4.2.3 Ruby 4.2.4 JavaScript 4.2.5 PostgreSQL 4.3 Double “f” in the middle 4.3.1 Solution 4.3.2 Python 4.3.3 Ruby 4.3.4 JavaScript 4.3.5 PostgreSQL 4.4 Extract timestamp 4.4.1 Solution 4.4.2 Python 4.4.3 Ruby 4.4.4 JavaScript 4.4.5 PostgreSQL

Chapter 5 Character classes 5.1 End-of-sentence words 5.1.1 Solution 5.1.2 Python 5.1.3 Ruby 5.1.4 JavaScript 5.1.5 PostgreSQL 5.2 Hex numbers 5.2.1 Solution 5.2.2 Python 5.2.3 Ruby 5.2.4 JavaScript 5.2.5 PostgreSQL 5.3 Hexwords 5.3.1 Solution 5.3.2 Python 5.3.3 Ruby 5.3.4 JavaScript 5.3.5 PostgreSQL 5.4 IP addresses 5.4.1 Solution 5.4.2 Python 5.4.3 Ruby 5.4.4 JavaScript 5.4.5 PostgreSQL 5.5 Long, weird words 5.5.1 Solution 5.5.2 Python 5.5.3 Ruby 5.5.4 JavaScript 5.5.5 PostgreSQL 5.6 Matching URLs 5.6.1 Solution 5.6.2 Python 5.6.3 Ruby

5.6.4 JavaScript 5.6.5 PostgreSQL 5.7 Non-zero hours 5.7.1 Solution 5.7.2 Python 5.7.3 Ruby 5.7.4 JavaScript 5.7.5 PostgreSQL 5.8 Quoted text 5.8.1 Solution 5.8.2 Python 5.8.3 Ruby 5.8.4 JavaScript 5.8.5 PostgreSQL 5.9 Supervocalic 5.9.1 Solution 5.9.2 Python 5.9.3 Ruby 5.9.4 JavaScript 5.9.5 PostgreSQL 5.10 Double triple vowel 5.10.1 Solution 5.10.2 Python 5.10.3 Ruby 5.10.4 JavaScript 5.10.5 PostgreSQL 5.11 Postfix dollar 5.11.1 Solution 5.11.2 Python 5.11.3 Ruby 5.11.4 JavaScript 5.11.5 PostgreSQL

Chapter 6 Alternation 6.1 Multiple date formats 6.1.1 Solution 6.1.2 Python 6.1.3 Ruby 6.1.4 JavaScript 6.1.5 PostgreSQL 6.2 “oo” and “ee” words 6.2.1 Solution 6.2.2 Python 6.2.3 Ruby 6.2.4 JavaScript 6.2.5 PostgreSQL 6.3 British and American spelling 6.3.1 Solution 6.3.2 Python 6.3.3 Ruby 6.3.4 JavaScript 6.3.5 PostgreSQL

Chapter 7 Anchoring 7.1 Capital vowel starts 7.1.1 Solution 7.1.2 Python 7.1.3 Ruby 7.1.4 JavaScript 7.1.5 PostgreSQL 7.2 Comment lines 7.2.1 Solution 7.2.2 Python 7.2.3 Ruby 7.2.4 JavaScript 7.2.5 PostgreSQL 7.3 Last five characters 7.3.1 Solution 7.3.2 Python 7.3.3 Ruby 7.3.4 JavaScript 7.3.5 PostgreSQL 7.4 u in the 2nd-to-last word 7.4.1 Solution 7.4.2 Python 7.4.3 Ruby 7.4.4 JavaScript 7.4.5 PostgreSQL

Chapter 8 Groups 8.1 Date and time 8.1.1 Solution 8.1.2 Python 8.1.3 Ruby 8.1.4 JavaScript 8.1.5 PostgreSQL 8.2 Config pairs 8.2.1 Solution 8.2.2 Python 8.2.3 Ruby 8.2.4 JavaScript 8.2.5 PostgreSQL 8.3 Quote first and last words 8.3.1 Solution 8.3.2 Python 8.3.3 Ruby 8.3.4 JavaScript 8.3.5 PostgreSQL 8.4 Prices with symbols 8.4.1 Solution 8.4.2 Python 8.4.3 Ruby 8.4.4 JavaScript 8.4.5 PostgreSQL 8.5 Question first word 8.5.1 Solution 8.5.2 Python 8.5.3 Ruby 8.5.4 JavaScript 8.5.5 PostgreSQL 8.6 t, but no “ing” 8.6.1 Solution 8.6.2 Python 8.6.3 Ruby

8.6.4 JavaScript 8.6.5 PostgreSQL 8.7 Usernames and user IDs 8.7.1 Solution 8.7.2 Python 8.7.3 Ruby 8.7.4 JavaScript 8.7.5 PostgreSQL 8.8 Beheaded usernames 8.8.1 Solution 8.8.2 Python 8.8.3 Ruby 8.8.4 JavaScript 8.8.5 PostgreSQL 8.9 Final question words 8.9.1 Solution 8.9.2 Python 8.9.3 Ruby 8.9.4 JavaScript 8.9.5 PostgreSQL 8.10 “d” user shells 8.10.1 Solution 8.10.2 Python 8.10.3 Ruby 8.10.4 JavaScript 8.10.5 PostgreSQL

Chapter 9 Flags 9.1 All usernames 9.1.1 Solution 9.1.2 Python 9.1.3 Ruby 9.1.4 JavaScript 9.1.5 PostgreSQL 9.2 abc 9.2.1 Solution 9.2.2 Python 9.2.3 Ruby 9.2.4 JavaScript 9.2.5 PostgreSQL 9.3 abcABC 9.3.1 Solution 9.3.2 Python 9.3.3 Ruby 9.3.4 JavaScript 9.3.5 PostgreSQL 9.4 abcABC, extended 9.4.1 Solution 9.4.2 Python 9.4.3 Ruby 9.4.4 JavaScript 9.4.5 PostgreSQL 9.5 No-error IP addresses 9.5.1 Solution 9.5.2 Python 9.5.3 Ruby 9.5.4 JavaScript 9.5.5 PostgreSQL

Chapter 10 Backreferences 10.1 Doubled vowels 10.1.1 Solution 10.1.2 Python 10.1.3 Ruby 10.1.4 JavaScript 10.1.5 PostgreSQL 10.2 Hours and seconds 10.2.1 Solution 10.2.2 Python 10.2.3 Ruby 10.2.4 JavaScript 10.2.5 PostgreSQL 10.3 Seven-letter start-finish words 10.3.1 Solution 10.3.2 Python 10.3.3 Ruby 10.3.4 JavaScript 10.3.5 PostgreSQL 10.4 end-start 10.4.1 Solution 10.4.2 Python 10.4.3 Ruby 10.4.4 JavaScript 10.4.5 PostgreSQL 10.5 Singular and plural 10.5.1 Solution 10.5.2 Python 10.5.3 Ruby 10.5.4 JavaScript 10.5.5 PostgreSQL

Chapter 11 Replace 11.1 Replace 11.2 Crunch whitespace 11.2.1 Solution 11.2.2 Python 11.2.3 Ruby 11.2.4 JavaScript 11.2.5 PostgreSQL 11.3 New hostname 11.3.1 Solution 11.3.2 Python 11.3.3 Ruby 11.3.4 JavaScript 11.3.5 PostgreSQL 11.4 Detagify 11.4.1 Solution 11.4.2 Python 11.4.3 Ruby 11.4.4 JavaScript 11.4.5 PostgreSQL 11.5 Deunixify paths 11.5.1 Solution 11.5.2 Python 11.5.3 Ruby 11.5.4 JavaScript 11.5.5 PostgreSQL

Chapter 12 Unix shell 12.1 Disk space 12.1.1 Solution 12.2 Not-today files 12.2.1 Solution 12.3 Problem logs 12.3.1 Solution 12.4 Old and new Office files 12.4.1 Solution

Preface: Practice Makes Regexp cha-preface Regular expressions (“regexps”) are often seen as equal parts blessing and curse. On the one hand, they are generally acknowledged to be powerful, useful, and often indispensible tools in identifying and retrieving pieces of text from within a larger corpus. In an age in which we are inundated with text, being able to write programs that can search through gigabytes, finding us specific patterns of text is nothing short of amazing. And yet. Regular expressions, for all of their power, remain mysterious, unreadable, and scary. A large number of professional, established programmers I know, who are quite smart and educated, have expressed their doubts about regular expressions – or say that they’ll get around to it one of these days. Or not.

I have to admit that I understand their feelings; my first exposure to regular expressions was in 1988, when I read through the manual for GNU Emacs. The manual’s description of regular expressions seemed intriguing, but when I got to the part of the manual that described how to use them, I wondered whether this was really something that I had to learn, or that I wanted to learn. The answer was a resounding “no,” and I ignored regular expressions for about four more years, when I started to program in Perl. Perl didn’t invent regular expressions, but it did basically require that you use them if you wanted to use the language. It also expanded the standard regular-expression library in many new and different ways, providing additional power – and tricky syntax! – that made it possible to examine, identify, and extract text even more easily than before. If you could master the syntax, of course. So, regular expressions are a technology that is universally seen as powerful and important, but also hard to learn and even harder to put into practice. Much of my time is spent teaching programming courses to large multinational companies, and while a minority of developers there say that they have taught themselves regular expressions, the overwhelming majority are completely unfamiliar with the syntax or use a very small part of regular expressions’ power. I have been teaching regular expressions for years, but it was only in 2015 that I began to teach a separate class on the subject. For two days, we do nothing but drill, drill, drill regexp syntax until it’s coming out of their ears. At the conclusion of the course, participants have written several dozen regexps, and are as a result able to see how to apply them in their own work. (Indeed, one of my favorite things to do in such classes is have

people bring problems from their own work, so that we can build regexps that will be useful in their day-to-day jobs.) The success of this course, has led me to the conclusion that as with so many things that appear to have inscrutible syntax, understanding of regular expressions comes through practice, experimentation, making mistakes, and then having the “aha!” moment in which it all makes sense. In theory, the workplace can provide such opportunities for practice, but in reality, work is often too busy, inflexible, or harried. Plus, when you’re working on a real problem for work, it is almost by definition a new problem – meaning that there isn’t anyone to walk you through the solution. This book is aimed at people who have learned the basics of regular expressions, either in a course or from reading a manual, but don’t quite understand when and how to use each of the regexp syntax. When (and how) do you use groups? When do you define character classes? How (and why) do you create non-capturing groups? This book doesn’t teach regular expressions; you can find numerous tutorials, lectures, and other resources online to get you that far. Rather, this book is intended to get you to understand and internalize regexp syntax through many different exercises. Most of these exercises are quite short, with a simple requirement. That said, the fact that a regexp’s specification is short, and that the regexp that solves the problem is one line long, doesn’t mean that it’ll be easy for you to come up with the solution. For that reason, every exercise comes with not only the solution, but also explanations and working code in Python, Ruby, JavaScript, and PostgreSQL. A final chapter discusses the

Unix command line, concentrating on the venerable – and invaluable – grep program, which is where most of us first encountered regexps. I chose these technologies because they are used by a large (and growing) number of programmers, and because many of the people using them aren’t aware of the fact that they contain sophisticated regexp engines. (Fine, most Ruby developers probably are – but I have encountered many PostgreSQL developers who had no idea that regexps were baked into the database.) The differences between the various implementations, and the ways in which the languages work with regular expressions, also provide me with a chance to demonstrate the pitfalls that developers encounter when working with regular expressions.

1

About me

I am an independent consultant, and have been since 1995. For many years, I have split my time between developing Web applications, consulting to companies about how to use technology to improve their businesses, and teaching programming courses (in the United States, Europe, Israel, and China). I use regular expressions nearly every day in my work, often in multiple technologies. I got my start as a Web developer back in 1993, when I helped to set up one of the first 100 Web sites in the world for The Tech, MIT’s student newspaper. After working for Hewlett Packard and Time Warner in the United States, I moved to Israel in 1995, and began work as a freelance consultant. In 2014, I completed my PhD in Learning Sciences (computer science + cognitive science + design + education) at Northwestern

University. My dissertation research involved the creation and analysis of the Modeling Commons, an online collaborative community for agentbased models written in NetLogo. I have been the Web technology columnist for Linux Journal since 1996, wrote “Core Perl” for Prentice Hall back in 2000, and self-published Practice Makes Python in 2014. I also give frequent lectures at technology conferences, helping technical and non-technical audiences alike to put new technologies into context. I live in Modi’in, Israel (halfway between Jerusalem and Tel Aviv) with my wife and three children. In my spare time, I enjoy reading, spending time with my children, and learning Chinese. (When people say that regexps are as difficult as Chinese, I can actually answer them!) I am very curious to hear from you, the person reading this book. Were the exercises too easy or too hard? Did they focus on the right topics? Are there aspects of regexps that you believe would be more useful to learn and practice? Please let me know what you think, and what improvements, corrections, and additions might be useful in updated editions. You can always reach me at [email protected], or on the Web at http://lerner.co.il.

2

Acknowledgements

I have been fortunate to teach programming to many thousands of people over the years. These students have often given me insights and ideas for new problems, as well as improvements to the solutions that I have provided. I appreciate the feedback and input, and hope that readers of this

book will similarly help to improve my understanding of Python, and the answers provided here. I also thank my family for their constant support, even when they don’t quite know what it is that I do, let alone what “regexps” are.

Chapter 1

Regexp use from programming languages This book is aimed at people using regular expressions in a variety of programming languages. There are three major problems with this approach, however: Every programming language implements a slightly different version, or dialect, of the regexp language. Thus, regexps in Python will be slightly different from regexps in JavaScript, which are different from regexps on the Unix command line. Unfortunately, different versions of a language can sometimes support multiple, conflicting regexp dialects; the Unix command line, in particular, has programs that support a variety of regexp dialects, which can make things even more confusing and frustrating. Every programming language has to implement an interface between the regexp engine and the rest of the language. Thus, how you define the regexp differs from language to language; do you use strings (as in Python and PostgreSQL), or regexp objects delimited by slashes (as in Ruby and JavaScript), or do you create distinct objects? Or do you have multple options available to you? Then, once you have created your regexp, how do you apply it to a piece of text? What operators, methods, and/or functions are available to you?

Finally, every language and technology produces results from regexp operations in different ways. And in many cases, the ways in which you extract results – especially when working with groups – can affect the results that you get. While this book is not meant to teach you regular expressions, I do feel compelled to provide a brief survey of how to use them from within each language. I’ll also provide a number of links for each language, so that you can learn about each in greater detail. The higher-level tiers of this book include the 300+ slides that I use in the class I teach in regular expressions, given to a number of Fortune 500 companies over the last few years. Those slides introduce the regexp syntax as used in Python, in part because of Python’s popularity but also because Python offers a rich version of regexps, with more features than many other languages.

1.1

Python

Python comes with a powerful regular expression engine. It is, in many ways, similar to the engine that comes with Perl 5; while this book does not use Perl in its examples, there is no doubt that Perl’s influence on the world of regexps was strong and long lasting. In particular, such options as nongreedy operators and non-capturing groups were innovations from Perl that have made their way into Python and others. As in Perl, and many other programming languages (but unlike grep and Emacs), you use backslashes in Python to neutralize a metacharacter. Thus,

is a metacharacter, indicating that the previous character must appear one or more times – but \+ matches the plain ol’ + character. +

1.1.1

Defining regexps

In Python, all usage of regular expressions is handled via the re module. This means that if and when you want to work with regexps from within Python, you must include the line 1 import re

somewhere before your first usage of regexps, preferably at the top of the file along with other import statements. You then define a regexp as a string, as in: 1 s = 'abc.def'

It’s important to point out that because all regexps in Python are first created as strings, the Python parser may handle some regexps differently than you might expect. For example, let’s say that your regexp is looking for the string abc as a word on its own. You would likely want to use the \b (word boundary) metacharacter to indicate this in your regexp, as follows: 1 s = '\babc\b'

However, this will fail. That’s because \b is treated by Python’s string parser as a special character (ASCII 8, or backspace). The regexp engine will thus think that it’s to look for the backspace character, rather than the \b metacharacter. The same is true if you use backreferences, which uses backslashes followed by numbers, such as \1.

This isn’t a legal character in a Python string, and you’ll get an error message from Python. In both of these cases, what you need to do is double your backslash, as follows: 1 s = '\\babc\\b'

# doubled backslashes

If this gets annoying, then you can always use a “raw string” – just put an r before the opening quote of a Python string, and the backslashes are automatically doubled. You can think of a raw string as a way to tell Python that you want the string to be precisely as you entered it: 1 s = r'\babc\b'

1.1.2

# raw string

Finding one

Once you have created a regexp string, you can then search for it inside of text. Python provides you with two basic ways to search inside of text with regexps: You can either search for a single occurrence, or for all of the occurrences. To search for a single occurrence of your regexp within a string, you’ll use the re.match or re.search functions. Both of them work in precisely the same way, except that re.match automatically anchors your regexp to the start of the screen. (You can think of re.match as automatically anchoring the regexp with \A representing the start of the string. It’s not the same as anchoring with , because in multiline mode, matches the start of the line, not the starts of the string.)

Some examples: 1 text = 'hello, world'

2 re.match('hello', text) 3 re.search('hello', text)

# Find "hello" at the start of text

# Find "hello" anywhere in text

Both re.search and re.match return either None (if no match was found) or a “match object” if one was. A match object, traditionally named m, has a number of useful attributes, the most popular of which is m.group(0). This asks Python to display the entire string that the regexp matched. If there were any groups within the regexp, then you can retrieve the individual groups with m.group and then passing the group number. In order to avoid trying to invoke group on None, it’s traditional to check to see if m is None (which evaluates to False in a boolean context, such as an if

statement):

For example: 1 2 3 4 5 6

text = 'hello, world'

m = re.search(r'\b(h.)(..o)\b', text)

if m:

print("Full match: {}".format(m.group(0))) print("First part: {}".format(m.group(1))) print("Last part: {}".format(m.group(2)))

# hello

# he

# llo

A regexp string can be compiled into a regexp object. If you are planning to use a regexp within a loop, then it is advisable to reduce your program’s overhead, and compile the regexp a single time, before the first loop iteration. For example: 1 2 3 4 5

text = 'hello, world'

r = re.compile('(h.)(..o)')

m = r.search(text)

if m:

print("Full match: {}".format(m.group(0)))

# hello

6 7

print("First part: {}".format(m.group(1))) print("Last part: {}".format(m.group(2)))

# he

# llo

Notice how re.search is now invoked as a method on r, rather than as a function whose first argument is a regexp string.

1.1.3

Finding more than one

To search for multiple occurences within a string, use re.findall. This function also takes a regexp string and a text string, but is guaranteed to return a Python list, with all of the matches for your regexp. If there were no matches, then it returns an empty list. Note that if your regexp includes groups (i.e., parentheses), then re.findall returns a list of matches for your group (if there was one group) or a list of tuples (if there were multiple groups). For example: 1 2 3 4 5 6 7 8 9 10 11 12 13

# Find all matches of "hello" in book

text = 'hello, world and hello, trees!'

re.findall('hello', text) # ['hello', 'hello']

# Find "h", three characters, and then o -- and match the three

# inner characters. Result is a list of those three characters

re.findall('h(...)o', text) # ['ell', 'ell']

# Find all words start with h and ending with o.

# Put the first two characters in a group, and the final three

# characters in a separate group. Return a list of two-element

# tuples, one with "h." and the other with "..o"

re.findall(r'\b(h.)(..o)\b', text) # [('he', 'llo'), ('he', 'llo')]

If you expect to find a large number of matches, then you might want to use re.finditer rather than re.findall. The only difference is that re.finditer

is an iterator, so it won’t consume large amounts of memory.

re.findall,

by contrast, will return a list of all matches, which might be

quite long.

1.1.4

Substituting text

Substituting text is done with re.sub, which takes a regexp string, a replacement string, and the text in which to search. It returns the transformed string, leaving the original string untouched. (Which is to be expected in Python, where strings are immutable.) For example, the following replaces all vowels in a string with underscores: 1 re.sub('[aeiou]', '_', 'The quick brown fox jumped over the lazy dog')

1.1.5

Flags

Python provides a number of flags that can be used to modify the behavior of regular expressions. Each flag has a short name and a long name, and is passed as an additional, final argument to the re. family of functions. If you wish to pass more than one flag, then you should use bitwise or (the | character) to set them.

1.1.6

Advanced features

Python’s regular expressions are especially rich, taking many elements from the Perl world. As in Perl, and many other programming languages (but unlike grep and Emacs), you use backslashes in Python to neutralize a metacharacter. Thus, + is a metacharacter, indicating that the previous character must appear one or more times – but \+ matches the palin ol’ + character. Another example of where Python took its cue from Perl is in the addition of a non-greedy operator: You can make a number of normally greedy metacharacters, such as + and ?, non-greedy by adding a ? to them – in

other words, you write +? and ??, and these characters indicate that we’re looking for the minimum possible text match, rather than the maximum possible text match. Python also supports non-capturing parentheses. This is especially useful, I have found, when using re.findall, and you want to use parentheses to have ? affect more than one character, but not be used as a group. Python supports several other advanced regexp options, such as positive and negative lookahead and lookbehind (all four combinations), and even named groups. Named groups were actually pioneered by Python, which means that there are several styles of defining them. I find named groups to be particularly exciting, in that you can do something like this: 1 s = 'The price is $123.45.'

2 m = re.search('\$(?P\d+)\.(?P\d+)', s)

3 if m:

4 print(m.group('dollars'))# 123

5 print(m.group('cents')) # 45

6 print(m.groupdict()) # {'cents': '45', 'dollars': '123'}

The syntax for defining named groups is admittedly a bit weird, but that’s what happens when you try to fit new functionality onto a decades-old, very terse syntax.

1.1.7

More information

More information about Python’s re module is available via the Python Web site (for Python 2 or Python 3. A nice summary is also available at the handy regexp site, http://www.regular-expressions.info/python.html. In addition, a Python-flavored Web site that allows you to test regexps is http://pythex.org/. I really love to use this site, especially when teaching

courses, and encourage you to use it in your work, as well.

1.1.8

About Python solutions

Exercise solutions presented in this book will work in both Python 2.7 and 3.5, the latest versions of the language as of this writing. I doubt that any aspects of Python will change in the future so as to make these solutions less accurate. You can download and install Python from http://python.org/.

1.2

Ruby

The Ruby language has often been described as a combination of Perl and Smalltalk. And indeed, this is not a bad description, in that it includes a large helping of Perl-style operators and syntax, along with Smalltalk’s object model. This means that there are several ways to create and work with regexps from within Ruby, typically reflecting the two different language traditions.

1.2.1

Defining regexps

In Perl, and thus in Ruby, we create an instance of Regexp (a class that comes with Ruby, and does’t need to be loaded from an external library) either with slashes (/regexp/) or with Regexp.new. The two are equivalent; the resulting object is normally displayed using slashes. For example: 1 r = Regexp.new('.ain') 2 r = /.ain/

# returns Regexp object /.ain/

# also returns Regexp object /.ain/

1.2.2

Finding one

We can then search in a string for this regexp with the =$ \sim $ (regexp match) operator. The operator can be used with either the string or the regexp coming first: 1 2 3 4

s r r s

= 'It will rain today'

= Regexp.new('.ain') # returns Regexp object /.ain/

=~ s # Returns the Fixnum (integer) 8

=~ r # Also returns the Fixnum (integer) 8

Why 8? Because s[8] (i.e., the 9th character in the string s) is where the first match was found. What if you want the entire string that was matched? You can use the special variable $&, which contains whatever Ruby found: 1 2 3 4

s = 'It will rain today'

r = Regexp.new('.ain') # returns Regexp object /.ain/

r =~ s # Returns the Fixnum (integer) 8

puts $& # Prints "rain"

If you prefer to use a more verbose (and less Perl-like) syntax, you can do so by applying the match method. This returns a MatchData object, which contains all of the information we need about the match. Printing a MatchData

object, or turning it into a string, returns the string that was

found. (If no match was found, then we get nil back, rather than an instance of MatchData. Once again, we can invoke String#match on our regexp or Regexp#match on our string: 1 2 3 4

s = 'It will rain today'

r = Regexp.new('.ain') # returns Regexp object /.ain/

puts r.match(s) # prints "rain"

puts s.match(r) # also prints "rain"

1.2.3

Finding more than one

If we want to find all of the matches, then we must invoke the String#scan method on a regexp. (There is no Regexp#scan to invoke on a string.) For example: 1 s = "the rain in Spain falls mainly on the plain"

2 r = Regexp.new('.ain') # returns Regexp object /.ain/

3 s.scan(r) # returns an array of four 4-character elements

1.2.4

Substituting text

Ruby’s String#sub method replaces the contents of a string. The argument to String#sub can be a string or a regexp; the behavior of the method depends on the object passed to it. We pass to String#sub two arguments, the regexp we want to apply, and the string that should be used in its place. For example: 1 s = "the rain in Spain falls mainly on the plain"

2 r = Regexp.new('.ain') # returns Regexp object /.ain/

3 s.sub(r, 'XXXX') # returns "the XXXX in Spain falls mainly on the plain"

If you want to replace all occurences, then use String#gsub rather than String#sub: 1 s = "the rain in Spain falls mainly on the plain"

2 r = Regexp.new('.ain') # returns Regexp object /.ain/

3 s.gsub(r, 'XXXX') # returns "the XXXX in SXXXX falls XXXXly on the pXXXX"

Both String#sub and String#gsub have alternate versions that modify the original string. As with many methods in Ruby, these add a ! character to the originals’ names:

1 2 3 4 5 6

s = 'The quick brown fox jumped over the lazy dog'

r = /[aeiou]/

s.gsub(r, '_')

puts s # No change

s.gsub!(r, '_')

puts s # Changed to "Th_ q__ck br_wn f_x j_mp_d _v_r th_ l_zy d_g"

1.2.5

Flags

You can modify the behavior of a regexp in Ruby in one of two ways: If you use the // syntax to create your regexp, then you put the modifiers following the final slash. Thus, /abc/i is case insensitive and /abc/im is both case insensitive and multiline. If you create regexps using Regexp.new, then you can pass an optional second argument. If this value is non-nil and non-false, then it’s assumed you want to make it case-insensitive. However, you can also pass one, two, or three modifiers joined with bitwise “or”.

1.2.6

Advanced features

As in Python, capturing is done with parentheses. In such cases, it’s probably a good idea to use String#match, which returns a MatchData object. Similar to Python’s match object, we can retrieve the entire matched string with m[0], and then the individual groups with m[1], m[2], and so forth: 1 2 3 4 5 6

s = 'hello, world'

r = /\b(h.)(..o)\b/

m = s.match(r)

puts m[0] # hello

puts m[1] # he

puts m[2] # llo

Ruby also supports named groups, using the .NET-style syntax. This is slightly different from the Python syntax introduced above: 1 2 3 4 5 6 7

s = 'The price is $123.45.'

r = '\$(?\d+)\.(?\d+)'

m = s.match(r)

if m

puts m['dollars'] # 123

puts m['cents'] # 45

end

There isn’t a built-in Ruby equivalent to python’s groupdict, but the MatchData

object does have a names method that can be used to retrieve all

of them: 1 2 3 4 5 6 7 8

s = 'The price is $123.45.'

r = '\$(?\d+)\.(?\d+)'

m = s.match(r)

if m

m.names.each do |name|

puts "#{name}: #{m[name]}"

end

end

Finally, Ruby supports POSIX-style character classes. In addition to the traditional \w, \s, and \d character classes (and their inverses), you can use things like [[:xdigit:]] to indicate that you’re looking for a hex digit. You can also use Unicode properties as character classes, as in \p{ASCII} and \p{Hebrew}.

1.2.7

More information

More information about Ruby’s Regexp class is available via the Ruby Web site. A nice summary is also available at the useful regexp Web site, http://www.regular-expressions.info/ruby.html.

In addition, a Ruby-flavored Web site that allows you to test regexps is http://rubular.org/.

1.2.8

About Ruby solutions

Exercise solutions presented in this book will work in in Ruby 2.3, the latest version of the language as of this writing. I doubt that any aspects of Ruby will change in the future so as to make these solutions less accurate. You can download and install Ruby from http://ruby-lang.org/.

1.3

JavaScript

JavaScript, also known by the more formal name of ECMAScript, is now considered to be the most popular programming language in the world – in no small part because it sits inside of every Web browser, and quickly gaining favor on the server, as well.

1.3.1

Defining regexps

JavaScript is similar to Ruby in some ways, in that you can define regexps using either the object syntax or a more Perl-like syntax using the RegExp object. For example: 1 var re = /a.c/; 2 var re = RegExp('a.c');

// Perl-like syntax

// object syntax

JavaScript supports three different flags: i (case-insensitive), m (multiline mode, changing the definitions of and $) and g, which tells the regexp that it

should search globally. There is no s modifier that changes the definition of to include newline characters.

.

You can pass these flags to regexps when you create them. Note that the modifiers are passed unquoted in the // syntax, but quoted with the object syntax: 1 2 3 4

var var var var

re re re re

= = = =

/a.c/i; /a.c/im; RegExp('a.c', 'i'); RegExp('a.c', 'im');

// // // //

case case case case

insensitive

insensitive + multiline

insensitive

insensitive + multiline

It should be noted that these two syntaxes create identical objects. Indeed, if you enter an expression in the JavaScript shell, you’ll get back the printed representation of your object, in the // format. This means that even if you define re using the final line of the above example, the printed representation will be /a.c/im. Note that one advantage of defining your regexps with slashes, rather than the RegExp constructor, is that the latter requires you use a string. In such cases, you’ll often find yourself needing to double backslashes, in order to get around the interpretation of \by the JavaScript interpreter for strings. Thus, be careful when using character classes such as \w, which work fine, but need a bit of love and attention (and extra escaping) in order to work.

1.3.2

Finding one or more

To find out whether a string matches a regular expression, invoke the “match” method on a string. The return value is an array of matches it found, or null if it didn’t find anything:

1 2 3 4 5 6 7 8 9 10 11 12 13

var s = 'The quick brown fox jumped over the lazy dog';

var re = /n...n/;

s.match(re) // result is null

var re = /b...n/;

s.match(re) // result is ["brown"]

var re = /[bq]...[kn]/;

s.match(re) // result is ["quick"]

var re = /[bq]...[kn]/g;

s.match(re) // result is ["quick", "brown"]

Note that in the above example, you must use the g modifier to invoke a global search. Alternatively, you can invoke the exec method on a RegExp object. Note, however, that exec will only return a single value each time; you must invoke exec multiple times, stopping when you get a null value, if there were multiple results: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

var s = 'The quick brown fox

var re = /n...n/;

re.exec(s); //

var re = /b...n/;

re.exec(s) //

var re = /[bq]...[kn]/;

re.exec(s) //

var re = /[bq]...[kn]/g;

re.exec(s) // re.exec(s) // re.exec(s) //

jumped over the lazy dog';

result is null

result is ["brown"]

result is ["quick"]

result is ["quick"]

result is ["brown"]

result is null

If you’re merely interested in knowing whether a regexp matches a particular string, you can also use the RegExp.prototype.test method, which returns a true or false value: 1 var s = 'The quick brown fox jumped over the lazy dog'; 2

3 var re = /fox/;

4 5 6 7

re.test(s);

var re = /^fox$/;

re.test(s);

// returns true

// returns false

Groups

1.3.3

Substituting text

Substitution of text is performed using the String.prototype.match method. If the regexp was defined with the g flag, then all of the regexp matches will be replaced. For example: 1 2 3 4 5 6

var s = "the rain in Spain falls mainly on the plain";

var r = /[aeiou]/;

s.replace(r, '_') // Returns "th_ rain in Spain falls mainly on the plain"

var r = /[aeiou]/g; // Make it global

s.replace(r, '_') // Returns "th_ r__n _n Sp__n f_lls m__nly _n th_ pl__n"

1.3.4

Advanced features

JavaScript’s regexps have traditionally not included some of the more advanced features found in other languges. It doesn’t have named capture groups, or lookbehind, although it does have lookahead. It doesn’t support the \A and \Z anchors, although it does support multiline mode via the m flag.

1.3.5

More information

Information about JavaScript’s regexp syntax and usage can be found in a number of places. The official source is the ECMA 262 specification, which you can download and read.

More realistically, you can read about JavaScript’s regexp capabilities and syntax from http://www.regular-expressions.info/javascript.html. Another good source of information, particularly if you’re interested in the latest “ES6” version of JavaScript, is Axel Rauschmayer’s book, “Speaking JavaScript.” You can read the regexp chapter online. Finally, an open-source library for JavaScript called “XRegExp” provides a number of enhancements to the built-in regexp syntax. I won’t use these in the book, but you can learn more and download it from xregexp.com.

1.3.6

About JavaScript solutions

While JavaScript is best known for its work in Web browsers, it can also be used on servers, and is even available as a standard programming language. There are several options for doing this; for the purposes of this book, I am using the REPL (“read-eval-print loop”) for JavaScript included with the popular Node.js program and library. On my computer, I’m able to type node

at the command line, and then to interact with JavaScript.

One big advantage of using Node.js is that it includes a number of the latest additions to JavaScript. This means that, among other things, I have can require

the fs object, giving me access to the filesystem, or the readline

object, allowing me to query the user.Reading from a file in the JavaScript REPL is a bit weird-looking at first, but it works pretty well: 1 2 3 4 5 6 7 8

"use strict";

var fs = require('fs');

fs.readFile('words.txt', 'utf8', function (err, data) {

if (err) {

console.log("Error!\n");

return console.log(err);

}

9 10 11 12 13 14

for (let line of data.split("\n")) {

console.log(line);

}

process.exit();

}

In the above code, I invoke fs.readFile, which takes three arguments – the name of the file to open, the encoding of the file (which will normally be utf8

in this book), and a function which takes two arguments. The first

argument represents an error, if it occurs. The second argument is a string with the contents of the file. However, if we want to iterate over the lines of the file, we’ll need to invoke split on the string, giving us an array object back. I use ES6’s for..of

loop construct, along with the new let variable scope declaration,

to iterate over the elements of that array, then printing each line of the file. Also note that I’m using console.log to display things on the screen. JavaScript programs in this book should all be in “strict” mode, giving us a greater chance of programs errors being caught earlier.

1.4

PostgreSQL

PostgreSQL isn’t a language per se, but rather a relational database system. That said, PostgreSQL includes a powerful regexp engine. It can be used to test which rows match certain criteria, but it can also be used to retrieve selected text from columns inside of a table. Regexps in PostgreSQL are a hidden gem, one which many people don’t even know exists, but which can be extremely useful.

The PostgreSQL regexp engine is descended from the one used in the Tcl language, which differs from the other regexp engines used in many langauges. Many flags are passed using single characters inside of parentheses inside of the regexp, for example. Other aspects of the syntax are just slightly off from other languages; for example, {min,max} cannot have an empty min or max, if it defines a range. Thus, {1,20} is OK, but {,20} is not. Even if you’re used to working with regexps in other languages, it’s worth reading the documentation. for PostgreSQL’s implementation to fully understand how it works.

1.4.1

Defining regexps

Regexps in PostgreSQL are defined using strings. Thus, you will create a string (using single quotes only; you should never use double quotes in PostgreSQL), and then match that to another string. If there is a match, PostgreSQL returns “true.” PostgreSQL’s regexp syntax is similar to that of Python and Ruby, in that you use backslashes to neutralize metacharacters. Thus, + is a metacharacter in PostgreSQL, whereas \+ is a plain “plus” character. However, there are differences between the regexp syntaxes – for example, PostgreSQL’s word-boundary metacharacter is \y whereras in Python and Ruby, it is \b. (This was likely done to avoid conflicts with the ASCII backspace character.) Where things are truly different in PostgreSQL’s implementation is the set of operators and functions used to work with regexps. PostgreSQL’s operators are generally aimed at finding whether a particular regexp

matches text, in order to include or exclude result rows from an SQL query. By contrast, the regexp functions are meant to retrieve some or all of a string from a column’s text value.

1.4.2

True/false operators

PostgreSQL comes with four regexp operators. In each case, the text string to be matched should be on the left, and the regexp should be on the right. All of these operators return true or false: $ \sim $

case-sensitive match

$ \sim $*

case-insensitive match

!$ \sim $

case-sensitive non-match

!$ \sim $*

case-insensitive non-match

Thus, you can say: 1 select 'abc' ~ 'a.c'; 2 select 'abc' ~ 'A.C'; 3 select 'abc' ~* 'A.C';

-- returns "true"

-- returns "false"

-- returns "true"

In addition to the standard character classes, we can also use POSIX-style character classes: 1 select 'abc' ~* '^[[:xdigit:]]$'; 2 select 'abc' ~* '^[[:xdigit:]]+$'; 3 select 'abcq' ~* '^[[:xdigit:]]+$';

-- returns "false" -- returns "true"

-- returns "false"

This operator, as mentioned above, is often used to include or exclude rows in a query’s WHERE clause:

1 CREATE TABLE Stuff (id SERIAL, thing TEXT);

2 INSERT INTO Stuff (thing) VALUES ('ABC'), ('abc'), ('AbC'), ('Abq'), ('ABCq');

3 SELECT id, thing FROM Stuff WHERE thing ~* '^[abc]{3}$';

This final query should return three rows, those in which thing is equal to abc, Abc,

and ABC.

1.4.3

Extracting text

If you’re interested in the text that was actually matched, then you’ll need to use one of the built-in regexp functions that PostgreSQL provides. For example, the regexp_matches function allows us not only to determine whether a regexp matches some text, but also to get the text that was matched. For each matching column, regexp_matches returns an array of text (even if that array contains a single element). For example: 1 CREATE TABLE Stuff (id SERIAL, thing TEXT);

2 INSERT INTO Stuff (thing) VALUES ('ABC'), ('abc'), ('AbC'), ('Abq'), ('ABCq');

3 SELECT regexp_matches(thing, '^[abc]{3}$') FROM Stuff;

The above will return a single row: {abc}

As you can see, the above returned only a single column (from the function) and a single row (i.e., the one matching it). That’s because when you invoke regexp_matches, you can provide additional flags that modify the way in which it operates. These flags are similar to those used in Python, Ruby, and JavaScript. For example, we can use the i flag to make regexp_matches

case-insensitive:

1 CREATE TABLE Stuff (id SERIAL, thing TEXT);

2 INSERT INTO Stuff (thing) VALUES ('ABC'), ('abc'), ('AbC'), ('Abq'), ('ABCq');

3 SELECT regexp_matches(thing, '^[abc]{3}$', 'i') FROM Stuff;

Now we’ll get three rows back, since we have now made the match caseinsensitive. regexp_matches can take several other flags as well, including g

(for a global search). For example: 1 CREATE TABLE Stuff (id SERIAL, thing TEXT);

2 INSERT INTO Stuff (thing) VALUES ('ABC');

3 SELECT regexp_matches(thing, '.', 'g') FROM Stuff;

Here is the output from regexp_matches: {A}

{B}

{C}

Notice how regexp_matches, because of the g option, returned three rows, with each row containing a single (one-character) array. This indicates that there were three matches. Why is each returned row an array, rather than a string? Because if we use groups to capture parts of the text, the array will contain the groups: 1 CREATE TABLE Stuff (id SERIAL, thing TEXT);

2 INSERT INTO Stuff (thing) VALUES ('ABC'), ('AqC');

3 SELECT regexp_matches(thing, '^(A)(..)$', 'ig') FROM Stuff;

Notice that in the above example, I combined the i and g flags, passing them in a single string. The result is a set of arrays: | regexp_matches |

|----------------|

| {A,BC} |

| {A,qC} |

If we’re interested in retrieving a single element from that array, we’ll need to use [] to grab a particular element. Remember that in PostgreSQL, arrays are indexed starting with 1, not 0. Thus, in the above example, we can 1 CREATE TABLE Stuff (id SERIAL, thing TEXT);

2 INSERT INTO Stuff (thing) VALUES ('ABC');

3 SELECT (regexp_matches(thing, '.', 'g'))[1] FROM Stuff;

The result is: A

B

C

That is, we get a column of text, rather than of one-element text arrays.

1.4.4

Splitting

A common function in many high-level languages is split, which takes a string and returns an array of items. PostgreSQL offers this with its split_part

function, but that only works on strings.

However, PostgreSQL also offers two other functions. splits text into a PostgreSQL text array, while regexp_split_to_table turns it into a table These functions allow us to regexp_split_to_array

split a text string using a regexp, rather than a fixed string. For example, if we say: 1 select regexp_split_to_array('abc def

ghi

jkl', '\s+');

The above will take any length of whitespace, and will use that to split the columns. But you can use any regexp you want to split things, getting an array back. A similar function is regexp_split_to_table, which returns not a single row containing an array, but rather one row for each element. Repeating the above example: 1 select regexp_split_to_table('abc def

ghi

jkl', '\s+');

The above would return a table of four rows, with each split text string in its own row. ### Substituting text The regexp_replace function allows us to create a new text string based on an old one. For example: 1 SELECT regexp_replace('The quick brown fox jumped over the lazy dog',

2 '[aeiou]', '_');

The above returns: Th_ quick brown fox jumped over the lazy dog

Why was only the first vowel replaced? Because when we invoked regexp_replace,

we did so without the g option, making it global:

1 SELECT regexp_replace('The quick brown fox jumped over the lazy dog',

2 '[aeiou]', '_', 'g');

Now all occurrences are replaced:

Th_ q__ck br_wn f_x j_mp_d _v_r th_ l_zy d_g

1.4.5

More information

PostgreSQL’s regexp engine is surprisingly full featured, and I’ve only scratched the surface here. The best and most complete place from which you can learn more is the PostgreSQL documentation. Additional information is available at http://www.regularexpressions.info/postgresql.html. In addition, the “Postgres Online” site contains a good article outlining regexp use in PostgreSQL.

1.5

grep

The grep program has been associated with the Unix command line for many years. Lore has it that the standalone grep program came into being after using a combination of “global” and “print” in sed, with an arbitrary regular expression between the “g” and the “p.” Modern versions of Unix are almost unthinkable without grep. At the same time, we have to realize that there are numerous versions of grep out there. For example, Linux uses the GNU version of grep, maintained by the Free Software Foundation as part of their GNU project. By contrast, FreeBSD and Apple’s OS X include a version of grep that has fewer features, but is directly descended from the traditional Unix grep. There are also variations on these, such as fgrep, egrep, and so forth. Unrelated to these, but worth noting because it’s so incredibly useful, is ngrep,

a “network grep” program that lets you use regexps to examine the

current network traffic to and from your computer. I have used ngrep on numerous occasions when debugging network applications. You can learn more about ngrep from its home page.

1.5.1

Basic use

All versions of grep operate on the assumption that you want to search through a file, line by line, and find those lines that match a regular expression. Thus, certain options associated with regexps in programming languages are no longer relevant, such as multiline mode. Normally, grep is used to find all of the matches in a file: grep 'a.c' myfile.txt

The output will contain all of the lines of the file containing the regexp. It doesn’t matter whether the regexp matches once or multiple times; the fact that there was even one match triggers the printing of the line. You can reverse this with the -v flag. Thus, assuming that I have a file containing Unix-style comments (i.e., # in the first column), I can use grep to find all of the comment lines, or all of the non-comment lines: grep '^#' myfile.txt grep -v '^#' myfile.txt

# Finds all comment lines

# Finds all non-comment lines

Another useful option to grep is -i, which makes the search caseinsensitive.

1.5.2

Backslashes

One of the biggest issues for me when using grep is that it handles backslashes differently from all of the other programming languages mentioned above. In this sense, it’s more traditional, using the metacharacters as they were originally defined and used in Unix. However, I can see why Larry Wall flipped the meaning in Perl, in order to avoid what he called “backslashitis.” The basic idea is that many metacharacters, such as +, *, [ ], {min,max}, and |, are treated as standard characters without a backslash, and metacharacters when they are preceded by a backslash. For example: $ echo 'I want to eat breakfast' > file.txt

$ grep '[aeiou]+' file.txt # no match

$ grep '[aeiou]\+' file.txt # matches

1.5.3 grep,

Context and especially GNU grep, takes a very large number of arguments.

You can read more about these in the grep man page, either for BSD Unix or for GNU grep. However, one of the most useful options is what I call “ABC”: The -A option shows you a number of lines /after/ a match The -B option shows you a number of lines /before/ a match The -C option shows you a number of lines of context (i.e., /both/ before and after) I use these all of the time when I’m looking through logfiles; having a few lines of context above and/or below what I’m searching for, such as an IP address, can be quite useful.

Chapter 2

Input data Regular expressions are not something that you learn or use in a vacuum. Rather, they are a way of consuming, identifying, and extracting text from within larger files. In order to make the exercises a bit more interesting and realistic, I have enclosed a number of files with this

2.1

Dictionary (words.txt)

The Engilsh-language dictionary that I have included in this ebook comes with Linux, and is thus available under an open-source license. The dictionary consists of one word per line in the file, which amounts to more than 235,000 words. I have learned over the years of teaching regexp classes that the dictionary contains a surprisingly large and varied number of words, such that even when you ask for all of the words that have 11 letters in them and start with t, you’ll still get a fairly long list! We will use the dictionary file in exercises where I want you to find “all of the words that…” for some condition that I’ll give in the exercise.

2.2

Alice in Wonderland (alice.txt)

Project Gutenberg is an attempt to make as many books as possible available, for free, over the Internet. It has been around for many years, and publicizes as many books as it can – often, waiting until books are no longer copyrighted, and then publicizing them. I have taken the text of “Alice in Wonderland” from Project Gutenberg. Several of the exercises will ask you to find certain types of text from Alice. Note that I have left the Project Gutenberg notices intact in the file; while they aren’t part of the story, they do provide us with more text to search through, whcih I see as a good thing in a book like this.

2.3

Config (config.txt)

I often use regular expressions to look through configuration files. Many of these config files are of the form “name = value”, with a # at the start of a line indicating that it’s a comment. I have included one simple config-style file, so that we can explore and extract data from it.

2.4

Apache logfile (access-log.txt)

Another type of file on which I often use regexps is a logfile. I have taken an excerpt from the Apache logfile on my server, from many years ago, and have extracted several hundred lines from it, in what I call the “mini acces log.” We will explore this file, and try to find some interesting data points from it.

2.5

Linux “passwd” file (passwd.txt)

As another example of a configuration-type file, I have included a slightly modified version of a Linux “password” file. This file, called /etc/passwd, is traditionally included on Unix and Linux systems, and lists not only the usenrames, but the passwords, as well. In recent years, despite the name, the file does not contain the password. I have modified this file slightly, such that it includes several blank lines and comment lines starting with #.

2.6

Fakelog (fakelog.txt)

Some of the time, you need to work with logfiles whose values extend over a single line. In such cases, you need to write multiline regexps. For those exercises, I have prepared a simple file, fakelog.txt, which simulates such a situation.

2.7

PostgreSQL database

PostgreSQL is a relational database, rather than a programming language. As a result, it cannot easily work with files on disk. In order to make the examples more appropriate for PostgreSQL users, I created a database and dumped it to a file that you can load into PostgreSQL. The assumption is that all of your solutions should work against the appropriately named table in the database, rather than a file on disk. The dumpfile was made with PostgreSQL 9.5, but should be compatible with earlier verisons, as well.

To import the file into PostgreSQL, you’ll first need to create a database on the Unix command line: createdb practice_makes_regexp

The above assumes, of course, that the user via which you are logged in has permissions to create PostgreSQL databases. If not, then check your system configuration to give yourself that ability. Once the database has been created, you can import the dumpfile into PostgreSQL, from the Unix shell prompt: psql practice_makes_regexp < practice_makes_regexp.sql

You can then check to see if it all worked by entering into the practice_makes_regexp

database:

psql practice_makes_regexp

Then, ask to see the current list of tables: \dt

You should see 16 defined tables there, two for each of the files mentioned above. Each table has been added once – the first time, with each line of the file as a separate row in the database table, and the second time, in which the entire file has been inserted into a single row. This was done to ensure that even those exercises in which you’re asked to find text that spans lines of the file can be solved using PostgreSQL.

Chapter 3

Exercises This chapter contains all of the exercises presented later in the book, without the solutions. In this way, you can do the exercises without worrying about peeking at the answers. And no, you shouldn’t peek! Rather, you should work on the exercise, struggling a bit until you either find the answer or give up. But don’t give up too soon; I suggest that you engage in what I call “controlled frustration,” allowing yourself to get annoyed and frustrated, without having an actual work deadline or boss standing over you, waiting for you to finish.

3.1 3.1.1

Simple regexps Find matches

Solution is in section 4.1 This exercise is deliberately very simple, to try to get you into the spirit of working with regular expressions. The idea is to ask the user to enter a regular expression, and then to print all of the lines in a file which match that regexp. In other words you’re going to be creating a simple grep command.

Each programming language has a different way of asking the user for input – and in the case of PostgreSQL, there really isn’t any way, so I fudged it a bit in my solution. Nevertheless, taking a string and turning into a regexp, then finding that regexp in a file, is a good way to start. In this exercise, you are to: 1. Ask the user to enter a regexp (via a string) 2. Print all lines in the dictionary that match that regexp. Note that the regexp doesn’t have to match the /entire/ word. Thus if our regexp is abc, then any word containing the three characters abc in a row should be printed, regardless of whether it is a 3-letter word or a 10-letter word.

3.1.2

Five-letter words

Solution is in section 4.2 In this exercise, you are to display words in the dictionary that are either four letters long, or that are five letters long if they end with an s. The word – not just a subset of the word – should be precisely four or five letters long. For the purposes of this exercise, any character (not just a letter) can be counted in the first four letters of the word. However, if there is a fifth letter, it must be an s.

3.1.3

Double “f” in the middle

Solution is in section 4.3

In this exercise, you need to find all of the words in the dictionary that contain a “ff” in them, so long as those f’s are not the first or final characters in the world. Thus, “affable” would be fine, but “quaff” would not.

3.1.4

Extract timestamp

Solution is in section 4.4 It’s common to use regular expressions to extract information from logfiles. In the access-log.txt file that comes with this book, each HTTP request is accompanied by a timestamp, consisting of a date and time. In this exercise, you must match and retrieve the entire timestamp from each line, starting with [ and ending with ]. For the purposes of this exercise, you cannot assume that this will be the only pair of [ and ] in the logfile, so you cannot use a regexp such as: \[[^]]\]

which would mean, “start with [, end with ], and take everything in the middle.” You’ll need to specify the regexp more explicitly and carefully than that. For example, the first line of access-log.txt contains the following timestamp: [30/Jan/2010:00:03:18 +0200]

You are to retrieve just that part of each line.

3.2 3.2.1

Character classes End-of-sentence words

Solution is in section 5.1 In Alice in Wonderland, find all of the words that are at the end of a sentence. In other words, find and display all of the words that end with ., ?, or !. You should display the punctuation mark along with the word. For the purposes of this exercise, a word is any string of alphanumeric characters at least two characters long.

3.2.2

Hex numbers

Solution is in section 5.2 Given the following sentence: I like the hex numbers 0xfa and 0X123 and 0xcab and 0xff

retrieve all of the hexadecimal numbers. That is, it starts with 0x (or 0X), then has a string of digits or the letters a through f, capital or lowercase.

3.2.3

Hexwords

Solution is in section 5.3 Which words in the dictionary only the letters a through f?

3.2.4

IP addresses

Solution is in section 5.4 Each line of access-log.txt starts with an IP address. Each IP address has four numbers, each containing between one and three digits. The numbers are separated by periods (.). In this exercise, you are to retrieve the IP addresses from access-log.txt by building a chracter class, not by splitting the line across whitespace.

3.2.5

Long, weird words

Solution is in section 5.5 Find all of the words in the dictionary that have the following characteristics: 10 letters long Start with a letter from the first half of the alphabet (a-m) End with a letter from the second half of the alphabet (n-z) Somewhere in the middle, there should be a “p”

3.2.6

Matching URLs

Solution is in section 5.6 Let’s assume that we have defined a string: I love to visit https://example.com/foo.html every day!

More than http://abc-def.co.il/.

Write a regexp that will match both URLs, but not the characters before or after them. Include the /foo.html in the first URL, but not the training period (.) in the second.

3.2.7

Non-zero hours

Solution is in section 5.7 Once again, it’s time to search for certain patterns in access-log.txt: We want to find all of the records in which the hour doesn’t begin with a 0. (Remember that Apache logs, like many other logfiles, operates on a 24hour clock. Thus, 11 p.m. is written as 23:00.) Thus, you should not show the records from 00:00 through 09:59, and then show those from 10:00 through 23:59. For the purposes of this exercise, you may assume that square brackets ([ and ]) only occur around the timestamp.

3.2.8

Quoted text

Solution is in section 5.8 In this exercise, we’re going to look for all of the quotations in Alice in Wonderland. I’m looking for any stretch of text that starts with the doublequote character (“) and ends with that same character. I’m going to assume that quotes are never nested, and that there’s no use of a programmer’s backslash () to escape the double quotes. However, quotes might extend across more than one line.

3.2.9

Supervocalic

Solution is in section 5.9 A word is considered “supervocalic” if it contains all five of the Englishlanguage vowels (a, e, i, o, and u). Each letter should appear only once, and in that order. For this task, you want to find all of the supervocalic words in the dictionary.

3.2.10

Double triple vowel

Solution is in section 5.10 In English, doubled vowels are a pretty common occurrence. Tripled vowels, though, are a pretty rare thing. Your task is to try to find something even rarer: Words in the dictionary with two separate sets of triple vowels. (And yes, the dictionary I’ve included with this book contains 69 such words.)

3.2.11

Postfix dollar

Solution is in section 5.11 In the United States, we put the dollar sign before the price of something, as in $123.45. In my travels, I’ve noticed and discovered that many people, in many countries, aren’t used to this, and put the $ sign after the numbers. Given the sentence: They wrote 1,000$ (not $1000), and 123.45$ (not $123.45).

For this exercise, write a regular expression that finds all of the cases of numbers (including commas and decimal points) followed by dollar signs. Thus, the results should find 1,000$ and 123.45$.

3.3 3.3.1

Alternation Multiple date formats

Solution is in section 6.1 Dates are a well-known problem in the world, in that the same representation can mean different things. If you see the date 1/2/2016, does that mean February 1st or January 2nd? It all depends on whether you’re in the United States or Europe. Asian countries write dates altogether differently, starting with the year, so 2016-2-1 would mean February 1st, 2016. For this exercise, write a regular expression that finds all dates in the following string: I write it as 2015-09-02, but he wrote 2/9/2015, and she wrote 9.2.2015.

3.3.2

“oo” and “ee” words

Solution is in section 6.2 Find all of the words containing the double-letter combination oo and/or ee in the Alice in Wonderland, regardless of case.

3.3.3

British and American spelling

Solution is in section 6.3 The problem here is a relatively simple one. We have a sentence: The new box of cheques is blue in colour.

Or I might have this sentence: The new box of checks is blue in color.

Write a regexp that matches either of these.

3.4 3.4.1

Anchors Capital vowel starts

Solution is in section 7.1 In this assignment, find and print all of words that begin with a capital vowel (A, E, I, O, or U) and are at the start of a line.

3.4.2

Comment lines

Solution is in section 7.2 Many Unix-style files, including programs written in such languages as Python and Ruby, indicate comments by having a # at the start of the line. In this exercise, you are to print all comment lines – meaning, all lines that

start with #, or that are preceded by whitespace. Comments that follow whitespace can be ignored. Thus, given the following file: # Comment 1

# Comment 2

print("Hello") # Comment 3

Your solution should print comments 1 and 2, but not comment 3.

3.4.3

Last five characters

Solution is in section 7.3 In Alice in Wonderland, print the last five characters of every line, in which the third-to-last character is a lowercase letter in the second half of the alphabet (i.e., starting with n).

3.4.4

u in the 2nd-to-last word

Solution is in section 7.4 Show the final two words of each line of Alice in Wonderland in which u is in the second-to-last word.

3.5

Groups

3.5.1

Date and time

Solution is in section 8.1 In access-log.txt, each line contains a timestamp, which looks like this: [30/Jan/2010:00:03:18 +0200]

Notice that the timestamp starts with [, ends with ], and contains both the date (in DD/MMM/YYYY format) and the time (in HH:MM:SS +TZ format). For this exercise, you are to grab the date and time in separate groups. Each language has a slightly different way of extracting the groups; the idea is that for each line, it should be possible to extract and display the date and time separately. The time should include the time zone; for now, we’ll leave it in the format used by the access log.

3.5.2

Config pairs

Solution is in section 8.2 config.txt

is a simple configuration file. Simple, in that the configuration

is set with lines that look like name:value

But as often happens in such files, the people writing the file have gone a bit crazy, and have added lots of extra whitespace. Some lines contain only whitespace, or are generally illegal, without either a name or a value. We want to extract all of the name-value pairs from this file, grabbing the name and value in separate groups from legal lines. Moreover, we want to ignore any leading and trailing whitespace surrounding the name and value.

3.5.3

Quote first and last words

Solution is in section 8.3 In an earlier exercise (5.8), we found all of the quotations in Alice in Wonderland. For this exercise, find the first word and last from each quotation, not including the quotation marks and punctuation. Thus, if the quote is "Hello out

there!"

You should find Hello and there. Note that quotes might extend across lines.

3.5.4

Prices with symbols

Solution is in section 8.4 [Note: This chapter uses Unicode symbols that aren’t printing correctly. I’m working on fixing this. In theory, there should be a dollar sign, a euro symbol, and a UK pound sign.] Assume that we have a string: We sell 10 in the US for $100, in Europe for €100, and in the UK for £100.

We want to retrieve all of the prices from this string, but we don’t want to retrieve the currency symbol as well. In other words, we want to find all of the digits (no commas or decimal points) that follow a currency symbol.

3.5.5

Question first word

Solution is in section 8.5 Once again, let’s extract some text from Alice in Wonderland: Retrieve the first word of every question – meaning, every sentence that ends with a question mark.

3.5.6

t, but no “ing”

Solution is in section 8.6 In this exercise, you are to find all of the words in Alice in Wonderland that start with t and end with ing. However, you are to return the portion of the word that precedes the int. Thus, if the word is trailing, you should only match and return trail.

3.5.7

Usernames and user IDs

Solution is in section 8.7 In linux-etc-passwd, field index 0 is the username, field index 2 is the user ID, and field index -1 contains the user’s shell. For each user in the file, I want a regexp that extracts the user’s name, the user’s ID number, and the user’s shell. The regexp should extract each piece of information using a group. If the language supports it, retrieve each field using a named group, rather than a numbered one.

3.5.8

Beheaded usernames

In this exercise, display the final four characters of any username that starts with a and contains at least five characters. Thus, given the users nobody, root, amotz, atara, adam,

and astronaut, we would see the following

output: motz

tara

naut

3.5.9

Final question words

Solution is in section 8.9 In this exercise, you are to retrieve the final word of each question in Alice in Wonderland. You can assume that a question always ends with a question mark (?). You should not retrieve the question mark, but just the word preceding it.

3.5.10

“d” user shells

Solution is in section 8.10 In /etc/passwd, each line contains a number of different fields, separated by : characters. The first field is the username, and the final field is the user’s shell (i.e., the command interpreter). On a typical Linux box, most people will be using /bin/sh or /bin/bash, whereas others will be using /usr/bin/zsh, or something like that. And then you have the internal system users, whose shells are often /bin/false (so that they cannot log in), or something of the like.

In this exercise, I want you to retrieve the shell from every user whose name contains d. For example, given the following line: daemon:x:1:1:daemon:/usr/sbin:/usr/sbin/nologin

This user (daemon) starts with d, and their shell is /usr/bin/nologin. But we also want shells from users with d elsewhere in the name, as in: redis:x:112:123:redis server,,,:/var/lib/redis:/bin/false

3.6 3.6.1

Flags All usernames

Solution is in section 9.1 In this exercise, you are to find all of the usernames in passwd.txt. However, you are to do this not by looping over the lines in passwd.txt, but rather by applying a regexp to the entire contents of the file as a single string, and retrieving all of the matches found in that string. Just to remind you, the username is at the start of each line, until the first : character.

3.6.2

abc

Solution is in section 9.2 In Alice in Wonderland, find stretches of text that start with a, have a b in the middle, and end with c. Between each of these characters can be up to 20 other characters.

3.6.3

abcABC

Solution is in section 9.3 This exercise is a repeat of the previous one. But whereas the previous exercise asked you to find stretches of a, b, and c with up to 20 characters between each of these letters, here the search should be case-insensitive. That is, now we’re looking for either a or A, then up to 20 characters, then b or B, followed by up to 20 characters, then c or C, followed by up to 20 characters.

3.6.4

abcABC, extended

Solution is in section 9.4 The regexp in the previous exercise was starting to get a bit long and complex. In such cases, it’s a good idea to break the regexp into separate lines, taking advantage of the “extended mode” that many regexp engines offer. In this exercise, I want you to take the regexp from the previous exercise (9.3) and turn it into a multi-line regexp, using extended mode in your language of choice.

3.6.5

No-error IP addresses

Solution is in section 9.5 In this exercise, we’re going to work with fakelog.txt, a logfile using a format that I created for the purposes of my regexp courses. Each entry in

the logfile is two lines long, and represents a response code of some sort, similar to HTTP. The first line contains the timestamp of the error message, followed by the (fake) IP address that caused the error. The second line contains the word Result, followed by a three-digit number indicating the error code, a colon, and a message. Your task is to extract the IP addresses associated with a response code starting with a 2.

3.7 3.7.1

Backreferences Doubled vowels

Solution is in section 10.1 Find all of the words in Alice in Wonderland that contain doubled vowels – that is, the same vowel (a, e, i, o, or u) appears twice in a row. For example, “beer” is a doubled vowel, but “bear” is not.

3.7.2

Hours and seconds

Solution is in section 10.2 In access-log.txt, , find all of the entries in which the hour and second for the entry were identical. Thus, a request at 12:34:12 matches, but 12:34:56 does not.

3.7.3

Seven-letter start-finish words

Solution is in section 10.3 In the dictionary, find all seven-letter words that start and end with the same two letters. For example, restore starts with re and ends with re, and is seven letters long.

3.7.4

end-start

Solution is in section 10.4 Show all words in the dictionary in which the final two letters of one word are the same as the first two letters of the next word. Thus, if the word require is followed by the word requirement, then we’ll want to see require

in our output.

3.7.5

Singular and plural

Solution is in section 10.5 Find all of the words in Alice in Wonderland that appear in both singular and plural forms. For the purposes of this exercise, we’ll generalize, and say that a “plural” is any word with an “s” or “es” on the end. Thus, if both cat

and cats appear in the book, then I want to see cat. We’ll also say that

the singular version of a word must be at least 2 letters long, and that the singular version must precede the plural version.

3.8 3.8.1

Replace Crunch whitespace

Solution is in section 11.2 This is another simple exercise, but one that has great practical implications. The idea is that you have read some text into your program. That text contains a number of types of whitespace characters – spaces, tabs, newlines, and even carriage returns. You want to turn one of those characters, or every multi-character combination, into a single space character. So if you have the string abc

def\n

\tghi \t \r \n jkl

You want to turn it into abc def ghi jkl

3.8.2

New hostname

Solution is in section 11.3 Our company is rebranding from “foocorp” to “barcorp”, and as such, all of the URLs much change. We’re also changing our URLs such that if there is a www. before the foocorp, that should go away as well. And our corporate security team has said we need to use HTTPS instead of HTTP, so all of our URLs that currently use http now need to use https. Can we take care of all three of these at once? In other words, the text Please visit http://www.foocorp.com/.

we should change it to Please visit https://barcorp.com/.

3.8.3

Detagify

Solution is in section 11.4 While regexps shouldn’t be used for parsing HTML and XML, there are stil times when they can be used to manipulate those formats. You have to be careful when doing this; a famous Stack Overflow answer about using regexp to parse XML demonstrates just how frustrated some programmers can get with some questions. However, there are some XML-related tasks for which regexps are perfectly suited. This exercise is one of them: Given a text string, you are to remove all of the XML/HTML tags, leaving everything else in place. It’s fine to leave some corner cases in place; we’re not trying to build the ultimate XML tag parser here. So if you have the string This is a headline