Prep-course Extra Credit: An Intro to Regular Expressions

This post is written by Mark, a student participating in the March 2014 cohort.

A

t the start of the second week at Bitmaker — after the introduction to object orientated programming but before rails — the curriculmn challenges students to solve a fairly straightforward problem using Ruby. It's straightforward if you know something about regular expressions. If not, the solution is about three times as long. Regular expressions — or regex — isn't covered in the prework, but learning something about it before you enter the course is superhandy and, generally, they are a key addition to a hacker's toolbox.

So what are regular expressions — or regex or regexp?

Simply regular expressions are a way of saying "Hey computer, find all characters in a pattern (within a file) that match X criteria." So if you wanted to find all words that start with a certain capital letter, you could use a regular expression. If you wanted to find all carriage returns preceded by a period, you could use a regular expression. If you wanted to find all email addresses in a thousand-line text block, you could use a regular expression. Most people are familar with the find and replace functionality in word processing software; the logic behind the scenes is regex. If you have used grep, the 're' in grep is regex.

Where to start?

Like so many other things UNIX-y and pre-web, how to write (and read) regular expressions is not obvious. Much ink has been spilled to explain them. But don't be intimidated. Even if you use them everyday mastery can be elusive. So it's enough to know what they are; know the basics about how they work, then google everything else.

The best resource I have found for noobs is written by the folks at Bare Bones Software as part of the PDF user manual for the awesomeness that is TextWrangler. Even if you are not a Mac user, I suggest you download this guide and read chapter eight (glancing at chapter seven wouldn't hurt either). Most of what follows is a summary of what I have learned from this guide.

![Some rights reserved by flickr user Lasse Havelund](http://farm2.staticflickr.com/1182/542439223_79fe963a6f_n.jpg)

Some rights reserved by flickr user Lasse Havelund

Some basics

  • like with plaintext search, most characters match themselves
  • unlike plaintext search, the edge cases are where you find the majority of regex uses
  • non-printing characters are prefixed with a blackslash ()

    • so \r matches a line break
    • \t matches a tab
  • in fact all special characters are 'escaped' using a \

    • \ matches backslash
    • . will match any character except a return
  • other wildcard characters include the following:

    • \^ will match the beginning of a line
    • \$ will match the end of a line.
      Example: \^From will match all lines that begin with the word 'From:'.

The characters ^ and $ behave a little differently if they are included in classes, so we will explain that next.

Basic classes

Classes match a range or a set of characters

  • classes are defined between square brackets []
  • so [abc] matches any one of the characters a, b, or c
  • [^abc] matches any character but excludes a, b, or c
  • [a-z] matches any character in the alphabet
  • similarly [0-9] matches any number
    Example: [a-z0-9] will match any letter or any number.

Regular expressions are case sensitive so [a-z] is different than [A-Z]. Got that? OK. Imagine you want to find words that have 'zz' somewhere in them. A range won't help you. You need a quantifier.

Basic quantifiers

Operators like + specify how many times the pattern preceding them may repeat.

  • p+ will match one or an infinite number of p characters in a string
  • p{count} will match a specific number of p characters
    Example: [0-9]+ will match any string of numbers (e.g. 2013) while 0[2] will match the last two digits of 1900.

I am just scratching the surface here. There are ways to find patterns in standalone words. There are 'greedy' expressions. You can make 'assertions' with regex. Replace text etc.

Seriously, go read the Barebones document to get your feet wet. I suggest learning about regex on it's own before compounding things with Ruby code. When you see code like word =~ /^ ([^aeiouyq](qu)?)(.)$/x ...you will want to know what syntax is from Ruby and which is regex.

One advanced feature I should mention are subpatterns. I used regex for months before learning about subpatterns — sometimes called subexpressions — and they totally changed my game. In a nutshell, subpatterns allows you to find, then save a matched pattern so it can be passed to another action. They are super useful when transposing first and last names. For example, finding John Doe and replacing it with Doe, John. In any case be sure to lookup how this works in your reading.

Conclusion

So is the power of regular expressions worth the hassle? I would say yes, but as BM instructor Mina Mikhail recently repeated:

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.‡
Yes they aren't for everybody. They are painful to learn. But they will make your time at Bitmaker less so. Don't say I didn't tell you. Haha.

‡ As an aside that quotation is by Jamie Zawinski. Read what Jeff Atwood says about it.