Personal Blogs

Cheat Guide to Regex

Thursday, 26 Nov 2009, 13:43

Visible to anyone in the world

Edited by Mitchell Cooper, Tuesday, 7 Sept 2010, 18:12

Years went by with me safely avoiding regular expressions. They were such happy times. Then twice this year I was forced to make its acquaintance. But it was a good thing - maybe even the best of things - because now I no longer feel a tingle in my bottom when I see it in our source. I thought I'd lay down what I've learnt, for my reference should more years come between us, and in case it can help others find their way into the arms of this barbed stranger without drawing as much blood or curse.

The following was tested using the .NET framework and the static method System.Text.RegularExpressions.Regex.Match(String, String). To ensure the entire strings are wholly matched my usage is:

VB.NET:

If  Match(searchText, searchPattern).Length = searchText.Length Then
 ' Matched
Else
 ' No match
End if

The Formatted Telephone Number

Consider this list of user-entered numbers. Let's say, of this list, H is the only valid number.

 A: AbcH0W_BOThemy84kies
 B: 01543 MWC66
 C: 8
 D: 91543278966
 E: 01543278966 
 F: 0145355 56 544 144
 G: 01543 1231981 
 H: 01543 278966

As a first cut lets only accept string C (a single number).

Pattern:
 [0-9]

Tokens described:
 [	...start of character list...
 0-9	...any number between 0 and 9...
 ]	...end of character list

Now let's accept C, D & E(all multi-digit numbers without spaces)

Pattern:
 [0-9]+

Tokens described:
 [	...start of character list...
 0-9	...any number between 0 and 9...
 ]	...end of character list
 + 	...one or more of the previous token ([0-9])

Now let's accept C, D, E, F, G & H (all numbers including those with spaces)

Pattern:
 [0-9 ]+

Tokens described:
 [	...start of character list...
 0-9  	...any number between 0 and 9...
  	...or a space...
 ] 	...end of character list
 +  	...one or more of the previous token ([0-9 ])

Now let's accept E, F, G, H (all numbers, including those with spaces, beginning with 0)

Pattern:
 ^0[0-9 ]+

Tokens:
 ^  	Start of string...
 0  	...must be 0...
 [	...start of character list...
 0-9  	...any number between 0 and 9...
  	...or a space...
 ] 	...end of character list
 +  	...one or more of the previous token ([0-9 ])

Now let's accept G & H(all multi-digit numbers beginning with 0 containing exactly one space and ending with numbers)

Pattern:
 ^0[0-9]+ [0-9]+$

Tokens:
 ^	Start of string...
 0 	...must be 0...
 [   	...start of character list...
 0-9  	...any number between 0 and 9...
 ] 	...end of character list
 +  	...one or more of the previous token ([0-9])
   	...followed by a space
 [   	...start of character list...
 0-9  	...any number between 0 and 9...
 ] 	...end of character list
 +  	...one or more of the previous token ([0-9])
 $ 	...end of string.

Now let's accept H (all numbers beginning with 0 followed by exactly four numbers then exactly one space then exactly six numbers)

Pattern:
 ^0[0-9]{4} [0-9]{6}$

Tokens:
 ^	Start of string...
 0  	...must be 0...
 [  	...start of character list...
 0-9  	...any number between 0 and 9...
 ] 	...end of character list
 {4}  	...exactly four of the previous token ([0-9])
   	...followed by a space
 [   	...start of character list...
 0-9  	...any number between 0 and 9...
 ] 	...end of character list
 {6}  	...exactly six of the previous token ([0-9])
 $ 	...end of string.

Useful Misc

The next example is given as it includes useful operations I couldn't include in the above sample without killing its simplicity. This includes:-

specifying ranges of character frequency e.g. xxx to xxxxxxx
Regex operators e.g. ^ + [ .
longer strings such as CAT, Bat, Coffee (character groups)
ranges of characters e.g. D-K

Consider these, highly fanciful, currency-esque, strings:

 A: X:GBR_£100.23p +17.05
 B: Y:USA_$1981.01c =1.99
 C: X:USA_$56.03c -0.23
 D: D:GBR_£1.00p +0.50
 E: X:GBR_£956.21p =0.211
 F: X:USA_$6000.00k +29.02

Pattern:
 ^[X-Z]{1}:(GBR|USA|EUR)_[£\$][0-9]{2,6}\.[0-9]{2}[cp]{1} [=\-\+][0-9]{1,2}\.[0-9]{2}$

This Regex pattern will match the first three strings.

Tokens:
 ^	Start of string...
 [   	...start of character list...
 X-Z  	...any letter between X and Z in alphabet...
 ] 	...end of character list
 {1} 	...exactly one of the previous token ([X-Z])
 : 	...followed by an semi colon
 (  	...start of list of character groups...
 GBR 	...this character group...
 | 	...or...
 USA 	...this character group...
 | 	...or...
 EUR 	...this character group...
 ) 	...end of list of character groups
 _ 	...followed by an underscore
 [   	...start of character list...
 £ 	...a pound sign
 $ 	...or a dollar
 ] 	...end of character list 
 [   	...start of character list...
 0-9  	...any number between 0 and 9...
 ] 	...end of character list
 {2,6}	...between two and six of the previous token ([0-9])
 \.	...followed by a full stop
 [   	...start of character list...
 0-9  	...any number between 0 and 9...
 ] 	...end of character list
 {2} 	...exactly two of the previous token ([0-9])
 [   	...start of character list...
 c 	...a c
 p 	...or a p
 ] 	...end of character list 
 {1}  	...exactly one of the previous token ([cp])
   	...followed by a space
 [   	...start of character list...
 =  	...an equals sign...
 \- 	...or a minus sign...
 \+ 	...or a plus sign...
 ] 	...end of character list
 [   	...start of character list...
 0-9  	...any number between 0 and 9...
 ] 	...end of character list
 {1,2}	...between one and two of the previous token ([0-9])
 \.	...followed by a full stop
 [   	...start of character list...
 0-9  	...any number between 0 and 9...
 ] 	...end of character list
 {2} 	...exactly two of the previous token ([0-9])
 $ 	...end of string.

Links & Source:

The code I used in testing the above Regex can be found here.

Derek Slager has a great online tester here.

And easily the best Regex resource on the web is here.

The above expressions are presented merely as examples to illustrate basic Regex concepts. These are not conclusive and I do not credit them, by any means, as definitive for the solutions described, nor do I promote their use in anything other than learning scenarios. Your statutory rights are the things most affected and you need not send no money now.