|
Almost every application has search facility, where we type a string in a dialog box and search it in the given text. We can replace the string with specified string too. We can provide this kind of facility in our program using methods of the
System.String class. But if we wish to search how many times a string is repeated or the strings starting and ending with a given character or the strings representing dates in a given range or strings having repeated characters in it, etc. then searching becomes difficult. For this kind of situation regular expression language is of great use. In regular expressions language we can write search expressions. Using these search expressions we can extract, delete or replace a substring. We can also split a string into substrings.
One of the most common usages of regular expressions is to validate the user input such as e-mail address, credit card number, ZIP code, etc. Another common use of regular expression is ‘screen scraping’. Suppose a web site extracts real time information from a database and displays it on an HTML page. Information can be a weather report, stock exchange prices, air ticket reservation and so on. If we want to use this information in our application we would access the HTML page. What we would get is the HTML script. We would need to parse the script to obtain the required information. Regular expressions can be effectively used in this task.
The regular expressions language consists of literals and metacharacters (sometimes called escape sequences). A metacharacter acts as a command to the regular expression parser. A parser is the engine responsible to understand and interpret the regular expression. The
System.Text.RegularExpressions namespace provides classes to create and use the regular expressions.
Let us take an example. The metacharacter '\b' is used as a word boundary indicating either beginning or end of a word. If we wish to search a word that starts with character ‘a’ then the search expression will be @“\ba”. To search for a word that ends with characters ‘ing’ the expression will be @“ing\b”. If we have to write an expression that searches strings starting with @“\ba” and ending with @“ing\b”, what about the characters in the middle? For this, we must use the '\S' escape sequence. '\S' stands for any character except a whitespace. So, our expression would be @“\ba\S*ing\b”. * here means zero or more number of characters. So the expression “\S*” means zero or more number of characters except the whitespace
Let us use this search expression in a program.
using System ;
using System.Text.RegularExpressions ;
static void Main ( string [ ] args )
{
String instr = @“It's amazing that the amount of news that happens in the world everyday always just
exactly fits the newspaper." ;
String pattern = @"\ba\S*ing\b" ;
Match m = Regex.Match ( instr, pattern,
RegexOptions.IgnoreCase ) ;
while ( m.Success )
{
Console.WriteLine ( m.Value ) ;
m = m.NextMatch( ) ;
}
}
This program would output ‘amazing’. The
Regex class contains several methods that we can use to perform operations using the specified regular expression. Here, the string pattern contains our regular expression. The static method
Regex.Match( ) finds the string matching to the specified pattern in the input string
instr. Matching is done case-insensitively since we have mentioned the enumerated value
RegexOptions.IgnoreCase. The Match( ) method returns a reference to an object of the
Match class that we have collected in m (don’t get confused between the
Match( ) method and Match class). If the Match( ) method succeeds, the
Value property of the Match class contains the resultant substring and the
Success property contains true. Another property of Match class called
Index contains the index of the first character of the resultant substring. To locate and display all the matching substrings, we have called the
NextMatch( ) method of the Match class. The NextMatch( ) method returns the Match reference with the next matching substring. If the
NextMatch( ) method is called after the last match is found, it would fail and the
Success property would contain false. Instead of m.Value we can directly pass
m to WriteLine( ) method because it calls the ToString( ) method overridden in
Match class. The ToString( ) method of Match class returns the same substring that the
Value property holds.
The following table displays the metacharacters we can use in regular expression.
|
|
Expression |
Meaning |
|
. |
Matches any character except \n |
|
[character] |
Matches a single character in the list |
|
[^characters] |
Matches a single character not in the list |
|
[charX-charY] |
Matches a single character in the specified range |
|
\w |
Matches a word character (word character is any alphanumeric character and underscore) |
|
\W |
Matches a non-word character |
|
\s |
Matches a whitespace character |
|
\S |
Matches a non-whitespace character |
|
\d |
Matches a decimal digit |
|
\D |
Matches a non-digit character |
|
^ |
Beginning of the line |
|
$ |
End of the line |
|
\b |
On a word boundary |
|
\B |
Not on a word boundary |
|
* |
Zero or more matches |
|
+ |
One or more matches |
|
? |
Zero or one matches |
|
{n} |
Exactly n matches |
|
{n,} |
At least n matches |
|
{n,m} |
At least n but no more than m matches |
|
( ) |
Capture matched substring |
|
(?<name>) |
Capture matched
substring into group name |
|
| |
Logical OR |
|
|
We would now write a program that splits a given input string using the specified delimiter. Suppose we have collected e-mail addresses separated by semicolon ( ; ). Then we can split the string into single address as shown in the following code.
String instr = "rucha_200@hotmail.com;malviya@yahoo.com;rdeshpande@hotmail.com" ;
Regex ex = new Regex ( ";" ) ;
string [ ] str = ex.Split ( instr ) ;
foreach ( string s in str )
Console.WriteLine ( s ) ;
We have created an object of the Regex class passing to its constructor the delimiter string i.e ‘;’. The
Split( ) method of the Regex class splits the string passed to it. It uses the pattern passed to the
Regex’s constructor. The Split( ) method splits the string into substrings and returns an array of the substrings. We have collected the same and displayed. We can use multiple delimiters to split a string. The multiple delimiters must be separated by a pipe ( | ) character.
We saw how to find a pattern in a string using the Match( ) and Matches( ) methods. We can use these methods to find multiple patterns in a string too. For example,
String instr = @"The brain is a wonderful organ, it does not stop until you get into the office." ;
Regex ex = new Regex ( "(is)|(in)|(it)" ) ;
MatchCollection mc = ex.Matches ( instr ) ;
foreach ( Match m in mc )
Console.WriteLine ( "{0} found at position {1}", m, m.Index ) ;
We have specified the multiple patterns “is”, “in” and “it” separated by ‘|’. This will display multiple instances of the given patterns along with their positions. The character ‘i’ is common in all the patterns, so alternatively, we can use a regular expression as "i(s|n|t)".
Regular expressions language can be effectively used for creating applications like HTML processing, HTTP header parsing, etc. Here is an example that checks the HTML script for the ‘href’ tag used for specifying links.
String instr = @"<HTML>
<A href = ""freevb.htm"">Free VB.NET source code</A>
<A href = ""freec.htm"">Free VC++ source code</A>
</HTML>" ;
string pattern = @"href\s*=\s*(?:""(?<1>[^""]*)""|(?<1>\S+))" ;
MatchCollection mc = Regex.Matches ( instr, pattern,RegexOptions.IgnoreCase ) ;
foreach ( Match m in mc )
Console.WriteLine ( m ) ;
Interpreting the expression used to find
hrefs is now your job.
We can use the brackets [ ] to find a set or range of characters in a string. For example, the expression “[aeiou]” can be used to match any vowel in the string. On the other hand, if we use ^ as the first character in [ ] then it finds characters other than those specified. For example, “[^aeiou]” will find all the consonants. To find characters that fall in a range, we can use dash (–) to separate the two characters. For example, the expression “\d[0-25-9]” will find any digit (\d is used to match a digit) in the string that is in the range 0 to 2 or 5 to 9.
Groups And Captures
One of the features of regular expressions is that we can group characters together. For example, suppose a training institute has maintained a list of machine names and their IP addresses in a string. We can create two different groups. One group will contain machine names from all the strings and second will contain IP addresses. Following program illustrates how to create and use groups.
using System ;
using System.Text.RegularExpressions ;
static void Main ( string [ ] args )
{
string instr = "User1 192.168.90.45\n" + "User2 192.168.90.46\n" + "User3 192.168.90.47" ;
Regex ex = new Regex ( @"(?<name>\S+)\s"+@"(?<ip>(\d|\.)+)" ) ;
MatchCollection mc = ex.Matches ( instr ) ;
foreach ( Match m in mc )
{
Console.WriteLine ( "\nEntire string: {0}", m ) ;
Console.WriteLine ( "Name: {0}", m.Groups [ "name" ] ) ;
Console.WriteLine ( "IP: {0,5}", m.Groups [ "ip" ] ) ;
}
}
The groups are always created in pair of parenthesis. Our two groups namely
name and ip are noticeable by the two pairs of parenthesis.
|