"Linux Gazette...making Linux just a little more fun!"

Regular Expressions in C

By Ben Tindale

Scope

In this series of articles I intend to explore the varying implementations of strings in languages that are common on the Linux platform. The first article will explore the regular expression library provided with GNU libc. In future articles I hope to look at other common libraries and languages - hashing functions in Java and strings in KDE versus string in Gnome.

Each language has its strengths and its weaknesses. I hope that by doing a little grunt work on your behalf, I'll be able to give you a brief overview of the abilities and weaknesses of the common languages and their libraries with respect to string handling.

I won't be talking about internationalization and localisation in this series of articles, since those subjects are worthy of volumes of study - not a short summary.

The Gnu C Library and Regular Expressions

The GNU C library is the most basic system element on any Linux installation from a programmer's perspective. Most higher level libraries are based on libc, and most of what we think of as the "C language" are really functions in libc.

Strings in C are just null terminated arrays of chars or wide chars. This is the simplest and most efficient implementation of strings in terms of computer resources, but probably the trickiest and least efficient implementation in terms of programmer resources. Since strings are either constants (ie literals) or pointers, the programmer has the power to manipulate the strings down to the bit level and has all kinds of opportunities to optimise their code (for example this snippet). On the other hand, null termination of strings and the absence of in-built length checking mean that problems such as infinite loops and buffer-overflows are inevitably going to appear in code.

The GNU C library is rich in string manipulation functions. There are standard calls to copy, move, concatenate, compare and find the length of a string (or a section of memory). In addition to these, libc also supports tokenization and regular expression searches.

Regular expressions are a powerful method for searching for text that matches a particular pattern. Most users will have first encountered the idea of regular expressions while using the command line, where characters such as '*' have a special meaning (in this case, matching zero or more characters). To illustrate the power of regular expressions and how they are used, we will implement a simple form of grep.

Mygrep.c

Mygrep.c uses the powerful regex.h library for the task of searching through a text file for a line that matches the given pattern.

	bash> ./mygrep -f mygrep.c -p int Line 17: int
	match_patterns(regex_t *r, FILE *FH) Line 36: printf("Line %d:
	%s", line_no, line); Line 52: printf("In error\n");
        bash>

Libc makes the use of regular expressions comparitively easy. Of course, it would be much easier to use a language with regular expression matching as part of its core definition (such as perl) for this example, but the C library does have the advantage of easy integration with existing code and maybe speed (although in languages such as perl the regular expression matching is highly optimised).

If you examine the program listing, you will see that mygrep.c consists of a main function that handles the user options and two functions that perform the actual regular expression matching. The first of these functions, logically, is the function do_regex(). This function takes in as its parameters a pointer to a regular expression structure, a string holding the pattern to search for and a string holding the filename. The first task that do_regex() performs is to "compile" the regular expression pattern into the format native to the GNU library by calling regcomp(). This format is a data structure optimised for pattern matching, the details of which are hidden from the user. Next, the file to be scanned is opened, then the file handle and the compiled regular expression are passed to match_patterns() to execute the search and output the results.

Match_patterns() scans through each line of the file, looking for patterns that match the regular expression. We begin scanning the lines one by one - note that we have assumed that the lines are less than 1023 bytes long (the array called "line" is 1024 bytes long and we need one byte for the null termination). If the input is more than 1023 bytes long, then the line is wrapped over and interpreted as a new line until the '\n' character is met. The function regexec() scans the line for a set of characters that match the user specified pattern. Every set of characters that matches the regular expression forces regexec() to return 0, at which point we print out the line and the line number that match. If a regular expression matches more than once, then the line is printed out more than once. The offset from the beginning of the line is updated so that we do not match on the same pattern again.

This example, while fairly trivial, illustrates how powerful the GNU C library can be. Some of the more salient features of the library that we have used include:

The ability to automagically handle extremely long lines.
Optimised data structures for particular functions.
Standard, portable error handling.
Standard, portable handling of command line options.

In particular, we explored the capable GNU regular expression library, regex.h, which simplifies the inclusion of regular expression matching into your program, and provides a safe and simple interface to these capabilities.