COMP3148›Regular expressions/Regular languages

Theory of AutomataTopic 3 of 25

Regular expressions/Regular languages

7 minread

1,190words

Intermediatelevel

Regular Expressions and Regular Languages

Regular expressions and regular languages are central concepts in the theory of automata and formal languages. Regular expressions provide a concise and powerful way to describe patterns in strings, and regular languages are the set of languages that can be recognized by finite automata or described by regular expressions.

1. Regular Expressions (RE)

A regular expression (RE) is a symbolic notation used to describe patterns in strings over a given alphabet. Regular expressions are used to describe the syntax of regular languages, and they are used in many practical applications, such as text search, validation, and processing in programming languages.

Basic Components of Regular Expressions

Literal Characters: A single symbol from the alphabet.
- Example: In the alphabet Σ = {a, b}, 'a' and 'b' are literal characters.
Union ( | ): The "or" operator. It is used to denote the union of two regular expressions, meaning either one pattern or the other.
- Example: a|b matches either the string "a" or the string "b".
Concatenation: Placing two regular expressions together represents the concatenation of the strings they describe.
- Example: ab matches the string "ab" (the concatenation of "a" and "b").
Kleene Star ( * ): The Kleene star applied to a regular expression denotes zero or more repetitions of the pattern.
- Example: a* matches any string consisting of zero or more 'a's (including the empty string, ε). This includes "", "a", "aa", "aaa", and so on.
Kleene Plus ( + ): The Kleene plus is similar to the Kleene star, but it requires at least one occurrence of the pattern.
- Example: a+ matches any string consisting of one or more 'a's, such as "a", "aa", "aaa", etc.
Parentheses ( () ): Parentheses are used to group expressions and change the order of operations in regular expressions.
- Example: (a|b)c matches either "ac" or "bc".
Character Classes ( [ ] ): A character class is used to match any one character from a specified set of characters.
- Example: [abc] matches any one of the characters 'a', 'b', or 'c'.
Ranges in Character Classes ( [ ] ): You can specify ranges of characters within character classes.
- Example: [a-z] matches any lowercase letter from 'a' to 'z'.
Negation in Character Classes ( ^ ): The caret (^) at the beginning of a character class negates the class, meaning it matches any character except the ones in the class.
- Example: [^a] matches any character except 'a'.

2. Regular Languages

A regular language is a language that can be described by a regular expression or recognized by a finite automaton (DFA, NFA, or ε-NFA). Regular languages form the simplest class of languages in the Chomsky hierarchy of formal languages.

Properties of Regular Languages

Closed under Operations: Regular languages are closed under several operations, meaning that applying these operations to regular languages results in another regular language. These operations include:
1. Union: The union of two regular languages is regular.
2. Intersection: The intersection of two regular languages is regular.
3. Complement: The complement of a regular language is regular.
4. Concatenation: The concatenation of two regular languages is regular.
5. Kleene Star: The Kleene star of a regular language is regular.
Finite Automata Representation: A regular language can be recognized by a finite automaton, either a deterministic finite automaton (DFA) or a nondeterministic finite automaton (NFA). This is a crucial property because it ties regular languages to automata theory.
Equivalence with Regular Expressions: Every regular language can be described by a regular expression, and conversely, every regular expression describes a regular language.

Examples of Regular Languages

Simple strings:
- The language L = {a, b} is a regular language. It consists of the strings "a" and "b", and it can be represented by the regular expression a|b.
Languages with repetition:
- The language L = {a, aa, aaa, aaaa, ...} is the set of strings with one or more 'a's. It can be represented by the regular expression a+.
Alternating strings:
- The language L = {ab, abab, ababab, ...} consists of alternating "ab" patterns. It can be described by the regular expression (ab)+.
Binary strings with even length:
- The language L = {ε, aa, ab, ba, bb, aaaa, abab, ...} consists of binary strings of even length. It can be described by the regular expression (aa|ab|ba|bb)*.

3. Constructing Regular Expressions

To construct a regular expression for a language, you generally combine basic expressions using the operations (union, concatenation, and Kleene star). Here are some examples:

Even number of 'a's:
- The language consists of strings with an even number of 'a's (e.g., "", "aa", "abab", "bbaa").
- A regular expression for this would be: (bb|ab|ba)*.
Strings with exactly one 'a':
- The language consists of strings that have exactly one 'a' (e.g., "a", "ba", "ab").
- A regular expression for this would be: b*a(b*).
Strings that contain at least one 'a':
- The language consists of strings that contain at least one 'a'.
- A regular expression for this would be: .*a.*.

4. Applications of Regular Expressions

Regular expressions are widely used in computing for various tasks, such as:

Text Searching: Searching for specific patterns in strings (e.g., using grep or find in Unix systems).
String Validation: Validating input, such as checking if a string matches a specific format (e.g., email addresses, phone numbers).
Syntax Highlighting: Identifying keywords, operators, and other elements in code editors.
Data Extraction: Extracting specific data from larger text bodies (e.g., extracting URLs from a web page).

5. From Regular Expressions to Finite Automata

A regular expression can be converted into an equivalent finite automaton (DFA, NFA, or ε-NFA). This is a critical concept because finite automata provide a concrete model for recognizing regular languages, and regular expressions offer a compact way to describe those same languages.

Thompson's Construction: This is an algorithm that converts a regular expression into an NFA.
Subset Construction: This algorithm converts an NFA (or ε-NFA) into a DFA.

These conversions illustrate that regular expressions, NFAs, and DFAs are equivalent in terms of the languages they can recognize.

6. Regular Language Theorems

Several important theorems apply to regular languages:

Myhill-Nerode Theorem: This theorem provides a characterization of regular languages by defining equivalence classes of strings.
Pumping Lemma for Regular Languages: The pumping lemma is a tool used to prove that certain languages are not regular by showing that they fail to satisfy a property that all regular languages have.
Closure Properties: Regular languages are closed under union, intersection, complement, concatenation, and Kleene star, as mentioned earlier.

Conclusion

Regular expressions and regular languages are foundational to automata theory and formal language theory. Regular expressions provide a convenient syntax for describing patterns in strings, and regular languages are the class of languages that can be recognized by finite automata or described by regular expressions. The close relationship between finite automata and regular expressions is what makes the theory of regular languages so powerful, and it has practical applications in fields like text processing, syntax analysis, and pattern recognition.

Previous topic 2

Language definitions preliminaries

Next topic 4

Finite automata (FAs)

Past Papers

Open this section to load past papers

Click on Show Past Papers to see past papers.

COMP3148›Regular expressions/Regular languages

Theory of AutomataTopic 3 of 25

Regular expressions/Regular languages

7 minread

1,190words

Intermediatelevel

Regular Expressions and Regular Languages

1. Regular Expressions (RE)

Basic Components of Regular Expressions

Literal Characters: A single symbol from the alphabet.
- Example: In the alphabet Σ = {a, b}, 'a' and 'b' are literal characters.
Union ( | ): The "or" operator. It is used to denote the union of two regular expressions, meaning either one pattern or the other.
- Example: a|b matches either the string "a" or the string "b".
Concatenation: Placing two regular expressions together represents the concatenation of the strings they describe.
- Example: ab matches the string "ab" (the concatenation of "a" and "b").
Kleene Star ( * ): The Kleene star applied to a regular expression denotes zero or more repetitions of the pattern.
- Example: a* matches any string consisting of zero or more 'a's (including the empty string, ε). This includes "", "a", "aa", "aaa", and so on.
Kleene Plus ( + ): The Kleene plus is similar to the Kleene star, but it requires at least one occurrence of the pattern.
- Example: a+ matches any string consisting of one or more 'a's, such as "a", "aa", "aaa", etc.
Parentheses ( () ): Parentheses are used to group expressions and change the order of operations in regular expressions.
- Example: (a|b)c matches either "ac" or "bc".
Character Classes ( [ ] ): A character class is used to match any one character from a specified set of characters.
- Example: [abc] matches any one of the characters 'a', 'b', or 'c'.
Ranges in Character Classes ( [ ] ): You can specify ranges of characters within character classes.
- Example: [a-z] matches any lowercase letter from 'a' to 'z'.
Negation in Character Classes ( ^ ): The caret (^) at the beginning of a character class negates the class, meaning it matches any character except the ones in the class.
- Example: [^a] matches any character except 'a'.

2. Regular Languages

Properties of Regular Languages

Closed under Operations: Regular languages are closed under several operations, meaning that applying these operations to regular languages results in another regular language. These operations include:
1. Union: The union of two regular languages is regular.
2. Intersection: The intersection of two regular languages is regular.
3. Complement: The complement of a regular language is regular.
4. Concatenation: The concatenation of two regular languages is regular.
5. Kleene Star: The Kleene star of a regular language is regular.
Finite Automata Representation: A regular language can be recognized by a finite automaton, either a deterministic finite automaton (DFA) or a nondeterministic finite automaton (NFA). This is a crucial property because it ties regular languages to automata theory.
Equivalence with Regular Expressions: Every regular language can be described by a regular expression, and conversely, every regular expression describes a regular language.

Examples of Regular Languages

Simple strings:
- The language L = {a, b} is a regular language. It consists of the strings "a" and "b", and it can be represented by the regular expression a|b.
Languages with repetition:
- The language L = {a, aa, aaa, aaaa, ...} is the set of strings with one or more 'a's. It can be represented by the regular expression a+.
Alternating strings:
- The language L = {ab, abab, ababab, ...} consists of alternating "ab" patterns. It can be described by the regular expression (ab)+.
Binary strings with even length:
- The language L = {ε, aa, ab, ba, bb, aaaa, abab, ...} consists of binary strings of even length. It can be described by the regular expression (aa|ab|ba|bb)*.

3. Constructing Regular Expressions

To construct a regular expression for a language, you generally combine basic expressions using the operations (union, concatenation, and Kleene star). Here are some examples:

Even number of 'a's:
- The language consists of strings with an even number of 'a's (e.g., "", "aa", "abab", "bbaa").
- A regular expression for this would be: (bb|ab|ba)*.
Strings with exactly one 'a':
- The language consists of strings that have exactly one 'a' (e.g., "a", "ba", "ab").
- A regular expression for this would be: b*a(b*).
Strings that contain at least one 'a':
- The language consists of strings that contain at least one 'a'.
- A regular expression for this would be: .*a.*.

4. Applications of Regular Expressions

Regular expressions are widely used in computing for various tasks, such as:

Text Searching: Searching for specific patterns in strings (e.g., using grep or find in Unix systems).
String Validation: Validating input, such as checking if a string matches a specific format (e.g., email addresses, phone numbers).
Syntax Highlighting: Identifying keywords, operators, and other elements in code editors.
Data Extraction: Extracting specific data from larger text bodies (e.g., extracting URLs from a web page).

5. From Regular Expressions to Finite Automata

Thompson's Construction: This is an algorithm that converts a regular expression into an NFA.
Subset Construction: This algorithm converts an NFA (or ε-NFA) into a DFA.

These conversions illustrate that regular expressions, NFAs, and DFAs are equivalent in terms of the languages they can recognize.

6. Regular Language Theorems

Several important theorems apply to regular languages:

Myhill-Nerode Theorem: This theorem provides a characterization of regular languages by defining equivalence classes of strings.
Pumping Lemma for Regular Languages: The pumping lemma is a tool used to prove that certain languages are not regular by showing that they fail to satisfy a property that all regular languages have.
Closure Properties: Regular languages are closed under union, intersection, complement, concatenation, and Kleene star, as mentioned earlier.

Conclusion

Previous topic 2

Language definitions preliminaries

Next topic 4

Finite automata (FAs)

Past Papers

Open this section to load past papers

Click on Show Past Papers to see past papers.