Write a regex parser for '*', '+', and '.'.

Hard
14 years ago

Write code to parse regular expressions including "*" (0 or more characters), "+" (1 or more characters), and "." (1 exact character). For example:

  1. parse("a.", "ab") should return true.
  2. parse("a*", "aaa") should return true.
  3. parse("a+", "a") should return true.
  4. parse("a+", "") should return false.
  5. parse("a.*b", "axyzb") should return true.
  6. parse("a.*b", "axyz") should return false.
  7. parse("abc", "abc") should return true.
  8. parse("abc", "abd") should return false.

Your function should take two strings as input: the regular expression and the string to match against. It should return a boolean indicating whether the string matches the regular expression. Focus on handling the special characters ".", "*", and "+". Assume that other characters in the regular expression must match literally. Explain your approach and provide a clear, well-documented code implementation. Consider edge cases such as empty strings and null inputs. Also think about the time and space complexity of your solution, and try to optimize it for performance. Can you provide a solution using dynamic programming?

Sample Answer

Regular Expression Parser

This response provides a code implementation for parsing regular expressions that include the special characters * (0 or more characters), + (1 or more characters), and . (1 exact character). It starts with a naive recursive approach, then presents an optimized dynamic programming solution, along with detailed explanations of time and space complexity, edge case handling, and illustrative examples.

1. Naive Recursive Approach

Approach

The basic idea is to use recursion to explore all possible matches. We traverse both the regular expression and the input string. When we encounter a special character, we handle it accordingly. The . character matches any single character. The * character matches zero or more occurrences of the preceding character. The + character matches one or more occurrences of the preceding character.

Code

def parse_recursive(regex, text):
    if not regex:
        return not text

    if not text and regex:
        if len(regex) > 1 and regex[1] in ('*', '+'):
            return parse_recursive(regex[2:], text)
        else:
            return False

    if len(regex) > 1 and regex[1] == '*':
        # Case 1: Zero occurrences
        if parse_recursive(regex[2:], text):
            return True

        # Case 2: One or more occurrences
        if regex[0] == '.' or regex[0] == text[0]:
            return parse_recursive(regex, text[1:])
        else:
            return False

    elif len(regex) > 1 and regex[1] == '+':
        # Case 1: One or more occurrences
        if regex[0] == '.' or regex[0] == text[0]:
            if parse_recursive(regex[2:], text[1:]):
                return True
            return parse_recursive(regex, text[1:])
        else:
            return False

    elif regex[0] == '.' or regex[0] == text[0]:
        return parse_recursive(regex[1:], text[1:])
    else:
        return False

Example Usage

print(parse_recursive("a.", "ab"))  # True
print(parse_recursive("a*", "aaa")) # True
print(parse_recursive("a+", "a"))   # True
print(parse_recursive("a+", ""))    # False
print(parse_recursive("a.*b", "axyzb")) # True
print(parse_recursive("a.*b", "axyz"))  # False
print(parse_recursive("abc", "abc"))  # True
print(parse_recursive("abc", "abd"))  # False

2. Dynamic Programming Approach (Optimal)

Approach

Dynamic programming offers a more efficient solution by avoiding redundant calculations. We create a 2D table dp where dp[i][j] is True if the first i characters of the regular expression match the first j characters of the text, and False otherwise.

Code

def parse_dp(regex, text):
    n = len(regex)
    m = len(text)

    dp = [[False] * (m + 1) for _ in range(n + 1)]
    dp[0][0] = True

    for i in range(1, n + 1):
        if regex[i - 1] == '*' and i > 1:
            dp[i][0] = dp[i - 2][0]

    for i in range(1, n + 1):
        for j in range(1, m + 1):
            if regex[i - 1] == '.' or regex[i - 1] == text[j - 1]:
                dp[i][j] = dp[i - 1][j - 1]
            elif regex[i - 1] == '*':
                if i > 1:
                    dp[i][j] = dp[i - 2][j]  # Zero occurrences
                    if regex[i - 2] == '.' or regex[i - 2] == text[j - 1]:
                        dp[i][j] = dp[i][j] or dp[i][j - 1] # One or more occurrences
            elif regex[i-1] == '+':
                if i > 1:
                  if regex[i-2] == '.' or regex[i-2] == text[j-1]:
                    dp[i][j] = dp[i-2][j-1] or dp[i][j-1]


            else:
                dp[i][j] = False

    return dp[n][m]

Example Usage

print(parse_dp("a.", "ab"))  # True
print(parse_dp("a*", "aaa")) # True
print(parse_dp("a+", "a"))   # True
print(parse_dp("a+", ""))    # False
print(parse_dp("a.*b", "axyzb")) # True
print(parse_dp("a.*b", "axyz"))  # False
print(parse_dp("abc", "abc"))  # True
print(parse_dp("abc", "abd"))  # False

3. Time Complexity Analysis

Recursive Approach

  • Time Complexity: O(2<sup>(m+n)</sup>), where n is the length of the regular expression and m is the length of the text. This is because, in the worst case (especially with * and +), the recursive function might branch into multiple paths for each character, leading to exponential time complexity.

Dynamic Programming Approach

  • Time Complexity: O(n*m), where n is the length of the regular expression and m is the length of the text. We iterate through the dp table once, which has dimensions (n+1) x (m+1). Each cell calculation takes constant time.

4. Space Complexity Analysis

Recursive Approach

  • Space Complexity: O(m+n) in the worst case, due to the call stack during recursion, where n is the length of the regular expression and m is the length of the text.

Dynamic Programming Approach

  • Space Complexity: O(n*m), where n is the length of the regular expression and m is the length of the text. This is because we store the dp table with dimensions (n+1) x (m+1). The space complexity can be reduced to O(min(n, m)) by storing only the previous row/column of the DP table, but at the cost of code readability.

5. Edge Case Handling

  • Empty Regular Expression: If the regular expression is empty, the text should also be empty for a match.
  • Empty Text: If the text is empty, the regular expression must consist of characters followed by *, or be empty itself, to result in a match.
  • Null Inputs: Handle null (None in Python) inputs by returning False or raising an exception, depending on the specific requirements.
  • Consecutive * or +: The regex should be parsed correctly, as the algorithm considers 0 or more/1 or more occurrences of the preceding character.

6. Optimizations and Considerations

  • Dynamic Programming: The dynamic programming solution significantly improves performance compared to the naive recursive approach.
  • Precompilation: For repeated use of the same regular expression, precompiling the regex pattern (not shown here, but possible using regex libraries) can further optimize performance.
  • Early Termination: In some cases, the algorithm can be optimized to terminate early if a mismatch is detected that cannot be recovered.

7. Summary

This response has presented two approaches to regular expression parsing with the special characters *, +, and .. The dynamic programming approach provides a significant performance improvement over the naive recursive approach, making it the preferred solution for more complex regular expressions or larger input texts. The code examples include clear implementations, and the analysis covers time and space complexity, edge case handling, and optimization strategies.