ScholarQuill logoScholarQuillUniversity Notes
  • Notes
  • Past Papers
  • Blogs
  • Todo
Login
ScholarQuill logoScholarQuillUniversity Notes
Login
NotesPast PapersBlogsTodo
More
SubjectsDiscussionCGPA CalculatorGPA CalculatorStudent PortalCourse Outline
About
About usPrivacy PolicyReportContact
Notes
Past Papers
Blogs
Todo
Analytics
    Current Subject
    🧩
    Analysis of Algorithms
    COMP4121
    Progress0 / 28 topics
    Topics
    1. Introduction2. Role of Algorithms in Computing3. Analysis on Nature of Input and Size of Input4. Asymptotic Notations5. Big-O Notation6. Big-Ω Notation7. Big-Θ Notation8. Little-o Notation9. Little-ω Notation10. Sorting Algorithm Analysis11. Loop Invariants12. Recursion and Recurrence Relations13. Algorithm Design Techniques14. Brute Force Approach15. Divide-and-Conquer Approach16. Merge Sort17. Quick Sort18. Greedy Approach19. Dynamic Programming20. Elements of Dynamic Programming21. Search Trees22. Heaps23. Hashing24. Graph Algorithms25. Shortest Paths26. Sparse Graphs27. String Matching28. Introduction to Complexity Classes
    COMP4121›String Matching
    Analysis of AlgorithmsTopic 27 of 28

    String Matching

    9 minread
    1,574words
    Intermediatelevel

    String Matching

    String matching (or pattern matching) is the process of finding one or more occurrences of a substring (called the pattern) within a larger string (called the text). This is a fundamental problem in computer science and is widely used in various applications, including text searching, DNA sequence analysis, spell-checking, and more.

    In string matching, the goal is to find all occurrences of the pattern PPP of length mmm within the text TTT of length nnn.

    Types of String Matching Problems

    1. Exact String Matching: The objective is to find all occurrences of a pattern in the text such that the pattern matches the text exactly (with no deviations).

    2. Approximate String Matching: The goal is to find occurrences of a pattern within the text, but with some allowed mismatches, insertions, or deletions (i.e., finding strings that are close to the pattern).

    Basic String Matching Algorithms

    1. Naive String Matching Algorithm

    The Naive String Matching Algorithm is a simple approach where we check all possible positions in the text where the pattern could fit and compare the characters of the pattern with the corresponding characters in the text.

    Algorithm Steps:

    1. Start at the first position of the text.
    2. Compare the pattern with the substring of the text starting at the current position.
    3. If all characters match, record the position.
    4. Move one position forward and repeat the process until all possible positions in the text are checked.

    Time Complexity:

    • In the worst case, the algorithm compares every character in the text with every character in the pattern, leading to a time complexity of O((n−m+1)×m)O((n - m + 1) \times m)O((n−m+1)×m), where nnn is the length of the text and mmm is the length of the pattern.
    • This results in a worst-case time complexity of O(n×m)O(n \times m)O(n×m).

    Advantages:

    • Simple and easy to implement.

    Disadvantages:

    • Inefficient for large texts or patterns due to the high worst-case time complexity.

    2. Knuth-Morris-Pratt (KMP) Algorithm

    The KMP Algorithm improves upon the naive algorithm by avoiding redundant comparisons. It uses the concept of the prefix function (also called the partial match table) to store information about previously matched characters and skip unnecessary comparisons.

    Algorithm Steps:

    1. Preprocess the pattern to create a partial match table (or prefix table).
    2. Use this table to determine how far to skip when a mismatch occurs.
    3. Traverse the text and the pattern, using the table to avoid re-checking characters that have already been matched.

    The prefix table for a pattern tells you, when a mismatch occurs, the next position to start comparing in the pattern.

    Time Complexity:

    • Preprocessing the pattern takes O(m)O(m)O(m) time.
    • The matching phase runs in O(n)O(n)O(n) time.
    • Overall, the time complexity is O(n+m)O(n + m)O(n+m).

    Advantages:

    • More efficient than the naive algorithm, especially for large inputs.

    Disadvantages:

    • Slightly more complex to implement.

    3. Rabin-Karp Algorithm

    The Rabin-Karp Algorithm uses hashing to perform string matching. Instead of comparing the entire pattern with the text, it computes the hash value of the pattern and compares it with hash values of substrings of the text.

    Algorithm Steps:

    1. Calculate the hash value of the pattern.
    2. Compute the hash value of each substring of the text that has the same length as the pattern.
    3. If the hash values match, perform a direct comparison to check if the strings are equal (since hash collisions can occur).
    4. If a match is found, record the position.

    Time Complexity:

    • In the average case, the time complexity is O(n+m)O(n + m)O(n+m), where nnn is the length of the text and mmm is the length of the pattern.
    • In the worst case (due to hash collisions), it can degrade to O(n×m)O(n \times m)O(n×m).

    Advantages:

    • Efficient in practice for large texts and patterns when hash functions are good and collisions are rare.

    Disadvantages:

    • Worst-case time complexity can be high due to hash collisions.
    • Requires extra space to store the hash values.

    4. Boyer-Moore Algorithm

    The Boyer-Moore Algorithm is considered one of the most efficient string matching algorithms for practical use, especially when the pattern is relatively long. It works by scanning the text from right to left (as opposed to left to right like other algorithms), and uses two heuristics to skip sections of the text that cannot possibly match the pattern.

    Algorithm Steps:

    1. Preprocess the pattern to create two tables:
      • Bad character rule: If a mismatch occurs, shift the pattern so that the mismatched character in the text aligns with the last occurrence of that character in the pattern (or shift it past the mismatched character if it's not in the pattern).
      • Good suffix rule: If a mismatch occurs, shift the pattern so that a suffix of the pattern that matched a substring of the text aligns with a matching substring of the text.
    2. Use the preprocessed tables to efficiently shift the pattern over the text to find all occurrences.

    Time Complexity:

    • In the best case, the time complexity is O(n/m)O(n/m)O(n/m), where nnn is the length of the text and mmm is the length of the pattern.
    • In the worst case, the time complexity is O(n×m)O(n \times m)O(n×m), though this is rare in practice.

    Advantages:

    • Very fast in practice, especially with long patterns, due to the efficient use of the bad character and good suffix rules.

    Disadvantages:

    • More complex to implement compared to other algorithms.

    Approximate String Matching Algorithms

    In many real-world applications, exact matches are not always expected. Approximate string matching allows for small differences between the pattern and the text. These differences can be in the form of:

    • Substitutions (mismatched characters)
    • Insertions (extra characters in the text)
    • Deletions (missing characters in the pattern)

    1. Levenshtein Distance (Edit Distance)

    The Levenshtein distance is a measure of the difference between two strings. It counts the minimum number of single-character edits (insertions, deletions, or substitutions) required to convert one string into the other.

    Dynamic Programming Approach:

    • Use a matrix where the cell [i][j][i][j][i][j] represents the minimum number of operations needed to convert the first iii characters of the pattern into the first jjj characters of the text.
    • The algorithm fills the matrix by considering the three operations (insert, delete, substitute) and chooses the minimum cost.

    Time Complexity:

    • O(m×n)O(m \times n)O(m×n), where mmm is the length of the pattern and nnn is the length of the text.

    2. Hamming Distance

    The Hamming distance measures the number of positions at which two strings of equal length differ. It is useful when the strings are of equal length and only substitutions are allowed.

    Time Complexity:

    • O(m)O(m)O(m), where mmm is the length of the strings.

    Summary of String Matching Algorithms

    Algorithm Time Complexity Space Complexity Key Advantage
    Naive Algorithm O(n×m)O(n \times m)O(n×m) O(1)O(1)O(1) Simple to implement.
    KMP Algorithm O(n+m)O(n + m)O(n+m) O(m)O(m)O(m) Efficient with preprocessing.
    Rabin-Karp Algorithm O(n+m)O(n + m)O(n+m) (avg) O(n)O(n)O(n) Efficient with hashing.
    Boyer-Moore Algorithm O(n/m)O(n/m)O(n/m) (best) O(m)O(m)O(m) Fast in practice for long patterns.
    Levenshtein Distance O(m×n)O(m \times n)O(m×n) O(m×n)O(m \times n)O(m×n) Used for approximate matching.

    Applications of String Matching

    • Text Searching: Used in search engines and text editors to find specific words or phrases.
    • DNA Sequence Analysis: Finding similar gene sequences in biological data.
    • Spell Checkers: Identifying correctly spelled words and suggesting corrections.
    • Data Compression: Algorithms like Lempel-Ziv use string matching to compress data efficiently.

    Conclusion

    String matching is a crucial problem in computer science with various algorithms designed to solve it efficiently under different circumstances. While simple algorithms like the naive approach are easy to implement, more advanced algorithms like Boyer-Moore and KMP offer significantly better performance, especially with large inputs. For approximate matching, algorithms like Levenshtein distance are key in dealing with minor mismatches, insertions, and deletions.

    Previous topic 26
    Sparse Graphs
    Next topic 28
    Introduction to Complexity Classes

    Past Papers

    Open this section to load past papers

    Click on Show Past Papers to see past papers.
    On This Page
      Reading Stats
      Est. reading time9 min
      Word count1,574
      Code examples0
      DifficultyIntermediate