String Matching Task

AKA: Pattern Matching in Strings Task, String Search Task, String Pattern Matching Task, String Search, String Matching, String Matching Problem.
Context:
- Input:
  - a String Pattern.
  - a Target String.
- output: one or more Substring Locations.
- It can range from being an Exact String Matching Task to being an Approximate String Matching Task.
- It can range from being a Shortest Common Supersequence Matching Task to being a Longest Common Subsequence Matching Task.
- It can range from being a Shortest Common Substring Matching Task to being a Shortest Common Superstring Matching Task.
- It can range from being a Single String Matching Task to being a Multiple String Matching Task.
- It can range from being a Simple String Matching Task to being an Extended String Matching Task.
- It can be, depending on the String Pattern Type: a Regular Expression Matching Task.
- It can be solved by a String Pattern Matching System (that implements a String Pattern Matching algorithm).
Example(s):
- a Text Matching Task.
- a DNA Sequence Matching Task.
- …
Counter-Example(s):
- a String Parsing Task.
- a Graph Matching Search Task.
- a Tuple Matching Task.
See: Detection Task, String, Computer Character, Pattern Recognition Task, String-Level Text Error Correction System.

References

(Wikipedia, 2009) ⇒ http://en.wikipedia.org/wiki/String_searching_algorithm
- String searching algorithms, sometimes called string matching algorithms, are an important class of string algorithms that try to find a place where one or several strings (also called patterns) are found within a larger string or text.
- Let Σ be an alphabet (finite set). Formally, both the pattern and searched text are concatenations of elements of Σ. The Σ may be a usual human alphabet (for example, the letters A through Z in English). Other applications may use binary alphabet (Σ = {0,1}) or DNA alphabet (Σ = {A,C,G,T}) in bioinformatics.
- In practice how the string is encoded can affect the feasible string search algorithms. In particular if a variable width encoding is in use then it is slow (time proportional to N) to find the Nth character. This will significantly slow down many of the more advanced search algorithms. A possible solution is to search for the sequence of code units instead, but doing so may produce false matches unless the encoding is specifically designed to avoid it.

(Navarro & Raffinot, 2002) ⇒ Gonzalo Navarro, and Mathieu Raffinot. (2002). “Flexible Pattern Matching in Strings." Cambridge University Press.
- QUOTE: String matching problems range from the relatively simple task of searching a single text for a string of characters to searching a database for approximate occurrences of a complex pattern. Recent years have witnessed a dramatic increase of interest in sophisticated string matching problems, especially in information retrieval and computational biology. This book presents a practical approach to string matching problems, focusing on the algorithms and implementations that perform best in practice. It covers searching for simple, multiple and extended strings, as well as regular expressions, and exact and approximate searching. It includes all the most significant new developments in complex pattern searching. The clear explanations, step-by-step examples, algorithm pseudocode, and implementation efficiency maps will enable researchers, professionals and students in bioinformatics, computer science, and software engineering to choose the most appropriate algorithms for their applications.

(Knuth et al., 1977) ⇒ Donald E. Knuth, James H. Morris, Jr., and Vaughan R. Pratt. (1977). “Fast Pattern Matching in Strings.” In: SIAM Journal on Computing, 6(2).
- ABSTRACT: An algorithm is presented which finds all occurrences of one given string within another, in running time proportional to the sum of the lengths of the strings. The constant of proportionality is low enough to make this algorithm of practical use, and the procedure can also be extended to deal with some more general pattern-matching problems. A theoretical application of the algorithm shows that the set of concatenations of even palindromes, i.e., the language $\{\alpha \alpha ^R\}^*$, can be recognized in linear time. Other algorithms which run even faster on the average are also considered.

(Aho & Corasick, 1975) ⇒ Alfred V. Aho, and Margaret J. Corasick. (1975). “Efficient String Matching: An aid to bibliographic search.” In: Communications of the ACM, 18(6).
- ABSTRACT: This paper describes a simple, efficient algorithm to locate all occurrences of any of a finite number of keywords in a string of text. The algorithm consists of constructing a finite state pattern matching machine from the keywords and then using the pattern matching machine to process the text string in a single pass. Construction of the pattern matching machine takes time proportional to the sum of the lengths of the keywords. The number of state transitions made by the pattern matching machine in processing the text string is independent of the number of keywords. The algorithm has been used to improve the speed of a library bibliographic search program by a factor of 5 to 10.