Shortest Common Superstring(SCS)

From GM-RKB
Jump to navigation Jump to search

A Shortest Common Superstring(SCS) is a Shortest Common Supersequence in which the shortest string that contains two more strings as Substrings.



References

2019

  • (Wikipedia, 2019) ⇒ https://en.wikipedia.org/wiki/Shortest_common_supersequence_problem#Shortest_common_superstring Retrieved:2019-12-12.
    • The closely related problem of finding a string which is a superstring of a finite set of strings [math]\displaystyle{ S=\{s_1,s_2,\cdots,s_n\} }[/math] is NP-Complete. Also, good (constant factor) approximations have been found for the average case but not for the worst case. However, it can be formulated as an instance of weighted set cover in such a way that the weight of the optimal solution to the set cover instance is less than twice the length of the shortest superstring . One can then use the O(log())-approximation for weighted set-cover to obtain an O(log())-approximation for the shortest superstring (note that this is not a constant factor approximation).

      For any string [math]\displaystyle{ x }[/math] in this alphabet, define [math]\displaystyle{ P(x) }[/math] to be the set of all strings which are substrings of [math]\displaystyle{ x }[/math]. The instance [math]\displaystyle{ I }[/math] of set cover is formulated as follows:

      • Let [math]\displaystyle{ M }[/math] be empty.
      • For each pair of strings [math]\displaystyle{ s_i }[/math] and [math]\displaystyle{ s_j }[/math], if the last [math]\displaystyle{ k }[/math] symbols of [math]\displaystyle{ s_i }[/math] are the same as the first [math]\displaystyle{ k }[/math] symbols of [math]\displaystyle{ s_j }[/math], then add a string to [math]\displaystyle{ M }[/math] that consists of the concatenation with maximal overlap of [math]\displaystyle{ s_i }[/math] with [math]\displaystyle{ s_j }[/math].
      • Define the universe [math]\displaystyle{ \mathcal{U} }[/math] of the set cover instance to be [math]\displaystyle{ S }[/math]
      • Define the set of subsets of the universe to be [math]\displaystyle{ \{P(x)| x \in S \cup M\} }[/math]
      • Define the cost of each subset [math]\displaystyle{ P(x) }[/math] to be [math]\displaystyle{ |x| }[/math], the length of [math]\displaystyle{ x }[/math].
The instance [math]\displaystyle{ I }[/math] can then be solved using an algorithm for weighted set cover, and the algorithm can output an arbitrary concatenation of the strings [math]\displaystyle{ x }[/math] for which the weighted set cover algorithm outputs [math]\displaystyle{ P(x) }[/math].

2002

  • (Navarro & Raffinot, 2002) ⇒ Gonzalo Navarro, and Mathieu Raffinot. (2002). “Flexible Pattern Matching in Strings." Cambridge University Press.
    • Sequence comparison is about determining similarities and correspondences between two or more strings. It is related to approximate searching (Chapter 6) and has many applications in computational biology, speech recognition, computer science, coding theory, chromatography, and so on. These applications look for similarities between sequences of symbols. The general goal is to perform basic operation over the strings until they become equal.
    • A concept of "distance" between two strings can be defined according to the minimum cost of making them equal.
    • In some cases it is useful to measure the degree of similarity rather than of dissimilarity (i.e., a distance). One example is the LCS, a heavily studeis measure. Other examples are the shortest common supersequence (SCS), longest common substring (LCG, different from LCS because the common string has to be a contiguous substring of both sequences), and shortest common superstring (SCG), as well as their version or more than two strings.