Sequence Alignment Task

A Sequence Alignment Task is a Range-type String Matching Task that requires the identification of strings with a log string distance to a given string.

AKA: String Alignment Task.
Context:
- It can be solved by a Sequence Alignment System (that implements a sequence alignment algorithm).
- It can range from being a Local Sequence Alignment Task to being a Global Sequence Alignment Task.
- It can range from being a Pairwise Sequence Alignment Task, to being a Multiple Sequence Alignment Task, to being a Structural Alignment Task.
- It can range from being a Genetic Sequence Alignment Task to being a Linguistic Sequence Alignment Task.
Example(s):
- a DNA/RNA Sequence Alignment Task,
- a Text Alignment Task,
- a Word Alignment Task.
- …
Counter-Example(s):
- a Data Alignment Task,
- a Sequence Matching Task,
- a String Parsing Task.
See: Needleman-Wunsch Algorithm, Sequence Homology, Longest Common Subsequence, Shortest Common Supersequence, Longest Common Substring, Shortest Common Superstring, Approximate String Matching, Phylogenetic Analysis Task, Alignment-free Sequence Analysis Task, Levenshtein Distance, Edit Distance, Alignment Distance, Sequential Pattern Mining Task.

References

2021a

(Wikipedia, 2021) ⇒ https://en.wikipedia.org/wiki/Sequence_alignment Retrieved:2021-2-21.
- In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences.^[1] Aligned sequences of nucleotide or amino acid residues are typically represented as rows within a matrix. Gaps are inserted between the residues so that identical or similar characters are aligned in successive columns.
  Sequence alignments are also used for non-biological sequences, such as calculating the distance cost between strings in a natural language or in financial data.
  (...)
  Very short or very similar sequences can be aligned by hand. However, most interesting problems require the alignment of lengthy, highly variable or extremely numerous sequences that cannot be aligned solely by human effort. Instead, human knowledge is applied in constructing algorithms to produce high-quality sequence alignments, and occasionally in adjusting the final results to reflect patterns that are difficult to represent algorithmically (especially in the case of nucleotide sequences). Computational approaches to sequence alignment generally fall into two categories: global alignments and local alignments. Calculating a global alignment is a form of global optimization that "forces" the alignment to span the entire length of all query sequences. By contrast, local alignments identify regions of similarity within long sequences that are often widely divergent overall. Local alignments are often preferable, but can be more difficult to calculate because of the additional challenge of identifying the regions of similarity.^[2] A variety of computational algorithms have been applied to the sequence alignment problem. These include slow but formally correct methods like dynamic programming. These also include efficient, heuristic algorithms or probabilistic methods designed for large-scale database search, that do not guarantee to find best matches.

↑ Mount DM. (2004). Bioinformatics: Sequence and Genome Analysis (2nd ed.). Cold Spring Harbor Laboratory Press: Cold Spring Harbor, NY. ISBN 978-0-87969-608-5.
↑ Polyanovsky, V. O.; Roytberg, M. A.; Tumanyan, V. G. (2011). "Comparative analysis of the quality of a global algorithm and a local algorithm for alignment of two sequences". Algorithms for Molecular Biology. 6 (1): 25. doi:10.1186/1748-7188-6-25. PMC 3223492. PMID 22032267. S2CID 2658261.

2021b

(Xie, 2021) ⇒ https://www.ics.uci.edu/~xhx/courses/python/tmp/alignment.pdf Retrieved:2021-2-21.
- QUOTE: An alignment of two sequences $S$ and $T$ is obtained by first inserting spaces ('−') either into, before or at the ends of $S$ and $T$ to obtain $S'$ and $T'$ such that $|S'|= |T'|$, and then placing $S'$ on top of $T'$ such that every character in $S'$ is uniquely aligned with a character in $T'$.

2020

(Wijffels, 2020) ⇒ Jan Wijffels (2020). Package ‘text.alignment’: https://cran.r-project.org/web/packages/text.alignment/text.alignment.pdf
- QUOTE: Align text using the Smith-Waterman algorithm. The Smith-Waterman algorithm performs local sequence alignment. It finds similar regions between two strings. Similar regions are a sequence of either characters or words which are found by matching the characters or words of 2 sequences of strings.
  If the word/letter is the same in each text, the alignment score is increased with the match score, while if they are not the same the local alignment score drops by the gap score. If one of the 2 texts contains extra words letters, the score drops by the mismatch score.

2019

(Mirabello & Wallner, 2019) ⇒ Claudio Mirabello, and Bjorn Wallner (2019). "rawMSA: End-to-end Deep Learning using raw Multiple Sequence Alignments". In: PLoS ONE 14(8):e0220182. DOI:10.1371/journal.pone.0220182.

2015

(Eger, 2015) ⇒ Steffen Eger (2015, July). "Multiple Many-to-Many Sequence Alignment for Combining String-Valued Variables: A G2P Experiment". In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers).

2011

(Will, 2011) ⇒ S. Will (2011). "Alignment". In: MIT Course 18.417, Fall 2011.
- QUOTE: Motivation: assess similarity of sequences and learn about their evolutionary relationship.
  (...)
  For pairwise sequence comparison: define edit distance, define alignment distance, show equivalence of distances, define alignment problem and efficient algorithm gap penalties, local alignment.

2002

(Navarro & Raffinot, 2002) ⇒ Gonzalo Navarro, and Mathieu Raffinot. (2002). “Flexible Pattern Matching in Strings." Cambridge University Press.
- QUOTE: Sequence comparison is about determining similarities and correspondences between two or more strings. It is related to approximate searching (Chapter 6) and has many applications in computational biology, speech recognition, computer science, coding theory, chromatography, and so on. These applications look for similarities between sequences of symbols. The general goal is to perform basic operation over the strings until they become equal.
  ... A concept of "distance" between two strings can be defined according to the minimum cost of making them equal.
  ... In some cases it is useful to measure the degree of similarity rather than of dissimilarity (i.e., a distance). One example is the LCS, a heavily studied measure. Other examples are the shortest common supersequence (SCS), longest common substring (LCG, different from LCS because the common string has to be a contiguous substring of both sequences), and shortest common superstring (SCG), as well as their version or more than two strings.

1999

(Tiedemann, 1999) ⇒ . Jorg Tiedemann (1999). "Word Alignment Step by Step". In: Proceedings of the 12th Nordic Conference of Computational Linguistics (NODALIDA 1999).
- QUOTE: Word alignment aims at the identification of translation equivalents between linguistic units below the sentence level within parallel text (Merkel 1999^[1]), mainly bilingual text ibitext). Those units include single-word units (SlVUs) and multi-word units (MWUs) and will be referred to as link units further on. The basic terminology for describing parallel text and word alignment inthis paper follows Ahrenberg et al (1999)^[2] and Ahrenberg et al (forthcoming)^[3]. In particular, each word correspondence in the bitext describes a link instance, or simply a link. A pair of link units that is instantiated in the bitext will be referred to as link type. Word alignment systems usually assume segmented bitext (sentence aligned bitext). Common bitext segments are sentence fragments, sentences, and sequences of sentences that have corresponding units in the translation.

↑ Merkel, M. 1999. Understanding and enhancing translation by parallel text processing. Linköping Studies in Science and Technology. Dissertation No. 607. Linköping University. Dept, of Computer and Information Science.
↑ Ahrenberg, L., Merkel, M., Sagvall Hein, A., and Tiedemann, J. 1999. Evaluating LWA and UWA. PLUG deliverable 3A.1. Internal report.
↑ Ahrenberg, L., Merkel, M., Sagvall Hein, A., and Tiedemann, J. forthcoming. Evaluation of Word Alignment Systems. In: Proceedings of the International Conference on Language Resources and Evaluation, LREC-2000, Athens, Greece, 2000.

1981

(Smith & Waterman, 1981) ⇒ Temple F. Smith, Michael S. Waterman. (1981). “Identification of Common Molecular Subsequences.” In: Journal of Molecular Biology, 147. doi:10.1016/0022-2836(81)90087-5.

1974

(Wagner & Fischer, 1974) ⇒ Robert A. Wagner, and Michael J. Fischer. (1974). “The String to String Correction Problem.” In: Journal of the ACM (JACM), 21(1).
- QUOTE: The string-to-string correction problem is to determine the distance between two strings as measured by the minimum cost sequence of “edit operations” needed to change the one string into the other. The edit operations investigated allow changing one symbol of a string into another single symbol, deleting one symbol from a string, or inserting a single symbol into a string. An algorithm is presented which solves this problem in time proportional to the product of the lengths of the two strings. Possible applications are to the problems of automatic spelling correction and determining the longest subsequence of characters common to two strings.

[mount-1] Mount DM. (2004). Bioinformatics: Sequence and Genome Analysis (2nd ed.). Cold Spring Harbor Laboratory Press: Cold Spring Harbor, NY. ISBN 978-0-87969-608-5.

[Polyanovsky2011-2] Polyanovsky, V. O.; Roytberg, M. A.; Tumanyan, V. G. (2011). "Comparative analysis of the quality of a global algorithm and a local algorithm for alignment of two sequences". Algorithms for Molecular Biology. 6 (1): 25. doi:10.1186/1748-7188-6-25. PMC 3223492. PMID 22032267. S2CID 2658261.

[3] Merkel, M. 1999. Understanding and enhancing translation by parallel text processing. Linköping Studies in Science and Technology. Dissertation No. 607. Linköping University. Dept, of Computer and Information Science.

[4] Ahrenberg, L., Merkel, M., Sagvall Hein, A., and Tiedemann, J. 1999. Evaluating LWA and UWA. PLUG deliverable 3A.1. Internal report.

[5] Ahrenberg, L., Merkel, M., Sagvall Hein, A., and Tiedemann, J. forthcoming. Evaluation of Word Alignment Systems. In: Proceedings of the International Conference on Language Resources and Evaluation, LREC-2000, Athens, Greece, 2000.

[1]

[2]

[1]

[2]

[3]