Sentence Boundary Detection Task
A Sentence Boundary Detection Task is a text segmentation task that requires the segmentation of a linguistic expression into its component natural language sentences
- AKA: SST, End-of-Sentence Detection, Sentence Splitting.
- Context:
- Input: a Text Item.
- output: a Segmented Text Item demarcated by sentences.
- measure: F1 Measure.
- It can range from being a Rule-based Sentence Boundary Detection Task to being a Data-driven Boundary Detection Task (such as a Supervised Sentence Boundary Detection Task).
- It can range from being a Written Sentence Boundary Detection Task to being a Spoken Sentence Boundary Detection Task.
- It can be solved by a Sentence Boundary Detection System (that implements a Sentence Boundary Detection Algorithm).
- Example(s):
- "I saw E. coli under the microscope with Dr. Smith. They were moving.” ⇒
<SENT>I saw E. coli under the microscope with Dr. Smith.</SENT> <SENT>They were moving.</SENT>
- "I saw E. coli under the microscope with Dr. Smith. They were moving.” ⇒
- Counter-Example(s):
- See: Punctuation Mark, Full Stop, Abbreviation, Decimal Point, Ellipsis, Question Mark, Exclamation Mark.
References
2015
- (Wikipedia, 2015) ⇒ http://en.wikipedia.org/wiki/Sentence_boundary_disambiguation Retrieved:2015-4-11.
- Sentence boundary disambiguation (SBD), also known as sentence breaking, is the problem in natural language processing of deciding where sentences begin and end. Often natural language processing tools require their input to be divided into sentences for a number of reasons. However sentence boundary identification is challenging because punctuation marks are often ambiguous. For example, a period may denote an abbreviation, decimal point, an ellipsis, or an email address - not the end of a sentence. About 47% of the periods in the Wall Street Journal corpus denote abbreviations. As well, question marks and exclamation marks may appear in embedded quotations, emoticons, computer code, and slang.
Languages like Japanese and Chinese have unambiguous sentence-ending markers.
- Sentence boundary disambiguation (SBD), also known as sentence breaking, is the problem in natural language processing of deciding where sentences begin and end. Often natural language processing tools require their input to be divided into sentences for a number of reasons. However sentence boundary identification is challenging because punctuation marks are often ambiguous. For example, a period may denote an abbreviation, decimal point, an ellipsis, or an email address - not the end of a sentence. About 47% of the periods in the Wall Street Journal corpus denote abbreviations. As well, question marks and exclamation marks may appear in embedded quotations, emoticons, computer code, and slang.
2011
- (Wikipedia, 2011) ⇒ http://en.wikipedia.org/wiki/Text_segmentation#Sentence_segmentation
- QUOTE: Sentence segmentation is the problem of dividing a string of written language into its component sentences. In English and some other languages, using punctuation, particularly the full stop character is a reasonable approximation. However even in English this problem is not trivial due to the use of the full stop character for abbreviations, which may or may not also terminate a sentence. For example Mr. is not its own sentence in "Mr. Smith went to the shops in Jones Street.” When processing plain text, tables of abbreviations that contain periods can help prevent incorrect assignment of sentence boundaries.
As with word segmentation, not all written languages contain punctuation characters which are useful for approximating sentence boundaries.
- QUOTE: Sentence segmentation is the problem of dividing a string of written language into its component sentences. In English and some other languages, using punctuation, particularly the full stop character is a reasonable approximation. However even in English this problem is not trivial due to the use of the full stop character for abbreviations, which may or may not also terminate a sentence. For example Mr. is not its own sentence in "Mr. Smith went to the shops in Jones Street.” When processing plain text, tables of abbreviations that contain periods can help prevent incorrect assignment of sentence boundaries.