AutoPhrase Text Segmenter

From GM-RKB
Jump to navigation Jump to search

An AutoPhrase Text Segmenter is a informative phrase text segmenter.



References

2018.03.04

2018

  • https://github.com/shangjingbo1226/AutoPhrase
    • QUOTE:
      • Fix a few bugs during the pre-processing and post-processing, i.e., Tokeninzer.java. Previously, when the corpus contains characters like /, the results could be wrong or errors may occur.
      • When the phrasal segmentation is serving new text, for the phrases (every token is seen in the traning corpus) provided in the knowledge base (wiki_quality.txt), the score is set as 1.0. Previously, it was kind of infinite.
      • Support extremely large corpus (e.g., 100GB or more). Please comment out the // define LARGE in the beginning of src/utils/parameters.h before you run AutoPhrase on such a large corpus.
      • Quality phrases (every token is seen in the raw corpus) provided in the knowledge base will be incorporated during the phrasal segmentation, even their frequencies are smaller than MIN_SUP.
      • Stopwords will be treated as low quality single-word phrases.
      • Model files are saved separately. Please check the variable MODEL in both auto_phrase.sh and phrasal_segmentation.sh.
      • The end of line is also a separator for sentence splitting.

2018b

2015