AutoPhrase Text Segmenter

From GM-RKB

Jump to navigation Jump to search

An AutoPhrase Text Segmenter is a informative phrase text segmenter.

See: SegPhrase.

References

2018.03.04

2018

https://github.com/shangjingbo1226/AutoPhrase
- QUOTE:
  - Fix a few bugs during the pre-processing and post-processing, i.e., Tokeninzer.java. Previously, when the corpus contains characters like /, the results could be wrong or errors may occur.
  - When the phrasal segmentation is serving new text, for the phrases (every token is seen in the traning corpus) provided in the knowledge base (wiki_quality.txt), the score is set as 1.0. Previously, it was kind of infinite.
  - Support extremely large corpus (e.g., 100GB or more). Please comment out the // define LARGE in the beginning of src/utils/parameters.h before you run AutoPhrase on such a large corpus.
  - Quality phrases (every token is seen in the raw corpus) provided in the knowledge base will be incorporated during the phrasal segmentation, even their frequencies are smaller than MIN_SUP.
  - Stopwords will be treated as low quality single-word phrases.
  - Model files are saved separately. Please check the variable MODEL in both auto_phrase.sh and phrasal_segmentation.sh.
  - The end of line is also a separator for sentence splitting.

2018b

(Shang et al., 2018) ⇒ Jingbo Shang, Jialu Liu, Meng Jiang, Xiang Ren, Clare R Voss, and Jiawei Han. (2018). “Automated Phrase Mining from Massive Text Corpora.” In: IEEE Transactions on Knowledge and Data Engineering Journal, PP(99). doi:10.1109/TKDE.2018.2812203
- QUOTE: … In this paper, we propose a novel framework for automated phrase mining, AutoPhrase, which supports any language as long as a general knowledge base (e.g., Wikipedia) in that language is available, while benefiting from, but not requiring, a POS tagger. Compared to the state-of-the-art methods, AutoPhrase has shown significant improvements in both effectiveness and efficiency on five real-world datasets across different domains and languages. Besides, AutoPhrase can be extend to model single-word quality phrases.

2015

(Liu et al., 2015) ⇒ Jialu Liu, Jingbo Shang, Chi Wang, Xiang Ren, and Jiawei Han. (2015). “Mining Quality Phrases from Massive Text Corpora.” In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ISBN:978-1-4503-2758-9 doi:10.1145/2723372.2751523

Retrieved from "http://www.gabormelli.com/RKB/index.php?title=AutoPhrase_Text_Segmenter&oldid=843266"

Concept