Text Mining (TM)
- Topics (From Tamara Polajnar, 2006)
- Natural Language Processing (NLP)
- corpus: A body of text used in language processing is usually
referred as a corpus.
- token
- tagger: A tagger is a tool which automatically marks parts of
text with certain information.
- annotation: The process of marking up text is called annotation.
- Parsing:
- Named Entity Recognition (NER)
- Information Retrieval (IR)
- IR works on large document collections and returns entire
documents, where TM works on smaller collections corresponding to IR
results, and returns paragraph is sentences or phrases as results.
TM is also sometimes called information extraction (IE).
- Machine Learning Algorithms
- Features-types vary with the problems. For NLP, features can
include words, part of speech tags, punctuation, orthography,
morphology, the location in the sentence.
- Related Papers
- Survey of Text Mining of Biomedical Corpora, Tamara Polajnar, 2006