Text Mining

Text Mining (TM)

Topics (From Tamara Polajnar, 2006)
- Natural Language Processing (NLP)
  - corpus: A body of text used in language processing is usually referred as a corpus.
  - token
  - tagger: A tagger is a tool which automatically marks parts of text with certain information.
  - annotation: The process of marking up text is called annotation.
  - Parsing:
- Named Entity Recognition (NER)
  - Anaphora resolution.
- Information Retrieval (IR)
  - IR works on large document collections and returns entire documents, where TM works on smaller collections corresponding to IR results, and returns paragraph is sentences or phrases as results. TM is also sometimes called information extraction (IE).
- Machine Learning Algorithms
  - Features-types vary with the problems. For NLP, features can include words, part of speech tags, punctuation, orthography, morphology, the location in the sentence.
Related Papers
- Survey of Text Mining of Biomedical Corpora, Tamara Polajnar, 2006