Data Mining
(A special program to Text Mining)
jdwang@asia.edu.tw,
Room:I517, ext:1847
Time: Monday, 1:10pm~4:00pm, Room:
I528
-
Text
Book
-
Reference
Book
-
Perl Programming
OutlinePreCourse
PubMed Articles Information Extraction
Transform Instances
Into Vectors
SVM Classifier
Weka 3: Data Mining Software in Java
(1) PubMed Articles Download and Analysis (50%)
- PubMed Articles Statistics.
- Topic: your own keywords?
- The statistics of PubMed
articles according to Year, Month and day.
- Compute the TF, IDF, TF*IDF
of each word.
- Select the representative
words according to TF*IDF.
- Obtain the trend of your
research from the web site
significant pattern history
- (Having
your own comments and
conclusions)
- (2014.11.17) Presentation (PPT)
- Please summarize your
experimental results and present using PPT
(15min~20 min).
- (2014.11.24 Report (Please just email your
report to me, without have class at 11/24)
- Report with MS-Word format
(Title, keywords, Introduction, approaches,
experimental result, conclusion and discussion.
- Report format (Title,
Background, methods, Experimental result,
conclusion and future works)
(2) PubMed Articles Classification (50%)
- Source Program :
program13_SVMClassifier_jdwang2014_12_30_Pattern10000Test.7z
- Experimental Resource
- DataSet2:
- Download at least three types (keywords) of PubMed Articles. (Created by yourself)
- Parse the XML format articles.
- Extract the features according to tf*idf
weighting method.
- Select features according to their tf*idf
weighting.
- Transform all of articles into vectors with
class-labels via selected features.
- Try SVM classifier with k-fold
Cross-Validation(CV).
- Find the accuracy, Recall, Precision and F1,
Confusion Matrix, and show your observation.
- Comparison:
- Different
number of selected features.
(1000,3000,5000,10000,20000)
- Different
value k of k-fold CV. (k=5,10)
- Computation Time = Training Time +
Testing Time
- Accuracy + Computation Time
=>LIBSVM
vs.
LIBLINEAR --
A Library for Large Linear Classification
- Confusion Matrix
- Do you have another feature
selection or weighting method?
- (2015.1.19) Presentation (PPT)
(2015.1.21) Report (Email to
jdwang@asia.edu.tw)
- Title Page (Final Project, Your Name, ID,
Date)
- Method
- Experimental Results
- Data Set
- Data Preprocessing.
- Comparison
- Conclusion and Discussion