中央社新聞(CNA news)
- Establish Subdirectory
- Program (for perl programs)
- SrcData (for test data)
- Split news articles.
- Extract the reporter from a news article.
- Extract the date of one news article.
- Extract the types of one news article.
- Extract all information of one news article.
=======================PerlCNAParseKgram_jdwang2010_3_30.zip===============================
- How to read filenames from a given directory
- Parse the news article files under one directory "../SrcData"
==================================================================
- Check the ParsedData
- Check the ParsedData (Sentences)
- Check the ParsedData (Sentences Length)
- The statistics of Chinese characters(1-gram)(one sentence)
- The statistics of Chinese characters (k-mers, k-grams)(one news article)
- The statistics of Chinese characters (k-mers, k-grams)=> outputfile
- >perl
2_8CNANewStatistics_SplitOneNews.pl
- input:>../ParsedData/CNA9101_Temp.txt_Parsed.txt
- output:>../StatisticsData/CNA_CharStatistic.txt
- The statistics of Chinese characters (k-mers, k-grams)=>
outputfile(+tf distribution)
- >perl
2_9CNANewStatistics_SplitOneNews.pl
- input:>../ParsedData/CNA9101_Temp.txt_Parsed.txt
- output:>"../StatisticsData/CNA_CharStatistic.txt"
- The statistics of Chinese characters (k-mers, k-grams)=>
outputfile(tf_df)
- >perl
2_10CNANewStatistics_SplitOneNews.pl
- input:>../ParsedData/CNA9101_Temp.txt_Parsed.txt
- output:>"../StatisticsData/TF_DF_IDF_Statistic.txt";
==================================================================
- perl Creat_MySQLTable_kgram_tf.pl (匯入CNA_CharStatistic.txt,
TF_DF_IDF_Statistic.txt)