雲端運算實務(Cloud Computing Practices)
(Hadoop Map&Reduce Practices with Windoop and AWS EMR)

成績(Score)



上課時間 (Class period)
(四)234 I627 (Thu.234 I627)

Artificial Intelligence and Institutional Research
時間(Time): 2018年12月27日 9:00-12:00
地點(Room): 亞洲大學哈佛講堂 A116
報名方式(Registration): (Registration)

調課(Rescheduling Classes ):

  • 2018/9/20 (Thursday,2,3,4 (9:10am~12:00am))=> 2018/9/20 (Thursday 7,8,9 (7:10pm~10:00pm)
  • Prof. Jing-Doo Wang is invited to give talk at NCCU (Time: 2018-09-20(四) 13:30-15:00)

    On-Line program
    (1) Apply for an AWS Educate (By Suca)
    (2) AWS Educate : Construct two EC2s(VM) on AWS
    Example: Launch a Linux Virtual Machine (Amazon EC2)
    Homework1(Deadline : 2018/9/28, 12:00PM)


  • 2018/11/29 (Thursday,2,3,4 (9:10am~12:00am))=> 2018/9/27 (Thursday (7:10pm~10:00pm))
  • Prof. Jing-Doo Wang attend Adv.Bioinformatics 2018, Dublin, Ireland. (2018/11/26-27)

    On-Line program:
    (1) Install Windoop
    (2) Run "Wordcount" program in Windoop
    Homework2(Deadline : 2018/10/12, 12:00PM)



    教科書(Text Book):



  • Python+Spark 2.0+Hadoop機器學習與大數據分析實戰,林大貴,出版商: 博碩,出版日期: 2016-10-03,語言: 繁體中文,ISBN: 9864341537,ISBN-13: 9789864341535
    部落格 http://pythonsparkhadoop.blogspot.tw , 
    Facebook Python+Spark 2.0+Hadoop機器學習與大數據分析社團
    範例程式 P21622_example.zip
  • 大數據基礎與實務 (Big Data Fundamentals and Practices),2017, 胡嘉璽 著,ISBN:9789869527767, 普林斯頓 (高立圖書)



  • 授課內容(Contents):
  • AWS Educate Program
  • Hadoop MapReduce Programming with Java
  • FREE COURSE:Intro to Hadoop and MapReduce byCloudera(UDACITY)
  • Big Data 2014: Introduction to MapReduce(Big Data 2014: Introduction to MapReduce)
  • How to Download & Install Java JDK 8 in Windows(From:Guru99)
    Windoop PC Cluster Setup
  • (Windoop 2. 0)(感謝(Thanks):賴敬勳,王俊平,楊松儒 環境測試)
    Windoop_WorkerNode_jdwang2018_10_16.zipThanks for (陳咨雅)
    Horizontal Scale Up : How to add worker nodes efficiently?

    (Please modify the origal "Windoop" into "Windoop_Localhost")
    windoop_ClusterIP_10.36.27.170.7z(Modified From:Windoop 林奇暻)
    SpeedUp Problem?
    Commercial Product: (1) CloudEra (2) Hortonwork (CloudEra+Hortonwork)

    1. Check IP: DOS> ipconfig
    2. (Make sure that the all IPs of PCs are in the same internet segment (e.g. 172.168.115.?))
    3. Modify the file "windoop\hadoop\etc\hadoop\core-site.xml"
    4. ("localhost"=> IP)
    5. Modify the file "windoop\hadoop\etc\hadoop\yarn-site.xml"
    6. ("localhost"=> IP)
    7. Modify the file "windoop\hadoop\etc\hadoop\hdfs-site.xml"
    8. ("localhost"=> IP)
      "windoop/dfs/name"
      "windoop/dfs/data"
  • Hadoop Cluster Setup
  • Chapter 6. Hadoop HDFS commands
  • HDFSOperation.7z

  • HADOOP_HOME ${eclipse_home}\..\hadoop
  • PATH %PATH%;${eclipse_home}\..\hadoop\bin
  • Chapter 7 Hadoop MapReduce
  • (optional)Google:Machine Learning Crash Course

  • 評分 (Score): Submission Delay (Origal Score * 0.9 / per day)
  • (10%)(Homework 1):On-Line program
  • On-Line program
    (1) Apply for an AWS Educate (By Suca)
    (2) AWS Educate : Construct two EC2s(VM) on AWS
    Example: Launch a Linux Virtual Machine (Amazon EC2))

    Report (pdf) : Submit to Moodle (Deadline : 2018/9/28, 12:00PM)
    (1) Save some screens (VM with your ID and Name) to show your work,
    (2) Tell something about what do you learn from this work.
    (3) Record one vedio explain your works and upload to YouTube(1~3 Min)(URL shoudl be imbedded in your report)


  • (10%)(Homework 2):
  • Download: (Hadoop 2.7.1) windoop_2.7.1_with_HBase_jre8_x64_zh_TW.7z (感謝 Windoop 林奇暻 先生提供)

    Hadoop Java Programming Import, add Jar Library(pdf解說)

    WordCount Examples WordCount_jdwang_2016_10_12.zip

    Report (pdf) : Submit to Moodle (Deadline : 2018/10/12, 12:00PM)
    (0) Create a VM with Windoop (CPU: 4~8 cores, RAM at least 8GB) on the AWS, and have your MapReduce program run on it..
    (1) Save some screens (Change the project name as "WordCount_YourAsiaID", instead of "WordCount_jdwang") to show your work,
    (2) Tell something about what do you learn from this work.
    (3) Record one vedio explain your works and upload to YouTube(1~3 Min)(URL shoudl be imbedded in your report)

  • (30%)(Middle Project)
  • (Gantry Information)(國道計費門架座標及里程牌價表104.09.04版.csv)
    (The locations of all Gantries in Google Map)國道計費門架座標
    如: "03F-186.0S"(國道三號 龍井-和美)=> GantryID="03F1860S"
    (How to import the Gantry locations into GoogleMap?)高速公路計費匝道位置-Google Map 匯入教學

    Information Extraction from 『Traffic Data Collection System,TDCS』
    The locations of the Gantries in the Natinoal freeway
    如: "03F-186.0S"=> GantryID="03F1860S"

    How to import the location of Gentries within Google Map

    Big Data Processing Project: TDCS-06A How to use Web Robot(pdf)

    Example: Web Robot (TDCS_WebURLDownload_jdwang_2017_10_20.zip)

    Java Project for TDCS Gantry parsing (24 hours)TDCS_GIDSequence_MapReduceParser_24Hour_jdwang2017_10_13.zip)
    Testing Data (201701_1-1.7z)(24 Hours)
    TDCS Gantry parsing(24 hours)(pdf)
    The frequency distribution of "VehicleType", "GantryID" or "Specific GantryID" within 24 Hours)
    (Freeway Gantry Location 104.09.04.csv)
    如: "03F-186.0S"(Freeway: National 5)=> GantryID="03F1860S"

    Dataset: (1) 2018/9 (2) 2018/8 (3) 2018/7 (4) 2018/6 (5) 2018/5 (6) 2018/4
    (Select at least one month for observing the 24 hours frequency distribution of one specific gantry)

    Check Points:
    (1) Gantry_On(The 3rd field) and Gantry_Off(The 5th field) vs. Intersection (On and Off)
    Check the two directions (south and north ) of freeway within Google Map
    How to eavluate to the amount of traffic flow of one intersection per (hour, day, week, month ?)
    How to compute the traffic flow that pass one gantry (not Gantry_On and Gantry_Off) per (hour, day, week, month ?)

    How to evaluate different kinds of vehicles?

    (2) SpeedUp Experiment (WindoopExecuteTime.xlsx
    SingleNode vs. Multinodes (1,2,4,8)?
    small dataset vs. large dataset
    many small files (? MB << 128 MB) vs. one packed files (e.g. ?GB)

    (3) How to have interactive views of the traffic flow for observing experimental rsult?
    (Excel? PowerBI? Tableau?)

  • (15%)(Homework 3: Deadline: 2018/12/6)
    (1) AWS EMR: Run the "word_Count" program
    (2) AWS EMR: Run Java Project for TDCS Gantry parsing (24 hours)TDCS_GIDSequence_MapReduceParser_24Hour_jdwang2017_10_13.zip)
    Amazon Simple Storage Service (S3)
    Amazon EMR)
    AWS HBase

  • (35%) (Final Project: Project with AWS services)
  • Regular pattern mining from the statistics of One Significant Time Interval Patterns of Vehicles extracted from Freeway Gantry Timestamp Sequences
    On AWS Cloud Platform (E.g. AWS S3, AWS EMR and Amazon Redshift

    TDCS_CandidateMR_2018_9_2018_9_Date-Weekday-Vehicle_M1_TF2_CF1_Length1_MRP(One Month, 100 files, 26.2GB)
    TDCS_CandidateMR_2018_7_2018_9_Date-Weekday-Vehicle_M1_TF2_CF1_Length1_MRP(Three Month, 100 files, 88.2 GB )
    TDCS_CandidateMR_2018_4_2018_9_Date-Weekday-Vehicle_M1_TF2_CF1_Length1_MRP(Six Month, 100 files, ~ 200 GB )
  • Example of Parser TDCS MRP :
    TDCS_MRP_Statistic_jdwang2018_6_6.zip


  • (IEEE ICASI 2017)PPT
  • A Novel Approach to Extract Significant Time Intervals of Vehicles from Superhighway Gantry Timestamp Sequences
    Jing-Doo Wang, and Ming-Chorng Hwang,
    2017 IEEE International Conference on Applied System Innovation (IEEE ICASI 2017) May 13-17, 2017,Hotel emisia, Sapporo, Japan (First Prize Paper Award) (Extended version Applied Sciences as a Special Issue "Selected Papers from IEEE ICASI 2017")
  • Presentation:(2018/1/3, PPT)
    Report : (2018/1/10, Conference or Journal paper + YouTube)
    (1) Topic: Distinct Statistics and Performance Comparison
    (2) Topic: Mining for reqular pattern of traval time intervals
    (3) Topic: Visualization sytetm for the statstics of TDCS Travel Time Intervals



    Extra

    Extra


    Official Amazon Web Services(Chose one)
    Introduce One of AWS services (Select at least one of AWS services)
    (Learn to Build on AWS:Big Data)

    Analyzing Big Data with Amazon EMR

    (中文: Amazon EMR) (English): Amazon EMR)
    AWS Member Login (AWS Educate for students without credic card)
    如何在五分鐘內透過AWS的EMR服務快速開啟一個Hadoop叢集?
    Amazon EMR Hadoop Demonstration
    Analyzing Big Data with Amazon EMR
    How To Connect To Amazon EC2 With PuTTY On Windows – Quick Tutorial
    100. How to Launch Amazon EMR Cluster with sample data in AWS EMR service

    Using Amazon ML to Predict Responses to a Marketing Offer
    Build a Log Analytics Solution on AWS
    AWS (VM, IAM, VPC)

    Amazon Virtual Private Cloud (VPC)

    AWS Auto Scaling - Amazon.com