Global cancer statistics for the most common cancers
Pubmed articles ( Top 10 cancertypes : 10,000 articles = 1000 articles /per cancertype * 10 cancertypes)
Via the python code exmaple "Ch07c - Document Clustering"
but using the dataset: the top 10 cancertypes pubmed articles collected in middle project
, insteady of using 'tmdb_5000_movies.csv.gz'.
(70%) Repeat the processes of "K-Means" ( "NUM_CLUSTERS = 10")
Select one articel (using "PUMID"), instead of using'popularity', from each cluster and extract top 5 similiarest articels (instances))
and tell what kind of the relationship among that articles and those 10 clusters you can observe.
(80%) (70%) + Compare "K-Means" with two different clustering method "Affinity Propagation" and "Ward clustering algorithm".
Try to observe the results of above three clustering methods and compare the content of articelss within these 10 clusters if possible.
(90%) (80%) + Try to find one of missclassified instances of text classifications and evaluate its neighbor instances within the same cluster
Report (pdf) embedded with YouTube (URL link shared)