AUTOMATIC REPRESENTATIVE NEWS GENERATION USING AUTOMATIC CLUSTERING

More than 2000 news presented by 32 online news sites in Indonesia in one day, it can make users who do not have enough time to access it feel the difficulties to choose which news that worth enough to read for them because there are news that have same topic and content among of those news. Cluster the news automatically that can provide news representative from all similar news is the best solution to cover news redundancy problem. This final project presents a new approach of automatic representative news generation using automatic clustering as a combination of Data Acquisition, Keyword Extraction, Metadata Aggregation, Automatic Clustering, and Representation News Generation. Data Acquisition is used to generate the news from RSS and present the news description that tokenized and filtered in Keyword Extraction Process. Token values, token links, and tokens are the result of Keyword Extraction and inputted into Metadata Aggregation process to provide a matrix of token values from each links. By using Automatic Clustering method, the system can identified the match number of cluster and clustered the news automatically to provide the news representative for users. The news representation can be found by finding the news that has the shortest distance with centroid of each cluster. The results of news representative depend on the token value of each links, if the difference value of cluster is too small, it means that the news included in condensed data distribution, but if the difference value of cluster is too big, that means the news included in scattered data distribution. The longer time that taken as a refresh-time, the automatic clustering results will be more accurate, because the more data that can be formed as a cluster.

Keywords: Data Acquisition, Keyword Extraction, Metadata Aggregation, Automatic Clustering, and Representation News Generation