Font Size: a A A

Research And Implementation Of Text Clustering Algorithm Based On Memory Calculation

Posted on:2016-09-09Degree:MasterType:Thesis
Country:ChinaCandidate:M D LiFull Text:PDF
GTID:2308330503976552Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
News clustering system which is the core of personalized news recommendation engine was born in the wave of the Internet. Results of cluster algorithm have a direct impact ion the recommended news. A complete news clustering system includes a web crawler module which is responsible for obtaining the data source; text extraction module which is responsible for data cleaning, removal of noise; clustering module which is responsible for the final classification. Not only quality of the data source is a necessary condition for the effective operation of the algorithm, but also an important part of the practical application. How to design a faster and less duplicated web crawler and text extraction algorithm with high accuracy have been a research focus. This article is written in this context, achieving a real news clustering system as a research goal, in order to work out an efficient web deduplication algorithm and precise text extraction algorithm. The main contents of this paper consist of:Firstly, the traditional architecture of the web crawler is analyzed in depth, and more attentions are paid to mechanism of web deduplication. In traditional deduplication strategy, most of the URLs are removed before downloading web pages, which can filter out the same web page URL. But in a news recommendation system, in addition to a large number of first-hand pages, while a large number of second-hand news pages are existing on the Internet, such as major news websites reprint other site’pages. In this case, although URLs are different and the entire pages are different due to different ads inserted into the pages, maybe the pure text content of the page is same. For readers, the same content page is not what they want to see. In view of this situation, this paper presents a solution. Second deduplication is done to downloaded web pages. For the original HTML pages, extract title and then hash it. Remove the same web page if collision happens. It is proved that the algorithm is effective through a lot of experiments. It improves 0.48% averagely on the basis of existing deduplication strategy.Subsequently, existing content extraction algorithms are introduced. A new content extraction algorithm based on statistic and position relationship between title and content is proposed on the basis of statistical algorithm. The algorithm takes title into consideration. First of all, text within tag of<title> is extracted as a reference. Then all of the elements within DOM tree should be compared. At the same time, attributes of each element are also taken into account. The one most likely will be chosen as news title. After title is determined, the scope of content is narrowed according to the relationship between title and body text, and thus enhance the correct rate of extraction of the body. At the end of the content is extracted, the correctness of the title can be verified by the similarity of title and content again, which can achieve the effect of mutual each one of another. It is proved that the algorithm is effective through a lot of experiments. The average correct of new algorithm reaches 97.83 percent which is far higher than traditional content extraction algorithm.Finally, in the existing developed web crawler and content extraction algorithm, news clustering system is designed and implemented. A distributed architecture is designed to handle a large number of web page, and the system is ran on Hadoop platform with MapReduce model to achieve the task. The entire system consists of five different MapReduce jobs, namely job of segmentation, removal of stop words and word frequency statistic, job of calculating the number of words per page, job of computing TFIDF, job of building document vector and job of K-Means algorithm. All of the job will be started to run one by one. The value of K is determined by relationships between contours coefficient and K. Times of iteration can be evaluated which can make the system completing all jobs within an acceptable time range. It is proved that the algorithm is effective and practical and results of clustering are examined manually, which meet the requirements of practical application.
Keywords/Search Tags:News Clustering System, Web Crawler, Web Deduplication, Content Extraction, Distributed Systems, K-Means Algorithm
PDF Full Text Request
Related items