Research And Implementation Of Text Clustering Algorithm Based On Memory Calculation

Posted on:2016-09-09

Degree:Master

Type:Thesis

Country:China

Candidate:M D Li

Full Text:PDF

GTID:2308330503976552

Subject:Communication and Information System

Abstract/Summary:

PDF Full Text Request

News clustering system which is the core of personalized news recommendation engine was born in the wave of the Internet. Results of cluster algorithm have a direct impact ion the recommended news. A complete news clustering system includes a web crawler module which is responsible for obtaining the data source; text extraction module which is responsible for data cleaning, removal of noise; clustering module which is responsible for the final classification. Not only quality of the data source is a necessary condition for the effective operation of the algorithm, but also an important part of the practical application. How to design a faster and less duplicated web crawler and text extraction algorithm with high accuracy have been a research focus. This article is written in this context, achieving a real news clustering system as a research goal, in order to work out an efficient web deduplication algorithm and precise text extraction algorithm. The main contents of this paper consist of:Firstly, the traditional architecture of the web crawler is analyzed in depth, and more attentions are paid to mechanism of web deduplication. In traditional deduplication strategy, most of the URLs are removed before downloading web pages, which can filter out the same web page URL. But in a news recommendation system, in addition to a large number of first-hand pages, while a large number of second-hand news pages are existing on the Internet, such as major news websites reprint other site’pages. In this case, although URLs are different and the entire pages are different due to different ads inserted into the pages, maybe the pure text content of the page is same. For readers, the same content page is not what they want to see. In view of this situation, this paper presents a solution. Second deduplication is done to downloaded web pages. For the original HTML pages, extract title and then hash it. Remove the same web page if collision happens. It is proved that the algorithm is effective through a lot of experiments. It improves 0.48% averagely on the basis of existing deduplication strategy.Subsequently, existing content extraction algorithms are introduced. A new content extraction algorithm based on statistic and position relationship between title and content is proposed on the basis of statistical algorithm. The algorithm takes title into consideration. First of all, text within tag of<title> is extracted as a reference. Then all of the elements within DOM tree should be compared. At the same time, attributes of each element are also taken into account. The one most likely will be chosen as news title. After title is determined, the scope of content is narrowed according to the relationship between title and body text, and thus enhance the correct rate of extraction of the body. At the end of the content is extracted, the correctness of the title can be verified by the similarity of title and content again, which can achieve the effect of mutual each one of another. It is proved that the algorithm is effective through a lot of experiments. The average correct of new algorithm reaches 97.83 percent which is far higher than traditional content extraction algorithm.Finally, in the existing developed web crawler and content extraction algorithm, news clustering system is designed and implemented. A distributed architecture is designed to handle a large number of web page, and the system is ran on Hadoop platform with MapReduce model to achieve the task. The entire system consists of five different MapReduce jobs, namely job of segmentation, removal of stop words and word frequency statistic, job of calculating the number of words per page, job of computing TFIDF, job of building document vector and job of K-Means algorithm. All of the job will be started to run one by one. The value of K is determined by relationships between contours coefficient and K. Times of iteration can be evaluated which can make the system completing all jobs within an acceptable time range. It is proved that the algorithm is effective and practical and results of clustering are examined manually, which meet the requirements of practical application.

Keywords/Search Tags:

News Clustering System, Web Crawler, Web Deduplication, Content Extraction, Distributed Systems, K-Means Algorithm

PDF Full Text Request

Related items

1	K-Means Clustering Algorithm Optimization And Its Application In Image Deduplication
2	Research On Routing Algorithm For Distributed Data Deduplication Systems
3	The Application And Research Of Chinese Word Segmentation And Web Deduplication In News Vertical Search Engine
4	Design And Implementation Of Top-Scholar Talents Database System Based On Distributed Crawler
5	Research And Implementation Of Web News Extraction Method Based On Tag Path And Keyword Features
6	Research On Technologies Of Distributed Link Extraction And DNS Cache
7	Research Of K-means Clustering Algorithm Based On News Comments
8	Design And Implementation Of Distributed-based News Crawler And Recommendation System
9	Research On Content-based Video Retrieval Technology
10	Webpage Content Extraction Techniques For Specific Topic