Font Size: a A A

Research Of Web Document Clustering Menthod Based On Hadoop

Posted on:2013-03-02Degree:MasterType:Thesis
Country:ChinaCandidate:X HeFull Text:PDF
GTID:2248330374975858Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
The web documents are the prime style of information on the Internet, it’s used for thepeople to publish or to obtain the information on the Internet. With the increasingly change inthe information age, the number of the web documents are on the explosive growth. With thehundreds of millions of web documents, how to mine useful information effectively? How toidentify useless information fastly? How to classify the information conveniently? Datamining methods will be one of the effective measures. And web document clustering is auseful measure of data mining. Web documents can be participated on the semantic by nosupervision or half supervision web document clustering methods.Web document clustering can be wildly used in many applications. For instance, the websearch engine can provide more related information by clustering the search results. If the websearch engine clusters the result, the class information created dynamiclly can helps tonavigate. Web document clustering also can be used to recognize the junk web pages. So, webdocument clustering is a hot research field. Lot’s of problems should be studied deeperly.The web document clustering can be divied into smaller problems such as web contentrecognizing and extracting, distance measure, dimensional reduction, clustering analysis,clusters number determination, clustering label generation and so on. This paper focuses onthe web document clustering analysis, a new web document clustering using the Multicalssspectral clustering is proposed. A web search engine with result clustering is built in this paper,Multiclass spectral clustering and Normalized cuts clustering has integratied in the system.The web document clustering using spectral clustering algorithm has high accuracy, but amatrix contains the similarity values are needed for the spectral clustering. The size of thematrix grows faster than the size of the dataset which prepred to clustering. But the memoryof a single machine can hardly store the matrix when the dataset size is too large. So, ascalable web document clustering method proposed in this paper. The Hadoop is used in themethod, so the matrix can be stored in the HDFS easily. The experiments are geiven in thepaper.
Keywords/Search Tags:Normalized Cuts, Multiclass Spectral Clustering, Web document clustering, Hadoop, MapReduce
PDF Full Text Request
Related items