Font Size: a A A

A Study On Machine Learning Algorithms For The Document Analysis

Posted on:2009-12-31Degree:MasterType:Thesis
Country:ChinaCandidate:X L ChangFull Text:PDF
GTID:2178360272486768Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
As the exponential growing of internet's information, how to realize the automatic analysis of huge text data becomes an increasingly urgent research subject. In these years, as an important measure of text auto analysis, text clustering and hotspot information detecting are gaining more and more researchers' attention. The clustering of internet's information makes people know about the distribution of the themes of the information from a high level, and choose text of different themes based on their own interesting to browse. Automatically detecting the hotspot information of internet makes users know about different hotspots of different classic easily.This dissertation concentrates on the improvement and effective realizing of text clustering and hotspot information detecting algorithm, to push the text automatic analysis technique's practical use in huge data and engineered environment effectively. Firstly, on account of the feature that K-Means algorithm's clustering result seriously depend on the first point, this dissertation imports a delta approximate K-Center algorithm with optimized centers to the K-Means algorithm, and constructs the improved clustering algorithm KWOC (K-Means With Optimized Centers) to realize the more effective centers chosen. Experiment proved that KWOC can improve final clustering result's robustness distinctly. In the concrete realizing of KWOC, this dissertation design a creative transacted file system, which realizes effect caching of the middle results of the K-Center algorithm, and realizes the result sharing on the file level. This scheme can reduce the time spending of the KWOC algorithm.Secondly, to mine the hotspot information in huge web data effectively, we design a new web hotspot information detecting algorithm. This algorithm based on staged streaming frequency changing data matrix, combining the history wave of the streaming frequency changing, figures out the effect hotspot information cluster's evaluating indicator, and finally finishes the work of choosing hotspot document based on the chosen information by the indicator. This algorithm's concrete realizing as well uses the targeted transacted file system, so it has high time efficiency.At last, this dissertation gives the design scheme and implementation methods of above clustering and hotspot information detecting targeted transacted file system, which is based on the consistency Hash's Theory, effectively implements result sharing during computing based on fast hash file, effectively transports computing depending to transaction depending, and provides the strong insurance of the reliability of the algorithm in the frame of transacted rebuilding theory.Experiment proved the good performance efficiency of the clustering algorithm, hotspot algorithm and their implementation of the system's design scheme, and they can be used in huge data of the real engineered environment.
Keywords/Search Tags:Text clustering, Hotspot Detecting, Transaction Processing file system, machine learning algorithms
PDF Full Text Request
Related items