A Study On Machine Learning Algorithms For The Document Analysis

Posted on:2009-12-31

Degree:Master

Type:Thesis

Country:China

Candidate:X L Chang

Full Text:PDF

GTID:2178360272486768

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

As the exponential growing of internet's information, how to realize the automatic analysis of huge text data becomes an increasingly urgent research subject. In these years, as an important measure of text auto analysis, text clustering and hotspot information detecting are gaining more and more researchers' attention. The clustering of internet's information makes people know about the distribution of the themes of the information from a high level, and choose text of different themes based on their own interesting to browse. Automatically detecting the hotspot information of internet makes users know about different hotspots of different classic easily.This dissertation concentrates on the improvement and effective realizing of text clustering and hotspot information detecting algorithm, to push the text automatic analysis technique's practical use in huge data and engineered environment effectively. Firstly, on account of the feature that K-Means algorithm's clustering result seriously depend on the first point, this dissertation imports a delta approximate K-Center algorithm with optimized centers to the K-Means algorithm, and constructs the improved clustering algorithm KWOC (K-Means With Optimized Centers) to realize the more effective centers chosen. Experiment proved that KWOC can improve final clustering result's robustness distinctly. In the concrete realizing of KWOC, this dissertation design a creative transacted file system, which realizes effect caching of the middle results of the K-Center algorithm, and realizes the result sharing on the file level. This scheme can reduce the time spending of the KWOC algorithm.Secondly, to mine the hotspot information in huge web data effectively, we design a new web hotspot information detecting algorithm. This algorithm based on staged streaming frequency changing data matrix, combining the history wave of the streaming frequency changing, figures out the effect hotspot information cluster's evaluating indicator, and finally finishes the work of choosing hotspot document based on the chosen information by the indicator. This algorithm's concrete realizing as well uses the targeted transacted file system, so it has high time efficiency.At last, this dissertation gives the design scheme and implementation methods of above clustering and hotspot information detecting targeted transacted file system, which is based on the consistency Hash's Theory, effectively implements result sharing during computing based on fast hash file, effectively transports computing depending to transaction depending, and provides the strong insurance of the reliability of the algorithm in the frame of transacted rebuilding theory.Experiment proved the good performance efficiency of the clustering algorithm, hotspot algorithm and their implementation of the system's design scheme, and they can be used in huge data of the real engineered environment.

Keywords/Search Tags:

Text clustering, Hotspot Detecting, Transaction Processing file system, machine learning algorithms

PDF Full Text Request

Related items

1	Research On Hotspot Detection And Tracking In Social Medium
2	Email Processing System Based On Personal Information Management
3	Research On Transaction Propagation Algorithms And Decentralized Machine Learning Framework For Blockchains
4	Study Of The Natural Language Processing Based On Machine Learning Algorithms
5	Research On Hotspot Detection Technology Of Microblogging Public Opinion Based On Text Clustering
6	Significant Study Of Text Clustering Model Based On Machine Learning
7	A Study of Applying Machine Learning Algorithms in Application of Text Classificatio
8	Research Of Network Hotspot Content Classification Based On Improved Singular Value Decomposition And Cosine Theorem
9	The Study And Application Of New Clustering Algorithms In Image Processing And Text Clustering
10	Research On High Performance Chinese Text Classification Based On Machine Learning