Font Size: a A A

Study Of Component Analysis Algorithm And The Application In The Feature Extraction From Web Text

Posted on:2006-02-22Degree:MasterType:Thesis
Country:ChinaCandidate:Z P ZhangFull Text:PDF
GTID:2168360152466598Subject:Computer applications
Abstract/Summary:PDF Full Text Request
Nowadays, World Wide Web (WWW) is developing fast at the depth and width, the capacity of information is increasing at the exponential speed. Usually, Web text data are expressed in HTML format, so we transform the text into a feature vector which can reflect text content . They commonly have the shortage that the text feature vectors have egregious dimensions. It leads to the feature extraction being essential to Web text mining. The paper is base on the processing of Web text information, then makes deep study of the feature extraction method from theory and application respectively.The paper begins with the Web text mining model, describing the definition,characteristics,processing and universal technology. Secondly, it discusses in detail about participle,feature expression,feature extraction. Finally, the paper analyzes at length and ameliorates the text feature extraction which is the core in the process of Web text mining, puting forward the algorithm which is composed of the SVD and gene analysis. Then improving the validity of the algorithm through experiment and bringing forward the genetic algorithm based on vector similarity and gene analysis.The paper aims primarily to study and realize the feature extraction algorithm. The acquirement of feature vector is a NP problem. At the present, many scholars are paying attention to the study for feature extraction, several new methods have come into being. Many methods endue the word with definite power based on the word frequency and the position and select the bigger.The paper puts forward two kinds of methods based on the component analysis algorithm: â‘ feature extraction algorithm based on principal component analysis: it make use of the combining of SVD and principal component analysis to find the potential notional structure. It expresses the original feature with the combination of the principal component so that it can embody the internal relation to explain the texts. â‘¡genetic algorithm based on vector similarity: we transform the acquirement of text feature vector into searching excellence in the Web text space. The better individual will reflect the text preferably and include the information of other chromosome. It is said that it will have greater similarity with other individuals. The paper transforms the individual which is composed of distinct feature words into the vector of the space which is composed of common component of component analysis algorithm. Then constantly searching the question territory space based on vector similarity to obtain the best feature vector.At last, it introduces the design and realization of the system and presents with the experimental result of the two feature extraction algorithms.
Keywords/Search Tags:Web Text mining, Feature extraction, component analysis, singular value decomposition, Genetic algorithm
PDF Full Text Request
Related items