Font Size: a A A

Research On Web Text Classification Key Technology

Posted on:2009-12-08Degree:DoctorType:Dissertation
Country:ChinaCandidate:S Q YinFull Text:PDF
GTID:1118360242497054Subject:Basic Psychology
Abstract/Summary:PDF Full Text Request
Since the 1990s, Internet has developed with the surprising rapidity. Web, as the main platform of the information manufacturing, issuing, processing and transacting, has emerged massive isomerous dynamic semi-structural or non-structural information resources. Among this web information, more than 80 percent information exists in the form of text. The capacity is increasing rapidly, which add on average 1 million pages per day. With such expansion of the Internet and the emergence of massive online texts, it indicates that there is extremely abundant useful information, namely knowledge. How to discover useful information and knowledge pattern and put them to good use has been the direction of research for a long time. Computing technology has been developed ever since, and it is impossible to read the information on Internet and classify and summarize them by people. The classified browse pattern for search engines has emerged, whose catalog classified quality is high and search effect is efficient, and it can assist user to look for the information needed. But it needs manual maintenance, so it has some disadvantages, such as high cost, information updating slowly, large amount of maintenance workloads, and at the same time it uncovered the deficiencies of low searching result and limited recall rate. In addition, it cannot give special personalized services for special user. A survey showed that in the face of vast information on the web, 99 percent web information is no use for 99 percent users, so a lot of resources searched by web search engines will be submerged. Thus, the knowledge discovery based on web text data was born, which can solve the above problems effectively, improves the efficiency of web information searching, and classifies the massive frontpages according to the texts' meaning contained in the frontpages, then helps people hold web information, locate the target knowledge exactly, extract and discover valuable knowledge. Web text classification derives from ATC technology, which is the key part of Web text mining.In present paper, on the basis of analyzing the research status quo of Web text mining and Web text classification and the existing problems, aiming at the characteristics of complex massive semi-structural or non-structural text data type, from the point of knowledge discovering inner cognitive mechanism, the integrative line of Web text classification system as structure model- algorithm-application was formed. The key technology in the Web text classification process was researched, such as text collection, segmentation, feature dimension reduction, calculation of feature weight, classification etc. Separately it gave the hybrid improved algorithm of Web text categorization by combining tough sets, fuzzy sets and the inner cognitive mechanism.The main research contents and innovations in this paper include the following:(1) Constructing the web text classification system model. It gave the text preprocessing module, the classification module, the function and content of classification quality assessment module and the whole model frame. The key technology in the Web text classification system model was researched, such as text collection, segmentation, feature representation of text, feature dimension reduction, and calculation of weight and classification technology. At the same time, 5 factors that influenced classification performance assessment and several common methods of classification quality assessment were described.(2) Giving a Web text collection algorithm and collection system. Web text collection technology, database design approach in text collection system, collection system function design content, and collection algorithm were researched. And it described the specific collection process of collecting web text from Internet to form Web txt file sets.(3) Proposing a Web text classification association rule mining algorithm based on tough sets and double-base cooperating mechanism. Adopting the combination of two methods for reducing dimension, one is the primary selection of the feature using the mutual information; the other is the further attribute reduction using the theory of tough sets. Such combination of two technologies implemented more effective dimension reduction process and greatly curtailed the high dimensional characteristic space of text. Also because of the segmentation and dimension reduction that only used for the training text and the selection of feature items of the awaiting classified text accomplished by using the characteristic space after searching the reduction dimension matching training text sets, it greatly improved the speed of dealing with the text categorization. It combined the rule mining of text categorization with the double-base cooperating mechanism based on the inner recognition to relate the analysis methods and carry through the further rule optimization and extraction effectively. After using the interruptive coordinator and processing like this, the number of the condition features in the rules and the rule conflicts were reduced as far as possible, so the rules were more adaptive. Then use the combination of two feature reduction methods and the combination of multiple mining of classification rules methods hybrid to realize the classification of text effectively. It validated the feasibility of this algorithm through the experiment using the association rules hybrid mining algorithm based on the attribute reducing methods of rough sets and the double-base cooperating mechanism and the comparison between the traditional method and this modified algorithm.(4) Proposing a modified algorithm for improving the text classification based on fuzzy inference. Using the maximum-minimum synthesis method to carry through the fuzzy inference and classification could only keep the main information. Many secondary factors were neglected. Through this method could decrease the workload of calculation, the adaptability and credibility of its results was not satisfied. Here a modified fuzzy comprehensive weighted evaluation algorithm was proposed. That is, using the "comprehensive weighted model" operator (?) (·,⊕)given in the paper and establishing a fuzzy inference mechanism by synthetically considering the influence of each factor based on the weight coefficient to ensure the accuracy and credibility of the result. A calculation method modifying the value of text feature item was given. Consideration of the length and the label of the words of the feature items show that weighting and modifying the value of weighs are beneficial to the efficiency of classification. It validated the feasibility of this algorithm through the experiment using this algorithm and the comparison between the traditional methods based on the compose fuzzy reasoning algorithm and this modified algorithm.(5) An improved centroid-based web text classification algorithm based on inner cognitive mechanism with feedback, combining with cognitive science was proposed. Combining cognitive science, focusing the two-stage classification method of training and classification, in the situation of without the ability to keep on learning ability and the classification ability in the future classification process to be fixed, on the basis of text training and classification stage, the automatic Feedback stage was added to simulate the increasing of learning mode of human, progressive mode of knowledge discovery. The degree of intelligent text categorization and classification effectiveness was improved through the classifier was automatic adjusted and modified. Compared with the traditional centroid-based web text classification algorithm, the improved algorithm has been experimented to validate its feasibility.
Keywords/Search Tags:Web Text Mining, Web Text Classification, Inner cognitive Mechanism, Rough Sets, Fuzzy Reasoning
PDF Full Text Request
Related items