Font Size: a A A

Research On Key Problems In Web Text Mining

Posted on:2013-01-01Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z Y WangFull Text:PDF
GTID:1228330374499653Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet and telecommunication network, web text becomes the important carrier of information and indispensable source. Web text mining depends on the theories in the fields of data mining, pattern recognition, information retrieval, natural language processing, etc. It aims to get comprehensible and easy-to-use knowledge from numerous and complicated texts. This dissertation focuses on several key problems in web text mining, such as text categorization, SMS filtering, information retrieval, complex network, etc.(1) Multiclass text categorization. This dissertation aims at the lack of Error Correcting Output Code (ECOC) in decoding, and proposes a method of multiclass text categorization based on Support Vector Machine (SVM) and probabilistic ECOC. Several binary classifiers are trained according to appropriate encoding matrix. Values of decision functions are transformed to probabilities by a sigmoid-style function. Two decoding algorithms are introduced for classifying samples. One is calculating the probabilities of each classes, the other is solving the pseudo-inverse of the encoding matrix. Experiments on standard Chinese and English datasets show that the methods are superior to traditional ECOC and other classic algorithms. Moreover, our methods keep stable precision in the condition that samples of each class are not evenly distributed.(2) Evolutionary SMS filtering. This dissertation proposes a series of algorithms and systems of evolutionary SMS filtering for difficulties of fast updates, personality and lack of training samples. First, a basic evolutionary system is introduced based on Naive Bayes classifier. Its innovations lie in flexible feedback for users, adaptive learning and evolutionary learning. Three types of personalized feedback are put forward according to the uses’habits. Evolutionary learning and adaptive learning are used to update features and their weights. Moreover, this dissertation proposes an interlayer mapping-based SMS filtering algorithm to address the problem in not only high precision but also few training samples. Experimental results show that the proposed method can effectively receive the stream of short messages and update the filter automatically. Interlayer mapping-based filtering algorithm achieves required accuracy with rapid convergence. It can be combined with traditional methods for boosting the performance when samples are enough for training.(3)Web entity-oriented search. This dissertation proposes a set of algorithms and systems for entity mining and retrieval based on the Entity Track at Text REtrieval Conference (TREC). Entity lexicons including dozens of types for entity extraction are established through semi-automatic, rule-based and statistic-based methods. Document-Centered Model (DCM) and Entity-Centered Model (ECM) are proposed for entity ranking. In addition, semantic category labels are introduced for improving the accuracy. Considering entities in web pages should be identified uniquely, a rule-based algorithm of homepage allocation is presented. Ranking first in official assessment testifies the effectiveness of the proposed methods. Besides, testing on the semi-structured English Wikipedia dataset indicates that semantic category labels improve DCM and ECM by12.1%and25.6%at NDCG, respectively.(4)Modeling and applications of complex network based on activation force and affinity measure. Taking natural language text as an example, Word Activation Force (WAF) like activation effect in biology and psychology is proposed by merging some statistics, such as word frequency, co-occurrence, distance, etc. Then word affinity measure and undirected network used for studying the semantic similarity between words are generated by WAR On this basis, WAF and word affinity measure are applied to text representation, feature selection and text categorization. These methods are also suitable for PPI (Protein-Protein Interaction) network modeling and protein association analysis. In addition, entity affinity measure contributes to the re-ranking in entity retrieval. Experimental results demonstrate complex network modeling based on activation force and affinity measure is of great significance for web text mining.
Keywords/Search Tags:text categorization, SMS filtering, text retrieval, complexnetwork, activation force
PDF Full Text Request
Related items