Font Size: a A A

Study On Text Semantic Representation And Key Techniques Of Hierarchical Classiifcation

Posted on:2013-10-17Degree:DoctorType:Dissertation
Country:ChinaCandidate:S L SongFull Text:PDF
GTID:1228330395457244Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The fast growth of information technology and rapid changes of internet havebrought us into an enriched and rapidly updated information age. Especially with theemergence of various social networks in recent years, massive text information has beenproduced and disseminated constantly on the networks every day."Information poverty"has been replaced by "information overload" with the rapid growth of the mass ofinformation. The problem we are facing is no longer how to get information, but how toquickly and efficiently extract the required information from large amount ofinformation. As a key technology of great useful value, to a large extent, textclassification can solve the problem of information mess, and bring convenience forusers to accurately specify their required information and distribute information. Alongwith wide application of classification technology in information retrieval, publicsentiment analysis, information filtering, news classification, digital library and moreother areas, the study on key techniques of text classification has become an advancingfront subject of information processing, and has wide applications prospect andimportant research significance. This dissertation is mainly concerned with textsemantic representation and key techniques of hierarchical classification. The author’smajor contributions are outlined as follows:1. A Text Semantic Graph based text representation model is proposed. To solvethe problem of words semantic information loss caused by text representation based onword frequency statistics, a new Chinese text semantic representation model: TextSemantic Graph, is proposed by considering contextual semantic and backgroundinformation of the words in the text. This method captures the semantic relationshipsbetween words using Wikipedia as a knowledge base. Words with strong semanticrelationships are combined into a word-package as indicated by a graph node, whichweighted by the total number and frequency of the words it contains. Contextualrelationship between words in different word-packages is stated by a directed edge,which weighted with the maximum weight of its adjacent nodes. The model retains thecontextual information of each word to a large extent while at the same time thesemantic meaning between words is strengthened.2. A virtual category tree based the hierarchical text classification method isproposed. According to the problem of top-down building classification model inexisting hierarchical classification methods and sample data repetitive learning, a newvirtual category tree based the hierarchical text classification method is proposed. The classification method uses a bottom-up approach to build classifiers. It can decrease thecost of sample repetitive learning and reduce sample learning time. In the process oftop-down text classification, the similarity between document vector preprocessed andthe associated classifier is calculated. The maximum value is selected to determine thecategory which the document belongs to until the document is classified to leaf node.3. Hierarchical text classification incremental learning algorithms are proposed.Combined with the analysis on learning problems of single document adjustment andnew sample sets, the incremental learning algorithms based on the hierarchicalclassification model for the two patterns are proposed. Towards single documentadjustment, the classifier, which is the extreme left mismatching node between thedocument’s classification path and its actual path in the virtual category tree, isretraining and then the virtual category tree model is updated. For new sample sets, thefeature space is updated incrementally using an incremental features selection algorithm.The weights are recalculated to improve the accuracy of classification model.4. A hierarchical text classification performance evaluation method is proposed. Toevaluate the hierarchical classification methods, resolve the limitations of conventionalflat classification measures for hierarchical classification evaluation, after studying thehierarchical classification methods based on concept tree, a set of extended measuresare put forward to accurately describe its performance, by effectively using the level and"affinity" among the categories in a hierarchical structure. And further a definition ofError Classification Concentration Ratio (ECCR) is given based on the distribution ofmisclassification samples. Besides evaluation the classification result, ECCR can guidethe training samples selection process to make the training set more representative.5. A text information processing model is designed. According to a text intelligenceprocessing application mode, a process model of text information processing is designed,including four stages of text information collection, hotspot aggregation andclassification, full text information retrieval and text information integrated compilation.On this basis, text information processing system is developed. The system can realizethe text information pre-processing, analysis processing and integrated compilation. Itprovides a software platform for information workers to improve the efficiency ofinformation processing.
Keywords/Search Tags:Text Classification, Text Representation Hierarchical ClassificationIncremental Learning, Performance Evaluation, Text SemanticGraph, Virtual Category Tree
PDF Full Text Request
Related items