Font Size: a A A

Research Of Domain Text Classification Based On Semantics

Posted on:2012-01-09Degree:MasterType:Thesis
Country:ChinaCandidate:G X ZhangFull Text:PDF
GTID:2178330338991429Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology, especially the popularization and application of Internet, Explosive text information present in front of people. We need to classify and filter the surge of information effectively to continuously improve efficiency which people can find useful content in the mass of information. Text classification technology is based on the text content or attribute, in a given classification system, the computer will automatically put into the large amount of texts in the appropriate category. Text classification can handle a lot of texts, to a certain extent, can solve the present situation of information disorder, at the same time, convenient user to find necessary information. Traditional text classification algorithm uses the key words as features to build vector space model, which keywords are mutually independent, no semantic association. While traditional text classification algorithm has been rapid developed, still faces some problems such as If we do not consider the textual structure information and rich semantic association features, will lead to classification effect which failed to achieve satisfactory accuracy. And in recent years the network has a lot of semantic data available, such as WordNet, Wikipedia, etc. This paper thinks, make full use of the rich semantic data, is to improve the effect of one of the ideas in text categorization.The primary problem of text categorization is text representation model. The traditional text classifications are mostly based on the vector space model. This kind of text representation is relatively simple, but it triggers a high-dimensional and sparse problems, On one hand it makes the text classification with a very high complexity; On the other hand, ignores the characteristics of the semantic relationships between terms, which leads to the loss of a large number of semantic information, resulting in feature vector representation of text is not well. All these issues interfere with the efficiency and accuracy of text classification, making the decline in the performance of text classification.In order to solve the above problems, this paper references the semantic information which is provided by semantic dictionary WordNet, puts forward a new text representation model, research and implementation of a text categorization prototype system. According to different types of documents, select the text based on the concept of representation or text-based representation of distance graphs, The concept of vector space models with concepts to as text features, will have synonymous relations mapped to one concept, In the distance graph representation model, by adding document structure analysis, the text feature terms expressed as a distance map of the nodes, feature co-occurrence relationship between the graph structure described as side, and then the text will map the graph structure, Finally, support vector machine (SVM) classification algorithm combined with the text representation model, and use the Fudan University corpus to test the system performance,from the recall, precision and F-Measure contrast text classification system with the traditional experimental, Results show that the method proposed in this paper systematically than traditional text classification system, improve the overall effect of 12.49%, recall rate increase 13.5%,and F-Measure increase 23.16% on average.In short, this paper faces several specific fields for text categorization, the feature extraction, text representation model, text classification algorithm on the key techniques such as the theoretical analysis and experimental verification, puts forward a series of solutions, and with experiment results proves the effectiveness of these solutions. These algorithms and model the future research text classification and other text processing problem will have certain reference value and reference.
Keywords/Search Tags:text classification, semantic, distance graph, svm
PDF Full Text Request
Related items