Research Of Domain Text Classification Based On Semantics

Posted on:2012-01-09

Degree:Master

Type:Thesis

Country:China

Candidate:G X Zhang

Full Text:PDF

GTID:2178330338991429

Subject:Information and Communication Engineering

Abstract/Summary:

With the rapid development of information technology, especially the popularization and application of Internet, Explosive text information present in front of people. We need to classify and filter the surge of information effectively to continuously improve efficiency which people can find useful content in the mass of information. Text classification technology is based on the text content or attribute, in a given classification system, the computer will automatically put into the large amount of texts in the appropriate category. Text classification can handle a lot of texts, to a certain extent, can solve the present situation of information disorder, at the same time, convenient user to find necessary information. Traditional text classification algorithm uses the key words as features to build vector space model, which keywords are mutually independent, no semantic association. While traditional text classification algorithm has been rapid developed, still faces some problems such as If we do not consider the textual structure information and rich semantic association features, will lead to classification effect which failed to achieve satisfactory accuracy. And in recent years the network has a lot of semantic data available, such as WordNet, Wikipedia, etc. This paper thinks, make full use of the rich semantic data, is to improve the effect of one of the ideas in text categorization.The primary problem of text categorization is text representation model. The traditional text classifications are mostly based on the vector space model. This kind of text representation is relatively simple, but it triggers a high-dimensional and sparse problems, On one hand it makes the text classification with a very high complexity; On the other hand, ignores the characteristics of the semantic relationships between terms, which leads to the loss of a large number of semantic information, resulting in feature vector representation of text is not well. All these issues interfere with the efficiency and accuracy of text classification, making the decline in the performance of text classification.In order to solve the above problems, this paper references the semantic information which is provided by semantic dictionary WordNet, puts forward a new text representation model, research and implementation of a text categorization prototype system. According to different types of documents, select the text based on the concept of representation or text-based representation of distance graphs, The concept of vector space models with concepts to as text features, will have synonymous relations mapped to one concept, In the distance graph representation model, by adding document structure analysis, the text feature terms expressed as a distance map of the nodes, feature co-occurrence relationship between the graph structure described as side, and then the text will map the graph structure, Finally, support vector machine (SVM) classification algorithm combined with the text representation model, and use the Fudan University corpus to test the system performance,from the recall, precision and F-Measure contrast text classification system with the traditional experimental, Results show that the method proposed in this paper systematically than traditional text classification system, improve the overall effect of 12.49%, recall rate increase 13.5%,and F-Measure increase 23.16% on average.In short, this paper faces several specific fields for text categorization, the feature extraction, text representation model, text classification algorithm on the key techniques such as the theoretical analysis and experimental verification, puts forward a series of solutions, and with experiment results proves the effectiveness of these solutions. These algorithms and model the future research text classification and other text processing problem will have certain reference value and reference.

Keywords/Search Tags:

text classification, semantic, distance graph, svm

Related items

1	A Study On Chinese Text Classification Based On Semantic Graph
2	Research And Application Of Text Classification Based On Multi-Semantic Fusion Learning
3	Research On WEB Page Classification Algorithms Based On Text Semantic Graph
4	Research On Key Techniques Of Short-text Representation And Classification Based On Hybrid Semantic
5	Construction Of Hierarchical Semantic Graph And Its Application In Text Mining
6	Research And Implementation Of Sensitive Text Classification Algorithm Based On Artificial Immune System
7	Research On Text Summarization Technology Based On Abstract Meaning Representation Graph
8	Research And Application Of Short Text Semantic Analysis Based On Domain Knowledge Graph
9	Semi-supervised Text Classification Based On Graph Attention Neural Networks
10	Research On Ontology-Based Semantic Text Categorization