Font Size: a A A

Categorization Corpus Construction And Research On Classification Method For Short Text

Posted on:2016-09-01Degree:MasterType:Thesis
Country:ChinaCandidate:W X WuFull Text:PDF
GTID:2308330461992491Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The corpus and dictionary regard as the basic resources of natural language information processing research, having a complementary relationship with information processing technology. The development of the language form and complexity has determined that the corpus which conforms the requirements must have for understanding of the features of language. After a certain stage, Chinese corpus has been well developed, and some achievements have been obtained both in the construction and application. But due to the rapid development of information processing technology, the demand for the classification of corpus in various fields has increased gradually both in quantity and professionalism. The traditional corpus built early cannot completely meet the requirements in the degree of novelty, specialty, construction methods and so on. Therefore, the research on the construction of text classification corpus for the field of information processing is a very important research direction. Today the text classification has become the core and foundation of large-scale data processing application, the lagging of corpus research has become barriers for the development of the information technology.At the same time, because of the emergence of the social web, information in the form of short text messages is poured into the life of people. It means that the traditional corpus is not sufficient to handle the theory research now, the traditional research methods’ defects and shortcomings also reflect on the short text classification. And the large-scale short text corpus covers people’s various positions and point of views for various social phenomenons, thus many important application prospects may bring to the attention, including public opinion surveys, mining hot topics, new word detection and topic detection etc. Classification is also an key step for further mining of these essays, short text classification get more and more attention.In this thesis, we study and construct an Chinese short text corpus suitable for classification and topic modeling, and make some improvements on the short text classification method. The main work includes the following aspects:1. According to the lack of various characteristic and general corpora, short text corpus is constructed in this experiment. Through the API of Sina, we obtain all the micro-blog data, including six categories and more than 20 thousand essays. Then a certain processes should be done on the corpus, such as category annotation, themes modeling and dictionary building. Finally, classification experiments are performed on the construction of corpus in order to verify the classification effects.2. Short texts are normally featured with less content, looser text format, varied sentence length and relatively complex sentence structure. Consequently, the effects of traditional classification algorithms are quite unsatisfactory. This thesis presents an authentic comprehensive method by the fusion of BTM theme features and well-improved weight calculation method for short text classification.Firstly, in terms of feature weighting computation, the TF is improved in the TF-IWF algorithm, and the distribution entropy variable is introduced.Secondly, introducing "document-theme" probability distribution in the BTM theme model can strengthen the relationship between contexts. They relieve ambiguity words whose meanings are difficult to determine because of the lacking of the context information and due to short length. The experimental results prove that the F1-Measure value can be raised.Finally, in view of the essay is too short, the document is empty after feature selected, so we use "theme-word" probability distribution in the BTM theme model to extend the essay. The basis for selecting topic word set:choosing the topic word set which has the maximum topic probability distribution value of the document. Experiments prove that keywords number is one which can improve the effect of classification greatly. At the same time additional expenses have not been brought in computational efficiency and space.
Keywords/Search Tags:corpus, short text, annotation, TF-IWF, BTM theme model, Classification method
PDF Full Text Request
Related items