Font Size: a A A

The Study And Application Of Document Categorization

Posted on:2015-10-27Degree:MasterType:Thesis
Country:ChinaCandidate:Q R LiuFull Text:PDF
GTID:2298330467463925Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
Full text retrieval is a key technology in the field of information retrieval. And it is the basis of text categorization. Text classification is to group a series of documents into categories with a predefined classification system. As a cross field of machine learning and information retrieval, text classification has been applied in many fields. Nowadays it has been an important subject in the field of information science. With the rapid development of the internet, weibo has become very important. The average daily number of newly posted weibo is over ten million in which contain valuable information. The processing technology on weibo messages has a bright future. As the special of weibo messages, the current mainstream classification methods may not be suitable for weibo messages text classification. It has a realistic significance to research the text categorization on weibo messages. The main work of this paper includes:(1).Read high level papers about the document classification technology. Study the main problems of document classification. Finish the analysis report of technology development.(2).Study of full-text retrieval technology and achieve the chinese words segment. Build a full-text index system with Lucene.(3).Study the supervised and unsupervised feature extraction algorithm of document classification. Use experiments to test them.(4).Improve and complete the existing weibo indexing and searching system. Finish six main models of the system:the initialization; the crawler; the index; the document clustering, the classification index, the query(5).Propose an incremental weibo-oriented clustering algorithm which use the last centers as references to compare with the next centers. The new centers are got by the merge of categories.(6).Propose an incremental clustering algorithm with unknown word detection functionality. The increasing weibo corpus is incrementally clustered into categories in which a set of morpheme is derived based on local term frequency. The correct unknown words are extracted from the morpheme sets.(7).Submit the final report of the system function and performance...
Keywords/Search Tags:Full-text retrievals, Text-categorization, Weibo Incrementalclustering, K-means, Unknown words extraction
PDF Full Text Request
Related items