The Study And Application Of Document Categorization

Posted on:2015-10-27

Degree:Master

Type:Thesis

Country:China

Candidate:Q R Liu

Full Text:PDF

GTID:2298330467463925

Subject:Communication and Information System

Abstract/Summary:

PDF Full Text Request

Full text retrieval is a key technology in the field of information retrieval. And it is the basis of text categorization. Text classification is to group a series of documents into categories with a predefined classification system. As a cross field of machine learning and information retrieval, text classification has been applied in many fields. Nowadays it has been an important subject in the field of information science. With the rapid development of the internet, weibo has become very important. The average daily number of newly posted weibo is over ten million in which contain valuable information. The processing technology on weibo messages has a bright future. As the special of weibo messages, the current mainstream classification methods may not be suitable for weibo messages text classification. It has a realistic significance to research the text categorization on weibo messages. The main work of this paper includes:(1).Read high level papers about the document classification technology. Study the main problems of document classification. Finish the analysis report of technology development.(2).Study of full-text retrieval technology and achieve the chinese words segment. Build a full-text index system with Lucene.(3).Study the supervised and unsupervised feature extraction algorithm of document classification. Use experiments to test them.(4).Improve and complete the existing weibo indexing and searching system. Finish six main models of the system:the initialization; the crawler; the index; the document clustering, the classification index, the query(5).Propose an incremental weibo-oriented clustering algorithm which use the last centers as references to compare with the next centers. The new centers are got by the merge of categories.(6).Propose an incremental clustering algorithm with unknown word detection functionality. The increasing weibo corpus is incrementally clustered into categories in which a set of morpheme is derived based on local term frequency. The correct unknown words are extracted from the morpheme sets.(7).Submit the final report of the system function and performance...

Keywords/Search Tags:

Full-text retrievals, Text-categorization, Weibo Incrementalclustering, K-means, Unknown words extraction

PDF Full Text Request

Related items

1	Statistical Law Of The Same Frequency Words For Text Mining And Short Text Categorization
2	Studies On Text Content Indexing: Based On Key Phrase
3	Research On Text Feature Selection Algorithm And Its Application In Micro-Blog
4	The Sense Guessing Of Chinese Unknown Words Research And Implementation For The Full Text Annotation
5	The Research And Implementation Of Full-Text System Based On Lucene And Textual Image
6	Research Of Automatic Categorization System For Chinese Text About Complaining Information
7	A Research On Weibo (Micro-blog) Data And The Construction Of A Blogger Analysis System
8	A Study On Text Categorization Based On Machine Learning
9	Full-text Search For The Modern Chinese Text Processing, Automatic Word Generic System
10	Research On Web Text Clustering And Classification Algorithm