| Information retrieval technology began in the1940s, which was originally used to manage the large number of scientific literature. With the arrival of information age, the number of digital text and the needs of user’s access to information grew rapidly, which made information retrieval technology very important.Full-text retrieval and text categorization are the two important technologies in information retrieval. Text categorization is a content-based document management technology, which is very dependent on the basic technologies of full-text retrieval. So full-text retrieval and text categorization have many similarities. Weibo is a kind of information media that is characterized by extremely fast dissemination, high real time and extensive information source. This paper’s study is focused on full-text retrieval, and a full-text retrieval system based on Lucene is designed and built in this paper. Then, the research of weibo-oriented text categorization built on full-text retrieval is carried on, and a weibo-oriented full-text retrieval and text categorization system is designed and built.The work of this paper can be divided into two parts. The first part is the research and application of full-text retrieval and the second part is the research and application of weibo-oriented text categorization. The main contents are as follows:(l)Through the research of full-text retrieval technology and the analysis of the task, this paper solves the key issues, such as the acquirement of information, file management and index file management. And a full-text retrieval system based on Lucene is designed and built in this paper; (2) This paper studies the factors of affecting the results of text clustering which is based on Euclidean distance and cosine similarity. And a result is proved:cosine similarity is more optimal in the weibo-oriented text categorization system;(3) Based on the characteristics of weibo, an improved K-means method for weibo is put forward in this paper. The method can automatic compute the value of K and initialize the algorithm according to the weibo corpus;(4) Based on the improved K-means method, a weibo-oriented method for unknown word extraction is proposed. This method can reduce the computational complexity but not degrade the performance;(5)Based on the full-text retrieval system and the research of weibo-oriented text categorization, this paper solves the the key issues, such as text clustering, classified index and classified query. And a weibo-oriented full-text retrieval and text categorization system is designed and built in this paper. |