Font Size: a A A

Study And Design Of Text Information Extraction And Classification System

Posted on:2011-10-29Degree:MasterType:Thesis
Country:ChinaCandidate:J F YangFull Text:PDF
GTID:2178360308969425Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the development of the Internet techniques, the information on the Internet increases rapidly, which provides a big convenience to people's life. At the same time, it becomes more and more difficult to obtain accurate information as the amount of information increases. How to get accurate information is now well recognized. This paper discusses the issues of the information extraction and text categorization, the main works are as follows:(1) Firstly, three models of text information extraction are introduced and the disadvantages and advantages are compared. We focus on the extraction model based on the wrapper model and introduce the design and implement of intelligent web text extraction system. Taking into the incomplete text data in information extraction using traditional methods, an information integration based on XML method of rule extraction is proposed. The method can convert HTML page into XML page, and integrate text information by the XML page. The method can integrate the extracted text information effectively compared with the general method.(2) According to a practical application and the extraction model based on the wrapper model, this paper introduces the design and implement of intelligent web text extraction system, the function and required design technology of each module and the technology of design module are described in detail. After discussing the application of. DataSet and XML mechanism in file storage, this paper gives an semi-automated method of rule extraction. Finally, an application example for this system is given.(3) After analysis the feature selection and categorization algorithms in the field of automatic text categorization, a method of space-dimension reduction based on automatic select stop-word is proposed. This method achieves the goal of dimension reduction by processing stop-words to test set. Analytical and experimental results show that the speed of the algorithm that uses automatic select stop-word method is improved compared with the traditional algorithm in the case of no effect on the categorization results.(4) Based on improved KNN algorithm, a text categorization prototype system have been designed and implemented.
Keywords/Search Tags:Information extraction, Chinese text categorization, Wrapper, KNN, Stop-word
PDF Full Text Request
Related items