Study And Design Of Text Information Extraction And Classification System

Posted on:2011-10-29

Degree:Master

Type:Thesis

Country:China

Candidate:J F Yang

Full Text:PDF

GTID:2178360308969425

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

With the development of the Internet techniques, the information on the Internet increases rapidly, which provides a big convenience to people's life. At the same time, it becomes more and more difficult to obtain accurate information as the amount of information increases. How to get accurate information is now well recognized. This paper discusses the issues of the information extraction and text categorization, the main works are as follows:(1) Firstly, three models of text information extraction are introduced and the disadvantages and advantages are compared. We focus on the extraction model based on the wrapper model and introduce the design and implement of intelligent web text extraction system. Taking into the incomplete text data in information extraction using traditional methods, an information integration based on XML method of rule extraction is proposed. The method can convert HTML page into XML page, and integrate text information by the XML page. The method can integrate the extracted text information effectively compared with the general method.(2) According to a practical application and the extraction model based on the wrapper model, this paper introduces the design and implement of intelligent web text extraction system, the function and required design technology of each module and the technology of design module are described in detail. After discussing the application of. DataSet and XML mechanism in file storage, this paper gives an semi-automated method of rule extraction. Finally, an application example for this system is given.(3) After analysis the feature selection and categorization algorithms in the field of automatic text categorization, a method of space-dimension reduction based on automatic select stop-word is proposed. This method achieves the goal of dimension reduction by processing stop-words to test set. Analytical and experimental results show that the speed of the algorithm that uses automatic select stop-word method is improved compared with the traditional algorithm in the case of no effect on the categorization results.(4) Based on improved KNN algorithm, a text categorization prototype system have been designed and implemented.

Keywords/Search Tags:

Information extraction, Chinese text categorization, Wrapper, KNN, Stop-word

PDF Full Text Request

Related items

1	Research And Implementation Of Text Categorization System Based On VSM
2	Research And Implementation Of The Automatic Chinese Text Categorization
3	Research Of Chinese Text Categorization Algorithms Based On Information Entropy
4	Research Of Automatic Categorization System For Chinese Text About Complaining Information
5	The Studies On Chinese Text Categorization Based On Pso And Svm
6	The Study Of Chinese Text Categorization Based On Concept
7	Study On Chinese Text Categorization
8	Research Of The Automatic Chinese WEB Text Categorization In Search Engine
9	Research On Chinese Text Categorization Algorithms Based On Technology Text
10	Algorithm Research For Text Information Extraction Based On Wrapper Model