Font Size: a A A

Research On Specialty Knowledge Retrieval Method Based On Web Information Extraction

Posted on:2008-05-30Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y HuFull Text:PDF
GTID:1118360242473059Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Rapid development makes Internet become an important resource in global information transformation and sharing. The data in the web are growing at a steady rate of geometric series, so it is more and more difficult to acquire a piece of useful information from the Web, and "information overload" has become an urgent problem needed to be solved. The ideal case is described as: people can inquire into the data in the web in the same way as we inquire into the data base. However, how to extract the useful information from vast and numerous data on the Web is still a problem which the researchers hope to solve.Such characteristics as large quantity, isomery and dynamic variation and so on make Web information extraction different from traditional information extraction, and bring new challenges. In recent years the extraction techniques have been enriched as the demand increases, and there exist many information extraction methods domestically and abroad. In this dissertation, we investigate the method of automatic knowledge acquisition in all subjects from the Web according to the need of the customers, in accordance with the subject knowledge data base to be established in the smart instructional system.Specialized knowledge acquisition method based on Web information extraction, which is proposed in this dissertation, is mainly enlightened by the idea that SRV regards the information extraction as a classification problem. Along with Web information extraction method based on HTML structure, we have constructed the frame of Web specialized knowledge acquisition system based on Web information extraction and classification method, and conducted special studies on some key techniques in this system. The detailed contents of this dissertation are listed as follows:1. Web page large-quantity acquisition and pretreatment are analyzed. Specialized knowledge acquisition based on Web requires collecting a large quantity of web pages with the same topic. Nowadays the service provided by all Search-engines can't meet the need. In this work, we present a simple and efficient method which is employed to automatically acquire web pages in large quantity and match the pages of the same topics by using canonical expressions.2. Page pretreatment method is studied. According to the label meaning in the HTML file structure, HTML vessel label tree is constructed. In view of the characteristics of noise block and subject content block in the pages, the noise node in the label tree is deleted and subject content block is confirmed.3. Subject information extraction method of the pages is discussed. In view of the fact that the present information extraction methods need much artificial intervention and much prior knowledge, and that different systems use different descriptive languages, we employ one kind of information extraction method based on XML mapping, establish Jtree by using DOM, automatically acquire the path of information extraction according to the tree node, and study information extraction rules, in order that the automation in information extraction is achieved.4. Chinese text characteristic expression method and text classification algorithm are also analyzed. The quantity of characteristic word in the text characteristic expression method of vector space model and the dimension of data searching space have an intimate relationship with the efficiency of classification algorithm. Based on the fact mentioned above, we have developed a characteristic word extraction method based on word gender, which can reduce the dimensions of characteristic vector. And we have also proposed two modified KNN algorithms, which are based on lessening of characteristic words and data division respectively, so that the efficiency and performance of classification algorithm are improved.5. Training base's automatic extraction method is studied. In order to improve the performance of the classification algorithm, a high-class training base has to be established. All the past researches are based on the training base which had already been established. However, in present study one high-class training base is automatically generated by Web excavation, in order to further improve the automation degree of specialized information acquisition.6. The information organization and storage methods are analyzed. The extracted specialized knowledge is organized into a form that the customer utility system-smart instructional system- can access directly, and the data are arranged initially according to the need of the utility system.In this dissertation, researches have been done on key techniques in every link of specialized knowledge acquisition based on web information extraction, the knowledge acquisition frame has been established, and elementary automation in the process of acquisition is achieved.
Keywords/Search Tags:page acquisition, page cleaning, information extraction, specialty knowledge retrieval, characteristic extraction, text classification, information storage
PDF Full Text Request
Related items