Font Size: a A A

Research On Extracting Information From Chinese Web Pages Based On Conceptual Model

Posted on:2008-05-28Degree:MasterType:Thesis
Country:ChinaCandidate:X Y ChenFull Text:PDF
GTID:2178360212476269Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With spread of computer technology, especially the rapid development of WWW network in recent years, the amount of information in Internet has increased dramatically. How to help users gain information and make use of it in a more effective way is a topic that computer scientists have to face with.This thesis mainly focuses on conceptual retrieval model, and puts forward a complete solution to extract information from Chinese web pages. It constructs conceptual relationship according to knowledge base so as to gives a web page a semantic description. Once the description is compared with that of users'requirements, there occurs the second round of screening for searching results.Conceptual retrieval model is a heated research area that has just merged recently in information retrieval sector. So far many researchers have proposed their own design of systems, but lack real running applications and verifiable research results.The paper first introduces the definition of concept and attribute, relationship between concepts as well as concept graph. Then it gives out the whole framework on how to extract information from Chinese web pages based on conceptual model. This framework is introduced in two ways: acquisition of conceptual knowledge and procedure of extraction. First, it introduces the usefulness of conceptual knowledge base, current products and its future prospect. With these resources, the whole process of extraction is divided into three modules: text blocks filtering, text block classification and text block information extraction. There are three different types of text block: search, list and pure text. Different methods are proposed to extract information from each of them.Later, the paper details the technique of acquiring entity relationship templates. Extracting entity relationship can help better understand the meaning of text, so as to increase correctness of searching. Thus, we put forward a kind of bootstrapping method called Slim Template Getter (STG). This method makes use of sequence matching technique in bioinformatics to generate semantic templates within context of Chinese entities. A new model of evaluation is presented to select better templates while tuples are expanded to obtain high quality in the next iteration of training. Test results show that the templates created by STG can not only cover a large number of tuples, but also reach 99% accuracy.At last, a real extraction system named Squib is implemented based on the methods above. In experiments, a conceptual knowledge based on"train ticket"is built. The system filters the first fifteen web pages that Google engine returns, extracts value of those attributes relevant to the requirements and re-ranks the searching result. The evaluation test shows that this system can extract useful information and improve the searching result from Google in a certain degree.
Keywords/Search Tags:Information Retrieval, Conceptual Model, Text Block, Machine Learning, Bootstrapping
PDF Full Text Request
Related items