Research On Extracting Information From Chinese Web Pages Based On Conceptual Model

Posted on:2008-05-28

Degree:Master

Type:Thesis

Country:China

Candidate:X Y Chen

Full Text:PDF

GTID:2178360212476269

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With spread of computer technology, especially the rapid development of WWW network in recent years, the amount of information in Internet has increased dramatically. How to help users gain information and make use of it in a more effective way is a topic that computer scientists have to face with.This thesis mainly focuses on conceptual retrieval model, and puts forward a complete solution to extract information from Chinese web pages. It constructs conceptual relationship according to knowledge base so as to gives a web page a semantic description. Once the description is compared with that of users'requirements, there occurs the second round of screening for searching results.Conceptual retrieval model is a heated research area that has just merged recently in information retrieval sector. So far many researchers have proposed their own design of systems, but lack real running applications and verifiable research results.The paper first introduces the definition of concept and attribute, relationship between concepts as well as concept graph. Then it gives out the whole framework on how to extract information from Chinese web pages based on conceptual model. This framework is introduced in two ways: acquisition of conceptual knowledge and procedure of extraction. First, it introduces the usefulness of conceptual knowledge base, current products and its future prospect. With these resources, the whole process of extraction is divided into three modules: text blocks filtering, text block classification and text block information extraction. There are three different types of text block: search, list and pure text. Different methods are proposed to extract information from each of them.Later, the paper details the technique of acquiring entity relationship templates. Extracting entity relationship can help better understand the meaning of text, so as to increase correctness of searching. Thus, we put forward a kind of bootstrapping method called Slim Template Getter (STG). This method makes use of sequence matching technique in bioinformatics to generate semantic templates within context of Chinese entities. A new model of evaluation is presented to select better templates while tuples are expanded to obtain high quality in the next iteration of training. Test results show that the templates created by STG can not only cover a large number of tuples, but also reach 99% accuracy.At last, a real extraction system named Squib is implemented based on the methods above. In experiments, a conceptual knowledge based on"train ticket"is built. The system filters the first fifteen web pages that Google engine returns, extracts value of those attributes relevant to the requirements and re-ranks the searching result. The evaluation test shows that this system can extract useful information and improve the searching result from Google in a certain degree.

Keywords/Search Tags:

Information Retrieval, Conceptual Model, Text Block, Machine Learning, Bootstrapping

PDF Full Text Request

Related items

1	Conceptual Graph Based Text Retrieval In Specified Domain
2	Information Retrieval Oriented Analysis Of Text Content
3	The Research Of Machine Learning Techniques And External Web Resources For Relevance Feedback
4	Development of a conceptual graph-based information retrieval model for medical question databases
5	Conceptual Network Technology Applications And Research, Information Organization And Information Retrieval Of The Digital City
6	Research On Information Retrieval Technology
7	Research And Implementation On Query Expansion Model Of Information Retrieval Based-on Conceptual Graph
8	Research On Text Cross-language Information Retrieval Technology Based On Conceptual Graph
9	Study On Information Retrieval Algorithm Guided With Query Conceptual Graph
10	Study On Information Retrieval Model Guided With Query Conceptual Graph