The Study On Automatical Domain-Specific Knowledge Extraction From Websites Based On Bootstrapping

Posted on:2013-01-25

Degree:Master

Type:Thesis

Country:China

Candidate:Q Kang

Full Text:PDF

GTID:2248330374982379

Subject:Computer system architecture

Abstract/Summary:

PDF Full Text Request

With the striking expansion of Internet and the speedy development of online applications, the scale of web information is becoming larger and larger. Web becomes an important knowledge repository; it is high desirable for people to obtain wanted information efficiently. The web contains a large amount of semi-structured domain knowledge on movies, books, restaurants and so on, which is closely related to our life. Nowadays, people can achieve information retrieval from web by search engine easily, but the results are not so credible. Meanwhile, domain-specific knowledge encoded in semi-structured pages is often come from underlying databases of commercial providers, it is difficult for search engines based on keyword-matching to crawl and index such knowledge. How to automatically extract and organize such domain-specific knowledge has become a research hotspot in information extraction field.According to analyzing the current web information extraction methods, this paper employs tag path technique to represent HTML page instead of DOM tree. Such representation technique reduces the scale of tags in tag tree dramatically and improves the performance of our algorithm. This paper proposes a novel automatically extracting domain-specific knowledge algorithm based on bootstrapping from semi-structured websites:Domain-specific Knowledge Extraction from Websites, DKEW. DKEW utilizes ontology to unify the labeling of domain-specific semi-structured data extracted, which is helpful to organize and query knowledge. DKEW firstly clusters the target pages based on tag path technique to filter noisy pages and keep detail pages which contains more semi-structured information. To extract information from detail pages, we propose a novel pattern based on tag path representation. For detailed cluster, DKEW uses a machine learning method to learn the pattern with the help of the known seed. Then DKEW automatically extracts domain knowledge using the learned pattern and maps the extracted knowledge to predefined ontology in the form of table. New mapped knowledge with high reliability will be used to expand domain seed and ontology for the next iteration process. Finally, DKEW utilizes a bootstrapping solution to iterate the whole processes and integrates them as an automatic information extraction algorithm. DKEW only requires a tiny human effort to initialize seed by annotating a few pages from Wikipedia in the specific domain. This paper crawls large-scale data from several popular domains by our self-defined web crawler to verify DKEW. Experimental results show that DKEW is better than RoadRunner, which is an automatical web information extraction method, in terms of effectiveness and efficiency. Moreover, our approach employs automatic mapping technique instead of manual labeling of RoadRunner, which saves a lot of labor and time. Experimental results also verify the generalization of DEKW in large-scale domains.

Keywords/Search Tags:

Domain knowledge extraction, semi-structured websites, pattern learning, ontology mapping, bootstrapping

PDF Full Text Request

Related items

1	Ontology-Based Structured Information Extraction From Web Pages
2	Research Of Pattern Extraction From Semi-structured Data Based On Rules
3	Research On Structured Information Extraction Based On Pattern Matching
4	Research On The Methods Of Domain Semantic Knowledge Base Construction And Knowledge Service
5	Research On Domain Ontology Learning Based On Chinese Texts
6	Research On Feature Extraction Method Of Semi-structured Document
7	A Research On Methods Of Knowledge Acquisition From Domain-Specific Texts And Their Application In Knowledge Acquisition From Archaeological Texts
8	Research On Domain Ontology Assisted Construction Based On Knowledge Representation Learning
9	Research On Semantic Information Extraction For Semi-structured Documents
10	Construction And Implementation Of Domain Ontology Based On Plain Text