Font Size: a A A

The Design And Implementation Of Enterprise Information-Oriented Web Focused Search

Posted on:2014-04-27Degree:MasterType:Thesis
Country:ChinaCandidate:X FanFull Text:PDF
GTID:2298330467464496Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
It is very important for the survival and development of the companies to gather related company information from the web to support the customer relationship management, potential competitor identification and etc.. Due to the limitations of general search engines for dealing with such issues, we design and realize the company information-oriented focused search engine.Web pages concerning company information can be divided into two categories: POI pages that present the company information in the form of structured tables and TOI pages that present the information in the form of unstructured text. These two kinds of pages have great differences in structure, and then the different processings of focused search are needed. Focused crawler and information extraction are two core components for the focused search engine. For the two core tasks, considering the different presentations of company information, we perform some exploratory and experimental studies as follows.1. PGI-oriented focused crawler. Existing literatures on focused crawlers are mostly subject-oriented and there is little work on the POI-oriented user need. In this paper, we realize the POI-oriented focused crawler with the classifier models of Naive Bayes (NB) and Support Vector Machines (SVMs) by designing effective feature templates. Experimental results show that focusing on the POI-oriented user need with crawlers is feasible.2. TOI-oriented focused crawler. When dealing with the text pages by traditional focused crawlers, most of them directly process all the content within pages, which introduces a lot of noise contents. In this paper, we realize the TOI-oriented focused crawler with the improved page correlation analysis algorithm. The algorithm is based on the five blocks of contents with given corresponding weights which are most relevant to the topic and adopts classifiers to make the overall relevance judgment. Experiments are also conducted with two kinds of classification algorithms of NB and SVMs. Compared to the baseline system of focused crawler realized with all the text in pages, the harvest rate in the experimental results is higher by20%in average and the highest difference has reached51.35%, which fully illustrates that the improved page correlation analysis algorithm is very efficient.3. Information extraction for companies. We regard the web pages obtained by the focused crawlers as the data source, and extract the company information from areas of POI or TOI. Information in POI has standard layout and strong structural regularity. And then we adopt the method of wrapper to extract from areas of POI. For the information in the relatively complex areas of TOI, statistical model is used for information extraction. The whole task is divided into two steps. Firstly, sentences containing slots are found. Secondly, the categories of the slots are judged. And then the final values of slots can be ascertained by the joint probabilities of sentences and phrases. Experiments are conducted with eight company properties as slots and obtain the average F-measure of all the slots by93.8%, which is higher than the rule-based baseline system by7.6%and fully illustrates the efficiency of this algorithm.
Keywords/Search Tags:Focused search, Focused crawler, Information extraction, Jointprobability model
PDF Full Text Request
Related items