Font Size: a A A

Research On Enterprise Competitive Intelligence Acquisition Based On Web Information Extraction

Posted on:2016-05-05Degree:MasterType:Thesis
Country:ChinaCandidate:Y G HeFull Text:PDF
GTID:2208330464463531Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development and popularity of Internet, the network has become an indispensable part of people’s life. There are many kinds of information in the network, which use webpages as carriers to users. The rich information contained in the webpages provide a new source of intelligence information for the Enterprise Competitive Intelligence System(ECIS). The purpose of this thesis is to research a kind of universal method to obtain enterprise competitive intelligence. Based on the study of existing web information extraction technology, this thesis presents a new algorithm on web information extraction on the basis of DOM tree and DBSCAN algorithm. Afterwards the model of enterprise competitive intelligence acquisition which is based on web information extraction is researched and constructed.First, this thesis comprehensively and systematically introduces the present status of web information extraction and enterprise competitive intelligence. The basic theories of ECIS and enterprise competitive intelligence acquisition are also discussed. Then several web data processing technologies which will be used in this thesis are analyzed, such as web crawler technology, Jsoup webpage analysis technology, DOM and DBSCAN algorithm. After that, the basic concept, technologies and evaluation standards of web information extraction are introduced in detail.Secondly, this thesis presents a new algorithm of web information extraction which combined DOM tree with DBSCAN algorithm by researching the universal rules of various and changeful structures of webpages on the Internet. The several parts of the algorithm are introduced in detail, include webpage pretreatment, construct DOM tree and segmented text content acquisition, webpage content extraction based on DBSCAN. It shows that the algorithm can obtain the main text information in webpage effectively through the experiment results. Besides, the algorithm has strong universality, which is independent of the webpage’s structure.Finally, the model of enterprise competitive intelligence acquisition which is based on web crawler technology, webpage analysis technology and web information extraction algorithm is constructed for an enterprise of an industry. According to the reserved website, the model gets the url of all links in the website through web crawler. Then it filters the webpages by judging the title of the webpage is related to the field of the industry. Next, the main text information of the filtered webpage is obtained. After that, the enterprise competitive information is extracted from the main text of the webpage according to the reserved information, which the enterprise focus on. Based on the model, the enterprise competitive intelligence acquisition prototype system is designed and implemented. Under the experiment result, the model of enterprise competitive intelligence acquisition which is based on web information extraction is right. Meanwhile, the model has a certain correctness.
Keywords/Search Tags:DOM Tree, DBSCAN, Web Information Extraction, Enterprise Competitive Intelligence, Competitive Intelligence Acquisition
PDF Full Text Request
Related items