Font Size: a A A

On Research Of Deep Search And Information Extraction For E-commerce Websites

Posted on:2012-03-15Degree:MasterType:Thesis
Country:ChinaCandidate:X ZhouFull Text:PDF
GTID:2218330338968319Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of network and database technology, Internet has become a carrier of a great deal of information, how to extract the meaningful information accurately and rapidly from the mass of web information has become an urgent problem which people want to solve. There is a lot of web information in the E-commerce websites, these websites'web pages are dynamic, and have high degree of structures. Actually, the E-commerce websites belongs to Deep Web, Deep Web is the information which can not access with hyperlinks and needs the dynamic web technology to access, users need to submit queries on the specific interface(the query interface) to access the data in the database. Deep search is the kind of search which aim at Deep Web. Deep Web has lots of information, high degree of structures and public access interface. So we choose the E-commerce websites to do the object of research of deep search, and get the purpose that extract the products'information. The search engines of E-commerce websites offer convenient for researchers to extracting information. Depending on some keywords that users input, these query interfaces can show users the information about keywords by web database dynamically. We can use these query interfaces to do some research into deep search, getting the web pages what we need by simulating the filling and submission of keywords.The webpages what get by deep search have a lot of products'information, we use these information source to extraction. The key point of web information extraction is how to generate extraction rules. There are two approaches to generating extraction rules, one is generating automatically, and another one is generating by manual work, these two approaches are all have advantages and shortcomings, they have their respective applicable scopes. The automatic approach is suitable for websites which have different structrue, but the precision rate of this approach is lower. Although the manual approach needs someone to write the regule rules, the precision rate of extraction results is higer. Due to the web pages'structures of E-commerce websites are mainly the same, and we want to extract the information includes: products'name, products'price, freight and other information about products, we choose the manual method what has more precise extraction results.The main works and innovations of this paper are as follows: 1,Design an interface of keywords'files, allowing the system to accept to the keywords'files(text files, there is a carriage return between two keywords), and putting these keywords in our system for filling and submitting query forms. Moreover, we consider the question of incremental keywords, the system of this paper do not accept the keywords what in the old keywords'library.2,Extracting the HTML codes of E-commerce websites. Extracting the part of query forms'HTML codes of E-commerce websites by analyzing the HTML codes of E-commerce websites. According to these HTML codes, using WebBrowser Control to simulate the filling and submission of keywords, and getting the initial pages about keywords.3,Extracting the hyperlinks with selectivity, only extract the hyperlinks of products'information, but not the hyperlinks of advertisement and any other unrelated hyperlinks. Moreover, we need get more comprehensive hyperlinks of products'information by getting the hyperlinks of"next-page", because of the multi-page hyperlinks of products'information. In this paper, we introduce some approaches of getting the hyperlinks of"next-page", and propose an approach that has good applicability.4,According to the structures of different websites, generating the extraction rules by regular expression, thus information extraction. The extraction results are saved as text files which is convenient for updating keywords'files.
Keywords/Search Tags:Deep Search, Deep Web, Web Information Extraction, URL Collection, Regular expression
PDF Full Text Request
Related items