Font Size: a A A

Research And System Development Of Content Duplicate Chechking In E-business Website Based On Semantics

Posted on:2018-05-28Degree:MasterType:Thesis
Country:ChinaCandidate:P XuFull Text:PDF
GTID:2348330518496482Subject:Electronic Science and Technology
Abstract/Summary:PDF Full Text Request
With the growth of Internet users and the flourish of e-commerce, e-commerce website is getting larger, electronic business data on the website shows explosive growth. As the electronic shopping has become a part of people's daily life, data on the electronic business website has also become researchers' an important research object of people's daily economic activities. Thus, high efficient collection of electronic business website information is very important. However, there is not only a large amount of data, but also a large amount of redundant data. The large number of redundant data will seriously affect the time efficiency of data collection and reduce the accuracy of data. In order to enable users to better compare these information, it is necessary to check the repeatability of data.This paper first introduces the technology that is needed throughout the paper. Using automated test framework Selenium to realize data capture, which is the basis of the entire system. Then we introduce the semantic standard of wordnet. In this paper, we use its standard to establish the nodes of the semantic tree model. The standard semantic tree is used to compute the similarity between products.(1) using of selenium framework to crawl electronic business website information. Automated testing framework for general testing of web services, but this paper uses the capacity of analysis of page js, label and xpath to extract the elements of the page, combined with phantomjs browser core. Applying it to business data crawl, the rendering time of the front page could largely decrease and the crawl speed enhanced.(2) the construction of the semantic tree model characterization of electronic business website. In this paper, we investigate the structure of major electronic business websites, compared their similarity in hierarchical classification, and map them to the semantic tree of the same structure. Using wordnet standard semantics to unify each layer node's description for the different electronic merchant's website goods, and unify the merchandise information of different electric business website completely to the same semantic tree.(3) the use of semantic tree for goods check weight. Because the semantic tree has already defined the expression of the standard commodity. It is possible to determine whether they belong to the same or similar goods by comparing whether the paths mapped by the commodity on the semantic tree are the same.(4) electric business data acquisition system design and product similarity comparison system design. Because of the structure of the tree to describe the electrical business data, the design of the database storage structure using a hierarchical relationship model, which could greatly reduce redundant data storage. The entire service is designed to be multithreaded, allowing simultaneous crawling of data from multiple e-commerce sites. Since they are represented using the same model and stored in the same database, there is no need to worry about data obfuscation. The comparison of commodity similarity is to use the semantic model of this tree to achieve the comparison of each node.
Keywords/Search Tags:e-commerce data mining, semantic tree, similarity comparison, duplicate checking
PDF Full Text Request
Related items