Font Size: a A A

Extraction Technology And Internet Product Information Based On The Structural Semantics Of Entropy

Posted on:2010-10-31Degree:MasterType:Thesis
Country:ChinaCandidate:X Y WuFull Text:PDF
GTID:2208360275491686Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
These years have witnessed the exposure of online commodity and trade volume, but on the contrary,the trust and security consumers hold towards the Internet is decreasing.To cope with the inconsistency,IOEB of Fudan University Software School has performed the discussion about the technology of Internet merchandise monitoring and conducted in-depth research into the core problem that is how to extract commodity information from the Internet.Currently,a number of methods have proposed for the web information extraction,most of which need people to label the extracted results.Therefore,the accuracy rate will decline if the manual interventions are reduced.On the other hand, many existing methods cannot adapt to the changes of web sites.Once the web pages are altered,the wrapper of web page information extraction must be reconstructed.Based on the issues mentioned above,this paper proposes structured semantic entropy based web page recognition and extraction algorithms,utilizing web pages' structures and recognizing the main parts of web pages by computing the aggregation metric of commodity's information.We firstly investigate the publication situation and characteristics of commodity information on Internet,based on which we construct semantic dictionary for commodity information extraction.The dictionary helps to locate the commodity information in which the users are interested.Coupled with the features and traits of web page structures and commodity,structured semantic entropy based commodity extraction algorithm is capable of recognizing whether the page is a commodity sales page or not and extracts information from web pages automatically.Combining the algorithm with meta-search technology and web crawler,a framework is presented to realize the automatic discovery of the new e-business websites and extract commodity information from that.Finally,an online drug monitoring system has been developed on the basis of the framework presented in this paper.Through using the proposed algorithm,the system greatly expands the information extraction coverage and raises the automation level.It also provides technical feasibility to realize full-line monitoring of goods release information on the Internet,which helps to protect online transaction security.
Keywords/Search Tags:Web Information Extraction, Structured-semantic Entropy, Aggregation Analysis, Meta-search Technology, Semantic Dictionary
PDF Full Text Request
Related items