Font Size: a A A

Research On Product Attribute Extraction From Semi-structured Web Pages

Posted on:2014-01-11Degree:MasterType:Thesis
Country:ChinaCandidate:W TangFull Text:PDF
GTID:2248330398465496Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
These years have witnessed the exposure of online commodity and trade volume.Jiongdong and Taobao are the representatives of the electronic business platform in ourcountry. Online commodity trading need to show product attribute information on the pageas much detail as possible. If these product attributes are effectively extracted andorganized, they will play a greater role in multiple applications, such as opinion mining ofproduct review, Sentiment analysis, personalized product recommendation and productimprovements.Currently, a number of methods have proposed for the web information extraction,most of which need people to label the extracted results. Therefore, the accuracy rate willdecline if the manual interventions are reduced. On the other hand, many existing methodscannot adapt the changes of web sites. Once the web pages are altered, the wrapper of webpage information extraction must be reconstructed.Based on the issues mentioned above, we proposed two new methods to extract productattribute information. The first method is base on the attribute description block extractionmethod. We used the VIPS algorithm to cut the web page into blocks and extract somefeatures to train a classifier which achieve the identification of the blocks which describethe product attributes. Then a attribute record align method is used for product informationextraction.By analyzing the shortcomings of the first method, we put forward a new solution,which used the web page titles to assist the unsupervised template construction. Thismethod constructs a attribute word bag by a large number of titles of the same category.Then the word bag is used to identify the seed attribute name value pairs. We constructed templates based on these seed pairs and finally used the templates to extract productinformation in the web page. This method has the advantage of a high degree ofautomation and strong adaptability, and achieved good results in the experiments.
Keywords/Search Tags:Web information extraction, Web page segmentation, Attribute descriptionblock, Template construction
PDF Full Text Request
Related items