Font Size: a A A

Web Page Data Quality Assessment System Based On Wikipedia

Posted on:2015-04-16Degree:MasterType:Thesis
Country:ChinaCandidate:X P ChenFull Text:PDF
GTID:2298330467955766Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In recent years, Web information grew explosively. The Web is full of duplicate, garble and falseinformation. Therefore, when people browse web pages, they dont’t know whether theseinformation are accurate and complete or not, so they will often get lost in this ocean of information.Data quality assessment is the key step to solve the above question.After researching the related works, the author proposes an approach to assess Web page’squality. It uses wikipedia Web pages as reference to assess the quality of source Web page. First ofall, according to a Web page link that user provides, it extracts the Web page’s keywords and usesthese keywords to collect Web pages from wikipedia. Then it uses Machine Learning methods toidentify the qualities of wikipedia Web pages and extracts the high quality Web pages as triples.Finally, based on these triples, it designes algorithms to analyze the quality of source Web pagefrom multiple demensions.This approach has the following advantages: Firstly, this approach integrates the related Webpages as reference, it makes full use of collective intelligence and it can reflect the qua lity flaws ofsource Web page. Further more, because wikipedia Web pages have different qualities, it usesSupport Vector Machine to identify the high quality Web pages, further more uses LDA model toidentify the high topic relevance ones. Last but not least, existing approaches assess Web page’squality are mainly based on non-semantics, but this approach is based on semantics, it fullyexplores the semantic information of Web pages. Theoretical analysis and experimental comparisonprove the feasibility and efficiency of the proposed approach.
Keywords/Search Tags:Web data quality, Support Vector Machine, LDA Model, Semantic Triples, QualityDemensions
PDF Full Text Request
Related items