Font Size: a A A

Research On The Co-occurrence Statisticalrules Of Similar Named Entities In Massive Web Pages

Posted on:2012-05-15Degree:MasterType:Thesis
Country:ChinaCandidate:W Y LinFull Text:PDF
GTID:2178330338991455Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
As tremendous quantity of Web pages are sharply increasing, Web data mining is becoming more and more popular. When people surf the internet, it is focused on named entities, which are the proper names and specific quantifier. Named entities reflect concrete or abstract entities in the real world. There are many relationships among the named entities in the Web pages and the"Co-occurrence" relationship is the most simple and basic one. Co-occurrence refers to the words co-occurrence in the documents.The research of co-occurrence statistics of named entities in Web pages is different from traditional data mining research. It must be based on a large number of Web pages data, and use of named entities extraction algorithm to extract the named entities from analysis Web pages. That co-occurrence named entities in the Web pages were analysis by statistical method. Finally, it was demonstrated that there is potential relationship between named entities and some rules about named entities co-occurrence were figured out. Named entity co-occurrence rules research is hot topic at recent, and it is extremely important and practical to move the theory of the co-occurrence and the named entities co-occurrence in the Web pages research. Named entities extraction is the basis of this dissertation. Firstly, three named entities extraction models was chose, and then distinguish the advantages and disadvantages between them. According to the test result of selected test pages set, a models algorithm was chose to as the named entities extraction that was used for the experiments in this dissertation.FDC algorithm is a word co-occurrence algorithm which is used to analyze and study named entities co-occurrence frequency, named entities relative distance, named entities and co-document rate, which get the co-occurrence value of named entities. This dissertation was based on FDC algorithm, and applied FDC in the web pages named entities co-occurrence research. Ten thousand Web pages test set was selected from CWT200G to as the samples, which prove the effectiveness of FDC algorithm.There are some drawbacks in FDC algorithm, when it was applied in calculating co-occurrence value. The dissertation seeks to improve the original FDC algorithm mainly through the named entities co-occurrence frequency and named entities relative distance. When named entities co-occurrence frequency was metered, we used the named entities set replace of the Descartes; In the same time, the named entities location in the context was considered when measuring the relative distance among named entities. The experiments directly demonstrated improved FDC algorithm to be effective.
Keywords/Search Tags:named entity, co-occurrence, FDC algorithm, massive Web pages
PDF Full Text Request
Related items