Font Size: a A A

Analysis Of Webpage Directories And Design Of Relevant Algorithms

Posted on:2010-12-27Degree:MasterType:Thesis
Country:ChinaCandidate:Y JiangFull Text:PDF
GTID:2178360272996059Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Classification has been used for a long time with the aim of cataloguing and searching objects in big sets. It was mainly books in the early days; lately it becomes Webpages, photos and any kind of digital data. Classification describes its contents using natural language labels, which have been proved quite effective in manual classification. But natural language labels show their weak when automating the process, as it is very difficult to process classifications and their contents. Formal Classification turns out to be some form of lightweight ontology. This enables us to process them, to associate to every node a normal form formula which univocally describes their data, and to reduce document classification to reasoning about subsumption.In nowadays information society, as the quantity of information grows larger, it becomes necessary to develop efficient ways to summarize and navigate information from large, multivariate data sets. The field of classification supports these issues, as it investigates how sets of "objects" can be changed into a small number of classes, and it also provides methods to assist the search of "objects". In the past centuries, classification has been the field of librarians and archivists. Lately a lot of interest has been put on the management of the information present in the web: see for instance the WWW Virtual Library project, or directories of searching engines like Google, or Yahoo!. Web directories are often named lightweight ontologies. But, they lack one important property attributable to ontology. Namely, ontology must be represented in a formal language, which can then be used for automating processing. None of the existing human made classifications possesses its property. Because classification hierarchies are written in natural languages, it is very hard to automate the classification, and, as a consequence, standard classification approaches amount to manually classifying objects into classes.The work, which is part of Sweb project of university of Trento, in this paper is done by the author when participating the EU project EastWeb(contract ID: 111084)from Jan. to Oct. 2008 in Italy. This study focuses on the testing technology of BNF syntax. Main steps are named entity locating, part of speech tagging, word sense disambiguation, lexical and grammar examination. By analyzing the data set, we noticed the following characteristics of NEs:(1) Rare labels tend to be NEs. A common label can appear thousands of times in a Web directory, while NE labels occur much more rarely. Most of NE labels appear only one time.(2) There are so-called letter bars in Web directories, that is one letter"A","B", ...,"Z"or two letters"Aa","Ab", ...,"Zz". These labels are created only for the ease of searching. Besides, they are good indicators of NEs, because nearly all children of these labels are NEs.(3) In NE labels, initial articles, such as"the","a"and"an", are usually put at the end after a comma. This is another good indicator of NEs.(4) The NE and non-NE labels are different on their lengths.(5) The NE and non-NE labels are different on their depths.Both of CMEM and CRF are used in this thesis to do part of speech tagging. To tag a token, CMEM considers the context of the token by using the notion of feature. In the task of part of speech tagging, the context of a token is always the token itself and its surrounding tokens. A feature is a function which maps the context of the token to a 0 or 1 value. That is to say, it answers a yes/no question about the context. CMEM learns how part of speech is conditioned by contexts as a set of probability distributions from a manually processed corpus. The learning process uses a max-entropy style parameter optimization. CMEM tags the tokens one by one (starting from the left-most token in the label) by assigning the part of speech with the highest conditional probability to every token given the tokens'context. Differently from CMEM, in CRF, the part of speech of a token is conditioned by contexts of each token in the given sentence. This enables an overall coordination among local taggings. This property makes CRF a more advanced model for the part of speech tagging task. In our experiments, we used two part of speech taggers: the CRF-based FudanNLP part of speech tagger and the CMEM-based OpenNLP part of speech tagger. We trained these tools again on our data set and checked whether we gain an improvement in accuracy with regard to the case when they are trained on full-fledged sentences. To avoid a negative influence of NE labels on the training of a part of speech tagger, both of them were trained and tested only on the non-NE labels in the data set.Words'senses are defined by WordNet. One word may have many senses. The goal of word sense disambiguation is to select only one sense for a word.Lexical examination is implemented using DFA, and the state transition diagram is also designed. Through the state transition diagram, the transition function sets can be easily gotten. Description ability of DFA is the same as regular expression, that is only can be used to describe the words, and can not be used to describe the complex sentence structure. The LR method is used in the grammar check. The prefix is left reduced, and a specific grammar check algorithm is designed. LR grammar parsers can identify almost all of the structure of language which is described by context-free grammars. It is the most commonly known reducing grammar parsing method without retrospective analysis. It can also be effectively implemented like other move-reduce analysis. The syntax set analysed by LR is the real super set which can be analysed by forecast method. During the scanning of the input string from left to right, LR syntax parser can find syntax errors on time. The main fault of this method is that the workload for constructing the LR syntax parser is too heavy. So in the future research, the combination of bottom-up and top-down methods can be done to overcome the difficulty on parser constructing.Under the guidance of this BNF syntax, the program can effectively meet Dmoz and Yahoo's label model for the Web, which provides excellent mining testing tools.
Keywords/Search Tags:Natural language processing, BNF, syntax, web mining, document classification
PDF Full Text Request
Related items