| The lexical characteristics of commercial English are the fundamental element of the professional theory knowledge which is very important to construct the theoretical knowledge. Quantitative lexical description of commercial English has important academic significance for the quantitative lexical research, machine translation, natural language processing, and other fields. Quantitative linguists have made many in-depth quantitative lexical studies about general English and scientific and technological English, but there are very few quantitative lexical studies on commercial English.This thesis is corpus-based and employs the theory and methodology of Quantitative Linguistics investigating the English lexical characteristics of the Commerce Domain of British National Corpus (hereafter referred to CDBNC) both quantitatively and qualitatively. BNC contains 100,000,000 words as the source of the research. The samples of CDBNC were drawn randomly from BNC. As a reference, another eight groups of samples in another eight domains from BNC were drawn randomly. The size of each group from the nine domains is equally distributed. Each group is composed of 1,0002,000-word random samples totaling 2,000,000-word.The contents of the research include the following:the lexical statistics, vocabulary distribution, vocabulary richness, vocabulary growth, entropy and perplexity, vocabulary and textural coverage by CET-4 and CET-6 over CDBNC, and Brunet’s model and Tuldava’s model fit.In order to obtain the data, Perl was applied in programming, sampling, data extracting, processing, and calculating vocabulary growth and mathematical models, etc. In this research Visual Foxpro was mainly used for lemmatization. With the assistance of the statistical software NLREG and Visual Foxpro, a variety of statistical tests, calculation and analysis were made.The following conclusions are drawn from the research:1. Generally speaking, among the nine domains, CDBNC has remarkable differences from the other eight domains of BNC. CDBNC has the smallest number of vocabulary size, hapax and the TTR. Of the 30,044 lemmas from CDBNC, there are 10,622 hapaxes.2. The top 200 high frequency words of CDBNC have two characteristics:first, they share the characteristics of commercial English; second, the core words cannot be found in the top 200 high frequency words of general English.3. The entropy and perplexity of CDBNC are smaller compared with the other eight domains.4. TTR is normally distributed in the same length of the texts:as the increase of the text length, TTR increases. This kind of change can be described by Tuldava’s model.5. The vocabulary coverage and individual textual coverage of CET-4 and CET-6 are normally distributed. The mean vocabulary coverage of CET-4 over 1,000 text samples is 0.7747. The mean vocabulary coverage of CET-6 over 1,000 text samples is 0.8170. The textual coverage of CET-4 and CET-6 is considerably higher than the vocabulary coverage of CET-4 and CET-6. The mean textual coverage of CET-4 is 0.872. The mean textual coverage of CET-6 is 0.8955.6. Brunet’s model was tested on the vocabulary growth of the samples from CDBNC at a 2,000-word interval. The fit of Brunet’s model to CDBNC vocabulary growth curves is very good. |