Computational lexicography
Corpus (corpus-based) linguistics deals mainly with compiling various electronic corpora for conducting investigations in different linguistic fields. Corpora occupy a special place in the study of language. The importance of corpora for language researches is aligned to the importance of empirical data. Empirical data enable the linguist to make objective statements, rather than those which are subjective, or based upon the individual’s own internalized cognitive perception of language. A large and well-constructed corpus gives excellent information about frequency, distribution, and typicality of linguistic features — such as words, collocations, spellings, pronunciations, and grammatical constructions.
The recent development of corpus linguistics has given birth to corpus-based lexicographyand a new corpus-based generation of dictionaries.For example, the COBUILD English Dictionary used the Bank of English — the corpus of 20 million words in contemporary English developed at the Birmingham University. The Longman Dictionary of Contemporary English and the Oxford Advanced Learner's Dictionary of Current English used the British National Corpus.
The British National Corpus is a very large (over 100 million words) corpus of modern English, both spoken and written. The Corpus is designed to represent as wide range of modern British English as possible. The written part (90 %) includes, for example, extracts from regional and national newspapers, specialist periodicals and journals for all ages and interests, academic books and popular fiction, published and unpublished letters and memoranda, school and university essays, among many other kinds of text.Texts are selected for inclusion in the corpus according to three independentselection criteria: domain (75 % of texts from informative writings, e.g. from the fields of applied sciences or art, etc.; 25 % from imaginative writings — literary and creative works), time (mostly texts since 1975) and medium (60 % of written texts are books, 25 % — periodicals).
The spoken part (10 %)of theBritish National Corpus includes a large amount of unscripted informal conversation, recorded by volunteers selected from different age, region and social classes in a demographically balanced way, together with spoken language collected in all kinds of different contexts, ranging from formal business or government meetings to radio shows and phone-ins.
The use of corpora in dictionary-making practices gives a compiler a lot of opportunities; among the most important ones is the opportunity:
1) to produce and revise dictionaries much more quickly than before, thus providing up-to-date information about language;
2) to give more complete and precise definitions since a larger number of natural examples are examined;
3) to keep on top of new words entering the language, or existing words changing their meanings due to the open-ended (constantly growing) monitor corpus;
4) to describe usages of particular words or phrases typical of particular varieties and genres as corpus data contains a rich amount of textual information — regional variety, author, date, part-of-speech tags, genre, etc.;
5) to organize easily examples extracted from corpora into more meaningful groups for analysis and describe/present them laying special stress on their collocation. For example, by sorting the right-hand context of the word alphabetically so that it is possible to see all instances of a particular collocate together;
6) to treat phrases and collocations more systematically than was previously possible due to the ability to call up word-combinations rather than words and due to the existence of mutual information tools which establish relationships between co-occurring words;
7) to register cultural connotations and underlying ideologies which a language has.
Some of lexicographical giants have their own electronic text archives which they use depending on the type of dictionary compiled. For example, the Longman Corpus Network is a diverse, far-reaching group of databases consisting of many millions of words. Five highly sophisticated language databases form the nucleus of the Network: the Longman Learners' Corpus (comprised of 10 million words of writing in English by learners of the language from over 125 different countries); the Longman Written American Corpus (comprised of 140 million words of American newspaper and book text); the Longman Spoken American Corpus (a unique resource of 5 million words of everyday American speech); the Spoken British Corpus (gives objective information for the first time on what spoken English is really like and how it differs from written British English); and the Longman/ Lancaster Corpus (with over 30 million words it covers an extensive range of written texts from literature to bus timetables).
Computational linguisticsis the branch of linguistics in which the techniques of computer science are applied to the analysis and synthesis of language and speech.
Computational lexicographydeals with the design, compilation, use and evaluation of electronic (electronically readable/machine readable) dictionaries. Electronic dictionariesfundamentally differ in form, content, and function from conventional word-books. Among the most significant differences are: 1) the use of multimedia means; 2) the navigable help indices in windows oriented software; 3) the use of sound, animation, audio and visual (pictures, videos) elements as well as interactive exercises and games; 4) the varied possibilities of search and access methods that allow the user to specify the output in a number of ways; 5) the access to and retrieval of information are no longer determined by the internal, traditionally alphabetical, organization of the dictionary, but a non-linear structure of the text; 6) the use of hyperlinks which allow easily and quickly to cross-refer to words within an entry or to other words connected with this entry.
There are distinguished two main types of electronic dictionaries:
o online dictionaries;
o CD-ROM dictionaries.
To use on-line dictionaries it is necessary to have access to the Internet. To install CD-ROM dictionaries on a computer it is necessary to ensure that a computer meets the minimum system requirements that are usually enumerated in the User Guide.
Among the on-line dictionariesthere are the following: the Oxford English Dictionary Online, the Merriam-Webster Online Dictionary, the Cambridge Dictionaries Online (including Cambridge Advanced Learner’s Dictionary, Cambridge International Dictionary of Idioms, Cambridge Dictionary of American English, etc.), the American Heritage Dictionary of the English language and many others. Each dictionary has its own benefits and differs, sometimes greatly, in the interface,material available, contents area, number of options, organization of entries, search capabilities, etc. from other dictionaries of such kind.
The Oxford English Dictionary Online, for instance, contains the material of the 20-volume Oxford English Dictionary and 3-volume Addition Series. Besides, more revised and new entries are added to the online dictionary every quarter.
The Oxford English Dictionary Online is characterized by the following main features: 1) the display of entries according to a user’s needs, i.e. entries can be displayed by turning pronunciations, etymologies, variant spellings, and quotations on and off; 2) the search for pronunciations as well as accented and other special characters; 3) the search for words which have come into English via a particular language; 4) the search for quotations from a specified year, or from a particular author and/or work; 5) the search for a term when a user knows only meaning; 6) the use of wildcards (wildcards are used to search for words when not all letters in them are known. The wildcard symbol ‘?’ represents any single character and the wildcard symbol ‘*’ represents any string of characters) if a user is unsure of a spelling; 7) the restrictions of a search to a previous results set; 8) the search for first cited date, authors, and works; 9) the case-sensitive searches; and some others.
Among the CD-ROM dictionariesthere are the following: the Longman Dictionary of Contemporary English on CD-ROM, the Cambridge International Dictionary of English on CD-ROM, the Collins COBUILD on CD-ROM, the Concise Oxford Dictionary on
CD-ROM, and many others.
In most cases CD-ROM dictionaries are electronic versions of the printed reference books supplemented by more visual information, pronunciation, interactive exercises and games and allowing the user to carry out searches impossible with the book dictionaries.
The Longman Dictionary of Contemporary English on CD-ROM, for example, differs from the paper dictionary in the following way: 1) every word is pronounced in British and American English. A user can also record his/her own pronunciation and compare it with the accepted form; 2) it gives 15,000 word origins or etymologies and contains 7000 encyclopedic entries for people, places, and things, taken from the Longman Dictionary of English Language and Culture; 3) there are 80,000 additional examples given in the Longman Examples Bank; 4) over a million corpus sentences are included for very advanced learners and teachers of English; 5) it contains 150,000 extra words (collocates) that are used with the headword; 6) it has the Activator section which is very helpful in choosing the right word in this or that context and provides essay writing technique; 7) there are a lot of interactive activities in grammar, vocabulary, culture, as well as exam practice exercises.
The Longman Dictionary of Contemporary English on CD-ROM has its own distinctive features that make it prominent among the dictionaries of this kind. There are three main functions in the CD-ROM dictionary, each opening in the main window but with a slightly different look. These three functions are the Dictionary, Activator, and Exercises. Users can choose the full sized display, or ‘Pop-Up Mode’. The dictionary interface includes a search bar, an area for viewing entries, and windows for the Phrase Bank, Examples Bank, and the Activate Your Language tool.
HISTORICAL OUTLINE
I. Elementary bilingual (later polylingual) Latin-English and English-Latin glossaries (VIII – XIV); ‘learners’ dictionaries’
English lexicography has very rich traditions and a long history. The forerunners of modern dictionaries appeared long ago in medieval England as the work of nameless scholars who wrote in the margins of Latin manuscripts English equivalents for some difficult Latin words. These were collected into lists called glosses. Then several glosses were combined into a book called a glossarium which may be called a short Latin-English or English-Latin dictionary of selected words.
· Corpus Glossary (2, 000; alphabetical order) Lat-Eng; Eng-Lat;
· Aelfric Glossary (thematic order) Lat-Eng; Eng-Lat.
II. Complicated handwritten/ printed Latin-English and English-Latin glossaries with a limited number of words;
The first English-Latin dictionary was printed in England in 1440. It had a Latin title ‘Promptorium Parvolorum’ which means ‘A Storehouse for Young Boys’. At that time Latin was the international language of scholars, of church and the most important institutions of theMiddle Ages.
· ‘Promptorium Parvulorum sive Clericorum’ (Сокровищница для образованной молодежи; 1440; 12, 000);
· ‘Catholicum Anglicum’ (Всеобщий английский; 1483; 8, 000);
· ‘Medulla grammaticae’ (Душа грамматики, 1460, 20, 000);
· ‘Ortus vocabulorum’ (Сад слов; 1550; 27, 000);
III. Printed bilingual Latin-English and English-Latin dictionaries with a wide number of words and different characteristics of words (XVI);
· Th. Elyot The Dictionary (Lat-Eng, 1538);
· The Dictionary of 1552 (26, 000 explanations in English);
Дата добавления: 2018-11-25; просмотров: 653;