In linguistics , a treebank is a parsed text corpus that annotates syntactic or semantic sentence structure. The construction of parsed corpora in the early 1990s revolutionized computational linguistics , which benefitted from large-scale empirical data .
43-445: The term treebank was coined by linguist Geoffrey Leech in the 1980s, by analogy to other repositories such as a seedbank or bloodbank . This is because both syntactic and semantic structure are commonly represented compositionally as a tree structure . The term parsed corpus is often used interchangeably with the term treebank, with the emphasis on the primacy of sentences rather than trees. Treebanks are often created on top of
86-402: A Platonistic ontology and an externalist view of meaning. Within linguistics, it is more common to view formal semantics as part of the study of linguistic cognition . As a result, philosophers put more of an emphasis on conceptual issues while linguists are more likely to focus on the syntax–semantics interface and crosslinguistic variation. The fundamental question of formal semantics
129-441: A computational linguistics perspective, treebanks have been used to engineer state-of-the-art natural language processing systems such as part-of-speech taggers , parsers , semantic analyzers and machine translation systems. Most computational systems utilize gold-standard treebank data. However, an automatically parsed corpus that is not corrected by human linguists can still be useful. It can provide evidence of rule frequency for
172-441: A corpus that has already been annotated with part-of-speech tags . In turn, treebanks are sometimes enhanced with semantic or other linguistic information. Treebanks can be created completely manually, where linguists annotate each sentence with syntactic structure, or semi-automatically, where a parser assigns some syntactic structure which linguists then check and, if necessary, correct. In practice, fully checking and completing
215-456: A function which takes some individual x as an argument and returns the truth value "true" if x indeed smokes. Assuming that the words "Nancy" and "smokes" are semantically composed via function application , this analysis would predict that the sentence as a whole is true if Nancy indeed smokes. Scope can be thought of as the semantic order of operations. For instance, in the sentence " Paulina doesn't drink beer but she does drink wine ,"
258-520: A major subfield of linguistics in the late 1970s and early 1980s, due to the seminal work of Barbara Partee. Partee developed a linguistically plausible system which incorporated the key insights of both Montague Grammar and Transformational grammar . Early research in linguistic formal semantics used Partee's system to achieve a wealth of empirical and conceptual results. Later work by Irene Heim , Angelika Kratzer , Tanya Reinhart , Robert May and others built on Partee's work to further reconcile it with
301-468: A parser. A parser may be improved by applying it to large amounts of text and gathering rule frequencies. However, it should be obvious that only by a process of correcting and completing a corpus by hand is it possible then to identify rules absent from the parser knowledge base. In addition, frequencies are likely to be more accurate. In corpus linguistics , treebanks are used to study syntactic phenomena (for example, diachronic corpora can be used to study
344-464: A revision of his politeness model. Formal semantics (linguistics) Formal semantics is the study of grammatical meaning in natural languages using formal concepts from logic , mathematics and theoretical computer science . It is an interdisciplinary field, sometimes regarded as a subfield of both linguistics and philosophy of language . It provides accounts of what linguistic expressions mean and how their meanings are composed from
387-459: A shallow semantic treebank is PropBank , which provides annotation of verbal propositions and their arguments, without attempting to represent every word in the corpus in logical form . Many syntactic treebanks have been developed for a wide variety of languages: To facilitate the further researches between multilingual tasks, some researchers discussed the universal annotation scheme for cross-languages. In this way, people try to utilize or merge
430-400: A single surface form can be semantically ambiguous between different scope construals. Some theories of scope posit a level of syntactic structure called logical form , in which an item's syntactic position corresponds to its semantic scope. Others theories compute scope relations in the semantics itself, using formal tools such as type shifters, monads , and continuations . Binding is
473-694: A term derived from P. L. Garvin's translation of the Czech term aktualisace , referring to the psychological prominence (against the background of ordinary language) of artistic effects in literature. In Leech's account, foregrounding in poetry is based on deviation from linguistic norms, which may take the form of unexpected irregularity (as in Dylan Thomas 's A grief ago ) as well as unexpected regularity (or parallelism – as in I kissed thee ere I killed thee from Othello ). Further, Leech has distinguished three levels of deviation: Leech's interest in semantics
SECTION 10
#1732854699210516-549: Is a collection of natural language sentences annotated with a meaning representation. These resources use a formal representation of each sentence's semantic structure . Semantic treebanks vary in the depth of their semantic representation. A notable example of deep semantic annotation is the Groningen Meaning Bank , developed at the University of Groningen and annotated using Discourse Representation Theory . An example of
559-510: Is determined by the denotations of its parts along with their mode of composition. For instance, the denotation of the English sentence "Nancy smokes" is determined by the meaning of "Nancy", the denotation of "smokes", and whatever semantic operations combine the meanings of subjects with the meanings of predicates . In a simplified semantic analysis, this idea would be formalized by positing that "Nancy" denotes Nancy herself, while "smokes" denotes
602-399: Is distinct from pragmatics , which encompasses aspects of meaning which arise from interaction and communicative intent. Formal semantics is an interdisciplinary field, often viewed as a subfield of both linguistics and philosophy , while also incorporating work from computer science , mathematical logic , and cognitive psychology . Within philosophy, formal semanticists typically adopt
645-455: Is what you know when you know how to interpret expressions of a language. A common assumption is that knowing the meaning of a sentence requires knowing its truth conditions , or in other words knowing what the world would have to be like for the sentence to be true. For instance, to know the meaning of the English sentence "Nancy smokes" one has to know that it is true when the person Nancy performs
688-495: The Linguistics Wars , and many linguists were initially puzzled by it. While linguists wanted a restrictive theory that could only model phenomena that occur in human languages, Montague sought a flexible framework that characterized the concept of meaning at its most general. At one conference, Montague told Barbara Partee that she was "the only linguist who it is not the case that I can't talk to". Formal semantics grew into
731-652: The Penn Treebank or ICE-GB ) and those that annotate dependency structure (for example the Prague Dependency Treebank or the Quranic Arabic Dependency Treebank ). It is important to clarify the distinction between the formal representation and the file format used to store the annotated data. Treebanks are necessarily constructed according to a particular grammar. The same grammar may be implemented by different file formats. For example,
774-761: The generative approach to syntax. The resulting framework is known as the Heim and Kratzer system, after the authors of the textbook Semantics in Generative Grammar which first codified and popularized it. The Heim and Kratzer system differs from earlier approaches in that it incorporates a level of syntactic representation called logical form which undergoes semantic interpretation. Thus, this system often includes syntactic representations and operations which were introduced by translation rules in Montague's system. However, work by others such as Gerald Gazdar proposed models of
817-409: The proposition that Paulina drinks beer occurs within the scope of negation , but the proposition that Paulina drinks wine does not. One of the major concerns of research in formal semantics is the relationship between operators' syntactic positions and their semantic scope. This relationship is not transparent, since the scope of an operator need not directly correspond to its surface position and
860-623: The 1990s, he took a leading role in the compilation of the British National Corpus (BNC). The Lancaster research group that he co-founded (UCREL ) also developed programs for the annotation of corpora: especially corpus taggers and parsers. The term treebank , now generally applied to a parsed corpus, was coined by Leech in the 1980s. The LGSWE grammar (1999) was systematically based on corpus analysis. Leech's more recent corpus research has centred on grammatical change in recent and contemporary English. Leech has written extensively on
903-786: The Gricean model) rather than "reductionist" (reducing Grice's four maxims to a smaller number, as in Relevance theory, where the Maxim of Relation, or principle of relevance, is the only one that survives). Leech is also criticised for allowing the addition of new maxims to be unconstrained (in defiance of Occam's Razor ), and for his postulation of an "absolute politeness" which does not vary according to situation, whereas most politeness theorists maintain that politeness cannot be identified out of context. In his article "Politeness: Is there an East-West divide?" (2007), Leech addresses these criticisms and presents
SECTION 20
#1732854699210946-495: The action of smoking. However, many current approaches to formal semantics posit that there is more to meaning than truth-conditions. In the formal semantic framework of inquisitive semantics , knowing the meaning of a sentence also requires knowing what issues (i.e. questions) it raises. For instance "Nancy smokes, but does she drink?" conveys the same truth-conditional information as the previous example but also raises an issue of whether Nancy drinks. Other approaches generalize
989-571: The advantages of different treebanks corpora. For instance, The universal annotation approach for dependency treebanks; and the universal annotation approach for phrase structure treebanks. One of the key ways to extract evidence from a treebank is through search tools. Search tools for parsed corpora typically depend on the annotation scheme that was applied to the corpus. User interfaces range in sophistication from expression-based query systems aimed at computer programmers to full exploration environments aimed at general linguists. Wallis (2008) discusses
1032-413: The communication itself). In the 1970s and 1980s Leech took a part in the development of pragmatics as a newly emerging subdiscipline of linguistics deeply influenced by the ordinary-language philosophers J. L. Austin , J. R. Searle and H. P. Grice . In his main book on the subject, Principles of Pragmatics (1983), he argued for a general account of pragmatics based on regulative principles following
1075-399: The concept of truth conditionality or treat it as epiphenomenal. For instance in dynamic semantics , knowing the meaning of a sentence amounts to knowing how it updates a context. Pietroski treats meanings as instructions to build concepts. The Principle of Compositionality is the fundamental assumption in formal semantics. This principle states that the denotation of a complex expression
1118-624: The decision to use one grammatical construction tends to influence the decision to form others, and to try to understand how speakers and writers make decisions as they form sentences. Interaction research is particularly fruitful as further layers of annotation, e.g. semantic, pragmatic, are added to a corpus. It is then possible to evaluate the impact of non-syntactic phenomena on grammatical choices. In linguistics research, annotated treebank data has been used in syntactic research to test linguistic theories of sentence structure against large quantities of naturally occurring examples. A semantic treebank
1161-766: The following: Leech contributed to three team projects resulting in large-scale descriptive reference grammars of English, all published as lengthy single-volume works: A Grammar of Contemporary English (with Randolph Quirk, Sidney Greenbaum and Jan Svartvik, 1972); A Comprehensive Grammar of the English Language (with Randolph Quirk, Sidney Greenbaum and Jan Svartvik, 1985); and the Longman Grammar of Spoken and Written English (LGSWE) (with Douglas Biber, Stig Johansson , Susan Conrad and Edward Finegan, 1999). These grammars have been broadly regarded as providing an authoritative "standard" account of English grammar, although
1204-411: The meanings of countless natural language expressions including counterfactuals , propositional attitudes , evidentials , habituals and generics. The standard treatment of linguistic modality was proposed by Angelika Kratzer in the 1970s, building on an earlier tradition of work in modal logic . Formal semantics emerged as a major area of research in the early 1970s, with the pioneering work of
1247-425: The meanings of their parts. The enterprise of formal semantics can be thought of as that of reverse-engineering the semantic components of natural languages' grammars. Formal semantics studies the denotations of natural language expressions. High-level concerns include compositionality , reference , and the nature of meaning . Key topic areas include scope , modality , binding , tense , and aspect . Semantics
1290-694: The model of Grice's (1975) Cooperative principle (CP), with its constitutive maxims of Quantity, Quality, Relation and Manner. The part of the book that has had most influence is that dealing with the Principle of Politeness , seen as a principle having constituent maxims like Grice's CP. The politeness maxims Leech distinguished are: the Tact Maxim, Generosity Maxim, Approbation Maxim, Modesty Maxim, Agreement Maxim and Sympathy Maxim. This Gricean treatment of politeness has been much criticised: for example, it has been criticised for being "expansionist" (adding new maxims to
1333-608: The parsing of natural language corpora is a labour-intensive project that can take teams of graduate linguists several years. The level of annotation detail and the breadth of the linguistic sample determine the difficulty of the task and the length of time required to build a treebank. Some treebanks follow a specific linguistic theory in their syntactic annotation (e.g. the BulTreeBank follows HPSG ) but most try to be less theory-specific. However, two main groups can be distinguished: treebanks that annotate phrase structure (for example
Treebank - Misplaced Pages Continue
1376-511: The phenomenon in which anaphoric elements such as pronouns are grammatically associated with their antecedents . For instance in the English sentence "Mary saw herself", the anaphor "herself" is bound by its antecedent "Mary". Binding can be licensed or blocked in certain contexts or syntactic configurations, e.g. the pronoun "her" cannot be bound by "Mary" in the English sentence "Mary saw her". While all languages have binding, restrictions on it vary even among closely related languages. Binding
1419-472: The philosopher and logician Richard Montague . Montague proposed a formal system now known as Montague grammar which consisted of a novel syntactic formalism for English, a logical system called Intensional Logic , and a set of homomorphic translation rules linking the two. In retrospect, Montague Grammar has been compared to a Rube Goldberg machine , but it was regarded as earth-shattering when first proposed, and many of its fundamental insights survive in
1462-471: The principles of searching treebanks in detail and reviews the state of the art around that time. Geoffrey Leech Geoffrey Neil Leech FBA (16 January 1936 – 19 August 2014) was a specialist in English language and linguistics. He was the author, co-author, or editor of more than 30 books and more than 120 published papers. His main academic interests were English grammar , corpus linguistics , stylistics , pragmatics , and semantics . Leech
1505-479: The rather traditional framework employed has also been criticised — e.g. by Huddleston and Pullum (2002) in their Cambridge Grammar of the English Language . Inspired by the corpus-building work of Randolph Quirk at UCL, soon after his arrival at Lancaster, Leech pioneered computer corpus development. He initiated the first electronic corpus of British English, completed in 1978 as the [Lancaster-Oslo-Bergen Corpus|Lancaster-Oslo/Bergen] (LOB) Corpus. Later, in
1548-539: The stylistics of literary texts. The two stylistic works for which he is best known are A Linguistic Guide to English Poetry (1969) and Style in Fiction (1981; 2nd edn. 2007), co-authored with Mick Short. The latter book won the PALA25 Silver Jubilee Prize for "the most influential book in stylistics" since 1980. The approach Leech has taken to literary style relies heavily on the concept of foregrounding ,
1591-551: The syntactic analysis for John loves Mary , shown in the figure on the right, may be represented by simple labelled brackets in a text file, like this (following the Penn Treebank notation): This type of representation is popular because it is light on resources, and the tree structure is relatively easy to read without software tools. However, as corpora become increasingly complex, other file formats may be preferred. Alternatives include treebank-specific XML schemes, numbered indentation and various types of standoff notation. From
1634-448: The time course of syntactic change). Once parsed, a corpus will contain frequency evidence showing how common different grammatical structures are in use. Treebanks also provide evidence of coverage and support the discovery of new, unanticipated, grammatical phenomena. Another use of treebanks in theoretical linguistics and psycholinguistics is interaction evidence. A completed treebank can help linguists carry out experiments as to how
1677-418: The various semantic models which have superseded it. Montague Grammar was a major advance because it showed that natural languages could be treated as interpreted formal languages . Before Montague, many linguists had doubted that this was possible, and logicians of that era tended to view logic as a replacement for natural language rather than a tool for analyzing it. Montague's work was published during
1720-837: Was Professor of English Linguistics from 1974 to 2001. In 2002 he became Emeritus Professor in the Department of Linguistics and English Language, Lancaster University. He was a Fellow of the British Academy , an Honorary Fellow of UCL and of Lancaster University, a Member of the Academia Europaea and the Norwegian Academy of Science and Letters , and an honorary doctor of three universities, most recently of Charles University, Prague (2012). He died in Lancaster, England on 19 August 2014. Leech's most important research contributions are
1763-665: Was a major component to the government and binding theory paradigm. Modality is the phenomenon whereby language is used to discuss potentially non-actual scenarios. For instance, while a non-modal sentence such as "Nancy smoked" makes a claim about the actual world, modalized sentences such as "Nancy might have smoked" or "If Nancy smoked, I'll be sad" make claims about alternative scenarios. The most intensely studied expressions include modal auxiliaries such as "could", "should", or "must"; modal adverbs such as "possibly" or "necessarily"; and modal adjectives such as "conceivable" and "probable". However, modal components have been identified in
Treebank - Misplaced Pages Continue
1806-610: Was born in Gloucester, England on 16 January 1936. He was educated at Tewkesbury Grammar School, Gloucestershire, and at University College London (UCL), where he was awarded a BA (1959) and PhD (1968). He began his teaching career at UCL, where he was influenced by Randolph Quirk and Michael Halliday as senior colleagues. He spent 1964-5 as a Harkness Fellow at the Massachusetts Institute of Technology , Cambridge MA. In 1969 Leech moved to Lancaster University , UK, where he
1849-826: Was strong in the period up to 1980, when it gave way to his interest in pragmatics. His PhD thesis at London University was on the semantics of place, time and modality in English, and was subsequently published under the title Towards a Semantic Description of English (1969). At a more popular level, he published Semantics (1974, 1981), in which the seven types of meaning discussed in Chapter 2 have been widely cited: These are sometimes compared to Roman Jakobson 's six communication functions: Connative (requesting an action or response), emotive (communication of emotions), referential (communicating facts and opinions), Phatic (communication handshakes, acknowledgment, politeness etc.), poetic function, and meta-linguistic functions (self referring to
#209790