CYK algorithm - Misplaced Pages

In computer science , the Cocke–Younger–Kasami algorithm (alternatively called CYK , or CKY ) is a parsing algorithm for context-free grammars published by Itiroo Sakai in 1961. The algorithm is named after some of its rediscoverers: John Cocke , Daniel Younger, Tadao Kasami , and Jacob T. Schwartz . It employs bottom-up parsing and dynamic programming .

#367632

35-708: The standard version of CYK operates only on context-free grammars given in Chomsky normal form (CNF). However any context-free grammar may be algorithmically transformed into a CNF grammar expressing the same language ( Sipser 1997 ). The importance of the CYK algorithm stems from its high efficiency in certain situations. Using big O notation , the worst case running time of CYK is O ( n 3 ⋅ | G | ) {\displaystyle {\mathcal {O}}\left(n^{3}\cdot \left|G\right|\right)} , where n {\displaystyle n}

70-481: A {\displaystyle a} is a terminal symbol, because Robert W. Floyd found any BNF syntax can be converted to the above one in 1961. But he withdrew this term, "since doubtless many people have independently used this simple fact in their own work, and the point is only incidental to the main considerations of Floyd's note." While Floyd's note cites Chomsky's original 1959 article, Knuth's letter does not. Besides its theoretical significance, CNF conversion

105-469: A compiler front end their internal structure is usually not considered by the parser . The terminal symbol "^" denoted exponentiation in Algol60. In step "START" of the above conversion algorithm, just a rule S 0 → Expr is added to the grammar. After step "TERM", the grammar looks like this: After step "BIN", the following grammar is obtained: Since there are no ε-rules, step "DEL" does not change

140-411: A context-free grammar , G , is said to be in Chomsky normal form (first described by Noam Chomsky ) if all of its production rules are of the form: where A , B , and C are nonterminal symbols , the letter a is a terminal symbol (a symbol that represents a constant value), S is the start symbol, and ε denotes the empty string . Also, neither B nor C may be the start symbol , and

175-426: A new rule where S is the previous start symbol. This does not change the grammar's produced language, and S 0 will not occur on any rule's right-hand side. To eliminate each rule with a terminal symbol a being not the only symbol on the right-hand side, introduce, for every such terminal, a new nonterminal symbol N a , and a new rule Change every rule to If several terminal symbols occur on

210-410: A sequence of simple transformations is applied in a certain order; this is described in most textbooks on automata theory . The presentation here follows Hopcroft, Ullman (1979), but is adapted to use the transformation names from Lange, Leiß (2009). Each of the following transformations establishes one of the properties required for Chomsky normal form. Introduce a new start symbol S 0 , and

245-449: A very simple renaming of non-terminals, as shown by Lang (1994) . As pointed out by Lange & Leiß (2009) , the drawback of all known transformations into Chomsky normal form is that they can lead to an undesirable bloat in grammar size. The size of a grammar is the sum of the sizes of its production rules, where the size of a rule is one plus the length of its right-hand side. Using g {\displaystyle g} to denote

280-466: Is also possible to extend the CYK algorithm to parse strings using weighted and stochastic context-free grammars . Weights (probabilities) are then stored in the table P instead of booleans, so P[i,j,A] will contain the minimum weight (maximum probability) that the substring from i to j can be derived from A. Further extensions of the algorithm allow all parses of a string to be enumerated from lowest to highest weight (highest to lowest probability). When

315-437: Is applied after UNIT . The table shows which orderings are admitted. Moreover, the worst-case bloat in grammar size depends on the transformation order. Using | G | to denote the size of the original grammar G , the size blow-up in the worst case may range from | G | to 2 , depending on the transformation algorithm used. The blow-up in grammar size depends on the order between DEL and BIN . It may be exponential when DEL

350-402: Is being) removed. The skipping of nonterminal symbol B in the resulting grammar is possible due to B being a member of the unit closure of nonterminal symbol A . When choosing the order in which the above transformations are to be applied, it has to be considered that some transformations may destroy the result achieved by other ones. For example, START will re-introduce a unit rule if it

385-417: Is completed, the input string is generated by the grammar if the substring containing the entire input string is matched by the start symbol. This is an example grammar: Now the sentence she eats a fish with a fork is analyzed using the CYK algorithm. In the following table, in P [ i , j , k ] {\displaystyle P[i,j,k]} , i is the number of the row (starting at

SECTION 10

#1732854653368

420-516: Is done first, but is linear otherwise. UNIT can incur a quadratic blow-up in the size of the grammar. The orderings START , TERM , BIN , DEL , UNIT and START , BIN , DEL , UNIT , TERM lead to the least (i.e. quadratic) blow-up. The following grammar, with start symbol Expr , describes a simplified version of the set of all syntactical valid arithmetic expressions in programming languages like C or Algol60 . Both number and variable are considered terminal symbols here for simplicity, since in

455-417: Is only suitable for recognition. The dependence on efficient matrix multiplication cannot be avoided altogether: Lee (2002) has proved that any parser for context-free grammars working in time O ( n 3 − ε ⋅ | G | ) {\displaystyle O(n^{3-\varepsilon }\cdot |G|)} can be effectively converted into an algorithm computing

490-528: Is the length of the parsed string and | G | {\displaystyle \left|G\right|} is the size of the CNF grammar G {\displaystyle G} ( Hopcroft & Ullman 1979 , p. 140). This makes it one of the most efficient parsing algorithms in terms of worst-case asymptotic complexity , although other algorithms exist with better average running time in many practical scenarios. The dynamic programming algorithm requires

525-465: Is the length of the parsed string and | G | is the size of the CNF grammar G . This makes it one of the most efficient algorithms for recognizing general context-free languages in practice. Valiant (1975) gave an extension of the CYK algorithm. His algorithm computes the same parsing table as the CYK algorithm; yet he showed that algorithms for efficient multiplication of matrices with 0-1-entries can be utilized for performing this computation. Using

560-681: The Coppersmith–Winograd algorithm for multiplying these matrices, this gives an asymptotic worst-case running time of O ( n 2.38 ⋅ | G | ) {\displaystyle O(n^{2.38}\cdot |G|)} . However, the constant term hidden by the Big O Notation is so large that the Coppersmith–Winograd algorithm is only worthwhile for matrices that are too large to handle on present-day computers ( Knuth 1997 ), and this approach requires subtraction and so

595-560: The empty string can be transformed into Chomsky reduced form. In a letter where he proposed a term Backus–Naur form (BNF), Donald E. Knuth implied a BNF "syntax in which all definitions have such a form may be said to be in 'Floyd Normal Form'", where ⟨ A ⟩ {\displaystyle \langle A\rangle } , ⟨ B ⟩ {\displaystyle \langle B\rangle } and ⟨ C ⟩ {\displaystyle \langle C\rangle } are nonterminal symbols, and

630-409: The above example, since a start symbol S is in ⁠ M [ 7 , 1 ] {\displaystyle M[7,1]} ⁠ , the sentence can be generated by the grammar. The above algorithm is a recognizer that will only determine if a sentence is in the language. It is simple to extend it into a parser that also constructs a parse tree , by storing parse tree nodes as elements of

665-416: The array, instead of the boolean 1. The node is linked to the array elements that were used to produce it, so as to build the tree structure. Only one such node in each array element is needed if only one parse tree is to be produced. However, if all parse trees of an ambiguous sentence are to be kept, it is necessary to store in the array element a list of all the ways the corresponding node can be obtained in

700-457: The bottom at 1), and j is the number of the column (starting at the left at 1). For readability, the CYK table for P is represented here as a 2-dimensional matrix M containing a set of non-terminal symbols, such that R k is in ⁠ M [ i , j ] {\displaystyle M[i,j]} ⁠ if, and only if, ⁠ P [ i , j , k ] {\displaystyle P[i,j,k]} ⁠ . In

735-492: The context-free grammar to be rendered into Chomsky normal form (CNF), because it tests for possibilities to split the current sequence into two smaller sequences. Any context-free grammar that does not generate the empty string can be represented in CNF using only production rules of the forms A → α {\displaystyle A\rightarrow \alpha } and A → B C {\displaystyle A\rightarrow BC} ; to allow for

SECTION 20

#1732854653368

770-413: The empty string, one can explicitly allow S → ε {\displaystyle S\to \varepsilon } , where S {\displaystyle S} is the start symbol. The algorithm in pseudocode is as follows: Allows to recover the most probable parse given the probabilities of all productions. In informal terms, this algorithm considers every possible substring of

805-464: The form: where A {\displaystyle A} , B {\displaystyle B} and C {\displaystyle C} are nonterminal symbols, and a {\displaystyle a} is a terminal symbol . When using this definition, B {\displaystyle B} or C {\displaystyle C} may be the start symbol. Only those context-free grammars which do not generate

840-400: The grammar's start symbol. To eliminate all rules of this form, first determine the set of all nonterminals that derive ε. Hopcroft and Ullman (1979) call such nonterminals nullable , and compute them as follows: Obtain an intermediate grammar by replacing each rule by all versions with some nullable X i omitted. By deleting in this grammar each ε-rule, unless its left-hand side is

875-478: The grammar. After step "UNIT", the following grammar is obtained, which is in Chomsky normal form: The N a introduced in step "TERM" are PowOp , Open , and Close . The A i introduced in step "BIN" are AddOp_Term , MulOp_Factor , PowOp_Primary , and Expr_Close . Another way to define the Chomsky normal form is: A formal grammar is in Chomsky reduced form if all of its production rules are of

910-536: The input string and sets P [ l , s , v ] {\displaystyle P[l,s,v]} to be true if the substring of length l {\displaystyle l} starting from s {\displaystyle s} can be generated from the nonterminal R v {\displaystyle R_{v}} . Once it has considered substrings of length 1, it goes on to substrings of length 2, and so on. For substrings of length 2 and greater, it considers every possible partition of

945-422: The parsing process. This is sometimes done with a second table B[n,n,r] of so-called backpointers . The end result is then a shared-forest of possible parse trees, where common trees parts are factored between the various parses. This shared forest can conveniently be read as an ambiguous grammar generating only the sentence parsed, but with the same ambiguity as the original grammar, and the same parse trees up to

980-436: The probabilistic CYK algorithm is applied to a long string, the splitting probability can become very small due to multiplying many probabilities together. This can be dealt with by summing log-probability instead of multiplying probabilities. The worst case running time of CYK is Θ ( n 3 ⋅ | G | ) {\displaystyle \Theta (n^{3}\cdot |G|)} , where n

1015-406: The product of ( n × n ) {\displaystyle (n\times n)} -matrices with 0-1-entries in time O ( n 3 − ε / 3 ) {\displaystyle O(n^{3-\varepsilon /3})} , and this was extended by Abboud et al. to apply to a constant-size grammar. Chomsky normal form In formal language theory,

1050-412: The right-hand side, simultaneously replace each of them by its associated nonterminal symbol. This does not change the grammar's produced language. Replace each rule with more than 2 nonterminals X 1 ,..., X n by rules where A i are new nonterminal symbols. Again, this does not change the grammar's produced language. An ε-rule is a rule of the form where A is not S 0 ,

1085-413: The same language as the original example grammar, viz. { ab , aba , abaa , abab , abac , abb , abc , b , ba , baa , bab , bac , bb , bc , c }, but has no ε-rules. A unit rule is a rule of the form where A , B are nonterminal symbols. To remove it, for each rule where X 1 ... X n is a string of nonterminals and terminals, add rule unless this is a unit rule which has already been (or

CYK algorithm - Misplaced Pages Continue

1120-498: The size of the original grammar, the size blow-up in the worst case may range from g 2 {\displaystyle g^{2}} to 2 2 g {\displaystyle 2^{2g}} , depending on the transformation algorithm used. For the use in teaching, Lange and Leiß propose a slight generalization of the CYK algorithm, "without compromising efficiency of the algorithm, clarity of its presentation, or simplicity of proofs" ( Lange & Leiß 2009 ). It

1155-434: The start symbol, the transformed grammar is obtained. For example, in the following grammar, with start symbol S 0 , the nonterminal A , and hence also B , is nullable, while neither C nor S 0 is. Hence the following intermediate grammar is obtained: In this grammar, all ε-rules have been " inlined at the call site". In the next step, they can hence be deleted, yielding the grammar: This grammar produces

1190-423: The substring into two parts, and checks to see if there is some production A → B C {\displaystyle A\to B\;C} such that B {\displaystyle B} matches the first part and C {\displaystyle C} matches the second part. If so, it records A {\displaystyle A} as matching the whole substring. Once this process

1225-415: The third production rule can only appear if ε is in L ( G ), the language produced by the context-free grammar G . Every grammar in Chomsky normal form is context-free, and conversely, every context-free grammar can be transformed into an equivalent one which is in Chomsky normal form and has a size no larger than the square of the original grammar's size. To convert a grammar to Chomsky normal form,

#367632